GitHub - zju3dv/MotionStreamer: [ICCV 2025] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao¹ · Shunlin Lu ² · Huaijin Pi³ · Ke Fan⁴ · Liang Pan³ · Yueer Zhou¹ · Ziyong Feng⁵ ·
Xiaowei Zhou¹ · Sida Peng^1† · Jingbo Wang⁶

¹Zhejiang University ²The Chinese University of Hong Kong, Shenzhen ³The University of Hong Kong
⁴Shanghai Jiao Tong University ⁵DeepGlint ⁶Shanghai AI Lab
ICCV 2025

🔥 News

[2025-06] MotionStreamer has been accepted to ICCV 2025! 🎉

TODO List

Release the processing script of 272-dim motion representation.
Release the processed 272-dim Motion Representation of HumanML3D dataset. Only for academic usage.
Release the training code and checkpoint of our TMR-based motion evaluator trained on the processed 272-dim HumanML3D dataset.
Release the training and evaluation code as well as checkpoint of Causal TAE.
Release the training code of original motion generation model and streaming generation model (MotionStreamer).
Release the checkpoint and demo inference code of original motion generation model.
Release complete code for MotionStreamer.

🏃 Motion Representation

For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our GitHub repo.

Installation

🐍 Python Virtual Environment

conda env create -f environment.yaml
conda activate mgpt

🤗 Hugging Face Mirror

Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following:

pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com

📥 Data Preparation

To facilitate researchers, we provide the processed 272-dim Motion Representation of:

HumanML3D dataset at this link.

BABEL dataset at this link.

❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the AMASS License.

Download the processed 272-dim HumanML3D dataset following:

huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272
cd ./humanml3d_272
unzip texts.zip
unzip motion_data.zip

The dataset is organized as:

./humanml3d_272
  ├── mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
      ├── test.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...

Download the processed 272-dim BABEL dataset following:

huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272
cd ./babel_272
unzip texts.zip
unzip motion_data.zip

The dataset is organized as:

./babel_272
  ├── t2m_babel_mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...

Download the processed streaming 272-dim BABEL dataset following:

huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream
cd ./babel_272_stream
unzip train_stream.zip
unzip train_stream_text.zip
unzip val_stream.zip
unzip val_stream_text.zip

The dataset is organized as:

./babel_272_stream
  ├── train_stream
      ├── seq1.npy
      ...
  ├── train_stream_text
      ├── seq1.txt
      ...
  ├── val_stream
      ├── seq1.npy
      ...
  ├── val_stream_text
      ├── seq1.txt
      ...

NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t).

Then, our BABEL-stream is constructed as:

seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length)

seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length)

seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length)

Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1.

🚀 Training

Train our TMR-based motion evaluator on the processed 272-dim HumanML3D dataset:
```
bash TRAIN_evaluator_272.sh
```
After training for 100 epochs, the checkpoint will be stored at: Evaluator_272/experiments/temos/EXP1/checkpoints/.

⬇️ We provide the evaluator checkpoint on Hugging Face, download it following:
```
python humanml3d_272/prepare/download_evaluator_ckpt.py
```
The downloaded checkpoint will be stored at: Evaluator_272/.
Train the Causal TAE:
```
bash TRAIN_causal_TAE.sh ${NUM_GPUS}
```
e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8

The checkpoint will be stored at: Experiments/causal_TAE_t2m_272/

Tensorboard visualization:
```
tensorboard --logdir='Experiments/causal_TAE_t2m_272'
```
⬇️ We provide the Causal TAE checkpoint on Hugging Face, download it following:
```
python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py
```
Train text to motion model:

We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage).

3.1 Get motion latents:
```
python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents
```
3.2 Download sentence-T5-XXL model on Hugging Face:
```
huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/
```
3.3 Train text to motion generation model:
```
bash TRAIN_t2m.sh ${NUM_GPUS}
```
e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8

The checkpoint will be stored at: Experiments/t2m_model/

Tensorboard visualization:
```
tensorboard --logdir='Experiments/t2m_model'
```
⬇️ We provide the text to motion model checkpoint on Hugging Face, download it following:
```
python humanml3d_272/prepare/download_t2m_model_ckpt.py
```
Train streaming motion generation model (MotionStreamer):

We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data).

4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data:
```
bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272
```
e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272

The checkpoint will be stored at: Experiments/causal_TAE_t2m_babel_272/

Tensorboard visualization:
```
tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272'
```
⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on Hugging Face, download it following:
```
python humanml3d_272/prepare/download_Causal_TAE_t2m_babel_272_ckpt.py
```
4.2 Get motion latents of both HumanML3D-272 and the processed BABEL-272-stream dataset:
```
python get_latent.py --resume-pth Causal_TAE_t2m_babel/net_last.pth --latent_dir babel_272_stream/t2m_babel_latents --dataname t2m_babel_272
```
4.3 Train MotionStreamer model:
```
bash TRAIN_motionstreamer.sh ${NUM_GPUS}
```
e.g., if you have 8 GPUs, run: bash TRAIN_motionstreamer.sh 8

The checkpoint will be stored at: Experiments/motionstreamer_model/

Tensorboard visualization:
```
tensorboard --logdir='Experiments/motionstreamer_model'
```

📍 Evaluation

Evaluate the metrics of the processed 272-dim HumanML3D dataset:
```
bash EVAL_GT.sh
```
( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )
Evaluate the metrics of Causal TAE:
```
bash EVAL_causal_TAE.sh
```
( FID and MPJPE (mm) are reported. )
Evaluate the metrics of text to motion model:
```
bash EVAL_t2m.sh
```
( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )

🎬 Demo Inference

Inference of text to motion model:

[Option1] Recover from joint position
```
python demo_t2m.py --text 'a person is walking like a mummy.' --mode pos --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
```
[Option2] Recover from joint rotation
```
python demo_t2m.py --text 'a person is walking like a mummy.' --mode rot --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth
```
In our 272-dim representation, Inverse Kinematics (IK) is not needed. For further conversion to BVH format, please refer to this repo (Step 6: Representation_272 to BVH conversion). The BVH format of motion animation can be visualizd and edited in Blender.

🌹 Acknowledgement

This repository builds upon the following awesome datasets and projects:

🤝🏼 Citation

If our project is helpful for your research, please consider citing :

@InProceedings{Xiao_2025_ICCV,
    author    = {Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
    title     = {MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {10086-10096}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Evaluator_272		Evaluator_272
assets		assets
humanml3d_272		humanml3d_272
models		models
options		options
utils		utils
visualization		visualization
EVAL_causal_TAE.sh		EVAL_causal_TAE.sh
EVAL_t2m.sh		EVAL_t2m.sh
LICENSE		LICENSE
README.md		README.md
TRAIN_causal_TAE.sh		TRAIN_causal_TAE.sh
TRAIN_evaluator_272.sh		TRAIN_evaluator_272.sh
TRAIN_motionstreamer.sh		TRAIN_motionstreamer.sh
TRAIN_t2m.sh		TRAIN_t2m.sh
demo_t2m.py		demo_t2m.py
environment.yaml		environment.yaml
eval_causal_TAE.py		eval_causal_TAE.py
eval_t2m.py		eval_t2m.py
get_latent.py		get_latent.py
reference_end_latent_t2m_272.npy		reference_end_latent_t2m_272.npy
train_causal_TAE.py		train_causal_TAE.py
train_motionstreamer.py		train_motionstreamer.py
train_t2m.py		train_t2m.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

🔥 News

TODO List

🏃 Motion Representation

Installation

🐍 Python Virtual Environment

🤗 Hugging Face Mirror

📥 Data Preparation

🚀 Training

📍 Evaluation

🎬 Demo Inference

🌹 Acknowledgement

🤝🏼 Citation

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

🔥 News

TODO List

🏃 Motion Representation

Installation

🐍 Python Virtual Environment

🤗 Hugging Face Mirror

📥 Data Preparation

🚀 Training

📍 Evaluation

🎬 Demo Inference

🌹 Acknowledgement

🤝🏼 Citation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages