DOPO adapts a text-to-motion diffusion model (MDM) pretrained on a source dataset to generate motions that align with the distribution of a target dataset — without requiring paired data. It uses a DenseTMR evaluator trained on the target dataset as a dense reward signal during online preference optimization (SPO).
The full pipeline consists of three stages:
- Pretrain MDM — train a motion diffusion model on the source dataset.
- Train DenseTMR evaluator — train a text–motion retrieval model on the target dataset. This model serves as the reward during DOPO fine-tuning.
- DOPO fine-tune (MDM-SPO) — fine-tune the pretrained MDM on the target dataset distribution using online preference optimization guided by the DenseTMR reward.
Supported datasets: humanml3d, kitml, babel, motionx.
conda create -n dopo python=3.10 -y
conda activate dopo
# PyTorch (CUDA 12.8)
pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128 --extra-index-url https://pypi.nvidia.com --no-deps
# Project dependencies
pip install -r requirements_clean.txt
pip install git+https://github.com/openai/CLIP.gitTrain an MDM model from scratch on a source dataset using the SMPL-RiFKE motion representation (introducted in link).
# HumanML3D
python train_model.py \
data=humanml3d \
model=mdm \
data.with_noise=false \
data/motion_loader=smplrifke
# KiTML
python train_model.py \
data=kitml \
model=mdm \
data.with_noise=false \
data.text_to_token_emb=false \
data.text_to_sent_emb=false
# BABEL
python train_model.py \
data=babel \
model=mdm \
data.with_noise=false \
data.text_to_token_emb=false \
data.text_to_sent_emb=false
# MotionX
python train_model.py \
data=motionx \
model=mdm \
data.with_noise=false \
data.text_to_token_emb=false \
data.text_to_sent_emb=falseCheckpoints are saved to outputs/mdm_{dataset}_smplrifke/.
Convenience scripts for all datasets: bash/train_mdm_ALL_smplrifke.sh
Train a DenseTMR text–motion retrieval model on each target dataset. This evaluator is used both as a reward signal during DOPO fine-tuning and as the evaluation metric.
# HumanML3D
python train_evaluator.py \
data=humanml3d \
model=densetmr \
data.with_noise=true \
data.clip_embedder=false
# KiTML
python train_evaluator.py \
data=kitml \
model=densetmr \
data.with_noise=true \
data.clip_embedder=false
# BABEL
python train_evaluator.py \
data=babel \
model=densetmr \
data.with_noise=true \
data.clip_embedder=false
# MotionX
python train_evaluator.py \
data=motionx \
model=densetmr \
data.with_noise=true \
data.clip_embedder=falseCheckpoints are saved to outputs/densetmr_{dataset}_smplrifke/.
Convenience script: bash/train_densetmr_ALL_smplrifke.sh
Fine-tune a pretrained MDM on a target dataset using online preference optimization. The DenseTMR evaluator trained on the target dataset is used as the reward model.
Key arguments:
model.checkpoint_dir— path to the pretrained source MDM checkpoint.evaluator.checkpoint_dir— path to the target DenseTMR evaluator checkpoint.data— target dataset.run_dir— output directory for the fine-tuned model.
python train_model_spo.py \
data=kitml \
model=mdm_spo \
data.with_noise=false \
data.text_to_token_emb=false \
data.text_to_sent_emb=false \
dataloader.batch_size=160 \
model.train_batch_size=160 \
model.lr=1e-7 \
model.ckpt="last" \
evaluator.checkpoint_dir='outputs/densetmr_kitml_smplrifke' \
model.checkpoint_dir='outputs/mdm_humanml3d_smplrifke' \
run_dir='outputs/mdmspo_humanml3d_to_kitml_smplrifke_lr1e-7' \
trainer.max_epochs=20 \
group_name='H2K'bash bash/train_mdmspo_babel_2_ALL.shConvenience scripts for all source→target pairs are in bash/.
python eval_model.py \
model=mdm \
evaluator=densetmr \
data=kitml \
model_checkpoint_dir='outputs/mdm_humanml3d_smplrifke' \
evaluator_checkpoint_dir='outputs/densetmr_kitml_smplrifke' \
data.clip_embedder=false \
data.text_to_token_emb=false \
data.text_to_sent_emb=false \
distance_metric='cosine' \
model_ckpt="last" \
output_dir='evaluation_results/mdm_humanml3d_on_kitml'All cross-dataset baselines: bash/eval_mdm_baselines_ALL.sh
python eval_model.py \
model=mdm_spo \
evaluator=densetmr \
data=kitml \
model_checkpoint_dir='outputs/mdmspo_humanml3d_to_kitml_smplrifke_lr1e-7' \
evaluator_checkpoint_dir='outputs/densetmr_kitml_smplrifke' \
data.clip_embedder=false \
data.text_to_token_emb=false \
data.text_to_sent_emb=false \
distance_metric='cosine' \
model_ckpt="best" \
output_dir='evaluation_results/mdmspo_humanml3d_to_kitml_best'All fine-tuned models: bash/eval_mdmspo_best_ALL.sh
Reported metrics: R@1/R@3/R@10, MedR (T2M and M2T), FID, diversity, multimodality. Results are saved to evaluation_results/*/results.json.
├── train_model.py # Stage 1: pretrain MDM
├── train_evaluator.py # Stage 2: train DenseTMR evaluator
├── train_model_spo.py # Stage 3: DOPO fine-tuning
├── eval_model.py # Evaluation script
├── configs/ # Hydra configs (data, model, trainer)
│ ├── DenseTMR/
│ ├── MDM_SPO/
│ └── ...
├── models/ # Model implementations
├── evaluators/ # Evaluator implementations (TMR, DenseTMR, Guo)
├── bash/ # Convenience training/evaluation scripts
└── outputs/ # Saved checkpoints (created at runtime)
- Add StableMoFusion support as a generative backbone
- Add support for additional datasets
- Add arXiv link
- Release checkpoints for pretrained models
- Release checkpoints for trained evaluators
- Release checkpoints for fine-tuned models
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
