DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

CHI 2026 | Honorable Mention Award | Project Page

DancingBox captures motion from everyday physical objects (boxes, cups, etc.) using a single RGB camera (your phone, for example) and generates full-body character animation — no markers, suits, or depth sensors required.

I'm open for collaboration. You may contact [email protected] if interested.

Pipeline Overview

The system runs in two stages:

Step 1 — Bounding Box Extraction (step1_bbox/): Takes an input video and extracts 3D bounding boxes of physical proxy objects using SAM2 segmentation, CoTracker3 point tracking, and Pi3 monocular 3D reconstruction.
Step 2 — Motion Generation (step2_motion/): A conditional diffusion model takes the extracted bounding boxes as spatial hints and generates plausible full-body human motion (SMPL format).

Installation

The two pipeline steps use separate conda environments.

Step 1 — Bounding Box Extraction:

conda env create -f step1_bbox/environment.yml
conda activate s1bbox

Step 2 — Motion Generation:

conda env create -f step2_motion/environment.yml
conda activate s2motion

Model Downloads

All model files below are git-ignored (each > 50 MB). Download them before running the pipeline.

SAM2 checkpoints

File	Size	Download
`step1_bbox/sam2/checkpoints/sam2.1_hiera_tiny.pt`	149 MB	Meta official
`step1_bbox/sam2/checkpoints/sam2.1_hiera_small.pt`	176 MB	Meta official
`step1_bbox/sam2/checkpoints/sam2.1_hiera_base_plus.pt`	309 MB	Meta official
`step1_bbox/sam2/checkpoints/sam2.1_hiera_large.pt`	858 MB	Meta official

cd step1_bbox/sam2/checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_tiny.pt
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_small.pt
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_base_plus.pt
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

CoTracker3 checkpoints

File	Size	Download
`step1_bbox/co-tracker/checkpoints/scaled_offline.pth`	97 MB	HuggingFace
`step1_bbox/co-tracker/checkpoints/scaled_online.pth`	97 MB	HuggingFace

cd step1_bbox/co-tracker/checkpoints
wget https://huggingface.co/facebook/cotracker3/resolve/main/scaled_offline.pth
wget https://huggingface.co/facebook/cotracker3/resolve/main/scaled_online.pth

Pi3

The Pi3 model auto-downloads from HuggingFace (yyfz233/Pi3) at runtime — no manual download needed.

Motion generation model

File	Size	Download
`step2_motion/ckpt/model.pt`	168 MB	Google Drive

Place the checkpoint at step2_motion/ckpt/model.pt before running Step 2.

Usage

Step 1: Extract bounding boxes

Place your input video under step1_bbox/myvideos/, then run:

conda activate s1bbox
cd step1_bbox
python run_single_video.py

Note1: Use label 0 to indicate the ground plane object in the interactive prompt. Start an X server if you want save_vis=True in a headless environment.

Note2: When clicking, positive clicks will be automatically considered as negative clicks for other parts. Utilizing this feature saves a lot of clicks from my experience.

The output bounding boxes will be saved to step1_bbox/working_dir_<video_name>/bboxs/bboxs.npy.

You can optionally inspect the 3D bounding boxes in Blender using tools/vis_hints.blend (modify the bbox path in the Blender script, then run).

Step 2: Generate motion

From step2_motion/, run the generation script with your extracted bounding boxes:

conda activate s2motion
cd step2_motion

python -m sample.custom_generate_sequence \
  --model_path ./ckpt/model.pt \
  --output_dir ../results/ \
  --hint_path ../step1_bbox/working_dir_<video_name>/bboxs/bboxs.npy \
  --text_prompt 'put your prompt here'\
  --target_height 1.3 --y_offset 0.2 \
  --torso_idx 2 \
  --rotation_idx 3 \
  --no_vis_bbox

# an example hint can be tested by 
python -m sample.custom_generate_sequence --model_path ./ckpt/model.pt --num_repetitions 1 --output_dir ../results/ --hint_path ../examples/ASingleBoxJumpping/bboxs.npy --torso_idx 0 --target_height 1.3 --y_offset 0.2  --no_vis_bbox --no_dataset --text_prompt 'jump'

--text_prompt: this is optional, sometimes the cubes are enough to express a motion.
--target_height / --y_offset: normalization parameters that may need hand-tuning per input (see Known Limitations)
--rotation_idx: camera viewpoint index (0–9) if you want multiple generation from a fixed rotation; the script generates 10 rotating poses without this parameter by default
--torso_idx: index of the torso bounding box, if you use label n for the torso, you need to set this as n-1, since the ground object is ignored. You may also use the blender tool to check the torso cube's idx.
Remove --no_vis_bbox to visualize bounding box alignment with the generated motion, this will take longer when saving videos.

You can first run without --rotation_idx to find best angle for the input sequence. Then you may re-run with the best idx to sample more motions from that view.

Step 3: Export to BVH (optional)

Convert the generated motion to BVH format for use in Blender or other 3D software:

python tools/get_bvh.py ./results/results.npy -o results/

Known Limitations

Inference speed. The full pipeline (reconstruction + diffusion sampling) takes time. I'm considering pushing the performance towards real-time.
Normalization parameters require hand-tuning. Different rotation influences results. The --target_height and --y_offset flags often need manual adjustment per input to get good results. I'm investigating in the root cause (likely our training scheme is not optimal)

Citation

If you find this work useful, please cite:

ToBeAdded

Training Data and Scripts

Coming soon :

Contact me if you are ok with a messy version.

License

This project is licensed under the

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples/ASingleBoxJumpping		examples/ASingleBoxJumpping
resources		resources
step1_bbox		step1_bbox
step2_motion		step2_motion
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Pipeline Overview

Installation

Model Downloads

SAM2 checkpoints

CoTracker3 checkpoints

Pi3

Motion generation model

Usage

Step 1: Extract bounding boxes

Step 2: Generate motion

Step 3: Export to BVH (optional)

Known Limitations

Citation

Training Data and Scripts

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Pipeline Overview

Installation

Model Downloads

SAM2 checkpoints

CoTracker3 checkpoints

Pi3

Motion generation model

Usage

Step 1: Extract bounding boxes

Step 2: Generate motion

Step 3: Export to BVH (optional)

Known Limitations

Citation

Training Data and Scripts

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Packages