CHI 2026 | Honorable Mention Award | Project Page
DancingBox captures motion from everyday physical objects (boxes, cups, etc.) using a single RGB camera (your phone, for example) and generates full-body character animation — no markers, suits, or depth sensors required.
I'm open for collaboration. You may contact [email protected] if interested.
The system runs in two stages:
-
Step 1 — Bounding Box Extraction (
step1_bbox/): Takes an input video and extracts 3D bounding boxes of physical proxy objects using SAM2 segmentation, CoTracker3 point tracking, and Pi3 monocular 3D reconstruction. -
Step 2 — Motion Generation (
step2_motion/): A conditional diffusion model takes the extracted bounding boxes as spatial hints and generates plausible full-body human motion (SMPL format).
The two pipeline steps use separate conda environments.
Step 1 — Bounding Box Extraction:
conda env create -f step1_bbox/environment.yml
conda activate s1bboxStep 2 — Motion Generation:
conda env create -f step2_motion/environment.yml
conda activate s2motionAll model files below are git-ignored (each > 50 MB). Download them before running the pipeline.
| File | Size | Download |
|---|---|---|
step1_bbox/sam2/checkpoints/sam2.1_hiera_tiny.pt |
149 MB | Meta official |
step1_bbox/sam2/checkpoints/sam2.1_hiera_small.pt |
176 MB | Meta official |
step1_bbox/sam2/checkpoints/sam2.1_hiera_base_plus.pt |
309 MB | Meta official |
step1_bbox/sam2/checkpoints/sam2.1_hiera_large.pt |
858 MB | Meta official |
cd step1_bbox/sam2/checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_tiny.pt
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_small.pt
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_base_plus.pt
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt| File | Size | Download |
|---|---|---|
step1_bbox/co-tracker/checkpoints/scaled_offline.pth |
97 MB | HuggingFace |
step1_bbox/co-tracker/checkpoints/scaled_online.pth |
97 MB | HuggingFace |
cd step1_bbox/co-tracker/checkpoints
wget https://huggingface.co/facebook/cotracker3/resolve/main/scaled_offline.pth
wget https://huggingface.co/facebook/cotracker3/resolve/main/scaled_online.pthThe Pi3 model auto-downloads from HuggingFace (yyfz233/Pi3) at runtime — no manual download needed.
| File | Size | Download |
|---|---|---|
step2_motion/ckpt/model.pt |
168 MB | Google Drive |
Place the checkpoint at step2_motion/ckpt/model.pt before running Step 2.
Place your input video under step1_bbox/myvideos/, then run:
conda activate s1bbox
cd step1_bbox
python run_single_video.pyNote1: Use label
0to indicate the ground plane object in the interactive prompt. Start an X server if you wantsave_vis=Truein a headless environment.
Note2: When clicking, positive clicks will be automatically considered as negative clicks for other parts. Utilizing this feature saves a lot of clicks from my experience.
The output bounding boxes will be saved to step1_bbox/working_dir_<video_name>/bboxs/bboxs.npy.
You can optionally inspect the 3D bounding boxes in Blender using tools/vis_hints.blend (modify the bbox path in the Blender script, then run).
From step2_motion/, run the generation script with your extracted bounding boxes:
conda activate s2motion
cd step2_motion
python -m sample.custom_generate_sequence \
--model_path ./ckpt/model.pt \
--output_dir ../results/ \
--hint_path ../step1_bbox/working_dir_<video_name>/bboxs/bboxs.npy \
--text_prompt 'put your prompt here'\
--target_height 1.3 --y_offset 0.2 \
--torso_idx 2 \
--rotation_idx 3 \
--no_vis_bbox
# an example hint can be tested by
python -m sample.custom_generate_sequence --model_path ./ckpt/model.pt --num_repetitions 1 --output_dir ../results/ --hint_path ../examples/ASingleBoxJumpping/bboxs.npy --torso_idx 0 --target_height 1.3 --y_offset 0.2 --no_vis_bbox --no_dataset --text_prompt 'jump'--text_prompt: this is optional, sometimes the cubes are enough to express a motion.--target_height/--y_offset: normalization parameters that may need hand-tuning per input (see Known Limitations)--rotation_idx: camera viewpoint index (0–9) if you want multiple generation from a fixed rotation; the script generates 10 rotating poses without this parameter by default--torso_idx: index of the torso bounding box, if you use label n for the torso, you need to set this as n-1, since the ground object is ignored. You may also use the blender tool to check the torso cube's idx.- Remove
--no_vis_bboxto visualize bounding box alignment with the generated motion, this will take longer when saving videos.
You can first run without --rotation_idx to find best angle for the input sequence. Then you may re-run with the best idx to sample more motions from that view.
Convert the generated motion to BVH format for use in Blender or other 3D software:
python tools/get_bvh.py ./results/results.npy -o results/- Inference speed. The full pipeline (reconstruction + diffusion sampling) takes time. I'm considering pushing the performance towards real-time.
- Normalization parameters require hand-tuning. Different rotation influences results. The
--target_heightand--y_offsetflags often need manual adjustment per input to get good results. I'm investigating in the root cause (likely our training scheme is not optimal)
If you find this work useful, please cite:
ToBeAddedComing soon :
Contact me if you are ok with a messy version.
This project is licensed under the
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
