- [2026-02]: π₯ We released Training scripts for CodeDance.
- [2025-12]: π€ We released the CodeDance paper, project website, and the CodeDance-SFT and RL dataset.
- [2025-12]: π We introduced CodeDance, a dynamic tool-integrated MLLM that treats executable code as a general solver for visual reasoning.
CodeDance is a dynamic tool-integrated multimodal large language model that treats executable code as a general solver for visual reasoning.
"We introduce CodeDance, a dynamic tool-integrated multimodal large language model that treats executable code as a general solver for visual reasoning."
CodeDance scales up multimodal tool-based reasoning by letting the model think, write code, execute it, and reflect in a single loop. Instead of relying on rigid, text-only pipelines, CodeDance:
- Plans & Composes: Dynamically decides when and how to invoke tools.
- Executes: Orchestrates visual-symbolic operations (crop, draw, count, plot) in a sandbox.
- Reflects: Uses intermediate visual evidence to guide subsequent reasoning.
This design yields transparent, self-checkable solutions to challenging visual search and reasoning tasks.
The CodeDance pipeline consists of three stages:
We construct a 34k high-quality dataset of executable multi-turn trajectories to initialize the model.
- Weak-to-strong filtering: Pruning trivial cases with Qwen2.5-VL-7B and stratifying difficulty.
- Multi-turn atomic supervision: Decomposing hard cases into verifiable executable trajectories:
- Predefined visual operations
- Mathematical computation
- Open-ended operations
We optimize with a composite reward mechanism: Balanced Adaptive Tool-call.
- Sequence-level: Difficulty-aware incentives to discourage redundant calls on easy problems.
- Turn-level: Immediate penalties for failed executions plus dense correction advantages.
Without task-specific fine-tuning, CodeDance exhibits emergent capabilities beyond supervised primitives.
git clone https://github.com/CodeDance-VL/CodeDance.git
cd CodeDancebash install.sh| Dataset | Description | Size | Download |
|---|---|---|---|
| CodeDance-SFT | Executable multi-turn/single turn trajectories for cold-start | 34k | HuggingFace |
| CodeDance-RL | Data for reinforcement learning optimization | 63k | HuggingFace |
CodeDance/
βββ CodeDance-RL/
βΒ Β βββ data
βΒ Β βΒ Β βββ train-00000-of-00039.parquet
βΒ Β βΒ Β βββ train-00001-of-00039.parquet
βΒ Β βΒ Β βββ ...
βΒ Β βΒ Β βββ train-00038-of-00039.parquet
βΒ Β βββ README.md
βββ scripts/
βββ train.sh
The RL dataset is formatted as follows:
{
"data_source": "DATA_SOURCE",
"prompt": [
{
"role": "system",
"content": "You are a helpful assistant.\n\nSolve the following problem step by step. You may write python code to assist with the user query. When an image is supplied, you can either use the preloaded PIL Image object `input_image` or access the image file directly via the **relative path** `'input_image.jpg'`."
},
{
"role": "user",
"content": "<image>Is the car on the left side of the person...."
}
],
"images": [
"<IMAGE_BYTES_OR_PATH>"
],
"ability": "ABILITIES",
"reward_model": {
"style": "rule",
"ground_truth": "No, the car is not on the left side of the person."
},
"extra_info": {
"split": "train",
"index": 0,
"answer": "No, the car is not on the left side of the person.",
"question": "Is the car on the left side of the person",
"need_tools_kwargs": true,
"tools_kwargs": {
"execute_python_code": {
"create_kwargs": {
"image": "<IMAGE_BYTES_OR_PATH>"
}
}
}
}
}RL training scripts are provided in the examples/ directory.
vllm serve Qwen/Qwen2.5-72B-Instruct \
--port 18901 \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--tensor-parallel-size 8 \
--served-model-name "judge" \
--trust-remote-code \
--disable-log-requests \
--host "::"export LLM_AS_A_JUDGE_BASE="http://[Your_IP_here]:18901/v1"On the master node:
ray start --head --port=<PORT>On worker nodes (replace <HEAD_IP> with the head node's IP):
ray start --address=<HEAD_IP>:<PORT> \Note: You may need to modify the paths (e.g.,
PROJECT_DIR,PT_CKPT_PATH, data paths) in the shell scripts to match your local environment.
We provide an evaluation script in eval/eval.py. To run the evaluation, you need to first deploy both the judge model and the model to be evaluated as OpenAI-compatible APIs.
Step 1: Set up the Judge Model API Ensure the judge model API is running (as described in Step 1 of RL Training) and set the environment variable:
export LLM_AS_A_JUDGE_BASE="http://[Your_IP_here]:18901/v1"Step 2: Deploy the Evaluated Model Deploy the model you want to evaluate using vLLM or a similar serving engine:
vllm serve /path/to/your/model \
--port 18902 \
--gpu-memory-utilization 0.8 \
--max-model-len 32768 \
--trust-remote-code \
--disable-log-requestsStep 3: Run the Evaluation Script Execute the evaluation script with the corresponding API endpoints:
python eval/eval.py \
--model_name "CodeDance-7B" \
--api_url "http://127.0.0.1:18902/v1" \
--data_path "/Path/Eval_data.parquet" \
--save_path "./save/" \
--num_workers 8 \
--data_source "default"--model_name: Name of the model for saving results.--api_url: API URL(s) of the model being evaluated. Supports a comma-separated list for load balancing.--data_path: Path to the evaluation dataset in.parquetformat.--save_path: Directory to save the evaluation results and statistics.--data_source: Dataset type for specific scoring logic.
The script will output two files in the save_path directory:
result_{model_name}_{dataset}.jsonl: Detailed results for each sample, including predicted answers, multi-modal context, and tool execution history.stats_{model_name}.json: Summary statistics, including accuracy (ACC), judge scores, and code execution success rates.
If you find our work helpful, please cite:
@article{song2025codedance,
title={CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning},
author={Song, Qi and Li, Honglin and Yu, Yingchen and Zhou, Haoyi and Yang, Lin and Bai, Song and She, Qi and Huang, Zilong and Zhao, Yunqing},
journal={arXiv preprint arXiv:2512.17312},
year={2025}
}CodeDance is built upon excellent open-source works:
Related Projects:


