TVP is a visual programming framework that builds reusable tools from its own problem-solving experience. TVP implements a closed-loop paradigm via two interconnected libraries: an Example Library that accumulates program solutions as experience, and a Tool Library that maintains functions abstracted from these programs. The dual-libraries enable the circular program-tool-program cycle: solving problems generates experience, experience guides tool creation, and newly created tools improve future problem-solving.
- [2026-01-06] Check out our X thread
- [2025-12-24] Code released.
- [2025-12-23] Paper available on arXiv.
Create a conda environment and run the setup script:
conda create -n tvp python=3.10 -y
conda activate tvp
bash env_setup.shCreate a file named api.key in the project root with your OpenAI API key:
echo "your-openai-api-key" > api.keyOmni3D-Bench contains 501 challenging 3D spatial reasoning questions on real-world indoor scenes. Download and prepare the dataset:
mkdir -p data
huggingface-cli download dmarsili/Omni3D-Bench --repo-type dataset --include "omni3d-bench.zip" --local-dir ./data
unzip ./data/omni3d-bench.zip -d ./data
rm -rf ./data/.cache ./data/omni3d-bench.zipAfter setup, your data directory should look like:
data/
└── omni3d-bench/
├── images/
│ └── ARKitScenes/
│ └── Training/
│ └── .../*.jpg
└── annotations.json
To quickly see how TVP works, run a toy subset on 10 datapoints with relaxed thresholds. This demonstrates all pipeline components—example accumulation, tool abstraction, and deduplication—in a short run:
python main.py \
--resume tvp_toy_subset \
--num-questions 10 \
--num-iterations 2 \
--programs-per-question 2 \
--quality-threshold 8.0 \
--similarity-threshold 0.5 \
--abstraction-interval 1 \
--clustering-threshold 0.5 \
--min-cluster-size 2 \
--abstraction-potential-threshold 5.0 \
--min-correctness-rate 0.5 \
--deduplication-interval 1 \
--deduplication-threshold 0.6 \
--save-exe-traceResults will be saved to results/tvp_toy_subset/. The --save-exe-trace flag generates HTML visualizations of each program execution for inspection.
To experiment on the full Omni3D-Bench dataset:
python main.py \
--resume tvp_omni3d_full \
--num-questions -1 \
--num-iterations 3 \
--programs-per-question 4 \
--quality-threshold 8.5 \
--similarity-threshold 0.8 \
--abstraction-interval 1 \
--clustering-threshold 0.8 \
--min-cluster-size 4 \
--abstraction-potential-threshold 9.0 \
--min-correctness-rate 0.85 \
--deduplication-interval 1 \
--deduplication-threshold 0.95Estimated cost: ~$80 per iteration with GPT-4o + o4-mini.
See python main.py --help for all available configuration options.
TVP automatically saves checkpoints after processing each question. To resume an interrupted run:
python main.py --resume your_run_idState files are stored in results/your_run_id/state/.
Evaluation is a two-step process: (1) extract results from the TVP pipeline run, and (2) compute metrics.
After running TVP, extract solutions and associated metadata into a single JSON file:
python eval/extract_results.py --run-id tvp_omni3d_fullThis reads from results/tvp_omni3d_full/ and outputs to eval/extracted_results/tvp_omni3d_full.json.
Evaluate metrics on the extracted results:
python eval/evaluate_metrics.py --run-id tvp_omni3d_fullThis computes per-iteration metrics and saves detailed results to eval/metrics/tvp_omni3d_full/.
The evaluation reports the following metrics by question type:
| Question Type | Metric |
|---|---|
| Yes/No | Accuracy |
| Multiple Choice | Accuracy |
| Count | Accuracy (exact match) |
| Float | Accuracy (±10% relative error) |
| Float | MRA (Mean Relative Accuracy across thresholds 5%–50%) |
| Overall | Accuracy (using Float Accuracy ±10%) |
Results are saved as both a summary JSON and per-iteration CSV files with details on each question.
If you find TVP useful, please consider citing our paper:
@article{wu2025transductive,
title={Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning},
author={Wu, Shengguang and Wang, Xiaohan and Zhang, Yuhui and Zhu, Hao and Yeung-Levy, Serena},
journal={arXiv preprint arXiv:2512.20934},
year={2025}
}TVP builds upon several open-source projects. We thank the authors for their codebase: VADAR, GroundingDINO, UniDepth, FlagEmbedding.
