Transductive Visual Programming (TVP)

Evolving Tool Libraries from Experience for Spatial Reasoning

TVP is a visual programming framework that builds reusable tools from its own problem-solving experience. TVP implements a closed-loop paradigm via two interconnected libraries: an Example Library that accumulates program solutions as experience, and a Tool Library that maintains functions abstracted from these programs. The dual-libraries enable the circular program-tool-program cycle: solving problems generates experience, experience guides tool creation, and newly created tools improve future problem-solving.

📢 Updates

[2026-01-06] Check out our X thread
[2025-12-24] Code released.
[2025-12-23] Paper available on arXiv.

🚀 Quick Start

Environment Setup

Create a conda environment and run the setup script:

conda create -n tvp python=3.10 -y
conda activate tvp
bash env_setup.sh

API Key Setup

Create a file named api.key in the project root with your OpenAI API key:

echo "your-openai-api-key" > api.key

Download Omni3D-Bench

Omni3D-Bench contains 501 challenging 3D spatial reasoning questions on real-world indoor scenes. Download and prepare the dataset:

mkdir -p data
huggingface-cli download dmarsili/Omni3D-Bench --repo-type dataset --include "omni3d-bench.zip" --local-dir ./data
unzip ./data/omni3d-bench.zip -d ./data
rm -rf ./data/.cache ./data/omni3d-bench.zip

After setup, your data directory should look like:

data/
└── omni3d-bench/
    ├── images/
    │   └── ARKitScenes/
    │       └── Training/
    │           └── .../*.jpg
    └── annotations.json

🔬 Running TVP

Toy Run (Quick Demo)

To quickly see how TVP works, run a toy subset on 10 datapoints with relaxed thresholds. This demonstrates all pipeline components—example accumulation, tool abstraction, and deduplication—in a short run:

python main.py \
    --resume tvp_toy_subset \
    --num-questions 10 \
    --num-iterations 2 \
    --programs-per-question 2 \
    --quality-threshold 8.0 \
    --similarity-threshold 0.5 \
    --abstraction-interval 1 \
    --clustering-threshold 0.5 \
    --min-cluster-size 2 \
    --abstraction-potential-threshold 5.0 \
    --min-correctness-rate 0.5 \
    --deduplication-interval 1 \
    --deduplication-threshold 0.6 \
    --save-exe-trace

Results will be saved to results/tvp_toy_subset/. The --save-exe-trace flag generates HTML visualizations of each program execution for inspection.

Complete Run

To experiment on the full Omni3D-Bench dataset:

python main.py \
    --resume tvp_omni3d_full \
    --num-questions -1 \
    --num-iterations 3 \
    --programs-per-question 4 \
    --quality-threshold 8.5 \
    --similarity-threshold 0.8 \
    --abstraction-interval 1 \
    --clustering-threshold 0.8 \
    --min-cluster-size 4 \
    --abstraction-potential-threshold 9.0 \
    --min-correctness-rate 0.85 \
    --deduplication-interval 1 \
    --deduplication-threshold 0.95

Estimated cost: ~$80 per iteration with GPT-4o + o4-mini.

See python main.py --help for all available configuration options.

🔄 Resumability

TVP automatically saves checkpoints after processing each question. To resume an interrupted run:

python main.py --resume your_run_id

State files are stored in results/your_run_id/state/.

📈 Evaluation

Evaluation is a two-step process: (1) extract results from the TVP pipeline run, and (2) compute metrics.

Step 1: Extract Results

After running TVP, extract solutions and associated metadata into a single JSON file:

python eval/extract_results.py --run-id tvp_omni3d_full

This reads from results/tvp_omni3d_full/ and outputs to eval/extracted_results/tvp_omni3d_full.json.

Step 2: Compute Metrics

Evaluate metrics on the extracted results:

python eval/evaluate_metrics.py --run-id tvp_omni3d_full

This computes per-iteration metrics and saves detailed results to eval/metrics/tvp_omni3d_full/.

Metrics

The evaluation reports the following metrics by question type:

Question Type	Metric
Yes/No	Accuracy
Multiple Choice	Accuracy
Count	Accuracy (exact match)
Float	Accuracy (±10% relative error)
Float	MRA (Mean Relative Accuracy across thresholds 5%–50%)
Overall	Accuracy (using Float Accuracy ±10%)

Results are saved as both a summary JSON and per-iteration CSV files with details on each question.

Citation

If you find TVP useful, please consider citing our paper:

@article{wu2025transductive,
  title={Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning},
  author={Wu, Shengguang and Wang, Xiaohan and Zhang, Yuhui and Zhu, Hao and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2512.20934},
  year={2025}
}

Acknowledgments

TVP builds upon several open-source projects. We thank the authors for their codebase: VADAR, GroundingDINO, UniDepth, FlagEmbedding.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
eval		eval
.gitignore		.gitignore
README.md		README.md
agents.py		agents.py
config.py		config.py
env_setup.sh		env_setup.sh
execution.py		execution.py
library.py		library.py
logging_utils.py		logging_utils.py
main.py		main.py
models.py		models.py
pipeline.py		pipeline.py
predefined_apis.py		predefined_apis.py
prompts.py		prompts.py
state.py		state.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transductive Visual Programming (TVP)

Evolving Tool Libraries from Experience for Spatial Reasoning

📢 Updates

🚀 Quick Start

Environment Setup

API Key Setup

Download Omni3D-Bench

🔬 Running TVP

Toy Run (Quick Demo)

Complete Run

🔄 Resumability

📈 Evaluation

Step 1: Extract Results

Step 2: Compute Metrics

Metrics

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transductive Visual Programming (TVP)

Evolving Tool Libraries from Experience for Spatial Reasoning

📢 Updates

🚀 Quick Start

Environment Setup

API Key Setup

Download Omni3D-Bench

🔬 Running TVP

Toy Run (Quick Demo)

Complete Run

🔄 Resumability

📈 Evaluation

Step 1: Extract Results

Step 2: Compute Metrics

Metrics

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages