Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering.
Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal, Roni Sengupta,
This is the official repository of Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering.
- [2025/11] Paper is on arXiv.
Set up the environment (Linux, Python 3.9):
conda create -n ptp python=3.9 -y && conda activate ptp
conda install -c conda-forge -y "numpy=1.26.3"
conda config --env --append pinned_packages "numpy=1.26.3"
pip install -c constraints.txt torch==2.2.2+cu121 torchvision==0.17.2+cu121 \
--index-url https://download.pytorch.org/whl/cu121
conda install -c conda-forge -c aihabitat -y habitat-sim=0.2.5 headless faiss-cpu=1.7.4
conda install -y https://anaconda.org/pytorch3d/pytorch3d/0.7.8/download/linux-64/pytorch3d-0.7.8-py39_cu121_pyt222.tar.bz2
pip install -r requirements.txt -c constraints.txtPlease download the train and val split of HM3D, and specify
the path in cfg/eval_aeqa_template.yaml. For example, if your download path is /your_path/hm3d/ that
contains /your_path/hm3d/train/ and /your_path/hm3d/val/, you can set the scene_data_path in the config file as /your_path/hm3d/.
The test questions of A-EQA are provided in the data/ folder. For A-EQA, we provide two subsets of different size: aeqa_questions-41.json, aeqa_questions-573.json, where aeqa_questions-573.json is the official AEQA subset provided by OpenEQA and the others are a smaller subset for quick evaluation.
First run the following script to generate the predictions for the A-EQA dataset:
bash eval.sh --cf cfg/eval_aeqa_template.yamlTo split tasks, you can add --start_ratio and --end_ratio to specify the range of tasks to evaluate. For example,
to evaluate the first half of the dataset, you can run:
bash eval.sh --cf cfg/eval_aeqa_template.yaml --start_ratio 0.0 --end_ratio 0.5After the scripts finish, the results from all splits will be automatically aggregated and saved.
The default evaluation config will save visualization results including topdown maps, egocentric views, memory snapshots, and frontier snapshots at each step. Although saving visualization is very helpful, it may slow down the evaluation process. Please make save_visualization false if you would like to run large-scale evaluation without visuals.
Download the calibration trajectories from the google drive link here and unzip it in the data directory. We provide our dataset of annotated frontiers in data/annotated_frontier_data.json.
In order to calibrate you will need to spin up a VLM. We currently support the following models and a helper script to do this:
- Qwen2.5-VL 7B Instruct
- Qwen2.5-VL 32B Instruct
- Qwen3-VL 30B A3B Instruct
For example to spin up Qwen3-VL 30B on port 8000, run:
python -m models.Qwen.Qwen_server -m 3_30 --host localhost -p 8000For additional argument information run:
python -m models.Qwen.Qwen_server -hWe use Qwen VL models for our evaluation. If you want to calibrate a different model, refer to the Qwen_VL_Server class and process_payload method in models/Qwen/Qwen_server.py for implementation details and to update model initialization and prompt formatting for other models.
To generate model confidence values and save the ECDF for your own VLM, run:
python -m src.calibration -cf cfg/offline-calibration.yamlYou can update the following settings in the config file (cfg/offline-calibration.yaml):
- General settings
calibration_trajectories_dir— location of calibration trajectories (default:data/calibration_trajectories)
- VLM settings
vlm_host_name— Inference server hostname/IP (default:localhost)vlm_port— Inference server port (default:8000)
- ECDF save settings
output_dir— ECDF save directory (default:models/Holm-Bonferroni)output_name— ECDF save name (default:Holm-Bonferroni-Offline)noise— Adding noise to annotated data (default:0.0)
The codebase is heavily built upon 3D-Mem, OpenEQA, Explore-EQA, and ConceptGraph. We thank the authors for their great work.
@misc{frahm2025prunethenplansteplevelcalibrationstable,
title={Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering},
author={Noah Frahm and Prakrut Patel and Yue Zhang and Shoubin Yu and Mohit Bansal and Roni Sengupta},
year={2025},
eprint={2511.19768},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.19768},
}