Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering.

Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal, Roni Sengupta,

This is the official repository of Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering.

News

[2025/11] Paper is on arXiv.

Installation

Set up the environment (Linux, Python 3.9):

conda create -n ptp python=3.9 -y && conda activate ptp

conda install -c conda-forge -y "numpy=1.26.3"
conda config --env --append pinned_packages "numpy=1.26.3"

pip install -c constraints.txt torch==2.2.2+cu121 torchvision==0.17.2+cu121 \
  --index-url https://download.pytorch.org/whl/cu121

conda install -c conda-forge -c aihabitat -y habitat-sim=0.2.5 headless faiss-cpu=1.7.4
conda install -y https://anaconda.org/pytorch3d/pytorch3d/0.7.8/download/linux-64/pytorch3d-0.7.8-py39_cu121_pyt222.tar.bz2

pip install -r requirements.txt -c constraints.txt

Run Evaluation

1 - Preparations

Dataset

Please download the train and val split of HM3D, and specify the path in cfg/eval_aeqa_template.yaml. For example, if your download path is /your_path/hm3d/ that contains /your_path/hm3d/train/ and /your_path/hm3d/val/, you can set the scene_data_path in the config file as /your_path/hm3d/.

The test questions of A-EQA are provided in the data/ folder. For A-EQA, we provide two subsets of different size: aeqa_questions-41.json, aeqa_questions-573.json, where aeqa_questions-573.json is the official AEQA subset provided by OpenEQA and the others are a smaller subset for quick evaluation.

2 - Run Evaluation on A-EQA

First run the following script to generate the predictions for the A-EQA dataset:

bash eval.sh --cf cfg/eval_aeqa_template.yaml

To split tasks, you can add --start_ratio and --end_ratio to specify the range of tasks to evaluate. For example, to evaluate the first half of the dataset, you can run:

bash eval.sh --cf cfg/eval_aeqa_template.yaml --start_ratio 0.0 --end_ratio 0.5

After the scripts finish, the results from all splits will be automatically aggregated and saved.

3 - Save Visualization

The default evaluation config will save visualization results including topdown maps, egocentric views, memory snapshots, and frontier snapshots at each step. Although saving visualization is very helpful, it may slow down the evaluation process. Please make save_visualization false if you would like to run large-scale evaluation without visuals.

Calibrate Your Own Model

1 - Generate Model Confidence Values

Download the calibration trajectories from the google drive link here and unzip it in the data directory. We provide our dataset of annotated frontiers in data/annotated_frontier_data.json.

In order to calibrate you will need to spin up a VLM. We currently support the following models and a helper script to do this:

Qwen2.5-VL 7B Instruct
Qwen2.5-VL 32B Instruct
Qwen3-VL 30B A3B Instruct

For example to spin up Qwen3-VL 30B on port 8000, run:

python -m models.Qwen.Qwen_server -m 3_30 --host localhost -p 8000

For additional argument information run:

python -m models.Qwen.Qwen_server -h

We use Qwen VL models for our evaluation. If you want to calibrate a different model, refer to the Qwen_VL_Server class and process_payload method in models/Qwen/Qwen_server.py for implementation details and to update model initialization and prompt formatting for other models.

To generate model confidence values and save the ECDF for your own VLM, run:

python -m src.calibration -cf cfg/offline-calibration.yaml

You can update the following settings in the config file (cfg/offline-calibration.yaml):

General settings
- calibration_trajectories_dir — location of calibration trajectories (default: data/calibration_trajectories)
VLM settings
- vlm_host_name — Inference server hostname/IP (default: localhost)
- vlm_port — Inference server port (default: 8000)
ECDF save settings
- output_dir — ECDF save directory (default: models/Holm-Bonferroni)
- output_name — ECDF save name (default: Holm-Bonferroni-Offline)
- noise — Adding noise to annotated data (default: 0.0)

Acknowledgement

The codebase is heavily built upon 3D-Mem, OpenEQA, Explore-EQA, and ConceptGraph. We thank the authors for their great work.

Citing Prune-Then-Plan

@misc{frahm2025prunethenplansteplevelcalibrationstable,
      title={Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering}, 
      author={Noah Frahm and Prakrut Patel and Yue Zhang and Shoubin Yu and Mohit Bansal and Roni Sengupta},
      year={2025},
      eprint={2511.19768},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.19768}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering.

News

Installation

Run Evaluation

1 - Preparations

Dataset

2 - Run Evaluation on A-EQA

3 - Save Visualization

Calibrate Your Own Model

1 - Generate Model Confidence Values

Acknowledgement

Citing Prune-Then-Plan

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
cfg		cfg
data		data
evaluation		evaluation
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constraints.txt		constraints.txt
eval.sh		eval.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering.

News

Installation

Run Evaluation

1 - Preparations

Dataset

2 - Run Evaluation on A-EQA

3 - Save Visualization

Calibrate Your Own Model

1 - Generate Model Confidence Values

Acknowledgement

Citing Prune-Then-Plan

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages