we propose DiVE-k framework (Differential Visual rEasoning using top-k generations) addressing a key weakness in large vision-language models: their struggle with fine-grained distinctions. Our analysis reveals that simply having world knowledge isn't enough; the model needs to learn how to apply it with precision. DiVE-k treats base model's top-k generations, obtained via
Please follow this link to prepare the datasets. For CUB, Download the images and annotations from here.
We also provide pre-processed data at this hf collection for QWEN2.5-VL-7B model.
Pretrained models is released at this hf collection
We recommend using Docker to ensure a consistent environment.
Clone
git clone https://github.com/raja-kumar/DiVE-k
cd DiVE-k
Build the image:
docker build --build-arg CACHE_BUSTER=$(date +%s) -t dive_k .Run the container:
Replace the placeholders (e.g., /path/to/...) with your local paths.
docker run --rm --gpus all --shm-size=10g \
-v /path/to/local/repo:/app/DiVE-k \
-v /path/to/data_root:/data2/ \
-v /path/to/hf_cache:/root/.cache/huggingface/ \
-v /path/to/saved_models:/app/saved_models \
-it dive_k bashData must be arranged in the following format under your data_root:
<data_root>/
└── <dataset_name>/ # e.g., CUB_200_2011
├── zero_shot/
│ ├── subsample_{split}.json # e.g., subsample_base_train.json
│ ├── base_categories.txt
│ └── new_categories.txt
└── fewshot/
├── 4_shots_all_train.json
└── all_categories.txt
(Note: You can skip this step if you are using our provided MCQ data).
Step 1: Add Prompts
Add or modify your prompts in prompts.py.
Step 2: Run Generator Navigate to the generation directory inside the container and run the script:
cd /app/DiVE-k/topk-mcq-data
python generate_mcq.py --mcq_type qwen --data <DATASET_NAME> --split base --phase train(Example usage: --data CUB_200_2011)
Step 3: Build HuggingFace Dataset Convert the generated JSON to a HuggingFace dataset:
python build_hf_dataset.py <data_root>/<dataset_name>/qwen_mcq/subsample_base_train_qwen_mcq.jsonTo train the model:
Sample training script:
- Open
scripts/others/test.sh. - Modify the variables
DATA_PATH,SAVE_PATH, andRUN_NAMEto match your specific experiment. - Run the script:
bash scripts/others/test.shWe provide evaluation script for both one step and two step inference. Modify the path in bash scripts and run
Two Step Inference
cd eval
./two_step_inference.sh
One Step Inference
./one_step_inference.sh
After inference is completed, output is saved as a json file. Use LLM eval script to get accuracy
./run_llm_eval.sh
We provide sample inference under demo. modify main based on your input
cd demo
python inference.py
@inproceedings{
kumar2026divek,
title={Di{VE}-k: {DIFFERENTIAL} {VISUAL} {REASONING} {FOR} {FINE}-{GRAINED} {IMAGE} {RECOGNITION}},
author={Raja Kumar and Arka Sadhu and Ram Nevatia},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=flE6M5zFL6}
}
This repo uses code from Visual-RFT. We would like to thank the authors for their amazing work.
