DiVE-k: Differential Visual Reasoning For Fine-Grained Image Recognition [ICLR 2026]

we propose DiVE-k framework (Differential Visual rEasoning using top-k generations) addressing a key weakness in large vision-language models: their struggle with fine-grained distinctions. Our analysis reveals that simply having world knowledge isn't enough; the model needs to learn how to apply it with precision. DiVE-k treats base model's top-k generations, obtained via $K$ rollouts, as training primitive that enables differential visual reasoning.

Example of DiVE-k differential reasoning

Data

Please follow this link to prepare the datasets. For CUB, Download the images and annotations from here.

We also provide pre-processed data at this hf collection for QWEN2.5-VL-7B model.

Pretrained Models

Pretrained models is released at this hf collection

1. Installation

Docker Setup

We recommend using Docker to ensure a consistent environment.

Clone

git clone https://github.com/raja-kumar/DiVE-k
cd DiVE-k

Build the image:

docker build --build-arg CACHE_BUSTER=$(date +%s) -t dive_k .

Run the container: Replace the placeholders (e.g., /path/to/...) with your local paths.

docker run --rm --gpus all --shm-size=10g \
    -v /path/to/local/repo:/app/DiVE-k \
    -v /path/to/data_root:/data2/ \
    -v /path/to/hf_cache:/root/.cache/huggingface/ \
    -v /path/to/saved_models:/app/saved_models \
    -it dive_k bash

2. Prepare Data

Directory Structure

Data must be arranged in the following format under your data_root:

<data_root>/
└── <dataset_name>/             # e.g., CUB_200_2011
    ├── zero_shot/
    │   ├── subsample_{split}.json  # e.g., subsample_base_train.json
    │   ├── base_categories.txt
    │   └── new_categories.txt
    └── fewshot/
        ├── 4_shots_all_train.json
        └── all_categories.txt

3. Generate MCQ

(Note: You can skip this step if you are using our provided MCQ data).

Step 1: Add Prompts Add or modify your prompts in prompts.py.

Step 2: Run Generator Navigate to the generation directory inside the container and run the script:

cd /app/DiVE-k/topk-mcq-data
python generate_mcq.py --mcq_type qwen --data <DATASET_NAME> --split base --phase train

(Example usage: --data CUB_200_2011)

Step 3: Build HuggingFace Dataset Convert the generated JSON to a HuggingFace dataset:

python build_hf_dataset.py <data_root>/<dataset_name>/qwen_mcq/subsample_base_train_qwen_mcq.json

4. GRPO training

To train the model:

Sample training script:

Open scripts/others/test.sh.
Modify the variables DATA_PATH, SAVE_PATH, and RUN_NAME to match your specific experiment.
Run the script:

bash scripts/others/test.sh

Evaluation

We provide evaluation script for both one step and two step inference. Modify the path in bash scripts and run

Two Step Inference

cd eval
./two_step_inference.sh

One Step Inference

./one_step_inference.sh

After inference is completed, output is saved as a json file. Use LLM eval script to get accuracy

./run_llm_eval.sh

Inference

We provide sample inference under demo. modify main based on your input

cd demo
python inference.py

Citation

@inproceedings{
        kumar2026divek,
        title={Di{VE}-k: {DIFFERENTIAL} {VISUAL} {REASONING} {FOR} {FINE}-{GRAINED} {IMAGE} {RECOGNITION}},
        author={Raja Kumar and Arka Sadhu and Ram Nevatia},
        booktitle={The Fourteenth International Conference on Learning Representations},
        year={2026},
        url={https://openreview.net/forum?id=flE6M5zFL6}
}

Attributions

This repo uses code from Visual-RFT. We would like to thank the authors for their amazing work.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
demo		demo
eval		eval
readme_images		readme_images
src		src
topk-mcq-data		topk-mcq-data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiVE-k: Differential Visual Reasoning For Fine-Grained Image Recognition [ICLR 2026]

Example of DiVE-k differential reasoning

Data

Pretrained Models

1. Installation

Docker Setup

2. Prepare Data

Directory Structure

3. Generate MCQ

4. GRPO training

Evaluation

Inference

Citation

Attributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiVE-k: Differential Visual Reasoning For Fine-Grained Image Recognition [ICLR 2026]

Example of DiVE-k differential reasoning

Data

Pretrained Models

1. Installation

Docker Setup

2. Prepare Data

Directory Structure

3. Generate MCQ

4. GRPO training

Evaluation

Inference

Citation

Attributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages