Skip to content

raja-kumar/DiVE-k

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiVE-k: Differential Visual Reasoning For Fine-Grained Image Recognition [ICLR 2026]

we propose DiVE-k framework (Differential Visual rEasoning using top-k generations) addressing a key weakness in large vision-language models: their struggle with fine-grained distinctions. Our analysis reveals that simply having world knowledge isn't enough; the model needs to learn how to apply it with precision. DiVE-k treats base model's top-k generations, obtained via $K$ rollouts, as training primitive that enables differential visual reasoning.

Example of DiVE-k differential reasoning

Example 1

Data

Please follow this link to prepare the datasets. For CUB, Download the images and annotations from here.

We also provide pre-processed data at this hf collection for QWEN2.5-VL-7B model.

Pretrained Models

Pretrained models is released at this hf collection

1. Installation

Docker Setup

We recommend using Docker to ensure a consistent environment.

Clone

git clone https://github.com/raja-kumar/DiVE-k
cd DiVE-k

Build the image:

docker build --build-arg CACHE_BUSTER=$(date +%s) -t dive_k .

Run the container: Replace the placeholders (e.g., /path/to/...) with your local paths.

docker run --rm --gpus all --shm-size=10g \
    -v /path/to/local/repo:/app/DiVE-k \
    -v /path/to/data_root:/data2/ \
    -v /path/to/hf_cache:/root/.cache/huggingface/ \
    -v /path/to/saved_models:/app/saved_models \
    -it dive_k bash

2. Prepare Data

Directory Structure

Data must be arranged in the following format under your data_root:

<data_root>/
└── <dataset_name>/             # e.g., CUB_200_2011
    ├── zero_shot/
    │   ├── subsample_{split}.json  # e.g., subsample_base_train.json
    │   ├── base_categories.txt
    │   └── new_categories.txt
    └── fewshot/
        ├── 4_shots_all_train.json
        └── all_categories.txt

3. Generate MCQ

(Note: You can skip this step if you are using our provided MCQ data).

Step 1: Add Prompts Add or modify your prompts in prompts.py.

Step 2: Run Generator Navigate to the generation directory inside the container and run the script:

cd /app/DiVE-k/topk-mcq-data
python generate_mcq.py --mcq_type qwen --data <DATASET_NAME> --split base --phase train

(Example usage: --data CUB_200_2011)

Step 3: Build HuggingFace Dataset Convert the generated JSON to a HuggingFace dataset:

python build_hf_dataset.py <data_root>/<dataset_name>/qwen_mcq/subsample_base_train_qwen_mcq.json

4. GRPO training

To train the model:

Sample training script:

  1. Open scripts/others/test.sh.
  2. Modify the variables DATA_PATH, SAVE_PATH, and RUN_NAME to match your specific experiment.
  3. Run the script:
bash scripts/others/test.sh

Evaluation

We provide evaluation script for both one step and two step inference. Modify the path in bash scripts and run

Two Step Inference

cd eval
./two_step_inference.sh

One Step Inference

./one_step_inference.sh

After inference is completed, output is saved as a json file. Use LLM eval script to get accuracy

./run_llm_eval.sh

Inference

We provide sample inference under demo. modify main based on your input

cd demo
python inference.py

Citation

@inproceedings{
        kumar2026divek,
        title={Di{VE}-k: {DIFFERENTIAL} {VISUAL} {REASONING} {FOR} {FINE}-{GRAINED} {IMAGE} {RECOGNITION}},
        author={Raja Kumar and Arka Sadhu and Ram Nevatia},
        booktitle={The Fourteenth International Conference on Learning Representations},
        year={2026},
        url={https://openreview.net/forum?id=flE6M5zFL6}
}

Attributions

This repo uses code from Visual-RFT. We would like to thank the authors for their amazing work.

About

Code for DiVE-k: Differential Visual Reasoning For Fine-Grained Image Recognition [ICLR 2026]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages