Difficulty-Aware Adaptive Reasoning for Vietnamese VQA with GPT-OSS

This repository contains the official implementation for GPT-OSS-VQA, a modular and extensible framework designed for Visual Question Answering (VQA) tasks. Our approach leverages the powerful dense captioning capabilities of the Gemini 2.5 Model to provide rich context for Large Language Models (LLMs), enabling more accurate and detailed answers to visual queries.

This framework is built with a focus on modularity, performance, and ease of use, making it suitable for both research and practical applications.

Table of Contents

Features
Experimental Results
- ViVQA-X
- OpenViVQA
Getting Started
- Setup and Installation
- Usage
Framework Guides
- Configuration
- Evaluation
Programmatic Usage
Citation
Acknowledgements

Features

Modular Architecture: Decoupled components for data handling, model inference, and captioning, allowing for easy extension and modification.
Multi-Model Support: Seamlessly integrates multiple LLMs, including Gemini, Qwen3, and DeepSeek, for comparative analysis.
High-Performance Inference: Utilizes vLLM and unsloth for optimized, high-throughput inference on modern GPUs.
Rich Context Generation: Employs Gemini-Pro-2.5 model to generate detailed, object-aware image descriptions that enhance LLM reasoning.
Comprehensive Tooling: Includes scripts for batch processing, automated evaluation on standard benchmarks (OpenViVQA, ViVQA-X), and detailed result analysis.
Resilience and Efficiency: Features automatic job resumption, multi-threaded data loading, and periodic progress saving.

Experimental Results

ViVQA-X

Mode	BLEU@4	ROUGE-L	METEOR	CIDEr	SPICE	BERTScore	FLOPs ↓	Time (s) ↓
GPT-OSS Low	0.135	0.635	0.407	1.181	0.504	0.848	54,854	282
GPT-OSS High	0.141	0.641	0.413	0.840	0.570	0.820	98,884	3,317
GPT-OSS Ada (Ours)	0.142	0.654	0.421	1.022	0.503	0.839	62,356	815

OpenViVQA

Mode	BLEU@4	ROUGE-L	METEOR	CIDEr	SPICE	BERTScore	FLOPs ↓	Time (s) ↓
GPT-OSS Low	0.091	0.340	0.214	0.584	0.156	0.732	113,699	480
GPT-OSS High	0.072	0.295	0.171	0.452	0.185	0.695	258,089	7,602
GPT-OSS Ada (Ours)	0.085	0.326	0.200	0.542	0.212	0.719	163,574	3,009

Getting Started

To get started with the framework, please follow the detailed setup and usage guides.

Setup and Installation

Our comprehensive setup guide provides detailed instructions for environment preparation, dependency installation, and model/dataset acquisition. It covers system requirements, virtual environment setup, and verification steps to ensure a smooth start.

➡️ View Full Setup Guide

Usage

The usage guide explains how to run inference, perform batch processing, and evaluate models on benchmark datasets. It includes command-line examples for various scenarios.

➡️ View Full Usage Guide

Framework Guides

For advanced users and researchers, we provide in-depth guides on configuring the framework and running evaluation protocols.

Configuration

The configuration system is designed for flexibility. You can easily modify data paths, model parameters, and processing settings. This guide details the structure of the configuration files and how to customize them.

➡️ View Full Configuration Guide

Evaluation

This guide provides instructions on how to run the evaluation scripts, interpret the results, and perform comparative analysis between different models and configurations.

➡️ View Full Evaluation Guide

Programmatic Usage

The modular design of the framework allows for easy integration into your own Python projects. You can import and use the components directly for custom workflows.

from src.config import DAMConfig, CaptioningConfig
from src.captioner import run_captioning

# 1. Setup configurations
dam_config = DAMConfig(model_path='nvidia/DAM-3B-Self-Contained')
caption_config = CaptioningConfig(csv_path='my_captions.csv')

# 2. Define image source
image_folder = '/path/to/your/images'

# 3. Run the captioning process
run_captioning(image_folder, dam_config, caption_config)

print(f"Captions saved to {caption_config.csv_path}")

Citation

If you use this framework or our findings in your research, please consider citing our paper.

@misc{nhan2025gptossvqa,
  author    = {Dai-Nhan Tran, Phuc-Thinh Nguyen, Tue-Anh Vu, Bao Do, Hai-Au Trinh, Anh-Khoi Nguyen},
  title     = {Difficulty-Aware Adaptive Reasoning for Vietnamese VQA with GPT-OSS},
  year      = {2025},
  publisher = {},
  journal   = {},
  howpublished = {\url{https://github.com/hugebenevolence/ViVQA_GPT-OSS_DRA}},
}

Acknowledgements

This project is built upon the excellent work of several open-source projects and research contributions. We would like to extend our gratitude to:

The teams behind Unsloth and vLLM for their high-performance inference libraries.
The maintainers of the OpenViVQA and ViVQA-X datasets.
The Hugging Face team for the transformers and accelerate libraries.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
captions		captions
common		common
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calculate_metrics.py		calculate_metrics.py
main.py		main.py
requirements.txt		requirements.txt
run_deepseek_evaluation.sh		run_deepseek_evaluation.sh
run_gemini_openvivqa_evaluation.sh		run_gemini_openvivqa_evaluation.sh
run_gemini_vivqax_evaluation.sh		run_gemini_vivqax_evaluation.sh
run_qwen3_evaluation.sh		run_qwen3_evaluation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Difficulty-Aware Adaptive Reasoning for Vietnamese VQA with GPT-OSS

Features

Experimental Results

ViVQA-X

OpenViVQA

Getting Started

Setup and Installation

Usage

Framework Guides

Configuration

Evaluation

Programmatic Usage

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Difficulty-Aware Adaptive Reasoning for Vietnamese VQA with GPT-OSS

Features

Experimental Results

ViVQA-X

OpenViVQA

Getting Started

Setup and Installation

Usage

Framework Guides

Configuration

Evaluation

Programmatic Usage

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages