This repository contains code and resources for benchmarking foundational Vision-Language Models (VLMs) on a custom single-word Visual Question Answering (VQA) task, with a focus on e-commerce product images. The project includes dataset curation, baseline evaluation of multiple models, fine-tuning experiments, and model optimization.
Benchmarking_Generative_Models/
├── dataset.ipynb # Notebook for dataset curation and processing
├── baseline_model_evaluation/ # Scripts for loading and evaluating 7 VLMs (CLIP, ViLBERT, BLIP2, BLIP, Qwen, OFA, SmolVLM)
├── dataset_csv/ # Contains 18 CSV files with curated QA pairs
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── ... # Additional scripts and resources
git clone https://github.com/Niranjan-GopaL/FineTuning_VLM_Foundational_Models.git
cd FineTuning_VLM_Foundational_ModelsIt is recommended to use a virtual environment (e.g., venv or conda).
pip install -r requirements.txt- Run
dataset.ipynbto process the ABO dataset and generate single-word VQA pairs. - Curated data is stored as CSV files in the
dataset_csv/folder (18 files, each containing question-answer pairs with difficulty levels).
-
The
baseline_model_evaluation/directory contains scripts to load and evaluate the following foundational VLMs on the curated dataset:- CLIP
- ViLBERT
- BLIP
- BLIP2
- Qwen
- OFA
- SmolVLM
-
Models are evaluated on VQA tasks using the curated CSVs.
- Fine-tuning experiments (e.g., LoRA) and model compression (e.g., quantization) are included for efficient adaptation and deployment.
- Refer to the respective scripts and documentation for details.