Homepage | Model | Dataset | Code | Arxiv | PDF | Demo
This repository provides the necessary resources and guidelines for training and evaluating.
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
The repository is organized into the following directories:
-
train: Contains scripts and instructions for pretraining and finetuning the PANGEA model. We have made modifications from the open-source Llava-Next repository.
-
evaluation: Includes code and datasets to assess the model's performance across various tasks and languages. The code is modified from the lmms-eval repository for evaluation.
To get started with MAmmoTH-VL:
-
Clone the Repository: Use Git to clone the repository to your local environment.
-
Install Dependencies: Ensure you have the required dependencies installed. For training, you need to do
cd train/LLaVA-NeXT
pip install -e ".[train]"For evaluation, you need to do
cd evaluation/lmms-eval
pip install -e .- Download Datasets: Acquire the necessary pretraining and fine-tuning datasets. For pretraining, download the LLaVA-Pretrain dataset from HuggingFace. For finetuning, download the MAmmoTH-VL-12M dataset from HuggingFace.
Here is an example of training data:
{
"id": str,
"image": str/array,
"video": str,
"conversations": array,
}- id: Unique identifier for the data sample.
- image: The path to the image file used in this instance.
- video: The path to the video file used in this instance.
- conversations: A series of conversations between the "human" and the model (in this case, referred to as "gpt").
- from: Identifies the speaker (either "human" or "gpt").
- value: The content of the message, which can include both text and image references.
After setting up, initiate the pretraining phase:
- Run the Pretraining Script:
cd train
bash LLaVA-NeXT/scripts/train/mammoth_vl/pretrain_qwen_2_5.shThis result in the creation of a mm_projector.bin file essential for the finetuning stage.
Once pretraining is complete, proceed to finetune the model: Ensure Fine-tuning Data is Available
After obtaining the fine-tuning data, run the following script to begin fine-tuning:
cd train
bash LLaVA-NeXT/scripts/train/mammoth_vl/finetune_qwen_2_5_si.sh
After obtaining the fine-tuning data, run the following script to begin fine-tuning:
cd train
bash LLaVA-NeXT/scripts/train/mammoth_vl/finetune_qwen_2_5_ov.sh
To evaluate the model's capabilities:
- Navigate to the Evaluation Directory:
cd eval- Run the Evaluation Script:
To run the evaluation, use the following command:
export HF_HOME=xxx
export HF_TOKEN=xxx
export MLP_WORKER_0_PORT=xxx
export OPENAI_API_KEY=xxx
source yourpath/miniconda3/bin/activate lmms-eval
FINAL_RUN_NAME=$1
Task_Name=$2
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 -m lmms_eval --model llava_onevision --model_args pretrained=${FINAL_RUN_NAME},conv_template=qwen_2_5,model_name=llava_qwen --tasks mmmu_val --batch_size 1 --log_samples --log_samples_suffix ${Task_Name} --output_path xxxHere, ${FINAL_RUN_NAME} refers to either a locally available model or a model on HuggingFace, identified by its repository ID. Note that we use conv_template=qwen_2_5 for MAmmoTH-VL. You should remove this or change to other conv_template when appropriate.
eval/lmms-eval/eval_mammoth_vl_example.sh shows an example script to run evaluation.
@article{guo2024mammothvlelicitingmultimodalreasoning,
title={MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale},
author={Jarvis Guo and Tuney Zheng and Yuelin Bai and Bo Li and Yubo Wang and King Zhu and Yizhi Li and Graham Neubig and Wenhu Chen and Xiang Yue},
year={2024},
eprint={2412.05237},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.05237},
}