MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

[Paper] | [Full Dataset (2.3M)] | [SFT + RL (1.8M)] | [SFT (586K)] | [SFT (123K)]

This repository implements the complete data construction pipeline described in the MMFineReason paper, for building large-scale, high-quality multimodal reasoning datasets (1.8M samples, 5.1B solution tokens). The pipeline consists of four stages: Collection → Cleaning → Distillation → Selection.

Model Performance

The figure above compares our models (Qwen3-VL-8B/32B-SFT) against mainstream multimodal models such as GPT-4V, Gemini Ultra, and Qwen-VL on the MMFineReason validation set. The SFT-finetuned Qwen3-VL-32B-Thinking achieves significant improvements in multimodal reasoning, even surpassing some closed-source commercial models — validating the effectiveness of large-scale open data with chain-of-thought annotations.

Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                        MMFineReason Pipeline                           │
│                                                                        │
│  ┌──────────┐    ┌──────────┐    ┌──────────────┐    ┌───────────┐    │
│  │Collection│───▶│ Cleaning │───▶│ Distillation │───▶│ Selection │    │
│  └──────────┘    └──────────┘    └──────────────┘    └───────────┘    │
│                                                                        │
│  HF Dataset       Text/Image      CoT Distill +      Quality +        │
│  Download         Cleaning         Caption Gen        Difficulty       │
│  (pre-processed)  Standardization  4-phase reasoning  Content/Struct   │
│                   Refinement       Visual description  N-gram dedup    │
│                   Suitability                          Consistency     │
│                                                        Difficulty      │
│                                                                        │
│  2.3M raw ──────────────────────────────────────────▶ 1.8M selected   │
└─────────────────────────────────────────────────────────────────────────┘

Project Structure

mmfinereason/
├── README.md                          # This file
├── README_zh.md                       # Chinese version
├── requirements.txt                   # Dependencies
├── config.yaml                        # Configuration
├── mfr_pipeline.py                    # Main entry point (4-stage pipeline)
│
├── pipeline/
│   ├── __init__.py
│   ├── collection.py                  # Stage 1: Dataset download
│   ├── cleaning.py                    # Stage 2: Text + image cleaning
│   ├── distillation.py                # Stage 3: CoT distillation & caption
│   ├── selection.py                   # Stage 4: Quality & difficulty filtering
│   └── model_client.py                # Model client (OpenAI/vLLM compatible)
│
├── prompts/                           # Prompt templates
│   ├── distill.txt                    # CoT distillation prompt
│   ├── caption.txt                    # Image captioning prompt
│   ├── clean.txt                      # Data cleaning prompt
│   ├── answer_extraction.txt          # Answer extraction prompt
│   └── verify.txt                     # Correctness verification prompt
│
└── train/                             # Training configs
    └── config/
        └── mfr_8b.yaml               # Qwen3-VL-8B SFT config

Pipeline Stages

Paper-to-Code Mapping

Paper Section	Pipeline Stage	Code Module
Section 3.1 - Data Collection	Stage 1: Collection	`pipeline/collection.py`
Section 3.1 - Data Cleaning & Image Cleaning	Stage 2: Cleaning	`pipeline/cleaning.py`
Section 3.2 - Data Annotation	Stage 3: Distillation	`pipeline/distillation.py`
Section 3.3 - Data Selection	Stage 4: Selection	`pipeline/selection.py`

Stage 1: Collection

Corresponds to Section 3.1: Data Collection

Since the data curation and standardization process is highly fragmented across sources, this codebase directly downloads the pre-processed MMFineReason-Full dataset from HuggingFace, which already includes cleaned questions, images, and extracted answers.

Data Sources:

Category	Datasets	Samples
Mathematics (79.4%)	MMR1, WaltonColdStart, ViRL39K, Euclid30K, MMK12, Geo170K, Geo3K, mm-openr1, WeMath series	~1.41M
Science (13.8%)	VisualWebInstruct, BMMR, TQA, AI2D, Zebra-CoT, ScienceQA	~244K
Puzzle/Game (4.6%)	GameQA-140K, Raven, VisualSphinx, PuzzleQA	~82K
General/OCR (2.2%)	LLaVA-CoT	~39K

Features:

Download MMFineReason-Full from HuggingFace
Subset selection (e.g., --subsets BMMR Euclid30K)
Sample limiting for quick testing (e.g., --max-samples 100)

Output: collected.parquet

# Download specific subsets
python mfr_pipeline.py --step collection --output output/ --subsets BMMR Euclid30K

# Quick test (max 100 samples per subset)
python mfr_pipeline.py --step collection --output output/ --subsets BMMR --max-samples 100

Stage 2: Cleaning

Corresponds to Section 3.1: Data Cleaning + Image Cleaning

Performs comprehensive text and image cleaning on the collected data, ensuring language consistency, text cleanliness, and reasoning suitability.

Text Cleaning:

Operation	Description
Language Standardization	Translate non-English text (e.g., Chinese in BMMR, Euclid30K) to English
Noise Removal	Remove URLs, corrupted characters, formatting artifacts, question numbers, score annotations
Instruction Refinement	Rewrite shallow reasoning prompts (e.g., "just give the answer" → "provide your answer after careful reasoning")
Task Suitability Filtering	Filter out coding, drawing, and other non-visual-reasoning tasks

Image Cleaning:

Operation	Description
Corrupt image removal	Discard unreadable or corrupted images
Resizing	Proportionally scale images with longest edge > 2048px
Color space normalization	Convert all images to RGB

Output: cleaned.parquet

python mfr_pipeline.py --step cleaning --input output/collected.parquet --output output/

Stage 3: Distillation

Corresponds to Section 3.2: Data Annotation

Uses teacher models to generate high-quality chain-of-thought reasoning and visual descriptions for each sample. The HuggingFace dataset already includes distillation results; this stage skips samples with existing data by default.

CoT Distillation (Core):

Teacher Model: Qwen3-VL-235B-A22B-Thinking (strongest open-source VLM)
Four-phase Reasoning Framework:
1. Comprehensive Information Extraction — Extract all visual information from the image
2. Strategic Problem Setup — Understand the problem and determine the reasoning type
3. Rigorous Solution Execution — Step-by-step derivation with visual elements as core components
4. Solution Validation — Verify logical consistency of the answer
Output Format: <think>...</think> + <answer>...</answer>
Key Design: Captions are not provided during CoT distillation, ensuring the model reasons from visual information rather than taking text-only shortcuts

Caption Generation:

Model: Qwen3-VL-235B-A22B-Instruct
Generates structured dense descriptions: image type classification, global layout, symbolic elements, spatial relations, key visual cues
Average 609 tokens/caption, 100% coverage

Output: distilled.parquet (adds qwen3vl_235b_thinking_response and qwen3vl_235b_instruct_caption fields)

python mfr_pipeline.py --step distillation --input output/cleaned.parquet --output output/

Stage 4: Selection

Corresponds to Section 3.3: Data Selection

Applies multi-stage filtering to select 1.8M high-quality samples from the 2.3M raw pool, with optional difficulty filtering for efficient training subsets.

4a. Reasoning Quality Filtering → MMFineReason-1.8M

Filter Step	Description
Content Filter	`[ERROR]` keyword check + response word count ≥ 200 + caption word count ≥ 100
Structure Validation	Verify `<think>` / `<answer>` tag completeness (skip answer check for specific datasets)
N-gram Deduplication	Detect templated CoT with 50-gram repeating ≥ 3 times
Correctness Verification	Extract `<answer>` and compare with ground-truth, discard incorrect reasoning

4b. Difficulty Filtering → MMFineReason-123K / 586K

Strategy	Description
Model: `Qwen3-VL-4B-Thinking`	Generate 4 independent answers per question
pass rate = 0 → MMFineReason-123K	Small model answers all incorrectly (hardest subset, only 7%)
pass rate ≠ 1 → MMFineReason-586K	Small model fails to answer all correctly

"Less is More" Finding: Only 7% of the data (123K) achieves performance comparable to the full dataset.

Output: selected.parquet

# Quality filtering only
python mfr_pipeline.py --step selection --input output/distilled.parquet --output output/

# Quality + difficulty filtering (keep only pass_rate=0, hardest samples)
python mfr_pipeline.py --step selection --input output/distilled.parquet --output output/ \
    --difficulty 0.0

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Quick Test (BMMR subset, 100 samples)

# Download + quality filtering, no model API needed
python mfr_pipeline.py --step all --output output/ \
    --subsets BMMR --max-samples 100

3. Full Pipeline

# Edit config.yaml to set up model APIs (if distilling new data)
vim config.yaml

# Download full dataset and run all four stages
python mfr_pipeline.py --config config.yaml --step all --output output/

4. Run Individual Stages

# Stage 1: Collection — Download dataset from HuggingFace
python mfr_pipeline.py --step collection --output output/ --subsets BMMR Euclid30K

# Stage 2: Cleaning — Text/image cleaning
python mfr_pipeline.py --step cleaning --input output/collected.parquet --output output/

# Stage 3: Distillation — CoT distillation (skipped if HF data already has results)
python mfr_pipeline.py --step distillation --input output/cleaned.parquet --output output/

# Stage 4: Selection — Quality + difficulty filtering
python mfr_pipeline.py --step selection --input output/distilled.parquet --output output/

# Enable difficulty filtering (keep only pass_rate=0, hardest samples)
python mfr_pipeline.py --step selection --input output/distilled.parquet --output output/ \
    --difficulty 0.0

Configuration

config.yaml

The configuration file is organized by stage, with each stage's model clients and parameters in its own section:

# Global
num_proc: 8

# Stage 1: Collection
collection:
  dataset_id: "OpenDataArena/MMFineReason-Full-2.3M-Qwen3-VL-235B-Thinking"
  subsets: [BMMR, Euclid30K]
  cache_dir: null

# Stage 2: Cleaning
cleaning:
  max_image_size: 2048
  llm_client:
    type: vllm
    api_base: "http://localhost:8000/v1"
    model: "Qwen3-30B-A3B-Thinking"

# Stage 3: Distillation
distillation:
  cot:
    client:
      type: vllm
      api_base: "http://localhost:8002/v1"
      model: "Qwen3-VL-235B-A22B-Thinking"
    temperature: 1.0
    top_p: 0.95
    max_tokens: 16384
    thinking: true
  caption:
    client:
      type: vllm
      api_base: "http://localhost:8001/v1"
      model: "Qwen3-VL-235B-A22B-Instruct"
    temperature: 1.0
    top_p: 0.95
    max_tokens: 16384

# Stage 4: Selection
selection:
  content_filter:
    min_response_words: 200
    min_caption_words: 100
    error_keyword: "[ERROR]"
  structure_filter:
    no_answer_check_datasets: [...]
  quality_filter:
    ngram_n: 50
    ngram_freq: 3
    consistency: true          # Whether to verify CoT vs ground-truth consistency
  difficulty_filter:
    threshold: null             # null=disabled, 0.0=hardest only, 0.5=moderately hard
    num_samples: 4
    client:
      type: vllm
      api_base: "http://localhost:8003/v1"
      model: "Qwen3-VL-4B-Thinking"

Data Schema

The HuggingFace dataset includes the following fields:

Category	Field	Description
Metadata	`source`	Source dataset
	`id`	Unique identifier
Raw Data	`original_question`	Original question
	`original_answer`	Original answer
Input/Output	`image`	Image
	`question`	Question
	`answer`	Standardized answer
Augmented	`qwen3vl_235b_instruct_caption`	Dense image caption
	`qwen3vl_235b_thinking_response`	CoT reasoning trace
Metrics	`qwen3vl_4b_pass_rate`	Difficulty score (0–1)
	`is_consistent`	Correctness verification result
	`consistency_analysis`	Consistency analysis

Training

We use LLaMA-Factory for SFT training:

# Qwen3-VL-8B SFT
llamafactory-cli train train/config/mfr_8b.yaml

Key training settings:

Base model: Qwen/Qwen3-VL-8B-Instruct
Freeze Vision Tower and Projector, full fine-tuning of LLM
DeepSpeed ZeRO-2, Flash Attention 2
Learning Rate: 1e-5, Cosine Scheduler, 10% Warmup

Citation

@article{lin2026mmfinereason,
  title={MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods},
  author={Lin, Honglin and Liu, Zheng and Zhu, Yun and Qin, Chonghan and Lin, Juekai and Shang, Xiaoran and He, Conghui and Zhang, Wentao and Wu, Lijun},
  journal={arXiv preprint arXiv:2601.21821},
  year={2026}
}

License

This project is for research purposes only. Please comply with the original licenses of all data sources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Model Performance

Overview

Project Structure

Pipeline Stages

Paper-to-Code Mapping

Stage 1: Collection

Stage 2: Cleaning

Stage 3: Distillation

Stage 4: Selection

4a. Reasoning Quality Filtering → MMFineReason-1.8M

4b. Difficulty Filtering → MMFineReason-123K / 586K

Quick Start

1. Install Dependencies

2. Quick Test (BMMR subset, 100 samples)

3. Full Pipeline

4. Run Individual Stages

Configuration

config.yaml

Data Schema

Training

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
pipeline		pipeline
prompts		prompts
train/config		train/config
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
config.yaml		config.yaml
method.tex		method.tex
mfr_pipeline.py		mfr_pipeline.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Model Performance

Overview

Project Structure

Pipeline Stages

Paper-to-Code Mapping

Stage 1: Collection

Stage 2: Cleaning

Stage 3: Distillation

Stage 4: Selection

4a. Reasoning Quality Filtering → MMFineReason-1.8M

4b. Difficulty Filtering → MMFineReason-123K / 586K

Quick Start

1. Install Dependencies

2. Quick Test (BMMR subset, 100 samples)

3. Full Pipeline

4. Run Individual Stages

Configuration

config.yaml

Data Schema

Training

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages