This repository contains the official implementation for LLaMA-OSS, the official implementation of our knowledge distillation framework that curates multi-mode chain-of-thought (CoT) reasoning from GPT-OSS for efficient mathematical question answering. Our approach addresses the challenge of noisy and overly verbose supervision in dataset-based distillation by implementing a two-step curation pipeline that emphasizes quality over quantity.
This framework is built with a focus on modularity, performance, and ease of use, making it suitable for both research and practical applications.
Table of Contents
- Multi-Mode CoT Generation: Leverages GPT-OSS's low/medium/high inference modes for controllable reasoning generation
- Two-Step Curation Pipeline:
- Final-answer verification to filter incorrect reasoning traces
- Length distribution-based filtering with median-length selection to eliminate verbosity
- SFT + GRPO Training: Complete pipeline from supervised fine-tuning to policy optimization
- Comprehensive Evaluation: Automated evaluation on GSM8K and MATH500 benchmarks
- LLaMA-Factory Integration: Built on LLaMA-Factory for efficient training workflows
- MS-SWIFT Support: Compatible with ModelScope-SWIFT framework
- Modular Design: Easy to extend for other reasoning tasks or teacher models
All experiments use Llama 3.2 3B as the student model, distilled from GPT-OSS teacher models.
| Model | Training | Dataset | GSM8K 0-shot | GSM8K 5-shot |
|---|---|---|---|---|
| Llama3.2 | - | πorig | 0.7043 | 0.7043 |
| Llama3.2 | - | π* | 0.7043 | 0.7104 |
| Llama3.2 | SFT | π* | 0.6876 | 0.5762 |
| Llama3.2 | SFT | π*low | 0.7111 | 0.7142 |
| Llama3.2 | SFT | π*med | 0.7051 | 0.7074 |
| Llama3.2 | SFT | π*high | 0.7013 | 0.7051 |
| Llama3.2-πorig | GRPO | π | 0.7771 | 0.6603 |
| Llama3.2-π* | GRPO | π | 0.7847 | 0.6156 |
| Llama3.2-π*low | GRPO | π | 0.6308 | 0.5861 |
| Llama3.2-π*low | GRPO | π | 0.8006 | 0.7195 |
| Llama3.2-π*med | GRPO | π | 0.7771 | 0.6323 |
| Llama3.2-π*high | GRPO | π | 0.7559 | 0.7225 |
| Model | Training | Dataset | MATH500 0-shot | MATH500 4-shot |
|---|---|---|---|---|
| Llama3.2 | - | πorig | 0.3960 | 0.4340 |
| Llama3.2 | - | π* | 0.4060 | 0.4240 |
| Llama3.2 | SFT | π* | 0.3400 | 0.2420 |
| Llama3.2 | SFT | π*low | 0.4100 | 0.4400 |
| Llama3.2 | SFT | π*med | 0.4000 | 0.4160 |
| Llama3.2 | SFT | π*high | 0.4140 | 0.3920 |
| Llama3.2-πorig | GRPO | π | 0.4540 | 0.4380 |
| Llama3.2-π* | GRPO | π | 0.4520 | 0.4560 |
| Llama3.2-π*low | GRPO | π | 0.4400 | 0.4220 |
| Llama3.2-π*low | GRPO | π | 0.4760 | 0.4520 |
| Llama3.2-π*med | GRPO | π | 0.4480 | 0.4600 |
| Llama3.2-π*high | GRPO | π | 0.4740 | 0.4600 |
To get started with the framework, please follow the detailed setup and usage guides.
Our comprehensive setup guide provides detailed instructions for environment preparation, dependency installation, and model/dataset acquisition. It covers system requirements, virtual environment setup, and verification steps to ensure a smooth start.
β‘οΈ View Full Setup Guide
The usage guide explains how to run inference, perform batch processing, and evaluate models on benchmark datasets. It includes command-line examples for various scenarios.
β‘οΈ View Full Usage Guide
For advanced users and researchers, we provide in-depth guides on configuring the framework and running evaluation protocols.
The configuration system is designed for flexibility. You can easily modify data paths, model parameters, and processing settings. This guide details the structure of the configuration files and how to customize them.
β‘οΈ View Full Configuration Guide
This guide provides instructions on how to run the evaluation scripts, interpret the results, and perform comparative analysis between different models and configurations.
β‘οΈ View Full Evaluation Guide
The modular design of the framework allows for easy integration into your own Python projects. You can import and use the components directly for custom workflows.
from src.curation import CurationPipeline, AnswerVerifier, LengthFilter
from src.generator import GPTOSSGenerator
# 1. Setup the GPT-OSS generator for multi-mode CoT
generator = GPTOSSGenerator(
model_name='gpt-4o',
modes=['low', 'medium', 'high']
)
# 2. Generate CoT traces from your math dataset
math_problems = [
{"question": "What is 15 + 27?", "answer": "42"},
{"question": "Janet's ducks lay 16 eggs per day...", "answer": "18"}
]
cot_data = generator.generate_multi_mode_cot(
problems=math_problems,
output_path='raw_cot_data.jsonl'
)
# 3. Apply the two-step curation pipeline
curation = CurationPipeline(
answer_verifier=AnswerVerifier(),
length_filter=LengthFilter(percentile_range=(25, 75))
)
# Step 1: Answer verification
verified_data = curation.verify_answers(cot_data)
# Step 2: Length-based filtering with median selection
curated_data = curation.filter_by_length(verified_data, select_median=True)
# 4. Save mode-specific curated datasets
curation.save_by_mode(
curated_data,
output_dir='outputs',
filenames={
'low': 'cot_low.jsonl',
'medium': 'cot_med.jsonl',
'high': 'cot_high.jsonl'
}
)
print(f"Curated {len(curated_data)} high-quality reasoning traces")
print(f"Files saved: cot_low.jsonl, cot_med.jsonl, cot_high.jsonl")If you use this framework or find our work helpful, please consider citing:
@misc{llama-oss-2025,
author = {Hai-Au Trinh, Tue-Anh Vu, Dai-Nhan Tran, Uyen Khoi-Minh Huynh, Anh-Khoi Nguyen},
title = {Curating Multi-Mode CoT for Efficient Math Reasoning with GPT-OSS},
year = {2025},
publisher = {},
journal = {},
howpublished = {\url{https://github.com/Koii2k3/LLaMA-OSS}},
}This project is built upon the excellent work of several open-source projects and research contributions. We would like to extend our gratitude to:
- The teams behind LLaMA-Factory and MS-SWIFT for their high-performance inference libraries.
- Meta LLaMA - Foundation model
- The Hugging Face team for the
transformersandacceleratelibraries.
Special thanks to the research community for advancing efficient LLM training techniques.
Note: This is an active research project. Contributions, issues, and feature requests are welcome! Please check our contributing guidelines before submitting PRs.


