An Claude-friendly framework for any post-training

A lightweight post-training framework for LLMs and VLMs. Maximizing developer speed. Scales to billions of parameters with DeepSpeed, vLLM, and Ray.

Why oxRL?

	oxRL	TRL	OpenRLHF
Algorithms	51	~12	~5
Verified Models	38	—	—
Lines to train	3	~30	~50
RL Engine	vLLM + Ray	Native	vLLM + Ray
SL Engine	DeepSpeed	Native	DeepSpeed

Design Principle

Context-length minimized principle. Following LLM Oriented Design. So your LLM agent will not suffer from OOT or IQ loss problem.

Usage

Post-train any model with one command. oxRL auto-detects your hardware, auto-prepares datasets, and scales to multi-GPU automatically.

# CLI — one command to train
oxrl train --model Qwen/Qwen2.5-0.5B-Instruct --task math --epochs 3

# With a custom dataset (JSONL or Parquet)
oxrl train --model Qwen/Qwen2.5-7B-Instruct --dataset ./my_data.jsonl --task reasoning

# Python API — under 3 lines
from oxrl import Trainer

trainer = Trainer(model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
trainer.train(task="reasoning")

Datasets can be Parquet or JSONL — format is auto-detected from the file extension.

Supported Models

oxRL works with any HuggingFace model that supports AutoModelForCausalLM, including multimodal models via AutoModelForImageTextToText. No special integration needed — just pass the model name.

Verified Models

These models have been explicitly verified through our automated onboarding pipeline:

Model	Size	Task	Strategy
Qwen3-0.6B	0.6B	Instruct	Full-tuning
Qwen2.5-0.5B-Instruct	0.5B	Math	Full-tuning
Gemma-3-1b-it	1.0B	Instruct	Full-tuning
Qwen2.5-1.5B-Instruct	1.5B	Math	Full-tuning
Qwen2.5-Coder-1.5B-Instruct	1.5B	Coding	Full-tuning
SmolLM2-1.7B-Instruct	1.7B	Instruct	Full-tuning
Qwen2.5-3B-Instruct	3.0B	Math	Full-tuning
DeepSeek-R1-Distill-Qwen-7B	7.6B	Reasoning	LoRA
Qwen2.5-7B-Instruct	7.0B	Math	LoRA
Qwen2.5-Coder-7B-Instruct	7.6B	Coding	LoRA
Mistral-7B-Instruct-v0.3	7.0B	Instruct	LoRA
Qwen2-Audio-7B-Instruct	7.0B	Audio	LoRA
Qwen2-VL-7B-Instruct	7.0B	Vision	LoRA
DeepSeek-R1-Distill-Llama-8B	8.0B	Reasoning	LoRA
Qwen3.5-35B-A3B	35.0B (3B active)	Reasoning	LoRA
Qwen2.5-Coder-0.5B-Instruct	0.5B	Coding	Full-tuning
Llama-3.2-1B-Instruct	1.2B	Instruct	Full-tuning
Qwen2.5-Math-1.5B-Instruct	1.5B	Math	Full-tuning
Qwen2-VL-2B-Instruct	2.0B	Vision	Full-tuning
Qwen3-VL-2B-Instruct	2.0B	Vision	Full-tuning
SmolVLM-Instruct	2.3B	Vision	Full-tuning
Gemma-2-2b-it	2.6B	Instruct	Full-tuning
Qwen2.5-Coder-3B-Instruct	3.0B	Coding	Full-tuning
Qwen2.5-VL-3B-Instruct	3.0B	Vision	Full-tuning
Llama-3.2-3B-Instruct	3.2B	Instruct	Full-tuning
Phi-3.5-mini-instruct	3.8B	Math	Full-tuning
Qwen3-4B	4.0B	Math	Full-tuning
Qwen3-VL-4B-Instruct	4.0B	Vision	Full-tuning
Qwen2.5-Math-7B-Instruct	7.0B	Math	LoRA
Qwen2.5-VL-7B-Instruct	7.0B	Vision	LoRA
Qwen3-8B	8.0B	Math	LoRA
Llama-3.1-8B-Instruct	8.0B	Reasoning	LoRA
Gemma-3-4b-it	4.3B	Instruct	Full-tuning
GLM-4-9B-Chat	9.4B	Instruct	LoRA
Kimi-VL-A3B-Instruct	16.4B (2.8B active)	Vision	LoRA
Kimi-VL-A3B-Thinking	16.4B (2.8B active)	Vision	LoRA
Phi-4	14.7B	Math	LoRA
Qwen3.5-27B	27.0B	Instruct	LoRA

System Architecture

┌──────────────────────────────────────────────────────────────────┐
│                          oxRL Framework                          │
├────────────────────────────────┬─────────────────────────────────┤
│     RL Path (main_rl.py)       │     SL Path (main_sl.py)        │
│  SGRPO / GSPO / CISPO / PPO   │  SFT / DPO / ORPO / KTO         │
│  RLHF / RLAIF                  │  CPT / KD / RM / RFT            │
│  Ray actors + vLLM rollouts    │  OnlineDPO / SPIN / IPO / SimPO │
│                                │  DeepSpeed distributed training │
├────────────────────────────────┴─────────────────────────────────┤
│  oxrl/algs/       Algorithms   │  oxrl/rollouts/   vLLM + Replay │
│  oxrl/configs/    Pydantic cfg │  oxrl/rewards/    Verifiable    │
│  oxrl/datasets/   HF loaders   │  oxrl/utils/      Setup + Logs  │
└──────────────────────────────────────────────────────────────────┘

RL Training Workflow

Scout Agent: Discovers model metadata and ensures chat_template compatibility.
Multimodal Pipeline: Converts base64 images/audio into PIL/NumPy for vLLM rollouts.
LoRA Lifecycle: Train with adapters, save with gathered ZeRO-3 weights, and auto-strip PEFT prefixes for immediate vLLM compatibility.
Verifiable Rewards: Programmatic verification of CoT tags and mathematical correctness.

Getting Started

Installation

# From source (recommended for development)
git clone https://github.com/warlockee/oxRL.git
cd oxRL
pip install -e .

# Or from PyPI
pip install oxrl

Run Tests

pip install pytest
pytest tests/test_bugs.py -v

Environment Diagnostics

Before starting a long training run, verify your environment (GPUs, CUDA Toolkit, DeepSpeed, Ray) with our diagnostic tool:

oxrl doctor

Configuration

oxRL uses YAML config files. See oxrl/configs/rl_args.yaml (RL) and oxrl/configs/sl_args.yaml (SL) for all available options with documentation. Example configs are in registry/examples/.

Key environment variables:

OXRL_DATA_DIR — Override default data directory (default: ./data)
OXRL_CHECKPOINT_DIR — Override default checkpoint directory (default: ./checkpoints)
HF_TOKEN — HuggingFace token for gated models
GITHUB_TOKEN — For autonomous bug reporting (optional)

Post-train a Reasoning Model

# config.yaml
model:
  name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
lora:
  enabled: true
reward:
  reward_func: "reasoning_reward_func"
data:
  dataset: "openr1_math"

python main_rl.py --config-file config.yaml

Algorithms

Reinforcement Learning (via Ray + vLLM rollouts)

Algorithm	File	When to use
SGRPO	`oxrl/algs/grpo.py`	Default for dense models. Token-level clipped surrogate, no critic needed.
GSPO	`oxrl/algs/grpo.py`	MoE models (Qwen3-MoE, DeepSeek-V3). Sequence-level ratios absorb routing noise between vLLM and HF/DeepSpeed.
CISPO	`oxrl/algs/grpo.py`	When SGRPO shows reward hacking or instability. Clipped ratio as detached weight on log-prob — more conservative.
PPO	`oxrl/algs/ppo.py`	When you need fine-grained credit assignment. Full PPO with value head + GAE. ~2x memory cost.
RLHF	`oxrl/algs/grpo.py`	Alias for SGRPO. Use for readability with reward-model setups.
RLAIF	`oxrl/algs/grpo.py`	Alias for SGRPO. Use for readability with AI-feedback setups.

Supervised Learning (via DeepSpeed)

Algorithm	File	Description
SFT	`oxrl/algs/sft.py`	Supervised Fine-Tuning — Cross-entropy loss with masking and normalization.
DPO	`oxrl/algs/dpo.py`	Direct Preference Optimization — Pairwise preference learning with a reference model.
ORPO	`oxrl/algs/orpo.py`	Odds Ratio Preference Optimization — Reference-free preference alignment via log-odds.
KTO	`oxrl/algs/kto.py`	Kahneman-Tversky Optimization — Prospect-theory-inspired alignment with moving-average KL baseline.
CPT	`oxrl/algs/cpt.py`	Continued Pre-Training — Full-sequence language modeling on domain-specific text.
KD	`oxrl/algs/kd.py`	Knowledge Distillation — Teacher-student training with combined CE and KL divergence loss.
RM	`oxrl/algs/rm.py`	Reward Model Training — Bradley-Terry pairwise ranking with a learned scalar head.
OnlineDPO	`oxrl/algs/online_dpo.py`	Online DPO — DPO with on-the-fly rejection sampling in the data pipeline.
RFT	`oxrl/algs/rft.py`	Rejection Sampling Fine-Tuning — SFT on reward-filtered responses above a threshold.
SPIN	`oxrl/algs/spin.py`	Self-Play Improvement — DPO where rejected samples are the model's own prior outputs.
IPO	`oxrl/algs/ipo.py`	Identity Preference Optimization — Squared-loss variant of DPO for improved stability.
SimPO	`oxrl/algs/simpo.py`	Simple Preference Optimization — Reference-free, length-normalized preference alignment.
CPO	`oxrl/algs/cpo.py`	Contrastive Preference Optimization — Reference-free DPO + behavioral cloning regularizer (ICML 2024).
AlphaPO	`oxrl/algs/alphapo.py`	Generalizes SimPO with nonlinear reward shaping parameter alpha.
R-DPO	`oxrl/algs/rdpo.py`	Robust DPO — Length regularization to prevent length exploitation.
cDPO	`oxrl/algs/cdpo.py`	Conservative DPO — Label smoothing for noisy preference data.
SPO	`oxrl/algs/spo.py`	Self Preference Optimization — SiLU-based bounded loss (EMNLP 2025).
DPNLL	`oxrl/algs/dpnll.py`	DPO + NLL — Adds NLL on chosen to prevent chosen probability collapse.
MinorDPO	`oxrl/algs/minor_dpo.py`	Clamped reject penalty — stops penalizing once pi_l < pi_ref_l.
C2DPO	`oxrl/algs/c2dpo.py`	Constrained Controlled DPO — Quadratic penalty constraining deviation from reference.
AlphaDPO	`oxrl/algs/alpha_dpo.py`	Alpha-divergence DPO — Generalizes KL to alpha-divergence.
AOT	`oxrl/algs/aot.py`	Alignment via Optimal Transport — Wasserstein-based preference alignment.
APO	`oxrl/algs/apo.py`	Anchored Preference Optimization — Anchored to reference with adaptive margin.
BCO	`oxrl/algs/bco.py`	Binary Classifier Optimization — Trains binary classifier on preference pairs.
BetaDPO	`oxrl/algs/betadpo.py`	Dynamic Beta DPO — Adaptive beta scheduling (NeurIPS 2024).
BPO	`oxrl/algs/bpo.py`	Balanced Preference Optimization — Balances chosen/rejected gradients.
CalDPO	`oxrl/algs/caldpo.py`	Calibrated DPO — Calibration-aware preference learning (NeurIPS 2024).
ChiPO	`oxrl/algs/chipo.py`	Chi-squared Preference Optimization — Chi-squared divergence variant.
CPOSimPO	`oxrl/algs/cposimpo.py`	CPO + SimPO hybrid — Reference-free with length normalization.
DiscoPOP	`oxrl/algs/discopop.py`	Discovery of Optimal PO — Learns optimal loss function.
DPOP	`oxrl/algs/dpop.py`	DPO-Positive — Prevents chosen probability decrease.
DPOShift	`oxrl/algs/dposhift.py`	DPO with shifted baseline for improved stability.
DRDPO	`oxrl/algs/drdpo.py`	Distributionally Robust DPO — Minimax formulation over uncertainty set.
EXO	`oxrl/algs/exo.py`	Efficient Exact Optimization — Closed-form preference optimization.
FDPO	`oxrl/algs/fdpo.py`	Filtered DPO — Filters pairs by quality margin before training.
FocalPO	`oxrl/algs/focalpo.py`	Focal Preference Optimization — Focal loss weighting for hard examples.
GPO	`oxrl/algs/gpo.py`	Generalized Preference Optimization — Parameterized loss family.
HDPO	`oxrl/algs/hdpo.py`	Hybrid DPO — Combines multiple DPO objectives.
Hinge	`oxrl/algs/hinge.py`	Hinge loss preference optimization — SVM-style margin loss.
NCA	`oxrl/algs/nca.py`	Noise Contrastive Alignment — InfoNCE-based preference learning.
ODPO	`oxrl/algs/odpo.py`	Offset DPO — Adds learned offset to preference margin.
RobustDPO	`oxrl/algs/robust_dpo.py`	Robust DPO — Outlier-resistant with Huber-style loss.
SamPO	`oxrl/algs/sampo.py`	Sample-weighted Preference Optimization — Importance-weighted pairs.
SPPO	`oxrl/algs/sppo.py`	Self-Play Preference Optimization — Iterative self-play alignment.
WPO	`oxrl/algs/wpo.py`	Weighted Preference Optimization — Quality-weighted preference pairs.

Reward Functions

All reward functions share the signature (prompt_ids, response_ids, finish_reason, metadata) → (rewards, is_per_token). Set via reward_func in your config YAML.

Function	Signal	When to use
default_reward_func	Binary (EOS check)	Sanity checks or when reward comes from an external source.
gsm8k_reward_func	Binary	GSM8K and grade-school math with numeric answers.
math_reward_func	Binary	MATH dataset / competition math with `\boxed{}` answers.
soft_math_reward_func	Graduated (1.0/0.5/0.2)	Math tasks where binary reward is too sparse. Switch to binary once accuracy > ~20%.
code_reward_func	Binary	MBPP / HumanEval code-gen. Runs code against test cases. Requires `test_cases` in metadata.
format_reward_func	0–1.0 (0.25 steps)	Instruction-following / style alignment without ground-truth answers.
mcqa_reward_func	Binary	MMLU-Pro / multiple-choice QA benchmarks.
reasoning_reward_func	0–1.0 (tags + correctness)	DeepSeek-R1 style chain-of-thought training. Rewards `<think>` + `<answer>` tags.
multimodal_reward_func	0–1.0	Vision/audio tasks. Correctness + 0.2 fallback for modality awareness.
rm_reward_func	Continuous	RLHF with a trained reward model. Requires `reward_model_path` in config.

Project Structure

oxRL/
├── oxrl/                   # Core Framework Package
│   ├── trainer.py          # High-level Trainer API
│   ├── rewards/            # Verifiable reasoning and coding rewards (math, code, etc.)
│   ├── algs/               # 51 algorithm implementations (see tables above)
│   ├── swarm/              # Autonomous model onboarding (Scout, Bugfixer)
│   ├── preprocessing/      # Reasoning (OpenR1), Multimodal (Vision/Audio) preprocessors
│   ├── rollouts/           # vLLM inference with structured prompt support
│   └── datasets/           # Dataset loaders and samplers
├── main_rl.py              # RL training loop (Ray + DeepSpeed)
├── main_sl.py              # SL training loop (DeepSpeed) — 12 algorithms
├── registry/examples/      # Example configs for all 18 algorithms
├── examples/               # Ready-to-use recipes and training scripts
└── pyproject.toml          # Packaging and Installation

Design Principles

Debuggability over Pipelining. oxRL avoids complex async pipelining to ensure that failure states are 100% reproducible and logs are clear.

Robust Environment Handling. oxRL is designed to work even in constrained environments. It automatically handles common CUDA/DeepSpeed mismatches by providing actionable warnings instead of fatal crashes.

Autonomous Bug Reporting. On framework failure, oxRL provides structured diagnostic signals for AI agents to automatically generate and submit GitHub issues (requires GITHUB_TOKEN environment variable).

LoRA-first for 7B+. We default to LoRA for larger models to enable high-quality research on consumer-grade and restricted high-end hardware.

Verification-driven RL. We prioritize datasets where the reward is verifiable (Math, Code, Format) to drive logical discovery.

LLM Developer Map

This repository is optimized for LLM-assisted development (Claude/Gemini). If you are asking an AI to work on this framework, refer them to these "High-Signal" files:

Bug Reporting: See BUG_REPORTING.md for instructions on autonomous issue submission.
Adding a New Algorithm: See oxrl/algs/base.py (Base Class) and oxrl/algs/grpo.py (Implementation).
Adding a Reward Function: Add to oxrl/rewards/ using the signature in oxrl/rewards/base.py.
Changing Model Loading: See oxrl/utils/setup.py -> load_model_and_ref.
Training Logic: RL loop in main_rl.py, SL loop in main_sl.py.
Config Validation: Logic is in oxrl/configs/load.py.

Contributing

Contributions are welcome. Please follow the existing architectural patterns and style.

FAQ

Check out the FAQ for details on LoRA merging and Multimodal input formatting.

Appendix: Why oxRL Uses Rollouts (Not an "Inference Server")

There is a growing claim in the community that the concept of a "rollout" is outdated and that inference servers are the new paradigm. We disagree. Here is our analysis of the debate and why oxRL is designed the way it is.

The Claim

"Rollout is outdated. Inference server is the new trend."

Why This Is a Category Error

An inference server is not a replacement for rollouts — it is an implementation strategy for rollouts. Every RL training loop for LLMs still performs rollouts: the policy model autoregressively generates trajectories that are scored and used for gradient updates. The question is only where and how that generation happens.

	Co-located (e.g. veRL HybridEngine)	Disaggregated (e.g. OpenRLHF, StreamRL)
Rollout runs on	Same GPUs as training	Separate inference server
Weight sync	In-memory NCCL	Network transfer
Scaling	Coupled	Independent

Both approaches still generate rollouts. The "inference server" camp is running rollouts on a dedicated vLLM/SGLang process instead of in-process. That is an infrastructure decision, not a conceptual paradigm shift.

Evidence Against "Rollouts Are Dead"

NVIDIA is scaling rollouts up, not phasing them out. BroRL pushes rollout counts from N=16 to N=512 per prompt and demonstrates this breaks through performance ceilings where step-scaling cannot. If rollouts were obsolete, the latest NVIDIA research would not make them the primary scaling axis.
Disaggregation has real costs. Serialization overhead, network bandwidth for weight sync, and added orchestration complexity. Co-located approaches like veRL's HybridEngine avoid weight transfer overhead entirely by keeping generation and training on the same devices.
The motivation for disaggregation is agentic RL, not rollout obsolescence. Co-located rollouts break down when you need multi-turn tool use, because synchronous batch processing cannot handle variable-latency tool calls. That is a real constraint for agentic workloads — but it is a specific use case, not a universal indictment of rollouts.
"Inference server" is rebranding. Putting an HTTP endpoint in front of your rollout worker does not change the fundamental computation. StreamRL achieves a 2.66x throughput improvement through stream generation and skewness-aware scheduling — engineering wins on the rollout pipeline, not alternatives to it.

Where oxRL Stands

oxRL uses co-located vLLM rollouts by design. For standard RL post-training (GRPO, PPO, RLHF) on math, reasoning, and coding tasks — which is the vast majority of practical post-training — this architecture is simpler, avoids network overhead, and produces fully reproducible training runs. We prioritize debuggability over pipelining.

If and when agentic multi-turn RL with tool use becomes the dominant training paradigm, disaggregated inference will be the right call. Until then, saying "rollout is outdated" is like saying "compilation is outdated because we use build servers now." The build server does the compilation. The inference server does the rollout. The abstraction changed; the computation did not.

Citation

If you find oxRL useful in your research, please cite our paper:

@article{oxrl2026,
  title={Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales with Scale-Dependent Ranking Inversions},
  author={Li, Xiaoyi},
  journal={arXiv preprint arXiv:2603.19335},
  year={2026},
  url={https://arxiv.org/abs/2603.19335}
}

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.claude		.claude
.github		.github
assets		assets
data-gym-cache		data-gym-cache
data		data
docs		docs
examples		examples
oxrl		oxrl
registry		registry
scripts		scripts
tests		tests
.gitignore		.gitignore
BUG_REPORTING.md		BUG_REPORTING.md
CLAUDE.md		CLAUDE.md
FAQ.md		FAQ.md
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
main_rl.py		main_rl.py
main_sl.py		main_sl.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
review_feedback.md		review_feedback.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

An Claude-friendly framework for any post-training

Why oxRL?

Design Principle

Usage

Supported Models

Verified Models

System Architecture

RL Training Workflow

Getting Started

Installation

Run Tests

Environment Diagnostics

Configuration

Post-train a Reasoning Model

Algorithms

Reinforcement Learning (via Ray + vLLM rollouts)

Supervised Learning (via DeepSpeed)

Reward Functions

Project Structure

Design Principles

LLM Developer Map

Contributing

FAQ

Appendix: Why oxRL Uses Rollouts (Not an "Inference Server")

The Claim

Why This Is a Category Error

Evidence Against "Rollouts Are Dead"

Where oxRL Stands

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages