Olmix

Warning

This project is under active development. We are migrating from our internal infrastructure to open source — expect rough edges, missing docs, and breaking changes.

Toolkit for optimizing pretraining data mixtures. Learns from small-scale proxy experiments ("swarms") to predict how data mixing ratios affect downstream performance, then proposes optimized mixtures for full-scale training.

End-To-End Mixing Flow

+----------------------------+
| Define Data Sources        |
+----------------------------+
             |
             v
+--------------------------------------------------------------+
| Compute Priors                                               |
| (olmix priors compute --config ...)                         |
+--------------------------------------------------------------+
             |
             v
+--------------------------------------------------------------------------+
| Generate Swarm Variants                                                  |
| (olmix generate --config ... --base ... --output ...)                   |
+--------------------------------------------------------------------------+
             |
             v
+--------------------------------------------------------------+
| Launch Proxy Runs                                            |
| (olmix launch run --variants ...)                            |
+--------------------------------------------------------------+
             |
             v
+-----------------------------------------+
| Collect + Export ratios.csv + metrics.csv |
+-----------------------------------------+
             |
             v
+--------------------------------------------------------------+
| Fit Regressors + Propose Optimized Mix                      |
| (olmix fit --config ... --output-dir ...)                   |
+--------------------------------------------------------------+
             |
             v
+------------------------------------+
| Use Mix for Full-Scale Training    |
+------------------------------------+

Installation

git clone https://github.com/allenai/olmix.git
cd olmix
uv pip install -e ".[dev]"

Quickstart

Choose one path:

Pattern 1 (fastest): you already have swarm results in CSV, and want optimized mix weights.
Pattern 2 (end-to-end): you want to generate variants, launch proxy runs, then fit.

Pattern 1: CSV -> Fit

Prepare:

ratios.csv with one row per run and domain columns summing to ~1.0.
metrics.csv with one row per run and evaluation metric columns.
A fit config YAML (see configs/examples/fit/example.yaml).

Minimal example files:

# ratios.csv
run,name,index,dclm,wikipedia,arxiv
hz0dfydj,my-swarm-0000,0,0.45,0.30,0.25
pj0hxxl7,my-swarm-0001,1,0.60,0.20,0.20
sqleanmq,my-swarm-0002,2,0.33,0.33,0.34

# metrics.csv
run,name,index,arc_challenge_bpb,hellaswag_bpb,mmlu_stem_bpb
hz0dfydj,my-swarm-0000,0,1.23,0.87,1.45
pj0hxxl7,my-swarm-0001,1,1.15,0.91,1.38
sqleanmq,my-swarm-0002,2,1.20,0.89,1.42

Run:

olmix fit --config configs/examples/fit/example.yaml --output-dir output/my_fit

Pattern 2: Generate -> Launch -> Fit

Compute priors for your generation config:

olmix priors compute --config configs/examples/generate/example.yaml

Generate launch variants:

olmix generate \
  --config configs/examples/generate/example.yaml \
  --base configs/examples/launch/data_proportions/mix_baseline.yaml \
  --output output/my_variants/

Launch the swarm:

olmix launch run --variants output/my_variants/

Export swarm outputs to ratios.csv and metrics.csv, then fit:

olmix fit --config configs/examples/fit/example.yaml --output-dir output/my_fit

Required Inputs (At A Glance)

For fit: a YAML config with swarm.ratios, swarm.metrics, and priors.
For generate: a generation config with data, priors, swarm, and max_tokens.
For launch: generated variant YAML files plus Beaker environment access.
For end-to-end: a way to export run mixtures and eval metrics to CSV for fitting.

What You Get

A hashed fit output directory with config.json for reproducibility.
Regression diagnostics (*_fit.png, *_correlations.json, interaction_matrix.*).
Proposed optimal mixture files (*_optimal.json, *_optimal.png) unless fit_only: true.
Launch metadata under output/mixes/... when using olmix launch run.

CLI And Config Reference

Detailed arguments and config-field breakdowns live in docs/cli/:

olmix fit is the canonical fit command. olmix-fit is a legacy compatibility alias.

Development

make run-checks   # format + lint + typecheck + test

Citation

@article{chen2026olmix,
  title={Olmix: A Framework for Data Mixing Throughout LM Development},
  author={Chen, Mayee F and Murray, Tyler and Heineman, David and Jordan, Matt and Hajishirzi, Hannaneh and Re, Christopher and Soldaini, Luca and Lo, Kyle},
  year={2026},
  month={February}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.github		.github
configs		configs
docs		docs
olmix		olmix
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Olmix

End-To-End Mixing Flow

Installation

Quickstart

Pattern 1: CSV -> Fit

Pattern 2: Generate -> Launch -> Fit

Required Inputs (At A Glance)

What You Get

CLI And Config Reference

Development

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Olmix

End-To-End Mixing Flow

Installation

Quickstart

Pattern 1: CSV -> Fit

Pattern 2: Generate -> Launch -> Fit

Required Inputs (At A Glance)

What You Get

CLI And Config Reference

Development

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages