The AI Research Science Benchmark is an eval that quantifies the autonomous research abilities of LLM agents in the area of machine learning. AIRS-Bench comprises 20 tasks from state-of-the-art machine learning papers spanning diverse domains such as NLP, Code, Math, biochemical modelling and time series forecasting.
Each task is specified by a <problem, dataset, metric> triplet and a SOTA value. The problem defines the core computational challenge to be solved (e.g. text similarity ); the dataset specifies which data to solve the challenge over (e.g. SICK); the metric is used to quantify performance (e.g. Spearman Correlation); finally, the SOTA value is the value of the metric set by human SOTA (i.e. reported in a published paper). The agent receives the full task specification and is expected to develop a solution that in most cases generates predictions on the test labels file (e.g. submission.csv), which are then evaluated and compared with the state-of-the-art (SOTA) solution.
The following image provides an overview of the TextualSimilaritySickSpearmanCorrelation task, a specification of which is provided by the files under TextualSimilaritySickSpearmanCorrelation folder.
We sourced tasks from 17 different machine learning papers and 16 datasets across a wide range of categories, a distribution of which appears below:
We evaluated a variety of agents across these tasks, where in our setting we define an agent as a pair consisting of a large language model (LLM) and a scaffold. A scaffold comprises a set of mechanisms, such as operators and search algorithms, that enable the LLM to explore the solution space effectively. Scaffolds are instantiated by a harness, which serves as a system that encapsulates the agent and manages its research process. The environment provides the agent with the problem specifications, as well as any constraints and resources available for its exploration. The picture below illustrates the interplay between agents, scaffolds and harnesses, as well as how they relate with the problem and solution space.
In order to evaluate the impact of different scaffolds on agentic task performance, we tested agents using both linear and parallel harness frameworks. For the linear case, we measured the performance of a ReAct scaffold built by the open-source MLGym framework. For the parallel case, we benchmarked two scaffolds, One-shot and Greedy, using the open-source aira-dojo parallel harness framework. One-shot agents can attempt solving the problem only once, whereas Greedy agents can perform a tree-based best-first search to tackle the task. Agents are powered by LLMs, such as the Meta open-weights Code World Model (CWM).
To evaluate benchmark performance we designed a normalized score metric
where
with
Combining the ReAct, One-Shot and Greedy scaffolds with a range of LLMs yielded 14 different agents, the performance of which is compared in the figure below:
Finally, the graphic below depicts average normalized scores with each row corresponding to an AIRS-Bench task and each point to an agent’s normalized score for that task averaged across multiple seeds. For each task, the outcome of the worst-performing run is used as the baseline score (normalized score of 0), whereas SOTA always corresponds to a normalized score of 1. Tasks are ranked in decreasing order according to the average score across all agents. One can see that tasks are of varying difficulty, ranging from the agent struggling to submit any solution to the agent routinely surpassing human SOTA.
The leaderboard below shows bencmark performance for the range of agents we evaluated. We welcome further contributions from the agentic AI research community, especially work built on open components (both the scaffold and the LLM) that can be inspected and extended end to end.
| Agent | Avg. norm. score | # seeds | Date |
|---|---|---|---|
| Greedy gpt-oss-120b | 0.402 ± 0.031 | 10 | 2026-02-16 |
| Greedy gpt-oss-20b | 0.400 ± 0.032 | 10 | 2026-02-16 |
| Greedy o3-mini | 0.391 ± 0.022 | 10 | 2026-02-16 |
| Greedy GPT-4o | 0.309 ± 0.028 | 10 | 2026-02-16 |
| MLGym CWM | 0.302 ± 0.026 | 10 | 2026-02-16 |
| Greedy CWM | 0.287 ± 0.026 | 10 | 2026-02-16 |
| Greedy Devstral | 0.179 ± 0.021 | 10 | 2026-02-16 |
| MLGym GPT-4o | 0.178 ± 0.025 | 10 | 2026-02-16 |
| One-Shot o3-mini | 0.171 ± 0.017 | 20 | 2026-02-16 |
| One-Shot gpt-oss-120b | 0.161 ± 0.020 | 20 | 2026-02-16 |
| One-Shot gpt-oss-20b | 0.077 ± 0.019 | 20 | 2026-02-16 |
| One-Shot GPT-4o | 0.057 ± 0.015 | 20 | 2026-02-16 |
| One-Shot CWM | 0.041 ± 0.011 | 20 | 2026-02-16 |
| One-Shot Devstral | 0.018 ± 0.009 | 20 | 2026-02-16 |
AIRS-Bench consists of 20 tasks whose definitions can be found under airsbench/tasks. For each task, we provide two specifications corresponding to two different AI research agent frameworks: one for aira-dojo (under airsbench/tasks/rad) and one for MLGym (under airsbench/tasks/mlgym).
📂 airs-bench/
┣ 📂 airsbench/tasks
┣ 📂 rad
┣ 📂 CodeGenerationAPPSPassAt5
┣ 📄 evaluate_prepare.py
┣ 📄 evaluate.py
┣ 📄 metadata.yaml
┣ 📄 prepare.py
┣ 📄 project_description.md
┗ 📄 utils.py
┣ 📂 CodeRetrievalCodeXGlueMRR
...
┗ 📂 U0MolecularPropertyPredictionQm9MeanAbsoluteError
┗ 📂 mlgym
┣ 📂 CodeGenerationAPPSPassAt5
...
┗ 📂 U0MolecularPropertyPredictionQm9MeanAbsoluteError
┣ 📂 datasets/
┣ 📂 images/
┣ 📂 notebooks/
┣ 📂 scripts/
┣ 📄 README.md
┗ 📄 pyproject.toml
Each task is specified using the following files:
metadata.yamlcontains core data information about the task, such as its name, a pointer to its HuggingFace dataset along with train/test splits, the research problem, the evaluation metric and SOTA informationproject_description.mdcontains the task prompt provided to the agent with information about what the objective of the tasks is along with dataset and evaluation detailsprepare.pycontains the dataset preparation logic, so that the agent has access to all the necessary data to iterate on a solution, but the agent has no access to the solution (i.e. extract input features and labels for the train set and input features for the test set)evaluate.pycontains the evaluation script used to score the agent's submission against the test dataevaluate_prepare.pycontains the dataset preparation logic to evaluate the agent's submission (i.e. labels for the test set)utils.pyis an optional file to consolidate overlapping code between theprepare.py,evaluate.pyandevaluate_prepare.pyfiles
The above task specification can be directly ingested by the aira-dojo agentic harness and can be programatically converted into task definition files for other frameworks. We provide a conversion script below that transforms the files above into task definition files for the MLGym agentic framework.
python scripts/converter_rad_mlgym_enhanced.py airsbench/tasks/rad/TextualClassificationSickAccuracy
Download the train and test data for each of the tasks using:
pip install datasets==3.6.0
./datasets/download_hf_datasets_text.sh datasets/datasets_download_location/
The datasets download location (e.g. datasets/datasets_download_location/ in the example above) is used as the --global-shared-data-dir argument in prepare.py and evaluate_prepare.py
Our AI research agents are powered by the OSS aira-dojo and MLGym agentic frameworks. For One-shot/Greedy agents, check out the aira-dojo installation guide; for ReAct agents see the MLGym setup instructions.
Install airsbench locally using:
git clone [email protected]:facebookresearch/airs-bench.git
cd airs-bench
conda create -n airsbench python=3.12
pip install -e .
Please cite using the following BibTeX entry:
@article{lupidi2026airsbenchsuitetasksfrontier,
title={AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents},
author={Alisia Lupidi and Bhavul Gauri and Thomas Simon Foster and Bassel Al Omari and Despoina Magka and Alberto Pepe and Alexis Audran-Reiss and Muna Aghamelu and Nicolas Baldwin and Lucia Cipolina-Kun and Jean-Christophe Gagnon-Audet and Chee Hau Leow and Sandra Lefdal and Hossam Mossalam and Abhinav Moudgil and Saba Nazir and Emanuel Tewolde and Isabel Urrego and Jordi Armengol Estape and Amar Budhiraja and Gaurav Chaurasia and Abhishek Charnalia and Derek Dunfield and Karen Hambardzumyan and Daniel Izcovich and Martin Josifoski and Ishita Mediratta and Kelvin Niu and Parth Pathak and Michael Shvartsman and Edan Toledo and Anton Protopopov and Roberta Raileanu and Alexander Miller and Tatiana Shavrina and Jakob Foerster and Yoram Bachrach},
year={2026},
eprint={2602.06855},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.06855},
}
This codebase uses the CC BY-NC 4.0 license.




