Announcement Accepted to the ICLR 2026 Agents in the Wild Workshop on March 1, 2026. ๐ŸŽ‰ arXiv
ResearchGym

Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi1    Manasi Patwardhan1    Arman Cohan2
1TCS Research 2Yale University
5 | 39
Tasks ยท Sub-tasks
3 | 3
Scaffolds ยท Seeds
$10-$20
Budget per Run
12-24h
Autonomous Runs
ACL, ICML, ICLR oral & spotlight
Objective SOTA evaluation
Runs on single GPU
Full ML research cycle

Leaderboard

Best scores normalized against state-of-the-art baselines. Higher is better, 100% = matches SOTA.

# Agent Model CL MDT CMR TIM IRB Avg
1 RG-Agent gpt-5-high 94.3% 49.4% 96.3% 107.2% 34.3% 76.3%
2 Codex gpt-5.2-codex xhigh 97.9% 49.0% 96.6% 17.1% 49.7% 62.1%
3 Claude Code claude-opus-4.5 98.5% -- -- -- 21.7% 24.0%
CL Continual Learning MDT Materials Tokenization CMR Cross-Modal Retrieval TIM Time-Series Explanation IRB Improving Replay Buffers -- = no valid submission

Research Tasks

Each task mirrors the full ML research cycle: problem understanding, approach design, implementation, and experimentation with fixed compute.

Vision

Continual Learning

Scalable continual learning for foundation models without rehearsal. ImageNet-R/A, CIFAR-100, CUB-200.

Accuracy, AAA
Baseline: InfLoRA 86.75%
91%
Peak SOTA
Vision-Lang

Cross-Modal Retrieval

Query shift adaptation in cross-modal retrieval. COCO-C, Flickr-C with 16 image + 15 text corruptions.

Recall@1
Baseline: EATA 54.4%
96%
Peak SOTA
RL

Improving Replay Buffers

Memory systems for efficient experience replay. DeepMind Control Suite, OpenAI Gym environments.

Average Return
Baseline: SynthER 727
34%
Peak SOTA
NLP / Sci

Materials Tokenization

Domain-preserving tokenization for materials science. SciBERT backbone, MatSci-NLP benchmark.

Micro-F1, Macro-F1
Baseline: PickyBPE 75.6%
48%
Peak SOTA
XAI

Time-Series Explanation

Directionally-aware explanations for time series. PAM, Boiler, Epilepsy, Wafer, Freezer datasets.

CPD, AUP, AUR
Baseline: IG 0.573 CPD
107%
Exceeded SOTA

Agent Scaffolds

Three primary agent implementations with frontier model backends for comprehensive research capabilities.

RG-Agent

gpt-5-high

Primary research agent with comprehensive tooling for file operations, code execution, and experiment management.

OpenAI Scaffold

gpt-5.2-codex xhigh

OpenAI's coding scaffold used for the Codex-based runs in ResearchGym.

Anthropic Scaffold

claude-opus-4.5

Anthropic's coding scaffold used for the Claude Code runs in ResearchGym.

Architecture

Modular adapter pattern with unified interfaces for agents, runtimes, and evaluation systems.

Entry Point CLI, Workspace, Resume
โ†“
Agent Adapters prepare_workspace() โ†’ run()
โ†“
Runtime Systems UV Local | Docker
โ†“
AgenticEnv Gym-like Interface
๐Ÿ’ฐ

Budget Enforcement

Real-time cost tracking with per-model pricing via LiteLLM

๐Ÿ”„

Resume & Continuation

Session state preservation with transcript-based context recovery

๐Ÿณ

Docker Isolation

Containerized execution with deterministic environments

๐Ÿ“Š

Comprehensive Logging

Transcripts, cost summaries, and execution artifacts

๐Ÿ”’

Workspace Security

Path-bounded file operations with process isolation

Citation

If you use ResearchGym in your research, please cite our paper.

ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan
arXiv preprint, 2026
BibTeX
@misc{garikaparthi2026researchgymevaluatinglanguagemodel,
      title={ResearchGym: Evaluating Language Model Agents on Real-World AI Research}, 
      author={Aniketh Garikaparthi and Manasi Patwardhan and Arman Cohan},
      year={2026},
      eprint={2602.15112},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.15112}, 
}

Quick Start

Requires Python 3.12+ and uv package manager.

# Clone and setup git clone https://github.com/Anikethh/ResearchGym.git cd ResearchGym uv sync # Run an agent on a task python run_agent.py tasks/test/continual-learning rg-agent \ --runtime docker \ --model openai/gpt-5-high \ --basic_hours 12