LABBench2 is a benchmark for measuring real-world capabilities of AI systems performing scientific research tasks. It is an evolution of the Language Agent Biology Benchmark (LAB-Bench), comprising nearly 1,900 tasks that measure similar capabilities but in more realistic contexts.
LABBench2 provides a meaningful jump in difficulty over LAB-Bench (model-specific accuracy differences range from −26% to −46% across subtasks), underscoring continued room for improvement. LABBench2 aims to be a standard benchmark for evaluating and advancing AI capabilities in scientific research.
This repository provides a public evaluation harness for running LABBench2 evaluations against any model or agent system. The task dataset is available at huggingface.co/datasets/EdisonScientific/labbench2.
Installation
Note: Go 1.21+ is required for cloning questions validation.
git clone [email protected]:EdisonScientific/labbench2.git
cd labbench2
uv syncDevelopment setup (optional):
uv sync --extra dev && uv run pre-commit installRunning Evals
export HF_TOKEN=your-huggingface-token
export ANTHROPIC_API_KEY=your-key
uv run python -m evals.run_evals --agent anthropic:claude-opus-4-5 --tag seqqa2 --limit 5| Option | Description |
|---|---|
--agent AGENT |
Agent to evaluate (see Agent Formats below) |
--tag TAG |
Filter by problem type (see tags below) |
--mode MODE |
File processing: file (default), inject, or retrieve |
--limit N |
Limit number of questions |
--parallel N |
Parallel workers (default: 30) |
--ids ID [...] |
Filter by specific question IDs |
--ids-file FILE |
Load question IDs from file (one per line) |
--report-path FILE |
Output path for report JSON file |
--retry-from FILE |
Retry failed IDs from a previous report, saves as *_retry.json |
Available tags: cloning, dbqa2, figqa2, figqa2-img, figqa2-pdf, litqa3, patentqa, protocolqa2, seqqa2, sourcequality, suppqa2, tableqa2, tableqa2-img, tableqa2-pdf, trialqa
The --agent flag supports three formats:
1. Pydantic-AI Models — provider:model[@flags]
--agent anthropic:claude-opus-4-5 # Basic
--agent anthropic:claude-opus-4-5@tools # All tools (WebSearch, CodeExecution, WebFetch)
--agent anthropic:claude-opus-4-5@search # WebSearch only
--agent anthropic:claude-opus-4-5@code # CodeExecution only
--agent anthropic:claude-opus-4-5@high # High reasoning effort
--agent anthropic:claude-opus-4-5@tools,high # Combine flags2. Native SDK Runners — native:provider:model[@flags]
Uses provider SDKs directly for better file handling.
--agent native:anthropic:claude-opus-4-5
--agent native:openai-responses:gpt-5.2
--agent native:openai-completions:gpt-5.2
--agent native:google-vertex:gemini-3-pro-preview3. Custom Runners — external:path/to/runner.py:ClassName
--agent external:./external_runners/edison_analysis_runner.py:EdisonAnalysisRunner| Mode | Description |
|---|---|
file |
Upload files via API with smart routing (default) |
inject |
Concatenate text file contents into prompt |
retrieve |
Instruct agent to retrieve from external sources |
Smart routing (file mode): PDFs/images always go to context. Other files go to filesystem when supported by the runner.
| Runner | Filesystem Support |
|---|---|
| Anthropic (native SDK) | Yes (with @tools/@code) |
| OpenAI (native SDK) | Yes (with @tools/@code) |
| Google (native SDK) | No (context only) |
| Pydantic-AI | No (context only) |
# Anthropic with tools and high effort
# Requires: export ANTHROPIC_API_KEY=your-key
uv run python -m evals.run_evals \
--agent anthropic:claude-opus-4-5@tools,high \
--tag seqqa2
# OpenAI with tools
# Requires: export OPENAI_API_KEY=your-key
uv run python -m evals.run_evals \
--agent openai-responses:gpt-5.2@tools \
--tag seqqa2
# Google Vertex AI with search
# Requires: gcloud auth application-default login
# export GOOGLE_CLOUD_PROJECT=your-project-id
# export GOOGLE_CLOUD_LOCATION=global
uv run python -m evals.run_evals \
--agent google-vertex:gemini-3-pro-preview@search \
--tag seqqa2
# Native runner
uv run python -m evals.run_evals \
--agent native:anthropic:claude-opus-4-5 \
--tag figqa2
# Custom runner
uv run python -m evals.run_evals \
--agent external:./external_runners/edison_analysis_runner.py:EdisonAnalysisRunner \
--tag seqqa2Evaluating Custom Agents
To evaluate a custom agent, create a class implementing the AgentRunner protocol (typed method signatures available there):
# my_runner.py
import os
from evals.runners import AgentResponse
class MyRunner:
def __init__(self):
self.api_url = os.environ.get("AGENT_API_URL")
async def upload_files(self, files, gcs_prefix=None):
"""Upload files to your agent's backend. Returns: dict mapping path -> reference."""
return {str(f): f"ref:{f.name}" for f in files}
async def execute(self, question, file_refs=None):
"""Call your agent and return the answer."""
return AgentResponse(text="answer")
def extract_answer(self, response):
"""Parse response. Default returns response.text."""
return response.text
async def download_outputs(self, dest_dir):
"""Download agent-generated files (e.g., primers) to dest_dir. Returns list of filenames."""
return []
async def cleanup(self):
"""Clean up resources."""
passuv run python -m evals.run_evals --agent external:./my_runner.py:MyRunner --tag seqqa2See external_runners/edison_analysis_runner.py for a complete example.
Reports
By default, evaluation reports are saved to assets/reports/{tag}/{mode}/{model}.json (and .txt). You can specify a custom output path using the --report-path option.
The reports used in the paper are available at assets/reports_paper/. Paper results were generated using native SDK runners (native:provider:model) for direct API access and better file handling support.
Reproducing Paper Evaluations
To run the same evaluations as in the paper for a different agent:
./run_evals.sh <agent> [options]
# Examples
./run_evals.sh native:anthropic:claude-opus-4-5
./run_evals.sh native:openai-responses:gpt-5.2 --limit 1
./run_evals.sh 'external:./my_runner.py:MyAgent' -j 4 -w 10The script runs all tag/mode combinations from the paper for the specified agent. Run ./run_evals.sh --help to see all options.
Report Summarization
To generate a summary table:
uv run python evals/summarize_report.py assets/reports_paper/seqqa2/file/claude-opus-4-5.jsonPass --show-failed-outputs to display unique error messages from failed tasks.
Multiple reports can be merged (later files patch earlier ones). This is useful when combining original reports with retries from --retry-from:
uv run python evals/summarize_report.py original.json original_retry.jsonData
Evaluation data (PDFs, images, bioinformatics files etc.) is hosted on Google Cloud Storage. Files are downloaded and cached locally on-demand when running evals, so no manual data setup is required. Cache is stored at ~/.cache/labbench2.
If you use LABBench2 in your research, please cite:
@article{labbench2_2026,
title={LABBench2: An Improved Benchmark for AI Systems Performing Biology Research},
author={Jon M. Laurent and Albert Bou and Michael Pieler and Conor Igoe and Alex Andonian and Siddharth Narayanan and James Braza and Alexandros Sanchez Vassopoulos and Jacob L. Steenwyk and Blake Lash and Andrew D. White and Samuel G. Rodriques},
year={2026},
url={https://github.com/EdisonScientific/labbench2}
}Notable changes to LABBench2 will be documented here. We expect to update the harness and dataset only in the case of clear issues, and do not intend to meaningfully change the benchmark over time.
2026-03-13 - We corrected an inadvertent data issue with sourcequality tasks. This has resulted in an entirely new set of 150 sourcequality tasks being incorporated into the dataset. We've also made a corresponding harness update to work with the new task structure. Published results have been updated accordingly.
