Skip to content

EdisonScientific/labbench2

Repository files navigation

LABBench2

Paper CI Coverage License: CC BY-SA 4.0 Python Go

Overview

Overview

LABBench2 is a benchmark for measuring real-world capabilities of AI systems performing scientific research tasks. It is an evolution of the Language Agent Biology Benchmark (LAB-Bench), comprising nearly 1,900 tasks that measure similar capabilities but in more realistic contexts.

LABBench2 provides a meaningful jump in difficulty over LAB-Bench (model-specific accuracy differences range from −26% to −46% across subtasks), underscoring continued room for improvement. LABBench2 aims to be a standard benchmark for evaluating and advancing AI capabilities in scientific research.

This repository provides a public evaluation harness for running LABBench2 evaluations against any model or agent system. The task dataset is available at huggingface.co/datasets/EdisonScientific/labbench2.


Evaluation Harness

Installation

Note: Go 1.21+ is required for cloning questions validation.

git clone [email protected]:EdisonScientific/labbench2.git
cd labbench2
uv sync

Development setup (optional):

uv sync --extra dev && uv run pre-commit install
Running Evals

Quick Start

export HF_TOKEN=your-huggingface-token
export ANTHROPIC_API_KEY=your-key
uv run python -m evals.run_evals --agent anthropic:claude-opus-4-5 --tag seqqa2 --limit 5

CLI Options

Option Description
--agent AGENT Agent to evaluate (see Agent Formats below)
--tag TAG Filter by problem type (see tags below)
--mode MODE File processing: file (default), inject, or retrieve
--limit N Limit number of questions
--parallel N Parallel workers (default: 30)
--ids ID [...] Filter by specific question IDs
--ids-file FILE Load question IDs from file (one per line)
--report-path FILE Output path for report JSON file
--retry-from FILE Retry failed IDs from a previous report, saves as *_retry.json

Available tags: cloning, dbqa2, figqa2, figqa2-img, figqa2-pdf, litqa3, patentqa, protocolqa2, seqqa2, sourcequality, suppqa2, tableqa2, tableqa2-img, tableqa2-pdf, trialqa

Agent Formats

The --agent flag supports three formats:

1. Pydantic-AI Modelsprovider:model[@flags]

--agent anthropic:claude-opus-4-5              # Basic
--agent anthropic:claude-opus-4-5@tools        # All tools (WebSearch, CodeExecution, WebFetch)
--agent anthropic:claude-opus-4-5@search       # WebSearch only
--agent anthropic:claude-opus-4-5@code         # CodeExecution only
--agent anthropic:claude-opus-4-5@high         # High reasoning effort
--agent anthropic:claude-opus-4-5@tools,high   # Combine flags

2. Native SDK Runnersnative:provider:model[@flags]

Uses provider SDKs directly for better file handling.

--agent native:anthropic:claude-opus-4-5
--agent native:openai-responses:gpt-5.2
--agent native:openai-completions:gpt-5.2
--agent native:google-vertex:gemini-3-pro-preview

3. Custom Runnersexternal:path/to/runner.py:ClassName

--agent external:./external_runners/edison_analysis_runner.py:EdisonAnalysisRunner

File Processing Modes

Mode Description
file Upload files via API with smart routing (default)
inject Concatenate text file contents into prompt
retrieve Instruct agent to retrieve from external sources

Smart routing (file mode): PDFs/images always go to context. Other files go to filesystem when supported by the runner.

Runner Filesystem Support
Anthropic (native SDK) Yes (with @tools/@code)
OpenAI (native SDK) Yes (with @tools/@code)
Google (native SDK) No (context only)
Pydantic-AI No (context only)

Examples

# Anthropic with tools and high effort
# Requires: export ANTHROPIC_API_KEY=your-key
uv run python -m evals.run_evals \
  --agent anthropic:claude-opus-4-5@tools,high \
  --tag seqqa2

# OpenAI with tools
# Requires: export OPENAI_API_KEY=your-key
uv run python -m evals.run_evals \
  --agent openai-responses:gpt-5.2@tools \
  --tag seqqa2

# Google Vertex AI with search
# Requires: gcloud auth application-default login
#           export GOOGLE_CLOUD_PROJECT=your-project-id
#           export GOOGLE_CLOUD_LOCATION=global
uv run python -m evals.run_evals \
  --agent google-vertex:gemini-3-pro-preview@search \
  --tag seqqa2

# Native runner
uv run python -m evals.run_evals \
  --agent native:anthropic:claude-opus-4-5 \
  --tag figqa2

# Custom runner
uv run python -m evals.run_evals \
  --agent external:./external_runners/edison_analysis_runner.py:EdisonAnalysisRunner \
  --tag seqqa2
Evaluating Custom Agents

To evaluate a custom agent, create a class implementing the AgentRunner protocol (typed method signatures available there):

# my_runner.py
import os
from evals.runners import AgentResponse


class MyRunner:
    def __init__(self):
        self.api_url = os.environ.get("AGENT_API_URL")

    async def upload_files(self, files, gcs_prefix=None):
        """Upload files to your agent's backend. Returns: dict mapping path -> reference."""
        return {str(f): f"ref:{f.name}" for f in files}

    async def execute(self, question, file_refs=None):
        """Call your agent and return the answer."""
        return AgentResponse(text="answer")

    def extract_answer(self, response):
        """Parse response. Default returns response.text."""
        return response.text

    async def download_outputs(self, dest_dir):
        """Download agent-generated files (e.g., primers) to dest_dir. Returns list of filenames."""
        return []

    async def cleanup(self):
        """Clean up resources."""
        pass
uv run python -m evals.run_evals --agent external:./my_runner.py:MyRunner --tag seqqa2

See external_runners/edison_analysis_runner.py for a complete example.


Additional Details

Reports

By default, evaluation reports are saved to assets/reports/{tag}/{mode}/{model}.json (and .txt). You can specify a custom output path using the --report-path option.

The reports used in the paper are available at assets/reports_paper/. Paper results were generated using native SDK runners (native:provider:model) for direct API access and better file handling support.

Reproducing Paper Evaluations

To run the same evaluations as in the paper for a different agent:

./run_evals.sh <agent> [options]

# Examples
./run_evals.sh native:anthropic:claude-opus-4-5
./run_evals.sh native:openai-responses:gpt-5.2 --limit 1
./run_evals.sh 'external:./my_runner.py:MyAgent' -j 4 -w 10

The script runs all tag/mode combinations from the paper for the specified agent. Run ./run_evals.sh --help to see all options.

Report Summarization

To generate a summary table:

uv run python evals/summarize_report.py assets/reports_paper/seqqa2/file/claude-opus-4-5.json

Pass --show-failed-outputs to display unique error messages from failed tasks.

Multiple reports can be merged (later files patch earlier ones). This is useful when combining original reports with retries from --retry-from:

uv run python evals/summarize_report.py original.json original_retry.json
Data

Evaluation data (PDFs, images, bioinformatics files etc.) is hosted on Google Cloud Storage. Files are downloaded and cached locally on-demand when running evals, so no manual data setup is required. Cache is stored at ~/.cache/labbench2.


Citation

If you use LABBench2 in your research, please cite:

@article{labbench2_2026,
  title={LABBench2: An Improved Benchmark for AI Systems Performing Biology Research},
  author={Jon M. Laurent and Albert Bou and Michael Pieler and Conor Igoe and Alex Andonian and Siddharth Narayanan and James Braza and Alexandros Sanchez Vassopoulos and Jacob L. Steenwyk and Blake Lash and Andrew D. White and Samuel G. Rodriques},
  year={2026},
  url={https://github.com/EdisonScientific/labbench2}
}

Changelog

Notable changes to LABBench2 will be documented here. We expect to update the harness and dataset only in the case of clear issues, and do not intend to meaningfully change the benchmark over time.

2026-03-13 - We corrected an inadvertent data issue with sourcequality tasks. This has resulted in an entirely new set of 150 sourcequality tasks being incorporated into the dataset. We've also made a corresponding harness update to work with the new task structure. Published results have been updated accordingly.

About

labbench2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages