VCR-Agent (Towards Autonomous Mechanistic Reasoning in Virtual Cells)

This is the official code repository for the paper titled Towards Autonomous Mechanistic Reasoning in Virtual Cells.

Abstract: Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-Traces dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

Overview

Input: Perturbation + Context data (JSON)
Preprocessing: NER entity extraction (HunFlair2) + PubChem synonym search
Multi-Agent: Report and structured explanation generation with external tools

Setup

1. Install Dependencies

uv sync

2. Environment Variables

Create a .env file (or export) with:

# Required
DATA_DIR=data/curation_v1              # Path to directory with action_primitives.json, templates/, mondo.json

# Required for KG tool
STARK_PRIMEKG_DIR=/path/to/stark_prime_kg   # Directory with edge_index.pt, node_info.json, etc. (https://stark.stanford.edu/dataset_prime.html)
DRUGBANK_XML_PATH=/path/to/full_database.xml  # DrugBank XML (optional, for chemical entity matching)

# Required for LLM providers
ANTHROPIC_API_KEY=...                  # For Anthropic Claude
# or
OPENAI_API_KEY=...                     # For OpenAI

3. PubMed RAG Server (required for `pubmed-fast-ner` tool)

Clone repositories, download the PubMed database (~34GB fully built) and binary search data (~2.5GB). If the download process does not work well, refer to https://github.com/domluna/pubmedFastRAG.git for full explanation:

git clone https://github.com/kyunghyuncho/pubmed-vectors.git
git clone https://github.com/domluna/pubmedFastRAG.git

# Download PubMed SQLite database (takes a while)
python pubmed-vectors/download_pubmed.py

# Download binary RAG search data (from Google Drive, ~2.5GB)
pip install gdown
gdown "1LuCaUcILQuQgkDm3_tWBWr4X7AQ518kX" -O pubmedFastRAG/bindata.zip
cd pubmedFastRAG && unzip bindata.zip -d bindata/ && rm bindata.zip && cd ..

# Move database to data/
mv pubmed_data.db data/pubmed_data.db

Install Python dependencies for the embed server:

pip install fastapi uvicorn einops

Install Julia (if not already installed) and resolve Julia packages:

cd pubmedFastRAG
julia --project=. -e 'using Pkg; Pkg.update(); Pkg.precompile()'
cd ..

Unzip data/mondo.json.zip.

Input Format

The input is a JSON list of perturbation records. Some fields may be missing.

[
  {
    "index": 0,
    "perturbation": {
      "context": [
        {
          "perturbation type": "soluble factor",
          "description": "Soluble factor addition of VEGF",
          "cell_type": "N/A",
          "disease_model": "Angiogenic factor/tumors",
          "cell type": null,
          "subtype": null
        }
      ],
      "perturbations": [
        {
          "type": "chemical",
          "smiles": "CN1CCN(CC1)CC(=O)N(C)C2=CC=C(C=C2)N=C(C3=CC=CC=C3)C4=C(NC5=C4C=CC(=C5)C(=O)OC)O",
          "name": "Nintedanib",
          "target": "VEGFR",
          "moa_type": "antibody",
          "known targets": []
        }
      ]
    }
  }
]

Preprocessing

Extracts NER entities and PubChem compound information from the input perturbations:

INPUT_PATH=data/example_input.json
PREPROCESSED_PATH=data/example_preprocessed.json

DATA_DIR=data/curation_v1 uv run script/preprocess.py --input_path "$INPUT_PATH" --preprocessed_path "$PREPROCESSED_PATH"

The preprocessed output adds perturbation_entity and context_entity fields (NER results) and optional pubchem_info for chemical perturbations.

Multi-Agent

The agent generates both a report and structured explanation using external tools.

Available Tools

Tool	Description	Requires
`pubmed-fast-ner`	PubMed paper search via NER entities	Julia RAG server (ports 8002, 8003)
`kg-ner`	KG node lookup via NER entities	`STARK_PRIMEKG_DIR` env var
`harmonizome`	Gene/gene-set information	Internet access
`wikipedia`	Wikipedia articles for NER entities	Internet access

Running

# 1) Start PubMed servers (skip if not using pubmed-fast-ner)
( cd pubmedFastRAG && uv run embed.py --port 8002 --device cpu ) &
EMBED_PID=$!
until nc -z 127.0.0.1 8002; do sleep 1; done

( cd pubmedFastRAG && julia --project=. -t auto -e \
  'using Pkg; Pkg.instantiate(); include("rag.jl"); rag = RAGServer("../data/pubmed_data.db"); start_server(rag; port=8003)' ) &
JULIA_PID=$!
until nc -z 127.0.0.1 8003; do sleep 1; done

# 2) Run LLM agent
DATA_DIR=data/curation_v1 uv run script/generate.py \
  --experiment_name multi_tool_order \
  --wandb_mode disabled \
  --mode report-explain \
  --model_type anthropic \
  --tool_list '["pubmed-fast-ner", "kg-ner", "harmonizome", "wikipedia"]' \
  --folder_name multi_tool \
  --pert_path "$PREPROCESSED_PATH" \
  --kg_with_rel

# 3) Stop servers (optional)
kill $EMBED_PID $JULIA_PID || true

Key Arguments

Argument	Default	Description
`--model_type`	`anthropic`	LLM provider (`anthropic`, `openai`, `gemini`). Use `litellm` with API keys in `.env`
`--tool_list`	`["pubmed-fast-ner", "kg-ner", "harmonizome", "wikipedia"]`	JSON list of tools to use
`--mode`	`report-explain`	`report-explain` or `explain-only`
`--kg_with_rel`	`true`	Include KG relation info
`--wandb_mode`	`disabled`	W&B logging (`online`, `offline`, `disabled`)
`--max_items`	`0`	Limit perturbations to process (0 = all)
`--pert_path`	``	Preprocessed perturbation file (Result of preprocessing step)

Notebooks

notebooks/data-generation/data_generation.ipynb - Interactive data generation walkthrough

Troubleshooting

Ports already in use

lsof -iTCP:8002 -sTCP:LISTEN -n -P
lsof -iTCP:8003 -sTCP:LISTEN -n -P
lsof -tiTCP:8002 -sTCP:LISTEN | xargs -r kill
lsof -tiTCP:8003 -sTCP:LISTEN | xargs -r kill

Julia RAG server not starting

Ensure Julia is installed and on PATH, and that pubmed_data.db exists under data/.

KG data not found

Set STARK_PRIMEKG_DIR to the directory containing edge_index.pt, node_info.json, and other KG files. Set DATA_DIR to the directory containing mondo.json and action_primitives.json.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github		.github
assets		assets
data		data
docs		docs
notebooks/data-generation		notebooks/data-generation
script		script
slurm		slurm
src/explain		src/explain
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VCR-Agent (Towards Autonomous Mechanistic Reasoning in Virtual Cells)

Overview

Setup

1. Install Dependencies

2. Environment Variables

3. PubMed RAG Server (required for `pubmed-fast-ner` tool)

Input Format

Preprocessing

Multi-Agent

Available Tools

Running

Key Arguments

Notebooks

Troubleshooting

Ports already in use

Julia RAG server not starting

KG data not found

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VCR-Agent (Towards Autonomous Mechanistic Reasoning in Virtual Cells)

Overview

Setup

1. Install Dependencies

2. Environment Variables

3. PubMed RAG Server (required for pubmed-fast-ner tool)

Input Format

Preprocessing

Multi-Agent

Available Tools

Running

Key Arguments

Notebooks

Troubleshooting

Ports already in use

Julia RAG server not starting

KG data not found

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. PubMed RAG Server (required for `pubmed-fast-ner` tool)

Packages