Skip to content

valence-labs/VCR-Agent

Repository files navigation

VCR-Agent (Towards Autonomous Mechanistic Reasoning in Virtual Cells)

VCR-Agent

This is the official code repository for the paper titled Towards Autonomous Mechanistic Reasoning in Virtual Cells.

Abstract: Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-Traces dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

Overview

  1. Input: Perturbation + Context data (JSON)
  2. Preprocessing: NER entity extraction (HunFlair2) + PubChem synonym search
  3. Multi-Agent: Report and structured explanation generation with external tools

Setup

1. Install Dependencies

uv sync

2. Environment Variables

Create a .env file (or export) with:

# Required
DATA_DIR=data/curation_v1              # Path to directory with action_primitives.json, templates/, mondo.json

# Required for KG tool
STARK_PRIMEKG_DIR=/path/to/stark_prime_kg   # Directory with edge_index.pt, node_info.json, etc. (https://stark.stanford.edu/dataset_prime.html)
DRUGBANK_XML_PATH=/path/to/full_database.xml  # DrugBank XML (optional, for chemical entity matching)

# Required for LLM providers
ANTHROPIC_API_KEY=...                  # For Anthropic Claude
# or
OPENAI_API_KEY=...                     # For OpenAI

3. PubMed RAG Server (required for pubmed-fast-ner tool)

Clone repositories, download the PubMed database (~34GB fully built) and binary search data (~2.5GB). If the download process does not work well, refer to https://github.com/domluna/pubmedFastRAG.git for full explanation:

git clone https://github.com/kyunghyuncho/pubmed-vectors.git
git clone https://github.com/domluna/pubmedFastRAG.git

# Download PubMed SQLite database (takes a while)
python pubmed-vectors/download_pubmed.py

# Download binary RAG search data (from Google Drive, ~2.5GB)
pip install gdown
gdown "1LuCaUcILQuQgkDm3_tWBWr4X7AQ518kX" -O pubmedFastRAG/bindata.zip
cd pubmedFastRAG && unzip bindata.zip -d bindata/ && rm bindata.zip && cd ..

# Move database to data/
mv pubmed_data.db data/pubmed_data.db

Install Python dependencies for the embed server:

pip install fastapi uvicorn einops

Install Julia (if not already installed) and resolve Julia packages:

cd pubmedFastRAG
julia --project=. -e 'using Pkg; Pkg.update(); Pkg.precompile()'
cd ..

Unzip data/mondo.json.zip.

Input Format

The input is a JSON list of perturbation records. Some fields may be missing.

[
  {
    "index": 0,
    "perturbation": {
      "context": [
        {
          "perturbation type": "soluble factor",
          "description": "Soluble factor addition of VEGF",
          "cell_type": "N/A",
          "disease_model": "Angiogenic factor/tumors",
          "cell type": null,
          "subtype": null
        }
      ],
      "perturbations": [
        {
          "type": "chemical",
          "smiles": "CN1CCN(CC1)CC(=O)N(C)C2=CC=C(C=C2)N=C(C3=CC=CC=C3)C4=C(NC5=C4C=CC(=C5)C(=O)OC)O",
          "name": "Nintedanib",
          "target": "VEGFR",
          "moa_type": "antibody",
          "known targets": []
        }
      ]
    }
  }
]

Preprocessing

Extracts NER entities and PubChem compound information from the input perturbations:

INPUT_PATH=data/example_input.json
PREPROCESSED_PATH=data/example_preprocessed.json

DATA_DIR=data/curation_v1 uv run script/preprocess.py --input_path "$INPUT_PATH" --preprocessed_path "$PREPROCESSED_PATH"

The preprocessed output adds perturbation_entity and context_entity fields (NER results) and optional pubchem_info for chemical perturbations.

Multi-Agent

The agent generates both a report and structured explanation using external tools.

Available Tools

Tool Description Requires
pubmed-fast-ner PubMed paper search via NER entities Julia RAG server (ports 8002, 8003)
kg-ner KG node lookup via NER entities STARK_PRIMEKG_DIR env var
harmonizome Gene/gene-set information Internet access
wikipedia Wikipedia articles for NER entities Internet access

Running

# 1) Start PubMed servers (skip if not using pubmed-fast-ner)
( cd pubmedFastRAG && uv run embed.py --port 8002 --device cpu ) &
EMBED_PID=$!
until nc -z 127.0.0.1 8002; do sleep 1; done

( cd pubmedFastRAG && julia --project=. -t auto -e \
  'using Pkg; Pkg.instantiate(); include("rag.jl"); rag = RAGServer("../data/pubmed_data.db"); start_server(rag; port=8003)' ) &
JULIA_PID=$!
until nc -z 127.0.0.1 8003; do sleep 1; done

# 2) Run LLM agent
DATA_DIR=data/curation_v1 uv run script/generate.py \
  --experiment_name multi_tool_order \
  --wandb_mode disabled \
  --mode report-explain \
  --model_type anthropic \
  --tool_list '["pubmed-fast-ner", "kg-ner", "harmonizome", "wikipedia"]' \
  --folder_name multi_tool \
  --pert_path "$PREPROCESSED_PATH" \
  --kg_with_rel

# 3) Stop servers (optional)
kill $EMBED_PID $JULIA_PID || true

Key Arguments

Argument Default Description
--model_type anthropic LLM provider (anthropic, openai, gemini). Use litellm with API keys in .env
--tool_list ["pubmed-fast-ner", "kg-ner", "harmonizome", "wikipedia"] JSON list of tools to use
--mode report-explain report-explain or explain-only
--kg_with_rel true Include KG relation info
--wandb_mode disabled W&B logging (online, offline, disabled)
--max_items 0 Limit perturbations to process (0 = all)
--pert_path `` Preprocessed perturbation file (Result of preprocessing step)

Notebooks

  • notebooks/data-generation/data_generation.ipynb - Interactive data generation walkthrough

Troubleshooting

Ports already in use

lsof -iTCP:8002 -sTCP:LISTEN -n -P
lsof -iTCP:8003 -sTCP:LISTEN -n -P
lsof -tiTCP:8002 -sTCP:LISTEN | xargs -r kill
lsof -tiTCP:8003 -sTCP:LISTEN | xargs -r kill

Julia RAG server not starting

Ensure Julia is installed and on PATH, and that pubmed_data.db exists under data/.

KG data not found

Set STARK_PRIMEKG_DIR to the directory containing edge_index.pt, node_info.json, and other KG files. Set DATA_DIR to the directory containing mondo.json and action_primitives.json.

About

Implementation of VCR-Agent, the multi-agent explanation system for virtual cells described in Towards Autonomous Mechanistic Reasoning in Virtual Cells (arXiv:2604.11661v2).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors