This is the official code repository for the paper titled Towards Autonomous Mechanistic Reasoning in Virtual Cells.
Abstract: Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-Traces dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.
- Input: Perturbation + Context data (JSON)
- Preprocessing: NER entity extraction (HunFlair2) + PubChem synonym search
- Multi-Agent: Report and structured explanation generation with external tools
uv syncCreate a .env file (or export) with:
# Required
DATA_DIR=data/curation_v1 # Path to directory with action_primitives.json, templates/, mondo.json
# Required for KG tool
STARK_PRIMEKG_DIR=/path/to/stark_prime_kg # Directory with edge_index.pt, node_info.json, etc. (https://stark.stanford.edu/dataset_prime.html)
DRUGBANK_XML_PATH=/path/to/full_database.xml # DrugBank XML (optional, for chemical entity matching)
# Required for LLM providers
ANTHROPIC_API_KEY=... # For Anthropic Claude
# or
OPENAI_API_KEY=... # For OpenAIClone repositories, download the PubMed database (~34GB fully built) and binary search data (~2.5GB). If the download process does not work well, refer to https://github.com/domluna/pubmedFastRAG.git for full explanation:
git clone https://github.com/kyunghyuncho/pubmed-vectors.git
git clone https://github.com/domluna/pubmedFastRAG.git
# Download PubMed SQLite database (takes a while)
python pubmed-vectors/download_pubmed.py
# Download binary RAG search data (from Google Drive, ~2.5GB)
pip install gdown
gdown "1LuCaUcILQuQgkDm3_tWBWr4X7AQ518kX" -O pubmedFastRAG/bindata.zip
cd pubmedFastRAG && unzip bindata.zip -d bindata/ && rm bindata.zip && cd ..
# Move database to data/
mv pubmed_data.db data/pubmed_data.dbInstall Python dependencies for the embed server:
pip install fastapi uvicorn einopsInstall Julia (if not already installed) and resolve Julia packages:
cd pubmedFastRAG
julia --project=. -e 'using Pkg; Pkg.update(); Pkg.precompile()'
cd ..Unzip data/mondo.json.zip.
The input is a JSON list of perturbation records. Some fields may be missing.
[
{
"index": 0,
"perturbation": {
"context": [
{
"perturbation type": "soluble factor",
"description": "Soluble factor addition of VEGF",
"cell_type": "N/A",
"disease_model": "Angiogenic factor/tumors",
"cell type": null,
"subtype": null
}
],
"perturbations": [
{
"type": "chemical",
"smiles": "CN1CCN(CC1)CC(=O)N(C)C2=CC=C(C=C2)N=C(C3=CC=CC=C3)C4=C(NC5=C4C=CC(=C5)C(=O)OC)O",
"name": "Nintedanib",
"target": "VEGFR",
"moa_type": "antibody",
"known targets": []
}
]
}
}
]Extracts NER entities and PubChem compound information from the input perturbations:
INPUT_PATH=data/example_input.json
PREPROCESSED_PATH=data/example_preprocessed.json
DATA_DIR=data/curation_v1 uv run script/preprocess.py --input_path "$INPUT_PATH" --preprocessed_path "$PREPROCESSED_PATH"The preprocessed output adds perturbation_entity and context_entity fields (NER results) and optional pubchem_info for chemical perturbations.
The agent generates both a report and structured explanation using external tools.
| Tool | Description | Requires |
|---|---|---|
pubmed-fast-ner |
PubMed paper search via NER entities | Julia RAG server (ports 8002, 8003) |
kg-ner |
KG node lookup via NER entities | STARK_PRIMEKG_DIR env var |
harmonizome |
Gene/gene-set information | Internet access |
wikipedia |
Wikipedia articles for NER entities | Internet access |
# 1) Start PubMed servers (skip if not using pubmed-fast-ner)
( cd pubmedFastRAG && uv run embed.py --port 8002 --device cpu ) &
EMBED_PID=$!
until nc -z 127.0.0.1 8002; do sleep 1; done
( cd pubmedFastRAG && julia --project=. -t auto -e \
'using Pkg; Pkg.instantiate(); include("rag.jl"); rag = RAGServer("../data/pubmed_data.db"); start_server(rag; port=8003)' ) &
JULIA_PID=$!
until nc -z 127.0.0.1 8003; do sleep 1; done
# 2) Run LLM agent
DATA_DIR=data/curation_v1 uv run script/generate.py \
--experiment_name multi_tool_order \
--wandb_mode disabled \
--mode report-explain \
--model_type anthropic \
--tool_list '["pubmed-fast-ner", "kg-ner", "harmonizome", "wikipedia"]' \
--folder_name multi_tool \
--pert_path "$PREPROCESSED_PATH" \
--kg_with_rel
# 3) Stop servers (optional)
kill $EMBED_PID $JULIA_PID || true| Argument | Default | Description |
|---|---|---|
--model_type |
anthropic |
LLM provider (anthropic, openai, gemini). Use litellm with API keys in .env |
--tool_list |
["pubmed-fast-ner", "kg-ner", "harmonizome", "wikipedia"] |
JSON list of tools to use |
--mode |
report-explain |
report-explain or explain-only |
--kg_with_rel |
true |
Include KG relation info |
--wandb_mode |
disabled |
W&B logging (online, offline, disabled) |
--max_items |
0 |
Limit perturbations to process (0 = all) |
--pert_path |
`` | Preprocessed perturbation file (Result of preprocessing step) |
notebooks/data-generation/data_generation.ipynb- Interactive data generation walkthrough
lsof -iTCP:8002 -sTCP:LISTEN -n -P
lsof -iTCP:8003 -sTCP:LISTEN -n -P
lsof -tiTCP:8002 -sTCP:LISTEN | xargs -r kill
lsof -tiTCP:8003 -sTCP:LISTEN | xargs -r killEnsure Julia is installed and on PATH, and that pubmed_data.db exists under data/.
Set STARK_PRIMEKG_DIR to the directory containing edge_index.pt, node_info.json, and other KG files. Set DATA_DIR to the directory containing mondo.json and action_primitives.json.
