Landing Page
Documentation landing page
Demo page

GeneAI

Inspiration

Most drugs are prescribed based on symptoms and population averages — not the patient sitting in front of the clinician. Yet genetic variants in enzymes like CYP2D6 and CYP2C19 can make a standard dose toxic for one person and ineffective for another. CPIC pharmacogenomic guidelines exist to address this, but they're buried in academic tables clinicians rarely have time to consult mid-appointment.

Every year, 1.3 million people are admitted to the ER because the drugs their doctor prescribed had adverse effects. Two people with the same illness and the same medication can have completely different outcomes — because their genes are different.

GeneAI was built to close that gap: take a patient's genetic profile, run it through a trained model, and surface a personalized risk assessment with a CPIC-backed recommendation — instantly.

What it does

GeneAI is a pharmacogenomics REST API and web UI that accepts a patient's gene profile and a drug name, and returns:

Per-gene activity scores from a trained Cross-Attention Set Transformer model
CPIC clinical recommendation text (e.g., "Avoid codeine use because of possibility of diminished analgesia")
Plain-English explanation via GPT-4o-mini, suitable for patients
Supports both structured JSON input and natural language ("I'm a CYP2D6 poor metabolizer taking codeine")

API Endpoints:

Method	Endpoint	Description
POST	`/predict`	Structured gene profile + drug → risk assessment
POST	`/predict/natural`	Natural language input → parsed and predicted
POST	`/explain`	Risk score → plain-English explanation
GET	`/drugs`	List all 323 supported drugs (search, pagination)
GET	`/drugs/{id}`	Drug details + known gene interactions
GET	`/genes`	List all 17 supported pharmacogenes
GET	`/genes/{symbol}/alleles`	Allele details with function status
GET	`/validate`	Check if a gene-drug pair exists in CPIC
GET	`/health`	API status, model version, data version

How we built it

Data Pipeline

Extracted clinical recommendations from the CPIC database v1.54.0 (2,129 gene-drug pairs across 17 pharmacogenes and 323 drugs)
Fetched SMILES strings from PubChem (299/323 drugs) and drug-gene target flags from DrugBank (3,422 mappings)
Engineered risk score labels from CPIC recommendation severity text using GPT-4o-mini

Model

We built a Cross-Attention Set Transformer in PyTorch:

Gene sequences → frozen ESM-2 (Meta) embeddings → projected to 128-dim
Drug SMILES → Morgan fingerprints (RDKit, 1024-bit) → projected to 128-dim
Each gene vector is scaled by its phenotype activity level (0.0 for poor metabolizer → 2.0 for ultrarapid)
DrugBank target flags are injected as a learned embedding (biological mechanism prior)
Self-attention over genes captures gene-gene interactions
Cross-attention (drug queries genes) learns which genes drive risk for a given molecule
Prediction head outputs risk score ∈ [1, 10], with attention weights providing per-gene interpretability

API & Services

FastAPI backend with routes for health checks, drug/gene lookup, structured prediction, natural language prediction (via GPT-4o-mini parsing), and AI explanation generation. Data served from CPIC-processed CSVs loaded at startup. Full Swagger documentation at /docs.

Frontend

React app with a Three.js 3D DNA helix, gene/drug input panels, and live prediction results — all in an iPhone-style liquid glass UI. Deployed at geneai.tech.

Challenges we ran into

Data sparsity — Many gene-drug combinations are underdocumented in CPIC. We handled unknown genes/drugs gracefully with fuzzy matching suggestions and fallback responses rather than crashing.

Phenotype-to-activity mapping — Translating clinical phenotype labels ("Intermediate Metabolizer", "No Function") into continuous activity scalars required careful CPIC-domain rules. CYP2D6 alone has activity values of 0, 0.25, 0.5, 0.75, 1.0, and 2.0.

Label quality — Initial rule-based risk scores from keyword matching were too coarse. We replaced them with GPT-4o-mini-scored continuous labels for better model targets.

Interpretability — Returning a single risk score isn't enough clinically. The cross-attention weights give per-gene attribution, and we pair every prediction with CPIC recommendation text.

Accomplishments that we're proud of

End-to-end working system: trained model → FastAPI → live UI at geneai.tech
Real Set Transformer architecture with cross-attention interpretability — not just a lookup table
Natural language interface that parses free-text clinical descriptions into structured predictions
CPIC text enrichment pairs every model prediction with evidence-based clinical language
10+ API endpoints with proper error handling, pagination, search, fuzzy matching, and Swagger docs — built for developers
Every piece of data comes from open, peer-reviewed, government-funded sources (CPIC CC0, PharmGKB CC BY-SA, PubChem public domain, ESM-2 MIT)

What we learned

Building for clinical use is less about raw accuracy and more about:

Safe fallbacks — the system should never crash on unknown input
Transparent outputs — per-gene attention weights let users see why a score was assigned
Domain alignment — risk scores only matter if they map to actionable CPIC recommendations

We also learned that ESM-2 protein embeddings + Morgan fingerprints give surprisingly rich features even before any fine-tuning.