GeneAI
Inspiration
Most drugs are prescribed based on symptoms and population averages — not the patient sitting in front of the clinician. Yet genetic variants in enzymes like CYP2D6 and CYP2C19 can make a standard dose toxic for one person and ineffective for another. CPIC pharmacogenomic guidelines exist to address this, but they're buried in academic tables clinicians rarely have time to consult mid-appointment.
Every year, 1.3 million people are admitted to the ER because the drugs their doctor prescribed had adverse effects. Two people with the same illness and the same medication can have completely different outcomes — because their genes are different.
GeneAI was built to close that gap: take a patient's genetic profile, run it through a trained model, and surface a personalized risk assessment with a CPIC-backed recommendation — instantly.
What it does
GeneAI is a pharmacogenomics REST API and web UI that accepts a patient's gene profile and a drug name, and returns:
- Per-gene activity scores from a trained Cross-Attention Set Transformer model
- CPIC clinical recommendation text (e.g., "Avoid codeine use because of possibility of diminished analgesia")
- Plain-English explanation via GPT-4o-mini, suitable for patients
- Supports both structured JSON input and natural language ("I'm a CYP2D6 poor metabolizer taking codeine")
API Endpoints:
| Method | Endpoint | Description |
|---|---|---|
| POST | /predict |
Structured gene profile + drug → risk assessment |
| POST | /predict/natural |
Natural language input → parsed and predicted |
| POST | /explain |
Risk score → plain-English explanation |
| GET | /drugs |
List all 323 supported drugs (search, pagination) |
| GET | /drugs/{id} |
Drug details + known gene interactions |
| GET | /genes |
List all 17 supported pharmacogenes |
| GET | /genes/{symbol}/alleles |
Allele details with function status |
| GET | /validate |
Check if a gene-drug pair exists in CPIC |
| GET | /health |
API status, model version, data version |
How we built it
Data Pipeline
- Extracted clinical recommendations from the CPIC database v1.54.0 (2,129 gene-drug pairs across 17 pharmacogenes and 323 drugs)
- Fetched SMILES strings from PubChem (299/323 drugs) and drug-gene target flags from DrugBank (3,422 mappings)
- Engineered risk score labels from CPIC recommendation severity text using GPT-4o-mini
Model
We built a Cross-Attention Set Transformer in PyTorch:
- Gene sequences → frozen ESM-2 (Meta) embeddings → projected to 128-dim
- Drug SMILES → Morgan fingerprints (RDKit, 1024-bit) → projected to 128-dim
- Each gene vector is scaled by its phenotype activity level (0.0 for poor metabolizer → 2.0 for ultrarapid)
- DrugBank target flags are injected as a learned embedding (biological mechanism prior)
- Self-attention over genes captures gene-gene interactions
- Cross-attention (drug queries genes) learns which genes drive risk for a given molecule
- Prediction head outputs risk score ∈ [1, 10], with attention weights providing per-gene interpretability
API & Services
FastAPI backend with routes for health checks, drug/gene lookup, structured prediction, natural language prediction (via GPT-4o-mini parsing), and AI explanation generation. Data served from CPIC-processed CSVs loaded at startup. Full Swagger documentation at /docs.
Frontend
React app with a Three.js 3D DNA helix, gene/drug input panels, and live prediction results — all in an iPhone-style liquid glass UI. Deployed at geneai.tech.
Challenges we ran into
Data sparsity — Many gene-drug combinations are underdocumented in CPIC. We handled unknown genes/drugs gracefully with fuzzy matching suggestions and fallback responses rather than crashing.
Phenotype-to-activity mapping — Translating clinical phenotype labels ("Intermediate Metabolizer", "No Function") into continuous activity scalars required careful CPIC-domain rules. CYP2D6 alone has activity values of 0, 0.25, 0.5, 0.75, 1.0, and 2.0.
Label quality — Initial rule-based risk scores from keyword matching were too coarse. We replaced them with GPT-4o-mini-scored continuous labels for better model targets.
Interpretability — Returning a single risk score isn't enough clinically. The cross-attention weights give per-gene attribution, and we pair every prediction with CPIC recommendation text.
Accomplishments that we're proud of
- End-to-end working system: trained model → FastAPI → live UI at geneai.tech
- Real Set Transformer architecture with cross-attention interpretability — not just a lookup table
- Natural language interface that parses free-text clinical descriptions into structured predictions
- CPIC text enrichment pairs every model prediction with evidence-based clinical language
- 10+ API endpoints with proper error handling, pagination, search, fuzzy matching, and Swagger docs — built for developers
- Every piece of data comes from open, peer-reviewed, government-funded sources (CPIC CC0, PharmGKB CC BY-SA, PubChem public domain, ESM-2 MIT)
What we learned
Building for clinical use is less about raw accuracy and more about:
- Safe fallbacks — the system should never crash on unknown input
- Transparent outputs — per-gene attention weights let users see why a score was assigned
- Domain alignment — risk scores only matter if they map to actionable CPIC recommendations
We also learned that ESM-2 protein embeddings + Morgan fingerprints give surprisingly rich features even before any fine-tuning.
What's next for GeneAI
- Expand gene coverage beyond the current 17 pharmacogenes
- Add multi-drug support for polypharmacy patients
- Retrospective validation against real clinical outcomes
- Confidence intervals and uncertainty quantification on risk scores
- EHR integration prototype for hospital systems
Prototype only — not validated for clinical use.
Built With
- Python
- PyTorch
- FastAPI
- ESM-2 (Meta)
- RDKit
- OpenAI GPT-4o-mini
- Three.js
- React
- CPIC Database
- PharmGKB
- PubChem
- DrugBank
Try it out
- Live API: geneai.tech
- API Docs: geneai.tech/docs
- GitHub: github.com/Topupchips/HackIllinoisWinningIdea

Log in or sign up for Devpost to join the conversation.