Bacter.ai

Home Page
File Upload Page
Results Overview
Antibiogram
SHAP Data for Ampicillin
SHAP Data for Ceftriaxone
Genome Map
Information on Trimethoprim-resistant dihydrofolate reductase
Information on Class A β-lactamase
Resistance Mechanism Diagram for dfrA17
Resistance Mechanism Diagram for gyrA-S83L
General Data Statistics
Statistics for trimethoprim/sulfamethoxazole
Recommendations Tab

Inspiration

Antibiotic resistance kills 1.27 million people per year, more than HIV and malaria combined. When a patient arrives with a bacterial infection, doctors prescribe antibiotics without knowing which ones will actually work. Lab results take 48-72 hours. For sepsis patients, every hour on the wrong antibiotic increases mortality by 4-8%. We built bacter.ai to close that gap.

What it does

bacter.ai takes a raw bacterial DNA sequence (FASTA file) and predicts resistance or susceptibility across 10 major antibiotics in seconds. It outputs a full antibiogram showing which drugs will work, which won't, and why, with explainable AI highlighting the specific DNA patterns driving each prediction. Doctors get actionable treatment guidance before traditional lab results are even available.

How we built it

We sourced 3,030 lab-verified E. coli genomes and their corresponding antibiotic resistance test results from the PATRIC/BV-BRC database, a public NIH-funded repository of bacterial genome data.

Each genome, roughly 5 million characters of raw DNA, was converted into a numerical fingerprint through a process called k-mer frequency extraction. We slide a window of 6 characters across the entire DNA sequence and count how often each of the 4,096 possible 6-letter DNA patterns appears, then normalize by genome length. This produces a compact numerical representation that captures the genetic content of the entire genome, including resistance genes, mutations, and regulatory elements.

We trained 10 independent XGBoost classifiers, one per antibiotic. XGBoost is a gradient-boosted decision tree ensemble, it sequentially builds 200 decision trees, where each tree learns from the mistakes of the previous ones by fitting to the residual errors. Each tree splits on k-mer frequency thresholds to separate resistant from susceptible genomes. The model minimizes binary cross-entropy loss and uses L2 regularization to prevent overfitting.

We validated every model with 5-fold stratified cross-validation, ensuring every prediction is made on data the model never saw during training. We computed 95% bootstrap confidence intervals over 1,000 resampling iterations to quantify the reliability of our accuracy estimates.

To make predictions explainable, we integrated SHAP (SHapley Additive exPlanations), which decomposes each prediction into individual feature contributions, showing exactly which DNA patterns pushed the prediction toward resistant or susceptible.

The backend serves real-time predictions through a Flask REST API. The frontend, built in React with D3.js, visualizes results through an interactive antibiogram, a circular genome map showing resistance gene locations, a verification panel comparing predictions to lab results, and SHAP-based model explanation charts.

Challenges we ran into

Downloading and processing thousands of genome sequences from the BV-BRC API, where each genome is approximately 5MB of raw DNA text and 9,255 out of 12,627 genomes failed to download due to API timeouts and missing records
Only a fraction of downloaded genomes had matching lab-tested resistance data for our target antibiotics, limiting our usable training set to 3,030 genomes out of 4,693 downloaded
Some antibiotics had severe class imbalance, gentamicin had only 13% resistant samples, making it hard for the model to learn resistance patterns without biasing toward the majority class
Converting raw MIC (minimum inhibitory concentration) lab values into binary resistant/susceptible labels using CLSI clinical breakpoint standards
Keeping backend models, API endpoints, and frontend components in sync across development branches while everything was continuously changing under a 36-hour deadline

Accomplishments that we're proud of

Trained 10 working antibiotic resistance classifiers from scratch in under 36 hours
Achieved 85% accuracy on ciprofloxacin with 94.6% sensitivity, and 81% on ceftriaxone with balanced sensitivity and specificity, using only raw genomic sequence data
Built a verification system proving predictions work on genomes the model has never seen, with actual lab results shown side by side
Created a complete end-to-end pipeline, raw DNA upload, k-mer extraction, real-time prediction, explainable results, all running in seconds
Processed 243,000 lab-verified AMR phenotype records and 4,693 bacterial genome sequences to build our training dataset

What we learned

How antibiotic resistance works at the molecular level - resistance genes like TEM-1 beta-lactamases and gyrA mutations, horizontal gene transfer via plasmids, and why different antibiotics fail through fundamentally different biochemical mechanisms
That feature engineering matters more than model complexity - XGBoost on well-crafted k-mer features matches or outperforms neural networks on this task, according to published benchmarks
The critical importance of rigorous validation - cross-validation, bootstrap confidence intervals, and blind held-out test sets are what separate a real ML tool from a demo that memorized its training data 4.How to build and coordinate a full-stack ML application under extreme time pressure, from data acquisition through model training to deployment

What's next for bacter.ai

Train species-specific models for Staphylococcus aureus, Klebsiella pneumoniae, and Pseudomonas aeruginosa using the same pipeline
Add hyperparameter tuning and ensemble methods to push accuracy above 90%
Integrate with portable DNA sequencers (Oxford Nanopore MinION) for point-of-care deployment directly in hospitals
Validate on hospital-specific datasets to account for regional variation in resistance patterns
Build a clinician-facing report generator with treatment recommendations ranked by predicted efficacy