About the Project

Meet AmPy, an expert LLM in antibiotic treatments and antimicrobial resistance build on Google DeepMind's lightweight Gemma3 model.

Inspiration

Antibiotic resistance is a growing global health threat, making rapid and accurate treatment decisions critical for patient outcomes. AmPy aims to create a user-friendly platform that helps physicians quickly decide on effective treatment options based on bacterial genomic data. Traditional lab-based resistance testing is slow and resource-intensive. We were inspired to build a tool that leverages bacterial genome data and machine learning to instantly predict resistance profiles for multiple antibiotics, empowering clinicians to make informed choices in real time. AmPy provides reasoning and evidence behind it's treatment recommendation and can have though discussions with physicians about the specific cases they are treating!

What We Learned

  • Bioinformatics: We learned how to process and represent genome sequences using k-mer extraction, a technique that breaks DNA into short substrings for feature engineering.
  • Machine Learning: We explored multi-label classification, model evaluation (AUC, AUPRC), and feature importance for interpretability.
  • Data Engineering: We tackled large-scale data streaming, merging genome and phenotype data, and handling missing/imbalanced labels.
  • Model Deployment: We experimented with user interfaces (Gradio) for real-world accessibility.
  • LLM Deployment: We trained our LLM, AmPy, with the World Health Organization's antibiotic recommendation documentation. AmPy was build on Google's Gemma3 Open Weights model

How We Built It

  • Data Preparation: We streamed bacterial genome data and resistance labels, sampled genomes, and merged them for analysis.
  • Feature Extraction: We used k-mer extraction and TF-IDF vectorization to convert DNA sequences into machine-readable features.
  • Model Training: For each antibiotic, we trained an XGBoost classifier, using stratified splits and robust metrics (AUC, AUPRC).
  • Evaluation: We summarized model performance, identified top-performing antibiotics, and extracted the most predictive k-mers.
  • Prediction & UI: We built a pipeline to predict resistance for any genome and deployed a Gradio interface for user-friendly access.

Challenges

  • Data Imbalance: Many antibiotics had few resistant or susceptible samples, requiring careful selection of trainable targets.
  • Computational Load: Genome data is massive; we optimized by sampling, limiting k-mer features, and using efficient vectorization.
  • Interpretability: Making predictions explainable for clinicians was key, so we included feature importance and confidence scores.
  • Integration: Merging streaming genome data with static label files and handling missing data was non-trivial.

Math & Algorithms

We used k-mer extraction: $$ \text{For a sequence } S, \text{ k-mers are } { S_i = S[i:i+k] \mid 0 \leq i \leq |S| - k } $$

TF-IDF vectorization: $$ \text{TF-IDF}(t, d) = \text{tf}(t, d) \times \log\left(\frac{N}{\text{df}(t)}\right) $$

Model evaluation: $$ AUC = \int_0^1 TPR(FPR^{-1}(x)) dx $$

Final Thoughts

We built a scalable, explainable, and fast pipeline for multi-antibiotic resistance prediction from genome data. This project showed us the power of combining bioinformatics and AI to address urgent clinical needs. We hope our tool can help accelerate effective treatment and combat the rise of superbugs.

GITHUB: https://github.com/hackbio-ca/genome-based-antibiotic-recommendation

Built With

  • gradio
  • matplotlib
  • model-evaluation)-xgboost-(machine-learning-models)-tfidfvectorizer-(feature-extraction)-gradio-(user-interface)-matplotlib
  • numpy
  • numpy-(data-processing)-hugging-face-datasets-(genome-data-streaming)-scikit-learn-(feature-engineering
  • pandas
  • pickle
  • python
  • scikit-learn
  • seaborn
  • tfidfvectorizer
  • tqdm
  • xgboost
Share this project:

Updates