🧬 Biomarker Discovery Pipeline

An end-to-end, reproducible computational framework for transcriptomic biomarker discovery and validation.
This repository demonstrates how raw RNA-seq or microarray data can be processed, analyzed, and modeled using a combination of statistical and machine learning methods.

🧠 Experience Gained & Responsibilities

Gained foundational understanding of transcriptomic technologies (RNA-seq, microarray) and primary data repositories including GEO, SRA, NCBI, and ArrayExpress.
Analyzed raw RNA-seq data from FASTQ to quantified counts using splice-aware aligners (STAR, HISAT2) and quasi-mappers (Salmon, Kallisto) for robust gene expression profiling.
Retrieved and processed GEO datasets using GEOquery, followed by manual metadata curation for comparative transcriptomic analysis.
Performed differential gene expression analysis using edgeR, limma, and DESeq2, including normalization, outlier detection, and batch effect correction.
Applied downstream visualization and statistical methods: heatmaps, boxplots, PCA, UMAP, t-SNE, hierarchical clustering, volcano plots, and phylogenetic analyses.
Conducted functional enrichment analysis using DAVID, STRING, Enrichr, Reactome, and GSEA to identify enriched pathways and biological processes associated with Mpox virus infection.
Predicted candidate therapeutic compounds using DSigDB based on DEG signatures.
Validated key DEGs as potential biomarkers via ROC curve analysis (AUC-based).
Developed and benchmarked machine learning models (Random Forest, Extra Trees, SVM) to identify predictive biomarkers, including model tuning and feature importance ranking.
Utilized a range of bioinformatics tools and libraries:
R (DESeq2, limma, edgeR, GOplot, ggplot2, pROC),
Cytoscape (ClueGO, CluePedia, cytoHubba, MCODE),
Python (PyCaret, scikit-learn, pandas), and TBtools.
First-author publication:
Debnath, J.P., et al. (2025). Identification of potential biomarkers for 2022 Mpox virus infection: a transcriptomic network analysis and machine learning approach. Scientific Reports, 15, 2922.

🧩 Project Workflow

1️⃣ Preprocessing & Quality Control    →  FastQC, Trimmomatic  
2️⃣ Alignment & Quantification         →  STAR / Salmon / featureCounts  
3️⃣ Differential Expression Analysis   →  DESeq2, edgeR, limma  
4️⃣ Functional Enrichment Analysis     →  GSEA, Enrichr, GOstats  
5️⃣ Network Modeling                   →  Identify hub genes, functional modules  
6️⃣ Machine Learning Benchmarking      →  PyCaret (AUPRC, AUROC, F1, MCC)

📥 Data Access

Raw RNA-seq and microarray datasets used in this pipeline can be downloaded from public repositories:

SRA (Sequence Read Archive)
GEO (Gene Expression Omnibus)

📜 License

This project is distributed under the MIT License.
See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
dump		dump
june_24_final		june_24_final
notebooks_ML		notebooks_ML
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Biomarker Discovery Pipeline

🧠 Experience Gained & Responsibilities

🧩 Project Workflow

📥 Data Access

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 Biomarker Discovery Pipeline

🧠 Experience Gained & Responsibilities

🧩 Project Workflow

📥 Data Access

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages