Chencheng Mo 09/08/2025
This pipeline integrates two-step host sequence removal, robust species annotation, and multi-center contamination filtering, making it highly suitable for large-scale, cross-cohort microbiome studies. By combining deep host depletion, high-resolution taxonomic classification, and multi-cohort contamination filtering, the pipeline achieves lower contamination rates and greater analytical precision than conventional metagenomic workflows.
- Two-step host removal using both GRCh38.p14 and CHM13 v2.0 reference genomes to maximize removal of host sequences.
- High-accuracy species annotation with Kraken2 + Bracken using an updated Kraken PlusPF reference database.
- Cross-cohort contamination removal using a multi-center contamination database and decontam filtering.
- Final microbiome abundance tables generated from high-confidence reads for downstream statistical analyses.
Convert raw sequencing output (e.g., SRA, BAM formats) into FASTQ files, ensuring proper sample demultiplexing, metadata integrity, and concatenation of sequencing reads when necessary.
Perform read trimming and quality filtering using fastp (v0.23.4) to remove low-quality bases, adapters, and sequencing artifacts.
Align reads to the human reference genome GRCh38.p14 using bowtie2 (v2.5.2) to remove human-origin sequences.
Align the remaining reads to the CHM13 v2.0 telomere-to-telomere (T2T) human reference using bowtie2 (v2.5.2) for deeper host sequence depletion.
Classify microbial reads using Kraken2 (v2.1.3) with the PlusPF (2024.1.12) database.
Refine species-level abundance estimates using Bracken for more accurate microbial quantitation.
Use a multi-center contamination database and decontam filtering to remove background contaminants, ensuring cross-cohort comparability.
Extract high-confidence microbial reads and generate final microbiome abundance tables using KrakenTools, ready for downstream statistical and ecological analyses.
- Reduced Contamination — Multi-step host removal and cohort-specific contaminant filtering.
- Improved Precision — Dual-host genome alignment combined with Bracken refinement for species-level accuracy.
- Scalable — Optimized for integration of datasets from multiple global cohorts.
- Reproducible — Modular structure with version-controlled tools and databases.
- Python 3.12.2
- R 3.4