Skip to content

yhg926/MetaKSSD

Repository files navigation

Instantaneous Metagenomic Taxonomic Profiling with MetaKSSD

MetaKSSD is the second version of KSSD (K-mer Substring Space Sampling/Shuffling Decomposition), designed for instantaneous metagenome taxonomic profiling using WGS fastq data.

K-mer Substring Space Decomposition (KSSD) facilitates highly efficient genome sketching and enables lossless sketch operations, including union, intersection, and subtraction . Building upon the KSSD framework, MetaKSSD introduces a novel feature that tracks k-mer counts within the sketch. Leveraging these foundational functionalities, MetaKSSD further innovates methods for constructing a taxonomic marker database (MarkerDB), metagenome taxonomic profiling, and profile searching.

For users not familiar with linux command-line, we provide on-the-spot & online metagenomic analysis through the Client Apps:

Mac OS MetaKSSD Clients, see tutorial video.

Windows OS MetaKSSD Clients, see tutorial video.

Have a try, and you will love it, it is really Cool! qq Image Jul 31, 2025, 03_12_25 PM

Users may also be interested in The philosophy behind MetaKSSD, and its advantages over Sylph.

We also provide 382,016 QCed NCBI SRA metagenomic profiles generated by MetaKSSD as of Dec. 27, 2023.

Note: MetaKSSD is ideally for prokaryotic metagenomic NGS Whole genome sequecing (WGS) data. It currently not support amplican (e.g. 16S) data; Virus metagenome might not be detected using our default 4096-fold reduction (L3K*.shuf) and GTDB markerDB.

1. Installation

git clone https://github.com/yhg926/MetaKSSD.git &&
cd MetaKSSD && make
export PATH=$(pwd)/bin:$PATH
# To remember METAKSSD PATH permanently
echo "export PATH=$(pwd):$(pwd)/bin:\$PATH" >> ~/.bashrc
echo "export METAKSSD_PATH=$(pwd)" >> ~/.bashrc
source ~/.bashrc

You can also install via conda:

conda install --channel https://conda.anaconda.org/luxiaoxin2 metakssd

and apt

#1、添加PPA源
sudo add-apt-repository ppa:metakssd/metakssd -y
#2、更新源
sudo apt update
#3、安装 metakssd
sudo apt install metakssd
#4、验证安装成功
metakssd --help

1.1 (Optional) Get pre-built MarkerDB (L3K11)

#latest version
wget https://zenodo.org/records/16317275/files/GTDBr226_genomes_L3K11_sketch_markerdb.tar.gz
# older version
wget https://zenodo.org/records/14353354/files/markerdb.gtdb_r214v2_L3K11_sketch.tar.gz
tar xf *markerdb*.tar.gz

This step is only needed when you do not have a MarkerDB. You can also prepare your own MarkerDB, see build custom MarkerDB.

1.2 (Optional) Prepare gtdb to ncbi taxonomy convertion tables

#if you use markerdb earlier than r226  
gunzip -d $METAKSSD_PATH/data/best.gtdbr[214|226]_psid2ncbi_specid.tsv.gz;
gunzip -d $METAKSSD_PATH/data/scienficaname.ncbitaxid_rank_parentnode_name.gtdbr[214|226]_pseudoidrelated.tsv.gz

These files are only needed when you have to convert gtdb to ncbi taxonomy and the markerdb is earlier than r226.

1.3 (Optional) Get pre-build Abundance Vector Database (L3K11)

#only r214 are avialble currently 
wget https://zenodo.org/records/11437234/files/markerdb.abvdb231227.L3K11_gtdb_r214.tar.gz
tar xf markerdb.abvdb231227.L3K11_gtdb_r214.tar.gz

The Abundance Vector Database is only needed when you perform abundance vector searching. You can also prepare your own Abundance Vector Database, see Index abundance vector database.

2. Metagenome profiling

One-stop profiling (GTDB taxonomy only)

# one .fq file per sample 
run_profiling.sh <MarkerDB> <sample1.fq> ...

Step-by-step profiling pipeline (for user customization)

#sketching with k-mer counts tracking
metakssd dist -L $METAKSSD_PATH/shuf_files/L3K11.shuf -A -o <sample1_sketch> <sample1.fastq>
#generate raw profile
metakssd composite -r <markerdb> -q <sample1_sketch> > <species_coverage.tsv>
#abundance normalization
perl $METAKSSD_PATH/scripts/possion.kssd2out.pl <species_coverage.tsv> <minimum overlapped k-mer S (default:18)> > <species_relative_abundance_profile>

If need to covert species abundaces to full gtdb taxonomy profile, using

perl $METAKSSD_PATH/scripts/kssd2out2gtdb_taxonomy_profile.pl <species_relative_abundance_profile> data/gtdbr[214|226]_psid2krona_taxonomy.tsv

If need to covert to CAMI format profile with NCBI taxonomy, using

# for profile annotation later than r226 (need taxonkit installed)
$METAKSSD_PATH/scripts/mk_cami_from_profile.sh -m $METAKSSD_PATH/data/r226.best.gtdb_species2ncbi_species_taxid.tsv.gz -i <species_relative_abundance_profile> -o all_samples_cami.profile
# for earlier profile annotation
perl $METAKSSD_PATH/scripts/possion.kssdcomposite2taxonomy_profilefmt.pl <species_coverage.tsv> data/best.gtdbr[214|226]_psid2ncbi_specid.tsv data/scienficaname.ncbitaxid_rank_parentnode_name.gtdbr[214|226]_pseudoidrelated.tsv 18 > sample1.profile

If need to covert species_coverage.tsv to Krona format profile, using

perl $METAKSSD_PATH/scripts/kssdcomposite2gtdb_tax_kronafmt.pl <species_coverage.tsv> data/gtdbr[214|226]_psid2krona_taxonomy.tsv <outdir>

3. Abundance Vector Searching

3.1 Generate your abundance vector in a given path

metakssd composite -r <markerdb> -q <metagenome sketch> -b -o <path>

3.2 Abundance Vector Searching

To retrieve abundance vectors similar to an abundance vector "input.abv" from a markerdb with indexed abv:

metakssd composite -r <markerdb.abvdb> -s<0 or 1> <path/input.abv>

Here, the options -s0 and -s1 enable searching based on L1 norm and cosine similarity, respectively.

4. Index abundance vector database

Suppose you have many *.abv generated following section 3.1. Then the abundance vector database could be indexed as follow:

#make folder named abundance_Vec under your markerdb path
mkdir -p <markerdb path>/abundance_Vec
#collect all *.abv to the folder
cp *.abv <markerdb path>/abundance_Vec
#index 
metakssd composite -r <markerdb path> -i

5. Build custom MarkerDB

One-stop MarkerDB construction

# build gtdbr226
build_MarkerDB_gtdbr226.sh <all_gtdbr226_genomes_dir>
# gtdbr214
build_MarkerDB.sh <all_gtdbr214_genomes_dir>

Step-by-step MarkerDB construction (for user customization)

# sketching reference genomes
metakssd dist -L <L3K11.shuf> -o <L3K11_sketch> <all genomes Dir>
# print genome name
metakssd set -P <L3K11_sketch> > <genome_name.txt>

The genome names within the sketch were print line by line and direct the output to a file named "genome_name.txt" . Then a grouping file named "group_name.txt" were prepared as follow: The "group_name.txt" file should contain the same lines as "genome_name.txt", where each line specifies the species name of the genome corresponding to the line in "genome_name.txt". Each line follows the format: "IDspecies_name\n", where the ID is a non-negative integer uniquely labeling the species_name, and the species name of the genome can be found in the metadata files {ar53,bac120}_metadata_r*.tsv.gz. For example, "1 Escherichia_coli\n" represents the species name "Escherichia coli" labeled with ID 1. The ID 0 is reserved for excluding genomes from the resulting sketch. Once "group_name.txt" is prepared, genomes can be grouped by their originating species using the following command:

metakssd set -g <group_name.txt> -o <L3K11_pan-sketch> <L3K11_sketch>

Here, "L3K11_pan-sketch" represents the consolidated ‘pangenome’ sketches for all species.

Subsequently, the union sketch of all species-specific marker, named "L3K11_union_sp-sketch", was obtained using this command:

metakssd set -q -o <L3K11_union_sp-sketch> <L3K11_pan-sketch>

Finally, the MarkerDB named “markerdb_L3K11” was generated by overlapping "L3K11_union_sp-sketch" with "L3K11_pan-sketch" using this command:

metakssd set -i <L3K11_union_sp-sketch> -o <markerdb_L3K11> <L3K11_pan-sketch>

6. MetaKSSD benchmarking results (with GTDBr214 MarkerDB)

6.1 Speed and Memory

Screen Shot 2025-07-31 at 1 17 35 PM

6.2 OPAL benchmarking results on CAMI datasets are available :

  1. Mouse gut
  2. Marine
  3. Strain_madness
  4. Rhizosphere
  5. New_released

Species level results summary:

Screen Shot 2025-07-31 at 1 19 54 PM

7. Related papers

  1. Yi, H., Lu, X. & Chang, Q. MetaKSSD: boosting the scalability of the reference taxonomic marker database and the performance of metagenomic profiling using sketch operations. Nat Comput Sci 5, 884–897 (2025). https://doi.org/10.1038/s43588-025-00855-0

  2. Yi, H., Lin, Y., Lin, C. & Jin, W. KSSD: Sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis. Genome Biol 22, (2021).

About

Instantaneous Metagenome Taxonomic Profiling with MetaKSSD

Resources

License

Stars

Watchers

Forks

Packages