Code for comparing SubseqHash2 with minimizer, syncmer, and strobemer on the seeding step of overlap detection. For this purpose, sequences that share at least one common seed are considered overlapping.
All the seeding programs take ordinary fasta files as input. A sample file with 3 reads, SRX533603-sample3.efa, is provided.
git clone https://github.com/Shao-Group/SubseqHash2.git
cd SubseqHash2/overlap_detection
make product
-
To generate random tables used by SubseqHash2 and SubseqHash2w: Run
../src/gentable_rc.out k d tables/table_namewherekanddare parameters for SubseqHash2. Two sample tables are provided in tables/. -
All seeding programs by default use 15 threads to generate seeds in parallel (one thread per read), the number of threads used can be changed by modifying
NUMTHREADSin src/seedfactory.h. -
To generate seeds with SubseqHash2: Run
./genSubseq2Seeds.out readFile n k d randTableFile. Example:./genSubseq2Seeds.out sample-reads/SRX533603-sample3.efa 60 48 11 tables/48_11which corresponds to the second left most point of theSH2r 60 11 kcurve in Fig.5 of our paper. This call creates a directorysample-reads/SRX533603-sample3.efa-locSeeds-n60-k48-d11/containing all the seeds. SubseqHash2 generatesksets of seeds for each read. -
To generate seeds with minimizer: Run
./genMinimizerSeeds.out readFile w k. Note thatwhere is the number ofk-mers in a window. Example:./genMinimizerSeeds.out sample-reads/SRX533603-sample3.efa 4 22which corresponds to the second left most point of theminimizer 25curve. This call creates a directorysample-reads/RX533603-sample3.efa-minimizerSeeds-w4-k22/containing all the seeds. Two seed files per read are stored, one for the forward direction, the other for the reverse complement. -
To generate seeds with syncmer: Run
./genSyncmerSeeds.out readFile k sto selectk-mers according to the location of their smallests-mers. Example:./genSyncmerSeeds.out sample-reads/SRX533603-sample3.efa 14 11which corresponds to the second right most point of thesyncmer 14curve. This call creates a directorysample-reads/SRX533603-sample3.efa-syncmerSeeds-k14-s11/containing all the seeds. Similar as minimizer, two seed files per read are stored. -
To generate seeds with strobemer: Run
./genStrobemerSeeds.out readFile k w w_min w_max seednumto generate seeds withwstrobes of lengthkeach, the second strobe is picked from the window[w_min, w_max]relative to the beginning of the seed; theseednumparameter controls the number of repeats applied, namely,2*seednumsets of seeds with different hash functions are generated for each read. Example:./genStrobemerSeeds.out sample-reads/SRX533603-sample3.efa 22 2 22 38 12which corresponds to the left most point of thestrobemer 60 kcurve. This call creates a directorysample-reads/SRX533603-sample3.efa-strobemerSeeds-k22-w2-w_min22-w_max38/. -
To generate seeds with SubseqHash2w: Run
./genSubseq2strobeSeeds.out readFile n k d randTableFile w pre_kto produce seeds withw+1parts -- a leadingpre_k-mer followed bywsubsequences of lengthkpicked from a window of sizen. Example:./genSubseq2strobeSeeds.out sample-reads/SRX533603-sample3,efa 44 32 11 tables/32_11 1 16which corresponds to the second left most point of theSH2w 60 11 kcurve. This call creates a directorysample-reads/SRX533603-sample3.efa-subseq2strobeSeeds-n44-k32-d11-w1-pre_k16/. SubseqHash2w generates2ksets of seeds for each read. -
To get overlapping pairs with the seeds generated using any of the above methods: Run
overlapBySeeds.out seedsDir k repeatId numReads fileExtwhereseedsDiris the directory containing the seed files;kis the length of each seed;repeatIdspecifies seeds from which repeat is to be used; only the topnumReadsreads are considered for overlapping; andfileExtis the filename extension of the seed files, all possible choices are listed in the following table.Seeding Method fileExt SubseqHash2 subseqseed2 minimizer mmseed syncmer syncmerseed strobemer strobemerseed SubseqHash2w ss2sseed For seeding methods without repeat (minimizer, syncmer), the
repeatIdshould be 0; for others, it starts from 0. Each call creates a fileoverlap-n{numReads}-r{repeatId}.all-pairwithin the seed directory. In this file, each line represents an overlapping pair in the formatread_id1 read_id2 number_of_shared_seeds. -
The
overlapFileOp.jsscript can be used to take union/intersection of overlap detection results. It requires the k8 library.- Example for obtaining the union of two result files (used for repeating SubseqHash2):
./overlapFileOp.js union sample-reads/SRX533603-sample3.efa-locSeeds-n60-k48-d11/overlap-n3-r{0,1}.all-pair > union2.all-pair. - Example for obtaining the intersection of two files (used for comparing reported overlapping pairs versus the ground truth):
./overlapFileOp.js intersection sample-reads/SRX533603-sample3.truepairs sample-reads/SRX533603-sample3.efa-minimizerSeeds-w4-k22/overlap-n3-r0.all-pairwhere the truepairs file is in the same format as result files except that the third column is the number of overlapping base pairs.
- Example for obtaining the union of two result files (used for repeating SubseqHash2):