This project stores the structural variant (SV) detection results generated by the HitSV tool on HG002/3/4/5/6/7 and the 1000 Genomes Project datasets.
GitHub repository of the tool: https://github.com/hitbc/HitSV
All variant detection was performed based on the GRCh38 reference genome. Algorithm alias: gcSV
All datasets are at 30× coverage. All results have been locally phased (pseudo-phased). The output includes the original contig sequences as well as small variants surrounding each SV.
The data were realigned to the GRCh38 reference genome.
Original data source: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb/alignment/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.bam
HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-LRS/ccs.30X.vcf.gz
Original data source: s3://ont-open-data/giab_2025.01/basecalling/sup/HG002/PAW70337/calls.sorted.bam
HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-LRS/ont.30X.vcf.gz
Original data sources: s3://ont-open-data/giab_2025.01/basecalling/sup/HG002/PAW70337/calls.sorted.bam (downloaded using aws s3 cp) s3://ont-open-data/giab_2025.01/basecalling/sup/HG003/PAY87794/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG004/PAY87778/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG005/PAW87816/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG006/PAY77227/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG007/PAY12990/calls.sorted.bam
HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/LRS-TRIO/sup_HG00*.vcf.gz
Original data source: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.hs37d5.60x.1.bam
HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-SRS/ILL_60X.vcf.gz
Original data source: https://opendata.nist.gov/pdrsrv/mds2-2336/input_fastqs/HG002.novaseq.pcr-free.35x.R1.fastq.gz
HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-SRS/ILL_35X.vcf.gz
Original data sources: See the sections above.
HitSV variant calling results (ONT + Illumina hybrid and CCS + Illumina hybrid, respectively): https://github.com/hitbc/HitSV_call_results/HG002-Hybrid/HYBRID_ILL.30X.ont.4X.vcf.gz https://github.com/hitbc/HitSV_call_results/HG002-Hybrid/HYBRID_ILL.30X.ccs.4X.vcf.gz
Original data source: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/
Variant calling results (results from 10 randomly selected samples are provided): https://github.com/hitbc/HitSV_call_results/HG002-Hybrid/HYBRID_ILL.30X.ccs.4X.vcf.gz https://github.com/hitbc/HitSV_call_results/tree/main/1KGP-single%20sample/*.vcf.gz
Original data source: Same as above.
Variant calling results (stored separately by chromosome): https://github.com/hitbc/HitSV_call_results/blob/main/1KGP_3202_samples_gcSV_v1.0_grch38_SURVIVOR_merge/1KGP_3202_samples_gcSV_v1.0_grch38_SURVIVOR_merge_sort_chr*.vcf.gz
Description:
Lines starting with VNTR_REGION:
Each line describes a simple repeat region based on Repeat Masker.
The columns represent: chromosome ID (0-based), start and end positions of the region, annotation type, and repeat unit.
Lines not starting with VNTR_REGION:
Each line describes a structural variant (SV) located within a specific simple repeat region.
The columns represent: chromosome ID (0-based), start position of the variant, variant length, REF, ALT, and allele count (AC) in different superpopulations.