Skip to content

hitbc/HitSV_call_results

Repository files navigation

HitSV: Maximizing discovery of structural variants across sequencing technologies

Project Description

This project stores the structural variant (SV) detection results generated by the HitSV tool on HG002/3/4/5/6/7 and the 1000 Genomes Project datasets.

GitHub repository of the tool: https://github.com/hitbc/HitSV

All variant detection was performed based on the GRCh38 reference genome. Algorithm alias: gcSV

HitSV LRS SV Detection Information

All datasets are at 30× coverage. All results have been locally phased (pseudo-phased). The output includes the original contig sequences as well as small variants surrounding each SV.

HG002 CCS Dataset

The data were realigned to the GRCh38 reference genome.

Original data source: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb/alignment/HG002.Sequel.15kb.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.bam

HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-LRS/ccs.30X.vcf.gz

HG002 ONT Dataset

Original data source: s3://ont-open-data/giab_2025.01/basecalling/sup/HG002/PAW70337/calls.sorted.bam

HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-LRS/ont.30X.vcf.gz

Trio ONT Dataset: HG002/HG003/HG004 and HG005/HG006/HG007

Original data sources: s3://ont-open-data/giab_2025.01/basecalling/sup/HG002/PAW70337/calls.sorted.bam (downloaded using aws s3 cp) s3://ont-open-data/giab_2025.01/basecalling/sup/HG003/PAY87794/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG004/PAY87778/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG005/PAW87816/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG006/PAY77227/calls.sorted.bam s3://ont-open-data/giab_2025.01/basecalling/sup/HG007/PAY12990/calls.sorted.bam

HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/LRS-TRIO/sup_HG00*.vcf.gz

HitSV SRS SV Detection Information

HG002 SRS 60× Dataset

Original data source: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.hs37d5.60x.1.bam

HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-SRS/ILL_60X.vcf.gz

HG002 SRS 35× Dataset

Original data source: https://opendata.nist.gov/pdrsrv/mds2-2336/input_fastqs/HG002.novaseq.pcr-free.35x.R1.fastq.gz

HitSV variant calling results: https://github.com/hitbc/HitSV_call_results/HG002-SRS/ILL_35X.vcf.gz

HitSV Hybrid SV Detection Information

Original data sources: See the sections above.

HitSV variant calling results (ONT + Illumina hybrid and CCS + Illumina hybrid, respectively): https://github.com/hitbc/HitSV_call_results/HG002-Hybrid/HYBRID_ILL.30X.ont.4X.vcf.gz https://github.com/hitbc/HitSV_call_results/HG002-Hybrid/HYBRID_ILL.30X.ccs.4X.vcf.gz

EASY and HARD Regions for Benchmarking

The data source for the HARD regions is:

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.3/GRCh38@all/Union/GRCh38_alldifficultregions.bed.gz

The data source for the EASY regions is:

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.3/GRCh38@all/Union/GRCh38_notinalldifficultregions.bed.gz

1000 Genomes Project (3,202 Samples) HitSV Calls

Single-Sample Variant Calling

Original data source: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/

Variant calling results (results from 10 randomly selected samples are provided): https://github.com/hitbc/HitSV_call_results/HG002-Hybrid/HYBRID_ILL.30X.ccs.4X.vcf.gz https://github.com/hitbc/HitSV_call_results/tree/main/1KGP-single%20sample/*.vcf.gz

Population-Level Variant Calling

Original data source: Same as above.

Variant calling results (stored separately by chromosome): https://github.com/hitbc/HitSV_call_results/blob/main/1KGP_3202_samples_gcSV_v1.0_grch38_SURVIVOR_merge/1KGP_3202_samples_gcSV_v1.0_grch38_SURVIVOR_merge_sort_chr*.vcf.gz

1000 Genomes Project VNTR Analysis: SVs in Simple Repeat Regions

Description:

Lines starting with VNTR_REGION: Each line describes a simple repeat region based on Repeat Masker. The columns represent: chromosome ID (0-based), start and end positions of the region, annotation type, and repeat unit.

Lines not starting with VNTR_REGION: Each line describes a structural variant (SV) located within a specific simple repeat region. The columns represent: chromosome ID (0-based), start position of the variant, variant length, REF, ALT, and allele count (AC) in different superpopulations.

Results: https://github.com/hitbc/HitSV_call_results/blob/main/1KGP_3202_samples_gcSV_v1.0_grch38_VNTR_ANALYSIS.txt.gz

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors