This project contains code for simultaneously detecting family inheritance patterns, identity-by-descent, crossovers, and inherited deletions in nuclear families, as published in
Paskov K, Chrisman B, Stockham N, Washington PY, Dunlap K, Jung JY, Wall DP. Identifying crossovers and shared genetic material in whole genome sequencing data from families. Genome Research. 2023 Oct 1;33(10):1747-56.
This code starts with VCF files (with accompanying .tbi files), split by chromosome. It produces
- family inheritance patterns across the genome (.BED)
- crossovers (.JSON, .BED)
- inherited deletions (.JSON, .BED)
The code adds to a directory structure created by the https://github.com/kpaskov/VCFtoNPZ project.
[data_dir]
- genotypes
- family_genotype_counts
- sequencing_error_rates
- phase
- - info.json
- - crossovers.json
- - gene_conversions.json
- - inherited_deletions.json
- - sibpairs.json
- - inheritance_patterns
- - - [family_1].bed
- - - [family_2].bed
...
The info.json file contains metadata including the reference assembly (GRch37 or GRch38), vcf directory, ped file, sequencing error parameter files, and any flags provided to the phasing algorithm.
The crossovers.json and crossovers.bed files contains all called crossovers in the cohort in .json and .bed format respectively.
The gene_conversions.json and gene_conversions.bed files contains possible gene-conversion events in the cohort in .json and .bed format respectively. Gene conversions are more difficult to detect than crossovers due to their small size. We have not validated the gene conversions called by our algorithm, so they should be examined carefully before use.
sibpairs.json contains all sibling-pairs in the dataset as well as genome-wide autosomal IBD values and chromosome-level IBD values.
The inherited_deletions.json and inherited_deletions.bed files contains all called inherited deletions in the cohort in .json and .bed format respectively.
The [family].bed files contain inheritance patterns for each family, across all chromosomes in .bed format.
using https://github.com/kpaskov/VCFtoNPZ.
using https://github.com/kpaskov/FamilySeqError.
If using whole-genome sequencing data, follow the instructions for estimating sequencing error rates in both low-complexity and high-complexity regions. A good source of low-complexity regions is the supplementary materials file from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271055/.
3. Run our hidden Markov model family phasing algorithm.
The algorithm detects crossovers and inherited deletions. Run using
python phase/phase_chromosome.py [ped_file] [data_dir] [high_complexity_sequencing_error_profile] [low_complexity_sequencing_error_profile]
The script has options
--detect_inherited_deletionsmodels inherited deletions while phasing.--detect_updmodels uniparental disomy while phasing. We detect only heterodisomy because isodisomy is essentially indistinguishable from a denovo deletion. (This option is still under development.)--chrom [chrom]phases a single chromosome. If this option is not used, all autosomal chromosomes are phased.--family_size [family_size]only families of size [family_size] are phased.--family [family]only [family] is phased.--batch_sizeand--batch_numoptions can be used to parallelize when running on a large cohort.--phase_name [name]phase data will be written to directory[data_dir]/phase_[name]/inheritance_patterns. If this option is not used, phase data will be written to directory[data_dir]/phase/inheritance_patterns.
The example below phases all autosomal chromosomes for all families of size 4 in the ssc.hg38 dataset.
python phase/phase_chromosome.py ../DATA/ssc.hg38/ssc.ped HCR LCR --detect_inherited_deletions --family_size 4
You can use as many sequencing error profiles as you'd like. This comes in handy when phasing the X chromosome, since the PAR and non-PAR regions have different error profiles. This is an example of phasing the X chromosome using four different error profiles.
python phase/phase_chromosome.py ../DATA/ssc.hg38/ssc.ped ../DATA/ssc.hg38 X_PAR_HCR X_PAR_LCR X_nonPAR_HCR X_nonPAR_LCR --chrom X --batch_size 3 --batch_num $SLURM_ARRAY_TASK_ID --detect_inherited_deletions --family_size 4 --phase_name X
If no --chrom option is given, phase_chromosome.py will phase all autosomal chromosomes. The X chromosome must be phased explicitly using the --chrom option.
This script pulls genome-wide autosomal IBD for sibling pairs as well as chromosome-level IBD. It marks siblings who appear to be identical twins or based on their IBD sharing, and identifies siblings with unusual IBD as outliers.
python phase/pull_sibpair_ibd.py [data_dir]
The --phase_name [name] flag indicates that phase data from the directory [data_dir]/phase_[name]/inheritance_patterns will be analyzed.
This script produces [data_dir]/phase/sibpairs.json (or [data_dir]/phase_[name]/sibpairs.json if the --phase_name flag is used) which contains an entry for every sibling pair in the dataset with the following fields:
familythe family ID for the sibling pairsibling1the ID for the first sibling in the pair (the first sibling's ID is alphabetically first)sibling2the ID for the second sibling in the pair (the second sibling's ID is alphabetically last)maternal_ibdthe genome-wide autosomal maternal IBD between the siblingsmaternal_unknown_fractionthe fraction of the autosomal genome where maternal IBD could not be determinedmaternal_ibd_chromsa list of chromosome-level maternal IBD between the siblings in order from chr1-chr22, ending with chrXmaternal_unknown_fraction_chromsa list of the fraction of each chromosome where maternal IBD could not be determined in order from chr1-chr22, ending with chrXpaternal_ibdthe genome-wide autosomal paternal IBD between the siblingspaternal_unknown_fractionthe fraction of the autosomal genome where paternal IBD could not be determinedpaternal_ibd_chromsa list of chromosome-level paternal IBD between the siblings in order from chr1-chr22, ending with chrXpaternal_unknown_fraction_chromsa list of the fraction of each chromosome where paternal IBD could not be determined in order from chr1-chr22, ending with chrXmatxpat_ibdthe genome-wide autosomal IBD2 (simultaneous maternal and paternal IBD) between the siblingsmatxpat_unknown_fractionthe fraction of the autosomal genome where IBD2 could not be determinedmatxpat_ibd_chromsa list of chromosome-level IBD2 between the siblings in order from chr1-chr22, ending with chrXmatxpat_unknown_fraction_chromsa list of the fraction of each chromosome where IBD2 could not be determined in order from chr1-chr22, ending with chrXis_identicalsiblings are marked as identical if both maternal and paternal genome-wide autosomal IBD is greater than 0.8.is_ibd_outliersiblings are marked as ibd outliers using gaussian kernel density outlier detection
The qc/Visualize-Sibpair-IBD.ipynb jupyter notebook can be used to visualize IBD in the dataset.
This script pulls crossovers for the cohort. It identifies individuals with too many or too few crossovers as outliers.
python phase/pull_crossovers.py [data_dir]
The script has options
--phase_name [name]flag indicates that phase data from the directory[data_dir]/phase_[name]/inheritance_patternswill be analyzed.--hts_loss_regions [loss_region_index1] [loss_region_index2]should be used if multiple sequencing error profiles are used to represent hard-to-sequence regions. For example, the X_chromosome phasing command shown in the example above has two sequencing error profiles corresponding to hard-to-sequence regions (index 1 and 3).
The script produces [data_dir]/phase/crossovers.json and [data_dir]/phase/gene_conversions.json which contain crossover and gene_conversion events respectively. Each event has fields:
familythe family ID where the event was detectedchildthe child id the event was detected withinchromthe chrom where the event ocurredstart_posandend_posthe window where the event occuredis_matandis_patindicating whether the event occured during maternal or paternal meiosisis_complextrue if the event is a complex event (composed of multiple closely spaced events)is_htstrue if the event occurs in a region marked error-prone for this family
The qc/Visualize-Crossovers.ipynb jupyter notebook can be used to visualize crossovers in the dataset.
This script pulls inherited deletions for the cohort.
python phase/pull_deletions.py [data_dir]
The script has options
--phase_name [name]flag indicates that phase data from the directory[data_dir]/phase_[name]/inheritance_patternswill be analyzed.
The script produces [data_dir]/phase/deletions.json. Each deletion has fields:
familythe family ID where the deletion was detectedchromthe chrom where the deletion ocurredstart_posandend_posthe window where the deletion occuredis_matandis_patindicating whether the deletion occured during maternal or paternal meiosisis_htstrue if the deletion occurs in a region marked error-prone for this familyparentthe parent with the deletiontransa list of children the deletion was transmitted tonotransa list of children the deletion was not transmitted to
The qc/Visualize-Deletions.ipynb jupyter notebook can be used to visualize deletions in the dataset.