ANOMALY

Analysis of Nuclear inserts Of MitochondriA using Long-read sequencing in Your data (ANOMALY) is the snakemake pipeline for calling NuMTs (Nuclear Embedded Mitochondrial Sequences) from Long-Read Sequencing data.

Feature

Snakemake pipeline for calling NUMTs from long-read FASTQ files.
Support for Multiple Species.
Support for both ONT and PacBio Sequencing Data.
Visualisation of Called NUMTs.

Installation

To run the Snakemake pipeline locally, the following tools and databases are required:

Snakemake >= 8.26.0
Minimap2 >= 2.28
SAMtools >= 1.21
Sniffles >= 2.5.2
Seqkit >= 2.9.0
BLAST >= 2.12.0
R >= 4.4.2
MT Genome Database for BLAST

The user can install all the required tools and database using the following command:

bash setup.sh && source ~/.bashrc

NOTE:

1. The pipeline is tested in Ubuntu 22.04, CentOS 7 and Rocky Linux 9.5.

2. Only the BAM file generated by Minimap2 with the '-Y' parameter will work with the pipeline. Please use the FASTQ file in case the BAM file is not generated by Minimap2.

3. -Y = use soft clipping for supplementary alignments.

4. Supports species with at most 24 nuclear chromosomes.

Test Example

The pipeline can be tested using the following script:

bash run_example.sh

This script will first download the raw fastq file of simulated data from human chromosome 22 with known NuMT insertions.
It will then download the nuclear genome fasta file (CHM13), followed by Read mapping (Minimap2) and SV calling (Sniffles2).
The insertion calls will then be mapped to the concatenated mitochondrial genome to get the putative NUMTs.
The supplementary nuclear genome reads aligned to the mitochondrial genome will be utilised to call longer NUMTs missed by SV calling.
The final text file will contain insertion-based and supplementary alignment-based NUMTs with overlaps removed, if any.

Steps to follow

Create the Snakemake Config file:

bash get_config.sh [-d b] [-r /path/to/reference/fasta] [-m 16] [-s 16] [-p ONT] [-i /path/to/input/directory] [-o /path/to/output/directory] [-l /path/to/headers/list] [-q Minimum Mapping Quality] [-n Minimum Read Support for SV] [-g Genotype Ploidy] [-e Blast E-value Cutoff] [-c NuMT Coverage Cutoff] [-a Minimum supplementary alignments support]
-d: Specify the Input type (BAM 'b' or FASTQ 'f'). [Default: b]
-r: Absolute/Relative path of Reference Nuclear Genome FASTA file.
-m: Number of threads to be used for Minimap2. [Default: 16]
-s: Number of threads to be used for Sniffles. [Default: 16]
-p: Sequencing Platform (ONT/pb/HiFi). [Default: ONT]
-i: Absolute/Relative path of the directory that contains the FASTQ or BAM Files [Do not pass the FASTQ name].
-o: Absolute/Relative path of the directory for Pipeline Output and Intermediate Files.
-l: Absolute/Relative path of the file containing reference chromosome headers. [Default: ref_headers.txt]
-q: Minimum Map Quality for calling SVs. [default: 1]
-n: Minimum Support for calling SVs. [default: 4]
-g: Genotype Ploidy. [default: 2]
-e: Blast E-value Cutoff. [default: 1e-3]
-c: NUMT Coverage Cutoff. [default: 70]
-a: Minimum Support for calling NUMTs from Supplementary Alignments. [default: 5]

# Absolute/Relative Paths of Reference Genome, Input Data, Pipeline Output Directory, and a list with reference chromosome headers are Mandatory.
# Please note that only the BAM file generated by Minimap2 with the '-Y' parameter will work with the pipeline. Please use the FASTQ file in case the BAM file is not generated by Minimap2.
# -Y = use soft clipping for supplementary alignments.

Run the pipeline using SnakeMake:

conda activate snakemake
python main.py -c /path/to/working/directory/snake_config.yml
-c: Snakemake configuration YML file.

bash run_snakemake.sh -w /path/to/working/directory -t 96
-w: Absolute/Relative Path of the Working Directory.
-t: Maximum Number of Threads Allowed for the Snakemake Pipeline. [Default: 48]

Output Table

CHROMOSOME	POSITION	MT GENOME HEADER	MT START	MT END	LENGTH
22	10669630	MT	318	391	74
22	16645321	MT	354	16414	510
22	27052372	MT	8672	8797	126
22	32722334	MT	3951	8331	4381
22	43340044	MT	10795	10908	114
22	50408296	MT	13056	14429	1374

This is the expected output after running ANOMALY on the test dataset. The test dataset was simulated using chromosome 22 of the CHM13 reference genome.

Output NUMT Circos Plot

The Circos plot illustrates the integration of Nuclear Mitochondrial DNA Sequences (NuMTs) across various chromosomes in the nuclear genome. The outermost circular segments represent individual chromosomes (1–22 and X) and concatenated mitochondrial genome segments (Concat_1 and Concat_2). The curved links in the inner region indicate the insertion events of NUMTs from the mitochondrial genome into nuclear chromosomes.

To visualise the NUMTs containing the control region of the mitochondrial genome, we have utilised the concatenated genome of Mitochondria (shown as Concat_1 and Concat_2). Hence, all the NUMTs containing control regions will be split into two links, one coming from the end of the first mitochondrial segment (Concat_1) and the second coming from the start of the second mitochondrial segment (Concat_2).

Preparing the User-Specific Mitochondrial Reference Genome

The users can prepare their species of interest's Mitochondrial genome for the tool by using the following steps:

bash path/to/tool/Scripts/prepare_genome_fasta.sh /path/to/mitochondrial/genome/fasta && source ~/.bashrc

Things to remember:

The above script will save the mitochondrial genome fasta and blast database indices to the DATA directory in the tool's directory. If you move the reference genome and the indices, please make changes in the ~/.bashrc.
The script will also change the header of the input fasta file to "MT" to execute downstream analysis.
By default, the last reference genome prepared using this script will be used in the blast command. If you want to use some other reference genome, prepare the genome using the above step or make changes in the ~/.bashrc.
The $MT_DATA variable in the ~/.bashrc represents the path of the modified reference genome and blast indices. Please modify this variable accordingly if any changes are required.

Pipeline Overview:

ANOMALY Workflow: The workflow is implemented as a Snakemake pipeline designed to detect NuMTs. It accepts raw sequencing data in FASTQ format or pre-aligned data in BAM format as input. The pipeline produces a TSV file containing NuMT calls and visual representations as a Circos plot, saved in PNG and SVG formats. In the schematic representation, the open-source tools used in the pipeline are highlighted in purple. The steps are described along the arrows connecting the workflow components. Outputs of each step are indicated within circles.

Feedback/Help

If you have any feedback/issues, please report the issue via GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANOMALY

Feature

Installation

NOTE:

Test Example

Steps to follow

Output Table

Output NUMT Circos Plot

Preparing the User-Specific Mitochondrial Reference Genome

The users can prepare their species of interest's Mitochondrial genome for the tool by using the following steps:

Things to remember:

Pipeline Overview:

Feedback/Help

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
DATA		DATA
Envs		Envs
Manuscript Data		Manuscript Data
Scripts		Scripts
ANOMALY-Pipeline-Downsample.png		ANOMALY-Pipeline-Downsample.png
LICENSE		LICENSE
NUMT_Circos_Example.png		NUMT_Circos_Example.png
README.md		README.md
get_config.sh		get_config.sh
main.py		main.py
ref_headers.txt		ref_headers.txt
run_example.sh		run_example.sh
run_snakemake.sh		run_snakemake.sh
setup.sh		setup.sh
snake_config.yml		snake_config.yml

Folders and files

Latest commit

History

Repository files navigation

ANOMALY

Feature

Installation

NOTE:

Test Example

Steps to follow

Output Table

Output NUMT Circos Plot

Preparing the User-Specific Mitochondrial Reference Genome

The users can prepare their species of interest's Mitochondrial genome for the tool by using the following steps:

Things to remember:

Pipeline Overview:

Feedback/Help

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Packages