Analysis of Nuclear inserts Of MitochondriA using Long-read sequencing in Your data (ANOMALY) is the snakemake pipeline for calling NuMTs (Nuclear Embedded Mitochondrial Sequences) from Long-Read Sequencing data.
- Snakemake pipeline for calling NUMTs from long-read FASTQ files.
- Support for Multiple Species.
- Support for both ONT and PacBio Sequencing Data.
- Visualisation of Called NUMTs.
To run the Snakemake pipeline locally, the following tools and databases are required:
- Snakemake >= 8.26.0
- Minimap2 >= 2.28
- SAMtools >= 1.21
- Sniffles >= 2.5.2
- Seqkit >= 2.9.0
- BLAST >= 2.12.0
- R >= 4.4.2
- MT Genome Database for BLAST
The user can install all the required tools and database using the following command:
bash setup.sh && source ~/.bashrc1. The pipeline is tested in Ubuntu 22.04, CentOS 7 and Rocky Linux 9.5.
2. Only the BAM file generated by Minimap2 with the '-Y' parameter will work with the pipeline. Please use the FASTQ file in case the BAM file is not generated by Minimap2.
3. -Y = use soft clipping for supplementary alignments.
4. Supports species with at most 24 nuclear chromosomes.
The pipeline can be tested using the following script:
bash run_example.sh- This script will first download the raw fastq file of simulated data from human chromosome 22 with known NuMT insertions.
- It will then download the nuclear genome fasta file (CHM13), followed by Read mapping (Minimap2) and SV calling (Sniffles2).
- The insertion calls will then be mapped to the concatenated mitochondrial genome to get the putative NUMTs.
- The supplementary nuclear genome reads aligned to the mitochondrial genome will be utilised to call longer NUMTs missed by SV calling.
- The final text file will contain insertion-based and supplementary alignment-based NUMTs with overlaps removed, if any.
- Create the Snakemake Config file:
bash get_config.sh [-d b] [-r /path/to/reference/fasta] [-m 16] [-s 16] [-p ONT] [-i /path/to/input/directory] [-o /path/to/output/directory] [-l /path/to/headers/list] [-q Minimum Mapping Quality] [-n Minimum Read Support for SV] [-g Genotype Ploidy] [-e Blast E-value Cutoff] [-c NuMT Coverage Cutoff] [-a Minimum supplementary alignments support]
-d: Specify the Input type (BAM 'b' or FASTQ 'f'). [Default: b]
-r: Absolute/Relative path of Reference Nuclear Genome FASTA file.
-m: Number of threads to be used for Minimap2. [Default: 16]
-s: Number of threads to be used for Sniffles. [Default: 16]
-p: Sequencing Platform (ONT/pb/HiFi). [Default: ONT]
-i: Absolute/Relative path of the directory that contains the FASTQ or BAM Files [Do not pass the FASTQ name].
-o: Absolute/Relative path of the directory for Pipeline Output and Intermediate Files.
-l: Absolute/Relative path of the file containing reference chromosome headers. [Default: ref_headers.txt]
-q: Minimum Map Quality for calling SVs. [default: 1]
-n: Minimum Support for calling SVs. [default: 4]
-g: Genotype Ploidy. [default: 2]
-e: Blast E-value Cutoff. [default: 1e-3]
-c: NUMT Coverage Cutoff. [default: 70]
-a: Minimum Support for calling NUMTs from Supplementary Alignments. [default: 5]
# Absolute/Relative Paths of Reference Genome, Input Data, Pipeline Output Directory, and a list with reference chromosome headers are Mandatory.
# Please note that only the BAM file generated by Minimap2 with the '-Y' parameter will work with the pipeline. Please use the FASTQ file in case the BAM file is not generated by Minimap2.
# -Y = use soft clipping for supplementary alignments.- Run the pipeline using SnakeMake:
conda activate snakemake
python main.py -c /path/to/working/directory/snake_config.yml
-c: Snakemake configuration YML file.bash run_snakemake.sh -w /path/to/working/directory -t 96
-w: Absolute/Relative Path of the Working Directory.
-t: Maximum Number of Threads Allowed for the Snakemake Pipeline. [Default: 48]| CHROMOSOME | POSITION | MT GENOME HEADER | MT START | MT END | LENGTH |
|---|---|---|---|---|---|
| 22 | 10669630 | MT | 318 | 391 | 74 |
| 22 | 16645321 | MT | 354 | 16414 | 510 |
| 22 | 27052372 | MT | 8672 | 8797 | 126 |
| 22 | 32722334 | MT | 3951 | 8331 | 4381 |
| 22 | 43340044 | MT | 10795 | 10908 | 114 |
| 22 | 50408296 | MT | 13056 | 14429 | 1374 |
This is the expected output after running ANOMALY on the test dataset. The test dataset was simulated using chromosome 22 of the CHM13 reference genome.
The Circos plot illustrates the integration of Nuclear Mitochondrial DNA Sequences (NuMTs) across various chromosomes in the nuclear genome. The outermost circular segments represent individual chromosomes (1–22 and X) and concatenated mitochondrial genome segments (Concat_1 and Concat_2). The curved links in the inner region indicate the insertion events of NUMTs from the mitochondrial genome into nuclear chromosomes.
To visualise the NUMTs containing the control region of the mitochondrial genome, we have utilised the concatenated genome of Mitochondria (shown as Concat_1 and Concat_2). Hence, all the NUMTs containing control regions will be split into two links, one coming from the end of the first mitochondrial segment (Concat_1) and the second coming from the start of the second mitochondrial segment (Concat_2).
The users can prepare their species of interest's Mitochondrial genome for the tool by using the following steps:
bash path/to/tool/Scripts/prepare_genome_fasta.sh /path/to/mitochondrial/genome/fasta && source ~/.bashrc- The above script will save the mitochondrial genome fasta and blast database indices to the DATA directory in the tool's directory. If you move the reference genome and the indices, please make changes in the ~/.bashrc.
- The script will also change the header of the input fasta file to "MT" to execute downstream analysis.
- By default, the last reference genome prepared using this script will be used in the blast command. If you want to use some other reference genome, prepare the genome using the above step or make changes in the ~/.bashrc.
- The $MT_DATA variable in the ~/.bashrc represents the path of the modified reference genome and blast indices. Please modify this variable accordingly if any changes are required.
ANOMALY Workflow: The workflow is implemented as a Snakemake pipeline designed to detect NuMTs. It accepts raw sequencing data in FASTQ format or pre-aligned data in BAM format as input. The pipeline produces a TSV file containing NuMT calls and visual representations as a Circos plot, saved in PNG and SVG formats. In the schematic representation, the open-source tools used in the pipeline are highlighted in purple. The steps are described along the arrows connecting the workflow components. Outputs of each step are indicated within circles.
If you have any feedback/issues, please report the issue via GitHub.

