INVPG_annot

A tool to annotate inversions from pangenome graph bubbles.

Installation

Warning

Prerequisites:

minimap2
python $\geq$ 3.10

git clone https://github.com/SandraLouise/INVPG_annot.git
cd INVPG_annot
pip install -r requirements.txt --upgrade
python -m pip install . --quiet

Usage

Warning

Prerequisites:

GFA - A pangenome graph in GFA format.
VCF - The bubbles extracted from the .gfa in VCF format. Note that INVPG-annot has only been tested on and developed based on the formats of VCFs produced by vg deconstruct or the minigraph-call pipeline. Please see Commands to learn more about how to prepare your files for invpg-annot.

You can use a single command to execute the whole pipeline or do a step-by-step analysis (see Steps).

usage: invpg [-h] [-v INPUT_VCF_FILE] [-g INPUT_GFA_FILE] [-o OUTPUT_PREFIX] [-d DIV_PERCENTAGE] [-m MINCOV] [-k] [-t THREADS] [-O OUTPUT_VCF_FILE]

A tool to annotate inversions from pangenome graph bubbles.
  -h, --help            show this help message and exit
  -v, --input_vcf_file INPUT_VCF_FILE
                        Path to a VCF file.
  -g, --input_gfa_file INPUT_GFA_FILE
                        Path to a GFA-like file. Should be provided solely when not using minigraph graphs.
  -o, --output_prefix OUTPUT_PREFIX
                        Name/path of output VCF file. If parent folder of output VCF file doesn't already exist, it will be created.
  -d, --div_percentage DIV_PERCENTAGE
                        This parameter controls the leniency of the algorithm towards allele size difference (in nt) in the first step of variant/bubble
                        filtering. Only the non-reference alleles that have a size difference <= (d * max allele size / 100) will go through the annotation
                        step. (default: 10)
  -m, --mincov MINCOV   Minimum coverage of inversion signal as fraction of bubble length. (default: 0.5)
  -k, --keep_files      Keep temporary files after pipeline completion (mostly for debugging purposes).
  -t, --threads THREADS
                        Number of threads used for parallelization (minimap2).

Test with a small dataset

To check that INVPG-annot behaves as expected on your device, you can run:

cd test-dir/
invpg -v test_bubbles.vcf -g test_graph.gfa -o test_annotation.vcf -m 0.5 -d 10
diff expected_annotation.vcf test_annotation.vcf

To explore the intermediate output files (described here) on a small dataset, run (with option -k):

mkdir outputfiles
cd outputfiles
invpg -v ../test-dir/test_bubbles.vcf -g ../test-dir/test_graph.gfa -o test_annotation.vcf -k -m 0.5 -d 10
cd res_*

Impact of the `-d` and `-m` parameters

The -d parameter controls the leniency of the algorithm towards allele size difference (in nucleotides) in the first step of variant/bubble filtering. Only the bubbles that pass this filtering step will be processed in the annotation step. The main motive behind the filtering step is to speed up the annotation step by ignoring the bubbles that are considered unlikely to represent inversions due to their allele sizes (e.g. SNPs, indels, deletions, insertions), given that inversion bubbles are expected to represent only a small portion of the total bubbles in a pangenome graph. As the sizes of two inversion alleles may not be strictly identical in a pangenome graph due to biological factors (inner genomic variation) and/or artificial factors (alignment artefacts generated during pangenome graph inference), the algorithm uses the -d parameter to define the level of allele size difference acceptable in a potential inversion bubble as len_Am * d / 100 (len_Am being the size of the largest allele of the bubble). With -d 0, only the variants/bubbles that have at least two alleles with identical size will go through the annotation step. The higher the -d value used, the less likely inversion bubbles will be wrongly discarded due to high size difference between alleles, but the more time the annotation step will take. Based on several tests, we advise to use -d 10 (allowing for an allele size difference of 10% of the largest allele) even with low divergence between genomes.

The -m parameter sets the minimum coverage of inversion signal (as a fraction of the bubble nucleotidic length) that must be found on a bubble for it to be reported in the output. We advise to set is as -m 0.5. For more details on how this coverage is calculated, please see the Method section of our paper.

Output files

Final VCF annotation file

The main output file of the invpg command is a VCF file in which each line describes a bubble annotated as inversion. The three first fields contain vg deconstruct-like VCF information (i.e. reference chromosome ID, position, paths, alternates...). The INFO field contains additionnal information about the annotation for each non-reference allele that passed the first step filter (separated by ;) with the following specification:

##INFO=<ID=INVANNOT,Number=A,Type=String,Description="Source of inversion annotation (PATH=path-explicit,ALN=alignment-rescued,NOINV=insufficient inversion signal,NA=not tested)">
##INFO=<ID=INVCOV,Number=A,Type=Float,Description="Inversion signal coverage">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of SV">

Text file with inversion bubbles statistics

invpg also outputs a text file, name ending with .stats, which summarizes the bubble annotation statistics at each step of the program. It contains the number of input bubbles, balanced bubbles, and annotated inversion bubbles. It also indicates the total numbers of path-explicit and alignment-rescued topologies, as numbers of paths (one bubble can have alternative several paths, each with an annotation).

Intermediate files (when using `-k` parameter)

There are two types of intermediate files: one VCF file and multiple PAF files.

The VCF file (name ending with .balancedSV.vcf) contains all input VCF lines that pass the first step filter. It is the file used as input for the second step of annotation.
The PAF files contain the minimap2 alignment results for each allele going through the second step of annotation. The number of PAF files can be very large depending on the input VCF contents. We plan to optimize the number of PAF files generated in the future.

Citation

Romain, S., Dubois, S., Legeai, F., & Lemaitre, C. (2025). Investigating the topological motifs of inversions in pangenome graphs. bioRxiv, 2025-03, https://doi.org/10.1101/2025.03.14.643331.

Contact

INVPG-annot is a Genscale tool developed by Sandra Romain, Siegfried Dubois, Fabrice Legeai and Claire Lemaitre. For any bug report or feedback, please use the Github Issues form.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
docs		docs
invpg		invpg
test-dir		test-dir
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
LICENSE.md		LICENSE.md
README.md		README.md
justfile		justfile
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INVPG_annot

Installation

Usage

Test with a small dataset

Impact of the `-d` and `-m` parameters

Output files

Final VCF annotation file

Text file with inversion bubbles statistics

Intermediate files (when using `-k` parameter)

Citation

Contact

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

INVPG_annot

Installation

Usage

Test with a small dataset

Impact of the -d and -m parameters

Output files

Final VCF annotation file

Text file with inversion bubbles statistics

Intermediate files (when using -k parameter)

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Impact of the `-d` and `-m` parameters

Intermediate files (when using `-k` parameter)

Packages