cov2db: a low frequency variant DB for SARS-CoV-2

(SARS-CoV-2 Illustration image credit: Davian Ho for the Innovative Genomics Institute)

Problem

Global SARS-CoV-2 sequencing efforts have resulted in a massive genomic dataset available to the public for a variety of analyses. However, the two most common resources are genome assemblies (e.g. deposited in GISAID and GenBank) and raw sequencing reads. Both of these limit the quantity of information, especially with respect to variants found within the SARS-CoV-2 populations. Genome assemblies only contain consensus level information, which is not reflective of the full genomic diversity within a given sample (since even a single patient derived sample represents a viral population within the host). Raw sequencing reads on the other hand require further analyses in order to extract variant information, and can often be prohibitively large in size.

Thus, we propose cov2db; a database resource for collecting low frequency variant information for available SARS-CoV-2 data (as of October 12th, 2021 there are more than 1.2 million SARS-CoV-2 sequencing datasets in SRA and ENA). Our goal is to provide an easy to use query system, and contribute to a database of VCF files that contain variant calls for SARS-CoV-2 samples. We hope that such interactive database will speed up downstream analyses and encourage collaboration.

An illustration of low frequency single nucleotide variants (iSNVs) within two viral populations inside infected hosts DOI:10.1101/gr.268961.120.

Timeline

Sunday:

During Hackathon

Annotated 2,000 VCFs
Converted 2,000 VCFs to JSON
Scaled up our database to handle the data
Prototyped a R Shiny UI for database interactions

Wednesday:

Features

cov2db is a unified database containing information about SARS-CoV-2 strains variants that’s easily available and searchable for the scientific community. Cov2db is hosted in a mongoDB server, but can also be accessed using our Graphical User interface, created with a Shiny R. Our pipeline also provides the tools for any user to include their own datasets to the database, generating a formidable resource for the study of SARS-CoV-2.

Supported queries based on the following fields.

Annotation:

Reference amino acid
Variant amino acid
Gene name
Mutation type (missense, synonymous, upstream, etc.)

Variant call information:

Sample metadata:

R Shiny UI

Follow the link below for a quick video demo (no sound) of the R Shiny interface to cov2db.

Accessing the database

In order to access the database and run custom queries you will first need to install MongoDB Compass or MongoDB Shell. The following examples make use of MongoDB Compass installation.

To connect to the cov2db database you will need to open MongoDB Compass app and press ⌘N on a Mac computer or navigate to the menu at the top and pick Connect->Connect to item.

In the new window paste in the following connection string mongodb://sno.cs.rice.edu:27017 as shown and click connect.

Finally, select cov2db database, and navigate to the annotated_vcf collection.

To begin using the shell and start issuing queries, click on the mongosh button in the lower left corner and a shell with >test prompt will appear.

Finally, type the following command on the shell use cov2db to connect to the cov2db database.

Now, you are ready to run the queries.

Example queries

Get the count of missense variants reported for ORF1ab

db.annotated_vcf.count( { info_SequenceOntology: "missense_variant", info_GeneName: "ORF1ab" } )

Get sample accession numbers for samples that have a variant at position 23403 in the genome

db.annotated_vcf.find( { start: 23403 }, {VCF_SAMPLE: 1, _id: 0})

Get the count of missense variants occuring at frequency below 1% within the samples with depth of coverage >100000x at the variant call position

db.annotated_vcf.count( { info_SequenceOntology: "missense_variant", info_af: { $lt: 0.01 }, info_dp: { $gt: 100000} } )

Get sample accession numbers for samples that is missense variant in gene ORF1ab with allele frequency less than 0.001

db.annotated_vcf.find( { info_SequenceOntology: "missense_variant", info_GeneName: "ORF1ab", info_af: { $lt: 0.001 }},{VCF_SAMPLE:1, _id:0} )

Methods

How to handle iVar data

TSV iVar output was converted to VCF by using the script from here.

python ivar_variants_to_vcf.py example.tsv example.vcf

VCF annotation

The VCF output was then annonated with snpEff. To install snpEff

wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip

To annotate

java -Xmx8g -jar ../path/to/snpEff/snpEff.jar NC_045512.2 your_input.vcf > output.ann.vcf

annotated VCF to JSON conversion.

Workflow figure✍️

Related work

VAPr is an excellent mongodb based database for storing variant info. UCSC SARS-CoV-2 genome broswers also provides visualization of intrahost variants here.

Team members

Daniel Agustinho, Washington University (data acquisition, writer)
Li Chuin Chong, Twincore GmbH/HZI-DKFZ under auspices MHH (Sysadmin, mongodb)
Maria Jose, Pondicherry Central University (data acquisition, mongodb)
BaiWei Lo, University of Konstanz (data acquisition, QC)
Ramanandan Prabhakaran, Roche Canada (Sysadmin, python wrapper, mongodb database development, workflow development)
Sophie Poon, (Data acquisition, QC)
Suresh Kumar, ICAR-NIVEDI (QC)
Nick Sapoval, Rice University (Team co-lead, data acquisition, writer, R Shiny development)
Todd Treangen, (Team Lead)

####### CITATION:

CITE: Walker K, Kalra D, Lowdon R et al. The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms [version 1; peer review: awaiting peer review]. F1000Research 2022, 11:530 (https://doi.org/10.12688/f1000research.110194.1)

###########

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
scripts_used_for_processing_VCF_2_JSON		scripts_used_for_processing_VCF_2_JSON
shinyapp		shinyapp
test		test
wdl		wdl
LICENSE		LICENSE
PRESENTATION.md		PRESENTATION.md
README.md		README.md
ZomboMeme 11102021113553.jpg		ZomboMeme 11102021113553.jpg
ZomboMeme 12102021123250.jpg		ZomboMeme 12102021123250.jpg
annotation.sh		annotation.sh
covid_freq-Group6 (1).jpeg		covid_freq-Group6 (1).jpeg
coviddb_workflow.png		coviddb_workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cov2db: a low frequency variant DB for SARS-CoV-2

Problem

Timeline

Sunday:

During Hackathon

Wednesday:

Features

R Shiny UI

Accessing the database

Example queries

Methods

How to handle iVar data

VCF annotation

annotated VCF to JSON conversion.

Workflow figure✍️

Related work

Team members

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cov2db: a low frequency variant DB for SARS-CoV-2

Problem

Timeline

Sunday:

During Hackathon

Wednesday:

Features

R Shiny UI

Accessing the database

Example queries

Methods

How to handle iVar data

VCF annotation

annotated VCF to JSON conversion.

Workflow figure✍️

Related work

Team members

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages