Skip to content

MariaJose501/cov2db

 
 

Repository files navigation

Visitor

cov2db: a low frequency variant DB for SARS-CoV-2

cov2db_logo_bg (SARS-CoV-2 Illustration image credit: Davian Ho for the Innovative Genomics Institute)


Problem

Global SARS-CoV-2 sequencing efforts have resulted in a massive genomic dataset available to the public for a variety of analyses. However, the two most common resources are genome assemblies (e.g. deposited in GISAID and GenBank) and raw sequencing reads. Both of these limit the quantity of information, especially with respect to variants found within the SARS-CoV-2 populations. Genome assemblies only contain consensus level information, which is not reflective of the full genomic diversity within a given sample (since even a single patient derived sample represents a viral population within the host). Raw sequencing reads on the other hand require further analyses in order to extract variant information, and can often be prohibitively large in size.

Thus, we propose cov2db; a database resource for collecting low frequency variant information for available SARS-CoV-2 data (as of October 12th, 2021 there are more than 1.2 million SARS-CoV-2 sequencing datasets in SRA and ENA). Our goal is to provide an easy to use query system, and contribute to a database of VCF files that contain variant calls for SARS-CoV-2 samples. We hope that such interactive database will speed up downstream analyses and encourage collaboration.

figure6_covid An illustration of low frequency single nucleotide variants (iSNVs) within two viral populations inside infected hosts DOI:10.1101/gr.268961.120.

Timeline

Sunday:

During Hackathon

  • Annotated 2,000 VCFs
  • Converted 2,000 VCFs to JSON
  • Scaled up our database to handle the data
  • Prototyped a R Shiny UI for database interactions

Wednesday:

Features

cov2db is a unified database containing information about SARS-CoV-2 strains variants that’s easily available and searchable for the scientific community. Cov2db is hosted in a mongoDB server, but can also be accessed using our Graphical User interface, created with a Shiny R. Our pipeline also provides the tools for any user to include their own datasets to the database, generating a formidable resource for the study of SARS-CoV-2.

Supported queries based on the following fields.

Annotation:

  • Reference amino acid
  • Variant amino acid
  • Gene name
  • Mutation type (missense, synonymous, upstream, etc.)

Variant call information:

  • Position
  • Allele frequency
  • Reference allele
  • Alternative allele
  • Coverage depth
  • Strand bias

Sample metadata:

  • Sequencing device
  • Library layout
  • Submission date
  • Study accession
  • Variant caller

R Shiny UI

Follow the link below for a quick video demo (no sound) of the R Shiny interface to cov2db. R Shiny Demo

Accessing the database

In order to access the database and run custom queries you will first need to install MongoDB Compass or MongoDB Shell. The following examples make use of MongoDB Compass installation.

To connect to the cov2db database you will need to open MongoDB Compass app and press ⌘N on a Mac computer or navigate to the menu at the top and pick Connect->Connect to item.

Screen Shot 2021-10-13 at 10 35 20 AM

In the new window paste in the following connection string mongodb://sno.cs.rice.edu:27017 as shown and click connect.

Screen Shot 2021-10-13 at 10 37 08 AM

Finally, select cov2db database, and navigate to the annotated_vcf collection.

To begin using the shell and start issuing queries, click on the mongosh button in the lower left corner and a shell with >test prompt will appear.

Screen Shot 2021-10-13 at 10 39 36 AM

Finally, type the following command on the shell use cov2db to connect to the cov2db database.

Screen Shot 2021-10-13 at 10 42 37 AM

Now, you are ready to run the queries.

Example queries

  1. Get the count of missense variants reported for ORF1ab

db.annotated_vcf.count( { info_SequenceOntology: "missense_variant", info_GeneName: "ORF1ab" } ) Screen Shot 2021-10-13 at 9 22 34 AM

  1. Get sample accession numbers for samples that have a variant at position 23403 in the genome

db.annotated_vcf.find( { start: 23403 }, {VCF_SAMPLE: 1, _id: 0}) Screen Shot 2021-10-13 at 9 39 52 AM

  1. Get the count of missense variants occuring at frequency below 1% within the samples with depth of coverage >100000x at the variant call position

db.annotated_vcf.count( { info_SequenceOntology: "missense_variant", info_af: { $lt: 0.01 }, info_dp: { $gt: 100000} } ) Screen Shot 2021-10-13 at 11 05 52 AM

  1. Get sample accession numbers for samples that is missense variant in gene ORF1ab with allele frequency less than 0.001

db.annotated_vcf.find( { info_SequenceOntology: "missense_variant", info_GeneName: "ORF1ab", info_af: { $lt: 0.001 }},{VCF_SAMPLE:1, _id:0} ) Screenshot

Methods

How to handle iVar data

TSV iVar output was converted to VCF by using the script from here.

python ivar_variants_to_vcf.py example.tsv example.vcf

VCF annotation

The VCF output was then annonated with snpEff. To install snpEff

wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip

To annotate

java -Xmx8g -jar ../path/to/snpEff/snpEff.jar NC_045512.2 your_input.vcf > output.ann.vcf

annotated VCF to JSON conversion.

Workflow figure✍️

covid_freq-Group6 (3)

Related work

VAPr is an excellent mongodb based database for storing variant info. UCSC SARS-CoV-2 genome broswers also provides visualization of intrahost variants here.


Team members

  • Daniel Agustinho, Washington University (data acquisition, writer)
  • Li Chuin Chong, Twincore GmbH/HZI-DKFZ under auspices MHH (Sysadmin, mongodb)
  • Maria Jose, Pondicherry Central University (data acquisition, mongodb)
  • BaiWei Lo, University of Konstanz (data acquisition, QC)
  • Ramanandan Prabhakaran, Roche Canada (Sysadmin, python wrapper, mongodb database development, workflow development)
  • Sophie Poon, (Data acquisition, QC)
  • Suresh Kumar, ICAR-NIVEDI (QC)
  • Nick Sapoval, Rice University (Team co-lead, data acquisition, writer, R Shiny development)
  • Todd Treangen, (Team Lead)

####### CITATION:

CITE: Walker K, Kalra D, Lowdon R et al. The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms [version 1; peer review: awaiting peer review]. F1000Research 2022, 11:530 (https://doi.org/10.12688/f1000research.110194.1)

###########

About

cov2db repo provides necessary scripts for building a database of low frequency variants

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 38.2%
  • WDL 30.8%
  • R 23.9%
  • Shell 7.1%