Skip to content

edahelsinki/vilma-db-doc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VILMA Database Access Documentation

The VILMA database is a collection of molecular databases with relevance to the VILMA project. The database is implemented as a document-oriented NoSQL MongoDB server instance. The MongoDB instance hosts multiple databases where each individual database has been constructed from some molecular dataset(s). Each individual database in turn is divided into one or more partitions called collections. The NoSQL nature of the database means there is no pre-defined schema i.e. strict structure for any of the individual collections, but rather each of them is a data store of BSON (Binary JSON) documents.

Thus the general hierarchy of the data can be described as: Instance (the host server) β†’ Database (constructed from some dataset(s)) β†’ Collection (some partition of the dataset(s)) β†’ Document (each individual data element). Each document has one or more key-value pairs referred to as fields. In general it can be assumed that all of the documents within the same collection have the same fields, but in theory this is not guaranteed or enforced by default by the database engine.

The following will describe the general steps for connecting to the databases within the VILMA MongoDB instance and how to access the data within specific collections with the python client.

Available Databases

GeckoQ

This database has been constructed by parsing the GeckoQ and its addendum datasets.

Available collections

πŸ“Š Core Data (Click to expand)
  • molecules: Parent molecules extracted from the GeckoQ and it addendum data.

  • molecule_isomers: Isomers of the parent molecules extracted from the GeckoQ and it addendum data.

πŸ“Š Machine Learning (GNN Project) (Click to expand)
  • ml_test: Test partition for the GNN project.

  • ml_train Train partition for the GNN project.

  • ml_val: Validation partition for the GNN project.

  • test_holdout: Test holdout partition for the GNN project.

  • ml_metadata: Metadata document for the data partitions for the GNN project.

  • local_development: Subset of 1000 molecules for local development for the GNN project.

πŸ“Š System & Logs (Click to expand)
  • molecule_log: Log documents related to parsing the data to the database.

Schema Details

πŸ“‘ Collection: molecules (Click to expand)
Field Type Description
_id ObjectId Unique identifier for the document generated by MongoDB.
molecule_index Int32 Unique identifier for the molecule.
smiles String Canonical SMILES string.
ref_structure Array 3D reference structure.
flexibility Double Initial flexibility in Hartrees based on all conformers from GeckoQ and H-bonds information from COSMO.
min_h_index_conformers_idxs Array The ids of the conformers with minimum number of hydrogen bonds.
conformers Array All the conformers of the molecule including 3D structure and metadata.
source String The original source of the data (original GeckoQ or addendum).
processing_metadata Object Processing metadata.
reestimated_flexibility Double Flexibility in Hartrees recalculated with valid conformers and improved algorithms for H-bond estimation.
valid_conformer_ids Array Ids of the chemically valid conformers.
valid_flexibility Double Flexibility in Hartrees calculated using only conformers that pass validation criteria (e.g., have valid internal bonds).
num_conformers Int32 Total number of conformers.
isomer_count Int32 Total number of isomers of the molecule within the molecule_isomers collection.
psat_COSMO_atm Double Saturation vapor pressure in standard atmospheres calculated by COSMOtherm.
psat_SIMPOL_atm Double Saturation vapor pressure in standard atmospheres calculated by SIMPOL.
πŸ“‘ Collection: molecule_isomers (Click to expand)

Collection: molecule_isomers

Field Type Description
_id ObjectId Unique identifier for the document generated by MongoDB.
isomer_id Int32 Unique identifier for the isomer.
conformers Array All the conformers of the molecule including 3D structure and metadata.
flexibility Double Flexibility in Hartrees.
is_sole_isomer Boolean True if the isomer is the only isomer of the parent molecule; otherwise, false.
parent_molecule_index Int32 Id of the parent molecule within the molecules collection.
ref_structure Array 3D reference structure.
smiles String Canonical SMILES string of the parent molecule.
stereo_smiles String Stereo SMILES string of the isomer.
num_conformers Int32 Total number of conformers.
psat_COSMO_atm Double Saturation vapor pressure in standard atmospheres calculated by COSMOtherm.
psat_SIMPOL_atm Double Saturation vapor pressure in standard atmospheres calculated by SIMPOL.

GeckoAP

This database has been constructed by parsing the filtered GeckoAP datasets.

Available collections

πŸ“Š Core Data (Click to expand)
  • molecules: Molecules extracted from the filtered GeckoAP data.

Getting Started

Access Credentials

In order to access the MongoDB instance you need credentials in the form a connection string. It has the general structure of: mongodb://<username>:<password>@<hostname>:<port>. The :<port> segment is optional. If omitted, the driver defaults to port 27017.

Make sure the connection string provided to you follows this format. By default the credentials given will be read-only, meaning you can query and retrieve data but cannot modify or delete existing records.

For security, never hard-code your connection string in your source code. Store it as an environment variable in a .env file at your project root:

MONGO_URI=mongodb://<username>:<password>@<hostname>

Warning

Remember to exclude the .env file in the .gitignore file before pushing your project to any remote repository.

To use these credentials in Python, use the python-dotenv library. To install it with pip, use pip install python-dotenv. Then to access the environment variable do:

import os
from dotenv import load_dotenv

load_dotenv()
uri = os.getenv('MONGO_URI')

Connecting to the Database

MongoDB Compass and Shell

MongoDB Compass provides a Graphical User Interface (GUI) while MongoDB Shell (not to be confused with MongoDB CLI) provides a terminal-based interface for data interaction. Both support direct data access via queries and advanced data processing via aggregation pipelines.

MongoDB Compass:

MongoDB Shell:

Within Python

You can connect to the MongoDB instance within Python by using the PyMongo library. To install it with pip, use pip install pymongo. Then, to create a connection and accessing a specific collection within a specific database do the following:

from pymongo import MongoClient

db_name = "database_name"
collection_name = "collection_name"

client = MongoClient(uri)
db = client[db_name]
collection = db[collection_name]

Here the variable uri is connection string read from the environment variable as described under the section Access Credentials.

If you are not sure what the available database names are within the MongoDB instance, you can list them with the following:

print(client.list_database_names())

If you are not sure what the available collection names are within a specific, you can list them with the following:

print(db.list_collection_names())

The rest of the documentation will describe how typical tasks are achieved using PyMongo.

Basics of Querying Data

A general query can be defined as a dictionary where the keys correspond to fields within the collection and values to dictionaries of restrictions on those fields. Each query can include multiple fields and each field can have multiple restrictions. The restriction dictionary keys are query operators and values the operands of the operators.

To run the query we use the find method applied on a collection. The output will be a cursor that can be iterated over to retrieve documents matching the query as dictionaries.

Query Operators

Comparison Operators

These operators allow you to filter documents based on value comparisons.

Operator Name Description
$eq Equal Matches values that are equal to a specified value.
$ne Not Equal Matches all values that are not equal to a specified value.
$gt Greater Than Matches values that are greater than a specified value.
$gte Greater Than or Equal Matches values that are greater than or equal to a specified value.
$lt Less Than Matches values that are less than a specified value.
$lte Less Than or Equal Matches values that are less than or equal to a specified value.

Logical Operators

These operators allow you to combine multiple query conditions.

Operator Name Description
$and Logical AND Returns all documents that match the conditions of all clauses.
$or Logical OR Returns all documents that match the conditions of at least one clause.
$nor Logical NOR Returns all documents that fail to match both/all clauses.
$not Logical NOT Inverts the effect of a query expression.

Note

By default expressions without an explicit logical operator are expanded with implicit AND. So for example the query

{"num_conformers": {"$gt": 100, "$lt": 500}}

is implicitly expanded into

{"$and": [{"num_conformers": {"$gt": 100}}, {"num_conformers": {"$lt": 500}}]}

Projecting Data

By default the find method returns all the fields of the matching documents. We can however specify which fields should be included in or excluded from the resulting cursor to save bandwidth and memory. This called projecting the data.

The projection is a dictionary where the values are the names of the fields and values 1 for fields to include and 0 for fields to exclude. This dictionary is then given as the second argument for the find method.

Note

The default ID field _id automatically generated by MongoDB is always included by default unless explicitly excluded.

Important

Each projection can only consists of field inclusions or field exclusions, not both. The exception is the _id field which can be excluded from otherwise field inclusion projection.

Example queries

Tip

For more in-depth examples, take a look at the notebooks within the examples directory.

  1. The simplest query is the empty dictionary which will return all the documents within the collection:
results = collection.find({})
  1. The following query finds all the documents within the collection, but projects to include only the fields molecule_index and valid_flexibility:
projection = {
    "_id": 0,
    "molecule_index": 1,
    "valid_flexibility": 1
}

results = collection.find({}, projection)
  1. The following query finds all the documents with the field num_conformers within the range $[100, 500]$:
query = {
    "num_conformers": {"$gte": 100, "$lte": 500}
}

results = collection.find(query)
  1. The following query finds all the documents with the field source not equal to "addendum", field isomer_count equal to $1$, and field valid_flexibility greater than $0$, and projects to include only the fields molecule_index, valid_flexibility, and num_conformers:
query = {
    "source": {"$ne": "addendum"},
    "isomer_count": {"$eq": 1},
    "valid_flexibility" : {"$gt": 0}
}

projection = {
    "_id": 0,
    "molecule_index": 1,
    "valid_flexibility": 1,
    "num_conformers": 1
}

results = collection.find(query, projection)

About

Documentation and examples for connecting to and querying the VILMA project MongoDB instance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors