VILMA Database Access Documentation

The VILMA database is a collection of molecular databases with relevance to the VILMA project. The database is implemented as a document-oriented NoSQL MongoDB server instance. The MongoDB instance hosts multiple databases where each individual database has been constructed from some molecular dataset(s). Each individual database in turn is divided into one or more partitions called collections. The NoSQL nature of the database means there is no pre-defined schema i.e. strict structure for any of the individual collections, but rather each of them is a data store of BSON (Binary JSON) documents.

Thus the general hierarchy of the data can be described as: Instance (the host server) → Database (constructed from some dataset(s)) → Collection (some partition of the dataset(s)) → Document (each individual data element). Each document has one or more key-value pairs referred to as fields. In general it can be assumed that all of the documents within the same collection have the same fields, but in theory this is not guaranteed or enforced by default by the database engine.

The following will describe the general steps for connecting to the databases within the VILMA MongoDB instance and how to access the data within specific collections with the python client.

Available Databases

GeckoQ

This database has been constructed by parsing the GeckoQ and its addendum datasets.

Available collections

📊 Core Data (Click to expand)

molecules: Parent molecules extracted from the GeckoQ and it addendum data.
molecule_isomers: Isomers of the parent molecules extracted from the GeckoQ and it addendum data.

📊 Machine Learning (GNN Project) (Click to expand)

ml_test: Test partition for the GNN project.
ml_train Train partition for the GNN project.
ml_val: Validation partition for the GNN project.
test_holdout: Test holdout partition for the GNN project.
ml_metadata: Metadata document for the data partitions for the GNN project.
local_development: Subset of 1000 molecules for local development for the GNN project.

📊 System & Logs (Click to expand)

molecule_log: Log documents related to parsing the data to the database.

Schema Details

📑 Collection: molecules (Click to expand)

Field	Type	Description
`_id`	ObjectId	Unique identifier for the document generated by MongoDB.
`molecule_index`	Int32	Unique identifier for the molecule.
`smiles`	String	Canonical SMILES string.
`ref_structure`	Array	3D reference structure.
`flexibility`	Double	Initial flexibility in Hartrees based on all conformers from GeckoQ and H-bonds information from COSMO.
`min_h_index_conformers_idxs`	Array	The ids of the conformers with minimum number of hydrogen bonds.
`conformers`	Array	All the conformers of the molecule including 3D structure and metadata.
`source`	String	The original source of the data (original GeckoQ or addendum).
`processing_metadata`	Object	Processing metadata.
`reestimated_flexibility`	Double	Flexibility in Hartrees recalculated with valid conformers and improved algorithms for H-bond estimation.
`valid_conformer_ids`	Array	Ids of the chemically valid conformers.
`valid_flexibility`	Double	Flexibility in Hartrees calculated using only conformers that pass validation criteria (e.g., have valid internal bonds).
`num_conformers`	Int32	Total number of conformers.
`isomer_count`	Int32	Total number of isomers of the molecule within the `molecule_isomers` collection.
`psat_COSMO_atm`	Double	Saturation vapor pressure in standard atmospheres calculated by COSMOtherm.
`psat_SIMPOL_atm`	Double	Saturation vapor pressure in standard atmospheres calculated by SIMPOL.

📑 Collection: molecule_isomers (Click to expand)

Collection: molecule_isomers

Field	Type	Description
`_id`	ObjectId	Unique identifier for the document generated by MongoDB.
`isomer_id`	Int32	Unique identifier for the isomer.
`conformers`	Array	All the conformers of the molecule including 3D structure and metadata.
`flexibility`	Double	Flexibility in Hartrees.
`is_sole_isomer`	Boolean	True if the isomer is the only isomer of the parent molecule; otherwise, false.
`parent_molecule_index`	Int32	Id of the parent molecule within the `molecules` collection.
`ref_structure`	Array	3D reference structure.
`smiles`	String	Canonical SMILES string of the parent molecule.
`stereo_smiles`	String	Stereo SMILES string of the isomer.
`num_conformers`	Int32	Total number of conformers.
`psat_COSMO_atm`	Double	Saturation vapor pressure in standard atmospheres calculated by COSMOtherm.
`psat_SIMPOL_atm`	Double	Saturation vapor pressure in standard atmospheres calculated by SIMPOL.

GeckoAP

This database has been constructed by parsing the filtered GeckoAP datasets.

Available collections

📊 Core Data (Click to expand)

molecules: Molecules extracted from the filtered GeckoAP data.

Getting Started

Access Credentials

In order to access the MongoDB instance you need credentials in the form a connection string. It has the general structure of: mongodb://<username>:<password>@<hostname>:<port>. The :<port> segment is optional. If omitted, the driver defaults to port 27017.

Make sure the connection string provided to you follows this format. By default the credentials given will be read-only, meaning you can query and retrieve data but cannot modify or delete existing records.

For security, never hard-code your connection string in your source code. Store it as an environment variable in a .env file at your project root:

MONGO_URI=mongodb://<username>:<password>@<hostname>

Warning

Remember to exclude the .env file in the .gitignore file before pushing your project to any remote repository.

To use these credentials in Python, use the python-dotenv library. To install it with pip, use pip install python-dotenv. Then to access the environment variable do:

import os
from dotenv import load_dotenv

load_dotenv()
uri = os.getenv('MONGO_URI')

Connecting to the Database

MongoDB Compass and Shell

MongoDB Compass provides a Graphical User Interface (GUI) while MongoDB Shell (not to be confused with MongoDB CLI) provides a terminal-based interface for data interaction. Both support direct data access via queries and advanced data processing via aggregation pipelines.

MongoDB Compass:

Download: MongoDB Compass Download
Connection Guide: How to Connect
Official Documentation: Compass Docs

MongoDB Shell:

Download: MongoDB Shell Download
Connection Guide: How to Connect
Official Documentation: Shell Docs

Within Python

You can connect to the MongoDB instance within Python by using the PyMongo library. To install it with pip, use pip install pymongo. Then, to create a connection and accessing a specific collection within a specific database do the following:

from pymongo import MongoClient

db_name = "database_name"
collection_name = "collection_name"

client = MongoClient(uri)
db = client[db_name]
collection = db[collection_name]

Here the variable uri is connection string read from the environment variable as described under the section Access Credentials.

If you are not sure what the available database names are within the MongoDB instance, you can list them with the following:

print(client.list_database_names())

If you are not sure what the available collection names are within a specific, you can list them with the following:

print(db.list_collection_names())

The rest of the documentation will describe how typical tasks are achieved using PyMongo.

Full Official Documentation: PyMongo Docs

Basics of Querying Data

A general query can be defined as a dictionary where the keys correspond to fields within the collection and values to dictionaries of restrictions on those fields. Each query can include multiple fields and each field can have multiple restrictions. The restriction dictionary keys are query operators and values the operands of the operators.

To run the query we use the find method applied on a collection. The output will be a cursor that can be iterated over to retrieve documents matching the query as dictionaries.

Query Operators

Comparison Operators

These operators allow you to filter documents based on value comparisons.

Operator	Name	Description
`$eq`	Equal	Matches values that are equal to a specified value.
`$ne`	Not Equal	Matches all values that are not equal to a specified value.
`$gt`	Greater Than	Matches values that are greater than a specified value.
`$gte`	Greater Than or Equal	Matches values that are greater than or equal to a specified value.
`$lt`	Less Than	Matches values that are less than a specified value.
`$lte`	Less Than or Equal	Matches values that are less than or equal to a specified value.

Logical Operators

These operators allow you to combine multiple query conditions.

Operator	Name	Description
`$and`	Logical AND	Returns all documents that match the conditions of all clauses.
`$or`	Logical OR	Returns all documents that match the conditions of at least one clause.
`$nor`	Logical NOR	Returns all documents that fail to match both/all clauses.
`$not`	Logical NOT	Inverts the effect of a query expression.

Note

By default expressions without an explicit logical operator are expanded with implicit AND. So for example the query

{"num_conformers": {"$gt": 100, "$lt": 500}}

is implicitly expanded into

{"$and": [{"num_conformers": {"$gt": 100}}, {"num_conformers": {"$lt": 500}}]}

Projecting Data

By default the find method returns all the fields of the matching documents. We can however specify which fields should be included in or excluded from the resulting cursor to save bandwidth and memory. This called projecting the data.

The projection is a dictionary where the values are the names of the fields and values 1 for fields to include and 0 for fields to exclude. This dictionary is then given as the second argument for the find method.

Note

The default ID field _id automatically generated by MongoDB is always included by default unless explicitly excluded.

Important

Each projection can only consists of field inclusions or field exclusions, not both. The exception is the _id field which can be excluded from otherwise field inclusion projection.

Example queries

Tip

For more in-depth examples, take a look at the notebooks within the examples directory.

The simplest query is the empty dictionary which will return all the documents within the collection:

results = collection.find({})

The following query finds all the documents within the collection, but projects to include only the fields molecule_index and valid_flexibility:

projection = {
    "_id": 0,
    "molecule_index": 1,
    "valid_flexibility": 1
}

results = collection.find({}, projection)

The following query finds all the documents with the field num_conformers within the range $[100, 500]$:

query = {
    "num_conformers": {"$gte": 100, "$lte": 500}
}

results = collection.find(query)

The following query finds all the documents with the field source not equal to "addendum", field isomer_count equal to $1$, and field valid_flexibility greater than $0$, and projects to include only the fields molecule_index, valid_flexibility, and num_conformers:

query = {
    "source": {"$ne": "addendum"},
    "isomer_count": {"$eq": 1},
    "valid_flexibility" : {"$gt": 0}
}

projection = {
    "_id": 0,
    "molecule_index": 1,
    "valid_flexibility": 1,
    "num_conformers": 1
}

results = collection.find(query, projection)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VILMA Database Access Documentation

Available Databases

GeckoQ

Available collections

Schema Details

GeckoAP

Available collections

Getting Started

Access Credentials

Connecting to the Database

MongoDB Compass and Shell

Within Python

Basics of Querying Data

Query Operators

Comparison Operators

Logical Operators

Projecting Data

Example queries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

VILMA Database Access Documentation

Available Databases

GeckoQ

Available collections

Schema Details

GeckoAP

Available collections

Getting Started

Access Credentials

Connecting to the Database

MongoDB Compass and Shell

Within Python

Basics of Querying Data

Query Operators

Comparison Operators

Logical Operators

Projecting Data

Example queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages