The VILMA database is a collection of molecular databases with relevance to the VILMA project. The database is implemented as a document-oriented NoSQL MongoDB server instance. The MongoDB instance hosts multiple databases where each individual database has been constructed from some molecular dataset(s). Each individual database in turn is divided into one or more partitions called collections. The NoSQL nature of the database means there is no pre-defined schema i.e. strict structure for any of the individual collections, but rather each of them is a data store of BSON (Binary JSON) documents.
Thus the general hierarchy of the data can be described as: Instance (the host server) β Database (constructed from some dataset(s)) β Collection (some partition of the dataset(s)) β Document (each individual data element). Each document has one or more key-value pairs referred to as fields. In general it can be assumed that all of the documents within the same collection have the same fields, but in theory this is not guaranteed or enforced by default by the database engine.
The following will describe the general steps for connecting to the databases within the VILMA MongoDB instance and how to access the data within specific collections with the python client.
This database has been constructed by parsing the GeckoQ and its addendum datasets.
π Core Data (Click to expand)
-
molecules: Parent molecules extracted from the GeckoQ and it addendum data. -
molecule_isomers: Isomers of the parent molecules extracted from the GeckoQ and it addendum data.
π Machine Learning (GNN Project) (Click to expand)
-
ml_test: Test partition for the GNN project. -
ml_trainTrain partition for the GNN project. -
ml_val: Validation partition for the GNN project. -
test_holdout: Test holdout partition for the GNN project. -
ml_metadata: Metadata document for the data partitions for the GNN project. -
local_development: Subset of 1000 molecules for local development for the GNN project.
π System & Logs (Click to expand)
molecule_log: Log documents related to parsing the data to the database.
π Collection: molecules (Click to expand)
| Field | Type | Description |
|---|---|---|
_id |
ObjectId | Unique identifier for the document generated by MongoDB. |
molecule_index |
Int32 | Unique identifier for the molecule. |
smiles |
String | Canonical SMILES string. |
ref_structure |
Array | 3D reference structure. |
flexibility |
Double | Initial flexibility in Hartrees based on all conformers from GeckoQ and H-bonds information from COSMO. |
min_h_index_conformers_idxs |
Array | The ids of the conformers with minimum number of hydrogen bonds. |
conformers |
Array | All the conformers of the molecule including 3D structure and metadata. |
source |
String | The original source of the data (original GeckoQ or addendum). |
processing_metadata |
Object | Processing metadata. |
reestimated_flexibility |
Double | Flexibility in Hartrees recalculated with valid conformers and improved algorithms for H-bond estimation. |
valid_conformer_ids |
Array | Ids of the chemically valid conformers. |
valid_flexibility |
Double | Flexibility in Hartrees calculated using only conformers that pass validation criteria (e.g., have valid internal bonds). |
num_conformers |
Int32 | Total number of conformers. |
isomer_count |
Int32 | Total number of isomers of the molecule within the molecule_isomers collection. |
psat_COSMO_atm |
Double | Saturation vapor pressure in standard atmospheres calculated by COSMOtherm. |
psat_SIMPOL_atm |
Double | Saturation vapor pressure in standard atmospheres calculated by SIMPOL. |
π Collection: molecule_isomers (Click to expand)
Collection: molecule_isomers
| Field | Type | Description |
|---|---|---|
_id |
ObjectId | Unique identifier for the document generated by MongoDB. |
isomer_id |
Int32 | Unique identifier for the isomer. |
conformers |
Array | All the conformers of the molecule including 3D structure and metadata. |
flexibility |
Double | Flexibility in Hartrees. |
is_sole_isomer |
Boolean | True if the isomer is the only isomer of the parent molecule; otherwise, false. |
parent_molecule_index |
Int32 | Id of the parent molecule within the molecules collection. |
ref_structure |
Array | 3D reference structure. |
smiles |
String | Canonical SMILES string of the parent molecule. |
stereo_smiles |
String | Stereo SMILES string of the isomer. |
num_conformers |
Int32 | Total number of conformers. |
psat_COSMO_atm |
Double | Saturation vapor pressure in standard atmospheres calculated by COSMOtherm. |
psat_SIMPOL_atm |
Double | Saturation vapor pressure in standard atmospheres calculated by SIMPOL. |
This database has been constructed by parsing the filtered GeckoAP datasets.
π Core Data (Click to expand)
molecules: Molecules extracted from the filtered GeckoAP data.
In order to access the MongoDB instance you need credentials in the form a connection string. It has the general structure of: mongodb://<username>:<password>@<hostname>:<port>. The :<port> segment is optional. If omitted, the driver defaults to port 27017.
Make sure the connection string provided to you follows this format. By default the credentials given will be read-only, meaning you can query and retrieve data but cannot modify or delete existing records.
For security, never hard-code your connection string in your source code. Store it as an environment variable in a .env file at your project root:
MONGO_URI=mongodb://<username>:<password>@<hostname>Warning
Remember to exclude the .env file in the .gitignore file before pushing your project to any remote repository.
To use these credentials in Python, use the python-dotenv library. To install it with pip, use pip install python-dotenv. Then to access the environment variable do:
import os
from dotenv import load_dotenv
load_dotenv()
uri = os.getenv('MONGO_URI')MongoDB Compass provides a Graphical User Interface (GUI) while MongoDB Shell (not to be confused with MongoDB CLI) provides a terminal-based interface for data interaction. Both support direct data access via queries and advanced data processing via aggregation pipelines.
MongoDB Compass:
- Download: MongoDB Compass Download
- Connection Guide: How to Connect
- Official Documentation: Compass Docs
MongoDB Shell:
- Download: MongoDB Shell Download
- Connection Guide: How to Connect
- Official Documentation: Shell Docs
You can connect to the MongoDB instance within Python by using the PyMongo library. To install it with pip, use pip install pymongo. Then, to create a connection and accessing a specific collection within a specific database do the following:
from pymongo import MongoClient
db_name = "database_name"
collection_name = "collection_name"
client = MongoClient(uri)
db = client[db_name]
collection = db[collection_name]Here the variable uri is connection string read from the environment variable as described under the section Access Credentials.
If you are not sure what the available database names are within the MongoDB instance, you can list them with the following:
print(client.list_database_names())If you are not sure what the available collection names are within a specific, you can list them with the following:
print(db.list_collection_names())The rest of the documentation will describe how typical tasks are achieved using PyMongo.
- Full Official Documentation: PyMongo Docs
A general query can be defined as a dictionary where the keys correspond to fields within the collection and values to dictionaries of restrictions on those fields. Each query can include multiple fields and each field can have multiple restrictions. The restriction dictionary keys are query operators and values the operands of the operators.
To run the query we use the find method applied on a collection. The output will be a cursor that can be iterated over to retrieve documents matching the query as dictionaries.
These operators allow you to filter documents based on value comparisons.
| Operator | Name | Description |
|---|---|---|
$eq |
Equal | Matches values that are equal to a specified value. |
$ne |
Not Equal | Matches all values that are not equal to a specified value. |
$gt |
Greater Than | Matches values that are greater than a specified value. |
$gte |
Greater Than or Equal | Matches values that are greater than or equal to a specified value. |
$lt |
Less Than | Matches values that are less than a specified value. |
$lte |
Less Than or Equal | Matches values that are less than or equal to a specified value. |
These operators allow you to combine multiple query conditions.
| Operator | Name | Description |
|---|---|---|
$and |
Logical AND | Returns all documents that match the conditions of all clauses. |
$or |
Logical OR | Returns all documents that match the conditions of at least one clause. |
$nor |
Logical NOR | Returns all documents that fail to match both/all clauses. |
$not |
Logical NOT | Inverts the effect of a query expression. |
Note
By default expressions without an explicit logical operator are expanded with implicit AND. So for example the query
{"num_conformers": {"$gt": 100, "$lt": 500}}
is implicitly expanded into
{"$and": [{"num_conformers": {"$gt": 100}}, {"num_conformers": {"$lt": 500}}]}
By default the find method returns all the fields of the matching documents. We can however specify which fields should be included in or excluded from the resulting cursor to save bandwidth and memory. This called projecting the data.
The projection is a dictionary where the values are the names of the fields and values 1 for fields to include and 0 for fields to exclude. This dictionary is then given as the second argument for the find method.
Note
The default ID field _id automatically generated by MongoDB is always included by default unless explicitly excluded.
Important
Each projection can only consists of field inclusions or field exclusions, not both. The exception is the _id field which can be excluded from otherwise field inclusion projection.
Tip
For more in-depth examples, take a look at the notebooks within the examples directory.
- The simplest query is the empty dictionary which will return all the documents within the collection:
results = collection.find({})
- The following query finds all the documents within the collection, but projects to include only the fields
molecule_indexandvalid_flexibility:
projection = {
"_id": 0,
"molecule_index": 1,
"valid_flexibility": 1
}
results = collection.find({}, projection)- The following query finds all the documents with the field
num_conformerswithin the range$[100, 500]$ :
query = {
"num_conformers": {"$gte": 100, "$lte": 500}
}
results = collection.find(query)- The following query finds all the documents with the field
sourcenot equal to "addendum", fieldisomer_countequal to$1$ , and fieldvalid_flexibilitygreater than$0$ , and projects to include only the fieldsmolecule_index,valid_flexibility, andnum_conformers:
query = {
"source": {"$ne": "addendum"},
"isomer_count": {"$eq": 1},
"valid_flexibility" : {"$gt": 0}
}
projection = {
"_id": 0,
"molecule_index": 1,
"valid_flexibility": 1,
"num_conformers": 1
}
results = collection.find(query, projection)