Alex H. Yang

Scraping Reddit, part 2

2021-04-09T00:00:00-05:00

The last post dealt with using pushshift and handling requests to access posts and comments from Reddit. This post deals with using the Python Reddit API wrapper to accces posts and comments from Reddit and then using some NLP tools for some basic sentiment analysis.

There is some work to set up an application to use praw with oauth, but straightforward enough for anyone who’s just using this as a script.

After setting up the praw application, we can build up a small pipeline:

Use praw to download posts and comments from r/nba
Format them into a dataframe
Use huggingface and spacy for sentiment analysis

from dataclasses import dataclass
import itertools as it
from functools import reduce, partial
import datetime as dt

import pandas as pd
pd.set_option('display.max_colwidth', 150)
import praw
from praw.models import MoreComments
import matplotlib.pyplot as plt
import hfapi
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load("en_core_web_sm")
spacy_text_blob = SpacyTextBlob()
nlp.add_pipe(spacy_text_blob)

client = hfapi.Client()

reddit = praw.Reddit("bot1") # Pulls from praw.ini file
rnba = reddit.subreddit('nba')

Compiling praw objects into a dataframe

@dataclass
class RedditSubmission:
    title: str 
    body: str 
    permalink: str 
    author: str 
    score: float
    timestamp: dt.datetime
    
    def to_dict(self):
        return {
            'title': self.title,
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }
    
    @classmethod
    def from_praw_submission(
        cls,
        praw_submission: praw.models.Submission
    ):      
        return cls(
            praw_submission.title,
            praw_submission.selftext,
            praw_submission.permalink,
            praw_submission.author,
            praw_submission.score,
            dt.datetime.fromtimestamp(praw_submission.created_utc)
        )
        
@dataclass
class RedditComment:                                                                  
    body: str
    permalink: str
    author: str
    score: float
    timestamp: dt.datetime

    def to_dict(self):
        return {
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }

    @classmethod
    def from_praw_comment(
        cls,
        praw_comment: praw.models.Comment
    ):
        return cls(
            praw_comment.body,
            praw_comment.permalink,
            praw_comment.author,
            praw_comment.score,
            dt.datetime.fromtimestamp(praw_comment.created_utc)
        )

        
def process_submission_from_praw(praw_submission_generator):
    for praw_submission in praw_submission_generator:
        yield RedditSubmission.from_praw_submission(praw_submission)
        
def process_comment_from_praw_submission(praw_submission_generator):
    for praw_submission in praw_submission_generator:
        for praw_comment in praw_submission.comments:
            if isinstance(praw_comment, MoreComments):
                continue
            else:
                yield RedditComment.from_praw_comment(praw_comment)

praw_submission_generator1 = rnba.hot(limit=100)
praw_submission_generator2 = rnba.hot(limit=100)

submissions = process_submission_from_praw(praw_submission_generator1)
comments = process_comment_from_praw_submission(praw_submission_generator2)

submission_df = pd.DataFrame(a.to_dict() for a in submissions)
comment_df = pd.DataFrame(a.to_dict() for a in comments)

Using huggingface for sentiment analysis

Specifically, using huggingface api

def classification_single_body(client, sentence):
    classification = client.text_classification(sentence)
    if 'error' in classification:
        return None, None
    neg_sentiment, pos_sentiment = classification[0]

    return neg_sentiment['score'], pos_sentiment['score']

def classification_multiple_body(client, bunch_of_sentences, colnames=None):
    if colnames is None:
        colnames = ['negative_score', 'positive_score']
    df = pd.DataFrame(
        map(lambda x: classification_single_body(client, x), bunch_of_sentences),
        columns=colnames
    )

    
    return df

client = hfapi.Client()
classification_multiple_bodies_partial = partial(classification_multiple_body, client)

submission_df = pd.concat([
    submission_df, classification_multiple_bodies_partial(submission_df['title'].to_list())
], axis=1)

Scoring the submissions, here’s a title with an appropriately positive score “Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player’s OVERALL win shares for the current season.”

Here’s a title that is scored as incredibly negative, but in reality is pretty positive “Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season” – being even close to the 50-40-90 club is incredible

submission_df.sort_values("negative_score")[['title', 'score', 'negative_score', 'positive_score']]

	title	score	negative_score	positive_score
19	[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...	241	0.000185	0.999816
8	Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”	1610	0.000185	0.999815
12	[Thinking Basketball] The 10 Best NBA peaks since 1977	1346	0.000283	0.999717
25	[Highlight] Russell banks in the 3 to tie it at 124	92	0.000615	0.999385
23	Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player's OVERALL win shares for the current season.	406	0.000845	0.999155
...	...	...	...	...
15	Charles Barkley: "I've been poor, I've been rich, I've been fat, I've been in the Hall of Fame, and one thing I can tell you is that the Clippers ...	23341	0.999229	0.000771
38	Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season	443	0.999282	0.000718
75	[Stein] The Bucks' too-long-to-list-it-all injury report tonight against Charlotte includes no Giannis Antetokounmpo (left knee soreness) or Jrue ...	43	0.999286	0.000714
40	Bucks missing all five starters against Hornets	79	0.999449	0.000551
93	China’s Forced-Labor Backlash Threatens to Put N.B.A. in Unwanted Spotlight	174	0.999517	0.000483

100 rows × 4 columns

I think we were querying the API too quickly, so these responses started timing out, but you get the idea here

comment_df = pd.concat([
    comment_df, classification_multiple_bodies_partial(comment_df['body'].to_list())
], axis=1)

Using spacy for sentiment analysis

submission_df['title_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['title']))]
submission_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(submission_df['body']))]
comment_df['body_sentiment'] = [*map(lambda x: x._.sentiment.polarity, nlp.pipe(comment_df['body']))]

Here’s a simple title to score “Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.””

submission_df[['title', 'score', 'title_sentiment']].sort_values("title_sentiment")

	title	score	title_sentiment
99	The Mavs will play 3 back-to-backs over a 7 game span to start April. Over April and May, 62% of their games will be part of a b2b	15	-0.400000
83	[Post Game Thread] The Los Angeles Clippers (35-18) defeat the Phoenix Suns (36-15), 113 - 103	727	-0.400000
43	[Post Game Thread] The Boston Celtics (27-26) defeat the Minnesota Timberwolves (13-40) in OT, 145 - 136	49	-0.400000
91	[Post Game Thread] The Dallas Mavericks (29-22) defeat the Milwaukee Bucks (32-19), 116 - 101	754	-0.400000
37	The Denver Nuggets came onto the floor for their game against the Spurs with "X Gon' Give it to Ya" playing in the background	88	-0.400000
...	...	...	...
19	[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...	241	0.505556
18	Steve Kerr on leaving the Warriors: “I have a great job right now. I love coaching the Warriors, so I'm not going anywhere.”	465	0.528571
84	[Highlight] Cody Zeller perfectly blocks Sam Merrill's layup off the backboard	15	1.000000
8	Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”	1610	1.000000
12	[Thinking Basketball] The 10 Best NBA peaks since 1977	1346	1.000000

100 rows × 3 columns

I want to point out one comment “Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺”, which has a negative sentiment, probably because of the words “off” and “words”, but the sentence itself is more positive because it’s about a player performing very well

comment_df[['body', 'score', 'body_sentiment']].sort_values("body_sentiment")

	body	score	body_sentiment
2480	he has some of the worst luck with injuries.	591	-1.0
118	I tea bagged your fucking drum set!!!	3	-1.0
2081	RIP to the insane plus/minus of the Spurs bench	71	-1.0
1379	Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺	1	-1.0
1287	fucking disgusting	1	-1.0
...	...	...	...
2270	Perfect.... boost his confidence, while we continue to tank	5	1.0
273	It’s almost like he’s one of the best point guards of all time!	2	1.0
31	Best scorer on the Bulls since MJ	120	1.0
1632	Remember when DSJ was like the mavs best player? What a time	1	1.0
436	I will zag and point out another thing here. KD doesn't want to outright say Steph is the greatest shooter ever. He needs to add Klay to this stat...	-1	1.0

3200 rows × 3 columns

Closing remarks

Thanks to praw, it was really easy to pull and gather raw data. On top of that, the plethora of NLP software development has made it really easy to apply these models to whatever context you want.

To really take this further, an important middle step would need data cleaning (modifying for typos, slang, abbreviations), maybe filters/named entity resolution to look for specific players. Maybe you want to find some way to add weights to highly up-voted submissions/comments, or maybe you want some way to combine the sentiments from both submissions and comments. Lastly, the big caveat in NLP for reddit is using a language model sophisticated enough to capture the sarcasm, nuance, and toxicity that is the reddit community (and specifically within r/nba).

Scraping Reddit, part 1

2021-02-01T00:00:00-06:00

In light of recent internet trends about retail investors, I’m sure many of us have questions about the kinds of content that gets posted on reddit, and if there are home-grown, analytical ways of addressing these questions. I’ll be showing two ways of parsing submissions and comments to Reddit, this one focusing on using pushshift API endpoints using the requests library, some custom classes for processing these responses, and asyncio to handle asynchronous threading for multiple requests to pushshift.

These codes ran quickly on my chromebook (dual-core, dual-thread, 1.90 Ghz, 4 Gb memory), but querying lots of data from pushshift makes some of the final cells take ~10 minutes.

Note: at the time of putting this together, parts of pushshift appear to be down for repair/upgrade, but at least the github repo is still online

Raw notebook here, but I didn’t bother adding an environment – most of these packages are in the python standard library or easily available on conda or pip

import pandas as pd
import requests
import datetime as dt
import asyncio
import io

At its core, we are submitting queries to a URL and getting responses to these queries. Technically speaking, this means we are submitting get requests to pushshift endpoints.

The endpoint generally takes the form of something like “https://api.pushshift.io/reddit/search/submission”, with the “payload” or params kwarg to our request being some set of search parameters (like a keyword, subreddit, or timestamp info), pushshift API parameters here. With this endpoint, we’re searching the Reddit submissions (not the comments)

One of the simpler payloads could be searching a subreddit within a particular time window. This requires before and after timestamps, which can easily be handled with python’s datetime library

today = dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0).timestamp()
today_minus_seven = (dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0) - 
                     dt.timedelta(days=7)).timestamp()
today_minus_eight = (dt.datetime.today().replace(hour=8, minute=0, second=0, microsecond=0) - 
                     dt.timedelta(days=8)).timestamp()

This the the actual get request, observe the URL as the main arg, and the various search parameters in the params kwarg

reddit_response = requests.get("https://api.pushshift.io/reddit/search/submission",
                              params={'subreddit': 'stocks',
                                      'before': int(today_minus_seven), 
                                      'after': int(today_minus_eight)})

reddit_response.status_code

There are a variety of ways to parse request responses, but here’s one way to parse the title and text from the response to a Reddit submission get request

reddit_response.json()['data'][0]['title'],reddit_response.json()['data'][0]['selftext'],

('Would it be wise to increase the geographical diversity of my portfolio?',
 'Hello everyone, \n\nMy portfolio of 16 companies consists of 13 US stocks because they all seem to have some of the highest potential returns but in the midst of the pandemic I feel I should reallocate some resources towards European and UK stocks. Is anyone watching any interesting non-US stocks at the moment?')

As a little bit of dressing on top, we can grab a list of stock tickers. There are a lot of sources to pull tickers from (yfinance is a popular one), but we can also pull a list of tickers from the SEC

ticker_response = requests.get("https://www.sec.gov/include/ticker.txt")

tickers = pd.read_csv(
    io.StringIO(ticker_response.text), 
    delimiter='\t', 
    header=None, 
    usecols=[0],
)[0].to_list()

tickers[:5]

['aapl', 'msft', 'amzn', 'goog', 'tcehy']

import string                                                                         
from typing import List, Union, Dict, Optional, Any 
from collections import Counter
from requests import Response
from dataclasses import dataclass

We have all the raw information contained within the request response object, but for data processing purposes, we can define a class and some functions to simplify the work.

Key characteristics:

A corresponding python object property for each relevant property of a typical reddit submission.
- Unfortuantely the score property from pushshift isn’t the most reliable because it’s only a snapshot from when the data were indexed
summarize() that uses collections.Counter to tally up how frequently a stock ticker appears
to_dict() for serialization and conversion for pandas
from_response() to quickly instantiate a List[RedditSubmission] from a single response

@dataclass
class RedditSubmission:
    title: str 
    body: str 
    permalink: str 
    author: str 
    score: float
    timestamp: dt.datetime

    def summarize(self, 
        tickers: List[str], 
        weighted: bool = True
    ) -> Dict[str, Union[float, int]]:
        """ Process RedditSubmission for tickers 
        
        Use a Counter to count the number of times a ticker occurs.
        Include some corrections for punctuation
        """
        if self.title is not None:
            title_no_punctuation = self.title.translate(
                str.maketrans('', '', string.punctuation)
            )
            tickers_title = Counter(
                filter(lambda x: x in tickers, title_no_punctuation.split())
            )
        else:
            tickers_title = Counter()
        if self.body is not None:
            body_no_punctuation = self.body.translate(
                str.maketrans('', '', string.punctuation)
            )

            tickers_body = Counter(
                filter(lambda x: x in tickers, body_no_punctuation.split())
            )
        else:
            tickers_body = Counter()
        total_tickers = tickers_title + tickers_body
        
        return total_tickers
    
    def to_dict(self):
        return {
            'title': self.title,
            'body': self.body,
            'permalink': self.permalink,
            'author': self.author,
            'score': self.score,
            'timestamp': self.timestamp
        }
    
    @classmethod
    def from_response(
        cls, 
        resp_object: Response
    ) -> Optional[List[Any]]:
        """ Create a list of RedditSubmission objects from response"""
        if resp_object.status_code == 200:
            processed_response = [
                cls(
                    msg.get("title", None),
                    msg.get("body", None),
                    msg.get("permalink", None),
                    msg.get("author", None),
                    msg.get("score", None),
                    (
                        dt.datetime.fromtimestamp(msg['created_utc']) 
                        if msg['created_utc'] is not None else None
                    )
                ) for msg in resp_object.json()['data']
            ]
            return processed_response
        else:
            return None

In reality, there’s a decently-long wait time after we make the initial get request. The time to make and process the request is actually fairly quick, so this is a good opportunity to use python’s asyncio library.

Asyncio allows for concurrency in a different manner than multiprocessing or multithreading. You can have many tasks running, but only one is “controlling” the CPU, and gives up control when it’s not actively doing any work (like waiting for a response from the pushshift server).

The overall syntax is very similar to writing any other python function

async def submission_request_coroutine(**kwargs):
    await asyncio.sleep(5)
    reddit_response = requests.get("https://api.pushshift.io/reddit/search/submission",
                              params=kwargs)
    return reddit_response

Define a range of timestamps, initialize an async coroutine for each timestamp, then use asyncio to submit each request and gather them back together

snapshots = pd.date_range(
    start=dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(days=7),
    end=dt.datetime.now(tz=dt.timezone.utc) - dt.timedelta(days=1),
    freq='10min'
)

tasks = [
    submission_request_coroutine(subreddit='stocks', 
                 after=int(snapshot.timestamp()),
                 before=int(snapshots[i+1].timestamp()),
                 size=10
                ) 
    for i, snapshot in enumerate(snapshots[:-1])
]
all_submission_responses = await asyncio.gather(
    *tasks
)

The data is a List[Response] objects, which we can conver to a List[List[RedditSubmission]], then flatten as a List[RedditSubmission] with itertools

import itertools as it

reddit_submissions = [*it.chain.from_iterable(
    RedditSubmission.from_response(resp) for resp in all_submission_responses
    if resp.status_code == 200
)]

We can get a ticker counter for each RedditSubmission, but we’d like to quickly aggregate them all into a single, summary ticker counter over all the reddit submission in our time window. This can be easily achieved with functools.reduce

from functools import reduce
from collections import Counter

def aggregate_dictionaries(d1, d2):
    """ Given two dictionaries, aggregate key-value pairs """
    if len(d1) == 0:
        return dict(Counter(**d2).most_common())
    my_counter = Counter(**d1)
    my_counter.update(d2)
    return dict(my_counter.most_common())

submissions_breakdown = reduce(
    aggregate_dictionaries, 
    (submission.summarize(tickers) for submission in reddit_submissions)
)

It seems the list of tickers from the SEC was pretty generous ($A appears to be a ticker), but we can subselect for some of the recent trending tickers

submissions_breakdown['gme'], submissions_breakdown['amc']

(18, 11)

submissions_breakdown

{'a': 322,
 'on': 234,
 'for': 181,
 'it': 105,
 'or': 76,
 'be': 76,
 'next': 71,
 'are': 62,
 'new': 56,
 'good': 54,
 'now': 53,
 'can': 52,
 'all': 49,
 'at': 45,
 'out': 40,
 'amp': 34,
 'an': 33,
 'by': 31,
 'go': 30,
 'has': 26,
 'am': 24,
 'any': 22,
 'when': 21,
 'best': 20,
 'vs': 20,
 'one': 19,
 'so': 18,
 'gme': 18,
 'big': 17,
 'free': 15,
 'play': 13,
 'apps': 13,
 'amc': 11,
 'cash': 10,
 'see': 10,
 'find': 9,
 'run': 8,
 'rise': 7,
 'else': 7,
 'ever': 7,
 'work': 6,
 'real': 6,
 'open': 6,
 'wall': 5,
 'fund': 5,
 'post': 5,
 'love': 5,
 'well': 5,
 'very': 5,
 'ago': 5,
 'info': 5,
 'plan': 5,
 'pay': 5,
 'bit': 5,
 'ride': 4,
 'life': 4,
 'huge': 4,
 'low': 4,
 'nok': 4,
 'grow': 4,
 'cap': 4,
 'link': 3,
 'safe': 3,
 'plus': 3,
 'fast': 3,
 'stay': 3,
 'tech': 3,
 'fun': 3,
 'he': 3,
 'step': 3,
 'turn': 3,
 'live': 3,
 'site': 3,
 'ways': 3,
 'hear': 2,
 'teva': 2,
 'bb': 2,
 'co': 2,
 'boom': 2,
 'nice': 2,
 'mass': 2,
 'peak': 2,
 'max': 2,
 'wash': 2,
 'pump': 2,
 'tell': 2,
 'fly': 2,
 'pros': 2,
 'rock': 1,
 'both': 1,
 'gt': 1,
 'loan': 1,
 'nga': 1,
 'invu': 1,
 'most': 1,
 'ofc': 1,
 'nio': 1,
 'spot': 1,
 'min': 1,
 'onto': 1,
 'evfm': 1,
 'blue': 1,
 'nat': 1,
 'pure': 1,
 'sign': 1,
 'man': 1,
 'st': 1,
 'de': 1,
 'w': 1,
 'trtc': 1,
 'form': 1,
 'hi': 1,
 'joe': 1,
 'true': 1,
 'home': 1,
 'vrs': 1,
 'med': 1,
 'sqz': 1,
 'five': 1,
 'ship': 1,
 'trxc': 1,
 'wish': 1,
 're': 1,
 'car': 1,
 'nakd': 1,
 'rkt': 1,
 'flex': 1,
 'pm': 1,
 'ppl': 1,
 'earn': 1,
 'flow': 1,
 'lscc': 1,
 'peg': 1,
 'two': 1,
 'gain': 1,
 'wow': 1,
 'pro': 1,
 'team': 1,
 'fix': 1,
 'fnko': 1,
 'et': 1,
 'al': 1,
 'muh': 1,
 'save': 1,
 'gold': 1,
 'beat': 1,
 'vive': 1,
 'u': 1,
 'rh': 1,
 'x': 1,
 'vxrt': 1,
 'mind': 1,
 'ehth': 1,
 'job': 1,
 'road': 1,
 'box': 1}

Lastly, if we’re not interested in the tickers that occur, we can still boil all the data into a single dataframe

df = pd.DataFrame(a.to_dict() for a in reddit_submissions)

df

	title	body	permalink	author	score	timestamp
0	KSTR ETF "The nasdaq of china"	None	/r/stocks/comments/l664ce/kstr_etf_the_nasdaq_...	GioDesa	1	2021-01-27 09:56:46
1	Opinions/Projections on AMC?	None	/r/stocks/comments/l665a0/opinionsprojections_...	Double_jn_it	1	2021-01-27 09:58:03
2	GE, SPCE, & PLUG	None	/r/stocks/comments/l6668r/ge_spce_plug/	_MeatLoafLover	1	2021-01-27 09:59:21
3	Reddit is under DDOS attack. Certain gaming re...	None	/r/stocks/comments/l66692/reddit_is_under_ddos...	theBacillus	1	2021-01-27 09:59:22
4	#GainStock	None	/r/stocks/comments/l66777/gainstock/	lxPHENOMENONxl	1	2021-01-27 10:00:19
...	...	...	...	...	...	...
2338	AN OPEN LETTER TO GAMESTOP CEO	None	/r/stocks/comments/l98k85/an_open_letter_to_ga...	Artuhan	1	2021-01-31 03:55:51
2339	AN OPEN LETTER TO GAMESTOP CEO	None	/r/stocks/comments/l98lai/an_open_letter_to_ga...	Artuhan	1	2021-01-31 03:58:05
2340	Thoughts on YOLO (AdvisorShares Pure Cannabis ...	None	/r/stocks/comments/l98nly/thoughts_on_yolo_adv...	ConfidentProgrammer1	1	2021-01-31 04:02:29
2341	Daily advice	None	/r/stocks/comments/l98pic/daily_advice/	Bukprotingas	1	2021-01-31 04:06:24
2342	AMC- Next stop?	None	/r/stocks/comments/l98pif/amc_next_stop/	Hj-Fish	1	2021-01-31 04:06:24

2343 rows × 6 columns

Next up

While we just built our own Reddit API from some fundamental python libraries, there are more sophisticated API out there that do a better job of querying Reddit, like praw, and then we could try some other things like sentiment analysis

Accessing FoldingAtHome data on AWS

2020-12-29T00:00:00-06:00

Some F@H data is freely accessible on AWS. This will be a relatively short post on accessing and navigating the data on AWS.

If you regularly use AWS, this will be nothing new. If you’re a grad student who has only ever navigated local file directories or used scp/rsync/ssh to interact with remote clusters, this might be your first time interacting with files on AWS S3.

The python environment is fairly straightforward analytical environment, but with s3fs, boto3, and botocore to interact with files on S3

conda create -n fahaws python=3.7 pandas s3fs jupyter ipykernel -c conda-forge -yq

(Active environment)

python -m pip install boto3 botocore

The AWS CLI

The tools to navigate files within AWS directories follow that of unix-like systems. AWS CLI installation.

aws s3 ls s3://fah-public-data-covid19-absolute-free-energy/ --no-sign-request to list files within this particular S3 bucket. The no sign request flag at the end helps us bypass the need for any credentials.

You can read from stdout or pipe the output to a textfile, but this will be your bread and butter for wading through terabytes and terabytes of F@H data.

As of this post (Dec 2020), looks like the files in free_energy_data/ have been last updated end of Sept 2020

Summary of free energy results data

Fortunately, loading remote files via pandas is a common task, so there are convenient functions. Loading a dataframe over S3 is just like loading a dataframe locally (note the S3 string syntax)

The column febkT looks like the binding free energies in units of $k_B T$ (multiply by Boltzmann’s constant and temperature to get energies in kJ or kcal). It’s worth mentioning that the value of the binding free energy is not as helpful as the relative binding free energy to find the best binder of the bunch (how do these free energies compare against each other?)

import pandas as pd

df = pd.read_pickle("s3://fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl")

df.head()

	dataset	fah	identity	receptor	score	febkT	error	ns_RL	ns_L	wl_RL	L_error	RL_error
1155	MS0323_v3	PROJ14822/RUN258	DAR-DIA-43a-5	protein-0387.pdb	-5.201610	-25.546943	3.773523	[131, 89, 74, 113, 80]	[450, 490, 540, 410, 620]	[0.18446, 0.14757, 0.18446, 0.18446, 0.18446]	0.116912	3.280887
609	MS0326_v3	PROJ14823/RUN1202	MUS-SCH-c2f-13	Mpro-x0107-protein.pdb	-9.550890	-25.259420	22.776358	[121, 138, 96, 16, 5]	[200, 200, 200, 200, 200]	[0.18446, 0.18446, 0.23058, 0.23058, 0.23058]	16.216396	0.109175
759	MS0331_v3	PROJ14825/RUN685	MAK-UNK-129-18	Mpro-x0107_0.pdb	-8.425830	-24.789359	18.021078	[58, 68, 5, 7]	[200]	[0.37782, 0.30226, 0.9224, 0.59034]	0.000000	9.238496
615	MS0326_v3	PROJ14823/RUN2911	√ÅLV-UNI-7ff-30	Mpro-x0540-protein.pdb	-2.774634	-24.447756	6.605737	[174, 124, 70]	[200, 200, 200, 200, 200]	[0.14757, 0.14757, 0.18446]	0.042010	5.184169
1086	MS0326_v3	PROJ14823/RUN2580	SEL-UNI-842-3	Mpro-x0397-protein.pdb	-4.474095	-23.705301	1.248983	[166, 134, 45]	[200, 200, 200, 200, 200]	[0.18015, 0.22519, 0.35183]	0.212546	2.529874

Some code to iterate through these buckets

Pythonically, we can build some S3 code to list each object in this S3 bucket.

import boto3
from botocore import UNSIGNED
from botocore.client import Config

s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))

bucket_name = "fah-public-data-covid19-absolute-free-energy"
bucket = s3.Bucket(bucket_name)

This S3 bucket is very large – all the simulation inputs, trajectories, and outputs are in here, so it will take a while to enumerate every object. Instead, we’ll just make a generator and pull out a single item for proof-of-concept.

paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name)

def page_iterator(pages):
    for page in pages:
        for item in page['Contents']:
            yield item['Key']

all_objects = page_iterator(pages)

next(all_objects)

'PROJ14377/RUN0/CLONE0/frame0.tpr'

And if you wanted to, you could layer a filter over the generator to impose some logic like filtering for the top-level directories

first_level_dirs = filter(lambda x: x.count('/')==1, all_objects)

Unix-like python filesytem libraries

S3FS, built on botocore and fsspec, has a very unix-like syntax to navigate and open files

import s3fs
fs = s3fs.S3FileSystem(anon=True)

fs.ls(bucket_name)

['fah-public-data-covid19-absolute-free-energy/PROJ14377',
 'fah-public-data-covid19-absolute-free-energy/PROJ14378',
 'fah-public-data-covid19-absolute-free-energy/PROJ14379',
 'fah-public-data-covid19-absolute-free-energy/PROJ14380',
 'fah-public-data-covid19-absolute-free-energy/PROJ14383',
 'fah-public-data-covid19-absolute-free-energy/PROJ14384',
 'fah-public-data-covid19-absolute-free-energy/PROJ14630',
 'fah-public-data-covid19-absolute-free-energy/PROJ14631',
 'fah-public-data-covid19-absolute-free-energy/PROJ14650',
 'fah-public-data-covid19-absolute-free-energy/PROJ14651',
 'fah-public-data-covid19-absolute-free-energy/PROJ14652',
 'fah-public-data-covid19-absolute-free-energy/PROJ14653',
 'fah-public-data-covid19-absolute-free-energy/PROJ14654',
 'fah-public-data-covid19-absolute-free-energy/PROJ14655',
 'fah-public-data-covid19-absolute-free-energy/PROJ14656',
 'fah-public-data-covid19-absolute-free-energy/PROJ14665',
 'fah-public-data-covid19-absolute-free-energy/PROJ14666',
 'fah-public-data-covid19-absolute-free-energy/PROJ14667',
 'fah-public-data-covid19-absolute-free-energy/PROJ14668',
 'fah-public-data-covid19-absolute-free-energy/PROJ14669',
 'fah-public-data-covid19-absolute-free-energy/PROJ14670',
 'fah-public-data-covid19-absolute-free-energy/PROJ14671',
 'fah-public-data-covid19-absolute-free-energy/PROJ14702',
 'fah-public-data-covid19-absolute-free-energy/PROJ14703',
 'fah-public-data-covid19-absolute-free-energy/PROJ14704',
 'fah-public-data-covid19-absolute-free-energy/PROJ14705',
 'fah-public-data-covid19-absolute-free-energy/PROJ14723',
 'fah-public-data-covid19-absolute-free-energy/PROJ14724',
 'fah-public-data-covid19-absolute-free-energy/PROJ14726',
 'fah-public-data-covid19-absolute-free-energy/PROJ14802',
 'fah-public-data-covid19-absolute-free-energy/PROJ14803',
 'fah-public-data-covid19-absolute-free-energy/PROJ14804',
 'fah-public-data-covid19-absolute-free-energy/PROJ14805',
 'fah-public-data-covid19-absolute-free-energy/PROJ14806',
 'fah-public-data-covid19-absolute-free-energy/PROJ14807',
 'fah-public-data-covid19-absolute-free-energy/PROJ14808',
 'fah-public-data-covid19-absolute-free-energy/PROJ14809',
 'fah-public-data-covid19-absolute-free-energy/PROJ14810',
 'fah-public-data-covid19-absolute-free-energy/PROJ14811',
 'fah-public-data-covid19-absolute-free-energy/PROJ14812',
 'fah-public-data-covid19-absolute-free-energy/PROJ14813',
 'fah-public-data-covid19-absolute-free-energy/PROJ14823',
 'fah-public-data-covid19-absolute-free-energy/PROJ14824',
 'fah-public-data-covid19-absolute-free-energy/PROJ14826',
 'fah-public-data-covid19-absolute-free-energy/PROJ14833',
 'fah-public-data-covid19-absolute-free-energy/SVR51748107',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data',
 'fah-public-data-covid19-absolute-free-energy/receptor_structures.tar.gz',
 'fah-public-data-covid19-absolute-free-energy/setup_files']

fs.ls(bucket_name + "/free_energy_data")

['fah-public-data-covid19-absolute-free-energy/free_energy_data/',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_L_14382.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14717.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14718.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14719.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14720.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14817.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14818.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14819.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14820.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_L_14676.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14730.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14830.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_L_14374.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14721.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14821.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_L_14364.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14722.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14822.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_L_14369_14372_14370_14371.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14723.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14724.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14823.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14824.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_L_14376.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14725.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14825.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_L_14380.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14727.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14728.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14827.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14828.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_L_14378.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14752.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14852.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl',
 'fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl']

with fs.open('fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt', 'r') as f:
    print(f.read())

hello aws!

with fs.open("fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl", 'rb') as f:
    organization_df = pd.read_pickle(f)

organization_df.head()

	dataset	identity	receptor	score	v1_project	v1_run	v2_project	v2_run	v3_project	v3_run	project	run
0	72_RL	CCNCC(COC)Oc1ccccc1	receptor-270-343.pdb	0.999790	14600	0	14700	0	14800	0	NaN	NaN
1	72_RL	O=C(Cc1cccnc1)c1ccccc1	receptor-343.pdb	0.999652	14600	1	14700	1	14800	1	NaN	NaN
2	72_RL	CCCCC(N)c1cc(C)ccn1	receptor-343.pdb	0.999256	14600	2	14700	2	14800	2	NaN	NaN
3	72_RL	COCC(C)Nc1ccncn1	receptor-343.pdb	0.999096	14600	3	14700	3	14800	3	NaN	NaN
4	72_RL	CCN(CC)CCNc1ccc(C#N)cn1	receptor-270-343.pdb	0.998980	14600	4	14700	4	14800	4	NaN	NaN

Notebook itself can be found here

Poetry and Docker

2020-12-23T00:00:00-06:00

What is poetry and where does this fit in the python software/DS ecosystem? And some beginner forays into docker.

To skip the reading and jump to the code, go to this repo

Personal opinions motivating this work

The data science world is large. Data science is kind of like an intersection of statistics/math, subject matter, software engineering, algorithms, and all the collaboration/teamwork that comes with a job. Starting out, you definitely cannot be expected to have a mastery over everything, but at least some minimal competencies and capacity to learn (this basically applies to all jobs).

I’m about 11 months into technically being labeled a “data scientist” and I’ve observed that each data scientist ends up cultivating their own sets of skills they find valuable and/or interesting – generalist/specialist is basically skilling up however you want. Seeing the work of other data scientists has built up a long laundry list of things I’d like to learn but don’t have the business-hours to devote because of more-pressing project demands. Among these is this concept of packaging and building “applications”. For a graduate student, your “application” might be a codebase and set of functions that others can reliably and consistently use in their own hacky codes. For a software engineer, your “application” might be deployed onto some cloud server, where the code needs to be self-sufficient and robust, listening for input, processing this input, and pumping out some output without hands on a keyboard. For a data scientist, you may eventually need to think about how an application gets deployed, from consistency of functions and numerical accuracy to considering the entire technical stack involved. Day-to-day, I think consistency of functions and numerical accuracy are generally kept front-of-mind with unit tests or and mainly because you’re always thinking of the mathematical model.

If you’re a little more software-savvy, you’ll think about your python environment, using conda or something to control your python software dependencies, your software build, and any compilation that has to happen. Since I’m on socially-distanced, self-quarantined holiday, this is a great time to do some learning

Poetry

“Dependency” hell, an introduction

Most software depends on other software, and if the dependencies change some core functionality, then your own software may no longer function as intended. To resolve this, you venture through “dependency hell” to figure out whose code broke your code, and how to fix this.

Data scientists like to use python virtual environments to ensure dependencies are compatible and runnable. Some like to use pip and venv, which is fine for installing packages, but only recently will pip attempt to address dependency resolution. Conda is also very popular for managing software packages, compiling software, and resolving software dependencies.

What does poetry do?

A new contestant, poetry finds itself in some python packaging and dependency conversations like “oh I’ve heard of poetry but never really tried it”. Poetry helps manage the python package dependencies for a given software, with a simple CLI to add and update new package dependencies. Poetry generally involves the binary (available on conda and pip), but interacts with your package via two files, the poetry.lock and pyproject.toml. If someone gives you those files, you should be able to build your own compatible python environment. In tandem, the two specify the necessary dependncies for your project, with the former pinning dependencies and the latter floating dependencies. Poetry also has some convenient functions for compiling source distributions and wheels so you can distribute this code on somewhere like pypi (but it doesn’t look like there’s any mention of conda recipes).

What about docker?

Docker provides a lot of virtualization and environment control so you can put together an entire tech stack just for your application to run on a bare-bones, nothing-installed server somewhere. This comes in the form of a dockerfile, which like a set of instructions on how to build your container. For an early career data scientist, that’s probably all you need to know. Software engineers deal with this all the time, and data scientists eventually dip their toes here as a model/project comes to maturity.

You can learn a lot about dockerfiles by reading them and writing your own, so take a look at the repo linked at the beginning of this post. In general, it kind of resembles a lot of shell commands. Getting conda to work with docker comes with some sticking points:

conda commands within each layer won’t work unless you run the shell script that comes with conda, so you have to remember to run that script throughout the dockerfile
Note the use of the entrypoint.sh file, which becomes the final script that is executed when you call docker run. Observe the necessary chmod to make it executable, and note the conda.sh command even inside the entrypoint.sh file if you want the container to run some code within a conda environment.
docker run -it poetry /bin/bash if you want to open an interactive shell session to the container, running commands/codes inside the docker container like you would an SSH session.
Technically, since you have absolute control over the image, you might not need the virtual environment for small python packages. As the packages get more complex and package builds become more complicated, it becomes easier to let conda handle the package management rather than try to correctly install everything in a dockerfile

If you envision running lots of python code or calculations on cloud servers, docker containers and python environments are the sorts of tech that make it happen (and if you and your proejct are up for it, container-orchestration and workflow tools)

Bare bones example

I’ve documented my experiences in this sandbox for using docker and poetry. There are a lot of tutorials on the internet, so I won’t bother here. But, for a data scientist versed in python environments, this repo showcases how to build your docker images for conda/poetry/python. For a “real” industrial application, things will likely get messier as the environments and software stack get more complex, but this is a decent start for an amateur.

Exploring PyTorch + ANI + MD

2020-08-15T00:00:00-05:00

PyTorch + ANI + MD

PyTorch provides nice utilities for differentiation. ANI provides some interatomic potentials trained on some neural networks. Molecular Dynamics might be an interesting combination

Some basic pytorch functionality, a 1-D spring

Pytorch replicates a lot of numpy functionality, and we can build python functions that take pytorch tensors as input

import torch
import matplotlib.pyplot as plt

x = torch.ones((2,2), requires_grad=True)

A simple quadratic function

def sq_function(x):
    return x**2

Since we have an array of 1s, the square won’t look very interesting…

foo = sq_function(x)

foo

tensor([[1., 1.],
        [1., 1.]], grad_fn=<PowBackward0>)

More interstingly, we can compute the gradient of this function.

To compute the gradient, the value/function needs to be a scalar, but this scalar could be computed from a bunch of other functions stemming from some independent variables (our tensor x). In this case, our final scalar looks like this, $ Y = x_0^2 + x_1^2 + x_2^2 + x_3^2 $. Taking the gradient means taking 4 partial derivatives for each input. Fortunately, the equation is simple to compute each partial derivative, $ \frac{\partial Y}{\partial x_i} = 2*x_i $, where $i = [0,4)$. Since this is an array of 1s, each partial derivative evaluates to 2

torch.autograd.grad(foo.sum(), x)

(tensor([[2., 2.],
         [2., 2.]]),)

We’ve evaluated the function and its gradient at just one point, but we can use some numpy-esque functions to evaluate the square-function and its gradient at a range of points.

Yup, looks right to me

some_xvals = torch.arange(-12., 12., step=0.5, requires_grad=True)
some_yvals = sq_function(some_xvals)
fig, ax = plt.subplots(1,1)
ax.plot(some_xvals.detach().numpy(), some_yvals.detach().numpy())
ax.plot(some_xvals.detach().numpy(), 
       torch.autograd.grad(some_yvals.sum(), some_xvals)[0])

[<matplotlib.lines.Line2D at 0x7f9c907aa910>]

Slightly more book-keeping, 3x 1-D harmonic springs

Define an energy function as the sum of 3 harmonic springs

$ V(x, y, z) = V_x + V_y + V_z = (x-x_0)^2 + (y-y_0)^2 + (z-z_0)^2 $

The gradient, the 3 partial derivatives, are computed as such (being verbose with the chain rule)

$ \frac{\partial V}{\partial X} = 2 *(x-x_0) * 1 $

$\frac{\partial V}{\partial Y} = 2 *(y-y_0) * 1$

$\frac{\partial V}{\partial Z} = 2 *(z-z_0) * 1$

def harmonic_spring_3d(coord, origin=torch.tensor([0,0,0])):
    V_x = (coord[0]-origin[0])**2
    V_y = (coord[1]-origin[1])**2
    V_z = (coord[2]-origin[2])**2
    
    return V_x + V_y + V_z 

We can evaluate the potential energy at 1 point, which involves computing the energy in 3 dimensions.

Our “anchor” will be the origin, and our endpoint will be (1,2,3)

$ 1^2 + 2^2 + 3^2 = 14 $

my_coords = torch.tensor([1.,2.,3.], requires_grad=True)
total_energy = harmonic_spring_3d(my_coords)
total_energy

tensor(14., grad_fn=<AddBackward0>)

Computing the gradient, partial derivatives in each direction, which is simply 2 times the distance in each dimension

$ \nabla \hat V = < 21, 22, 2*3 > = <2,4,6> $

torch.autograd.grad(total_energy, my_coords)

(tensor([2., 4., 6.]),)

More involved: Lennard Jones

The Lennard-Jones potential describes the potential energy between two particles. Not the most accurate potential, but has been decent for a long time now. Some background information on the Lenanrd-Jones potential. For simplicity, assume $\epsilon =1$ and $\sigma=1$ in unitless quantities:

$ V_{LJ} = 4 * ( \frac{1}{r}^{12} - \frac{1}{r}^6) $

$ -\frac{\partial V}{\partial r} = -4 * (-12 * r^{-13} + 6 * r^{-7}) $

def lj(val):
    return 4 * ((1/val)**12 - (1/val)**6)

r_values = torch.arange(0.1, 12., step=0.001, requires_grad=True)

energy = lj(r_values)

forces = -torch.autograd.grad(energy.sum(), r_values)[0]

For sanity check, we can confirm that energy reaches a critical point (local minimum) when the force is 0.

Also, this definitely looks like a LJ potential to me

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1,1, dpi=100)
ax.plot(r_values.detach().numpy(), energy.detach().numpy(), label='energy')
ax.plot(r_values.detach().numpy(), forces.detach().numpy(), label='force')
ax.set_ylim([-2,1])
ax.legend()
ax.set_xlim([0,2])
ax.axhline(y=0, color='r', linestyle='--')

<matplotlib.lines.Line2D at 0x7f9d2859b2d0>

Moving to torchani

ANI is an interatomic potential built upon neural networks. Rather than write our own function to evaluate the energy between atoms, maybe we can just use ANI. Since this is pytorch-based, this is still available for autodifferentiation to get the forces

https://github.com/aiqm/torchani

To begin, we have to define our elements (a tensor of atomic numbers). For the molecular mechanics people, each atom is identifiable by its element, and not one of many atom-types.

We have to define the positions (units of Angstrom), which is also a multi-dimensional tensor.

Load the model, specifying to convert the atomic numbers to indices suitable for ANI.

We can compute the energies and forces from the model. The energy comes from the model, but the force is obtained via an autograd call, observing that we are differentiating the sum of the forces, evaluating at the positions

import torchani

elements = torch.tensor([[6, 6]])
positions = torch.tensor([[[3.0, 3.0, 3.0],
                           [3.5, 3.5, 3.5]]], requires_grad=True)

model = torchani.models.ANI2x(periodic_table_index=True)

energy = model((elements, positions)).energies

forces = -1.0 * torch.autograd.grad(energy.sum(), positions)[0]

/home/ayang41/miniconda3/envs/torch37/lib/python3.7/site-packages/torchani/aev.py:195: UserWarning: This overload of nonzero is deprecated:
	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  in_cutoff = (distances <= cutoff).nonzero()

energy

tensor([-75.7952], dtype=torch.float64, grad_fn=<AddBackward0>)

forces

tensor([[[-0.4016, -0.4016, -0.4016],
         [ 0.4016,  0.4016,  0.4016]]])

Going a step further, we can try to visualize the interaction potential by evaluating the energy at a variety of distances. We can also do some autodifferentiation to compute the forces.

In this example, we have 2 atoms that share X and Y coordinates, but pull them apart in the Z direction

all_z = torch.arange(3.0, 12.0, step=0.1)
all_energy = []
all_forces = []
for z in all_z:
    # Generate a new set of positions
    positions = torch.tensor([[[3.0, 3.0, 3.0],
                               [3.0, 3.0, z]]], requires_grad=True
                            )
    # Compute energy
    energy = model((elements, positions)).energies
    # Compute force
    forces = -1.0 * torch.autograd.grad(energy.sum(), positions)[0]
    
    # Get the force vector on the first atom
    one_atom_forces = forces[0,0]
    # Compute the magnitude of this force vector
    force_magnitude = torch.sqrt(torch.dot(one_atom_forces, one_atom_forces))
    # Calculate the unit vector for this force vector,
    # although it's a little unnecessary because the only distance is in the
    # z direction
    unit_vector_force = one_atom_forces/force_magnitude
    # Get z-component of force vector
    force_vector_z = unit_vector_force[2]*force_magnitude
    # Some nans will form if the force magnitude is zero, but this
    # is really just a 0 force vector
    if torch.isnan(force_vector_z).any():
        force_vector_z = 0.0
    else:
        force_vector_z = float(force_vector_z.detach().numpy())
    
    # Accumulate
    all_energy.append(float(energy.detach().numpy()))
    all_forces.append(force_vector_z)

Hmmm… this does not resemble the Lennard-Jones potential (or basic chemistry for that matter)

fig, ax = plt.subplots(1,1, dpi=100)
ax.plot(all_z-3, all_energy)
ax.set_xlabel(r"Distance ($\AA$)")
ax.set_ylabel("Energy (Hartree)")

Text(0, 0.5, 'Energy (Hartree)')

fig, ax = plt.subplots(1,1, dpi=100)
ax.plot(all_z-3, all_forces)

ax.set_xlabel(r"Distance ($\AA$)")
ax.set_ylabel("Force (Hartree / $\AA$)")

Text(0, 0.5, 'Force (Hartree / $\\AA$)')

Combinng torchani with some other molecular modeling libraries

We’re going to use mbuild to initialize some particles, mdtraj as a convenient library to hold molecular information, and torchani to calculate some energies. As with the 2-atom potential example, this pentane example is a little fishy, but this code snippet should hopefully serve as a nice framework to combine some open source molecular modeling libraries.

from mbuild.lib.recipes import Alkane

# The mBuild alkane recipe is mainly used to generate 
# some particles and positions
cmpd = Alkane(n=5)

# Convert to mdtraj trajectory out of convenience for atomic numbers
traj = cmpd.to_trajectory()

# Periodic cell, from nm to angstrom
cell = torch.tensor(traj.unitcell_vectors[0]*10)

# We just need atomic numbers
species = torch.tensor([[
    a.element.atomic_number for a in traj.top.atoms
]])

# Make tensor for coordinates
# Since we are differentiating WRT coordinates, we need the
# requires_grad=True
coordinates = torch.tensor(traj.xyz*10, requires_grad=True)

# PBC flag necessary for computing energies with periodic boundaries
pbc = torch.tensor([True, True, True], dtype=torch.bool)

energies = model((species, coordinates), cell=cell, pbc=pbc).energies

forces = -1.0 * (
    torch.autograd.grad(energies.sum(), coordinates)[0]
)

energies

/home/ayang41/miniconda3/envs/torch37/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

tensor([-197.1103], dtype=torch.float64, grad_fn=<AddBackward0>)

forces

/home/ayang41/miniconda3/envs/torch37/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)





tensor([[[ 6.5805e-02,  5.5707e-02,  4.9085e-02],
         [ 1.3603e-03, -2.1826e-02, -1.0588e-02],
         [ 1.5610e-02, -5.9448e-02,  1.4180e-03],
         [-7.4506e-09,  1.1921e-07,  1.1461e-02],
         [-7.2804e-03,  2.5767e-02, -6.2775e-04],
         [ 7.2804e-03, -2.5766e-02, -6.2775e-04],
         [-6.5805e-02, -5.5707e-02,  4.9085e-02],
         [-1.5610e-02,  5.9448e-02,  1.4180e-03],
         [-1.3604e-03,  2.1826e-02, -1.0588e-02],
         [ 6.9919e-02,  1.0938e-01, -4.7381e-02],
         [ 4.2583e-02,  1.5188e-01, -9.1655e-03],
         [-3.5887e-02, -5.4712e-03,  4.6396e-02],
         [ 3.4462e-03,  3.7552e-02, -3.4868e-02],
         [-6.9919e-02, -1.0938e-01, -4.7381e-02],
         [-4.2583e-02, -1.5188e-01, -9.1655e-03],
         [ 3.5887e-02,  5.4712e-03,  4.6396e-02],
         [-3.4462e-03, -3.7552e-02, -3.4868e-02]]])

To be continued …

One might imagine trying to incorporate ANI potentials into MD simulations (which has been done in ASE). However, the torchani-API is general enough that you could use any number of computational chemistry packages to feed into torchani. The output is also general enough you could imagine trying to apply your own integrators and make your own simulation. But… from the weird 2-atom interatomic potentials, some of these methods might require some debugging.

Files and environment can be found here

Reference

Xiang Gao, Farhad Ramezanghorbani, Olexandr Isayev, Justin S. Smith, and Adrian E. Roitberg. TorchANI: A Free and Open Source PyTorch Based Deep Learning Implementation of the ANI Neural Network Potentials. Journal of Chemical Information and Modeling 2020 60 (7), 3408-3415

Downloading and studying my message behavior

2020-08-07T00:00:00-05:00

Digital privacy is everywhere, and recent laws are pushing companies to disclose whatever personal information they may have on you. In the spirit of science, I’m going to make myself my own study subject and observe what Facebook has stored from my messenger history. Along the way, I’ll do some recursion, a little parallelization, some generators for data procesing, and basic visualization to observe my messenger behavior. Notebooks can be found here, but this one you can’t reproduce because I won’t be providing my messenger data (try this notebook on your own messenger data if you’re curious).

No real conclusion to this memo, but it’s interesting to see firsthand that a lot of data gets preserved from your messages – pictures, gifs, videos, audio, files, emotes, participants, timestamps.

The message data from Facebook is organized like this:

inbox/
- chat1/
  - message1.json
  - message2.json
  - audio/
  - files/
  - gifs/
  - photos/
  - videos/
- chat2/
  - message1.json

We can start with some basic tree-walking to identify which is the largest chat group

import os
from pathlib import Path
import json
import multiprocessing
from multiprocessing import Pool
import dask
from dask import delayed
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np


def size_of_tree(p):
    if 'json' in p.suffix:
        with open(p.as_posix()) as f:
            message_data = json.load(f)
            return len(message_data['messages'])
    elif p.is_dir():
        return sum([size_of_tree(a) for a in p.iterdir()])
    else:
        return 0

def parent_function(p):
    return {p: size_of_tree(p)}

def parent_function_chunk(p):
    return {folder: size_of_tree(folder) for folder in p}


p = Path('/censored/so/you/cant/find/my/facebook/inbox')
all_dirs = [a for a in p.iterdir() if a.is_dir()]

Since this is an embarrassingly parallel situation, we can easily show the serial version is slower than the parallel version (using dask or multiprocessing), with or without some chunking

%%time

sizes = [parent_function(folder) for folder in all_dirs]

CPU times: user 5.3 s, sys: 17.6 s, total: 22.9 s
Wall time: 1min 30s

%%time

all_delayed = [delayed(parent_function)(folder) for folder in all_dirs]

results = dask.compute(all_delayed)

CPU times: user 9.65 s, sys: 1min 4s, total: 1min 14s
Wall time: 30.4 s

%%time

with Pool() as p:
    pool_results = p.map(parent_function, all_dirs)

CPU times: user 131 ms, sys: 171 ms, total: 302 ms
Wall time: 27.5 s

%%time

all_delayed = [delayed(parent_function_chunk)(all_dirs[i::6]) for i in range(6)]

results = dask.compute(all_delayed)

CPU times: user 8.13 s, sys: 59.7 s, total: 1min 7s
Wall time: 31.4 s

%%time

with Pool() as p:
    pool_results = p.map(parent_function_chunk, [all_dirs[i::6] for i in range(6)])

CPU times: user 242 ms, sys: 33.7 ms, total: 276 ms
Wall time: 28.9 s

For those curious, I have a pretty skewed chat message distribution…

message_sizes = [size for chunk in results[0] for size in chunk.values()]

fig, ax = plt.subplots(1,1, figsize=(8,6))
ax.hist(message_sizes)
ax.set_ylabel("Number of chats")
ax.set_xlabel("Number of messages within chat")

Text(0.5, 0, 'Number of messages within chat')

fig, ax  =plt.subplots(1,1, figsize=(8,6))
ax.hist(np.log(message_sizes))
ax.set_ylabel("Number of chats")
ax.set_xlabel("Log number of messages within chat")

Text(0.5, 0, 'Log number of messages within chat')

We can make a small data pipeline for my message history by using two iterators, one after the other. The first iterator get_json_files_iter is simple, it will just burrow its way through each directory, grab all the json files, and spit out one at a time, returning a generator. The second iterator process_json_iter will take an item from the get_json_files_iter generator and actually process some information. In this case, getting information about the sender, timestamp, and length of message.

from typing import Iterator, Dict, Any, List
import pathlib
import json
from datetime import datetime

def get_json_files_iter(dirs) -> Iterator[str]:
    """ For each dir, get the json files """
    root = Path('.')
    for directory in dirs:
        subdir = root / Path(directory)
        for jsonfile in subdir.glob('*.json'):
            yield Path(jsonfile)

def process_json_iter(json_iter: Iterator[str]) -> Iterator[List[Dict[Any, Any]]]:
    """ Given a json file, parse and summarize the message info"""
    for jsonfile in json_iter:
        with open(jsonfile.as_posix()) as f:
            message_data = json.load(f)
        for message in message_data['messages']:
            yield {
                'sender': message['sender_name'],
                'timestamp': datetime.fromtimestamp(message['timestamp_ms']/1000),
                'n_words': len(message['content']) if message.get('content', None) else None # Some messages have no text
                # like an image/emoji post
            }
   

process_json_iter(get_json_files_iter(all_dirs))

<generator object process_json_iter at 0x7f26228366d0>

Getting through all the files (7 gb) isn’t too bad

%%time

extracted_messages = [*process_json_iter(get_json_files_iter(all_dirs))]

CPU times: user 6.01 s, sys: 0 ns, total: 6.01 s
Wall time: 10.8 s

%%time

df = pd.DataFrame(extracted_messages)

CPU times: user 909 ms, sys: 0 ns, total: 909 ms
Wall time: 900 ms

Conveniently, we can pass the generator itself to create a dataframe. This doesn’t provide much speedup, but it helps keep the code concise

%%time

df = pd.DataFrame(process_json_iter(get_json_files_iter(all_dirs)))

CPU times: user 7.42 s, sys: 0 ns, total: 7.42 s
Wall time: 13.7 s

df.columns

Index(['sender', 'timestamp', 'n_words'], dtype='object')

df.shape

(1003527, 3)

We can look at how my chat history has changed over the years…

df['date'] = df.apply(lambda x: '-'.join([str(x['timestamp'].year), 
                                          str(x['timestamp'].month), 
                                          str(x['timestamp'].day)]),
                      axis=1)

grouped_by_date = df.groupby('date').agg('count')

fig, ax = plt.subplots(1,1, figsize=(18,10))
ax.plot(grouped_by_date.index.tolist(),
       grouped_by_date['sender'])

ticks = np.linspace(0, len(grouped_by_date.index)-1, num=50, dtype=int)
ax.set_xticks(ticks)
ax.set_xticklabels([list(grouped_by_date.index)[i] for i in ticks], rotation='90', ha='right')
ax.set_ylabel("Number of messages", size=18)

Text(0, 0.5, 'Number of messages')

Maybe trying to smooth things out. The timestamps aren’t evenly distributed so the averages could be computed better, but they work well enough for now

rolling = grouped_by_date.rolling(10, min_periods=1).mean()

fig, ax = plt.subplots(1,1, figsize=(18,10))
ax.plot(rolling.index.tolist(),
       rolling['sender'])

ticks = np.linspace(0, len(rolling.index)-1, num=50, dtype=int)
ax.set_xticks(ticks)
ax.set_xticklabels([list(rolling.index)[i] for i in ticks], rotation='90', ha='right')
ax.set_ylabel("Number of messages", size=18)

Text(0, 0.5, 'Number of messages')

Lessons learned from accelerating foyer with dask

2020-06-20T00:00:00-05:00

Combining Foyer + Dask

More into the foray of combining modern molecular modeling tools with modern data science libraries…

Foyer uses graph algorithms to parametrize your molecular model

Given a system of molecules and atoms, how do we parametrize each atom according to our molecular model, our force field? The parameters for each atom depend on its bonded neighbors. Framing this as a graph problem (vertices are atoms and edges are bonds), subgraph isomorphisms are used to match our atom’s bonding patterns to the template bonding patterns specified by our force field’s atom-type bonding patterns

Dask helps distribute parallel workloads

Generally, most of these molecular modeling packages operate on a shared memory data structure - a list, a dictionary. To parallelize this atomtyping operation, we need to identify how we can parallelize this. For graph problems, sometimes each node (atom) needs to know every other node. We are left with a couple options

Broadcast the entire molecular graph to all workers, divy up which atoms each worker is reponsible for atomtyping. This risks some large overhead because the entire molecular graph can span tens of thousands (or more) nodes.
Broad only the relevant molecular graph to each worker, each worker becomes responsible for parametrizing that small subgraph. This one doesn’t involve broadcasting large graphs, but now the problem becomes identifying what the relevant graph is. I refer readers to the concept of a graph component

What to expect in this notebook

First, I’ll be breaking up the entire chemical system into smaller subgraphs. I’ll try to atom-type each subgraph serially. Then, I’ll try to distribute the workload of each subgraph using dask. I’ll try to do some timings - against different numbers of homogeneous molecules and different numbers of heterogeneous molecules. Along the way, I’ll be observing some friction points for using dask (casual user here) and for using foyer/parmed

Parallelization’s value is hard to demonstrate in this use case

Dask did not show improvements compared to canonical foyer. With the data structures we, and foyer, usually deal with, there’s some extra work in formatting them into easily-distributable data structures for parallelization. There’s always communication issues for parallel workloads. Foyer has molecule caching that accelerates atom-typing for molecules you’ve already atom-typed; this isn’t leveraged well in a distributed scenario. Foyer uses networkx, which likely already comes with its own optimizations for simplifying the workload, so evaluating a singular large graph may not be as bad as we think compared to lots of small graphs. As written, the foyer code may be best utilized serially. Future foyer implementations and refactors might better exposed elements of parallelization

Distributing the workload: split a chemical system into smaller components, parametrize each molecule, in serial

Use mbuild to create our molecule, replicate to 10 molecules, foyer to apply the OPLS-AA force field

import mbuild as mb
from mbuild.lib.recipes import Alkane
import foyer
import parmed as pmd
import networkx as nx

_ColormakerRegistry()

ff = foyer.forcefields.load_OPLSAA()

/home/ayang41/programs/foyer/foyer/forcefield.py:449: UserWarning: No force field version number found in force field XML file.
  'No force field version number found in force field XML file.'
/home/ayang41/programs/foyer/foyer/forcefield.py:461: UserWarning: No force field name found in force field XML file.
  'No force field name found in force field XML file.'
/home/ayang41/programs/foyer/foyer/validator.py:132: ValidationWarning: You have empty smart definition(s)
  warn("You have empty smart definition(s)", ValidationWarning)

single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[10,10,10])

/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory <mdtraj.Trajectory with 1 frames, 3 atoms, 1 residues, without unitcells>
  "mdtraj.Trajectory {}".format(traj)
/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory <mdtraj.Trajectory with 1 frames, 4 atoms, 1 residues, without unitcells>
  "mdtraj.Trajectory {}".format(traj)
/home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees.
  "No box specified and no Compound.box detected. "

view = single.visualize(backend='nglview')
view

NGLWidget()

structure = cmpd.to_parmed()

Box of pentanes as parmed structures

import nglview
nglview.show_parmed(structure)

NGLWidget()

Creating the molecule graph for all moleucles in our system

graph = nx.Graph()
graph.add_nodes_from([a.idx for a in structure.atoms])
graph.add_edges_from([(b.atom1.idx, b.atom2.idx) for b in structure.bonds])

Here we can see there’s a few different graph connected components here, AKA each molecule

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1,1, figsize=(8,8), dpi=100)

nx.draw_networkx(graph, node_size=100, with_labels=False, ax=ax)

Fortunately, networkx API has a connected components implementation. We have a list of sets of atom indices, where each set of atom indices refers to a connected component

individual_molecule_graphs = [*nx.connected_components(graph)]
individual_molecule_graphs[0:3]

[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16},
 {17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33},
 {34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50}]

For each individual molecule graph, we can create a parmed structure. Our entire box of pentanes was one parmed structure, but now we’re interested in creating N different parmed structures, one for each molecule. You could imagine creating another kind of object, like an mbuild compound or openmm topology, but to fit the foyer workflow, we operate on parmed structures.

all_substructures = []
for molecule_graph in individual_molecule_graphs:
    individual_structure = pmd.Structure()
    for idx in molecule_graph:
        individual_structure.add_atom(structure.atoms[idx], structure.atoms[idx].residue.name,
                                     structure.atoms[idx].residue.number)
        for neighbor_idx in graph[idx]:
            if idx < neighbor_idx:
                individual_structure.bonds.append(pmd.Bond(structure.atoms[idx], 
                                                           structure.atoms[neighbor_idx]))
    all_substructures.append(individual_structure)

all_substructures

[<Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>]

Simple iteration through each molecular subtructure, apply the force field to each

parametrized_substructures = []
for substructure in all_substructures:
    output_struc = ff.apply(substructure)
    parametrized_substructures.append(output_struc)

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 20, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)

parametrized_substructures

[<Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>]

Because parmed structures override addition, we can combine structures via addition

parametrized_substructures[0] + parametrized_substructures[1]

<Structure 34 atoms; 2 residues; 32 bonds; parametrized>

Using functools, we can quickly and conveniently combine all N parametrized structures into 1 structure

from functools import reduce
parametrized_structure = reduce(lambda x,y: x+y, parametrized_substructures)

parametrized_structure

<Structure 170 atoms; 10 residues; 160 bonds; parametrized>

Rather than parametrize one, big parmed structure, we are parametrizing a bunch of small parmed structures, in serial. We’re not distributing the workload, but we are simplifying the workload – rather than match subgraphs among large, complex graphs of hundreds of nodes and edges, we are matching subgraphs among smaller, simpler graphs

Split a chemical system into smaller components, parametrize each molecule, in parallel

import dask
from dask import delayed, bag as db

Streamline our code into functions that are mostly-compatible with dask.

The use of tuples over lists because tuples are hashable (important for dask)
Extra functions to map atomic indices to parmed Atoms. If we’re going to create different parmed structures, we need to track parmed atoms

from typing import List, Union, Set, Dict, Tuple
    
def structure_to_graph(structure: pmd.Structure) -> nx.Graph:
    graph = nx.Graph()
    graph.add_nodes_from([a.idx for a in structure.atoms])
    graph.add_edges_from([(b.atom1.idx, b.atom2.idx) for b in structure.bonds])
    
    return graph
        
def separate_molecule_graphs(structure: pmd.Structure, graph: nx.Graph) -> Tuple[Tuple[int,...]]:
    """ Use connected components to identify individual molecules"""
    individual_molecule_graphs = (tuple(a) for a in nx.connected_components(graph))
    
    return individual_molecule_graphs

def subselect_atoms(structure: pmd.Structure, indices: Tuple[int])-> Dict[int, pmd.Atom]:
    """ Create a mapping of index to atom """
    return {idx: structure.atoms[idx] for idx in indices}

def make_structure_from_graph(molecule_vertices: Tuple[int],
                             relevant_atoms: Dict[int, pmd.Atom],
                             molecule_graph: nx.Graph) -> pmd.Structure:
    """ From networkx graph and individal parmed atoms, make parmed structure"""
    individual_structure = pmd.Structure()
    
    for idx in molecule_vertices:
        individual_structure.add_atom(relevant_atoms[idx], relevant_atoms[idx].residue.name,
                                     relevant_atoms[idx].residue.number)

        for neighbor_idx in molecule_graph[idx]:
            if idx < neighbor_idx:
                individual_structure.bonds.append(pmd.Bond(relevant_atoms[idx], 
                                                           relevant_atoms[neighbor_idx]))
    return individual_structure

def parametrize(ff: foyer.Forcefield, structure: pmd.Structure, **kwargs) -> pmd.Structure:
    return ff.apply(structure, **kwargs)

Exercising our functions in serial

We’ll get to some timings later…

%%time 

single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[5,5,5])
structure = cmpd.to_parmed()
big_graph = structure_to_graph(structure)
individual_molecule_graphs = separate_molecule_graphs(structure, big_graph)
individual_structures = [make_structure_from_graph(molecule_graph, subselect_atoms(structure, molecule_graph), big_graph) 
     for molecule_graph in individual_molecule_graphs]
parametrized_structures = [parametrize(ff, struc) for struc in individual_structures]

parametrized_structure = reduce(lambda x,y: x+y, parametrized_structures)

parametrized_structure

/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory <mdtraj.Trajectory with 1 frames, 3 atoms, 1 residues, without unitcells>
  "mdtraj.Trajectory {}".format(traj)
/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory <mdtraj.Trajectory with 1 frames, 4 atoms, 1 residues, without unitcells>
  "mdtraj.Trajectory {}".format(traj)
/home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees.
  "No box specified and no Compound.box detected. "

CPU times: user 2.69 s, sys: 56.1 ms, total: 2.75 s
Wall time: 2.7 s

<Structure 170 atoms; 10 residues; 160 bonds; parametrized>

Here’s a first attempt at daskifying everything with delayed objects. Once we’ve created our entire system graph, we can start creating dask objects, starting with each molecule graph, and chaining the following operations:

From each molecule graph, grab the relevant parmed Atoms
From the molecule graph and parmed Atoms, create the (unparametrized) parmed Structure

%%time 

single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[5,5,5])
structure = cmpd.to_parmed()
big_graph = structure_to_graph(structure)
individual_molecule_graphs = [*separate_molecule_graphs(structure, big_graph)]

all_subselected_atoms = [delayed(subselect_atoms)(structure, molecule_graph) 
                         for molecule_graph in individual_molecule_graphs]

raw_structures = [delayed(make_structure_from_graph)(molecule_graph, subselected_atoms, big_graph)
               for molecule_graph, subselected_atoms in zip(individual_molecule_graphs, all_subselected_atoms)]

CPU times: user 149 ms, sys: 23.2 ms, total: 173 ms
Wall time: 60.4 ms

Pulse check, can we flush the task-graph and actually get our parametrized molecules?

%%time

[a.compute() for a in raw_structures]

CPU times: user 18.8 ms, sys: 889 µs, total: 19.7 ms
Wall time: 11.8 ms





[<Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized>]

Next step, parametrization

%%time

param_structures = [delayed(parametrize)(ff, struc) for struc in raw_structures]

CPU times: user 5.69 ms, sys: 594 µs, total: 6.28 ms
Wall time: 2.81 ms

(Another) pulse check, does the FF application work?

%%time

all_parametrized = [op.compute() for op in param_structures]
all_parametrized

CPU times: user 2.69 s, sys: 15.9 ms, total: 2.71 s
Wall time: 2.7 s





[<Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>,
 <Structure 17 atoms; 1 residues; 16 bonds; parametrized>]

Final step, putting the structures back together

%%time

reduce(lambda x,y: x+y, all_parametrized)

CPU times: user 52.7 ms, sys: 23 µs, total: 52.7 ms
Wall time: 50.8 ms

<Structure 170 atoms; 10 residues; 160 bonds; parametrized>

%%time 

single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[5,5,5])
structure = cmpd.to_parmed()
big_graph = structure_to_graph(structure)

individual_molecule_graphs = [*separate_molecule_graphs(structure, big_graph)]

all_subselected_atoms = [delayed(subselect_atoms)(structure, molecule_graph) 
                         for molecule_graph in individual_molecule_graphs]

raw_structures = [delayed(make_structure_from_graph)(molecule_graph, subselected_atoms, big_graph)
               for molecule_graph, subselected_atoms in zip(individual_molecule_graphs, all_subselected_atoms)]

param_structures = [delayed(parametrize)(ff, struc) for struc in raw_structures]

CPU times: user 65.5 ms, sys: 41.5 ms, total: 107 ms
Wall time: 51.4 ms

Last step is to combine all the parametrized structures, we can try some dask fold/reduce operations

param_structures_bag = db.from_sequence(param_structures)
param_structures_bag

dask.bag<from_sequence, npartitions=10>

Unfortuantely, some of these parmed AtomType objects are not hashable, so we cannot use dask to efficiently reduce parmed structures

from operator import add

param_structures_bag.fold(add).compute()  

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-37-2285eb100c4e> in <module>
      1 from operator import add
      2 
----> 3 param_structures_bag.fold(add).compute()

...

~/miniconda3/envs/md37/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in save_global(self, obj, name, pack)
    828         elif obj is type(NotImplemented):
    829             return self.save_reduce(type, (NotImplemented,), obj=obj)
--> 830         elif obj in _BUILTIN_TYPE_NAMES:
    831             return self.save_reduce(
    832                 _builtin_type, (_BUILTIN_TYPE_NAMES[obj],), obj=obj)


TypeError: unhashable type: '_UnassignedAtomType'

At this point, we can use dask to parallelize most of the steps in our process, but we still need to collect all of our parametrized structures prior to summing them all up

Timing isn’t so great but we’ll see how this scales

%%time

computed_parametrized_structures = [d.compute() for d in param_structures]

final_structure = reduce(lambda x,y: x+y, computed_parametrized_structures)

final_structure

CPU times: user 2.75 s, sys: 0 ns, total: 2.75 s
Wall time: 2.74 s

<Structure 170 atoms; 10 residues; 160 bonds; parametrized>

Putting all of our parallelized code together …

%%time 

# Make our molecular system
single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[5,5,5])
structure = cmpd.to_parmed()

# Convert to graphs
big_graph = structure_to_graph(structure)
individual_molecule_graphs = [*separate_molecule_graphs(structure, big_graph)]

# Grab parmed atoms for each node in the graph
all_subselected_atoms = [delayed(subselect_atoms)(structure, molecule_graph) 
                         for molecule_graph in individual_molecule_graphs]

# Generate parmed structures for each molecule
raw_structures = [delayed(make_structure_from_graph)(molecule_graph, subselected_atoms, big_graph)
               for molecule_graph, subselected_atoms in zip(individual_molecule_graphs, all_subselected_atoms)]

# Parametrize with our force field
param_structures = [delayed(parametrize)(ff, struc) for struc in raw_structures]

computed_parametrized_structures = [d.compute() for d in param_structures]

final_structure = reduce(lambda x,y: x+y, computed_parametrized_structures)

CPU times: user 2.7 s, sys: 58.3 ms, total: 2.76 s
Wall time: 2.7 s

Visualizing our task graph

param_structures[0].visualize()

Before moving to timing comparisons, it’s important to observe the residue_map functionality for foyer. If a “residue” (molecule type) has already been parametrized within this foyer apply function stack, we don’t need to re-iterate and re-discover the atom-types; the parametrization is effectively cached. As multiple foyer apply functions get called, this caching doesn’t get leveraged.

Timing comparisons

We have 3 methods to compare:

Canonical foyer, the standard way to use foyer on a single parmed structure that represents your entire molecular system. This actualy takes most advantage of the use_residue_map functionality
Distributed foyer in serial, divide your parmed structure into smaller parmed structures, parametrize individually
Distributed foyer in parallel, divide your parmed structure into smaller parmed structures, parametrize individually.

We’ll notice the number of residues in the final, parametrized strucutres are different – this is a consequnce of how parmed.structure.__add__ and parmed.structure.__iadd__ work when you try to combine different parmed structures. What’s important is that the number of atoms and bonds are consistent

def canonical_foyer(ff, structure, **kwargs):
    """ Standard way of using foyer, no parallelization"""
    return ff.apply(structure, **kwargs)

def distributed_foyer_serial(ff, structure):
    """ Apply foyer N times to N different molecules in serial"""
    big_graph = structure_to_graph(structure)
    individual_molecule_graphs = separate_molecule_graphs(structure, big_graph)
    individual_structures = [make_structure_from_graph(molecule_graph, subselect_atoms(structure, molecule_graph), big_graph) 
         for molecule_graph in individual_molecule_graphs]
    parametrized_structures = [parametrize(ff, struc) for struc in individual_structures]

    parametrized_structure = reduce(lambda x,y: x+y, parametrized_structures)
    
    return parametrized_structure

def distributed_foyer_parallel(ff, structure):
    """Apply foyer N times to N different molecules in parallel"""
    big_graph = structure_to_graph(structure)

    individual_molecule_graphs = [*separate_molecule_graphs(structure, big_graph)]

    # Grab parmed atoms for each node in the graph
    all_subselected_atoms = [delayed(subselect_atoms)(structure, molecule_graph) 
                             for molecule_graph in individual_molecule_graphs]

    # Generate parmed structures for each molecule
    raw_structures = [delayed(make_structure_from_graph)(molecule_graph, subselected_atoms, big_graph)
                   for molecule_graph, subselected_atoms in zip(individual_molecule_graphs, all_subselected_atoms)]

    # Parametrize with our force field
    param_structures = [delayed(parametrize)(ff, struc) for struc in raw_structures]

    computed_parametrized_structures = [d.compute() for d in param_structures]

    final_structure = reduce(lambda x,y: x+y, computed_parametrized_structures)
    
    return final_structure

Small, homogeneous system

10 pentane molecules

Method	Time
Canonical foyer	2.53 s
Distributed foyer serial	3.15 s
Distributed foyer parallel	3.22 s

%%time 

ff = foyer.forcefields.load_OPLSAA()
single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[10,10,10])
structure = cmpd.to_parmed()

canonical_foyer(ff, structure)

CPU times: user 2.52 s, sys: 57.7 ms, total: 2.58 s
Wall time: 2.53 s

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 200, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)

<Structure 170 atoms; 1 residues; 160 bonds; PBC (orthogonal); parametrized>

%%time

ff = foyer.forcefields.load_OPLSAA()
single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[10,10,10])
structure = cmpd.to_parmed()

distributed_foyer_serial(ff, structure)

CPU times: user 3.17 s, sys: 28.5 ms, total: 3.2 s
Wall time: 3.15 s

<Structure 170 atoms; 10 residues; 160 bonds; parametrized>

%%time

ff = foyer.forcefields.load_OPLSAA()
single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=10, box=[10,10,10])
structure = cmpd.to_parmed()

distributed_foyer_parallel(ff, structure)

CPU times: user 3.21 s, sys: 69.5 ms, total: 3.28 s
Wall time: 3.22 s

<Structure 170 atoms; 10 residues; 160 bonds; parametrized>

Large, homogeneous system

100 pentane molecules

Method	Time
Canonical foyer	21.1 s
Distributed foyer serial	35.1 s
Distributed foyer parallel	34.7 s

%%time 

ff = foyer.forcefields.load_OPLSAA()
single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=100, box=[1000,1000,1000])
structure = cmpd.to_parmed()

canonical_foyer(ff, structure)

CPU times: user 20.9 s, sys: 196 ms, total: 21.1 s
Wall time: 21 s

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 2000, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)

<Structure 1700 atoms; 1 residues; 1600 bonds; PBC (orthogonal); parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()
single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=100, box=[1000,1000,1000])
structure = cmpd.to_parmed()

distributed_foyer_serial(ff, structure)

CPU times: user 35.1 s, sys: 124 ms, total: 35.2 s
Wall time: 35.1 s

<Structure 1700 atoms; 100 residues; 1600 bonds; parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()
single = Alkane(n=5)
cmpd = mb.fill_box(single, n_compounds=100, box=[1000,1000,1000])
structure = cmpd.to_parmed()

distributed_foyer_parallel(ff, structure)

CPU times: user 34.8 s, sys: 159 ms, total: 34.9 s
Wall time: 34.7 s

<Structure 1700 atoms; 100 residues; 1600 bonds; parametrized>

Small, heterogeneous system

10 pentane, 10 decane, 10 nonadecane (C20-ane)

Method	Time
Canonical foyer	14 s
Distributed foyer serial	16.9 s
Distributed foyer parallel	16.6 s

%%time 

ff = foyer.forcefields.load_OPLSAA()
templates = [Alkane(n=5), Alkane(n=10), Alkane(n=20)]
cmpd = mb.fill_box(templates, n_compounds=[10,10,10], box=[100,100,100])
structure = cmpd.to_parmed()

canonical_foyer(ff, structure)

CPU times: user 14.2 s, sys: 130 ms, total: 14.3 s
Wall time: 14 s

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 1400, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)

<Structure 1110 atoms; 1 residues; 1080 bonds; PBC (orthogonal); parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()
templates = [Alkane(n=5), Alkane(n=10), Alkane(n=20)]
cmpd = mb.fill_box(templates, n_compounds=[10,10,10], box=[100,100,100])
structure = cmpd.to_parmed()

distributed_foyer_serial(ff, structure)

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 40, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 80, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)

CPU times: user 17.1 s, sys: 170 ms, total: 17.3 s
Wall time: 16.9 s

<Structure 1110 atoms; 30 residues; 1080 bonds; parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()
templates = [Alkane(n=5), Alkane(n=10), Alkane(n=20)]
cmpd = mb.fill_box(templates, n_compounds=[10,10,10], box=[100,100,100])
structure = cmpd.to_parmed()

distributed_foyer_parallel(ff, structure)

CPU times: user 16.7 s, sys: 222 ms, total: 16.9 s
Wall time: 16.6 s

<Structure 1110 atoms; 30 residues; 1080 bonds; parametrized>

Large, heterogeneous system

100 pentane, 100 decane, 100 nonadecane

Method	Time
Canonical foyer	2 min 31 s
Distributed foyer serial	4 min 20 s
Distributed foyer parallel	4 min 17 s

%%time 

ff = foyer.forcefields.load_OPLSAA()
templates = [Alkane(n=5), Alkane(n=10), Alkane(n=20)]
cmpd = mb.fill_box(templates, n_compounds=[100,100,100], box=[1000,1000,1000])
structure = cmpd.to_parmed()

canonical_foyer(ff, structure)

CPU times: user 2min 30s, sys: 1.27 s, total: 2min 31s
Wall time: 2min 31s

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 14000, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)

<Structure 11100 atoms; 1 residues; 10800 bonds; PBC (orthogonal); parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()
templates = [Alkane(n=5), Alkane(n=10), Alkane(n=20)]
cmpd = mb.fill_box(templates, n_compounds=[100,100,100], box=[1000,1000,1000])
structure = cmpd.to_parmed()

distributed_foyer_serial(ff, structure)

CPU times: user 4min 20s, sys: 1.05 s, total: 4min 21s
Wall time: 4min 20s

<Structure 11100 atoms; 300 residues; 10800 bonds; parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()
templates = [Alkane(n=5), Alkane(n=10), Alkane(n=20)]
cmpd = mb.fill_box(templates, n_compounds=[100,100,100], box=[1000,1000,1000])
structure = cmpd.to_parmed()

distributed_foyer_parallel(ff, structure)

CPU times: user 4min 17s, sys: 972 ms, total: 4min 18s
Wall time: 4min 17s

<Structure 11100 atoms; 300 residues; 10800 bonds; parametrized>

Random heterogeneous system

Method	Time
Canonical foyer	1 min 38 s
Distributed foyer serial	2 min 56 s
Distributed foyer parallel	3 min 1 s

import numpy as np
random_compounds = mb.Compound(subcompounds=[Alkane(n=i) for i in np.random.randint(5, high=20, size=200)])

/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory <mdtraj.Trajectory with 1 frames, 3 atoms, 1 residues, without unitcells>
  "mdtraj.Trajectory {}".format(traj)
/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory <mdtraj.Trajectory with 1 frames, 4 atoms, 1 residues, without unitcells>
  "mdtraj.Trajectory {}".format(traj)

%%time 

ff = foyer.forcefields.load_OPLSAA()

structure = random_compounds.to_parmed()

canonical_foyer(ff, structure, use_residue_map=False)

CPU times: user 1min 37s, sys: 158 ms, total: 1min 37s
Wall time: 1min 37s

<Structure 7663 atoms; 1 residues; 7463 bonds; PBC (orthogonal); parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()

structure = random_compounds.to_parmed()

distributed_foyer_serial(ff, structure)

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 28, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 44, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 60, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 24, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 56, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 52, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 64, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 68, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 36, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 32, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 76, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 72, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 48, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)


CPU times: user 2min 56s, sys: 839 ms, total: 2min 57s
Wall time: 2min 56s





<Structure 7663 atoms; 200 residues; 7463 bonds; parametrized>

%%time 

ff = foyer.forcefields.load_OPLSAA()

structure = random_compounds.to_parmed()

distributed_foyer_parallel(ff, structure)

CPU times: user 3min 1s, sys: 551 ms, total: 3min 2s
Wall time: 3min 1s

<Structure 7663 atoms; 200 residues; 7463 bonds; parametrized>

Making individual structures

Parallelization is fantastically slowing down our operations. I have a hunch this might be due to the extra steps involved in splitting up the molecular graphs.

When molecular modelers make these systems, we already know which collection of atoms and bonds forms a molecule, so we can use that to circumvent any use of connected components. In this iteration, we’ve added a shortcut where we already know the individual structures.

Canonical foyer is still faster. For a parallel library comparison, I tried using multiprocessing but got infinite recursion errors, so multiprocessing was not as easy to use as dask for this particular application

random_compounds = mb.Compound(subcompounds=[Alkane(n=i) for i in np.random.randint(5, high=20, size=200)])

%%time 

individual_structures = [cmpd.to_parmed() for cmpd in random_compounds.children]

param_structures =  [delayed(parametrize)(ff, struc) for struc in individual_structures]

computed_parametrized_structures = [d.compute() for d in param_structures]

final_structure = reduce(lambda x,y: x+y, computed_parametrized_structures)

final_structure

/home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees.
  "No box specified and no Compound.box detected. "
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 76, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 36, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 24, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 72, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 68, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 60, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 32, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 64, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 28, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 56, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 52, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 40, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 48, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)
/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 44, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)


CPU times: user 3min 19s, sys: 1.04 s, total: 3min 20s
Wall time: 3min 19s





<Structure 7735 atoms; 200 residues; 7535 bonds; PBC (orthogonal); parametrized>

param_structures[0].visualize(rankdir='LR')

%%time

one_structure = random_compounds.to_parmed()

canonical_foyer(ff, one_structure)

/home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees.
  "No box specified and no Compound.box detected. "

CPU times: user 1min 50s, sys: 1.05 s, total: 1min 51s
Wall time: 1min 51s

/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 9780, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers
  warnings.warn(msg)

<Structure 7735 atoms; 1 residues; 7535 bonds; PBC (orthogonal); parametrized>

Lessons and Takeaways

This was a little disheartening, any attempt to distribute foyer atom-typing or combine with dask did NOT accelerate anything. This can probably be explained in a variety of ways:

We had to convert our structure to a graph, run a connected components algorithm (which has its own scaling issues), create separate parmed structures, then re-join/add the individual structures together. Each of those steps is bound to slow things down. Data communication also plays a role here – communicating the molecular graphs and the entire structure to each dask worker will add some slowness to our pipeline. Doing everything in one foyer function allows the use of caching, which we lose when executing the function lots of different times. Even simplifying the pipeline didn’t show much improvement for the dask implementation

There probably is room for the foyer API to be more accommodating for dask and other parallel computations, but it might require a refactoring effort to properly expose the functions-to-parallelize and utilize data structures/approaches more amenable to parallelization. Breaking up a large chemical system into smaller substructures didn’t seem to help.

In all honesty since most molecular systems usually have less than a dozen different molecular species, just replicated into thousands of molecules, the best bet is to parametrize each molecular species once, then propagate the parameters appropriately, all in the canonical foyer style without any parallelization. The current foyer implementation already has implicit acceleration with caching and networkx may already have some graph optimizations for subgraph isomorphisms, mitigating any need for us to explicitly decompose one big graph into lots of small connected components

Notebooks can be found in this repo

Big data tools for MD simulation analysis

2020-05-13T00:00:00-05:00

Big data tools for MD simulation analysis

(Updated 2020-05-15)

Trajectories are sets of coordinates over time. While the act of gathering data and conducting simulations are exhaustively parallelized, some analysis methods are not. Speaking from experience, parallelizing analysis using Python multiprocessing can get very messy if you don’t have a clear idea of how you want to parallelize the analysis, and how exactly you’re going to code it up.

Here, I’m going to attempt to use some parallel librareis for MD trajectory analysis

Some big data tools

Since grad school, I’ve been exposed to a variety of big data tools (Dask, Spark, Rapids), and it’s been a point of interest to test their utility to molecular simulation. Each tool comes with its own sets of advantages and disadvantages, and I encourage everyone to actively try each to see which is most appropriate for the desired application.

Rapids is very fast, but requires GPUs. Depending on your tech stack and tech constraints, you may or may not have cheap and easy access to sufficient GPUs. Rapids is a little more sensitive to data types than others - but as an amateur, I could be misusing the libraries.
Spark is fast, but requires some hadoop and Spark knowhow to stand up properly. Many tech stacks and constraints seem to be well-suited for spark applications. Spark scales out well, very flexible with datatypes, and eschews a lot of parallel programming-knowhow. At my own work, some primitive tests have shown that spark outperforms dask for dataframe operations on strings and some ML operations - but as an amateur, there is probably some Dask tuning that could be done
Dask is also fast, but your mileage may vary. Some tech stacks are suitable for Dask, but cloud resources/tech constraints might make Dask adoption hard. Dask exposes various levels of parallelism, so proper Dask-users will end up learning a lot about parallel computing along the way.

I defer to this pydata video for a Dask, Rapids, Spark comparison

For those like me who are not used to setting up parallel compute

The one thing I will observe as I dabble away on my personal computer - I am neither familiar with setting up a Hadoop cluster nor am I familiar with exposing my WSL to my GPU, and single-node pyspark is not going to useful for the overhead. If given the proper infrastructure and resources, I can use these libraries, but at this moment it would take time for me to set up the resources to properly utilizes Spark or Rapids on my PC. Dask, in my case, seems like the simplest parallel compute library to use. If you’re a grad student or a data scientist unfamiliar with software environments and infrastructure beyond Conda environments, Dask might also be easiest for adoption.

Computing atomic distances from a molecular dynamics simulation

Trivial MD analysis involves looking at each atom within a frame, and not having to look at time correlations from frame to frame. I’m going to use MDTraj to load in a trajectory, and look at distances between atoms in each frame. I’ll do this serial, with just MDTraj, and I’ll do this with using one level of Dask parallelism, Dask delayed

import itertools as it
from pathlib import Path
import numpy as np


import mdtraj
import dask
from dask import delayed
import dask.bag as db

Saving myself the effort of generating my own trajectory, I will use one of the trajectories in MDTraj’s unit tests

path_to_data = Path('/home/ayang41/programs/mdtraj/tests/data')
tip3p_xtc = Path.joinpath(path_to_data/'tip3p_300K_1ATM.xtc')
tip3p_pdb = Path.joinpath(path_to_data/'tip3p_300K_1ATM.pdb')

This trajectory is only 401 frames - parallel analysis incurs too much overhead to be useful. I’m going to artificially lengthen the trajectory out to 1604 frames, where the gain from parallelization will hopefully be more apparent. In reality, most grad students will have many, many more frames to analyze.

traj = mdtraj.load(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix())
for i in range(2):
    traj = traj.join(traj)
traj

<mdtraj.Trajectory with 1604 frames, 774 atoms, 258 residues, and unitcells at 0x7f9bf4cce150>

Additionally, to load up the computational expense, I’ll look at all pairwise atomic distances in each frame

atom_pairs = [*it.permutations(np.arange(0, traj.n_atoms),2)]

Simple implementation with MDTraj

On my PC with 6 cores, this took about 23 seconds (and also nearly froze my computer).

It should be noted that MDTraj already does a lot of parallelization and acceleration under their hood with some C optimizations. “Simple” in this case, is a user depending on MDTraj’s optimizations

%%time

displacements = mdtraj.compute_displacements(traj, atom_pairs)

CPU times: user 5.94 s, sys: 17.5 s, total: 23.5 s
Wall time: 23.7 s

Combining Dask with MDTraj

Like most parallel computing applications, it’s important to recognize how and what you will be parallelizing/distributing. In this case, we will be distributing our one trajectory across 4 partitions, creating Delayed objects. Each Delayed object isn’t an actual execution - it’s a scheduled operation (like queueing something up in SLURM or PBS).

It helps that mdtraj.Trajectory objects are iterable, so we can easily break up the trajectory into 4 even-sized chunks with some python list comprehensions

%%time
chunksize = int(traj.n_frames/4)
bag = db.from_sequence([traj[chunksize*i: chunksize*(i+1)] for i in range(4)] , npartitions=4)
bunch_of_delayed = bag.to_delayed()

CPU times: user 62.5 ms, sys: 172 ms, total: 234 ms
Wall time: 293 ms

bag

dask.bag<from_sequence, npartitions=4>

bunch_of_delayed

[Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 0)),
 Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 1)),
 Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 2)),
 Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 3))]

If we wanted to, we can still pluck out and execute the Delayed objects, and parse the number of atoms in MDTraj-like syntax

bunch_of_delayed[0].compute()[0].n_atoms

We can also validate that each Delayed object is computing a quarter of our trajectory

bunch_of_delayed[0].compute(), bunch_of_delayed[1].compute()

([<mdtraj.Trajectory with 401 frames, 774 atoms, 258 residues, and unitcells at 0x7fb555e2df50>],
 [<mdtraj.Trajectory with 401 frames, 774 atoms, 258 residues, and unitcells at 0x7fb2a1a12d10>])

To queue up additional computations, we will take each Delayed object, and add on one additional operation - mdtraj.compute_displacements. Now the delayed objects have two operations - distributing the trajectory and computing the displacements. It’s worth noting that none of these operations involved rewriting MDTraj code or adding function decorators. These MDTraj functions are wrapped using the Delayed objects

Again, the computation has not been performed yet

%%time
all_displacements = [delayed(mdtraj.compute_displacements)(traj[0], atom_pairs) for traj in bunch_of_delayed]
all_displacements

CPU times: user 26.5 s, sys: 2.55 s, total: 29 s
Wall time: 29.2 s

[Delayed('compute_displacements-c1ef5c08-6bb2-4508-8f1a-166000d2cd3e'),
 Delayed('compute_displacements-5a9fd8cd-2993-4c4b-be90-a2523e47c09a'),
 Delayed('compute_displacements-35c48042-fecf-4eb4-adc5-931c097b6e8d'),
 Delayed('compute_displacements-d8699960-98e0-4b74-a320-2b2e1f3870a9')]

If we want to “flush” the queue and run all our Delayed computations, we use Dask to finally compute them.

At this point, the actual calculation took 3min 6s (hey, this is terrible!), but the overhead involved 27 seconds

%%time
displacements = dask.compute(all_displacements)

CPU times: user 17.8 s, sys: 27.9 s, total: 45.7 s
Wall time: 3min 6s

The returned object is 4 different results, and each result is a numpy array 401 x 598302 x 3 (n_frames x n_atompairs x n_spatialdimensions)

len(displacements[0])

displacements[0][1].shape

(401, 598302, 3)

Visualizing the dask graph

Spark and Dask both use task graphs to schedule function after function, with Spark doing some implicit optimizations.

Dask has a nice visualize functionality to show what the task graphs and parallelization look like for two of our Delayed objects

dask.visualize(all_displacements[0:2])

This Dask parallelization slowed the MDTraj operation down! What gives?

MDTraj is very well-optimized, so any attempts to distribute work end up slowing down the array multiplications

We’ll use our own, crude distance function that has no optimizations (and doesn’t obey the minimum image convention)

def crude_distances(traj, atom_pairs):
    all_distances = []
    for frame in traj:
        distances =[]
        for pair in atom_pairs:
            distance = np.sqrt(np.dot(frame.xyz[0, pair[0], :], frame.xyz[0, pair[1], :]))
            distances.append(distance)
        all_distances.append(distances)
    return np.array(all_distances)

%%time
traj = mdtraj.load(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix())
chunksize = int(traj.n_frames/4)
bag = db.from_sequence([traj[chunksize*i: chunksize*(i+1)] for i in range(4)] , npartitions=4)
bunch_of_delayed = bag.to_delayed()

CPU times: user 125 ms, sys: 0 ns, total: 125 ms
Wall time: 505 ms

atom_pairs = [*it.combinations(np.arange(0,100),2)]

%%time
all_displacements = [delayed(crude_distances)(traj[0], atom_pairs) for traj in bunch_of_delayed]
all_displacements

CPU times: user 156 ms, sys: 46.9 ms, total: 203 ms
Wall time: 169 ms

[Delayed('crude_distances-fb865e6f-232a-4a24-8a37-0b0f6ce13f22'),
 Delayed('crude_distances-438627d2-a181-4127-85a1-1cfbe99f64f6'),
 Delayed('crude_distances-543f6412-6dcc-4a30-922b-2f963e978a5d'),
 Delayed('crude_distances-78883eb2-520e-4c75-9af9-d06b82b746d1')]

%%time
output = dask.compute(all_displacements)

CPU times: user 54.6 s, sys: 1min, total: 1min 55s
Wall time: 1min 7s

%%time

output = crude_distances(traj, atom_pairs)

CPU times: user 1min 28s, sys: 1min 40s, total: 3min 8s
Wall time: 1min 51s

So there was ~47 second speedup from the crude function - that’s a small win.

And here’s the task graph for one of the Delayed objects

all_displacements[0].visualize()

Aiming for memory-efficiency

Up until now, we’ve had the whole trajectory loaded into memory prior to any parallelization with Dask. We can use MDTraj’s iterload function to reduce the size of the trajectory, but still pass different chunks around.

As another consideration for parallelization, increasing the number of disk reads will slow down your process, so make sure the gain from parallelization makes it worth it

%%time

delayed_load = db.from_sequence(a for a in mdtraj.iterload(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix())).to_delayed()

CPU times: user 172 ms, sys: 172 ms, total: 344 ms
Wall time: 312 ms

Confirming that each Delayed object has different frames

delayed_load[0].compute()[0].time, delayed_load[1].compute()[0].time

(array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
        13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
        26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38.,
        39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51.,
        52., 53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64.,
        65., 66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76., 77.,
        78., 79., 80., 81., 82., 83., 84., 85., 86., 87., 88., 89., 90.,
        91., 92., 93., 94., 95., 96., 97., 98., 99.], dtype=float32),
 array([100., 101., 102., 103., 104., 105., 106., 107., 108., 109., 110.,
        111., 112., 113., 114., 115., 116., 117., 118., 119., 120., 121.,
        122., 123., 124., 125., 126., 127., 128., 129., 130., 131., 132.,
        133., 134., 135., 136., 137., 138., 139., 140., 141., 142., 143.,
        144., 145., 146., 147., 148., 149., 150., 151., 152., 153., 154.,
        155., 156., 157., 158., 159., 160., 161., 162., 163., 164., 165.,
        166., 167., 168., 169., 170., 171., 172., 173., 174., 175., 176.,
        177., 178., 179., 180., 181., 182., 183., 184., 185., 186., 187.,
        188., 189., 190., 191., 192., 193., 194., 195., 196., 197., 198.,
        199.], dtype=float32))

%%time
all_displacements = [delayed(crude_distances)(traj[0], atom_pairs) for traj in delayed_load]
all_displacements

CPU times: user 188 ms, sys: 93.8 ms, total: 281 ms
Wall time: 294 ms

[Delayed('crude_distances-d2f8fad8-663a-41b4-a97c-9277cc086fba'),
 Delayed('crude_distances-4a9116d4-96f0-4c35-bca1-0525622976c8'),
 Delayed('crude_distances-6aa987e4-0e71-462f-9293-55e6deed1425'),
 Delayed('crude_distances-cd230400-cf2f-4ad9-9cc7-ab573848e397'),
 Delayed('crude_distances-627969a1-2726-4f7e-87a9-7a97665c46b0')]

Still ~40 second gain with the crude distance calculation with Dask

%%time
out = dask.compute(all_displacements)

CPU times: user 52.1 s, sys: 1min 3s, total: 1min 55s
Wall time: 1min 10s

%%time
all_displacements = []
for traj in mdtraj.iterload(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix()):
    all_displacements.append(crude_distances(traj, atom_pairs))

CPU times: user 1min 26s, sys: 1min 46s, total: 3min 13s
Wall time: 1min 51s

atom_pairs = [*it.combinations(np.arange(0, traj.n_atoms),2)]

delayed_load = db.from_sequence(a for a in mdtraj.iterload(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix())).to_delayed()

%%time
all_displacements = [delayed(mdtraj.compute_displacements)(traj[0], atom_pairs) for traj in delayed_load]

CPU times: user 15.6 s, sys: 688 ms, total: 16.3 s
Wall time: 16.4 s

%%time
out = dask.compute(all_displacements)

CPU times: user 7.98 s, sys: 938 ms, total: 8.92 s
Wall time: 8.92 s

%%time
all_displacements = []
for traj in mdtraj.iterload(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix()):
    all_displacements.append(mdtraj.compute_displacements(traj, atom_pairs))

CPU times: user 1.17 s, sys: 1.09 s, total: 2.27 s
Wall time: 2.26 s

Trying Dask distributed

We could try another level of parallelism using Dask’s distributed framework on a single node, but there appear to be Dask distributed issues with WSL.

Regardless, we can still see what happens

from distributed import Client

client = Client()
client

distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available

Client

Scheduler: tcp://127.0.0.1:54022
Dashboard: http://127.0.0.1:8787/status

Cluster

Workers: 3
Cores: 6
Memory: 17.11 GB

distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available

With default settings, we’re working with 3 workers across 6 cores.

We can see from the Dask dashboard that there are certainly concurrent operations, but the yellow operation (disk-read-compute_displacements) is adding a lot of overhead beyond that purple operation (the actual compute_displacements)

%%time
delayed_load = db.from_sequence(a for a in mdtraj.iterload(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix())).to_delayed()
all_displacements = [delayed(mdtraj.compute_displacements)(traj[0], atom_pairs) for traj in delayed_load]
out = dask.compute(all_displacements)

distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available


CPU times: user 37.6 s, sys: 12.4 s, total: 50 s
Wall time: 57.6 s

client.close()
client = Client(processes=False)
client

distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available

Client

Scheduler: inproc://192.168.0.15/667/12
Dashboard: http://192.168.0.15:8787/status

Cluster

Workers: 1
Cores: 6
Memory: 17.11 GB

Running all workers on the same process, there’s still some room for multithreading, but the same slow-downs rear their heads

%%time
delayed_load = db.from_sequence(a for a in mdtraj.iterload(tip3p_xtc.as_posix(), top=tip3p_pdb.as_posix())).to_delayed()
all_displacements = [delayed(mdtraj.compute_displacements)(traj[0], atom_pairs) for traj in delayed_load]
out = dask.compute(all_displacements)

distributed.utils_perf - WARNING - full garbage collections took 45% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 44% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 44% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 45% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 45% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 46% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 46% CPU time recently (threshold: 10%)


CPU times: user 51 s, sys: 1.22 s, total: 52.2 s
Wall time: 52.9 s

Takeaways from some Dask tests

The observations here were surprising, but maybe a good lesson before anyone immediately tries to jump into some big data tools

MDTraj is really performant

If you’re able to use MDTraj-optimized functions, use those. If you want to be memory efficient and stream trajectory data, use MDTraj for that; you don’t need to schedule loading different slices of a trajectory with Dask.

An optimized library can beat the bloat of a scheduler

Combining Dask + MDTraj was worse in all cases than just using MDTraj exclusively. Dask’s parallelization didn’t make anything run faster, and Dask’s delayed scheduling didn’t introduce anything better compared to MDTraj’s iterloading. This might be because of multiple reads, communication between workers, or overhead of building out the task scheduler.

If the opportunity, resources, and need exist, optimizing a library can go farther than trying to lump Dask on top of any code. Dask + my-bad-distance-code made things faster than my-bad-distance-code exclusively, but my bad-distance-code was completely devoid of optimization. But throw an optimized library like MDTraj in, and you likely won’t need Dask (or your poorly-written code!).

If you have a particularly unique function you don’t know how to optimize, then it’s time to think about what dask can offer

MDTraj is great because it provides a set of common, optimized functions. For a lot of work in this field, there will be unique analyses that are not common to many MD libraries, and if they are, they may not be optimized. If these two hold true to your particular studies, then your options become

1) Optimize your analysis code. Simplify routines for time and space complexity, reduce for-loops if you can, reduce the amount of read/write operations, write Cython/C/Cuda/compiled code

2) Use a parallel/scheduler framework like Dask

If you’re not a (parallel) programming wiz or lack the time to become one, then option 2 may be for you

It doesn’t help that we’re working with different data

A lot of Dask use-cases and API are built around arrays and dataframes, so there’s already a lot of built-in optimization for those data structures. There may be room to build a Dask-trajectory object that creates room for computational optimization (rather than stringing together a bunch of non-dask operations) that might be able to beat MDTraj

Lastly, the notebook can be found here

Digging through some Folding@Home data

2020-05-06T00:00:00-05:00

Learning cheminformatics from some Folding@Home data

Top 10 (based on Hybrid2 docking score) small molecules

2020-05-06 - 2020-05-11

I have no formal training in cheminformatics, so I am going to be stumbling and learning as I wade through this dataset. I welcome any learning lessons from experts.

This will be an ongoing foray

Source: https://github.com/FoldingAtHome/covid-moonshot

Introduction

Folding@Home is a distributed computing project - allowing molecular simulations to be run in parallel across thousands of different computers with minimal communication. This, combined with other molecular modeling methods, has yielded a lot of open data for others to examine. In particular, I’m interested in the docking screens and compounds targeted by the F@H and postera collaborations

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
pd.options.display.max_columns = 999

moonshot_df = pd.read_csv('moonshot-submissions/covid_submissions_all_info.csv')

moonshot_df.head()

	SMILES	CID	creator	fragments	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED
0	CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cn1nnc2ccccc21	AAR-POS-8a4e0f60-1	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	Z1260533612	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	371.444	3.5420	0	5	63.91	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False
1	O=C(Cn1nnc2ccccc21)NCc1ccc(Oc2cccnc2)c(F)c1	AAR-POS-8a4e0f60-10	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	Z826180044	FALSE	FALSE	s_22____1723102____13206668	False	False	False	False	False	False	False	False	False	False	False	False	False	377.379	3.0741	1	6	81.93	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False
2	CN(Cc1nnc2ccccn12)C(=O)N(Cc1cccs1)c1ccc(Br)cc1	AAR-POS-8a4e0f60-11	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	456.369	4.8119	0	5	53.74	PASS	PASS	PASS	Filter9_metal	aryl bromide	PASS	PASS	PASS	PASS	True	False	False
3	CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cc1noc2ccccc12	AAR-POS-8a4e0f60-2	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	Z1260535907	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	371.440	4.4810	0	4	59.23	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False
4	O=C(NCc1noc2ccccc12)N(Cc1cccs1)c1ccc(F)cc1	AAR-POS-8a4e0f60-3	Aaron Morris, PostEra	x0072	https://covid.postera.ai/covid/submissions/AAR...	FALSE	FALSE	FALSE	s_272164____9388766____17338746	False	False	False	False	False	False	False	False	False	False	False	False	False	381.432	4.9448	1	4	58.37	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	True	False	False

The moonshot data has a lot of logging/metadata information, some one-hot-encoding information about functional groups, and some additional columns about Glaxo, Dundee, BMS, Lint, PAINS, SureChEMBL - I’m not sure what those additional coluns mean, but the values are binary values, possibly the results of some other test or availability in another databases.

I’m going to focus on the molecular properties: MW, cLogP, HBD, HBA, TPSA

MW: Molecular Weight
cLogP: The logarithm of the partition coefficient (ratio of concentrations in octanol vs water, $\log{\frac{c_{octanol}}{c_{water}}}$)
HBD: Hydrogen bond donors
HBA: Hydrogen bond acceptors
TPSA: Topological polar surface area

Some of the correlations make some chemical sense - heavier molecules have more heavy atoms (O, N, F, etc.), but these heavier atoms are also the hydrogen bond acceptors. By that logic, more heavy atoms also coincides with more electronegative atoms, increasing your TPSA. It’s a little convoluted because TPSA looks at the surface, not necessarily the volume of the compound; geometry/shape will influence TPSA. There don’t appear to be any strong correlations with cLogP. Partition coefficients are a complex function of polarity, size/sterics, and shape - a 1:1 correlation with a singular, other variable will be hard to pinpoint

This csv file doesn’t have much other numerical data, but maybe some of those true/false, pass/fail data might be relevant…but I definitely need more context here

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA']
ax.matshow(moonshot_df[cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot_df[cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")

Some docking results

Okay here’s a couple other CSVs I found, these include some docking scores

Repurposing scores: “The Drug Repurposing Hub is a curated and annotated collection of FDA-approved drugs, clinical trial drugs, and pre-clinical tool compounds with a companion information resource” source here, so a public dataset of some drugs
Redock scores: “This directory contains experiments in redocking all screened fragments into the entire ensemble of X-ray structures.” Taking fragments and re-docking them

repurposing_df = pd.read_csv('repurposing-screen/drugset-docked.csv')
redock_df = pd.read_csv('redock-fragments/all-screened-fragments-docked.csv')

SMILES strings, names, docking scores

repurposing_df.head()

	SMILES	TITLE	Hybrid2	docked_fragment	Mpro-_dock	site
0	C[C@@H](c1ccc-2c(c1)Cc3c2cccc3)C(=O)[O-]	CHEMBL2104122	-11.519580	x0749	0.509349	active-covalent
1	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4...	CHEMBL1387	-10.580162	x0749	2.706928	active-covalent
2	CC(C)(C)c1cc(cc(c1O)C(C)(C)C)/C=C\2/C(=O)NC(=[...	CHEMBL275835	-10.557229	x0107	1.801830	active-noncovalent
3	C[C@]12CC[C@@H]3[C@H]4CCCCC4=CC[C@H]3[C@@H]1CC...	CHEMBL2104104	-10.480992	x0749	3.791700	active-covalent
4	CC(=O)[C@]1(CC[C@@H]2[C@@]1(CCC3=C4CCC(=O)C=C4...	CHEMBL2104231	-10.430775	x0749	4.230903	active-covalent

Hybrid2 looks like a docking method provided via OpenEye. Mpro likely refers to COVID-19 main protease. I’m not entirely sure what the receptor for “Hybrid2” is, but there seem to be multiple “sites” or “fragments” for docking. There are lots of different fragments, but very few sites. For each site-fragment combination, multiple small molecules may have been tested.

repurposing_df['docked_fragment'].value_counts()

x0195    114
x0749     69
x0678     58
x0397     45
x0104     24
x0161     21
x1077     19
x0072     14
x0874     13
x0354     13
x0689     10
x1382      7
x0708      4
x0434      4
x1093      3
x1392      2
x0395      2
x1402      2
x0831      2
x0107      2
x1385      2
x1418      2
x0387      2
x0830      2
x1478      1
x0786      1
x1187      1
x0692      1
x0967      1
x0426      1
x0305      1
x0946      1
x1386      1
x0759      1
Name: docked_fragment, dtype: int64

repurposing_df['site'].value_counts()

active-noncovalent    338
active-covalent       107
dimer-interface         1
Name: site, dtype: int64

repurposing_df.groupby(["docked_fragment", "site"]).count()

		SMILES	TITLE	Hybrid2	Mpro-_dock
docked_fragment	site
x0072	active-noncovalent	14	14	14	14
x0104	active-noncovalent	24	24	24	24
x0107	active-noncovalent	2	2	2	2
x0161	active-noncovalent	21	21	21	21
x0195	active-noncovalent	114	114	114	114
x0305	active-noncovalent	1	1	1	1
x0354	active-noncovalent	13	13	13	13
x0387	active-noncovalent	2	2	2	2
x0395	active-noncovalent	2	2	2	2
x0397	active-noncovalent	45	45	45	45
x0426	active-noncovalent	1	1	1	1
x0434	active-noncovalent	4	4	4	4
x0678	active-noncovalent	58	58	58	58
x0689	active-covalent	10	10	10	10
x0692	active-covalent	1	1	1	1
x0708	active-covalent	4	4	4	4
x0749	active-covalent	69	69	69	69
x0759	active-covalent	1	1	1	1
x0786	active-covalent	1	1	1	1
x0830	active-covalent	2	2	2	2
x0831	active-covalent	2	2	2	2
x0874	active-noncovalent	13	13	13	13
x0946	active-noncovalent	1	1	1	1
x0967	active-noncovalent	1	1	1	1
x1077	active-noncovalent	19	19	19	19
x1093	active-noncovalent	3	3	3	3
x1187	dimer-interface	1	1	1	1
x1382	active-covalent	7	7	7	7
x1385	active-covalent	2	2	2	2
x1386	active-covalent	1	1	1	1
x1392	active-covalent	2	2	2	2
x1402	active-covalent	2	2	2	2
x1418	active-covalent	2	2	2	2
x1478	active-covalent	1	1	1	1

Some molecules show up multiple times - why? Upon further investigation, this is mainly due to the molecule’s presence in multiple databases

repurposing_df.groupby(['SMILES']).count().sort_values("TITLE")

	TITLE	Hybrid2	docked_fragment	Mpro-_dock	site
SMILES
B(CCCC)(O)O	1	1	1	1	1
CCCc1ccccc1N	1	1	1	1	1
CCCc1cc(=O)[nH]c(=S)[nH]1	1	1	1	1	1
CCC[N@@H+]1CCO[C@H]2[C@H]1CCc3c2cc(cc3)O	1	1	1	1	1
CCC[N@@H+]1CCC[C@H]2[C@H]1Cc3c[nH]nc3C2	1	1	1	1	1
...	...	...	...	...	...
C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4=CC(=O)CC[C@H]34	2	2	2	2	2
C[C@]12CC[C@H]3[C@H]([C@@H]1CCC2=O)CC(=C)C4=CC(=O)C=C[C@]34C	2	2	2	2	2
CC(C)C[C@@H](C1(CCC1)c2ccc(cc2)Cl)[NH+](C)C	2	2	2	2	2
CC[C@](/C=C/Cl)(C#C)O	2	2	2	2	2
CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34	2	2	2	2	2

432 rows × 5 columns

repurposing_df[repurposing_df['SMILES']=="CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34"]

	SMILES	TITLE	Hybrid2	docked_fragment	Mpro-_dock	site
82	CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...	CHEMBL2107797	-9.002963	x0749	2.616094	active-covalent
105	CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...	EDRUG178	-8.705896	x0104	2.248707	active-noncovalent

There doesn’t seem to be a very good correlation between the two docking scores - if these are docking scores to different receptors, that would help explain things. It’s worth noting that we’re not seeing if the two numbers agree for each molecule, but if the trends persist (both scores go up for this molecule, but go down for this other molecule). The weak correlation suggests the trends do not persist between the two docking measures

repurposing_df[['Hybrid2', 'Mpro-_dock']].corr()

	Hybrid2	Mpro-_dock
Hybrid2	1.000000	0.581966
Mpro-_dock	0.581966	1.000000

Redocking dataframe: SMILES, names, data collection information, docking scores

redock_df.head()

	SMILES	TITLE	fragments	CompoundCode	Unnamed: 4	covalent_warhead	MountingResult	DataCollectionOutcome	DataProcessingResolutionHigh	RefinementOutcome	Deposition_PDB_ID	Hybrid2	docked_fragment	Mpro-x0500_dock	site
0	c1ccc(c(c1)NCc2ccn[nH]2)F	x0500	x0500	Z1545196403	NaN	False	OK: No comment:No comment	success	2.19	7 - Analysed & Rejected	NaN	-11.881923	x0678	-2.501554	active-noncovalent
1	Cc1ccccc1OCC(=O)Nc2ncccn2	x0415	x0415	Z53834613	NaN	False	OK: No comment:No comment	success	1.62	7 - Analysed & Rejected	NaN	-11.622278	x0678	NaN	active-noncovalent
2	Cc1csc(n1)CNC(=O)c2ccn[nH]2	x0356	x0356	Z466628048	NaN	False	OK: No comment:No comment	success	3.25	7 - Analysed & Rejected	NaN	-11.435024	x0678	NaN	active-noncovalent
3	Cc1csc(n1)CNC(=O)c2ccn[nH]2	x1113	x1113	Z466628048	NaN	False	OK: No comment:No comment	success	1.57	7 - Analysed & Rejected	NaN	-11.435024	x0678	NaN	active-noncovalent
4	c1cc(cnc1)NC(=O)CC2CCCCC2	x0678	x0678	Z31792168	NaN	False	Mounted_Clear	success	1.83	6 - Deposited	5R84	-11.355046	x0678	NaN	active-noncovalent

There don’t seem to be many Mpro docking scores in this dataset (only one molecule has a non-null Mpro docking score)

redock_df[redock_df['Mpro-x0500_dock'].isnull()].count()

SMILES                          1452
TITLE                           1452
fragments                       1452
CompoundCode                    1452
Unnamed: 4                         0
covalent_warhead                1452
MountingResult                  1452
DataCollectionOutcome           1452
DataProcessingResolutionHigh    1357
RefinementOutcome               1306
Deposition_PDB_ID                 78
Hybrid2                         1452
docked_fragment                 1452
Mpro-x0500_dock                    0
site                            1452
dtype: int64

redock_df[~redock_df['Mpro-x0500_dock'].isnull()].count()

SMILES                          1
TITLE                           1
fragments                       1
CompoundCode                    1
Unnamed: 4                      0
covalent_warhead                1
MountingResult                  1
DataCollectionOutcome           1
DataProcessingResolutionHigh    1
RefinementOutcome               1
Deposition_PDB_ID               0
Hybrid2                         1
docked_fragment                 1
Mpro-x0500_dock                 1
site                            1
dtype: int64

Are there overlaps in the molecules in each of these datasets?

repurpose_redock = repurposing_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))

moonshot_redock = moonshot_df.merge(redock_df, on='SMILES', how='inner',suffixes=("_L", "_R"))

repurpose_redock

	SMILES	TITLE_L	Hybrid2_L	docked_fragment_L	Mpro-_dock	site_L	TITLE_R	fragments	CompoundCode	Unnamed: 4	covalent_warhead	MountingResult	DataCollectionOutcome	DataProcessingResolutionHigh	RefinementOutcome	Deposition_PDB_ID	Hybrid2_R	docked_fragment_R	Mpro-x0500_dock	site_R
0	Cc1cc(=O)n([nH]1)c2ccccc2	CHEMBL290916	-7.889587	x0195	-2.068452	active-noncovalent	x0297	x0297	Z50145861	NaN	False	OK: No comment:No comment	success	1.98	7 - Analysed & Rejected	NaN	-7.889587	x0195	NaN	active-noncovalent
1	CC(C)Nc1ncccn1	CHEMBL1740513	-7.178702	x0072	-1.248482	active-noncovalent	x0583	x0583	Z31190928	NaN	False	OK: No comment:No comment	success	3.08	7 - Analysed & Rejected	NaN	-7.293537	x1093	NaN	active-noncovalent
2	CC(C)Nc1ncccn1	CHEMBL1740513	-7.178702	x0072	-1.248482	active-noncovalent	x1102	x1102	Z31190928	NaN	False	OK: No comment:No comment	success	1.46	7 - Analysed & Rejected	NaN	-7.293537	x1093	NaN	active-noncovalent
3	C[C@H](C(=O)[O-])O	CHEMBL1200559	-5.675188	x0397	-0.179049	active-noncovalent	x1035	x1035	Z1741982441	NaN	False	OK: No comment:No comment	Failed - no diffraction	NaN	NaN	NaN	-6.505556	x0397	NaN	active-noncovalent
4	CC(=O)C(=O)[O-]	DB00119	-5.448891	x0689	-0.494791	active-covalent	x1037	x1037	Z1741977082	NaN	False	OK: No comment:No comment	Failed - no diffraction	NaN	NaN	NaN	-5.448891	x0689	NaN	active-covalent
5	CCC(=O)[O-]	CHEMBL14021	-5.374838	x0397	-0.555688	active-noncovalent	x1029	x1029	Z955123616	NaN	False	OK: No comment:No comment	success	1.73	7 - Analysed & Rejected	NaN	-5.135675	x0689	NaN	active-covalent
6	C1CNCC[NH2+]1	CHEMBL1412	-5.079155	x0354	1.716032	active-noncovalent	x0996	x0996	Z1245537944	NaN	False	OK: No comment:No comment	success	1.96	7 - Analysed & Rejected	NaN	-4.675085	x0354	NaN	active-noncovalent

We joined on SMILES string, and now we can compare the docking scores between the repurposing and redocking datasets.

Some Hybrid2 scores look quantitatively similar, but for those that don’t, the ranking is still there. Looking at the COVID-19 main protease (Mpro I believe?), the docking scores don’t follow similar rankings - docking scores aren’t transferable to different receptors (this might be a fairly obvious observation)

repurpose_redock[['SMILES', "TITLE_L", "TITLE_R", "Hybrid2_L", "Hybrid2_R", 'Mpro-_dock', 'Mpro-x0500_dock']]

	SMILES	TITLE_L	TITLE_R	Hybrid2_L	Hybrid2_R	Mpro-_dock	Mpro-x0500_dock
0	Cc1cc(=O)n([nH]1)c2ccccc2	CHEMBL290916	x0297	-7.889587	-7.889587	-2.068452	NaN
1	CC(C)Nc1ncccn1	CHEMBL1740513	x0583	-7.178702	-7.293537	-1.248482	NaN
2	CC(C)Nc1ncccn1	CHEMBL1740513	x1102	-7.178702	-7.293537	-1.248482	NaN
3	C[C@H](C(=O)[O-])O	CHEMBL1200559	x1035	-5.675188	-6.505556	-0.179049	NaN
4	CC(=O)C(=O)[O-]	DB00119	x1037	-5.448891	-5.448891	-0.494791	NaN
5	CCC(=O)[O-]	CHEMBL14021	x1029	-5.374838	-5.135675	-0.555688	NaN
6	C1CNCC[NH2+]1	CHEMBL1412	x0996	-5.079155	-4.675085	1.716032	NaN

Joining the moonshot submission and redocking datasets does not yield too many overlapping molecules

moonshot_redock

	SMILES	CID	creator	fragments_L	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead_L	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED	TITLE	fragments_R	CompoundCode	Unnamed: 4	covalent_warhead_R	MountingResult	DataCollectionOutcome	DataProcessingResolutionHigh	RefinementOutcome	Deposition_PDB_ID	Hybrid2	docked_fragment	Mpro-x0500_dock	site
0	CC(C)Nc1cccnc1	MAK-UNK-2c1752f0-4	Maksym Voznyy	x1093	https://covid.postera.ai/covid/submissions/MAK...	FALSE	Z2574930241	EN300-56005	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	136.198	1.9019	1	2	24.92	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	x1098	x1098	Z1259341037	NaN	False	OK: No comment:No comment	success	1.66	7 - Analysed & Rejected	NaN	-7.474369	x0678	NaN	active-noncovalent
1	CC(C)Nc1cccnc1	MAK-UNK-2c1752f0-4	Maksym Voznyy	x1093	https://covid.postera.ai/covid/submissions/MAK...	FALSE	Z2574930241	EN300-56005	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	136.198	1.9019	1	2	24.92	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	x0572	x0572	Z1259341037	NaN	False	OK: No comment:No comment	success	2.98	7 - Analysed & Rejected	NaN	-7.474369	x0678	NaN	active-noncovalent
2	CCS(=O)(=O)Nc1ccccc1F	MAK-UNK-2c1752f0-5	Maksym Voznyy	x1093	https://covid.postera.ai/covid/submissions/MAK...	FALSE	Z53825177	EN300-116204	FALSE	False	True	False	False	False	False	False	False	False	False	False	False	False	203.238	1.5873	1	2	46.17	PASS	PASS	PASS	PASS	PASS	Hetero_hetero	PASS	PASS	PASS	False	False	False	x0247	x0247	Z53825177	NaN	False	OK: No comment:No comment	success	1.83	7 - Analysed & Rejected	NaN	-7.413380	x0678	NaN	active-noncovalent

Comparing other databases

CHEMBL, DrugBank, and “EDrug”(?) look to be the 3 prefixes in the “TITLE” column

from chembl_webresource_client.new_client import new_client
molecule = new_client.molecule
res = molecule.search('CHEMBL1387')

res_df = pd.DataFrame.from_dict(res)

res_df.columns

Index(['atc_classifications', 'availability_type', 'biotherapeutic',
       'black_box_warning', 'chebi_par_id', 'chirality', 'cross_references',
       'dosed_ingredient', 'first_approval', 'first_in_class', 'helm_notation',
       'indication_class', 'inorganic_flag', 'max_phase', 'molecule_chembl_id',
       'molecule_hierarchy', 'molecule_properties', 'molecule_structures',
       'molecule_synonyms', 'molecule_type', 'natural_product', 'oral',
       'parenteral', 'polymer_flag', 'pref_name', 'prodrug', 'score',
       'structure_type', 'therapeutic_flag', 'topical', 'usan_stem',
       'usan_stem_definition', 'usan_substem', 'usan_year', 'withdrawn_class',
       'withdrawn_country', 'withdrawn_flag', 'withdrawn_reason',
       'withdrawn_year'],
      dtype='object')

res_df[['chirality', 'molecule_properties', 'molecule_structures', 'score']]

	chirality	molecule_properties	molecule_structures	score
0	1	{'alogp': '3.64', 'aromatic_rings': 0, 'cx_log...	{'canonical_smiles': 'C#C[C@]1(O)CC[C@H]2[C@@H...	17.0

res_df[['molecule_properties']].values[0]

array([{'alogp': '3.64', 'aromatic_rings': 0, 'cx_logd': '2.81', 'cx_logp': '2.81', 'cx_most_apka': None, 'cx_most_bpka': None, 'full_molformula': 'C20H26O2', 'full_mwt': '298.43', 'hba': 2, 'hba_lipinski': 2, 'hbd': 1, 'hbd_lipinski': 1, 'heavy_atoms': 22, 'molecular_species': None, 'mw_freebase': '298.43', 'mw_monoisotopic': '298.1933', 'num_lipinski_ro5_violations': 0, 'num_ro5_violations': 0, 'psa': '37.30', 'qed_weighted': '0.55', 'ro3_pass': 'N', 'rtb': 0}],
      dtype=object)

res_df['molecule_properties'].apply(pd.Series)

	alogp	aromatic_rings	cx_logd	cx_logp	cx_most_apka	cx_most_bpka	full_molformula	full_mwt	hba	hba_lipinski	hbd	hbd_lipinski	heavy_atoms	molecular_species	mw_freebase	mw_monoisotopic	num_lipinski_ro5_violations	num_ro5_violations	psa	qed_weighted	ro3_pass	rtb
0	3.64	0	2.81	2.81	None	None	C20H26O2	298.43	2	2	1	1	22	None	298.43	298.1933	0	0	37.30	0.55	N	0

all_results = [molecule.search(a) for a in repurposing_df['TITLE']]

Here’s a big Python function tangent.

For each chembl molecule, we’ve searched for it within the chembl, returning us a list (of length 1) containing a dictionary of properties.

All molecules have been compiled into a list, so we have a list of lists of dicionatires.

For sanity, we can use a Python filter to only retain the non-None results.

We can chain that with a Python map function to parse the first item from each molecule’s list. Recall, each molecule was a list with just one element, a dictionary. We can boil this down to only returning the dictionary (eliminating the list wrapper).

For validation, I’ve called next to look at the results

filtered = map(lambda x: x[0], filter(lambda x: x is not None, all_results))

next(filtered)

{'atc_classifications': [],
 'availability_type': -1,
 'biotherapeutic': None,
 'black_box_warning': 0,
 'chebi_par_id': None,
 'chirality': 0,
 'cross_references': [],
 'dosed_ingredient': False,
 'first_approval': None,
 'first_in_class': 0,
 'helm_notation': None,
 'indication_class': 'Anti-Inflammatory',
 'inorganic_flag': 0,
 'max_phase': 0,
 'molecule_chembl_id': 'CHEMBL2104122',
 'molecule_hierarchy': {'molecule_chembl_id': 'CHEMBL2104122',
  'parent_chembl_id': 'CHEMBL2104122'},
 'molecule_properties': {'alogp': '3.45',
  'aromatic_rings': 2,
  'cx_logd': '1.26',
  'cx_logp': '3.92',
  'cx_most_apka': '4.68',
  'cx_most_bpka': None,
  'full_molformula': 'C16H14O2',
  'full_mwt': '238.29',
  'hba': 1,
  'hba_lipinski': 2,
  'hbd': 1,
  'hbd_lipinski': 1,
  'heavy_atoms': 18,
  'molecular_species': 'ACID',
  'mw_freebase': '238.29',
  'mw_monoisotopic': '238.0994',
  'num_lipinski_ro5_violations': 0,
  'num_ro5_violations': 0,
  'psa': '37.30',
  'qed_weighted': '0.74',
  'ro3_pass': 'N',
  'rtb': 2},
 'molecule_structures': {'canonical_smiles': 'CC(C(=O)O)c1ccc2c(c1)Cc1ccccc1-2',
  'molfile': '\n     RDKit          2D\n\n 18 20  0  0  0  0  0  0  0  0999 V2000\n   -0.5375    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.5375    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.4458    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.4458    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.3625    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.4875   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.4125   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.3292    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.4125    1.6500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3417   -0.5292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.3625    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.3500    1.1958    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n    4.2167   -0.6292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.3958    1.6500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.3958   -0.5125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3417   -1.6417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.3458    1.1083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.3458    0.0250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n  2  1  2  0\n  3  2  1  0\n  4  6  1  0\n  5  7  2  0\n  6  1  1  0\n  7  1  1  0\n  8 10  1  0\n  9  2  1  0\n 10  5  1  0\n 11  5  1  0\n 12  8  2  0\n 13  8  1  0\n 14  3  1  0\n 15  4  1  0\n 16 10  1  0\n 17 14  2  0\n 18 15  2  0\n  3  4  2  0\n  9 11  2  0\n 17 18  1  0\nM  END\n\n> <chembl_id>\nCHEMBL2104122\n\n> <chembl_pref_name>\nCICLOPROFEN\n\n',
  'standard_inchi': 'InChI=1S/C16H14O2/c1-10(16(17)18)11-6-7-15-13(8-11)9-12-4-2-3-5-14(12)15/h2-8,10H,9H2,1H3,(H,17,18)',
  'standard_inchi_key': 'LRXFKKPEBXIPMW-UHFFFAOYSA-N'},
 'molecule_synonyms': [{'molecule_synonym': 'Cicloprofen',
   'syn_type': 'BAN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'Cicloprofen',
   'syn_type': 'INN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'Cicloprofen',
   'syn_type': 'USAN',
   'synonyms': 'CICLOPROFEN'},
  {'molecule_synonym': 'SQ-20824',
   'syn_type': 'RESEARCH_CODE',
   'synonyms': 'SQ 20824'}],
 'molecule_type': 'Small molecule',
 'natural_product': 0,
 'oral': False,
 'parenteral': False,
 'polymer_flag': False,
 'pref_name': 'CICLOPROFEN',
 'prodrug': 0,
 'score': 16.0,
 'structure_type': 'MOL',
 'therapeutic_flag': False,
 'topical': False,
 'usan_stem': '-profen',
 'usan_stem_definition': 'anti-inflammatory/analgesic agents (ibuprofen type)',
 'usan_substem': '-profen',
 'usan_year': 1974,
 'withdrawn_class': None,
 'withdrawn_country': None,
 'withdrawn_flag': False,
 'withdrawn_reason': None,
 'withdrawn_year': None}

For now, I’m only really interested in the molecule_properties dictionary

filtered = [a[0]['molecule_properties'] for a in all_results if len(a) > 0]

chembl_df = pd.DataFrame(filtered)
chembl_df['TITLE'] = repurposing_df['TITLE']

Molecular properties contained in the chembl database

Here are the definitions I can dig up

alogp: (lipophilicity) partition coefficient
aromatic_rings: number of aromatic rings
cx_logd: distribution coefficient taking into account ionized and non-ionized forms
cx_most_apka: acidic pka
cx_most_bpka: basic pka
full_mwt: molecular weight (and also free base and monoisotopic masses)
hba: hydrogen bond acceptors (and hba_lipinski for lipinski definitiosn)
hbd: hydrogen bond donors (and hbd_lipinski)
heavy_atoms: number of heavy atoms
num_lipinski_ro5_violations: how many times this molecule violated Lipinski’s rule of five
num_ro5_violations: not sure, seems similar to lipinski rule of 5
psa: protein sequence alignment
qed_weighted: “quantitative estimate of druglikeness” (ranges between 0 and 1, with 1 being more favorable). This is based on a quantitatve mean of drugability functions
ro3_pass: rule of three
rtb: number of rotatable bonds

chembl_df.head()

	alogp	aromatic_rings	cx_logd	cx_logp	cx_most_apka	cx_most_bpka	full_molformula	full_mwt	hba	hba_lipinski	hbd	hbd_lipinski	heavy_atoms	molecular_species	mw_freebase	mw_monoisotopic	psa	qed_weighted	ro3_pass	rtb	TITLE
0	3.45	2.0	1.26	3.92	4.68	None	C16H14O2	238.29	1.0	2.0	1.0	1.0	18.0	ACID	238.29	238.0994	37.30	0.74	N	2.0	CHEMBL2104122
1	3.64	0.0	2.81	2.81	None	None	C20H26O2	298.43	2.0	2.0	1.0	1.0	22.0	None	298.43	298.1933	37.30	0.55	N	0.0	CHEMBL1387
2	3.92	1.0	4.25	4.25	10.15	2.86	C18H24N2O2S	332.47	4.0	4.0	2.0	3.0	23.0	NEUTRAL	332.47	332.1558	75.68	0.76	N	1.0	CHEMBL275835
3	4.31	0.0	4.04	4.04	None	None	C20H28O	284.44	1.0	1.0	1.0	1.0	21.0	None	284.44	284.2140	20.23	0.52	N	0.0	CHEMBL2104104
4	4.79	0.0	3.96	3.96	None	None	C21H28O2	312.45	2.0	2.0	0.0	0.0	23.0	None	312.45	312.2089	34.14	0.70	N	1.0	CHEMBL2104231

chembl_df.columns

Index(['alogp', 'aromatic_rings', 'cx_logd', 'cx_logp', 'cx_most_apka',
       'cx_most_bpka', 'full_molformula', 'full_mwt', 'hba', 'hba_lipinski',
       'hbd', 'hbd_lipinski', 'heavy_atoms', 'molecular_species',
       'mw_freebase', 'mw_monoisotopic', 'num_lipinski_ro5_violations',
       'num_ro5_violations', 'psa', 'qed_weighted', 'ro3_pass', 'rtb',
       'TITLE'],
      dtype='object')

chembl_df.corr()

	aromatic_rings	hba	hba_lipinski	hbd	hbd_lipinski	heavy_atoms	num_lipinski_ro5_violations	num_ro5_violations	rtb
aromatic_rings	1.000000	0.192569	0.178507	0.014928	0.036106	0.249022	0.031094	0.031094	0.229124
hba	0.192569	1.000000	0.868859	0.084553	0.054409	0.451560	-0.047705	-0.047705	-0.023690
hba_lipinski	0.178507	0.868859	1.000000	0.348600	0.294276	0.295864	-0.070783	-0.070783	0.021812
hbd	0.014928	0.084553	0.348600	1.000000	0.935710	-0.172866	-0.060462	-0.060462	0.040505
hbd_lipinski	0.036106	0.054409	0.294276	0.935710	1.000000	-0.211899	-0.085660	-0.085660	0.084225
heavy_atoms	0.249022	0.451560	0.295864	-0.172866	-0.211899	1.000000	0.397240	0.397240	0.259011
num_lipinski_ro5_violations	0.031094	-0.047705	-0.070783	-0.060462	-0.085660	0.397240	1.000000	1.000000	0.345308
num_ro5_violations	0.031094	-0.047705	-0.070783	-0.060462	-0.085660	0.397240	1.000000	1.000000	0.345308
rtb	0.229124	-0.023690	0.021812	0.040505	0.084225	0.259011	0.345308	0.345308	1.000000

At a glance, no definite linear correlations among this crowd besides pKas, partition coefficients, mwt/hba

corr_df = chembl_df.corr()
cols = chembl_df.columns

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)

ax.imshow(chembl_df.corr(), cmap='RdBu')

ax.set_xticklabels(['']+cols)
ax.tick_params(axis='x', rotation=90)

ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(corr_df.iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")

Maybe there are higher-order correlations and relationship more appropriate for clustering and decomposition

cols = ['aromatic_rings', 'cx_logp',  'full_mwt', 'hba']
cleaned = (chembl_df[~chembl_df[cols]
                     .isnull()
                     .all(axis='columns', skipna=False)][cols]
           .astype('float')
           .fillna(0, axis='columns'))

from sklearn import preprocessing

normalized = preprocessing.scale(cleaned)

Appears to be maybe 4 clusters of these compounds examined by the covid-moonshot group

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(1,1)
ax.scatter(output[:,0], output[:,1])
ax.set_title("Aromatic rings, cx_logp, mwt, hba")

Text(0.5, 1.0, 'Aromatic rings, cx_logp, mwt, hba')

By taking turns leaving out some features, it looks like leaving out aromatic rings or hydrogen bond acceptors will diminish the cluster distinction.

Aromatic rings are huge and bulky components to small molecules, it makes sense that a chunk of the behavior corresponds to the aromatic rings. Similarly, hydrogen bond acceptors (heavy molecules) also induce van der Waals and electrostatics influences on small molecules. Left with only weight and partition coefficient, there’s mainly a continous behavior

def clean_df(cols):
    cleaned = (chembl_df[~chembl_df[cols]
                     .isnull()
                     .all(axis='columns', skipna=False)][cols]
           .astype('float')
           .fillna(0, axis='columns'))

    normalized = preprocessing.scale(cleaned)
    
    return normalized

cols = ['cx_logp',  'full_mwt', 'hba']
normalized = clean_df(cols)
tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)
fig,ax = plt.subplots(3,1, figsize=(8,8))
ax[0].scatter(output[:,0], output[:,1])
ax[0].set_title("cx_logp, mwt, hba")

cols = ['cx_logp',  'full_mwt', 'aromatic_rings']
normalized = clean_df(cols)

tsne_analysis = TSNE(n_components=2)
output = tsne_analysis.fit_transform(normalized)

ax[1].scatter(output[:,0], output[:,1])
ax[1].set_title("aromatic_rings, cx_logp, mwt")

cols = ['cx_logp',  'full_mwt']
normalized = clean_df(cols)

ax[2].scatter(normalized[:,0], normalized[:,1])
ax[2].set_title("cx_logp, mwt")

fig.tight_layout()

DrugBank

I found someone had already downloaded the database. I may double-over these dataframes, but query the drugbank dataset rather than chembl

Some docking data

We have some smiles strings, molecular properties, docking scores, and information about the docking fragments

moonshot = pd.read_csv('moonshot-submissions/covid_submissions_all_info-docked-overlap.csv')

moonshot

	SMILES	TITLE	creator	fragments	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	num_criterion_violations	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED	Hybrid2	docked_fragment	Mpro-x1418_dock	site	number_of_overlapping_fragments	overlapping_fragments	overlap_score	volume
0	c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)Cl	MAK-UNK-9e4a73aa-2	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	366.779	4.51890	0	3	50.27	0	PASS	beta-keto/anhydride	PASS	PASS	PASS	Ketone, Dye 11	PASS	PASS	PASS	False	False	False	-11.881256	x1418	1.206534	active-covalent	3	x0434,x0678,x0830	3.208124	271.986084
1	Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])N	KIM-UNI-60f168f5-7	Kim Tai Tran, University of Copenhagen	x0107,x0991	https://covid.postera.ai/covid/submissions/KIM...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	297.362	1.22949	2	5	88.00	0	PASS	imine, imine	PASS	PASS	acyclic C=N-H	Imine 3	PASS	PASS	PASS	False	False	False	-11.654112	x0107	NaN	active-noncovalent	3	x0107,x1412,x1392	4.753475	232.815506
2	c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)Cl	MAK-UNK-9e4a73aa-14	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	368.755	2.72410	0	6	69.78	0	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	-10.460650	x0678	2.716276	active-noncovalent	3	x0678,x1412,x1392	5.520980	266.688721
3	Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...	AUS-WAB-916db9c0-1	Austin D. Chivington, Wabash College	x0107,x1077,x1374	https://covid.postera.ai/covid/submissions/AUS...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	351.450	3.51932	1	5	57.95	0	non_ring_acetal	het-C-het not in ring	PASS	Filter10_Terminal_vinyl	PASS	PASS	PASS	PASS	PASS	False	False	False	-9.516450	x0678	NaN	active-noncovalent	3	x0434,x0831,x0678	3.446572	284.195312
4	c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)O	DRV-DNY-ae159ed1-12	Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...	x1249	https://covid.postera.ai/covid/submissions/DRV...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	276.295	3.23150	1	4	63.08	0	PASS	PASS	PASS	Filter44_michael_acceptor2	PASS	Ketone, Dye 9, vinyl michael acceptor1	PASS	PASS	PASS	False	False	False	-9.243208	x0678	NaN	active-noncovalent	3	x0434,x0678,x0830	2.865147	220.275421
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4630	C[C@H]([C@@H](C(=O)N[C@H](Cc1ccccc1)C(=O)N[C@@...	PAU-UNI-6d15a9f5-4	paul brear, University of cambridge	x1086	https://covid.postera.ai/covid/submissions/PAU...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	714.821	-0.91270	8	11	256.10	4	PASS	PASS	PASS	PASS	PASS	Long aliphatic chain, Dipeptide	PASS	PASS	PASS	False	False	False	3.175111	x0305	NaN	active-noncovalent	0	NaN	5.297134	548.583191
4631	c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...	MAK-UNK-e05327b2-2	Maksym Voznyy	x1402	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	True	False	False	True	False	False	False	False	False	837.964	6.63190	0	9	98.31	2	PASS	PASS	PASS	PASS	PASS	Hetero_hetero	PASS	PASS	PASS	False	False	False	3.561681	x1392	NaN	active-covalent	0	NaN	3.297014	591.877563
4632	Cc1cccc(c1)C[NH+]2CCN(CC2)C(=O)c3ccc(cc3)C#Cc4...	MAK-UNK-e4a48a85-16	Maksym Voznyy	x0387,x0692	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	574.794	6.18892	0	5	39.68	2	PASS	triple bond	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	4.056698	x0978	NaN	active-covalent	0	NaN	4.360606	470.944824
4633	c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...	MAK-UNK-e05327b2-6	Maksym Voznyy	x1402	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	True	False	False	True	False	False	False	False	False	990.183	5.19160	0	12	138.93	3	alpha_halo_heteroatom, secondary_halide_sulfate	PASS	PASS	PASS	PASS	Hetero_hetero	PASS	Dithiomethylene_acetal	Alkyl Halide	False	False	False	4.242827	x0731	NaN	active-covalent	0	NaN	4.193186	694.333069
4634	Cc1cccc(c1)C[NH+]2CCN(CC2)c3cc(c(c(c3)Cl)c4cc5...	MAK-UNK-e4a48a85-15	Maksym Voznyy	x0387,x0692	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	659.687	7.36362	1	7	68.36	2	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	5.966927	x0705	NaN	active-covalent	0	NaN	1.473711	503.583801

4635 rows × 48 columns

moonshot.head(5)

	SMILES	TITLE	creator	fragments	link	real_space	SCR	BB	extended_real_space	in_molport_or_mcule	in_ultimate_mcule	in_emolecules	covalent_frag	covalent_warhead	acrylamide	acrylamide_adduct	chloroacetamide	chloroacetamide_adduct	vinylsulfonamide	vinylsulfonamide_adduct	nitrile	nitrile_adduct	MW	cLogP	HBD	HBA	TPSA	BMS	Dundee	Glaxo	Inpharmatica	LINT	MLSMR	PAINS	SureChEMBL	PostEra	ORDERED	MADE	ASSAYED	Hybrid2	docked_fragment	Mpro-x1418_dock	site	number_of_overlapping_fragments	overlapping_fragments	overlap_score	volume
0	c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)Cl	MAK-UNK-9e4a73aa-2	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	366.779	4.51890	0	3	50.27	PASS	beta-keto/anhydride	PASS	PASS	PASS	Ketone, Dye 11	PASS	PASS	PASS	False	False	False	-11.881256	x1418	1.206534	active-covalent	3	x0434,x0678,x0830	3.208124	271.986084
1	Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])N	KIM-UNI-60f168f5-7	Kim Tai Tran, University of Copenhagen	x0107,x0991	https://covid.postera.ai/covid/submissions/KIM...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	297.362	1.22949	2	5	88.00	PASS	imine, imine	PASS	PASS	acyclic C=N-H	Imine 3	PASS	PASS	PASS	False	False	False	-11.654112	x0107	NaN	active-noncovalent	3	x0107,x1412,x1392	4.753475	232.815506
2	c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)Cl	MAK-UNK-9e4a73aa-14	Maksym Voznyy	x1418	https://covid.postera.ai/covid/submissions/MAK...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	368.755	2.72410	0	6	69.78	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	PASS	False	False	False	-10.460650	x0678	2.716276	active-noncovalent	3	x0678,x1412,x1392	5.520980	266.688721
3	Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...	AUS-WAB-916db9c0-1	Austin D. Chivington, Wabash College	x0107,x1077,x1374	https://covid.postera.ai/covid/submissions/AUS...	FALSE	FALSE	FALSE	FALSE	False	False	False	True	False	False	False	False	False	False	False	False	False	351.450	3.51932	1	5	57.95	non_ring_acetal	het-C-het not in ring	PASS	Filter10_Terminal_vinyl	PASS	PASS	PASS	PASS	PASS	False	False	False	-9.516450	x0678	NaN	active-noncovalent	3	x0434,x0831,x0678	3.446572	284.195312
4	c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)O	DRV-DNY-ae159ed1-12	Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...	x1249	https://covid.postera.ai/covid/submissions/DRV...	FALSE	FALSE	FALSE	FALSE	False	False	False	False	False	False	False	False	False	False	False	False	False	276.295	3.23150	1	4	63.08	PASS	PASS	PASS	Filter44_michael_acceptor2	PASS	Ketone, Dye 9, vinyl michael acceptor1	PASS	PASS	PASS	False	False	False	-9.243208	x0678	NaN	active-noncovalent	3	x0434,x0678,x0830	2.865147	220.275421

moonshot['Mpro-x1418_dock'].isnull().sum() # Lots of missing Mpro dock scores

While there are a lot of different fragments to which the small molecule can bind, there are two “classes”, active-covalent and active-noncovalent (possibly referring to sites that covalently bond?)

This presents a way to logically bisect the data based on some fundamental chemistry of the binding pocket.

moonshot['docked_fragment'].value_counts()

x0678    940
x0749    771
x0104    347
x0831    283
x0830    281
x0195    269
x0161    252
x0107    201
x0072    172
x1077    127
x1392    107
x1093    107
x0434    105
x0874     81
x1385     69
x1418     58
x1334     50
x0967     46
x0397     42
x0946     38
x0692     37
x0759     37
x1386     35
x0395     29
x0305     24
x1311     16
x0708     13
x0774     12
x1380     10
x1412      7
x1374      7
x1348      6
x0770      5
x1249      5
x0387      5
x0736      4
x0705      4
x1358      3
x0426      3
x1375      3
x0734      3
x0540      3
x0354      3
x1382      3
x0755      1
x1458      1
x0689      1
x0769      1
x0981      1
x0978      1
x0731      1
x1493      1
x0771      1
x1478      1
x1384      1
x1351      1
Name: docked_fragment, dtype: int64

moonshot['site'].value_counts()

active-noncovalent    2799
active-covalent       1836
Name: site, dtype: int64

We can examine the same correlations, but now for each type of site, and look at the hybrid docking score correlations.

The biggest trend differences appear with the partition coefficient and number of hydrogen bond donors, but still the correlations are extremely weak

site_type = 'active-noncovalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")

Text(0.5, 1.05, 'Docking to active-noncovalent')

site_type = 'active-covalent'
fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
cols = ['MW', 'cLogP', 'HBD', 'HBA', 'TPSA', 'Hybrid2']
ax.matshow(moonshot[moonshot['site']==site_type][cols].corr(), cmap='RdBu')

ax.set_xticks([i for i,_ in enumerate(cols)])
ax.set_xticklabels(cols)

ax.set_yticks([i for i,_ in enumerate(cols)])
ax.set_yticklabels(cols)

for i, (rowname, row) in enumerate(moonshot[moonshot['site']==site_type][cols].corr().iterrows()):
    for j, (key, val) in enumerate(row.iteritems()):
        ax.annotate(f"{val:0.2f}", xy=(i,j), xytext=(-10, -5), textcoords="offset points")
ax.set_title(f"Docking to {site_type}")

Text(0.5, 1.05, 'Docking to active-covalent')

In general, lower docking score seem better, so the noncovalent sites might present more optimal binding locations (see histogram below). This seems non-intuitive because, if active-covalent really means sites that bond covalently, then covalent bonds would seem more energetically favorable than non-covalent interactions. Alternatively, forming covalent bonds might suggest an unstable region of the complex that could be shielded from the surroundings, inhibiting any sort of small molecule from binding the pocket? Expert opinion would be much appreciated here

fig, ax = plt.subplots(1,1, figsize=(8,6), dpi=100)
covalent_mean = moonshot[moonshot['site']=='active-covalent']['Hybrid2'].mean()
noncovalent_mean = moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'].mean()

ax.hist(moonshot[moonshot['site']=='active-covalent']['Hybrid2'], alpha=0.5, 
        label=f'active-covalent (mean={covalent_mean:.3f})')
ax.hist(moonshot[moonshot['site']=='active-noncovalent']['Hybrid2'], alpha=0.5, 
        label=f'active-noncovalent (mean={noncovalent_mean:.3f})')

ax.set_title(f"Hybrid2 histogram")
ax.set_xlabel("Hybrid2 score")
ax.legend()

<matplotlib.legend.Legend at 0x7fac6b459850>

from rdkit import Chem

rdkit_smiles = [Chem.MolFromSmiles(a) for a in moonshot.sort_values('Hybrid2', ascending=True)['SMILES'].head(10)]
scores = [f"{a:.3f}" for a in moonshot.sort_values('Hybrid2', ascending=True)['Hybrid2'].head(10)]

img=Chem.Draw.MolsToGridImage(rdkit_smiles,molsPerRow=5,subImgSize=(200,200),
                             legends=scores)

img

Is being “clutch” a myth?

2020-04-05T00:00:00-05:00

Are some players more “clutch” than others?

Clutch time is defined as “the last 5 minutes of a game in which the point differential is 5 or less”. Do some players really rise to the challenge and perform better in the clutch?

To address this, we can use some nba_api functionality to get clutch stats and compare them to regular season stats. Due to the small sample size of clutch stats, we look at the total field goal percentage (total field goals made divided by total field goals attempted during clutch time). However, for regular season stats, we look at the field goal percentage per game, averaged over all games. This allows us to get a sense of the game-by-game variation of a player’s field goal percentage

import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import time
import ballDontLie 
from ballDontLie.util.api_nba import find_player_id
from nba_api.stats.endpoints import PlayerDashboardByClutch, PlayerGameLog, LeagueGameLog

Sample players and data pull

We will examine some famous players, some thought to be more “clutch” than others. Further, we look at these players season-by-season; in particular, their MVP seasons. We could look into more than just their MVP seasons (some players didn’t win an MVP that season but still had some very historical regular seasons or playoff runs). Further, we could also expand to non-MVP players who have many clutch moments (Damian Lillard, Brandon Roy, among others)

players_mvps = {
    'Michael Jordan': ['1987-88', '1990-91', '1991-92', '1995-96', '1997-98'],
    'Kobe Bryant': ['2007-08'],
    'LeBron James': ['2008-09', '2009-10', '2011-12', '2012-13'], 
    'Kevin Durant': ['2013-14'],
    'Russell Westbrook': ['2016-17'],
    'Allen Iverson': ['2000-01'],
    'Stephen Curry': ['2014-15', '2015-16'],
    'Derrick Rose': ['2010-11'],
    'Steve Nash': ['2004-05', '2005-06']
}

Some constraints due to NBA stat recording, some years don’t record clutch time stats, so we will have to account for the lack of data

df = pd.DataFrame()
for i, (player, mvp_seasons) in enumerate(players_mvps.items()):
    for mvp_season in mvp_seasons:
        print(player, mvp_season)
        player_id = find_player_id(player)[0]
        season_results = {'player': player, 
                          'player_id': player_id,
                         'season': mvp_season}

        regular_season_game_log = PlayerGameLog(player_id, season=mvp_season, 
                                                season_type_all_star='Regular Season')
        regular_season_clutch_games = PlayerDashboardByClutch(player_id, season=mvp_season, 
                                                              season_type_playoffs='Regular Season')

        season_results['regular_fg_pct'] = regular_season_game_log.get_data_frames()[0]['FG_PCT'].mean()
        season_results['regular_fg_pct_std'] = regular_season_game_log.get_data_frames()[0]['FG_PCT'].std()
        try:
            season_results['regular_clutch_fg_pct'] = (regular_season_clutch_games
                                           .last5_min_plus_minus5_point_player_dashboard
                                           .get_data_frame()['FG_PCT'].values[0])
        except IndexError:
            season_results['regular_clutch_fg_pct'] = -1.0

        playoffs_game_log = PlayerGameLog(player_id, season=mvp_season, 
                                                season_type_all_star='Playoffs')
        playoffs_clutch_games = PlayerDashboardByClutch(player_id, season=mvp_season, 
                                                              season_type_playoffs='Playoffs')

        season_results['playoff_fg_pct'] = playoffs_game_log.get_data_frames()[0]['FG_PCT'].mean()
        season_results['playoff_fg_pct_std'] = playoffs_game_log.get_data_frames()[0]['FG_PCT'].std()
        try:
            season_results['playoff_clutch_fg_pct'] = (playoffs_clutch_games
                                           .last5_min_plus_minus5_point_player_dashboard
                                           .get_data_frame()['FG_PCT'].values[0])
        except IndexError:
            season_results['playoff_clutch_fg_pct'] = -1.0
        summary_dict = {i: season_results}
        df = df.append(pd.DataFrame.from_dict(summary_dict, orient='index'))
        time.sleep(10)

Michael Jordan 1987-88
Michael Jordan 1990-91
Michael Jordan 1991-92
Michael Jordan 1995-96
Michael Jordan 1997-98
Kobe Bryant 2007-08
LeBron James 2008-09
LeBron James 2009-10
LeBron James 2011-12
LeBron James 2012-13
Kevin Durant 2013-14
Russell Westbrook 2016-17
Allen Iverson 2000-01
Stephen Curry 2014-15
Stephen Curry 2015-16
Derrick Rose 2010-11
Steve Nash 2004-05
Steve Nash 2005-06

Looking at the data, we have a somewhat neat dataframe of the players in their mvp seasons, and some information about their fg%. Unfortunately, we’re missing a lot of clutch information for Michael Jordan

df

	player	player_id	season	regular_fg_pct	regular_fg_pct_std	regular_clutch_fg_pct	playoff_fg_pct	playoff_fg_pct_std	playoff_clutch_fg_pct
0	Michael Jordan	893	1987-88	0.536207	0.100253	-1.000	0.526700	0.093797	-1.000
0	Michael Jordan	893	1990-91	0.543220	0.098105	-1.000	0.532059	0.105735	-1.000
0	Michael Jordan	893	1991-92	0.517175	0.098628	-1.000	0.496273	0.083527	-1.000
0	Michael Jordan	893	1995-96	0.494646	0.098950	-1.000	0.455333	0.104624	-1.000
0	Michael Jordan	893	1997-98	0.463866	0.100443	0.430	0.468381	0.087482	0.440
1	Kobe Bryant	977	2007-08	0.464744	0.110210	0.448	0.485619	0.102819	0.484
2	LeBron James	2544	2008-09	0.490827	0.099416	0.556	0.512929	0.100286	0.526
2	LeBron James	2544	2009-10	0.499855	0.086847	0.488	0.487182	0.139674	0.714
2	LeBron James	2544	2011-12	0.533387	0.111258	0.453	0.500000	0.094863	0.370
2	LeBron James	2544	2012-13	0.572526	0.114762	0.442	0.496000	0.120758	0.440
3	Kevin Durant	201142	2013-14	0.510494	0.116384	0.379	0.460632	0.100532	0.515
4	Russell Westbrook	201566	2016-17	0.425136	0.119384	0.446	0.382400	0.078567	0.286
5	Allen Iverson	947	2000-01	0.412845	0.092642	0.441	0.380773	0.114597	0.306
6	Stephen Curry	201939	2014-15	0.483463	0.109280	0.441	0.456667	0.104391	0.381
6	Stephen Curry	201939	2015-16	0.499405	0.111825	0.442	0.436722	0.117035	0.538
7	Derrick Rose	201565	2010-11	0.447617	0.106040	0.402	0.400062	0.102861	0.409
8	Steve Nash	959	2004-05	0.508400	0.159107	0.447	0.510667	0.126669	0.444
8	Steve Nash	959	2005-06	0.513215	0.157780	0.425	0.500800	0.126879	0.385

Visualizing the results

We can plot the differences between the clutch fg% and the average fg% for each player’s season. If this number is above 0, then their clutch performances are better than their average performance. Evaluating statistical significance can be estimated if this difference is larger than the standard deviation of the player’s fg%.

Regular season

Lebron, Russ, and AI are the only players to show a clutch fg% higher than their average fg%. Unfortunately, these performance differences are very slight

Playoffs

Lebron, KD, Steph, and DRose show clutch fg%s higher than their average playoff fg%

import itertools as it

fig, ax = plt.subplots(2,1, sharex=True, figsize=(8,6), dpi=100)
unique_players = df['player'].unique()
for i, player in enumerate(unique_players):
    sub_df = df[df['player']==player]
    ax[0].errorbar([i]*len(sub_df), 100*(sub_df['regular_clutch_fg_pct'] - sub_df['regular_fg_pct']),
                  yerr=100*sub_df['regular_fg_pct_std'], linestyle='', marker='o', capsize=3)
    ax[1].errorbar([i] * len(sub_df), 100*(sub_df['playoff_clutch_fg_pct'] - sub_df['playoff_fg_pct']),
                  yerr=100*sub_df['playoff_fg_pct_std'], linestyle='', marker='o', capsize=3)
    
ax[0].axhline(y=0, color='r', linestyle='--')
ax[1].axhline(y=0, color='r', linestyle='--')

ax[0].set_title("Clutch vs Average FG%")

ax[1].set_xlim([-1, len(unique_players)])
ax[1].set_xticks(np.arange(0, len(unique_players)))
ax[1].set_xticklabels(unique_players)
ax[1].xaxis.set_tick_params(rotation=90)


ax[0].set_ylabel("Regular season")
ax[1].set_ylabel("Playoffs ")
ax[0].set_ylim([-30, 30])
ax[1].set_ylim([-30, 30])

(-30, 30)

Commentary on the analysis

It’s not particularly fair to just take the difference between clutch fg% and average fg%. During clutch time, it’s usually assumed the team will put the ball in their best player’s hands. For these players we sampled, this will naturally lower their fg% because defenses are focusing more strongly on them, not necessarily the pressure of the moment getting to them. Honestly, if your clutch fg% is the same as your average fg%, I’d be satisfied enough to call that player clutch.

It’s also at least fun to confirm that superstars play worse in the playoffs (if you compare the two columns in the dataframe). General concensus is that these players get guarded more tightly and schemed against, so their playoff fg% will be worse than regular season fg%.

To better evalaute “clutch”, it might help to do this on a game-by-game basis. If a player had a hot hand and cooled off during the clutch, that’s bad. If a player was cold and hit some big shots during the clutch, that’s great. In the manner conducted here, these game-by-game fluctuations are avoided and averaged out. Looking at a more granular game-by-game method, we would witness more dramatic changes in a player’s fg% from game to game and also that player’s clutch fg% game to game (more noise in the data).

Averaging out all the game also eschews things like game severity/importance, the teams and players they were up against, and other important factors like the player-in-question’s state of mind when they went into the game or the pressure of the moment. For example, Lebron game 6 of the 2012 ECF was a very clutch performance (techincally not even during clutch time), but a performance like that just gets averaged out against all other games. Other moments like losing 3-1 leads should be very anti-clutch performances, but those get averaged out.

Conclusion

Yes, one could try to take a data-driven approach to study the clutch myth. At this day and age, there’s some data for someone to try to build a case and argue for its validity. However, I would argue there are still many “unquantifiables” that prohibit the clutch myth to truly be scrutinized with numbers. All the complicated, “you had to be there”, test-your-compsure moments demonstrate the limitations of data-driven analytics.

This notebook can be found here