Jekyll2023-08-30T08:22:55-05:00https://ahy3nz.github.io/feed.xmlAlex H. YangMy personal websiteAlex H. Yang[email protected]Scraping Reddit, part 22021-04-09T00:00:00-05:002021-04-09T00:00:00-05:00https://ahy3nz.github.io/posts/2021/04/reddit2<p>The <a href="./2021-02-01-reddit1.md">last post</a> dealt with using pushshift and handling requests to access posts and comments from Reddit. This post deals with using the <a href="https://praw.readthedocs.io/en/latest/">Python Reddit API wrapper</a> to accces posts and comments from Reddit and then using some NLP tools for some basic sentiment analysis.</p> <p>There is some work to set up an application to use <a href="https://github.com/reddit-archive/reddit/wiki/OAuth2-App-Types">praw</a> with <a href="https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example">oauth</a>, but straightforward enough for anyone who’s just using this as a script.</p> <p>After setting up the praw application, we can build up a small pipeline:</p> <ol> <li>Use praw to download posts and comments from r/nba</li> <li>Format them into a dataframe</li> <li>Use huggingface and spacy for sentiment analysis</li> </ol> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span> <span class="kn">import</span> <span class="nn">itertools</span> <span class="k">as</span> <span class="n">it</span> <span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="nb">reduce</span><span class="p">,</span> <span class="n">partial</span> <span class="kn">import</span> <span class="nn">datetime</span> <span class="k">as</span> <span class="n">dt</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="n">pd</span><span class="p">.</span><span class="n">set_option</span><span class="p">(</span><span class="s">'display.max_colwidth'</span><span class="p">,</span> <span class="mi">150</span><span class="p">)</span> <span class="kn">import</span> <span class="nn">praw</span> <span class="kn">from</span> <span class="nn">praw.models</span> <span class="kn">import</span> <span class="n">MoreComments</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">import</span> <span class="nn">hfapi</span> <span class="kn">import</span> <span class="nn">spacy</span> <span class="kn">from</span> <span class="nn">spacytextblob.spacytextblob</span> <span class="kn">import</span> <span class="n">SpacyTextBlob</span> <span class="n">nlp</span> <span class="o">=</span> <span class="n">spacy</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">"en_core_web_sm"</span><span class="p">)</span> <span class="n">spacy_text_blob</span> <span class="o">=</span> <span class="n">SpacyTextBlob</span><span class="p">()</span> <span class="n">nlp</span><span class="p">.</span><span class="n">add_pipe</span><span class="p">(</span><span class="n">spacy_text_blob</span><span class="p">)</span> <span class="n">client</span> <span class="o">=</span> <span class="n">hfapi</span><span class="p">.</span><span class="n">Client</span><span class="p">()</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reddit</span> <span class="o">=</span> <span class="n">praw</span><span class="p">.</span><span class="n">Reddit</span><span class="p">(</span><span class="s">"bot1"</span><span class="p">)</span> <span class="c1"># Pulls from praw.ini file </span><span class="n">rnba</span> <span class="o">=</span> <span class="n">reddit</span><span class="p">.</span><span class="n">subreddit</span><span class="p">(</span><span class="s">'nba'</span><span class="p">)</span> </code></pre></div></div> <h2 id="compiling-praw-objects-into-a-dataframe">Compiling praw objects into a dataframe</h2> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">dataclass</span> <span class="k">class</span> <span class="nc">RedditSubmission</span><span class="p">:</span> <span class="n">title</span><span class="p">:</span> <span class="nb">str</span> <span class="n">body</span><span class="p">:</span> <span class="nb">str</span> <span class="n">permalink</span><span class="p">:</span> <span class="nb">str</span> <span class="n">author</span><span class="p">:</span> <span class="nb">str</span> <span class="n">score</span><span class="p">:</span> <span class="nb">float</span> <span class="n">timestamp</span><span class="p">:</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span> <span class="k">def</span> <span class="nf">to_dict</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="p">{</span> <span class="s">'title'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">title</span><span class="p">,</span> <span class="s">'body'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">body</span><span class="p">,</span> <span class="s">'permalink'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">permalink</span><span class="p">,</span> <span class="s">'author'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">author</span><span class="p">,</span> <span class="s">'score'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">score</span><span class="p">,</span> <span class="s">'timestamp'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">timestamp</span> <span class="p">}</span> <span class="o">@</span><span class="nb">classmethod</span> <span class="k">def</span> <span class="nf">from_praw_submission</span><span class="p">(</span> <span class="n">cls</span><span class="p">,</span> <span class="n">praw_submission</span><span class="p">:</span> <span class="n">praw</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="n">Submission</span> <span class="p">):</span> <span class="k">return</span> <span class="n">cls</span><span class="p">(</span> <span class="n">praw_submission</span><span class="p">.</span><span class="n">title</span><span class="p">,</span> <span class="n">praw_submission</span><span class="p">.</span><span class="n">selftext</span><span class="p">,</span> <span class="n">praw_submission</span><span class="p">.</span><span class="n">permalink</span><span class="p">,</span> <span class="n">praw_submission</span><span class="p">.</span><span class="n">author</span><span class="p">,</span> <span class="n">praw_submission</span><span class="p">.</span><span class="n">score</span><span class="p">,</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">fromtimestamp</span><span class="p">(</span><span class="n">praw_submission</span><span class="p">.</span><span class="n">created_utc</span><span class="p">)</span> <span class="p">)</span> <span class="o">@</span><span class="n">dataclass</span> <span class="k">class</span> <span class="nc">RedditComment</span><span class="p">:</span> <span class="n">body</span><span class="p">:</span> <span class="nb">str</span> <span class="n">permalink</span><span class="p">:</span> <span class="nb">str</span> <span class="n">author</span><span class="p">:</span> <span class="nb">str</span> <span class="n">score</span><span class="p">:</span> <span class="nb">float</span> <span class="n">timestamp</span><span class="p">:</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span> <span class="k">def</span> <span class="nf">to_dict</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="p">{</span> <span class="s">'body'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">body</span><span class="p">,</span> <span class="s">'permalink'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">permalink</span><span class="p">,</span> <span class="s">'author'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">author</span><span class="p">,</span> <span class="s">'score'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">score</span><span class="p">,</span> <span class="s">'timestamp'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">timestamp</span> <span class="p">}</span> <span class="o">@</span><span class="nb">classmethod</span> <span class="k">def</span> <span class="nf">from_praw_comment</span><span class="p">(</span> <span class="n">cls</span><span class="p">,</span> <span class="n">praw_comment</span><span class="p">:</span> <span class="n">praw</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="n">Comment</span> <span class="p">):</span> <span class="k">return</span> <span class="n">cls</span><span class="p">(</span> <span class="n">praw_comment</span><span class="p">.</span><span class="n">body</span><span class="p">,</span> <span class="n">praw_comment</span><span class="p">.</span><span class="n">permalink</span><span class="p">,</span> <span class="n">praw_comment</span><span class="p">.</span><span class="n">author</span><span class="p">,</span> <span class="n">praw_comment</span><span class="p">.</span><span class="n">score</span><span class="p">,</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">fromtimestamp</span><span class="p">(</span><span class="n">praw_comment</span><span class="p">.</span><span class="n">created_utc</span><span class="p">)</span> <span class="p">)</span> <span class="k">def</span> <span class="nf">process_submission_from_praw</span><span class="p">(</span><span class="n">praw_submission_generator</span><span class="p">):</span> <span class="k">for</span> <span class="n">praw_submission</span> <span class="ow">in</span> <span class="n">praw_submission_generator</span><span class="p">:</span> <span class="k">yield</span> <span class="n">RedditSubmission</span><span class="p">.</span><span class="n">from_praw_submission</span><span class="p">(</span><span class="n">praw_submission</span><span class="p">)</span> <span class="k">def</span> <span class="nf">process_comment_from_praw_submission</span><span class="p">(</span><span class="n">praw_submission_generator</span><span class="p">):</span> <span class="k">for</span> <span class="n">praw_submission</span> <span class="ow">in</span> <span class="n">praw_submission_generator</span><span class="p">:</span> <span class="k">for</span> <span class="n">praw_comment</span> <span class="ow">in</span> <span class="n">praw_submission</span><span class="p">.</span><span class="n">comments</span><span class="p">:</span> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">praw_comment</span><span class="p">,</span> <span class="n">MoreComments</span><span class="p">):</span> <span class="k">continue</span> <span class="k">else</span><span class="p">:</span> <span class="k">yield</span> <span class="n">RedditComment</span><span class="p">.</span><span class="n">from_praw_comment</span><span class="p">(</span><span class="n">praw_comment</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">praw_submission_generator1</span> <span class="o">=</span> <span class="n">rnba</span><span class="p">.</span><span class="n">hot</span><span class="p">(</span><span class="n">limit</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">praw_submission_generator2</span> <span class="o">=</span> <span class="n">rnba</span><span class="p">.</span><span class="n">hot</span><span class="p">(</span><span class="n">limit</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">submissions</span> <span class="o">=</span> <span class="n">process_submission_from_praw</span><span class="p">(</span><span class="n">praw_submission_generator1</span><span class="p">)</span> <span class="n">comments</span> <span class="o">=</span> <span class="n">process_comment_from_praw_submission</span><span class="p">(</span><span class="n">praw_submission_generator2</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submission_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">to_dict</span><span class="p">()</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">submissions</span><span class="p">)</span> <span class="n">comment_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">to_dict</span><span class="p">()</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">comments</span><span class="p">)</span> </code></pre></div></div> <h2 id="using-huggingface-for-sentiment-analysis">Using huggingface for sentiment analysis</h2> <p>Specifically, using <a href="https://github.com/huggingface/hfapi">huggingface api</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">classification_single_body</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">sentence</span><span class="p">):</span> <span class="n">classification</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">text_classification</span><span class="p">(</span><span class="n">sentence</span><span class="p">)</span> <span class="k">if</span> <span class="s">'error'</span> <span class="ow">in</span> <span class="n">classification</span><span class="p">:</span> <span class="k">return</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">None</span> <span class="n">neg_sentiment</span><span class="p">,</span> <span class="n">pos_sentiment</span> <span class="o">=</span> <span class="n">classification</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">return</span> <span class="n">neg_sentiment</span><span class="p">[</span><span class="s">'score'</span><span class="p">],</span> <span class="n">pos_sentiment</span><span class="p">[</span><span class="s">'score'</span><span class="p">]</span> <span class="k">def</span> <span class="nf">classification_multiple_body</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">bunch_of_sentences</span><span class="p">,</span> <span class="n">colnames</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span> <span class="k">if</span> <span class="n">colnames</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span> <span class="n">colnames</span> <span class="o">=</span> <span class="p">[</span><span class="s">'negative_score'</span><span class="p">,</span> <span class="s">'positive_score'</span><span class="p">]</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span> <span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">classification_single_body</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">x</span><span class="p">),</span> <span class="n">bunch_of_sentences</span><span class="p">),</span> <span class="n">columns</span><span class="o">=</span><span class="n">colnames</span> <span class="p">)</span> <span class="k">return</span> <span class="n">df</span> <span class="n">client</span> <span class="o">=</span> <span class="n">hfapi</span><span class="p">.</span><span class="n">Client</span><span class="p">()</span> <span class="n">classification_multiple_bodies_partial</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span><span class="n">classification_multiple_body</span><span class="p">,</span> <span class="n">client</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submission_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span> <span class="n">submission_df</span><span class="p">,</span> <span class="n">classification_multiple_bodies_partial</span><span class="p">(</span><span class="n">submission_df</span><span class="p">[</span><span class="s">'title'</span><span class="p">].</span><span class="n">to_list</span><span class="p">())</span> <span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </code></pre></div></div> <p>Scoring the submissions, here’s a title with an appropriately positive score “Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player’s OVERALL win shares for the current season.”</p> <p>Here’s a title that is scored as incredibly negative, but in reality is pretty positive “Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season” – being even close to the 50-40-90 club is incredible</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submission_df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"negative_score"</span><span class="p">)[[</span><span class="s">'title'</span><span class="p">,</span> <span class="s">'score'</span><span class="p">,</span> <span class="s">'negative_score'</span><span class="p">,</span> <span class="s">'positive_score'</span><span class="p">]]</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>title</th> <th>score</th> <th>negative_score</th> <th>positive_score</th> </tr> </thead> <tbody> <tr> <th>19</th> <td>[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...</td> <td>241</td> <td>0.000185</td> <td>0.999816</td> </tr> <tr> <th>8</th> <td>Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”</td> <td>1610</td> <td>0.000185</td> <td>0.999815</td> </tr> <tr> <th>12</th> <td>[Thinking Basketball] The 10 Best NBA peaks since 1977</td> <td>1346</td> <td>0.000283</td> <td>0.999717</td> </tr> <tr> <th>25</th> <td>[Highlight] Russell banks in the 3 to tie it at 124</td> <td>92</td> <td>0.000615</td> <td>0.999385</td> </tr> <tr> <th>23</th> <td>Nikola Jokic leads the league in offensive win shares at 8.9. This is also more than any player's OVERALL win shares for the current season.</td> <td>406</td> <td>0.000845</td> <td>0.999155</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>15</th> <td>Charles Barkley: "I've been poor, I've been rich, I've been fat, I've been in the Hall of Fame, and one thing I can tell you is that the Clippers ...</td> <td>23341</td> <td>0.999229</td> <td>0.000771</td> </tr> <tr> <th>38</th> <td>Kyrie Irving needs one more 3 point make to enter the 50-40-90 club for the 2020-2021 season</td> <td>443</td> <td>0.999282</td> <td>0.000718</td> </tr> <tr> <th>75</th> <td>[Stein] The Bucks' too-long-to-list-it-all injury report tonight against Charlotte includes no Giannis Antetokounmpo (left knee soreness) or Jrue ...</td> <td>43</td> <td>0.999286</td> <td>0.000714</td> </tr> <tr> <th>40</th> <td>Bucks missing all five starters against Hornets</td> <td>79</td> <td>0.999449</td> <td>0.000551</td> </tr> <tr> <th>93</th> <td>China’s Forced-Labor Backlash Threatens to Put N.B.A. in Unwanted Spotlight</td> <td>174</td> <td>0.999517</td> <td>0.000483</td> </tr> </tbody> </table> <p>100 rows × 4 columns</p> </div> <p>I think we were querying the API too quickly, so these responses started timing out, but you get the idea here</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">comment_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span> <span class="n">comment_df</span><span class="p">,</span> <span class="n">classification_multiple_bodies_partial</span><span class="p">(</span><span class="n">comment_df</span><span class="p">[</span><span class="s">'body'</span><span class="p">].</span><span class="n">to_list</span><span class="p">())</span> <span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </code></pre></div></div> <h2 id="using-spacy-for-sentiment-analysis">Using spacy for sentiment analysis</h2> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submission_df</span><span class="p">[</span><span class="s">'title_sentiment'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">_</span><span class="p">.</span><span class="n">sentiment</span><span class="p">.</span><span class="n">polarity</span><span class="p">,</span> <span class="n">nlp</span><span class="p">.</span><span class="n">pipe</span><span class="p">(</span><span class="n">submission_df</span><span class="p">[</span><span class="s">'title'</span><span class="p">]))]</span> <span class="n">submission_df</span><span class="p">[</span><span class="s">'body_sentiment'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">_</span><span class="p">.</span><span class="n">sentiment</span><span class="p">.</span><span class="n">polarity</span><span class="p">,</span> <span class="n">nlp</span><span class="p">.</span><span class="n">pipe</span><span class="p">(</span><span class="n">submission_df</span><span class="p">[</span><span class="s">'body'</span><span class="p">]))]</span> <span class="n">comment_df</span><span class="p">[</span><span class="s">'body_sentiment'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">_</span><span class="p">.</span><span class="n">sentiment</span><span class="p">.</span><span class="n">polarity</span><span class="p">,</span> <span class="n">nlp</span><span class="p">.</span><span class="n">pipe</span><span class="p">(</span><span class="n">comment_df</span><span class="p">[</span><span class="s">'body'</span><span class="p">]))]</span> </code></pre></div></div> <p>Here’s a simple title to score “Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.””</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submission_df</span><span class="p">[[</span><span class="s">'title'</span><span class="p">,</span> <span class="s">'score'</span><span class="p">,</span> <span class="s">'title_sentiment'</span><span class="p">]].</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"title_sentiment"</span><span class="p">)</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>title</th> <th>score</th> <th>title_sentiment</th> </tr> </thead> <tbody> <tr> <th>99</th> <td>The Mavs will play 3 back-to-backs over a 7 game span to start April. Over April and May, 62% of their games will be part of a b2b</td> <td>15</td> <td>-0.400000</td> </tr> <tr> <th>83</th> <td>[Post Game Thread] The Los Angeles Clippers (35-18) defeat the Phoenix Suns (36-15), 113 - 103</td> <td>727</td> <td>-0.400000</td> </tr> <tr> <th>43</th> <td>[Post Game Thread] The Boston Celtics (27-26) defeat the Minnesota Timberwolves (13-40) in OT, 145 - 136</td> <td>49</td> <td>-0.400000</td> </tr> <tr> <th>91</th> <td>[Post Game Thread] The Dallas Mavericks (29-22) defeat the Milwaukee Bucks (32-19), 116 - 101</td> <td>754</td> <td>-0.400000</td> </tr> <tr> <th>37</th> <td>The Denver Nuggets came onto the floor for their game against the Spurs with "X Gon' Give it to Ya" playing in the background</td> <td>88</td> <td>-0.400000</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>19</th> <td>[Orsborn]: Mike Malone on Pop still going strong at 72: "For him to be as engaged and as locked in and as committed as he is at this juncture of h...</td> <td>241</td> <td>0.505556</td> </tr> <tr> <th>18</th> <td>Steve Kerr on leaving the Warriors: “I have a great job right now. I love coaching the Warriors, so I'm not going anywhere.”</td> <td>465</td> <td>0.528571</td> </tr> <tr> <th>84</th> <td>[Highlight] Cody Zeller perfectly blocks Sam Merrill's layup off the backboard</td> <td>15</td> <td>1.000000</td> </tr> <tr> <th>8</th> <td>Kevin Durant: “Stephen Curry and Klay Thompson are the best shooters I’ve played with.”</td> <td>1610</td> <td>1.000000</td> </tr> <tr> <th>12</th> <td>[Thinking Basketball] The 10 Best NBA peaks since 1977</td> <td>1346</td> <td>1.000000</td> </tr> </tbody> </table> <p>100 rows × 3 columns</p> </div> <p>I want to point out one comment “Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺”, which has a negative sentiment, probably because of the words “off” and “words”, but the sentence itself is more positive because it’s about a player performing very well</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">comment_df</span><span class="p">[[</span><span class="s">'body'</span><span class="p">,</span> <span class="s">'score'</span><span class="p">,</span> <span class="s">'body_sentiment'</span><span class="p">]].</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"body_sentiment"</span><span class="p">)</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>body</th> <th>score</th> <th>body_sentiment</th> </tr> </thead> <tbody> <tr> <th>2480</th> <td>he has some of the worst luck with injuries.</td> <td>591</td> <td>-1.0</td> </tr> <tr> <th>118</th> <td>I tea bagged your fucking drum set!!!</td> <td>3</td> <td>-1.0</td> </tr> <tr> <th>2081</th> <td>RIP to the insane plus/minus of the Spurs bench</td> <td>71</td> <td>-1.0</td> </tr> <tr> <th>1379</th> <td>Goes off 😎😎 in OT ⌛⌛ against the worst team in the league 🐺🐺</td> <td>1</td> <td>-1.0</td> </tr> <tr> <th>1287</th> <td>fucking disgusting</td> <td>1</td> <td>-1.0</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>2270</th> <td>Perfect.... boost his confidence, while we continue to tank</td> <td>5</td> <td>1.0</td> </tr> <tr> <th>273</th> <td>It’s almost like he’s one of the best point guards of all time!</td> <td>2</td> <td>1.0</td> </tr> <tr> <th>31</th> <td>Best scorer on the Bulls since MJ</td> <td>120</td> <td>1.0</td> </tr> <tr> <th>1632</th> <td>Remember when DSJ was like the mavs best player? What a time</td> <td>1</td> <td>1.0</td> </tr> <tr> <th>436</th> <td>I will zag and point out another thing here. KD doesn't want to outright say Steph is the greatest shooter ever. He needs to add Klay to this stat...</td> <td>-1</td> <td>1.0</td> </tr> </tbody> </table> <p>3200 rows × 3 columns</p> </div> <h2 id="closing-remarks">Closing remarks</h2> <p>Thanks to praw, it was really easy to pull and gather raw data. On top of that, the plethora of NLP software development has made it really easy to apply these models to whatever context you want.</p> <p>To really take this further, an important middle step would need data cleaning (modifying for typos, slang, abbreviations), maybe filters/named entity resolution to look for specific players. Maybe you want to find some way to add weights to highly up-voted submissions/comments, or maybe you want some way to combine the sentiments from both submissions and comments. Lastly, the <em>big</em> caveat in NLP for reddit is using a language model sophisticated enough to capture the sarcasm, nuance, and toxicity that is the reddit community (and specifically within r/nba).</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> </code></pre></div></div>Alex H. Yang[email protected]The last post dealt with using pushshift and handling requests to access posts and comments from Reddit. This post deals with using the Python Reddit API wrapper to accces posts and comments from Reddit and then using some NLP tools for some basic sentiment analysis.Scraping Reddit, part 12021-02-01T00:00:00-06:002021-02-01T00:00:00-06:00https://ahy3nz.github.io/posts/2021/02/reddit1<p>In light of recent internet trends about retail investors, I’m sure many of us have questions about the kinds of content that gets posted on reddit, and if there are home-grown, analytical ways of addressing these questions. I’ll be showing two ways of parsing submissions and comments to Reddit, this one focusing on using <a href="http://pushshift.io/">pushshift API endpoints</a> using the <code class="language-plaintext highlighter-rouge">requests</code> library, some custom classes for processing these responses, and <code class="language-plaintext highlighter-rouge">asyncio</code> to handle asynchronous threading for multiple requests to pushshift.</p> <p>These codes ran quickly on my chromebook (dual-core, dual-thread, 1.90 Ghz, 4 Gb memory), but querying lots of data from pushshift makes some of the final cells take ~10 minutes.</p> <p>Note: at the time of putting this together, parts of pushshift appear to be down for repair/upgrade, but at least the <a href="https://github.com/pushshift/api">github repo</a> is still online</p> <p>Raw notebook <a href="https://github.com/ahy3nz/ahy3nz.github.io/tree/master/files/notebooks">here</a>, but I didn’t bother adding an environment – most of these packages are in the python standard library or easily available on conda or pip</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">requests</span> <span class="kn">import</span> <span class="nn">datetime</span> <span class="k">as</span> <span class="n">dt</span> <span class="kn">import</span> <span class="nn">asyncio</span> <span class="kn">import</span> <span class="nn">io</span> </code></pre></div></div> <p>At its core, we are submitting queries to a URL and getting responses to these queries. Technically speaking, this means we are submitting get requests to pushshift endpoints.</p> <p>The endpoint generally takes the form of something like “https://api.pushshift.io/reddit/search/submission”, with the “payload” or <code class="language-plaintext highlighter-rouge">params</code> kwarg to our request being some set of search parameters (like a keyword, subreddit, or timestamp info), <a href="https://pushshift.io/api-parameters/">pushshift API parameters here</a>. With this endpoint, we’re searching the Reddit submissions (not the comments)</p> <p>One of the simpler payloads could be searching a subreddit within a particular time window. This requires before and after timestamps, which can easily be handled with python’s <code class="language-plaintext highlighter-rouge">datetime </code>library</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">today</span> <span class="o">=</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">today</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="n">hour</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">minute</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">second</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">microsecond</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="n">timestamp</span><span class="p">()</span> <span class="n">today_minus_seven</span> <span class="o">=</span> <span class="p">(</span><span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">today</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="n">hour</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">minute</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">second</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">microsecond</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">-</span> <span class="n">dt</span><span class="p">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">7</span><span class="p">)).</span><span class="n">timestamp</span><span class="p">()</span> <span class="n">today_minus_eight</span> <span class="o">=</span> <span class="p">(</span><span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">today</span><span class="p">().</span><span class="n">replace</span><span class="p">(</span><span class="n">hour</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">minute</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">second</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">microsecond</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">-</span> <span class="n">dt</span><span class="p">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">8</span><span class="p">)).</span><span class="n">timestamp</span><span class="p">()</span> </code></pre></div></div> <p>This the the actual get request, observe the URL as the main arg, and the various search parameters in the <code class="language-plaintext highlighter-rouge">params</code> kwarg</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reddit_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://api.pushshift.io/reddit/search/submission"</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s">'subreddit'</span><span class="p">:</span> <span class="s">'stocks'</span><span class="p">,</span> <span class="s">'before'</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">today_minus_seven</span><span class="p">),</span> <span class="s">'after'</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">today_minus_eight</span><span class="p">)})</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reddit_response</span><span class="p">.</span><span class="n">status_code</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>200 </code></pre></div></div> <p>There are a variety of ways to parse <a href="https://requests.readthedocs.io/en/master/">request responses</a>, but here’s one way to parse the title and text from the response to a Reddit submission get request</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reddit_response</span><span class="p">.</span><span class="n">json</span><span class="p">()[</span><span class="s">'data'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s">'title'</span><span class="p">],</span><span class="n">reddit_response</span><span class="p">.</span><span class="n">json</span><span class="p">()[</span><span class="s">'data'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s">'selftext'</span><span class="p">],</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>('Would it be wise to increase the geographical diversity of my portfolio?', 'Hello everyone, \n\nMy portfolio of 16 companies consists of 13 US stocks because they all seem to have some of the highest potential returns but in the midst of the pandemic I feel I should reallocate some resources towards European and UK stocks. Is anyone watching any interesting non-US stocks at the moment?') </code></pre></div></div> <p>As a little bit of dressing on top, we can grab a list of stock tickers. There are a lot of sources to pull tickers from (<code class="language-plaintext highlighter-rouge">yfinance</code> is a popular one), but we can also pull a list of tickers from the SEC</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ticker_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://www.sec.gov/include/ticker.txt"</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tickers</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span> <span class="n">io</span><span class="p">.</span><span class="n">StringIO</span><span class="p">(</span><span class="n">ticker_response</span><span class="p">.</span><span class="n">text</span><span class="p">),</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">'</span><span class="se">\t</span><span class="s">'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">)[</span><span class="mi">0</span><span class="p">].</span><span class="n">to_list</span><span class="p">()</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tickers</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['aapl', 'msft', 'amzn', 'goog', 'tcehy'] </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">string</span> <span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Union</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">Any</span> <span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span> <span class="kn">from</span> <span class="nn">requests</span> <span class="kn">import</span> <span class="n">Response</span> <span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span> </code></pre></div></div> <p>We have all the raw information contained within the request response object, but for data processing purposes, we can define a class and some functions to simplify the work.</p> <p>Key characteristics:</p> <ul> <li>A corresponding python object property for each relevant property of a typical reddit submission. <ul> <li>Unfortuantely the <code class="language-plaintext highlighter-rouge">score</code> property from pushshift isn’t the most reliable because it’s only a snapshot from when the data were indexed</li> </ul> </li> <li><code class="language-plaintext highlighter-rouge">summarize()</code> that uses <code class="language-plaintext highlighter-rouge">collections.Counter</code> to tally up how frequently a stock ticker appears</li> <li><code class="language-plaintext highlighter-rouge">to_dict()</code> for serialization and conversion for pandas</li> <li><code class="language-plaintext highlighter-rouge">from_response()</code> to quickly instantiate a <code class="language-plaintext highlighter-rouge">List[RedditSubmission]</code> from a single response</li> </ul> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">dataclass</span> <span class="k">class</span> <span class="nc">RedditSubmission</span><span class="p">:</span> <span class="n">title</span><span class="p">:</span> <span class="nb">str</span> <span class="n">body</span><span class="p">:</span> <span class="nb">str</span> <span class="n">permalink</span><span class="p">:</span> <span class="nb">str</span> <span class="n">author</span><span class="p">:</span> <span class="nb">str</span> <span class="n">score</span><span class="p">:</span> <span class="nb">float</span> <span class="n">timestamp</span><span class="p">:</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span> <span class="k">def</span> <span class="nf">summarize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tickers</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">weighted</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span> <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Union</span><span class="p">[</span><span class="nb">float</span><span class="p">,</span> <span class="nb">int</span><span class="p">]]:</span> <span class="s">""" Process RedditSubmission for tickers Use a Counter to count the number of times a ticker occurs. Include some corrections for punctuation """</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">title</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span> <span class="n">title_no_punctuation</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">title</span><span class="p">.</span><span class="n">translate</span><span class="p">(</span> <span class="nb">str</span><span class="p">.</span><span class="n">maketrans</span><span class="p">(</span><span class="s">''</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">string</span><span class="p">.</span><span class="n">punctuation</span><span class="p">)</span> <span class="p">)</span> <span class="n">tickers_title</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tickers</span><span class="p">,</span> <span class="n">title_no_punctuation</span><span class="p">.</span><span class="n">split</span><span class="p">())</span> <span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="n">tickers_title</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">()</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">body</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span> <span class="n">body_no_punctuation</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">body</span><span class="p">.</span><span class="n">translate</span><span class="p">(</span> <span class="nb">str</span><span class="p">.</span><span class="n">maketrans</span><span class="p">(</span><span class="s">''</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">string</span><span class="p">.</span><span class="n">punctuation</span><span class="p">)</span> <span class="p">)</span> <span class="n">tickers_body</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tickers</span><span class="p">,</span> <span class="n">body_no_punctuation</span><span class="p">.</span><span class="n">split</span><span class="p">())</span> <span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="n">tickers_body</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">()</span> <span class="n">total_tickers</span> <span class="o">=</span> <span class="n">tickers_title</span> <span class="o">+</span> <span class="n">tickers_body</span> <span class="k">return</span> <span class="n">total_tickers</span> <span class="k">def</span> <span class="nf">to_dict</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="p">{</span> <span class="s">'title'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">title</span><span class="p">,</span> <span class="s">'body'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">body</span><span class="p">,</span> <span class="s">'permalink'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">permalink</span><span class="p">,</span> <span class="s">'author'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">author</span><span class="p">,</span> <span class="s">'score'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">score</span><span class="p">,</span> <span class="s">'timestamp'</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">timestamp</span> <span class="p">}</span> <span class="o">@</span><span class="nb">classmethod</span> <span class="k">def</span> <span class="nf">from_response</span><span class="p">(</span> <span class="n">cls</span><span class="p">,</span> <span class="n">resp_object</span><span class="p">:</span> <span class="n">Response</span> <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="n">Any</span><span class="p">]]:</span> <span class="s">""" Create a list of RedditSubmission objects from response"""</span> <span class="k">if</span> <span class="n">resp_object</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span> <span class="n">processed_response</span> <span class="o">=</span> <span class="p">[</span> <span class="n">cls</span><span class="p">(</span> <span class="n">msg</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"title"</span><span class="p">,</span> <span class="bp">None</span><span class="p">),</span> <span class="n">msg</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"body"</span><span class="p">,</span> <span class="bp">None</span><span class="p">),</span> <span class="n">msg</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"permalink"</span><span class="p">,</span> <span class="bp">None</span><span class="p">),</span> <span class="n">msg</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"author"</span><span class="p">,</span> <span class="bp">None</span><span class="p">),</span> <span class="n">msg</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"score"</span><span class="p">,</span> <span class="bp">None</span><span class="p">),</span> <span class="p">(</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">fromtimestamp</span><span class="p">(</span><span class="n">msg</span><span class="p">[</span><span class="s">'created_utc'</span><span class="p">])</span> <span class="k">if</span> <span class="n">msg</span><span class="p">[</span><span class="s">'created_utc'</span><span class="p">]</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="k">else</span> <span class="bp">None</span> <span class="p">)</span> <span class="p">)</span> <span class="k">for</span> <span class="n">msg</span> <span class="ow">in</span> <span class="n">resp_object</span><span class="p">.</span><span class="n">json</span><span class="p">()[</span><span class="s">'data'</span><span class="p">]</span> <span class="p">]</span> <span class="k">return</span> <span class="n">processed_response</span> <span class="k">else</span><span class="p">:</span> <span class="k">return</span> <span class="bp">None</span> </code></pre></div></div> <p>In reality, there’s a decently-long wait time after we make the initial get request. The time to make and process the request is actually fairly quick, so this is a good opportunity to use python’s <a href="https://docs.python.org/3/library/asyncio.html">asyncio</a> library.</p> <p>Asyncio allows for concurrency in a different manner than multiprocessing or multithreading. You can have many tasks running, but only one is “controlling” the CPU, and gives up control when it’s not actively doing any work (like waiting for a response from the pushshift server).</p> <p>The overall syntax is very similar to writing any other python function</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">def</span> <span class="nf">submission_request_coroutine</span><span class="p">(</span><span class="o">**</span><span class="n">kwargs</span><span class="p">):</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> <span class="n">reddit_response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://api.pushshift.io/reddit/search/submission"</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">kwargs</span><span class="p">)</span> <span class="k">return</span> <span class="n">reddit_response</span> </code></pre></div></div> <p>Define a range of timestamps, initialize an async coroutine for each timestamp, then use asyncio to submit each request and gather them back together</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">snapshots</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">date_range</span><span class="p">(</span> <span class="n">start</span><span class="o">=</span><span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">tz</span><span class="o">=</span><span class="n">dt</span><span class="p">.</span><span class="n">timezone</span><span class="p">.</span><span class="n">utc</span><span class="p">)</span> <span class="o">-</span> <span class="n">dt</span><span class="p">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">7</span><span class="p">),</span> <span class="n">end</span><span class="o">=</span><span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">tz</span><span class="o">=</span><span class="n">dt</span><span class="p">.</span><span class="n">timezone</span><span class="p">.</span><span class="n">utc</span><span class="p">)</span> <span class="o">-</span> <span class="n">dt</span><span class="p">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">freq</span><span class="o">=</span><span class="s">'10min'</span> <span class="p">)</span> <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span> <span class="n">submission_request_coroutine</span><span class="p">(</span><span class="n">subreddit</span><span class="o">=</span><span class="s">'stocks'</span><span class="p">,</span> <span class="n">after</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">snapshot</span><span class="p">.</span><span class="n">timestamp</span><span class="p">()),</span> <span class="n">before</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">snapshots</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">].</span><span class="n">timestamp</span><span class="p">()),</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span> <span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">snapshot</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">snapshots</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="p">]</span> <span class="n">all_submission_responses</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">gather</span><span class="p">(</span> <span class="o">*</span><span class="n">tasks</span> <span class="p">)</span> </code></pre></div></div> <p>The data is a <code class="language-plaintext highlighter-rouge">List[Response]</code> objects, which we can conver to a <code class="language-plaintext highlighter-rouge">List[List[RedditSubmission]]</code>, then flatten as a <code class="language-plaintext highlighter-rouge">List[RedditSubmission]</code> with itertools</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">itertools</span> <span class="k">as</span> <span class="n">it</span> <span class="n">reddit_submissions</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">it</span><span class="p">.</span><span class="n">chain</span><span class="p">.</span><span class="n">from_iterable</span><span class="p">(</span> <span class="n">RedditSubmission</span><span class="p">.</span><span class="n">from_response</span><span class="p">(</span><span class="n">resp</span><span class="p">)</span> <span class="k">for</span> <span class="n">resp</span> <span class="ow">in</span> <span class="n">all_submission_responses</span> <span class="k">if</span> <span class="n">resp</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span> <span class="p">)]</span> </code></pre></div></div> <p>We can get a ticker counter for each <code class="language-plaintext highlighter-rouge">RedditSubmission</code>, but we’d like to quickly aggregate them all into a single, summary ticker counter over all the reddit submission in our time window. This can be easily achieved with <code class="language-plaintext highlighter-rouge">functools.reduce</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="nb">reduce</span> <span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span> <span class="k">def</span> <span class="nf">aggregate_dictionaries</span><span class="p">(</span><span class="n">d1</span><span class="p">,</span> <span class="n">d2</span><span class="p">):</span> <span class="s">""" Given two dictionaries, aggregate key-value pairs """</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">d1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="k">return</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Counter</span><span class="p">(</span><span class="o">**</span><span class="n">d2</span><span class="p">).</span><span class="n">most_common</span><span class="p">())</span> <span class="n">my_counter</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="o">**</span><span class="n">d1</span><span class="p">)</span> <span class="n">my_counter</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">d2</span><span class="p">)</span> <span class="k">return</span> <span class="nb">dict</span><span class="p">(</span><span class="n">my_counter</span><span class="p">.</span><span class="n">most_common</span><span class="p">())</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submissions_breakdown</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span> <span class="n">aggregate_dictionaries</span><span class="p">,</span> <span class="p">(</span><span class="n">submission</span><span class="p">.</span><span class="n">summarize</span><span class="p">(</span><span class="n">tickers</span><span class="p">)</span> <span class="k">for</span> <span class="n">submission</span> <span class="ow">in</span> <span class="n">reddit_submissions</span><span class="p">)</span> <span class="p">)</span> </code></pre></div></div> <p>It seems the list of tickers from the SEC was pretty generous ($A appears to be a ticker), but we can subselect for some of the recent trending tickers</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submissions_breakdown</span><span class="p">[</span><span class="s">'gme'</span><span class="p">],</span> <span class="n">submissions_breakdown</span><span class="p">[</span><span class="s">'amc'</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(18, 11) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">submissions_breakdown</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'a': 322, 'on': 234, 'for': 181, 'it': 105, 'or': 76, 'be': 76, 'next': 71, 'are': 62, 'new': 56, 'good': 54, 'now': 53, 'can': 52, 'all': 49, 'at': 45, 'out': 40, 'amp': 34, 'an': 33, 'by': 31, 'go': 30, 'has': 26, 'am': 24, 'any': 22, 'when': 21, 'best': 20, 'vs': 20, 'one': 19, 'so': 18, 'gme': 18, 'big': 17, 'free': 15, 'play': 13, 'apps': 13, 'amc': 11, 'cash': 10, 'see': 10, 'find': 9, 'run': 8, 'rise': 7, 'else': 7, 'ever': 7, 'work': 6, 'real': 6, 'open': 6, 'wall': 5, 'fund': 5, 'post': 5, 'love': 5, 'well': 5, 'very': 5, 'ago': 5, 'info': 5, 'plan': 5, 'pay': 5, 'bit': 5, 'ride': 4, 'life': 4, 'huge': 4, 'low': 4, 'nok': 4, 'grow': 4, 'cap': 4, 'link': 3, 'safe': 3, 'plus': 3, 'fast': 3, 'stay': 3, 'tech': 3, 'fun': 3, 'he': 3, 'step': 3, 'turn': 3, 'live': 3, 'site': 3, 'ways': 3, 'hear': 2, 'teva': 2, 'bb': 2, 'co': 2, 'boom': 2, 'nice': 2, 'mass': 2, 'peak': 2, 'max': 2, 'wash': 2, 'pump': 2, 'tell': 2, 'fly': 2, 'pros': 2, 'rock': 1, 'both': 1, 'gt': 1, 'loan': 1, 'nga': 1, 'invu': 1, 'most': 1, 'ofc': 1, 'nio': 1, 'spot': 1, 'min': 1, 'onto': 1, 'evfm': 1, 'blue': 1, 'nat': 1, 'pure': 1, 'sign': 1, 'man': 1, 'st': 1, 'de': 1, 'w': 1, 'trtc': 1, 'form': 1, 'hi': 1, 'joe': 1, 'true': 1, 'home': 1, 'vrs': 1, 'med': 1, 'sqz': 1, 'five': 1, 'ship': 1, 'trxc': 1, 'wish': 1, 're': 1, 'car': 1, 'nakd': 1, 'rkt': 1, 'flex': 1, 'pm': 1, 'ppl': 1, 'earn': 1, 'flow': 1, 'lscc': 1, 'peg': 1, 'two': 1, 'gain': 1, 'wow': 1, 'pro': 1, 'team': 1, 'fix': 1, 'fnko': 1, 'et': 1, 'al': 1, 'muh': 1, 'save': 1, 'gold': 1, 'beat': 1, 'vive': 1, 'u': 1, 'rh': 1, 'x': 1, 'vxrt': 1, 'mind': 1, 'ehth': 1, 'job': 1, 'road': 1, 'box': 1} </code></pre></div></div> <p>Lastly, if we’re not interested in the tickers that occur, we can still boil all the data into a single dataframe</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">to_dict</span><span class="p">()</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">reddit_submissions</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>title</th> <th>body</th> <th>permalink</th> <th>author</th> <th>score</th> <th>timestamp</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>KSTR ETF "The nasdaq of china"</td> <td>None</td> <td>/r/stocks/comments/l664ce/kstr_etf_the_nasdaq_...</td> <td>GioDesa</td> <td>1</td> <td>2021-01-27 09:56:46</td> </tr> <tr> <th>1</th> <td>Opinions/Projections on AMC?</td> <td>None</td> <td>/r/stocks/comments/l665a0/opinionsprojections_...</td> <td>Double_jn_it</td> <td>1</td> <td>2021-01-27 09:58:03</td> </tr> <tr> <th>2</th> <td>GE, SPCE, &amp;amp; PLUG</td> <td>None</td> <td>/r/stocks/comments/l6668r/ge_spce_plug/</td> <td>_MeatLoafLover</td> <td>1</td> <td>2021-01-27 09:59:21</td> </tr> <tr> <th>3</th> <td>Reddit is under DDOS attack. Certain gaming re...</td> <td>None</td> <td>/r/stocks/comments/l66692/reddit_is_under_ddos...</td> <td>theBacillus</td> <td>1</td> <td>2021-01-27 09:59:22</td> </tr> <tr> <th>4</th> <td>#GainStock</td> <td>None</td> <td>/r/stocks/comments/l66777/gainstock/</td> <td>lxPHENOMENONxl</td> <td>1</td> <td>2021-01-27 10:00:19</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>2338</th> <td>AN OPEN LETTER TO GAMESTOP CEO</td> <td>None</td> <td>/r/stocks/comments/l98k85/an_open_letter_to_ga...</td> <td>Artuhan</td> <td>1</td> <td>2021-01-31 03:55:51</td> </tr> <tr> <th>2339</th> <td>AN OPEN LETTER TO GAMESTOP CEO</td> <td>None</td> <td>/r/stocks/comments/l98lai/an_open_letter_to_ga...</td> <td>Artuhan</td> <td>1</td> <td>2021-01-31 03:58:05</td> </tr> <tr> <th>2340</th> <td>Thoughts on YOLO (AdvisorShares Pure Cannabis ...</td> <td>None</td> <td>/r/stocks/comments/l98nly/thoughts_on_yolo_adv...</td> <td>ConfidentProgrammer1</td> <td>1</td> <td>2021-01-31 04:02:29</td> </tr> <tr> <th>2341</th> <td>Daily advice</td> <td>None</td> <td>/r/stocks/comments/l98pic/daily_advice/</td> <td>Bukprotingas</td> <td>1</td> <td>2021-01-31 04:06:24</td> </tr> <tr> <th>2342</th> <td>AMC- Next stop?</td> <td>None</td> <td>/r/stocks/comments/l98pif/amc_next_stop/</td> <td>Hj-Fish</td> <td>1</td> <td>2021-01-31 04:06:24</td> </tr> </tbody> </table> <p>2343 rows × 6 columns</p> </div> <h1 id="next-up">Next up</h1> <p>While we just built our own Reddit API from some fundamental python libraries, there are more sophisticated API out there that do a better job of querying Reddit, like <a href="https://praw.readthedocs.io/en/latest/">praw</a>, and then we could try some other things like sentiment analysis</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> </code></pre></div></div>Alex H. Yang[email protected]In light of recent internet trends about retail investors, I’m sure many of us have questions about the kinds of content that gets posted on reddit, and if there are home-grown, analytical ways of addressing these questions. I’ll be showing two ways of parsing submissions and comments to Reddit, this one focusing on using pushshift API endpoints using the requests library, some custom classes for processing these responses, and asyncio to handle asynchronous threading for multiple requests to pushshift.Accessing FoldingAtHome data on AWS2020-12-29T00:00:00-06:002020-12-29T00:00:00-06:00https://ahy3nz.github.io/posts/2020/12/fahonaws<p>Some F@H data is <a href="https://registry.opendata.aws/foldingathome-covid19/">freely accessible on AWS</a>. This will be a relatively short post on accessing and navigating the data on AWS.</p> <p>If you regularly use AWS, this will be nothing new. If you’re a grad student who has only ever navigated local file directories or used <code class="language-plaintext highlighter-rouge">scp</code>/<code class="language-plaintext highlighter-rouge">rsync</code>/<code class="language-plaintext highlighter-rouge">ssh</code> to interact with remote clusters, this might be your first time interacting with files on AWS S3.</p> <p>The python environment is fairly straightforward analytical environment, but with s3fs, boto3, and botocore to interact with files on S3</p> <p><code class="language-plaintext highlighter-rouge">conda create -n fahaws python=3.7 pandas s3fs jupyter ipykernel -c conda-forge -yq</code></p> <p>(Active environment)</p> <p><code class="language-plaintext highlighter-rouge">python -m pip install boto3 botocore</code></p> <h2 id="the-aws-cli">The AWS CLI</h2> <p>The tools to navigate files within AWS directories follow that of unix-like systems. <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html">AWS CLI installation</a>.</p> <p><code class="language-plaintext highlighter-rouge">aws s3 ls s3://fah-public-data-covid19-absolute-free-energy/ --no-sign-request</code> to list files within this particular S3 bucket. The no sign request flag at the end helps us bypass the need for any credentials.</p> <p>You can read from stdout or pipe the output to a textfile, but this will be your bread and butter for wading through terabytes and terabytes of F@H data.</p> <p>As of this post (Dec 2020), looks like the files in <code class="language-plaintext highlighter-rouge">free_energy_data/</code> have been last updated end of Sept 2020</p> <h2 id="summary-of-free-energy-results-data">Summary of free energy results data</h2> <p>Fortunately, loading remote files via pandas is a common task, so there are convenient functions. Loading a dataframe over S3 is just like loading a dataframe locally (note the S3 string syntax)</p> <p>The column <code class="language-plaintext highlighter-rouge">febkT</code> looks like the binding free energies in units of $k_B T$ (multiply by Boltzmann’s constant and temperature to get energies in kJ or kcal). It’s worth mentioning that the value of the binding free energy is not as helpful as the <em>relative</em> binding free energy to find the best binder of the bunch (how do these free energies compare against each other?)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_pickle</span><span class="p">(</span><span class="s">"s3://fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl"</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>dataset</th> <th>fah</th> <th>identity</th> <th>receptor</th> <th>score</th> <th>febkT</th> <th>error</th> <th>ns_RL</th> <th>ns_L</th> <th>wl_RL</th> <th>L_error</th> <th>RL_error</th> </tr> </thead> <tbody> <tr> <th>1155</th> <td>MS0323_v3</td> <td>PROJ14822/RUN258</td> <td>DAR-DIA-43a-5</td> <td>protein-0387.pdb</td> <td>-5.201610</td> <td>-25.546943</td> <td>3.773523</td> <td>[131, 89, 74, 113, 80]</td> <td>[450, 490, 540, 410, 620]</td> <td>[0.18446, 0.14757, 0.18446, 0.18446, 0.18446]</td> <td>0.116912</td> <td>3.280887</td> </tr> <tr> <th>609</th> <td>MS0326_v3</td> <td>PROJ14823/RUN1202</td> <td>MUS-SCH-c2f-13</td> <td>Mpro-x0107-protein.pdb</td> <td>-9.550890</td> <td>-25.259420</td> <td>22.776358</td> <td>[121, 138, 96, 16, 5]</td> <td>[200, 200, 200, 200, 200]</td> <td>[0.18446, 0.18446, 0.23058, 0.23058, 0.23058]</td> <td>16.216396</td> <td>0.109175</td> </tr> <tr> <th>759</th> <td>MS0331_v3</td> <td>PROJ14825/RUN685</td> <td>MAK-UNK-129-18</td> <td>Mpro-x0107_0.pdb</td> <td>-8.425830</td> <td>-24.789359</td> <td>18.021078</td> <td>[58, 68, 5, 7]</td> <td>[200]</td> <td>[0.37782, 0.30226, 0.9224, 0.59034]</td> <td>0.000000</td> <td>9.238496</td> </tr> <tr> <th>615</th> <td>MS0326_v3</td> <td>PROJ14823/RUN2911</td> <td>√ÅLV-UNI-7ff-30</td> <td>Mpro-x0540-protein.pdb</td> <td>-2.774634</td> <td>-24.447756</td> <td>6.605737</td> <td>[174, 124, 70]</td> <td>[200, 200, 200, 200, 200]</td> <td>[0.14757, 0.14757, 0.18446]</td> <td>0.042010</td> <td>5.184169</td> </tr> <tr> <th>1086</th> <td>MS0326_v3</td> <td>PROJ14823/RUN2580</td> <td>SEL-UNI-842-3</td> <td>Mpro-x0397-protein.pdb</td> <td>-4.474095</td> <td>-23.705301</td> <td>1.248983</td> <td>[166, 134, 45]</td> <td>[200, 200, 200, 200, 200]</td> <td>[0.18015, 0.22519, 0.35183]</td> <td>0.212546</td> <td>2.529874</td> </tr> </tbody> </table> </div> <h2 id="some-code-to-iterate-through-these-buckets">Some code to iterate through these buckets</h2> <p>Pythonically, we can build some S3 code to list each object in this S3 bucket.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">boto3</span> <span class="kn">from</span> <span class="nn">botocore</span> <span class="kn">import</span> <span class="n">UNSIGNED</span> <span class="kn">from</span> <span class="nn">botocore.client</span> <span class="kn">import</span> <span class="n">Config</span> <span class="n">s3</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">resource</span><span class="p">(</span><span class="s">'s3'</span><span class="p">,</span> <span class="n">config</span><span class="o">=</span><span class="n">Config</span><span class="p">(</span><span class="n">signature_version</span><span class="o">=</span><span class="n">UNSIGNED</span><span class="p">))</span> <span class="n">s3_client</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="s">'s3'</span><span class="p">,</span> <span class="n">config</span><span class="o">=</span><span class="n">Config</span><span class="p">(</span><span class="n">signature_version</span><span class="o">=</span><span class="n">UNSIGNED</span><span class="p">))</span> <span class="n">bucket_name</span> <span class="o">=</span> <span class="s">"fah-public-data-covid19-absolute-free-energy"</span> <span class="n">bucket</span> <span class="o">=</span> <span class="n">s3</span><span class="p">.</span><span class="n">Bucket</span><span class="p">(</span><span class="n">bucket_name</span><span class="p">)</span> </code></pre></div></div> <p>This S3 bucket is very large – all the simulation inputs, trajectories, and outputs are in here, so it will take a while to enumerate every object. Instead, we’ll just make a generator and pull out a single item for proof-of-concept.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">paginator</span> <span class="o">=</span> <span class="n">s3_client</span><span class="p">.</span><span class="n">get_paginator</span><span class="p">(</span><span class="s">'list_objects_v2'</span><span class="p">)</span> <span class="n">pages</span> <span class="o">=</span> <span class="n">paginator</span><span class="p">.</span><span class="n">paginate</span><span class="p">(</span><span class="n">Bucket</span><span class="o">=</span><span class="n">bucket_name</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">page_iterator</span><span class="p">(</span><span class="n">pages</span><span class="p">):</span> <span class="k">for</span> <span class="n">page</span> <span class="ow">in</span> <span class="n">pages</span><span class="p">:</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">page</span><span class="p">[</span><span class="s">'Contents'</span><span class="p">]:</span> <span class="k">yield</span> <span class="n">item</span><span class="p">[</span><span class="s">'Key'</span><span class="p">]</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_objects</span> <span class="o">=</span> <span class="n">page_iterator</span><span class="p">(</span><span class="n">pages</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">next</span><span class="p">(</span><span class="n">all_objects</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'PROJ14377/RUN0/CLONE0/frame0.tpr' </code></pre></div></div> <p>And if you wanted to, you could layer a filter over the generator to impose some logic like filtering for the top-level directories</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">first_level_dirs</span> <span class="o">=</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'/'</span><span class="p">)</span><span class="o">==</span><span class="mi">1</span><span class="p">,</span> <span class="n">all_objects</span><span class="p">)</span> </code></pre></div></div> <h1 id="unix-like-python-filesytem-libraries">Unix-like python filesytem libraries</h1> <p><a href="https://s3fs.readthedocs.io/en/latest/">S3FS</a>, built on botocore and fsspec, has a very unix-like syntax to navigate and open files</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">s3fs</span> <span class="n">fs</span> <span class="o">=</span> <span class="n">s3fs</span><span class="p">.</span><span class="n">S3FileSystem</span><span class="p">(</span><span class="n">anon</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fs</span><span class="p">.</span><span class="n">ls</span><span class="p">(</span><span class="n">bucket_name</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['fah-public-data-covid19-absolute-free-energy/PROJ14377', 'fah-public-data-covid19-absolute-free-energy/PROJ14378', 'fah-public-data-covid19-absolute-free-energy/PROJ14379', 'fah-public-data-covid19-absolute-free-energy/PROJ14380', 'fah-public-data-covid19-absolute-free-energy/PROJ14383', 'fah-public-data-covid19-absolute-free-energy/PROJ14384', 'fah-public-data-covid19-absolute-free-energy/PROJ14630', 'fah-public-data-covid19-absolute-free-energy/PROJ14631', 'fah-public-data-covid19-absolute-free-energy/PROJ14650', 'fah-public-data-covid19-absolute-free-energy/PROJ14651', 'fah-public-data-covid19-absolute-free-energy/PROJ14652', 'fah-public-data-covid19-absolute-free-energy/PROJ14653', 'fah-public-data-covid19-absolute-free-energy/PROJ14654', 'fah-public-data-covid19-absolute-free-energy/PROJ14655', 'fah-public-data-covid19-absolute-free-energy/PROJ14656', 'fah-public-data-covid19-absolute-free-energy/PROJ14665', 'fah-public-data-covid19-absolute-free-energy/PROJ14666', 'fah-public-data-covid19-absolute-free-energy/PROJ14667', 'fah-public-data-covid19-absolute-free-energy/PROJ14668', 'fah-public-data-covid19-absolute-free-energy/PROJ14669', 'fah-public-data-covid19-absolute-free-energy/PROJ14670', 'fah-public-data-covid19-absolute-free-energy/PROJ14671', 'fah-public-data-covid19-absolute-free-energy/PROJ14702', 'fah-public-data-covid19-absolute-free-energy/PROJ14703', 'fah-public-data-covid19-absolute-free-energy/PROJ14704', 'fah-public-data-covid19-absolute-free-energy/PROJ14705', 'fah-public-data-covid19-absolute-free-energy/PROJ14723', 'fah-public-data-covid19-absolute-free-energy/PROJ14724', 'fah-public-data-covid19-absolute-free-energy/PROJ14726', 'fah-public-data-covid19-absolute-free-energy/PROJ14802', 'fah-public-data-covid19-absolute-free-energy/PROJ14803', 'fah-public-data-covid19-absolute-free-energy/PROJ14804', 'fah-public-data-covid19-absolute-free-energy/PROJ14805', 'fah-public-data-covid19-absolute-free-energy/PROJ14806', 'fah-public-data-covid19-absolute-free-energy/PROJ14807', 'fah-public-data-covid19-absolute-free-energy/PROJ14808', 'fah-public-data-covid19-absolute-free-energy/PROJ14809', 'fah-public-data-covid19-absolute-free-energy/PROJ14810', 'fah-public-data-covid19-absolute-free-energy/PROJ14811', 'fah-public-data-covid19-absolute-free-energy/PROJ14812', 'fah-public-data-covid19-absolute-free-energy/PROJ14813', 'fah-public-data-covid19-absolute-free-energy/PROJ14823', 'fah-public-data-covid19-absolute-free-energy/PROJ14824', 'fah-public-data-covid19-absolute-free-energy/PROJ14826', 'fah-public-data-covid19-absolute-free-energy/PROJ14833', 'fah-public-data-covid19-absolute-free-energy/SVR51748107', 'fah-public-data-covid19-absolute-free-energy/free_energy_data', 'fah-public-data-covid19-absolute-free-energy/receptor_structures.tar.gz', 'fah-public-data-covid19-absolute-free-energy/setup_files'] </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fs</span><span class="p">.</span><span class="n">ls</span><span class="p">(</span><span class="n">bucket_name</span> <span class="o">+</span> <span class="s">"/free_energy_data"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['fah-public-data-covid19-absolute-free-energy/free_energy_data/', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_L_14382.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14717.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14718.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14719.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14720.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14817.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14818.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14819.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/BRO_RL_14820.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_L_14676.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14730.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/HITS_RL_14830.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_L_14374.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14721.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MLTN_RL_14821.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_L_14364.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14722.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0323_RL_14822.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_L_14369_14372_14370_14371.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14723.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14724.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14823.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0326_RL_14824.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_L_14376.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14725.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0331_RL_14825.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_L_14380.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14727.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14728.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14827.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406-2_RL_14828.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_L_14378.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14752.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/MS0406_RL_14852.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl', 'fah-public-data-covid19-absolute-free-energy/free_energy_data/results.pkl'] </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">fs</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'fah-public-data-covid19-absolute-free-energy/free_energy_data/hello.txt'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">read</span><span class="p">())</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hello aws! </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">fs</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"fah-public-data-covid19-absolute-free-energy/free_energy_data/organization.pkl"</span><span class="p">,</span> <span class="s">'rb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">organization_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_pickle</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">organization_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>dataset</th> <th>identity</th> <th>receptor</th> <th>score</th> <th>v1_project</th> <th>v1_run</th> <th>v2_project</th> <th>v2_run</th> <th>v3_project</th> <th>v3_run</th> <th>project</th> <th>run</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>72_RL</td> <td>CCNCC(COC)Oc1ccccc1</td> <td>receptor-270-343.pdb</td> <td>0.999790</td> <td>14600</td> <td>0</td> <td>14700</td> <td>0</td> <td>14800</td> <td>0</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>72_RL</td> <td>O=C(Cc1cccnc1)c1ccccc1</td> <td>receptor-343.pdb</td> <td>0.999652</td> <td>14600</td> <td>1</td> <td>14700</td> <td>1</td> <td>14800</td> <td>1</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>2</th> <td>72_RL</td> <td>CCCCC(N)c1cc(C)ccn1</td> <td>receptor-343.pdb</td> <td>0.999256</td> <td>14600</td> <td>2</td> <td>14700</td> <td>2</td> <td>14800</td> <td>2</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>3</th> <td>72_RL</td> <td>COCC(C)Nc1ccncn1</td> <td>receptor-343.pdb</td> <td>0.999096</td> <td>14600</td> <td>3</td> <td>14700</td> <td>3</td> <td>14800</td> <td>3</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>4</th> <td>72_RL</td> <td>CCN(CC)CCNc1ccc(C#N)cn1</td> <td>receptor-270-343.pdb</td> <td>0.998980</td> <td>14600</td> <td>4</td> <td>14700</td> <td>4</td> <td>14800</td> <td>4</td> <td>NaN</td> <td>NaN</td> </tr> </tbody> </table> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> </code></pre></div></div> <p>Notebook itself can be found <a href="../files/notebooks/2020-12-29-fahonaws.ipynb">here</a></p>Alex H. Yang[email protected]Some F@H data is freely accessible on AWS. This will be a relatively short post on accessing and navigating the data on AWS.Poetry and Docker2020-12-23T00:00:00-06:002020-12-23T00:00:00-06:00https://ahy3nz.github.io/posts/2020/12/poetrypackaging<p>What is <a href="https://python-poetry.org/">poetry</a> and where does this fit in the python software/DS ecosystem? And some beginner forays into docker.</p> <p>To skip the reading and jump to the code, go to <a href="https://github.com/ahy3nz/poetry-demo">this repo</a></p> <h1 id="personal-opinions-motivating-this-work">Personal opinions motivating this work</h1> <p>The data science world is large. Data science is kind of like an intersection of statistics/math, subject matter, software engineering, algorithms, and all the collaboration/teamwork that comes with a job. Starting out, you definitely cannot be expected to have a mastery over everything, but at least some minimal competencies and capacity to learn (this basically applies to all jobs).</p> <p>I’m about 11 months into technically being labeled a “data scientist” and I’ve observed that each data scientist ends up cultivating their own sets of skills they find valuable and/or interesting – generalist/specialist is basically skilling up however you want. Seeing the work of other data scientists has built up a long laundry list of things I’d like to learn but don’t have the business-hours to devote because of more-pressing project demands. Among these is this concept of packaging and building “applications”. For a graduate student, your “application” might be a codebase and set of functions that others can reliably and consistently use in their own hacky codes. For a software engineer, your “application” might be deployed onto some cloud server, where the code needs to be self-sufficient and robust, listening for input, processing this input, and pumping out some output without hands on a keyboard. For a data scientist, you may eventually need to think about how an application gets deployed, from consistency of functions and numerical accuracy to considering the entire technical stack involved. Day-to-day, I think consistency of functions and numerical accuracy are generally kept front-of-mind with unit tests or and mainly because you’re always thinking of the mathematical model.</p> <p>If you’re a little more software-savvy, you’ll think about your python environment, using conda or something to control your python software dependencies, your software build, and any compilation that has to happen. Since I’m on socially-distanced, self-quarantined holiday, this is a great time to do some learning</p> <h1 id="poetry">Poetry</h1> <h2 id="dependency-hell-an-introduction">“Dependency” hell, an introduction</h2> <p>Most software depends on other software, and if the dependencies change some core functionality, then your own software may no longer function as intended. To resolve this, you venture through <a href="https://en.wikipedia.org/wiki/Dependency_hell">“dependency hell”</a> to figure out whose code broke your code, and how to fix this.</p> <p>Data scientists like to use python virtual environments to ensure dependencies are compatible and runnable. Some like to use pip and venv, which is fine for installing packages, but only recently will pip attempt to address dependency resolution. Conda is also very popular for managing software packages, compiling software, and resolving software dependencies.</p> <h2 id="what-does-poetry-do">What does poetry do?</h2> <p>A new contestant, <a href="https://python-poetry.org/">poetry</a> finds itself in some python packaging and dependency conversations like “oh I’ve heard of poetry but never really tried it”. Poetry helps manage the python package dependencies for a given software, with a simple CLI to add and update new package dependencies. Poetry generally involves the binary (available on conda and pip), but interacts with your package via two files, the <code class="language-plaintext highlighter-rouge">poetry.lock</code> and <code class="language-plaintext highlighter-rouge">pyproject.toml</code>. If someone gives you those files, you should be able to build your own compatible python environment. In tandem, the two specify the necessary dependncies for your project, with the former pinning dependencies and the latter floating dependencies. Poetry also has some convenient functions for compiling source distributions and wheels so you can distribute this code on somewhere like pypi (but it doesn’t look like there’s any mention of conda recipes).</p> <h2 id="what-about-docker">What about docker?</h2> <p>Docker provides a lot of virtualization and environment control so you can put together an entire tech stack just for your application to run on a bare-bones, nothing-installed server somewhere. This comes in the form of a dockerfile, which like a set of instructions on how to build your container. For an early career data scientist, that’s probably all you need to know. Software engineers deal with this all the time, and data scientists eventually dip their toes here as a model/project comes to maturity.</p> <p>You can learn a lot about dockerfiles by reading them and writing your own, so take a look at the repo linked at the beginning of this post. In general, it kind of resembles a lot of shell commands. Getting conda to work with docker comes with some sticking points:</p> <ul> <li><code class="language-plaintext highlighter-rouge">conda</code> commands within each layer won’t work unless you run the shell script that comes with conda, so you have to remember to run that script throughout the dockerfile</li> <li>Note the use of the <code class="language-plaintext highlighter-rouge">entrypoint.sh</code> file, which becomes the final script that is executed when you call <code class="language-plaintext highlighter-rouge">docker run</code>. Observe the necessary <code class="language-plaintext highlighter-rouge">chmod</code> to make it executable, and note the <code class="language-plaintext highlighter-rouge">conda.sh</code> command even inside the <code class="language-plaintext highlighter-rouge">entrypoint.sh</code> file if you want the container to run some code within a conda environment.</li> <li><code class="language-plaintext highlighter-rouge">docker run -it poetry /bin/bash</code> if you want to open an interactive shell session to the container, running commands/codes inside the docker container like you would an SSH session.</li> <li>Technically, since you have absolute control over the image, you might not need the virtual environment for small python packages. As the packages get more complex and package builds become more complicated, it becomes easier to let conda handle the package management rather than try to correctly install everything in a dockerfile</li> </ul> <p>If you envision running lots of python code or calculations on cloud servers, docker containers and python environments are the sorts of tech that make it happen (and if you and your proejct are up for it, container-orchestration and workflow tools)</p> <h1 id="bare-bones-example">Bare bones example</h1> <p>I’ve documented my experiences in this <a href="https://github.com/ahy3nz/poetry-demo">sandbox for using docker and poetry</a>. There are a lot of tutorials on the internet, so I won’t bother here. But, for a data scientist versed in python environments, this repo showcases how to build your docker images for conda/poetry/python. For a “real” industrial application, things will likely get messier as the environments and software stack get more complex, but this is a decent start for an amateur.</p>Alex H. Yang[email protected]What is poetry and where does this fit in the python software/DS ecosystem? And some beginner forays into docker.Exploring PyTorch + ANI + MD2020-08-15T00:00:00-05:002020-08-15T00:00:00-05:00https://ahy3nz.github.io/posts/2020/08/torchanimd<h1 id="pytorch--ani--md">PyTorch + ANI + MD</h1> <p>PyTorch provides nice utilities for differentiation. ANI provides some interatomic potentials trained on some neural networks. Molecular Dynamics might be an interesting combination</p> <h2 id="some-basic-pytorch-functionality-a-1-d-spring">Some basic pytorch functionality, a 1-D spring</h2> <p>Pytorch replicates a lot of numpy functionality, and we can build python functions that take pytorch tensors as input</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> </code></pre></div></div> <p>A simple quadratic function</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sq_function</span><span class="p">(</span><span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span> </code></pre></div></div> <p>Since we have an array of 1s, the square won’t look very interesting…</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">foo</span> <span class="o">=</span> <span class="n">sq_function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">foo</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tensor([[1., 1.], [1., 1.]], grad_fn=&lt;PowBackward0&gt;) </code></pre></div></div> <p>More interstingly, we can compute the gradient of this function.</p> <p>To compute the gradient, the value/function needs to be a scalar, but this scalar could be computed from a bunch of other functions stemming from some independent variables (our tensor x). In this case, our final scalar looks like this, $ Y = x_0^2 + x_1^2 + x_2^2 + x_3^2 $. Taking the gradient means taking 4 partial derivatives for each input. Fortunately, the equation is simple to compute each partial derivative, $ \frac{\partial Y}{\partial x_i} = 2*x_i $, where $i = [0,4)$. Since this is an array of 1s, each partial derivative evaluates to 2</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">foo</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">x</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(tensor([[2., 2.], [2., 2.]]),) </code></pre></div></div> <p>We’ve evaluated the function and its gradient at just one point, but we can use some numpy-esque functions to evaluate the square-function and its gradient at a range of points.</p> <p>Yup, looks right to me</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">some_xvals</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mf">12.</span><span class="p">,</span> <span class="mf">12.</span><span class="p">,</span> <span class="n">step</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">some_yvals</span> <span class="o">=</span> <span class="n">sq_function</span><span class="p">(</span><span class="n">some_xvals</span><span class="p">)</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">some_xvals</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">some_yvals</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">())</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">some_xvals</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">some_yvals</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">some_xvals</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[&lt;matplotlib.lines.Line2D at 0x7f9c907aa910&gt;] </code></pre></div></div> <p><img src="/images/2020-08-15-torchanimd_files/2020-08-15-torchanimd_11_1.png" alt="png" /></p> <h2 id="slightly-more-book-keeping-3x-1-d-harmonic-springs">Slightly more book-keeping, 3x 1-D harmonic springs</h2> <p>Define an energy function as the sum of 3 harmonic springs</p> <p>$ V(x, y, z) = V_x + V_y + V_z = (x-x_0)^2 + (y-y_0)^2 + (z-z_0)^2 $</p> <p>The gradient, the 3 partial derivatives, are computed as such (being verbose with the chain rule)</p> <p>$ \frac{\partial V}{\partial X} = 2 *(x-x_0) * 1 $</p> <p>$\frac{\partial V}{\partial Y} = 2 *(y-y_0) * 1$</p> <p>$\frac{\partial V}{\partial Z} = 2 *(z-z_0) * 1$</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">harmonic_spring_3d</span><span class="p">(</span><span class="n">coord</span><span class="p">,</span> <span class="n">origin</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">])):</span> <span class="n">V_x</span> <span class="o">=</span> <span class="p">(</span><span class="n">coord</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">-</span><span class="n">origin</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="o">**</span><span class="mi">2</span> <span class="n">V_y</span> <span class="o">=</span> <span class="p">(</span><span class="n">coord</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">-</span><span class="n">origin</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="o">**</span><span class="mi">2</span> <span class="n">V_z</span> <span class="o">=</span> <span class="p">(</span><span class="n">coord</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">-</span><span class="n">origin</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span><span class="o">**</span><span class="mi">2</span> <span class="k">return</span> <span class="n">V_x</span> <span class="o">+</span> <span class="n">V_y</span> <span class="o">+</span> <span class="n">V_z</span> </code></pre></div></div> <p>We can evaluate the potential energy at 1 point, which involves computing the energy in 3 dimensions.</p> <p>Our “anchor” will be the origin, and our endpoint will be (1,2,3)</p> <p>$ 1^2 + 2^2 + 3^2 = 14 $</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">my_coords</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.</span><span class="p">,</span><span class="mf">2.</span><span class="p">,</span><span class="mf">3.</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">total_energy</span> <span class="o">=</span> <span class="n">harmonic_spring_3d</span><span class="p">(</span><span class="n">my_coords</span><span class="p">)</span> <span class="n">total_energy</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tensor(14., grad_fn=&lt;AddBackward0&gt;) </code></pre></div></div> <p>Computing the gradient, partial derivatives in each direction, which is simply 2 times the distance in each dimension</p> <p>$ \nabla \hat V = &lt; 2<em>1, 2</em>2, 2*3 &gt; = &lt;2,4,6&gt; $</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">total_energy</span><span class="p">,</span> <span class="n">my_coords</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(tensor([2., 4., 6.]),) </code></pre></div></div> <h2 id="more-involved-lennard-jones">More involved: Lennard Jones</h2> <p>The Lennard-Jones potential describes the potential energy between two particles. Not the most accurate potential, but has been decent for a long time now. <a href="http://www.sklogwiki.org/SklogWiki/index.php/Lennard-Jones_model">Some background information on the Lenanrd-Jones potential</a>. For simplicity, assume $\epsilon =1$ and $\sigma=1$ in unitless quantities:</p> <p>$ V_{LJ} = 4 * ( \frac{1}{r}^{12} - \frac{1}{r}^6) $</p> <p>$ -\frac{\partial V}{\partial r} = -4 * (-12 * r^{-13} + 6 * r^{-7}) $</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">lj</span><span class="p">(</span><span class="n">val</span><span class="p">):</span> <span class="k">return</span> <span class="mi">4</span> <span class="o">*</span> <span class="p">((</span><span class="mi">1</span><span class="o">/</span><span class="n">val</span><span class="p">)</span><span class="o">**</span><span class="mi">12</span> <span class="o">-</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">val</span><span class="p">)</span><span class="o">**</span><span class="mi">6</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">r_values</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">12.</span><span class="p">,</span> <span class="n">step</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">energy</span> <span class="o">=</span> <span class="n">lj</span><span class="p">(</span><span class="n">r_values</span><span class="p">)</span> <span class="n">forces</span> <span class="o">=</span> <span class="o">-</span><span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">energy</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">r_values</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> </code></pre></div></div> <p>For sanity check, we can confirm that energy reaches a critical point (local minimum) when the force is 0.</p> <p>Also, this <em>definitely</em> looks like a LJ potential to me</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">r_values</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">energy</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">label</span><span class="o">=</span><span class="s">'energy'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">r_values</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">forces</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">label</span><span class="o">=</span><span class="s">'force'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">([</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span> <span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlim</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">])</span> <span class="n">ax</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.lines.Line2D at 0x7f9d2859b2d0&gt; </code></pre></div></div> <p><img src="/images/2020-08-15-torchanimd_files/2020-08-15-torchanimd_23_1.png" alt="png" /></p> <h2 id="moving-to-torchani">Moving to torchani</h2> <p>ANI is an interatomic potential built upon neural networks. Rather than write our own function to evaluate the energy between atoms, maybe we can just use ANI. Since this is pytorch-based, this is still available for autodifferentiation to get the forces</p> <p>https://github.com/aiqm/torchani</p> <p>To begin, we have to define our elements (a tensor of atomic numbers). For the molecular mechanics people, each atom is identifiable by its element, and not one of many atom-types.</p> <p>We have to define the positions (units of Angstrom), which is also a multi-dimensional tensor.</p> <p>Load the model, specifying to convert the atomic numbers to indices suitable for ANI.</p> <p>We can compute the energies and forces from the model. The energy comes from the model, but the force is obtained via an autograd call, observing that we are differentiating the sum of the forces, evaluating at the positions</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torchani</span> <span class="n">elements</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[</span><span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">]])</span> <span class="n">positions</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[[</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">],</span> <span class="p">[</span><span class="mf">3.5</span><span class="p">,</span> <span class="mf">3.5</span><span class="p">,</span> <span class="mf">3.5</span><span class="p">]]],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">model</span> <span class="o">=</span> <span class="n">torchani</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="n">ANI2x</span><span class="p">(</span><span class="n">periodic_table_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">energy</span> <span class="o">=</span> <span class="n">model</span><span class="p">((</span><span class="n">elements</span><span class="p">,</span> <span class="n">positions</span><span class="p">)).</span><span class="n">energies</span> <span class="n">forces</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="o">*</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">energy</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">positions</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/miniconda3/envs/torch37/lib/python3.7/site-packages/torchani/aev.py:195: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.) in_cutoff = (distances &lt;= cutoff).nonzero() </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">energy</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tensor([-75.7952], dtype=torch.float64, grad_fn=&lt;AddBackward0&gt;) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">forces</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tensor([[[-0.4016, -0.4016, -0.4016], [ 0.4016, 0.4016, 0.4016]]]) </code></pre></div></div> <p>Going a step further, we can try to visualize the interaction potential by evaluating the energy at a variety of distances. We can also do some autodifferentiation to compute the forces.</p> <p>In this example, we have 2 atoms that share X and Y coordinates, but pull them apart in the Z direction</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_z</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">12.0</span><span class="p">,</span> <span class="n">step</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span> <span class="n">all_energy</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">all_forces</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">z</span> <span class="ow">in</span> <span class="n">all_z</span><span class="p">:</span> <span class="c1"># Generate a new set of positions </span> <span class="n">positions</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[[</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">],</span> <span class="p">[</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="n">z</span><span class="p">]]],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span> <span class="p">)</span> <span class="c1"># Compute energy </span> <span class="n">energy</span> <span class="o">=</span> <span class="n">model</span><span class="p">((</span><span class="n">elements</span><span class="p">,</span> <span class="n">positions</span><span class="p">)).</span><span class="n">energies</span> <span class="c1"># Compute force </span> <span class="n">forces</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="o">*</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">energy</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">positions</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># Get the force vector on the first atom </span> <span class="n">one_atom_forces</span> <span class="o">=</span> <span class="n">forces</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># Compute the magnitude of this force vector </span> <span class="n">force_magnitude</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">one_atom_forces</span><span class="p">,</span> <span class="n">one_atom_forces</span><span class="p">))</span> <span class="c1"># Calculate the unit vector for this force vector, </span> <span class="c1"># although it's a little unnecessary because the only distance is in the </span> <span class="c1"># z direction </span> <span class="n">unit_vector_force</span> <span class="o">=</span> <span class="n">one_atom_forces</span><span class="o">/</span><span class="n">force_magnitude</span> <span class="c1"># Get z-component of force vector </span> <span class="n">force_vector_z</span> <span class="o">=</span> <span class="n">unit_vector_force</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">*</span><span class="n">force_magnitude</span> <span class="c1"># Some nans will form if the force magnitude is zero, but this </span> <span class="c1"># is really just a 0 force vector </span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">isnan</span><span class="p">(</span><span class="n">force_vector_z</span><span class="p">).</span><span class="nb">any</span><span class="p">():</span> <span class="n">force_vector_z</span> <span class="o">=</span> <span class="mf">0.0</span> <span class="k">else</span><span class="p">:</span> <span class="n">force_vector_z</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">force_vector_z</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">())</span> <span class="c1"># Accumulate </span> <span class="n">all_energy</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">energy</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">()))</span> <span class="n">all_forces</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">force_vector_z</span><span class="p">)</span> </code></pre></div></div> <p>Hmmm… this does not resemble the Lennard-Jones potential (or basic chemistry for that matter)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">all_z</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="n">all_energy</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"Distance ($\AA$)"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Energy (Hartree)"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0, 0.5, 'Energy (Hartree)') </code></pre></div></div> <p><img src="/images/2020-08-15-torchanimd_files/2020-08-15-torchanimd_32_1.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">all_z</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="n">all_forces</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"Distance ($\AA$)"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Force (Hartree / $\AA$)"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0, 0.5, 'Force (Hartree / $\\AA$)') </code></pre></div></div> <p><img src="/images/2020-08-15-torchanimd_files/2020-08-15-torchanimd_33_1.png" alt="png" /></p> <h2 id="combinng-torchani-with-some-other-molecular-modeling-libraries">Combinng torchani with some other molecular modeling libraries</h2> <p>We’re going to use mbuild to initialize some particles, mdtraj as a convenient library to hold molecular information, and torchani to calculate some energies. As with the 2-atom potential example, this pentane example is a little fishy, but this code snippet should hopefully serve as a nice framework to combine some open source molecular modeling libraries.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">mbuild.lib.recipes</span> <span class="kn">import</span> <span class="n">Alkane</span> <span class="c1"># The mBuild alkane recipe is mainly used to generate # some particles and positions </span><span class="n">cmpd</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="c1"># Convert to mdtraj trajectory out of convenience for atomic numbers </span><span class="n">traj</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_trajectory</span><span class="p">()</span> <span class="c1"># Periodic cell, from nm to angstrom </span><span class="n">cell</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">traj</span><span class="p">.</span><span class="n">unitcell_vectors</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="mi">10</span><span class="p">)</span> <span class="c1"># We just need atomic numbers </span><span class="n">species</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[</span> <span class="n">a</span><span class="p">.</span><span class="n">element</span><span class="p">.</span><span class="n">atomic_number</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">traj</span><span class="p">.</span><span class="n">top</span><span class="p">.</span><span class="n">atoms</span> <span class="p">]])</span> <span class="c1"># Make tensor for coordinates # Since we are differentiating WRT coordinates, we need the # requires_grad=True </span><span class="n">coordinates</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">traj</span><span class="p">.</span><span class="n">xyz</span><span class="o">*</span><span class="mi">10</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c1"># PBC flag necessary for computing energies with periodic boundaries </span><span class="n">pbc</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="bp">True</span><span class="p">,</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">True</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="nb">bool</span><span class="p">)</span> <span class="n">energies</span> <span class="o">=</span> <span class="n">model</span><span class="p">((</span><span class="n">species</span><span class="p">,</span> <span class="n">coordinates</span><span class="p">),</span> <span class="n">cell</span><span class="o">=</span><span class="n">cell</span><span class="p">,</span> <span class="n">pbc</span><span class="o">=</span><span class="n">pbc</span><span class="p">).</span><span class="n">energies</span> <span class="n">forces</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="o">*</span> <span class="p">(</span> <span class="n">torch</span><span class="p">.</span><span class="n">autograd</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="n">energies</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">coordinates</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">energies</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/miniconda3/envs/torch37/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) tensor([-197.1103], dtype=torch.float64, grad_fn=&lt;AddBackward0&gt;) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">forces</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/miniconda3/envs/torch37/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) tensor([[[ 6.5805e-02, 5.5707e-02, 4.9085e-02], [ 1.3603e-03, -2.1826e-02, -1.0588e-02], [ 1.5610e-02, -5.9448e-02, 1.4180e-03], [-7.4506e-09, 1.1921e-07, 1.1461e-02], [-7.2804e-03, 2.5767e-02, -6.2775e-04], [ 7.2804e-03, -2.5766e-02, -6.2775e-04], [-6.5805e-02, -5.5707e-02, 4.9085e-02], [-1.5610e-02, 5.9448e-02, 1.4180e-03], [-1.3604e-03, 2.1826e-02, -1.0588e-02], [ 6.9919e-02, 1.0938e-01, -4.7381e-02], [ 4.2583e-02, 1.5188e-01, -9.1655e-03], [-3.5887e-02, -5.4712e-03, 4.6396e-02], [ 3.4462e-03, 3.7552e-02, -3.4868e-02], [-6.9919e-02, -1.0938e-01, -4.7381e-02], [-4.2583e-02, -1.5188e-01, -9.1655e-03], [ 3.5887e-02, 5.4712e-03, 4.6396e-02], [-3.4462e-03, -3.7552e-02, -3.4868e-02]]]) </code></pre></div></div> <h2 id="to-be-continued-">To be continued …</h2> <p>One might imagine trying to incorporate ANI potentials into MD simulations (which has been done in ASE). However, the torchani-API is general enough that you could use any number of computational chemistry packages to feed into torchani. The output is also general enough you could imagine trying to apply your own integrators and make your own simulation. But… from the weird 2-atom interatomic potentials, some of these methods might require some debugging.</p> <p>Files and environment can be found <a href="https://github.com/ahy3nz/ahy3nz.github.io/tree/master/files/notebooks">here</a></p> <h3 id="reference">Reference</h3> <p>Xiang Gao, Farhad Ramezanghorbani, Olexandr Isayev, Justin S. Smith, and Adrian E. Roitberg. TorchANI: A Free and Open Source PyTorch Based Deep Learning Implementation of the ANI Neural Network Potentials. Journal of Chemical Information and Modeling 2020 60 (7), 3408-3415</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> </code></pre></div></div>Alex H. Yang[email protected]PyTorch + ANI + MDDownloading and studying my message behavior2020-08-07T00:00:00-05:002020-08-07T00:00:00-05:00https://ahy3nz.github.io/posts/2020/08/fb_messages<p>Digital privacy is everywhere, and recent laws are pushing companies to disclose whatever personal information they may have on you. In the spirit of science, I’m going to make myself my own study subject and observe what Facebook has stored from my messenger history. Along the way, I’ll do some recursion, a little parallelization, some generators for data procesing, and basic visualization to observe my messenger behavior. Notebooks can be found <a href="https://github.com/ahy3nz/ahy3nz.github.io/tree/master/files/notebooks">here</a>, but this one you can’t reproduce because I won’t be providing my messenger data (try this notebook on your own messenger data if you’re curious).</p> <p>No real conclusion to this memo, but it’s interesting to see firsthand that a lot of data gets preserved from your messages – pictures, gifs, videos, audio, files, emotes, participants, timestamps.</p> <p>The message data from Facebook is organized like this:</p> <ul> <li>inbox/ <ul> <li>chat1/ <ul> <li>message1.json</li> <li>message2.json</li> <li>audio/</li> <li>files/</li> <li>gifs/</li> <li>photos/</li> <li>videos/</li> </ul> </li> <li>chat2/ <ul> <li>message1.json</li> </ul> </li> </ul> </li> </ul> <p>We can start with some basic tree-walking to identify which is the largest chat group</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span> <span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span> <span class="kn">import</span> <span class="nn">json</span> <span class="kn">import</span> <span class="nn">multiprocessing</span> <span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span> <span class="kn">import</span> <span class="nn">dask</span> <span class="kn">from</span> <span class="nn">dask</span> <span class="kn">import</span> <span class="n">delayed</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">matplotlib</span> <span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="k">def</span> <span class="nf">size_of_tree</span><span class="p">(</span><span class="n">p</span><span class="p">):</span> <span class="k">if</span> <span class="s">'json'</span> <span class="ow">in</span> <span class="n">p</span><span class="p">.</span><span class="n">suffix</span><span class="p">:</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">message_data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">message_data</span><span class="p">[</span><span class="s">'messages'</span><span class="p">])</span> <span class="k">elif</span> <span class="n">p</span><span class="p">.</span><span class="n">is_dir</span><span class="p">():</span> <span class="k">return</span> <span class="nb">sum</span><span class="p">([</span><span class="n">size_of_tree</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">p</span><span class="p">.</span><span class="n">iterdir</span><span class="p">()])</span> <span class="k">else</span><span class="p">:</span> <span class="k">return</span> <span class="mi">0</span> <span class="k">def</span> <span class="nf">parent_function</span><span class="p">(</span><span class="n">p</span><span class="p">):</span> <span class="k">return</span> <span class="p">{</span><span class="n">p</span><span class="p">:</span> <span class="n">size_of_tree</span><span class="p">(</span><span class="n">p</span><span class="p">)}</span> <span class="k">def</span> <span class="nf">parent_function_chunk</span><span class="p">(</span><span class="n">p</span><span class="p">):</span> <span class="k">return</span> <span class="p">{</span><span class="n">folder</span><span class="p">:</span> <span class="n">size_of_tree</span><span class="p">(</span><span class="n">folder</span><span class="p">)</span> <span class="k">for</span> <span class="n">folder</span> <span class="ow">in</span> <span class="n">p</span><span class="p">}</span> <span class="n">p</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">'/censored/so/you/cant/find/my/facebook/inbox'</span><span class="p">)</span> <span class="n">all_dirs</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">p</span><span class="p">.</span><span class="n">iterdir</span><span class="p">()</span> <span class="k">if</span> <span class="n">a</span><span class="p">.</span><span class="n">is_dir</span><span class="p">()]</span> </code></pre></div></div> <p>Since this is an embarrassingly parallel situation, we can easily show the serial version is slower than the parallel version (using dask or multiprocessing), with or without some chunking</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">sizes</span> <span class="o">=</span> <span class="p">[</span><span class="n">parent_function</span><span class="p">(</span><span class="n">folder</span><span class="p">)</span> <span class="k">for</span> <span class="n">folder</span> <span class="ow">in</span> <span class="n">all_dirs</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 5.3 s, sys: 17.6 s, total: 22.9 s Wall time: 1min 30s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_delayed</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">parent_function</span><span class="p">)(</span><span class="n">folder</span><span class="p">)</span> <span class="k">for</span> <span class="n">folder</span> <span class="ow">in</span> <span class="n">all_dirs</span><span class="p">]</span> <span class="n">results</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_delayed</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 9.65 s, sys: 1min 4s, total: 1min 14s Wall time: 30.4 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="k">with</span> <span class="n">Pool</span><span class="p">()</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span> <span class="n">pool_results</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">parent_function</span><span class="p">,</span> <span class="n">all_dirs</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 131 ms, sys: 171 ms, total: 302 ms Wall time: 27.5 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_delayed</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">parent_function_chunk</span><span class="p">)(</span><span class="n">all_dirs</span><span class="p">[</span><span class="n">i</span><span class="p">::</span><span class="mi">6</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">6</span><span class="p">)]</span> <span class="n">results</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_delayed</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 8.13 s, sys: 59.7 s, total: 1min 7s Wall time: 31.4 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="k">with</span> <span class="n">Pool</span><span class="p">()</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span> <span class="n">pool_results</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">parent_function_chunk</span><span class="p">,</span> <span class="p">[</span><span class="n">all_dirs</span><span class="p">[</span><span class="n">i</span><span class="p">::</span><span class="mi">6</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">6</span><span class="p">)])</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 242 ms, sys: 33.7 ms, total: 276 ms Wall time: 28.9 s </code></pre></div></div> <p>For those curious, I have a pretty skewed chat message distribution…</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">message_sizes</span> <span class="o">=</span> <span class="p">[</span><span class="n">size</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">size</span> <span class="ow">in</span> <span class="n">chunk</span><span class="p">.</span><span class="n">values</span><span class="p">()]</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">))</span> <span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">message_sizes</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Number of chats"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Number of messages within chat"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0.5, 0, 'Number of messages within chat') </code></pre></div></div> <p><img src="/images/2020-08-07-fb_messages_files/2020-08-07-fb_messages_10_1.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span><span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">))</span> <span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">message_sizes</span><span class="p">))</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Number of chats"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Log number of messages within chat"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0.5, 0, 'Log number of messages within chat') </code></pre></div></div> <p><img src="/images/2020-08-07-fb_messages_files/2020-08-07-fb_messages_11_1.png" alt="png" /></p> <p>We can make a small data pipeline for my message history by using two iterators, one after the other. The first iterator <code class="language-plaintext highlighter-rouge">get_json_files_iter</code> is simple, it will just burrow its way through each directory, grab all the json files, and spit out one at a time, returning a generator. The second iterator <code class="language-plaintext highlighter-rouge">process_json_iter</code> will take an item from the <code class="language-plaintext highlighter-rouge">get_json_files_iter</code> generator and actually process some information. In this case, getting information about the sender, timestamp, and length of message.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Iterator</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Any</span><span class="p">,</span> <span class="n">List</span> <span class="kn">import</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="nn">json</span> <span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span> <span class="k">def</span> <span class="nf">get_json_files_iter</span><span class="p">(</span><span class="n">dirs</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Iterator</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span> <span class="s">""" For each dir, get the json files """</span> <span class="n">root</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">'.'</span><span class="p">)</span> <span class="k">for</span> <span class="n">directory</span> <span class="ow">in</span> <span class="n">dirs</span><span class="p">:</span> <span class="n">subdir</span> <span class="o">=</span> <span class="n">root</span> <span class="o">/</span> <span class="n">Path</span><span class="p">(</span><span class="n">directory</span><span class="p">)</span> <span class="k">for</span> <span class="n">jsonfile</span> <span class="ow">in</span> <span class="n">subdir</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'*.json'</span><span class="p">):</span> <span class="k">yield</span> <span class="n">Path</span><span class="p">(</span><span class="n">jsonfile</span><span class="p">)</span> <span class="k">def</span> <span class="nf">process_json_iter</span><span class="p">(</span><span class="n">json_iter</span><span class="p">:</span> <span class="n">Iterator</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">Iterator</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="n">Any</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]]:</span> <span class="s">""" Given a json file, parse and summarize the message info"""</span> <span class="k">for</span> <span class="n">jsonfile</span> <span class="ow">in</span> <span class="n">json_iter</span><span class="p">:</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">jsonfile</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">message_data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> <span class="k">for</span> <span class="n">message</span> <span class="ow">in</span> <span class="n">message_data</span><span class="p">[</span><span class="s">'messages'</span><span class="p">]:</span> <span class="k">yield</span> <span class="p">{</span> <span class="s">'sender'</span><span class="p">:</span> <span class="n">message</span><span class="p">[</span><span class="s">'sender_name'</span><span class="p">],</span> <span class="s">'timestamp'</span><span class="p">:</span> <span class="n">datetime</span><span class="p">.</span><span class="n">fromtimestamp</span><span class="p">(</span><span class="n">message</span><span class="p">[</span><span class="s">'timestamp_ms'</span><span class="p">]</span><span class="o">/</span><span class="mi">1000</span><span class="p">),</span> <span class="s">'n_words'</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">message</span><span class="p">[</span><span class="s">'content'</span><span class="p">])</span> <span class="k">if</span> <span class="n">message</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'content'</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span> <span class="k">else</span> <span class="bp">None</span> <span class="c1"># Some messages have no text </span> <span class="c1"># like an image/emoji post </span> <span class="p">}</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">process_json_iter</span><span class="p">(</span><span class="n">get_json_files_iter</span><span class="p">(</span><span class="n">all_dirs</span><span class="p">))</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;generator object process_json_iter at 0x7f26228366d0&gt; </code></pre></div></div> <p>Getting through all the files (7 gb) isn’t too bad</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">extracted_messages</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">process_json_iter</span><span class="p">(</span><span class="n">get_json_files_iter</span><span class="p">(</span><span class="n">all_dirs</span><span class="p">))]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 6.01 s, sys: 0 ns, total: 6.01 s Wall time: 10.8 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">extracted_messages</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 909 ms, sys: 0 ns, total: 909 ms Wall time: 900 ms </code></pre></div></div> <p>Conveniently, we can pass the generator itself to create a dataframe. This doesn’t provide much speedup, but it helps keep the code concise</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">process_json_iter</span><span class="p">(</span><span class="n">get_json_files_iter</span><span class="p">(</span><span class="n">all_dirs</span><span class="p">)))</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 7.42 s, sys: 0 ns, total: 7.42 s Wall time: 13.7 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">columns</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['sender', 'timestamp', 'n_words'], dtype='object') </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">shape</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1003527, 3) </code></pre></div></div> <p>We can look at how my chat history has changed over the years…</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="s">'-'</span><span class="p">.</span><span class="n">join</span><span class="p">([</span><span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">'timestamp'</span><span class="p">].</span><span class="n">year</span><span class="p">),</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">'timestamp'</span><span class="p">].</span><span class="n">month</span><span class="p">),</span> <span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">'timestamp'</span><span class="p">].</span><span class="n">day</span><span class="p">)]),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grouped_by_date</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'date'</span><span class="p">).</span><span class="n">agg</span><span class="p">(</span><span class="s">'count'</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">18</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">grouped_by_date</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">grouped_by_date</span><span class="p">[</span><span class="s">'sender'</span><span class="p">])</span> <span class="n">ticks</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">grouped_by_date</span><span class="p">.</span><span class="n">index</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">ticks</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">([</span><span class="nb">list</span><span class="p">(</span><span class="n">grouped_by_date</span><span class="p">.</span><span class="n">index</span><span class="p">)[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ticks</span><span class="p">],</span> <span class="n">rotation</span><span class="o">=</span><span class="s">'90'</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s">'right'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Number of messages"</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0, 0.5, 'Number of messages') </code></pre></div></div> <p><img src="/images/2020-08-07-fb_messages_files/2020-08-07-fb_messages_25_1.png" alt="png" /></p> <p>Maybe trying to smooth things out. The timestamps aren’t evenly distributed so the averages could be computed better, but they work well enough for now</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rolling</span> <span class="o">=</span> <span class="n">grouped_by_date</span><span class="p">.</span><span class="n">rolling</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">min_periods</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">18</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">rolling</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">rolling</span><span class="p">[</span><span class="s">'sender'</span><span class="p">])</span> <span class="n">ticks</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">rolling</span><span class="p">.</span><span class="n">index</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">ticks</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">([</span><span class="nb">list</span><span class="p">(</span><span class="n">rolling</span><span class="p">.</span><span class="n">index</span><span class="p">)[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ticks</span><span class="p">],</span> <span class="n">rotation</span><span class="o">=</span><span class="s">'90'</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s">'right'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Number of messages"</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0, 0.5, 'Number of messages') </code></pre></div></div> <p><img src="/images/2020-08-07-fb_messages_files/2020-08-07-fb_messages_27_1.png" alt="png" /></p>Alex H. Yang[email protected]Digital privacy is everywhere, and recent laws are pushing companies to disclose whatever personal information they may have on you. In the spirit of science, I’m going to make myself my own study subject and observe what Facebook has stored from my messenger history. Along the way, I’ll do some recursion, a little parallelization, some generators for data procesing, and basic visualization to observe my messenger behavior. Notebooks can be found here, but this one you can’t reproduce because I won’t be providing my messenger data (try this notebook on your own messenger data if you’re curious).Lessons learned from accelerating foyer with dask2020-06-20T00:00:00-05:002020-06-20T00:00:00-05:00https://ahy3nz.github.io/posts/2020/06/foyer-dask<h1 id="combining-foyer--dask">Combining Foyer + Dask</h1> <p>More into the foray of combining modern molecular modeling tools with modern data science libraries…</p> <h2 id="foyer-uses-graph-algorithms-to-parametrize-your-molecular-model">Foyer uses graph algorithms to parametrize your molecular model</h2> <p>Given a system of molecules and atoms, how do we parametrize each atom according to our molecular model, our force field? The parameters for each atom depend on its bonded neighbors. Framing this as a graph problem (vertices are atoms and edges are bonds), subgraph isomorphisms are used to match our atom’s bonding patterns to the template bonding patterns specified by our force field’s atom-type bonding patterns</p> <h2 id="dask-helps-distribute-parallel-workloads">Dask helps distribute parallel workloads</h2> <p>Generally, most of these molecular modeling packages operate on a shared memory data structure - a list, a dictionary. To parallelize this atomtyping operation, we need to identify <em>how</em> we can parallelize this. For graph problems, sometimes each node (atom) needs to know every other node. We are left with a couple options</p> <ul> <li>Broadcast the entire molecular graph to all workers, divy up which atoms each worker is reponsible for atomtyping. This risks some large overhead because the entire molecular graph can span tens of thousands (or more) nodes.</li> <li>Broad only <em>the relevant molecular graph</em> to each worker, each worker becomes responsible for parametrizing that small subgraph. This one doesn’t involve broadcasting large graphs, but now the problem becomes identifying what the relevant graph is. I refer readers to the concept of a <a href="https://en.wikipedia.org/wiki/Component_(graph_theory)">graph component</a></li> </ul> <h2 id="what-to-expect-in-this-notebook">What to expect in this notebook</h2> <p>First, I’ll be breaking up the entire chemical system into smaller subgraphs. I’ll try to atom-type each subgraph serially. Then, I’ll try to distribute the workload of each subgraph using dask. I’ll try to do some timings - against different numbers of homogeneous molecules and different numbers of heterogeneous molecules. Along the way, I’ll be observing some friction points for using dask (casual user here) and for using foyer/parmed</p> <h2 id="parallelizations-value-is-hard-to-demonstrate-in-this-use-case">Parallelization’s value is hard to demonstrate in this use case</h2> <p>Dask did not show improvements compared to canonical foyer. With the data structures we, and foyer, usually deal with, there’s some extra work in formatting them into easily-distributable data structures for parallelization. There’s always communication issues for parallel workloads. Foyer has molecule caching that accelerates atom-typing for molecules you’ve already atom-typed; this isn’t leveraged well in a distributed scenario. Foyer uses networkx, which likely already comes with its own optimizations for simplifying the workload, so evaluating a singular large graph may not be as bad as we think compared to lots of small graphs. As written, the foyer code may be best utilized serially. Future foyer implementations and refactors might better exposed elements of parallelization</p> <h2 id="distributing-the-workload-split-a-chemical-system-into-smaller-components-parametrize-each-molecule-in-serial">Distributing the workload: split a chemical system into smaller components, parametrize each molecule, in serial</h2> <p>Use mbuild to create our molecule, replicate to 10 molecules, foyer to apply the OPLS-AA force field</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">mbuild</span> <span class="k">as</span> <span class="n">mb</span> <span class="kn">from</span> <span class="nn">mbuild.lib.recipes</span> <span class="kn">import</span> <span class="n">Alkane</span> <span class="kn">import</span> <span class="nn">foyer</span> <span class="kn">import</span> <span class="nn">parmed</span> <span class="k">as</span> <span class="n">pmd</span> <span class="kn">import</span> <span class="nn">networkx</span> <span class="k">as</span> <span class="n">nx</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>_ColormakerRegistry() </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/foyer/foyer/forcefield.py:449: UserWarning: No force field version number found in force field XML file. 'No force field version number found in force field XML file.' /home/ayang41/programs/foyer/foyer/forcefield.py:461: UserWarning: No force field name found in force field XML file. 'No force field name found in force field XML file.' /home/ayang41/programs/foyer/foyer/validator.py:132: ValidationWarning: You have empty smart definition(s) warn("You have empty smart definition(s)", ValidationWarning) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">])</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory &lt;mdtraj.Trajectory with 1 frames, 3 atoms, 1 residues, without unitcells&gt; "mdtraj.Trajectory {}".format(traj) /home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory &lt;mdtraj.Trajectory with 1 frames, 4 atoms, 1 residues, without unitcells&gt; "mdtraj.Trajectory {}".format(traj) /home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees. "No box specified and no Compound.box detected. " </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">view</span> <span class="o">=</span> <span class="n">single</span><span class="p">.</span><span class="n">visualize</span><span class="p">(</span><span class="n">backend</span><span class="o">=</span><span class="s">'nglview'</span><span class="p">)</span> <span class="n">view</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NGLWidget() </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> </code></pre></div></div> <p>Box of pentanes as parmed structures</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nglview</span> <span class="n">nglview</span><span class="p">.</span><span class="n">show_parmed</span><span class="p">(</span><span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NGLWidget() </code></pre></div></div> <p>Creating the molecule graph for all moleucles in our system</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">graph</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">()</span> <span class="n">graph</span><span class="p">.</span><span class="n">add_nodes_from</span><span class="p">([</span><span class="n">a</span><span class="p">.</span><span class="n">idx</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">])</span> <span class="n">graph</span><span class="p">.</span><span class="n">add_edges_from</span><span class="p">([(</span><span class="n">b</span><span class="p">.</span><span class="n">atom1</span><span class="p">.</span><span class="n">idx</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">atom2</span><span class="p">.</span><span class="n">idx</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">structure</span><span class="p">.</span><span class="n">bonds</span><span class="p">])</span> </code></pre></div></div> <p>Here we can see there’s a few different graph connected components here, AKA each molecule</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib</span> <span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">8</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">nx</span><span class="p">.</span><span class="n">draw_networkx</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">node_size</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">with_labels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span> </code></pre></div></div> <p><img src="/images/2020-06-21_foyer-dask_files/2020-06-21_foyer-dask_12_0.png" alt="png" /></p> <p>Fortunately, <a href="https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.connected_components.html#networkx.algorithms.components.connected_components">networkx API has a connected components implementation</a>. We have a list of sets of atom indices, where each set of atom indices refers to a connected component</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">nx</span><span class="p">.</span><span class="n">connected_components</span><span class="p">(</span><span class="n">graph</span><span class="p">)]</span> <span class="n">individual_molecule_graphs</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">3</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, {17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33}, {34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50}] </code></pre></div></div> <p>For each individual molecule graph, we can create a parmed structure. Our entire box of pentanes was one parmed structure, but now we’re interested in creating N different parmed structures, one for each molecule. You could imagine creating another kind of object, like an mbuild compound or openmm topology, but to fit the foyer workflow, we operate on parmed structures.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_substructures</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">molecule_graph</span> <span class="ow">in</span> <span class="n">individual_molecule_graphs</span><span class="p">:</span> <span class="n">individual_structure</span> <span class="o">=</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">()</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">molecule_graph</span><span class="p">:</span> <span class="n">individual_structure</span><span class="p">.</span><span class="n">add_atom</span><span class="p">(</span><span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">residue</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">residue</span><span class="p">.</span><span class="n">number</span><span class="p">)</span> <span class="k">for</span> <span class="n">neighbor_idx</span> <span class="ow">in</span> <span class="n">graph</span><span class="p">[</span><span class="n">idx</span><span class="p">]:</span> <span class="k">if</span> <span class="n">idx</span> <span class="o">&lt;</span> <span class="n">neighbor_idx</span><span class="p">:</span> <span class="n">individual_structure</span><span class="p">.</span><span class="n">bonds</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pmd</span><span class="p">.</span><span class="n">Bond</span><span class="p">(</span><span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">neighbor_idx</span><span class="p">]))</span> <span class="n">all_substructures</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">individual_structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_substructures</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[&lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;] </code></pre></div></div> <p>Simple iteration through each molecular subtructure, apply the force field to each</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parametrized_substructures</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">substructure</span> <span class="ow">in</span> <span class="n">all_substructures</span><span class="p">:</span> <span class="n">output_struc</span> <span class="o">=</span> <span class="n">ff</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">substructure</span><span class="p">)</span> <span class="n">parametrized_substructures</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">output_struc</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 20, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parametrized_substructures</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[&lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;] </code></pre></div></div> <p>Because parmed structures override addition, we can combine structures via addition</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parametrized_substructures</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">parametrized_substructures</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;Structure 34 atoms; 2 residues; 32 bonds; parametrized&gt; </code></pre></div></div> <p>Using functools, we can quickly and conveniently combine all N parametrized structures into 1 structure</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="nb">reduce</span> <span class="n">parametrized_structure</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">parametrized_substructures</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parametrized_structure</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;Structure 170 atoms; 10 residues; 160 bonds; parametrized&gt; </code></pre></div></div> <p>Rather than parametrize one, big parmed structure, we are parametrizing a bunch of small parmed structures, in serial. We’re not distributing the workload, but we are simplifying the workload – rather than match subgraphs among large, complex graphs of hundreds of nodes and edges, we are matching subgraphs among smaller, simpler graphs</p> <h2 id="split-a-chemical-system-into-smaller-components-parametrize-each-molecule-in-parallel">Split a chemical system into smaller components, parametrize each molecule, in parallel</h2> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dask</span> <span class="kn">from</span> <span class="nn">dask</span> <span class="kn">import</span> <span class="n">delayed</span><span class="p">,</span> <span class="n">bag</span> <span class="k">as</span> <span class="n">db</span> </code></pre></div></div> <p>Streamline our code into functions that are mostly-compatible with dask.</p> <ul> <li>The use of tuples over lists because tuples are hashable (important for dask)</li> <li>Extra functions to map atomic indices to parmed Atoms. If we’re going to create different parmed structures, we need to track parmed atoms</li> </ul> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Union</span><span class="p">,</span> <span class="n">Set</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Tuple</span> <span class="k">def</span> <span class="nf">structure_to_graph</span><span class="p">(</span><span class="n">structure</span><span class="p">:</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">:</span> <span class="n">graph</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">()</span> <span class="n">graph</span><span class="p">.</span><span class="n">add_nodes_from</span><span class="p">([</span><span class="n">a</span><span class="p">.</span><span class="n">idx</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">])</span> <span class="n">graph</span><span class="p">.</span><span class="n">add_edges_from</span><span class="p">([(</span><span class="n">b</span><span class="p">.</span><span class="n">atom1</span><span class="p">.</span><span class="n">idx</span><span class="p">,</span> <span class="n">b</span><span class="p">.</span><span class="n">atom2</span><span class="p">.</span><span class="n">idx</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">structure</span><span class="p">.</span><span class="n">bonds</span><span class="p">])</span> <span class="k">return</span> <span class="n">graph</span> <span class="k">def</span> <span class="nf">separate_molecule_graphs</span><span class="p">(</span><span class="n">structure</span><span class="p">:</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">,</span> <span class="n">graph</span><span class="p">:</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">Tuple</span><span class="p">[</span><span class="nb">int</span><span class="p">,...]]:</span> <span class="s">""" Use connected components to identify individual molecules"""</span> <span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">nx</span><span class="p">.</span><span class="n">connected_components</span><span class="p">(</span><span class="n">graph</span><span class="p">))</span> <span class="k">return</span> <span class="n">individual_molecule_graphs</span> <span class="k">def</span> <span class="nf">subselect_atoms</span><span class="p">(</span><span class="n">structure</span><span class="p">:</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">,</span> <span class="n">indices</span><span class="p">:</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">int</span><span class="p">])</span><span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Atom</span><span class="p">]:</span> <span class="s">""" Create a mapping of index to atom """</span> <span class="k">return</span> <span class="p">{</span><span class="n">idx</span><span class="p">:</span> <span class="n">structure</span><span class="p">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">indices</span><span class="p">}</span> <span class="k">def</span> <span class="nf">make_structure_from_graph</span><span class="p">(</span><span class="n">molecule_vertices</span><span class="p">:</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">relevant_atoms</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Atom</span><span class="p">],</span> <span class="n">molecule_graph</span><span class="p">:</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">:</span> <span class="s">""" From networkx graph and individal parmed atoms, make parmed structure"""</span> <span class="n">individual_structure</span> <span class="o">=</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">()</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">molecule_vertices</span><span class="p">:</span> <span class="n">individual_structure</span><span class="p">.</span><span class="n">add_atom</span><span class="p">(</span><span class="n">relevant_atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">relevant_atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">residue</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">relevant_atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">residue</span><span class="p">.</span><span class="n">number</span><span class="p">)</span> <span class="k">for</span> <span class="n">neighbor_idx</span> <span class="ow">in</span> <span class="n">molecule_graph</span><span class="p">[</span><span class="n">idx</span><span class="p">]:</span> <span class="k">if</span> <span class="n">idx</span> <span class="o">&lt;</span> <span class="n">neighbor_idx</span><span class="p">:</span> <span class="n">individual_structure</span><span class="p">.</span><span class="n">bonds</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pmd</span><span class="p">.</span><span class="n">Bond</span><span class="p">(</span><span class="n">relevant_atoms</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">relevant_atoms</span><span class="p">[</span><span class="n">neighbor_idx</span><span class="p">]))</span> <span class="k">return</span> <span class="n">individual_structure</span> <span class="k">def</span> <span class="nf">parametrize</span><span class="p">(</span><span class="n">ff</span><span class="p">:</span> <span class="n">foyer</span><span class="p">.</span><span class="n">Forcefield</span><span class="p">,</span> <span class="n">structure</span><span class="p">:</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pmd</span><span class="p">.</span><span class="n">Structure</span><span class="p">:</span> <span class="k">return</span> <span class="n">ff</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span> </code></pre></div></div> <p>Exercising our functions in serial</p> <p>We’ll get to some timings later…</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">big_graph</span> <span class="o">=</span> <span class="n">structure_to_graph</span><span class="p">(</span><span class="n">structure</span><span class="p">)</span> <span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="n">separate_molecule_graphs</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)</span> <span class="n">individual_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">make_structure_from_graph</span><span class="p">(</span><span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselect_atoms</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">molecule_graph</span><span class="p">),</span> <span class="n">big_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span> <span class="ow">in</span> <span class="n">individual_molecule_graphs</span><span class="p">]</span> <span class="n">parametrized_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">parametrize</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">struc</span><span class="p">)</span> <span class="k">for</span> <span class="n">struc</span> <span class="ow">in</span> <span class="n">individual_structures</span><span class="p">]</span> <span class="n">parametrized_structure</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">parametrized_structures</span><span class="p">)</span> <span class="n">parametrized_structure</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory &lt;mdtraj.Trajectory with 1 frames, 3 atoms, 1 residues, without unitcells&gt; "mdtraj.Trajectory {}".format(traj) /home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory &lt;mdtraj.Trajectory with 1 frames, 4 atoms, 1 residues, without unitcells&gt; "mdtraj.Trajectory {}".format(traj) /home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees. "No box specified and no Compound.box detected. " CPU times: user 2.69 s, sys: 56.1 ms, total: 2.75 s Wall time: 2.7 s &lt;Structure 170 atoms; 10 residues; 160 bonds; parametrized&gt; </code></pre></div></div> <p>Here’s a first attempt at daskifying everything with delayed objects. Once we’ve created our entire system graph, we can start creating dask objects, starting with each molecule graph, and chaining the following operations:</p> <ul> <li>From each molecule graph, grab the relevant parmed Atoms</li> <li>From the molecule graph and parmed Atoms, create the (unparametrized) parmed Structure</li> </ul> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">big_graph</span> <span class="o">=</span> <span class="n">structure_to_graph</span><span class="p">(</span><span class="n">structure</span><span class="p">)</span> <span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">separate_molecule_graphs</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)]</span> <span class="n">all_subselected_atoms</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">subselect_atoms</span><span class="p">)(</span><span class="n">structure</span><span class="p">,</span> <span class="n">molecule_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span> <span class="ow">in</span> <span class="n">individual_molecule_graphs</span><span class="p">]</span> <span class="n">raw_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">make_structure_from_graph</span><span class="p">)(</span><span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">individual_molecule_graphs</span><span class="p">,</span> <span class="n">all_subselected_atoms</span><span class="p">)]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 149 ms, sys: 23.2 ms, total: 173 ms Wall time: 60.4 ms </code></pre></div></div> <p>Pulse check, can we flush the task-graph and actually get our parametrized molecules?</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="p">[</span><span class="n">a</span><span class="p">.</span><span class="n">compute</span><span class="p">()</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">raw_structures</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 18.8 ms, sys: 889 µs, total: 19.7 ms Wall time: 11.8 ms [&lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; NOT parametrized&gt;] </code></pre></div></div> <p>Next step, parametrization</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">param_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">parametrize</span><span class="p">)(</span><span class="n">ff</span><span class="p">,</span> <span class="n">struc</span><span class="p">)</span> <span class="k">for</span> <span class="n">struc</span> <span class="ow">in</span> <span class="n">raw_structures</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 5.69 ms, sys: 594 µs, total: 6.28 ms Wall time: 2.81 ms </code></pre></div></div> <p>(Another) pulse check, does the FF application work?</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_parametrized</span> <span class="o">=</span> <span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">compute</span><span class="p">()</span> <span class="k">for</span> <span class="n">op</span> <span class="ow">in</span> <span class="n">param_structures</span><span class="p">]</span> <span class="n">all_parametrized</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 2.69 s, sys: 15.9 ms, total: 2.71 s Wall time: 2.7 s [&lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;, &lt;Structure 17 atoms; 1 residues; 16 bonds; parametrized&gt;] </code></pre></div></div> <p>Final step, putting the structures back together</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">all_parametrized</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 52.7 ms, sys: 23 µs, total: 52.7 ms Wall time: 50.8 ms &lt;Structure 170 atoms; 10 residues; 160 bonds; parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">big_graph</span> <span class="o">=</span> <span class="n">structure_to_graph</span><span class="p">(</span><span class="n">structure</span><span class="p">)</span> <span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">separate_molecule_graphs</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)]</span> <span class="n">all_subselected_atoms</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">subselect_atoms</span><span class="p">)(</span><span class="n">structure</span><span class="p">,</span> <span class="n">molecule_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span> <span class="ow">in</span> <span class="n">individual_molecule_graphs</span><span class="p">]</span> <span class="n">raw_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">make_structure_from_graph</span><span class="p">)(</span><span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">individual_molecule_graphs</span><span class="p">,</span> <span class="n">all_subselected_atoms</span><span class="p">)]</span> <span class="n">param_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">parametrize</span><span class="p">)(</span><span class="n">ff</span><span class="p">,</span> <span class="n">struc</span><span class="p">)</span> <span class="k">for</span> <span class="n">struc</span> <span class="ow">in</span> <span class="n">raw_structures</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 65.5 ms, sys: 41.5 ms, total: 107 ms Wall time: 51.4 ms </code></pre></div></div> <p>Last step is to combine all the parametrized structures, we can try some dask fold/reduce operations</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">param_structures_bag</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">from_sequence</span><span class="p">(</span><span class="n">param_structures</span><span class="p">)</span> <span class="n">param_structures_bag</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dask.bag&lt;from_sequence, npartitions=10&gt; </code></pre></div></div> <p>Unfortuantely, some of these parmed AtomType objects are not hashable, so we cannot use dask to efficiently reduce parmed structures</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">operator</span> <span class="kn">import</span> <span class="n">add</span> <span class="n">param_structures_bag</span><span class="p">.</span><span class="n">fold</span><span class="p">(</span><span class="n">add</span><span class="p">).</span><span class="n">compute</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--------------------------------------------------------------------------- TypeError Traceback (most recent call last) &lt;ipython-input-37-2285eb100c4e&gt; in &lt;module&gt; 1 from operator import add 2 ----&gt; 3 param_structures_bag.fold(add).compute() ... ~/miniconda3/envs/md37/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in save_global(self, obj, name, pack) 828 elif obj is type(NotImplemented): 829 return self.save_reduce(type, (NotImplemented,), obj=obj) --&gt; 830 elif obj in _BUILTIN_TYPE_NAMES: 831 return self.save_reduce( 832 _builtin_type, (_BUILTIN_TYPE_NAMES[obj],), obj=obj) TypeError: unhashable type: '_UnassignedAtomType' </code></pre></div></div> <p>At this point, we can use dask to parallelize most of the steps in our process, but we still need to collect all of our parametrized structures prior to summing them all up</p> <p>Timing isn’t so great but we’ll see how this scales</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">computed_parametrized_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">compute</span><span class="p">()</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">param_structures</span><span class="p">]</span> <span class="n">final_structure</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">computed_parametrized_structures</span><span class="p">)</span> <span class="n">final_structure</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 2.75 s, sys: 0 ns, total: 2.75 s Wall time: 2.74 s &lt;Structure 170 atoms; 10 residues; 160 bonds; parametrized&gt; </code></pre></div></div> <p>Putting all of our parallelized code together …</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="c1"># Make our molecular system </span><span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="c1"># Convert to graphs </span><span class="n">big_graph</span> <span class="o">=</span> <span class="n">structure_to_graph</span><span class="p">(</span><span class="n">structure</span><span class="p">)</span> <span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">separate_molecule_graphs</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)]</span> <span class="c1"># Grab parmed atoms for each node in the graph </span><span class="n">all_subselected_atoms</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">subselect_atoms</span><span class="p">)(</span><span class="n">structure</span><span class="p">,</span> <span class="n">molecule_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span> <span class="ow">in</span> <span class="n">individual_molecule_graphs</span><span class="p">]</span> <span class="c1"># Generate parmed structures for each molecule </span><span class="n">raw_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">make_structure_from_graph</span><span class="p">)(</span><span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">individual_molecule_graphs</span><span class="p">,</span> <span class="n">all_subselected_atoms</span><span class="p">)]</span> <span class="c1"># Parametrize with our force field </span><span class="n">param_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">parametrize</span><span class="p">)(</span><span class="n">ff</span><span class="p">,</span> <span class="n">struc</span><span class="p">)</span> <span class="k">for</span> <span class="n">struc</span> <span class="ow">in</span> <span class="n">raw_structures</span><span class="p">]</span> <span class="n">computed_parametrized_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">compute</span><span class="p">()</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">param_structures</span><span class="p">]</span> <span class="n">final_structure</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">computed_parametrized_structures</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 2.7 s, sys: 58.3 ms, total: 2.76 s Wall time: 2.7 s </code></pre></div></div> <p>Visualizing our task graph</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">param_structures</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">visualize</span><span class="p">()</span> </code></pre></div></div> <p><img src="/images/2020-06-21_foyer-dask_files/2020-06-21_foyer-dask_53_0.png" alt="png" /></p> <p>Before moving to timing comparisons, it’s important to observe the <code class="language-plaintext highlighter-rouge">residue_map</code> functionality for foyer. If a “residue” (molecule type) has already been parametrized within this foyer apply function stack, we don’t need to re-iterate and re-discover the atom-types; the parametrization is effectively cached. As multiple foyer apply functions get called, this caching doesn’t get leveraged.</p> <h2 id="timing-comparisons">Timing comparisons</h2> <p>We have 3 methods to compare:</p> <ol> <li> <p>Canonical foyer, the standard way to use foyer on a single parmed structure that represents your entire molecular system. This actualy takes most advantage of the use_residue_map functionality</p> </li> <li> <p>Distributed foyer in serial, divide your parmed structure into smaller parmed structures, parametrize individually</p> </li> <li> <p>Distributed foyer in parallel, divide your parmed structure into smaller parmed structures, parametrize individually.</p> </li> </ol> <p>We’ll notice the number of residues in the final, parametrized strucutres are different – this is a consequnce of how <code class="language-plaintext highlighter-rouge">parmed.structure.__add__</code> and <code class="language-plaintext highlighter-rouge">parmed.structure.__iadd__</code> work when you try to combine different parmed structures. What’s important is that the number of atoms and bonds are consistent</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">canonical_foyer</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span> <span class="s">""" Standard way of using foyer, no parallelization"""</span> <span class="k">return</span> <span class="n">ff</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span> <span class="k">def</span> <span class="nf">distributed_foyer_serial</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">):</span> <span class="s">""" Apply foyer N times to N different molecules in serial"""</span> <span class="n">big_graph</span> <span class="o">=</span> <span class="n">structure_to_graph</span><span class="p">(</span><span class="n">structure</span><span class="p">)</span> <span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="n">separate_molecule_graphs</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)</span> <span class="n">individual_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">make_structure_from_graph</span><span class="p">(</span><span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselect_atoms</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">molecule_graph</span><span class="p">),</span> <span class="n">big_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span> <span class="ow">in</span> <span class="n">individual_molecule_graphs</span><span class="p">]</span> <span class="n">parametrized_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">parametrize</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">struc</span><span class="p">)</span> <span class="k">for</span> <span class="n">struc</span> <span class="ow">in</span> <span class="n">individual_structures</span><span class="p">]</span> <span class="n">parametrized_structure</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">parametrized_structures</span><span class="p">)</span> <span class="k">return</span> <span class="n">parametrized_structure</span> <span class="k">def</span> <span class="nf">distributed_foyer_parallel</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">):</span> <span class="s">"""Apply foyer N times to N different molecules in parallel"""</span> <span class="n">big_graph</span> <span class="o">=</span> <span class="n">structure_to_graph</span><span class="p">(</span><span class="n">structure</span><span class="p">)</span> <span class="n">individual_molecule_graphs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">separate_molecule_graphs</span><span class="p">(</span><span class="n">structure</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)]</span> <span class="c1"># Grab parmed atoms for each node in the graph </span> <span class="n">all_subselected_atoms</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">subselect_atoms</span><span class="p">)(</span><span class="n">structure</span><span class="p">,</span> <span class="n">molecule_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span> <span class="ow">in</span> <span class="n">individual_molecule_graphs</span><span class="p">]</span> <span class="c1"># Generate parmed structures for each molecule </span> <span class="n">raw_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">make_structure_from_graph</span><span class="p">)(</span><span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span><span class="p">,</span> <span class="n">big_graph</span><span class="p">)</span> <span class="k">for</span> <span class="n">molecule_graph</span><span class="p">,</span> <span class="n">subselected_atoms</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">individual_molecule_graphs</span><span class="p">,</span> <span class="n">all_subselected_atoms</span><span class="p">)]</span> <span class="c1"># Parametrize with our force field </span> <span class="n">param_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">parametrize</span><span class="p">)(</span><span class="n">ff</span><span class="p">,</span> <span class="n">struc</span><span class="p">)</span> <span class="k">for</span> <span class="n">struc</span> <span class="ow">in</span> <span class="n">raw_structures</span><span class="p">]</span> <span class="n">computed_parametrized_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">compute</span><span class="p">()</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">param_structures</span><span class="p">]</span> <span class="n">final_structure</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">computed_parametrized_structures</span><span class="p">)</span> <span class="k">return</span> <span class="n">final_structure</span> </code></pre></div></div> <h2 id="small-homogeneous-system">Small, homogeneous system</h2> <p>10 pentane molecules</p> <table> <thead> <tr> <th>Method</th> <th>Time</th> </tr> </thead> <tbody> <tr> <td>Canonical foyer</td> <td>2.53 s</td> </tr> <tr> <td>Distributed foyer serial</td> <td>3.15 s</td> </tr> <tr> <td>Distributed foyer parallel</td> <td>3.22 s</td> </tr> </tbody> </table> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">canonical_foyer</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 2.52 s, sys: 57.7 ms, total: 2.58 s Wall time: 2.53 s /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 200, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) &lt;Structure 170 atoms; 1 residues; 160 bonds; PBC (orthogonal); parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_serial</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 3.17 s, sys: 28.5 ms, total: 3.2 s Wall time: 3.15 s &lt;Structure 170 atoms; 10 residues; 160 bonds; parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_parallel</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 3.21 s, sys: 69.5 ms, total: 3.28 s Wall time: 3.22 s &lt;Structure 170 atoms; 10 residues; 160 bonds; parametrized&gt; </code></pre></div></div> <h2 id="large-homogeneous-system">Large, homogeneous system</h2> <p>100 pentane molecules</p> <table> <thead> <tr> <th>Method</th> <th>Time</th> </tr> </thead> <tbody> <tr> <td>Canonical foyer</td> <td>21.1 s</td> </tr> <tr> <td>Distributed foyer serial</td> <td>35.1 s</td> </tr> <tr> <td>Distributed foyer parallel</td> <td>34.7 s</td> </tr> </tbody> </table> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">canonical_foyer</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 20.9 s, sys: 196 ms, total: 21.1 s Wall time: 21 s /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 2000, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) &lt;Structure 1700 atoms; 1 residues; 1600 bonds; PBC (orthogonal); parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_serial</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 35.1 s, sys: 124 ms, total: 35.2 s Wall time: 35.1 s &lt;Structure 1700 atoms; 100 residues; 1600 bonds; parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">single</span> <span class="o">=</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">single</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_parallel</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 34.8 s, sys: 159 ms, total: 34.9 s Wall time: 34.7 s &lt;Structure 1700 atoms; 100 residues; 1600 bonds; parametrized&gt; </code></pre></div></div> <h2 id="small-heterogeneous-system">Small, heterogeneous system</h2> <p>10 pentane, 10 decane, 10 nonadecane (C20-ane)</p> <table> <thead> <tr> <th>Method</th> <th>Time</th> </tr> </thead> <tbody> <tr> <td>Canonical foyer</td> <td>14 s</td> </tr> <tr> <td>Distributed foyer serial</td> <td>16.9 s</td> </tr> <tr> <td>Distributed foyer parallel</td> <td>16.6 s</td> </tr> </tbody> </table> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">templates</span> <span class="o">=</span> <span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">20</span><span class="p">)]</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">templates</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">],</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">canonical_foyer</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 14.2 s, sys: 130 ms, total: 14.3 s Wall time: 14 s /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 1400, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) &lt;Structure 1110 atoms; 1 residues; 1080 bonds; PBC (orthogonal); parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">templates</span> <span class="o">=</span> <span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">20</span><span class="p">)]</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">templates</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">],</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_serial</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 40, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 80, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) CPU times: user 17.1 s, sys: 170 ms, total: 17.3 s Wall time: 16.9 s &lt;Structure 1110 atoms; 30 residues; 1080 bonds; parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">templates</span> <span class="o">=</span> <span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">20</span><span class="p">)]</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">templates</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="p">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">],</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_parallel</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 16.7 s, sys: 222 ms, total: 16.9 s Wall time: 16.6 s &lt;Structure 1110 atoms; 30 residues; 1080 bonds; parametrized&gt; </code></pre></div></div> <h2 id="large-heterogeneous-system">Large, heterogeneous system</h2> <p>100 pentane, 100 decane, 100 nonadecane</p> <table> <thead> <tr> <th>Method</th> <th>Time</th> </tr> </thead> <tbody> <tr> <td>Canonical foyer</td> <td>2 min 31 s</td> </tr> <tr> <td>Distributed foyer serial</td> <td>4 min 20 s</td> </tr> <tr> <td>Distributed foyer parallel</td> <td>4 min 17 s</td> </tr> </tbody> </table> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">templates</span> <span class="o">=</span> <span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">20</span><span class="p">)]</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">templates</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">],</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">canonical_foyer</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 2min 30s, sys: 1.27 s, total: 2min 31s Wall time: 2min 31s /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 14000, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) &lt;Structure 11100 atoms; 1 residues; 10800 bonds; PBC (orthogonal); parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">templates</span> <span class="o">=</span> <span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">20</span><span class="p">)]</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">templates</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">],</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_serial</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 4min 20s, sys: 1.05 s, total: 4min 21s Wall time: 4min 20s &lt;Structure 11100 atoms; 300 residues; 10800 bonds; parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">templates</span> <span class="o">=</span> <span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">),</span> <span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">20</span><span class="p">)]</span> <span class="n">cmpd</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">fill_box</span><span class="p">(</span><span class="n">templates</span><span class="p">,</span> <span class="n">n_compounds</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">],</span> <span class="n">box</span><span class="o">=</span><span class="p">[</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">,</span><span class="mi">1000</span><span class="p">])</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_parallel</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 4min 17s, sys: 972 ms, total: 4min 18s Wall time: 4min 17s &lt;Structure 11100 atoms; 300 residues; 10800 bonds; parametrized&gt; </code></pre></div></div> <h2 id="random-heterogeneous-system">Random heterogeneous system</h2> <table> <thead> <tr> <th>Method</th> <th>Time</th> </tr> </thead> <tbody> <tr> <td>Canonical foyer</td> <td>1 min 38 s</td> </tr> <tr> <td>Distributed foyer serial</td> <td>2 min 56 s</td> </tr> <tr> <td>Distributed foyer parallel</td> <td>3 min 1 s</td> </tr> </tbody> </table> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="n">random_compounds</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">Compound</span><span class="p">(</span><span class="n">subcompounds</span><span class="o">=</span><span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="n">high</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">200</span><span class="p">)])</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory &lt;mdtraj.Trajectory with 1 frames, 3 atoms, 1 residues, without unitcells&gt; "mdtraj.Trajectory {}".format(traj) /home/ayang41/programs/mbuild/mbuild/compound.py:2139: UserWarning: No simulation box detected for mdtraj.Trajectory &lt;mdtraj.Trajectory with 1 frames, 4 atoms, 1 residues, without unitcells&gt; "mdtraj.Trajectory {}".format(traj) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">random_compounds</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">canonical_foyer</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">,</span> <span class="n">use_residue_map</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 1min 37s, sys: 158 ms, total: 1min 37s Wall time: 1min 37s &lt;Structure 7663 atoms; 1 residues; 7463 bonds; PBC (orthogonal); parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">random_compounds</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_serial</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 28, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 44, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 60, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 24, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 56, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 52, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 64, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 68, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 36, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 32, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 76, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 72, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 48, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) CPU times: user 2min 56s, sys: 839 ms, total: 2min 57s Wall time: 2min 56s &lt;Structure 7663 atoms; 200 residues; 7463 bonds; parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">ff</span> <span class="o">=</span> <span class="n">foyer</span><span class="p">.</span><span class="n">forcefields</span><span class="p">.</span><span class="n">load_OPLSAA</span><span class="p">()</span> <span class="n">structure</span> <span class="o">=</span> <span class="n">random_compounds</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">distributed_foyer_parallel</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 3min 1s, sys: 551 ms, total: 3min 2s Wall time: 3min 1s &lt;Structure 7663 atoms; 200 residues; 7463 bonds; parametrized&gt; </code></pre></div></div> <h2 id="making-individual-structures">Making individual structures</h2> <p>Parallelization is fantastically slowing down our operations. I have a hunch this might be due to the extra steps involved in splitting up the molecular graphs.</p> <p>When molecular modelers make these systems, we already know which collection of atoms and bonds forms a molecule, so we can use that to circumvent any use of connected components. In this iteration, we’ve added a shortcut where we already know the individual structures.</p> <p>Canonical foyer is still faster. For a parallel library comparison, I tried using multiprocessing but got infinite recursion errors, so multiprocessing was not as easy to use as dask for this particular application</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">random_compounds</span> <span class="o">=</span> <span class="n">mb</span><span class="p">.</span><span class="n">Compound</span><span class="p">(</span><span class="n">subcompounds</span><span class="o">=</span><span class="p">[</span><span class="n">Alkane</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="n">high</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">200</span><span class="p">)])</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">individual_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">cmpd</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="k">for</span> <span class="n">cmpd</span> <span class="ow">in</span> <span class="n">random_compounds</span><span class="p">.</span><span class="n">children</span><span class="p">]</span> <span class="n">param_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">parametrize</span><span class="p">)(</span><span class="n">ff</span><span class="p">,</span> <span class="n">struc</span><span class="p">)</span> <span class="k">for</span> <span class="n">struc</span> <span class="ow">in</span> <span class="n">individual_structures</span><span class="p">]</span> <span class="n">computed_parametrized_structures</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">.</span><span class="n">compute</span><span class="p">()</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">param_structures</span><span class="p">]</span> <span class="n">final_structure</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">computed_parametrized_structures</span><span class="p">)</span> <span class="n">final_structure</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees. "No box specified and no Compound.box detected. " /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 76, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 36, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 24, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 72, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 68, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 60, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 32, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 64, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 28, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 56, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 52, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 40, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 48, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 44, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) CPU times: user 3min 19s, sys: 1.04 s, total: 3min 20s Wall time: 3min 19s &lt;Structure 7735 atoms; 200 residues; 7535 bonds; PBC (orthogonal); parametrized&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">param_structures</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">visualize</span><span class="p">(</span><span class="n">rankdir</span><span class="o">=</span><span class="s">'LR'</span><span class="p">)</span> </code></pre></div></div> <p><img src="/images/2020-06-21_foyer-dask_files/2020-06-21_foyer-dask_80_0.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">one_structure</span> <span class="o">=</span> <span class="n">random_compounds</span><span class="p">.</span><span class="n">to_parmed</span><span class="p">()</span> <span class="n">canonical_foyer</span><span class="p">(</span><span class="n">ff</span><span class="p">,</span> <span class="n">one_structure</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/ayang41/programs/mbuild/mbuild/compound.py:2527: UserWarning: No box specified and no Compound.box detected. Using Compound.boundingbox + 0.5 nm buffer. Setting all box angles to 90 degrees. "No box specified and no Compound.box detected. " CPU times: user 1min 50s, sys: 1.05 s, total: 1min 51s Wall time: 1min 51s /home/ayang41/programs/foyer/foyer/forcefield.py:267: UserWarning: Parameters have not been assigned to all impropers. Total system impropers: 9780, Parameterized impropers: 0. Note that if your system contains torsions of Ryckaert-Bellemans functional form, all of these torsions are processed as propers warnings.warn(msg) &lt;Structure 7735 atoms; 1 residues; 7535 bonds; PBC (orthogonal); parametrized&gt; </code></pre></div></div> <h1 id="lessons-and-takeaways">Lessons and Takeaways</h1> <p>This was a little disheartening, any attempt to distribute foyer atom-typing or combine with dask did NOT accelerate anything. This can probably be explained in a variety of ways:</p> <p>We had to convert our structure to a graph, run a connected components algorithm (which has its own scaling issues), create separate parmed structures, then re-join/add the individual structures together. Each of those steps is bound to slow things down. Data communication also plays a role here – communicating the molecular graphs and the entire structure to each dask worker will add some slowness to our pipeline. Doing everything in one foyer function allows the use of caching, which we lose when executing the function lots of different times. Even simplifying the pipeline didn’t show much improvement for the dask implementation</p> <p>There probably is room for the foyer API to be more accommodating for dask and other parallel computations, but it might require a refactoring effort to properly expose the functions-to-parallelize and utilize data structures/approaches more amenable to parallelization. Breaking up a large chemical system into smaller substructures didn’t seem to help.</p> <p>In all honesty since most molecular systems usually have less than a dozen different molecular species, just replicated into thousands of molecules, the best bet is to parametrize each molecular species once, then propagate the parameters appropriately, all in the canonical foyer style without any parallelization. The current foyer implementation already has implicit acceleration with caching and networkx may already have some graph optimizations for subgraph isomorphisms, mitigating any need for us to explicitly decompose one big graph into lots of small connected components</p> <p>Notebooks can be found <a href="https://github.com/ahy3nz/ahy3nz.github.io/tree/master/files/notebooks">in this repo</a></p>Alex H. Yang[email protected]Combining Foyer + DaskBig data tools for MD simulation analysis2020-05-13T00:00:00-05:002020-05-13T00:00:00-05:00https://ahy3nz.github.io/posts/2020/05/dask-mdtraj<h1 id="big-data-tools-for-md-simulation-analysis">Big data tools for MD simulation analysis</h1> <p>(Updated 2020-05-15)</p> <p>Trajectories are sets of coordinates over time. While the act of gathering data and conducting simulations are exhaustively parallelized, some analysis methods are not. Speaking from experience, parallelizing analysis using <a href="https://docs.python.org/3.7/library/multiprocessing.html">Python multiprocessing</a> can get very messy if you don’t have a clear idea of how you want to parallelize the analysis, and how exactly you’re going to code it up.</p> <p>Here, I’m going to attempt to use some parallel librareis for MD trajectory analysis</p> <h2 id="some-big-data-tools">Some big data tools</h2> <p>Since grad school, I’ve been exposed to a variety of big data tools (Dask, Spark, Rapids), and it’s been a point of interest to test their utility to molecular simulation. Each tool comes with its own sets of advantages and disadvantages, and I encourage everyone to actively try each to see which is most appropriate for the desired application.</p> <ul> <li>Rapids is very fast, but requires GPUs. Depending on your tech stack and tech constraints, you may or may not have cheap and easy access to sufficient GPUs. Rapids is a little more sensitive to data types than others - but as an amateur, I could be misusing the libraries.</li> <li>Spark is fast, but requires some hadoop and Spark knowhow to stand up properly. Many tech stacks and constraints seem to be well-suited for spark applications. Spark scales out well, very flexible with datatypes, and eschews a lot of parallel programming-knowhow. At my own work, some primitive tests have shown that spark outperforms dask for dataframe operations on strings and some ML operations - but as an amateur, there is probably some Dask tuning that could be done</li> <li>Dask is also fast, but your mileage may vary. Some tech stacks are suitable for Dask, but cloud resources/tech constraints might make Dask adoption hard. Dask exposes various levels of parallelism, so proper Dask-users will end up learning a lot about parallel computing along the way.</li> </ul> <p>I defer to <a href="https://www.youtube.com/watch?v=RRtqIagk93k">this pydata video for a Dask, Rapids, Spark comparison</a></p> <h2 id="for-those-like-me-who-are-not-used-to-setting-up-parallel-compute">For those like me who are not used to setting up parallel compute</h2> <p>The one thing I will observe as I dabble away on my personal computer - I am neither familiar with setting up a Hadoop cluster nor am I familiar with exposing my WSL to my GPU, and single-node pyspark is not going to useful for the overhead. If given the proper infrastructure and resources, I can use these libraries, but at this moment it would take time for me to set up the resources to properly utilizes Spark or Rapids on my PC. Dask, in my case, seems like the simplest parallel compute library to use. If you’re a grad student or a data scientist unfamiliar with software environments and infrastructure beyond Conda environments, Dask might also be easiest for adoption.</p> <h2 id="computing-atomic-distances-from-a-molecular-dynamics-simulation">Computing atomic distances from a molecular dynamics simulation</h2> <p>Trivial MD analysis involves looking at each atom within a frame, and not having to look at time correlations from frame to frame. I’m going to use <a href="https://github.com/mdtraj/mdtraj/">MDTraj</a> to load in a trajectory, and look at distances between atoms in each frame. I’ll do this serial, with just MDTraj, and I’ll do this with using one level of Dask parallelism, <a href="https://docs.dask.org/en/latest/delayed.html">Dask delayed</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">itertools</span> <span class="k">as</span> <span class="n">it</span> <span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">import</span> <span class="nn">mdtraj</span> <span class="kn">import</span> <span class="nn">dask</span> <span class="kn">from</span> <span class="nn">dask</span> <span class="kn">import</span> <span class="n">delayed</span> <span class="kn">import</span> <span class="nn">dask.bag</span> <span class="k">as</span> <span class="n">db</span> </code></pre></div></div> <p>Saving myself the effort of generating my own trajectory, I will use <a href="https://github.com/mdtraj/mdtraj/tree/master/tests/data">one of the trajectories in MDTraj’s unit tests</a></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">path_to_data</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">'/home/ayang41/programs/mdtraj/tests/data'</span><span class="p">)</span> <span class="n">tip3p_xtc</span> <span class="o">=</span> <span class="n">Path</span><span class="p">.</span><span class="n">joinpath</span><span class="p">(</span><span class="n">path_to_data</span><span class="o">/</span><span class="s">'tip3p_300K_1ATM.xtc'</span><span class="p">)</span> <span class="n">tip3p_pdb</span> <span class="o">=</span> <span class="n">Path</span><span class="p">.</span><span class="n">joinpath</span><span class="p">(</span><span class="n">path_to_data</span><span class="o">/</span><span class="s">'tip3p_300K_1ATM.pdb'</span><span class="p">)</span> </code></pre></div></div> <p>This trajectory is only 401 frames - parallel analysis incurs too much overhead to be useful. I’m going to artificially lengthen the trajectory out to 1604 frames, where the gain from parallelization will hopefully be more apparent. In reality, most grad students will have many, many more frames to analyze.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">traj</span> <span class="o">=</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">):</span> <span class="n">traj</span> <span class="o">=</span> <span class="n">traj</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">traj</span><span class="p">)</span> <span class="n">traj</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;mdtraj.Trajectory with 1604 frames, 774 atoms, 258 residues, and unitcells at 0x7f9bf4cce150&gt; </code></pre></div></div> <p>Additionally, to load up the computational expense, I’ll look at all pairwise atomic distances in each frame</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">atom_pairs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">it</span><span class="p">.</span><span class="n">permutations</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">traj</span><span class="p">.</span><span class="n">n_atoms</span><span class="p">),</span><span class="mi">2</span><span class="p">)]</span> </code></pre></div></div> <h2 id="simple-implementation-with-mdtraj">Simple implementation with MDTraj</h2> <p>On my PC with 6 cores, this took about 23 seconds (and also nearly froze my computer).</p> <p>It should be noted that MDTraj already does a lot of parallelization and acceleration under their hood with some C optimizations. “Simple” in this case, is a user depending on MDTraj’s optimizations</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">displacements</span> <span class="o">=</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">compute_displacements</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">atom_pairs</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 5.94 s, sys: 17.5 s, total: 23.5 s Wall time: 23.7 s </code></pre></div></div> <h2 id="combining-dask-with-mdtraj">Combining Dask with MDTraj</h2> <p>Like most parallel computing applications, it’s important to recognize how and what you will be parallelizing/distributing. In this case, we will be distributing our one trajectory across 4 partitions, creating <code class="language-plaintext highlighter-rouge">Delayed</code> objects. Each <code class="language-plaintext highlighter-rouge">Delayed</code> object isn’t an actual execution - it’s a scheduled operation (like queueing something up in SLURM or PBS).</p> <p>It helps that <code class="language-plaintext highlighter-rouge">mdtraj.Trajectory</code> objects are iterable, so we can easily break up the trajectory into 4 even-sized chunks with some python list comprehensions</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">chunksize</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">traj</span><span class="p">.</span><span class="n">n_frames</span><span class="o">/</span><span class="mi">4</span><span class="p">)</span> <span class="n">bag</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">from_sequence</span><span class="p">([</span><span class="n">traj</span><span class="p">[</span><span class="n">chunksize</span><span class="o">*</span><span class="n">i</span><span class="p">:</span> <span class="n">chunksize</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span> <span class="p">,</span> <span class="n">npartitions</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="n">bunch_of_delayed</span> <span class="o">=</span> <span class="n">bag</span><span class="p">.</span><span class="n">to_delayed</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 62.5 ms, sys: 172 ms, total: 234 ms Wall time: 293 ms </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bag</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dask.bag&lt;from_sequence, npartitions=4&gt; </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bunch_of_delayed</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 0)), Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 1)), Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 2)), Delayed(('from_sequence-b688539387c3c167fe82241b18a1670a', 3))] </code></pre></div></div> <p>If we wanted to, we can still pluck out and execute the <code class="language-plaintext highlighter-rouge">Delayed</code> objects, and parse the number of atoms in MDTraj-like syntax</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bunch_of_delayed</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">compute</span><span class="p">()[</span><span class="mi">0</span><span class="p">].</span><span class="n">n_atoms</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>774 </code></pre></div></div> <p>We can also validate that each <code class="language-plaintext highlighter-rouge">Delayed</code> object is computing a quarter of our trajectory</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bunch_of_delayed</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">compute</span><span class="p">(),</span> <span class="n">bunch_of_delayed</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">compute</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>([&lt;mdtraj.Trajectory with 401 frames, 774 atoms, 258 residues, and unitcells at 0x7fb555e2df50&gt;], [&lt;mdtraj.Trajectory with 401 frames, 774 atoms, 258 residues, and unitcells at 0x7fb2a1a12d10&gt;]) </code></pre></div></div> <p>To queue up additional computations, we will take each <code class="language-plaintext highlighter-rouge">Delayed</code> object, and add on one additional operation - <code class="language-plaintext highlighter-rouge">mdtraj.compute_displacements</code>. Now the delayed objects have two operations - distributing the trajectory and computing the displacements. It’s worth noting that none of these operations involved rewriting MDTraj code or adding function decorators. These MDTraj functions are wrapped using the <code class="language-plaintext highlighter-rouge">Delayed</code> objects</p> <p>Again, the computation has not been performed yet</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">mdtraj</span><span class="p">.</span><span class="n">compute_displacements</span><span class="p">)(</span><span class="n">traj</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">atom_pairs</span><span class="p">)</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">bunch_of_delayed</span><span class="p">]</span> <span class="n">all_displacements</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 26.5 s, sys: 2.55 s, total: 29 s Wall time: 29.2 s [Delayed('compute_displacements-c1ef5c08-6bb2-4508-8f1a-166000d2cd3e'), Delayed('compute_displacements-5a9fd8cd-2993-4c4b-be90-a2523e47c09a'), Delayed('compute_displacements-35c48042-fecf-4eb4-adc5-931c097b6e8d'), Delayed('compute_displacements-d8699960-98e0-4b74-a320-2b2e1f3870a9')] </code></pre></div></div> <p>If we want to “flush” the queue and run all our <code class="language-plaintext highlighter-rouge">Delayed</code> computations, we use Dask to finally compute them.</p> <p>At this point, the actual calculation took 3min 6s (hey, this is terrible!), but the overhead involved 27 seconds</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">displacements</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_displacements</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 17.8 s, sys: 27.9 s, total: 45.7 s Wall time: 3min 6s </code></pre></div></div> <p>The returned object is 4 different results, and each result is a numpy array 401 x 598302 x 3 (n_frames x n_atompairs x n_spatialdimensions)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">len</span><span class="p">(</span><span class="n">displacements</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4 </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">displacements</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">].</span><span class="n">shape</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(401, 598302, 3) </code></pre></div></div> <h2 id="visualizing-the-dask-graph">Visualizing the dask graph</h2> <p>Spark and Dask both use task graphs to schedule function after function, with Spark doing some implicit optimizations.</p> <p>Dask has a nice visualize functionality to show what the task graphs and parallelization look like for two of our <code class="language-plaintext highlighter-rouge">Delayed</code> objects</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dask</span><span class="p">.</span><span class="n">visualize</span><span class="p">(</span><span class="n">all_displacements</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span> </code></pre></div></div> <p><img src="/images/2020-05-13-dask-mdtraj_files/2020-05-13-dask-mdtraj_26_0.png" alt="png" /></p> <h2 id="this-dask-parallelization-slowed-the-mdtraj-operation-down-what-gives">This Dask parallelization slowed the MDTraj operation down! What gives?</h2> <p>MDTraj is very well-optimized, so any attempts to distribute work end up slowing down the array multiplications</p> <p>We’ll use our own, crude distance function that has no optimizations (and doesn’t obey the <a href="https://en.wikipedia.org/wiki/Periodic_boundary_conditions#Practical_implementation:_continuity_and_the_minimum_image_convention">minimum image convention</a>)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">crude_distances</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">atom_pairs</span><span class="p">):</span> <span class="n">all_distances</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">frame</span> <span class="ow">in</span> <span class="n">traj</span><span class="p">:</span> <span class="n">distances</span> <span class="o">=</span><span class="p">[]</span> <span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">atom_pairs</span><span class="p">:</span> <span class="n">distance</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">frame</span><span class="p">.</span><span class="n">xyz</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">pair</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="p">:],</span> <span class="n">frame</span><span class="p">.</span><span class="n">xyz</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">pair</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="p">:]))</span> <span class="n">distances</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">distance</span><span class="p">)</span> <span class="n">all_distances</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">distances</span><span class="p">)</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">all_distances</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">traj</span> <span class="o">=</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())</span> <span class="n">chunksize</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">traj</span><span class="p">.</span><span class="n">n_frames</span><span class="o">/</span><span class="mi">4</span><span class="p">)</span> <span class="n">bag</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">from_sequence</span><span class="p">([</span><span class="n">traj</span><span class="p">[</span><span class="n">chunksize</span><span class="o">*</span><span class="n">i</span><span class="p">:</span> <span class="n">chunksize</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span> <span class="p">,</span> <span class="n">npartitions</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="n">bunch_of_delayed</span> <span class="o">=</span> <span class="n">bag</span><span class="p">.</span><span class="n">to_delayed</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 125 ms, sys: 0 ns, total: 125 ms Wall time: 505 ms </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">atom_pairs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">it</span><span class="p">.</span><span class="n">combinations</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">100</span><span class="p">),</span><span class="mi">2</span><span class="p">)]</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">crude_distances</span><span class="p">)(</span><span class="n">traj</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">atom_pairs</span><span class="p">)</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">bunch_of_delayed</span><span class="p">]</span> <span class="n">all_displacements</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 156 ms, sys: 46.9 ms, total: 203 ms Wall time: 169 ms [Delayed('crude_distances-fb865e6f-232a-4a24-8a37-0b0f6ce13f22'), Delayed('crude_distances-438627d2-a181-4127-85a1-1cfbe99f64f6'), Delayed('crude_distances-543f6412-6dcc-4a30-922b-2f963e978a5d'), Delayed('crude_distances-78883eb2-520e-4c75-9af9-d06b82b746d1')] </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">output</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_displacements</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 54.6 s, sys: 1min, total: 1min 55s Wall time: 1min 7s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">output</span> <span class="o">=</span> <span class="n">crude_distances</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">atom_pairs</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 1min 28s, sys: 1min 40s, total: 3min 8s Wall time: 1min 51s </code></pre></div></div> <p>So there was ~47 second speedup from the crude function - that’s a small win.</p> <p>And here’s the task graph for one of the <code class="language-plaintext highlighter-rouge">Delayed</code> objects</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_displacements</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">visualize</span><span class="p">()</span> </code></pre></div></div> <p><img src="/images/2020-05-13-dask-mdtraj_files/2020-05-13-dask-mdtraj_35_0.png" alt="png" /></p> <h2 id="aiming-for-memory-efficiency">Aiming for memory-efficiency</h2> <p>Up until now, we’ve had the whole trajectory loaded into memory prior to any parallelization with Dask. We can use MDTraj’s iterload function to reduce the size of the trajectory, but still pass different chunks around.</p> <p>As another consideration for parallelization, increasing the number of disk reads will slow down your process, so make sure the gain from parallelization makes it worth it</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">delayed_load</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">from_sequence</span><span class="p">(</span><span class="n">a</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">iterload</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())).</span><span class="n">to_delayed</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 172 ms, sys: 172 ms, total: 344 ms Wall time: 312 ms </code></pre></div></div> <p>Confirming that each <code class="language-plaintext highlighter-rouge">Delayed</code> object has different frames</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">delayed_load</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">compute</span><span class="p">()[</span><span class="mi">0</span><span class="p">].</span><span class="n">time</span><span class="p">,</span> <span class="n">delayed_load</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">compute</span><span class="p">()[</span><span class="mi">0</span><span class="p">].</span><span class="n">time</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82., 83., 84., 85., 86., 87., 88., 89., 90., 91., 92., 93., 94., 95., 96., 97., 98., 99.], dtype=float32), array([100., 101., 102., 103., 104., 105., 106., 107., 108., 109., 110., 111., 112., 113., 114., 115., 116., 117., 118., 119., 120., 121., 122., 123., 124., 125., 126., 127., 128., 129., 130., 131., 132., 133., 134., 135., 136., 137., 138., 139., 140., 141., 142., 143., 144., 145., 146., 147., 148., 149., 150., 151., 152., 153., 154., 155., 156., 157., 158., 159., 160., 161., 162., 163., 164., 165., 166., 167., 168., 169., 170., 171., 172., 173., 174., 175., 176., 177., 178., 179., 180., 181., 182., 183., 184., 185., 186., 187., 188., 189., 190., 191., 192., 193., 194., 195., 196., 197., 198., 199.], dtype=float32)) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">crude_distances</span><span class="p">)(</span><span class="n">traj</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">atom_pairs</span><span class="p">)</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">delayed_load</span><span class="p">]</span> <span class="n">all_displacements</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 188 ms, sys: 93.8 ms, total: 281 ms Wall time: 294 ms [Delayed('crude_distances-d2f8fad8-663a-41b4-a97c-9277cc086fba'), Delayed('crude_distances-4a9116d4-96f0-4c35-bca1-0525622976c8'), Delayed('crude_distances-6aa987e4-0e71-462f-9293-55e6deed1425'), Delayed('crude_distances-cd230400-cf2f-4ad9-9cc7-ab573848e397'), Delayed('crude_distances-627969a1-2726-4f7e-87a9-7a97665c46b0')] </code></pre></div></div> <p>Still ~40 second gain with the crude distance calculation with Dask</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">out</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_displacements</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 52.1 s, sys: 1min 3s, total: 1min 55s Wall time: 1min 10s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">iterload</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">()):</span> <span class="n">all_displacements</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">crude_distances</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">atom_pairs</span><span class="p">))</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 1min 26s, sys: 1min 46s, total: 3min 13s Wall time: 1min 51s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">atom_pairs</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">it</span><span class="p">.</span><span class="n">combinations</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">traj</span><span class="p">.</span><span class="n">n_atoms</span><span class="p">),</span><span class="mi">2</span><span class="p">)]</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">delayed_load</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">from_sequence</span><span class="p">(</span><span class="n">a</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">iterload</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())).</span><span class="n">to_delayed</span><span class="p">()</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">mdtraj</span><span class="p">.</span><span class="n">compute_displacements</span><span class="p">)(</span><span class="n">traj</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">atom_pairs</span><span class="p">)</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">delayed_load</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 15.6 s, sys: 688 ms, total: 16.3 s Wall time: 16.4 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">out</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_displacements</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 7.98 s, sys: 938 ms, total: 8.92 s Wall time: 8.92 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">iterload</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">()):</span> <span class="n">all_displacements</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">mdtraj</span><span class="p">.</span><span class="n">compute_displacements</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">atom_pairs</span><span class="p">))</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 1.17 s, sys: 1.09 s, total: 2.27 s Wall time: 2.26 s </code></pre></div></div> <h2 id="trying-dask-distributed">Trying Dask distributed</h2> <p>We could try another level of parallelism using Dask’s distributed framework <a href="https://docs.dask.org/en/latest/setup/single-distributed.html">on a single node</a>, but there appear to be <a href="https://github.com/dask/distributed/issues/2543">Dask distributed issues with WSL</a>.</p> <p>Regardless, we can still see what happens</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">distributed</span> <span class="kn">import</span> <span class="n">Client</span> <span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">()</span> <span class="n">client</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available </code></pre></div></div> <table style="border: 2px solid white;"> <tr> <td style="vertical-align: top; border: 0px solid white"> <h3 style="text-align: left;">Client</h3> <ul style="text-align: left; list-style: none; margin: 0; padding: 0;"> <li><b>Scheduler: </b>tcp://127.0.0.1:54022</li> <li><b>Dashboard: </b><a href="http://127.0.0.1:8787/status" target="_blank">http://127.0.0.1:8787/status</a></li> </ul> </td> <td style="vertical-align: top; border: 0px solid white"> <h3 style="text-align: left;">Cluster</h3> <ul style="text-align: left; list-style:none; margin: 0; padding: 0;"> <li><b>Workers: </b>3</li> <li><b>Cores: </b>6</li> <li><b>Memory: </b>17.11 GB</li> </ul> </td> </tr> </table> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available </code></pre></div></div> <p>With default settings, we’re working with 3 workers across 6 cores.</p> <p>We can see from the Dask dashboard that there are certainly concurrent operations, but the yellow operation (<code class="language-plaintext highlighter-rouge">disk-read-compute_displacements</code>) is adding a lot of overhead beyond that purple operation (the actual <code class="language-plaintext highlighter-rouge">compute_displacements</code>)</p> <p><img src="/images/2020-05-13-dask-mdtraj_files/dask_mdtraj_6workers.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">delayed_load</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">from_sequence</span><span class="p">(</span><span class="n">a</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">iterload</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())).</span><span class="n">to_delayed</span><span class="p">()</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">mdtraj</span><span class="p">.</span><span class="n">compute_displacements</span><span class="p">)(</span><span class="n">traj</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">atom_pairs</span><span class="p">)</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">delayed_load</span><span class="p">]</span> <span class="n">out</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_displacements</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available CPU times: user 37.6 s, sys: 12.4 s, total: 50 s Wall time: 57.6 s </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">client</span><span class="p">.</span><span class="n">close</span><span class="p">()</span> <span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="n">processes</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="n">client</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available </code></pre></div></div> <table style="border: 2px solid white;"> <tr> <td style="vertical-align: top; border: 0px solid white"> <h3 style="text-align: left;">Client</h3> <ul style="text-align: left; list-style: none; margin: 0; padding: 0;"> <li><b>Scheduler: </b>inproc://192.168.0.15/667/12</li> <li><b>Dashboard: </b><a href="http://192.168.0.15:8787/status" target="_blank">http://192.168.0.15:8787/status</a></li> </ul> </td> <td style="vertical-align: top; border: 0px solid white"> <h3 style="text-align: left;">Cluster</h3> <ul style="text-align: left; list-style:none; margin: 0; padding: 0;"> <li><b>Workers: </b>1</li> <li><b>Cores: </b>6</li> <li><b>Memory: </b>17.11 GB</li> </ul> </td> </tr> </table> <p>Running all workers on the same process, there’s still some room for multithreading, but the same slow-downs rear their heads</p> <p><img src="/images/2020-05-13-dask-mdtraj_files/dask_mdtraj_noproc.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span> <span class="n">delayed_load</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">from_sequence</span><span class="p">(</span><span class="n">a</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">mdtraj</span><span class="p">.</span><span class="n">iterload</span><span class="p">(</span><span class="n">tip3p_xtc</span><span class="p">.</span><span class="n">as_posix</span><span class="p">(),</span> <span class="n">top</span><span class="o">=</span><span class="n">tip3p_pdb</span><span class="p">.</span><span class="n">as_posix</span><span class="p">())).</span><span class="n">to_delayed</span><span class="p">()</span> <span class="n">all_displacements</span> <span class="o">=</span> <span class="p">[</span><span class="n">delayed</span><span class="p">(</span><span class="n">mdtraj</span><span class="p">.</span><span class="n">compute_displacements</span><span class="p">)(</span><span class="n">traj</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">atom_pairs</span><span class="p">)</span> <span class="k">for</span> <span class="n">traj</span> <span class="ow">in</span> <span class="n">delayed_load</span><span class="p">]</span> <span class="n">out</span> <span class="o">=</span> <span class="n">dask</span><span class="p">.</span><span class="n">compute</span><span class="p">(</span><span class="n">all_displacements</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>distributed.utils_perf - WARNING - full garbage collections took 45% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 44% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 44% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 45% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 45% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 46% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 46% CPU time recently (threshold: 10%) CPU times: user 51 s, sys: 1.22 s, total: 52.2 s Wall time: 52.9 s </code></pre></div></div> <h2 id="takeaways-from-some-dask-tests">Takeaways from some Dask tests</h2> <p>The observations here were surprising, but maybe a good lesson before anyone immediately tries to jump into some big data tools</p> <h3 id="mdtraj-is-really-performant">MDTraj is really performant</h3> <p>If you’re able to use MDTraj-optimized functions, use those. If you want to be memory efficient and stream trajectory data, use MDTraj for that; you don’t need to schedule loading different slices of a trajectory with Dask.</p> <h3 id="an-optimized-library-can-beat-the-bloat-of-a-scheduler">An optimized library can beat the bloat of a scheduler</h3> <p>Combining Dask + MDTraj was worse in all cases than just using MDTraj exclusively. Dask’s parallelization didn’t make anything run faster, and Dask’s delayed scheduling didn’t introduce anything better compared to MDTraj’s iterloading. This might be because of multiple reads, communication between workers, or overhead of building out the task scheduler.</p> <p>If the opportunity, resources, and need exist, optimizing a library can go farther than trying to lump Dask on top of any code. Dask + my-bad-distance-code made things faster than my-bad-distance-code exclusively, but my bad-distance-code was completely devoid of optimization. But throw an optimized library like MDTraj in, and you likely won’t need Dask (or your poorly-written code!).</p> <h3 id="if-you-have-a-particularly-unique-function-you-dont-know-how-to-optimize-then-its-time-to-think-about-what-dask-can-offer">If you have a particularly unique function you don’t know how to optimize, then it’s time to think about what dask can offer</h3> <p>MDTraj is great because it provides a set of common, optimized functions. For a lot of work in this field, there will be unique analyses that are not common to many MD libraries, and if they are, they may not be optimized. If these two hold true to your particular studies, then your options become</p> <p>1) Optimize your analysis code. Simplify routines for time and space complexity, reduce for-loops if you can, reduce the amount of read/write operations, write Cython/C/Cuda/compiled code</p> <p>2) Use a parallel/scheduler framework like Dask</p> <p>If you’re not a (parallel) programming wiz or lack the time to become one, then option 2 may be for you</p> <h3 id="it-doesnt-help-that-were-working-with-different-data">It doesn’t help that we’re working with different data</h3> <p>A lot of Dask use-cases and API are built around arrays and dataframes, so there’s already a lot of built-in optimization for those data structures. There may be room to build a Dask-trajectory object that creates room for computational optimization (rather than stringing together a bunch of non-dask operations) that might be able to beat MDTraj</p> <p>Lastly, the notebook can be found <a href="https://github.com/ahy3nz/ahy3nz.github.io/tree/master/files/notebooks">here</a></p>Alex H. Yang[email protected]Big data tools for MD simulation analysisDigging through some Folding@Home data2020-05-06T00:00:00-05:002020-05-06T00:00:00-05:00https://ahy3nz.github.io/posts/2020/05/study_covid_moonshot<h1 id="learning-cheminformatics-from-some-foldinghome-data">Learning cheminformatics from some Folding@Home data</h1> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_75_0.png" alt="png" /></p> <p>Top 10 (based on Hybrid2 docking score) small molecules</p> <p>2020-05-06 - 2020-05-11</p> <p>I have no formal training in cheminformatics, so I am going to be stumbling and learning as I wade through this dataset. I welcome any learning lessons from experts.</p> <p>This will be an ongoing foray</p> <p>Source: https://github.com/FoldingAtHome/covid-moonshot</p> <h2 id="introduction">Introduction</h2> <p>Folding@Home is a distributed computing project - allowing molecular simulations to be run in parallel across thousands of different computers with minimal communication. This, combined with other molecular modeling methods, has yielded a lot of open data for others to examine. In particular, I’m interested in the docking screens and compounds targeted by the F@H and postera collaborations</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">import</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="n">pd</span><span class="p">.</span><span class="n">options</span><span class="p">.</span><span class="n">display</span><span class="p">.</span><span class="n">max_columns</span> <span class="o">=</span> <span class="mi">999</span> <span class="n">moonshot_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'moonshot-submissions/covid_submissions_all_info.csv'</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>CID</th> <th>creator</th> <th>fragments</th> <th>link</th> <th>real_space</th> <th>SCR</th> <th>BB</th> <th>extended_real_space</th> <th>in_molport_or_mcule</th> <th>in_ultimate_mcule</th> <th>in_emolecules</th> <th>covalent_frag</th> <th>covalent_warhead</th> <th>acrylamide</th> <th>acrylamide_adduct</th> <th>chloroacetamide</th> <th>chloroacetamide_adduct</th> <th>vinylsulfonamide</th> <th>vinylsulfonamide_adduct</th> <th>nitrile</th> <th>nitrile_adduct</th> <th>MW</th> <th>cLogP</th> <th>HBD</th> <th>HBA</th> <th>TPSA</th> <th>num_criterion_violations</th> <th>BMS</th> <th>Dundee</th> <th>Glaxo</th> <th>Inpharmatica</th> <th>LINT</th> <th>MLSMR</th> <th>PAINS</th> <th>SureChEMBL</th> <th>PostEra</th> <th>ORDERED</th> <th>MADE</th> <th>ASSAYED</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cn1nnc2ccccc21</td> <td>AAR-POS-8a4e0f60-1</td> <td>Aaron Morris, PostEra</td> <td>x0072</td> <td>https://covid.postera.ai/covid/submissions/AAR...</td> <td>Z1260533612</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>371.444</td> <td>3.5420</td> <td>0</td> <td>5</td> <td>63.91</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>True</td> <td>False</td> <td>False</td> </tr> <tr> <th>1</th> <td>O=C(Cn1nnc2ccccc21)NCc1ccc(Oc2cccnc2)c(F)c1</td> <td>AAR-POS-8a4e0f60-10</td> <td>Aaron Morris, PostEra</td> <td>x0072</td> <td>https://covid.postera.ai/covid/submissions/AAR...</td> <td>Z826180044</td> <td>FALSE</td> <td>FALSE</td> <td>s_22____1723102____13206668</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>377.379</td> <td>3.0741</td> <td>1</td> <td>6</td> <td>81.93</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>True</td> <td>False</td> <td>False</td> </tr> <tr> <th>2</th> <td>CN(Cc1nnc2ccccn12)C(=O)N(Cc1cccs1)c1ccc(Br)cc1</td> <td>AAR-POS-8a4e0f60-11</td> <td>Aaron Morris, PostEra</td> <td>x0072</td> <td>https://covid.postera.ai/covid/submissions/AAR...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>456.369</td> <td>4.8119</td> <td>0</td> <td>5</td> <td>53.74</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Filter9_metal</td> <td>aryl bromide</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>True</td> <td>False</td> <td>False</td> </tr> <tr> <th>3</th> <td>CCN(Cc1cccc(-c2ccncc2)c1)C(=O)Cc1noc2ccccc12</td> <td>AAR-POS-8a4e0f60-2</td> <td>Aaron Morris, PostEra</td> <td>x0072</td> <td>https://covid.postera.ai/covid/submissions/AAR...</td> <td>Z1260535907</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>371.440</td> <td>4.4810</td> <td>0</td> <td>4</td> <td>59.23</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>True</td> <td>False</td> <td>False</td> </tr> <tr> <th>4</th> <td>O=C(NCc1noc2ccccc12)N(Cc1cccs1)c1ccc(F)cc1</td> <td>AAR-POS-8a4e0f60-3</td> <td>Aaron Morris, PostEra</td> <td>x0072</td> <td>https://covid.postera.ai/covid/submissions/AAR...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>s_272164____9388766____17338746</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>381.432</td> <td>4.9448</td> <td>1</td> <td>4</td> <td>58.37</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>True</td> <td>False</td> <td>False</td> </tr> </tbody> </table> </div> <p>The moonshot data has a lot of logging/metadata information, some one-hot-encoding information about functional groups, and some additional columns about Glaxo, Dundee, BMS, Lint, PAINS, SureChEMBL - I’m not sure what those additional coluns mean, but the values are binary values, possibly the results of some other test or availability in another databases.</p> <p>I’m going to focus on the molecular properties: MW, cLogP, HBD, HBA, TPSA</p> <ul> <li>MW: Molecular Weight</li> <li>cLogP: The logarithm of the partition coefficient (ratio of concentrations in octanol vs water, $\log{\frac{c_{octanol}}{c_{water}}}$)</li> <li>HBD: Hydrogen bond donors</li> <li>HBA: Hydrogen bond acceptors</li> <li>TPSA: Topological polar surface area</li> </ul> <p>Some of the correlations make some chemical sense - heavier molecules have more heavy atoms (O, N, F, etc.), but these heavier atoms are also the hydrogen bond acceptors. By that logic, more heavy atoms also coincides with more electronegative atoms, increasing your TPSA. It’s a little convoluted because TPSA looks at the surface, not necessarily the volume of the compound; geometry/shape will influence TPSA. There don’t appear to be any strong correlations with cLogP. Partition coefficients are a complex function of polarity, size/sterics, and shape - a 1:1 correlation with a singular, other variable will be hard to pinpoint</p> <p>This csv file doesn’t have much other numerical data, but maybe some of those true/false, pass/fail data might be relevant…but I definitely need more context here</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'MW'</span><span class="p">,</span> <span class="s">'cLogP'</span><span class="p">,</span> <span class="s">'HBD'</span><span class="p">,</span> <span class="s">'HBA'</span><span class="p">,</span> <span class="s">'TPSA'</span><span class="p">]</span> <span class="n">ax</span><span class="p">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">moonshot_df</span><span class="p">[</span><span class="n">cols</span><span class="p">].</span><span class="n">corr</span><span class="p">(),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'RdBu'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">cols</span><span class="p">)])</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_yticks</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">cols</span><span class="p">)])</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">rowname</span><span class="p">,</span> <span class="n">row</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">moonshot_df</span><span class="p">[</span><span class="n">cols</span><span class="p">].</span><span class="n">corr</span><span class="p">().</span><span class="n">iterrows</span><span class="p">()):</span> <span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">row</span><span class="p">.</span><span class="n">iteritems</span><span class="p">()):</span> <span class="n">ax</span><span class="p">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">val</span><span class="p">:</span><span class="mf">0.2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">),</span> <span class="n">xytext</span><span class="o">=</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="o">-</span><span class="mi">5</span><span class="p">),</span> <span class="n">textcoords</span><span class="o">=</span><span class="s">"offset points"</span><span class="p">)</span> </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_4_0.png" alt="png" /></p> <h2 id="some-docking-results">Some docking results</h2> <p>Okay here’s a couple other CSVs I found, these include some docking scores</p> <ul> <li>Repurposing scores: “The Drug Repurposing Hub is a curated and annotated collection of FDA-approved drugs, clinical trial drugs, and pre-clinical tool compounds with a companion information resource” <a href="https://clue.io/repurposing">source here</a>, so a public dataset of some drugs</li> <li>Redock scores: “This directory contains experiments in redocking all screened fragments into the entire ensemble of X-ray structures.” Taking fragments and re-docking them</li> </ul> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'repurposing-screen/drugset-docked.csv'</span><span class="p">)</span> <span class="n">redock_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'redock-fragments/all-screened-fragments-docked.csv'</span><span class="p">)</span> </code></pre></div></div> <p>SMILES strings, names, docking scores</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>TITLE</th> <th>Hybrid2</th> <th>docked_fragment</th> <th>Mpro-_dock</th> <th>site</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>C[C@@H](c1ccc-2c(c1)Cc3c2cccc3)C(=O)[O-]</td> <td>CHEMBL2104122</td> <td>-11.519580</td> <td>x0749</td> <td>0.509349</td> <td>active-covalent</td> </tr> <tr> <th>1</th> <td>C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4...</td> <td>CHEMBL1387</td> <td>-10.580162</td> <td>x0749</td> <td>2.706928</td> <td>active-covalent</td> </tr> <tr> <th>2</th> <td>CC(C)(C)c1cc(cc(c1O)C(C)(C)C)/C=C\2/C(=O)NC(=[...</td> <td>CHEMBL275835</td> <td>-10.557229</td> <td>x0107</td> <td>1.801830</td> <td>active-noncovalent</td> </tr> <tr> <th>3</th> <td>C[C@]12CC[C@@H]3[C@H]4CCCCC4=CC[C@H]3[C@@H]1CC...</td> <td>CHEMBL2104104</td> <td>-10.480992</td> <td>x0749</td> <td>3.791700</td> <td>active-covalent</td> </tr> <tr> <th>4</th> <td>CC(=O)[C@]1(CC[C@@H]2[C@@]1(CCC3=C4CCC(=O)C=C4...</td> <td>CHEMBL2104231</td> <td>-10.430775</td> <td>x0749</td> <td>4.230903</td> <td>active-covalent</td> </tr> </tbody> </table> </div> <p><a href="https://docs.eyesopen.com/toolkits/java/dockingtk/docking.html">Hybrid2</a> looks like a docking method provided via OpenEye. Mpro likely refers to COVID-19 main protease. I’m not entirely sure what the receptor for “Hybrid2” is, but there seem to be multiple “sites” or “fragments” for docking. There are lots of different fragments, but very few sites. For each site-fragment combination, multiple small molecules may have been tested.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span><span class="p">[</span><span class="s">'docked_fragment'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x0195 114 x0749 69 x0678 58 x0397 45 x0104 24 x0161 21 x1077 19 x0072 14 x0874 13 x0354 13 x0689 10 x1382 7 x0708 4 x0434 4 x1093 3 x1392 2 x0395 2 x1402 2 x0831 2 x0107 2 x1385 2 x1418 2 x0387 2 x0830 2 x1478 1 x0786 1 x1187 1 x0692 1 x0967 1 x0426 1 x0305 1 x0946 1 x1386 1 x0759 1 Name: docked_fragment, dtype: int64 </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span><span class="p">[</span><span class="s">'site'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>active-noncovalent 338 active-covalent 107 dimer-interface 1 Name: site, dtype: int64 </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">"docked_fragment"</span><span class="p">,</span> <span class="s">"site"</span><span class="p">]).</span><span class="n">count</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th></th> <th>SMILES</th> <th>TITLE</th> <th>Hybrid2</th> <th>Mpro-_dock</th> </tr> <tr> <th>docked_fragment</th> <th>site</th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>x0072</th> <th>active-noncovalent</th> <td>14</td> <td>14</td> <td>14</td> <td>14</td> </tr> <tr> <th>x0104</th> <th>active-noncovalent</th> <td>24</td> <td>24</td> <td>24</td> <td>24</td> </tr> <tr> <th>x0107</th> <th>active-noncovalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x0161</th> <th>active-noncovalent</th> <td>21</td> <td>21</td> <td>21</td> <td>21</td> </tr> <tr> <th>x0195</th> <th>active-noncovalent</th> <td>114</td> <td>114</td> <td>114</td> <td>114</td> </tr> <tr> <th>x0305</th> <th>active-noncovalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x0354</th> <th>active-noncovalent</th> <td>13</td> <td>13</td> <td>13</td> <td>13</td> </tr> <tr> <th>x0387</th> <th>active-noncovalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x0395</th> <th>active-noncovalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x0397</th> <th>active-noncovalent</th> <td>45</td> <td>45</td> <td>45</td> <td>45</td> </tr> <tr> <th>x0426</th> <th>active-noncovalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x0434</th> <th>active-noncovalent</th> <td>4</td> <td>4</td> <td>4</td> <td>4</td> </tr> <tr> <th>x0678</th> <th>active-noncovalent</th> <td>58</td> <td>58</td> <td>58</td> <td>58</td> </tr> <tr> <th>x0689</th> <th>active-covalent</th> <td>10</td> <td>10</td> <td>10</td> <td>10</td> </tr> <tr> <th>x0692</th> <th>active-covalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x0708</th> <th>active-covalent</th> <td>4</td> <td>4</td> <td>4</td> <td>4</td> </tr> <tr> <th>x0749</th> <th>active-covalent</th> <td>69</td> <td>69</td> <td>69</td> <td>69</td> </tr> <tr> <th>x0759</th> <th>active-covalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x0786</th> <th>active-covalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x0830</th> <th>active-covalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x0831</th> <th>active-covalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x0874</th> <th>active-noncovalent</th> <td>13</td> <td>13</td> <td>13</td> <td>13</td> </tr> <tr> <th>x0946</th> <th>active-noncovalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x0967</th> <th>active-noncovalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x1077</th> <th>active-noncovalent</th> <td>19</td> <td>19</td> <td>19</td> <td>19</td> </tr> <tr> <th>x1093</th> <th>active-noncovalent</th> <td>3</td> <td>3</td> <td>3</td> <td>3</td> </tr> <tr> <th>x1187</th> <th>dimer-interface</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x1382</th> <th>active-covalent</th> <td>7</td> <td>7</td> <td>7</td> <td>7</td> </tr> <tr> <th>x1385</th> <th>active-covalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x1386</th> <th>active-covalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>x1392</th> <th>active-covalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x1402</th> <th>active-covalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x1418</th> <th>active-covalent</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>x1478</th> <th>active-covalent</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> </tbody> </table> </div> <p>Some molecules show up multiple times - why? Upon further investigation, this is mainly due to the molecule’s presence in multiple databases</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'SMILES'</span><span class="p">]).</span><span class="n">count</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"TITLE"</span><span class="p">)</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>TITLE</th> <th>Hybrid2</th> <th>docked_fragment</th> <th>Mpro-_dock</th> <th>site</th> </tr> <tr> <th>SMILES</th> <th></th> <th></th> <th></th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>B(CCCC)(O)O</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>CCCc1ccccc1N</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>CCCc1cc(=O)[nH]c(=S)[nH]1</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>CCC[N@@H+]1CCO[C@H]2[C@H]1CCc3c2cc(cc3)O</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>CCC[N@@H+]1CCC[C@H]2[C@H]1Cc3c[nH]nc3C2</th> <td>1</td> <td>1</td> <td>1</td> <td>1</td> <td>1</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@]2(C#C)O)CCC4=CC(=O)CC[C@H]34</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>C[C@]12CC[C@H]3[C@H]([C@@H]1CCC2=O)CC(=C)C4=CC(=O)C=C[C@]34C</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>CC(C)C[C@@H](C1(CCC1)c2ccc(cc2)Cl)[NH+](C)C</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>CC[C@](/C=C/Cl)(C#C)O</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> <tr> <th>CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34</th> <td>2</td> <td>2</td> <td>2</td> <td>2</td> <td>2</td> </tr> </tbody> </table> <p>432 rows × 5 columns</p> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span><span class="p">[</span><span class="n">repurposing_df</span><span class="p">[</span><span class="s">'SMILES'</span><span class="p">]</span><span class="o">==</span><span class="s">"CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CCC4=CC(=O)CC[C@H]34"</span><span class="p">]</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>TITLE</th> <th>Hybrid2</th> <th>docked_fragment</th> <th>Mpro-_dock</th> <th>site</th> </tr> </thead> <tbody> <tr> <th>82</th> <td>CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...</td> <td>CHEMBL2107797</td> <td>-9.002963</td> <td>x0749</td> <td>2.616094</td> <td>active-covalent</td> </tr> <tr> <th>105</th> <td>CC[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@]2(C#C)O)CC...</td> <td>EDRUG178</td> <td>-8.705896</td> <td>x0104</td> <td>2.248707</td> <td>active-noncovalent</td> </tr> </tbody> </table> </div> <p>There doesn’t seem to be a very good correlation between the two docking scores - if these are docking scores to different receptors, that would help explain things. It’s worth noting that we’re not seeing if the two numbers agree for each molecule, but if the trends persist (both scores go up for this molecule, but go down for this other molecule). The weak correlation suggests the trends do not persist between the two docking measures</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurposing_df</span><span class="p">[[</span><span class="s">'Hybrid2'</span><span class="p">,</span> <span class="s">'Mpro-_dock'</span><span class="p">]].</span><span class="n">corr</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Hybrid2</th> <th>Mpro-_dock</th> </tr> </thead> <tbody> <tr> <th>Hybrid2</th> <td>1.000000</td> <td>0.581966</td> </tr> <tr> <th>Mpro-_dock</th> <td>0.581966</td> <td>1.000000</td> </tr> </tbody> </table> </div> <p>Redocking dataframe: SMILES, names, data collection information, docking scores</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">redock_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>TITLE</th> <th>fragments</th> <th>CompoundCode</th> <th>Unnamed: 4</th> <th>covalent_warhead</th> <th>MountingResult</th> <th>DataCollectionOutcome</th> <th>DataProcessingResolutionHigh</th> <th>RefinementOutcome</th> <th>Deposition_PDB_ID</th> <th>Hybrid2</th> <th>docked_fragment</th> <th>Mpro-x0500_dock</th> <th>site</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>c1ccc(c(c1)NCc2ccn[nH]2)F</td> <td>x0500</td> <td>x0500</td> <td>Z1545196403</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>2.19</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-11.881923</td> <td>x0678</td> <td>-2.501554</td> <td>active-noncovalent</td> </tr> <tr> <th>1</th> <td>Cc1ccccc1OCC(=O)Nc2ncccn2</td> <td>x0415</td> <td>x0415</td> <td>Z53834613</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.62</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-11.622278</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>2</th> <td>Cc1csc(n1)CNC(=O)c2ccn[nH]2</td> <td>x0356</td> <td>x0356</td> <td>Z466628048</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>3.25</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-11.435024</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>3</th> <td>Cc1csc(n1)CNC(=O)c2ccn[nH]2</td> <td>x1113</td> <td>x1113</td> <td>Z466628048</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.57</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-11.435024</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>4</th> <td>c1cc(cnc1)NC(=O)CC2CCCCC2</td> <td>x0678</td> <td>x0678</td> <td>Z31792168</td> <td>NaN</td> <td>False</td> <td>Mounted_Clear</td> <td>success</td> <td>1.83</td> <td>6 - Deposited</td> <td>5R84</td> <td>-11.355046</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> </tr> </tbody> </table> </div> <p>There don’t seem to be many Mpro docking scores in this dataset (only one molecule has a non-null Mpro docking score)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">redock_df</span><span class="p">[</span><span class="n">redock_df</span><span class="p">[</span><span class="s">'Mpro-x0500_dock'</span><span class="p">].</span><span class="n">isnull</span><span class="p">()].</span><span class="n">count</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SMILES 1452 TITLE 1452 fragments 1452 CompoundCode 1452 Unnamed: 4 0 covalent_warhead 1452 MountingResult 1452 DataCollectionOutcome 1452 DataProcessingResolutionHigh 1357 RefinementOutcome 1306 Deposition_PDB_ID 78 Hybrid2 1452 docked_fragment 1452 Mpro-x0500_dock 0 site 1452 dtype: int64 </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">redock_df</span><span class="p">[</span><span class="o">~</span><span class="n">redock_df</span><span class="p">[</span><span class="s">'Mpro-x0500_dock'</span><span class="p">].</span><span class="n">isnull</span><span class="p">()].</span><span class="n">count</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SMILES 1 TITLE 1 fragments 1 CompoundCode 1 Unnamed: 4 0 covalent_warhead 1 MountingResult 1 DataCollectionOutcome 1 DataProcessingResolutionHigh 1 RefinementOutcome 1 Deposition_PDB_ID 0 Hybrid2 1 docked_fragment 1 Mpro-x0500_dock 1 site 1 dtype: int64 </code></pre></div></div> <p>Are there overlaps in the molecules in each of these datasets?</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurpose_redock</span> <span class="o">=</span> <span class="n">repurposing_df</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">redock_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'SMILES'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s">'inner'</span><span class="p">,</span><span class="n">suffixes</span><span class="o">=</span><span class="p">(</span><span class="s">"_L"</span><span class="p">,</span> <span class="s">"_R"</span><span class="p">))</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot_redock</span> <span class="o">=</span> <span class="n">moonshot_df</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">redock_df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'SMILES'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s">'inner'</span><span class="p">,</span><span class="n">suffixes</span><span class="o">=</span><span class="p">(</span><span class="s">"_L"</span><span class="p">,</span> <span class="s">"_R"</span><span class="p">))</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurpose_redock</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>TITLE_L</th> <th>Hybrid2_L</th> <th>docked_fragment_L</th> <th>Mpro-_dock</th> <th>site_L</th> <th>TITLE_R</th> <th>fragments</th> <th>CompoundCode</th> <th>Unnamed: 4</th> <th>covalent_warhead</th> <th>MountingResult</th> <th>DataCollectionOutcome</th> <th>DataProcessingResolutionHigh</th> <th>RefinementOutcome</th> <th>Deposition_PDB_ID</th> <th>Hybrid2_R</th> <th>docked_fragment_R</th> <th>Mpro-x0500_dock</th> <th>site_R</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Cc1cc(=O)n([nH]1)c2ccccc2</td> <td>CHEMBL290916</td> <td>-7.889587</td> <td>x0195</td> <td>-2.068452</td> <td>active-noncovalent</td> <td>x0297</td> <td>x0297</td> <td>Z50145861</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.98</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-7.889587</td> <td>x0195</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>1</th> <td>CC(C)Nc1ncccn1</td> <td>CHEMBL1740513</td> <td>-7.178702</td> <td>x0072</td> <td>-1.248482</td> <td>active-noncovalent</td> <td>x0583</td> <td>x0583</td> <td>Z31190928</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>3.08</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-7.293537</td> <td>x1093</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>2</th> <td>CC(C)Nc1ncccn1</td> <td>CHEMBL1740513</td> <td>-7.178702</td> <td>x0072</td> <td>-1.248482</td> <td>active-noncovalent</td> <td>x1102</td> <td>x1102</td> <td>Z31190928</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.46</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-7.293537</td> <td>x1093</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>3</th> <td>C[C@H](C(=O)[O-])O</td> <td>CHEMBL1200559</td> <td>-5.675188</td> <td>x0397</td> <td>-0.179049</td> <td>active-noncovalent</td> <td>x1035</td> <td>x1035</td> <td>Z1741982441</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>Failed - no diffraction</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>-6.505556</td> <td>x0397</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>4</th> <td>CC(=O)C(=O)[O-]</td> <td>DB00119</td> <td>-5.448891</td> <td>x0689</td> <td>-0.494791</td> <td>active-covalent</td> <td>x1037</td> <td>x1037</td> <td>Z1741977082</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>Failed - no diffraction</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>-5.448891</td> <td>x0689</td> <td>NaN</td> <td>active-covalent</td> </tr> <tr> <th>5</th> <td>CCC(=O)[O-]</td> <td>CHEMBL14021</td> <td>-5.374838</td> <td>x0397</td> <td>-0.555688</td> <td>active-noncovalent</td> <td>x1029</td> <td>x1029</td> <td>Z955123616</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.73</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-5.135675</td> <td>x0689</td> <td>NaN</td> <td>active-covalent</td> </tr> <tr> <th>6</th> <td>C1CNCC[NH2+]1</td> <td>CHEMBL1412</td> <td>-5.079155</td> <td>x0354</td> <td>1.716032</td> <td>active-noncovalent</td> <td>x0996</td> <td>x0996</td> <td>Z1245537944</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.96</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-4.675085</td> <td>x0354</td> <td>NaN</td> <td>active-noncovalent</td> </tr> </tbody> </table> </div> <p>We joined on SMILES string, and now we can compare the docking scores between the repurposing and redocking datasets.</p> <p>Some <code class="language-plaintext highlighter-rouge">Hybrid2</code> scores look quantitatively similar, but for those that don’t, the ranking is still there. Looking at the COVID-19 main protease (Mpro I believe?), the docking scores don’t follow similar rankings - docking scores aren’t transferable to different receptors (this might be a fairly obvious observation)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">repurpose_redock</span><span class="p">[[</span><span class="s">'SMILES'</span><span class="p">,</span> <span class="s">"TITLE_L"</span><span class="p">,</span> <span class="s">"TITLE_R"</span><span class="p">,</span> <span class="s">"Hybrid2_L"</span><span class="p">,</span> <span class="s">"Hybrid2_R"</span><span class="p">,</span> <span class="s">'Mpro-_dock'</span><span class="p">,</span> <span class="s">'Mpro-x0500_dock'</span><span class="p">]]</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>TITLE_L</th> <th>TITLE_R</th> <th>Hybrid2_L</th> <th>Hybrid2_R</th> <th>Mpro-_dock</th> <th>Mpro-x0500_dock</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Cc1cc(=O)n([nH]1)c2ccccc2</td> <td>CHEMBL290916</td> <td>x0297</td> <td>-7.889587</td> <td>-7.889587</td> <td>-2.068452</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>CC(C)Nc1ncccn1</td> <td>CHEMBL1740513</td> <td>x0583</td> <td>-7.178702</td> <td>-7.293537</td> <td>-1.248482</td> <td>NaN</td> </tr> <tr> <th>2</th> <td>CC(C)Nc1ncccn1</td> <td>CHEMBL1740513</td> <td>x1102</td> <td>-7.178702</td> <td>-7.293537</td> <td>-1.248482</td> <td>NaN</td> </tr> <tr> <th>3</th> <td>C[C@H](C(=O)[O-])O</td> <td>CHEMBL1200559</td> <td>x1035</td> <td>-5.675188</td> <td>-6.505556</td> <td>-0.179049</td> <td>NaN</td> </tr> <tr> <th>4</th> <td>CC(=O)C(=O)[O-]</td> <td>DB00119</td> <td>x1037</td> <td>-5.448891</td> <td>-5.448891</td> <td>-0.494791</td> <td>NaN</td> </tr> <tr> <th>5</th> <td>CCC(=O)[O-]</td> <td>CHEMBL14021</td> <td>x1029</td> <td>-5.374838</td> <td>-5.135675</td> <td>-0.555688</td> <td>NaN</td> </tr> <tr> <th>6</th> <td>C1CNCC[NH2+]1</td> <td>CHEMBL1412</td> <td>x0996</td> <td>-5.079155</td> <td>-4.675085</td> <td>1.716032</td> <td>NaN</td> </tr> </tbody> </table> </div> <p>Joining the moonshot submission and redocking datasets does not yield too many overlapping molecules</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot_redock</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>CID</th> <th>creator</th> <th>fragments_L</th> <th>link</th> <th>real_space</th> <th>SCR</th> <th>BB</th> <th>extended_real_space</th> <th>in_molport_or_mcule</th> <th>in_ultimate_mcule</th> <th>in_emolecules</th> <th>covalent_frag</th> <th>covalent_warhead_L</th> <th>acrylamide</th> <th>acrylamide_adduct</th> <th>chloroacetamide</th> <th>chloroacetamide_adduct</th> <th>vinylsulfonamide</th> <th>vinylsulfonamide_adduct</th> <th>nitrile</th> <th>nitrile_adduct</th> <th>MW</th> <th>cLogP</th> <th>HBD</th> <th>HBA</th> <th>TPSA</th> <th>num_criterion_violations</th> <th>BMS</th> <th>Dundee</th> <th>Glaxo</th> <th>Inpharmatica</th> <th>LINT</th> <th>MLSMR</th> <th>PAINS</th> <th>SureChEMBL</th> <th>PostEra</th> <th>ORDERED</th> <th>MADE</th> <th>ASSAYED</th> <th>TITLE</th> <th>fragments_R</th> <th>CompoundCode</th> <th>Unnamed: 4</th> <th>covalent_warhead_R</th> <th>MountingResult</th> <th>DataCollectionOutcome</th> <th>DataProcessingResolutionHigh</th> <th>RefinementOutcome</th> <th>Deposition_PDB_ID</th> <th>Hybrid2</th> <th>docked_fragment</th> <th>Mpro-x0500_dock</th> <th>site</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>CC(C)Nc1cccnc1</td> <td>MAK-UNK-2c1752f0-4</td> <td>Maksym Voznyy</td> <td>x1093</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>Z2574930241</td> <td>EN300-56005</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>136.198</td> <td>1.9019</td> <td>1</td> <td>2</td> <td>24.92</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>x1098</td> <td>x1098</td> <td>Z1259341037</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.66</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-7.474369</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>1</th> <td>CC(C)Nc1cccnc1</td> <td>MAK-UNK-2c1752f0-4</td> <td>Maksym Voznyy</td> <td>x1093</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>Z2574930241</td> <td>EN300-56005</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>136.198</td> <td>1.9019</td> <td>1</td> <td>2</td> <td>24.92</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>x0572</td> <td>x0572</td> <td>Z1259341037</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>2.98</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-7.474369</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> </tr> <tr> <th>2</th> <td>CCS(=O)(=O)Nc1ccccc1F</td> <td>MAK-UNK-2c1752f0-5</td> <td>Maksym Voznyy</td> <td>x1093</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>Z53825177</td> <td>EN300-116204</td> <td>FALSE</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>203.238</td> <td>1.5873</td> <td>1</td> <td>2</td> <td>46.17</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Hetero_hetero</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>x0247</td> <td>x0247</td> <td>Z53825177</td> <td>NaN</td> <td>False</td> <td>OK: No comment:No comment</td> <td>success</td> <td>1.83</td> <td>7 - Analysed &amp; Rejected</td> <td>NaN</td> <td>-7.413380</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> </tr> </tbody> </table> </div> <h2 id="comparing-other-databases">Comparing other databases</h2> <p>CHEMBL, DrugBank, and “EDrug”(?) look to be the 3 prefixes in the “TITLE” column</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">chembl_webresource_client.new_client</span> <span class="kn">import</span> <span class="n">new_client</span> <span class="n">molecule</span> <span class="o">=</span> <span class="n">new_client</span><span class="p">.</span><span class="n">molecule</span> <span class="n">res</span> <span class="o">=</span> <span class="n">molecule</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="s">'CHEMBL1387'</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">res_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">res</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">res_df</span><span class="p">.</span><span class="n">columns</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['atc_classifications', 'availability_type', 'biotherapeutic', 'black_box_warning', 'chebi_par_id', 'chirality', 'cross_references', 'dosed_ingredient', 'first_approval', 'first_in_class', 'helm_notation', 'indication_class', 'inorganic_flag', 'max_phase', 'molecule_chembl_id', 'molecule_hierarchy', 'molecule_properties', 'molecule_structures', 'molecule_synonyms', 'molecule_type', 'natural_product', 'oral', 'parenteral', 'polymer_flag', 'pref_name', 'prodrug', 'score', 'structure_type', 'therapeutic_flag', 'topical', 'usan_stem', 'usan_stem_definition', 'usan_substem', 'usan_year', 'withdrawn_class', 'withdrawn_country', 'withdrawn_flag', 'withdrawn_reason', 'withdrawn_year'], dtype='object') </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">res_df</span><span class="p">[[</span><span class="s">'chirality'</span><span class="p">,</span> <span class="s">'molecule_properties'</span><span class="p">,</span> <span class="s">'molecule_structures'</span><span class="p">,</span> <span class="s">'score'</span><span class="p">]]</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>chirality</th> <th>molecule_properties</th> <th>molecule_structures</th> <th>score</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>{'alogp': '3.64', 'aromatic_rings': 0, 'cx_log...</td> <td>{'canonical_smiles': 'C#C[C@]1(O)CC[C@H]2[C@@H...</td> <td>17.0</td> </tr> </tbody> </table> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">res_df</span><span class="p">[[</span><span class="s">'molecule_properties'</span><span class="p">]].</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([{'alogp': '3.64', 'aromatic_rings': 0, 'cx_logd': '2.81', 'cx_logp': '2.81', 'cx_most_apka': None, 'cx_most_bpka': None, 'full_molformula': 'C20H26O2', 'full_mwt': '298.43', 'hba': 2, 'hba_lipinski': 2, 'hbd': 1, 'hbd_lipinski': 1, 'heavy_atoms': 22, 'molecular_species': None, 'mw_freebase': '298.43', 'mw_monoisotopic': '298.1933', 'num_lipinski_ro5_violations': 0, 'num_ro5_violations': 0, 'psa': '37.30', 'qed_weighted': '0.55', 'ro3_pass': 'N', 'rtb': 0}], dtype=object) </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">res_df</span><span class="p">[</span><span class="s">'molecule_properties'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">)</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>alogp</th> <th>aromatic_rings</th> <th>cx_logd</th> <th>cx_logp</th> <th>cx_most_apka</th> <th>cx_most_bpka</th> <th>full_molformula</th> <th>full_mwt</th> <th>hba</th> <th>hba_lipinski</th> <th>hbd</th> <th>hbd_lipinski</th> <th>heavy_atoms</th> <th>molecular_species</th> <th>mw_freebase</th> <th>mw_monoisotopic</th> <th>num_lipinski_ro5_violations</th> <th>num_ro5_violations</th> <th>psa</th> <th>qed_weighted</th> <th>ro3_pass</th> <th>rtb</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>3.64</td> <td>0</td> <td>2.81</td> <td>2.81</td> <td>None</td> <td>None</td> <td>C20H26O2</td> <td>298.43</td> <td>2</td> <td>2</td> <td>1</td> <td>1</td> <td>22</td> <td>None</td> <td>298.43</td> <td>298.1933</td> <td>0</td> <td>0</td> <td>37.30</td> <td>0.55</td> <td>N</td> <td>0</td> </tr> </tbody> </table> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all_results</span> <span class="o">=</span> <span class="p">[</span><span class="n">molecule</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">repurposing_df</span><span class="p">[</span><span class="s">'TITLE'</span><span class="p">]]</span> </code></pre></div></div> <p>Here’s a big Python function tangent.</p> <p>For each chembl molecule, we’ve searched for it within the chembl, returning us a list (of length 1) containing a dictionary of properties.</p> <p>All molecules have been compiled into a list, so we have a list of lists of dicionatires.</p> <p>For sanity, we can use a Python <code class="language-plaintext highlighter-rouge">filter</code> to only retain the non-None results.</p> <p>We can chain that with a Python <code class="language-plaintext highlighter-rouge">map</code> function to parse the first item from each molecule’s list. Recall, each molecule was a list with just one element, a dictionary. We can boil this down to only returning the dictionary (eliminating the list wrapper).</p> <p>For validation, I’ve called <code class="language-plaintext highlighter-rouge">next</code> to look at the results</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filtered</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">,</span> <span class="n">all_results</span><span class="p">))</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">next</span><span class="p">(</span><span class="n">filtered</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'atc_classifications': [], 'availability_type': -1, 'biotherapeutic': None, 'black_box_warning': 0, 'chebi_par_id': None, 'chirality': 0, 'cross_references': [], 'dosed_ingredient': False, 'first_approval': None, 'first_in_class': 0, 'helm_notation': None, 'indication_class': 'Anti-Inflammatory', 'inorganic_flag': 0, 'max_phase': 0, 'molecule_chembl_id': 'CHEMBL2104122', 'molecule_hierarchy': {'molecule_chembl_id': 'CHEMBL2104122', 'parent_chembl_id': 'CHEMBL2104122'}, 'molecule_properties': {'alogp': '3.45', 'aromatic_rings': 2, 'cx_logd': '1.26', 'cx_logp': '3.92', 'cx_most_apka': '4.68', 'cx_most_bpka': None, 'full_molformula': 'C16H14O2', 'full_mwt': '238.29', 'hba': 1, 'hba_lipinski': 2, 'hbd': 1, 'hbd_lipinski': 1, 'heavy_atoms': 18, 'molecular_species': 'ACID', 'mw_freebase': '238.29', 'mw_monoisotopic': '238.0994', 'num_lipinski_ro5_violations': 0, 'num_ro5_violations': 0, 'psa': '37.30', 'qed_weighted': '0.74', 'ro3_pass': 'N', 'rtb': 2}, 'molecule_structures': {'canonical_smiles': 'CC(C(=O)O)c1ccc2c(c1)Cc1ccccc1-2', 'molfile': '\n RDKit 2D\n\n 18 20 0 0 0 0 0 0 0 0999 V2000\n -0.5375 0.0250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -0.5375 1.1083 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -2.4458 1.1083 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -2.4458 0.0250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 1.3625 0.0250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -1.4875 -0.5125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.4125 -0.5125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 3.3292 0.0250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 0.4125 1.6500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 2.3417 -0.5292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 1.3625 1.1083 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 3.3500 1.1958 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0\n 4.2167 -0.6292 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0\n -3.3958 1.6500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -3.3958 -0.5125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 2.3417 -1.6417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -4.3458 1.1083 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -4.3458 0.0250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 2 1 2 0\n 3 2 1 0\n 4 6 1 0\n 5 7 2 0\n 6 1 1 0\n 7 1 1 0\n 8 10 1 0\n 9 2 1 0\n 10 5 1 0\n 11 5 1 0\n 12 8 2 0\n 13 8 1 0\n 14 3 1 0\n 15 4 1 0\n 16 10 1 0\n 17 14 2 0\n 18 15 2 0\n 3 4 2 0\n 9 11 2 0\n 17 18 1 0\nM END\n\n&gt; &lt;chembl_id&gt;\nCHEMBL2104122\n\n&gt; &lt;chembl_pref_name&gt;\nCICLOPROFEN\n\n', 'standard_inchi': 'InChI=1S/C16H14O2/c1-10(16(17)18)11-6-7-15-13(8-11)9-12-4-2-3-5-14(12)15/h2-8,10H,9H2,1H3,(H,17,18)', 'standard_inchi_key': 'LRXFKKPEBXIPMW-UHFFFAOYSA-N'}, 'molecule_synonyms': [{'molecule_synonym': 'Cicloprofen', 'syn_type': 'BAN', 'synonyms': 'CICLOPROFEN'}, {'molecule_synonym': 'Cicloprofen', 'syn_type': 'INN', 'synonyms': 'CICLOPROFEN'}, {'molecule_synonym': 'Cicloprofen', 'syn_type': 'USAN', 'synonyms': 'CICLOPROFEN'}, {'molecule_synonym': 'SQ-20824', 'syn_type': 'RESEARCH_CODE', 'synonyms': 'SQ 20824'}], 'molecule_type': 'Small molecule', 'natural_product': 0, 'oral': False, 'parenteral': False, 'polymer_flag': False, 'pref_name': 'CICLOPROFEN', 'prodrug': 0, 'score': 16.0, 'structure_type': 'MOL', 'therapeutic_flag': False, 'topical': False, 'usan_stem': '-profen', 'usan_stem_definition': 'anti-inflammatory/analgesic agents (ibuprofen type)', 'usan_substem': '-profen', 'usan_year': 1974, 'withdrawn_class': None, 'withdrawn_country': None, 'withdrawn_flag': False, 'withdrawn_reason': None, 'withdrawn_year': None} </code></pre></div></div> <p>For now, I’m only really interested in the <code class="language-plaintext highlighter-rouge">molecule_properties</code> dictionary</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filtered</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">'molecule_properties'</span><span class="p">]</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">all_results</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">]</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chembl_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">filtered</span><span class="p">)</span> <span class="n">chembl_df</span><span class="p">[</span><span class="s">'TITLE'</span><span class="p">]</span> <span class="o">=</span> <span class="n">repurposing_df</span><span class="p">[</span><span class="s">'TITLE'</span><span class="p">]</span> </code></pre></div></div> <h2 id="molecular-properties-contained-in-the-chembl-database">Molecular properties contained in the chembl database</h2> <p>Here are the definitions I can dig up</p> <ul> <li>alogp: (lipophilicity) partition coefficient</li> <li>aromatic_rings: number of aromatic rings</li> <li>cx_logd: distribution coefficient taking into account ionized and non-ionized forms</li> <li>cx_most_apka: acidic pka</li> <li>cx_most_bpka: basic pka</li> <li>full_mwt: molecular weight (and also free base and monoisotopic masses)</li> <li>hba: hydrogen bond acceptors (and hba_lipinski for lipinski definitiosn)</li> <li>hbd: hydrogen bond donors (and hbd_lipinski)</li> <li>heavy_atoms: number of heavy atoms</li> <li>num_lipinski_ro5_violations: how many times this molecule violated <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski’s rule of five</a></li> <li>num_ro5_violations: not sure, seems similar to lipinski rule of 5</li> <li>psa: protein sequence alignment</li> <li>qed_weighted: “quantitative estimate of druglikeness” (ranges between 0 and 1, with 1 being more favorable). This is based on a <a href="https://www.nature.com/articles/nchem.1243">quantitatve mean of drugability functions</a></li> <li>ro3_pass: <a href="https://caz.lab.uic.edu/discovery/Medicinal-Chemistry-2018-Barcelona.pdf">rule of three</a></li> <li>rtb: number of rotatable bonds</li> </ul> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chembl_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>alogp</th> <th>aromatic_rings</th> <th>cx_logd</th> <th>cx_logp</th> <th>cx_most_apka</th> <th>cx_most_bpka</th> <th>full_molformula</th> <th>full_mwt</th> <th>hba</th> <th>hba_lipinski</th> <th>hbd</th> <th>hbd_lipinski</th> <th>heavy_atoms</th> <th>molecular_species</th> <th>mw_freebase</th> <th>mw_monoisotopic</th> <th>num_lipinski_ro5_violations</th> <th>num_ro5_violations</th> <th>psa</th> <th>qed_weighted</th> <th>ro3_pass</th> <th>rtb</th> <th>TITLE</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>3.45</td> <td>2.0</td> <td>1.26</td> <td>3.92</td> <td>4.68</td> <td>None</td> <td>C16H14O2</td> <td>238.29</td> <td>1.0</td> <td>2.0</td> <td>1.0</td> <td>1.0</td> <td>18.0</td> <td>ACID</td> <td>238.29</td> <td>238.0994</td> <td>0.0</td> <td>0.0</td> <td>37.30</td> <td>0.74</td> <td>N</td> <td>2.0</td> <td>CHEMBL2104122</td> </tr> <tr> <th>1</th> <td>3.64</td> <td>0.0</td> <td>2.81</td> <td>2.81</td> <td>None</td> <td>None</td> <td>C20H26O2</td> <td>298.43</td> <td>2.0</td> <td>2.0</td> <td>1.0</td> <td>1.0</td> <td>22.0</td> <td>None</td> <td>298.43</td> <td>298.1933</td> <td>0.0</td> <td>0.0</td> <td>37.30</td> <td>0.55</td> <td>N</td> <td>0.0</td> <td>CHEMBL1387</td> </tr> <tr> <th>2</th> <td>3.92</td> <td>1.0</td> <td>4.25</td> <td>4.25</td> <td>10.15</td> <td>2.86</td> <td>C18H24N2O2S</td> <td>332.47</td> <td>4.0</td> <td>4.0</td> <td>2.0</td> <td>3.0</td> <td>23.0</td> <td>NEUTRAL</td> <td>332.47</td> <td>332.1558</td> <td>0.0</td> <td>0.0</td> <td>75.68</td> <td>0.76</td> <td>N</td> <td>1.0</td> <td>CHEMBL275835</td> </tr> <tr> <th>3</th> <td>4.31</td> <td>0.0</td> <td>4.04</td> <td>4.04</td> <td>None</td> <td>None</td> <td>C20H28O</td> <td>284.44</td> <td>1.0</td> <td>1.0</td> <td>1.0</td> <td>1.0</td> <td>21.0</td> <td>None</td> <td>284.44</td> <td>284.2140</td> <td>0.0</td> <td>0.0</td> <td>20.23</td> <td>0.52</td> <td>N</td> <td>0.0</td> <td>CHEMBL2104104</td> </tr> <tr> <th>4</th> <td>4.79</td> <td>0.0</td> <td>3.96</td> <td>3.96</td> <td>None</td> <td>None</td> <td>C21H28O2</td> <td>312.45</td> <td>2.0</td> <td>2.0</td> <td>0.0</td> <td>0.0</td> <td>23.0</td> <td>None</td> <td>312.45</td> <td>312.2089</td> <td>0.0</td> <td>0.0</td> <td>34.14</td> <td>0.70</td> <td>N</td> <td>1.0</td> <td>CHEMBL2104231</td> </tr> </tbody> </table> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chembl_df</span><span class="p">.</span><span class="n">columns</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['alogp', 'aromatic_rings', 'cx_logd', 'cx_logp', 'cx_most_apka', 'cx_most_bpka', 'full_molformula', 'full_mwt', 'hba', 'hba_lipinski', 'hbd', 'hbd_lipinski', 'heavy_atoms', 'molecular_species', 'mw_freebase', 'mw_monoisotopic', 'num_lipinski_ro5_violations', 'num_ro5_violations', 'psa', 'qed_weighted', 'ro3_pass', 'rtb', 'TITLE'], dtype='object') </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chembl_df</span><span class="p">.</span><span class="n">corr</span><span class="p">()</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>aromatic_rings</th> <th>hba</th> <th>hba_lipinski</th> <th>hbd</th> <th>hbd_lipinski</th> <th>heavy_atoms</th> <th>num_lipinski_ro5_violations</th> <th>num_ro5_violations</th> <th>rtb</th> </tr> </thead> <tbody> <tr> <th>aromatic_rings</th> <td>1.000000</td> <td>0.192569</td> <td>0.178507</td> <td>0.014928</td> <td>0.036106</td> <td>0.249022</td> <td>0.031094</td> <td>0.031094</td> <td>0.229124</td> </tr> <tr> <th>hba</th> <td>0.192569</td> <td>1.000000</td> <td>0.868859</td> <td>0.084553</td> <td>0.054409</td> <td>0.451560</td> <td>-0.047705</td> <td>-0.047705</td> <td>-0.023690</td> </tr> <tr> <th>hba_lipinski</th> <td>0.178507</td> <td>0.868859</td> <td>1.000000</td> <td>0.348600</td> <td>0.294276</td> <td>0.295864</td> <td>-0.070783</td> <td>-0.070783</td> <td>0.021812</td> </tr> <tr> <th>hbd</th> <td>0.014928</td> <td>0.084553</td> <td>0.348600</td> <td>1.000000</td> <td>0.935710</td> <td>-0.172866</td> <td>-0.060462</td> <td>-0.060462</td> <td>0.040505</td> </tr> <tr> <th>hbd_lipinski</th> <td>0.036106</td> <td>0.054409</td> <td>0.294276</td> <td>0.935710</td> <td>1.000000</td> <td>-0.211899</td> <td>-0.085660</td> <td>-0.085660</td> <td>0.084225</td> </tr> <tr> <th>heavy_atoms</th> <td>0.249022</td> <td>0.451560</td> <td>0.295864</td> <td>-0.172866</td> <td>-0.211899</td> <td>1.000000</td> <td>0.397240</td> <td>0.397240</td> <td>0.259011</td> </tr> <tr> <th>num_lipinski_ro5_violations</th> <td>0.031094</td> <td>-0.047705</td> <td>-0.070783</td> <td>-0.060462</td> <td>-0.085660</td> <td>0.397240</td> <td>1.000000</td> <td>1.000000</td> <td>0.345308</td> </tr> <tr> <th>num_ro5_violations</th> <td>0.031094</td> <td>-0.047705</td> <td>-0.070783</td> <td>-0.060462</td> <td>-0.085660</td> <td>0.397240</td> <td>1.000000</td> <td>1.000000</td> <td>0.345308</td> </tr> <tr> <th>rtb</th> <td>0.229124</td> <td>-0.023690</td> <td>0.021812</td> <td>0.040505</td> <td>0.084225</td> <td>0.259011</td> <td>0.345308</td> <td>0.345308</td> <td>1.000000</td> </tr> </tbody> </table> </div> <p>At a glance, no definite linear correlations among this crowd besides pKas, partition coefficients, mwt/hba</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">corr_df</span> <span class="o">=</span> <span class="n">chembl_df</span><span class="p">.</span><span class="n">corr</span><span class="p">()</span> <span class="n">cols</span> <span class="o">=</span> <span class="n">chembl_df</span><span class="p">.</span><span class="n">columns</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">chembl_df</span><span class="p">.</span><span class="n">corr</span><span class="p">(),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'RdBu'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">([</span><span class="s">''</span><span class="p">]</span><span class="o">+</span><span class="n">cols</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'x'</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">90</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">rowname</span><span class="p">,</span> <span class="n">row</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">corr_df</span><span class="p">.</span><span class="n">iterrows</span><span class="p">()):</span> <span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">row</span><span class="p">.</span><span class="n">iteritems</span><span class="p">()):</span> <span class="n">ax</span><span class="p">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">val</span><span class="p">:</span><span class="mf">0.2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">),</span> <span class="n">xytext</span><span class="o">=</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="o">-</span><span class="mi">5</span><span class="p">),</span> <span class="n">textcoords</span><span class="o">=</span><span class="s">"offset points"</span><span class="p">)</span> </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_50_0.png" alt="png" /></p> <p>Maybe there are higher-order correlations and relationship more appropriate for clustering and decomposition</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'aromatic_rings'</span><span class="p">,</span> <span class="s">'cx_logp'</span><span class="p">,</span> <span class="s">'full_mwt'</span><span class="p">,</span> <span class="s">'hba'</span><span class="p">]</span> <span class="n">cleaned</span> <span class="o">=</span> <span class="p">(</span><span class="n">chembl_df</span><span class="p">[</span><span class="o">~</span><span class="n">chembl_df</span><span class="p">[</span><span class="n">cols</span><span class="p">]</span> <span class="p">.</span><span class="n">isnull</span><span class="p">()</span> <span class="p">.</span><span class="nb">all</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'columns'</span><span class="p">,</span> <span class="n">skipna</span><span class="o">=</span><span class="bp">False</span><span class="p">)][</span><span class="n">cols</span><span class="p">]</span> <span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float'</span><span class="p">)</span> <span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="s">'columns'</span><span class="p">))</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">preprocessing</span> <span class="n">normalized</span> <span class="o">=</span> <span class="n">preprocessing</span><span class="p">.</span><span class="n">scale</span><span class="p">(</span><span class="n">cleaned</span><span class="p">)</span> </code></pre></div></div> <p>Appears to be maybe 4 clusters of these compounds examined by the covid-moonshot group</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span> <span class="kn">from</span> <span class="nn">sklearn.manifold</span> <span class="kn">import</span> <span class="n">TSNE</span> <span class="n">tsne_analysis</span> <span class="o">=</span> <span class="n">TSNE</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="n">output</span> <span class="o">=</span> <span class="n">tsne_analysis</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">normalized</span><span class="p">)</span> <span class="n">fig</span><span class="p">,</span><span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">output</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">output</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Aromatic rings, cx_logp, mwt, hba"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0.5, 1.0, 'Aromatic rings, cx_logp, mwt, hba') </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_55_1.png" alt="png" /></p> <p>By taking turns leaving out some features, it looks like leaving out aromatic rings or hydrogen bond acceptors will diminish the cluster distinction.</p> <p>Aromatic rings are huge and bulky components to small molecules, it makes sense that a chunk of the behavior corresponds to the aromatic rings. Similarly, hydrogen bond acceptors (heavy molecules) also induce van der Waals and electrostatics influences on small molecules. Left with only weight and partition coefficient, there’s mainly a continous behavior</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">clean_df</span><span class="p">(</span><span class="n">cols</span><span class="p">):</span> <span class="n">cleaned</span> <span class="o">=</span> <span class="p">(</span><span class="n">chembl_df</span><span class="p">[</span><span class="o">~</span><span class="n">chembl_df</span><span class="p">[</span><span class="n">cols</span><span class="p">]</span> <span class="p">.</span><span class="n">isnull</span><span class="p">()</span> <span class="p">.</span><span class="nb">all</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'columns'</span><span class="p">,</span> <span class="n">skipna</span><span class="o">=</span><span class="bp">False</span><span class="p">)][</span><span class="n">cols</span><span class="p">]</span> <span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float'</span><span class="p">)</span> <span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="s">'columns'</span><span class="p">))</span> <span class="n">normalized</span> <span class="o">=</span> <span class="n">preprocessing</span><span class="p">.</span><span class="n">scale</span><span class="p">(</span><span class="n">cleaned</span><span class="p">)</span> <span class="k">return</span> <span class="n">normalized</span> <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'cx_logp'</span><span class="p">,</span> <span class="s">'full_mwt'</span><span class="p">,</span> <span class="s">'hba'</span><span class="p">]</span> <span class="n">normalized</span> <span class="o">=</span> <span class="n">clean_df</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="n">tsne_analysis</span> <span class="o">=</span> <span class="n">TSNE</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="n">output</span> <span class="o">=</span> <span class="n">tsne_analysis</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">normalized</span><span class="p">)</span> <span class="n">fig</span><span class="p">,</span><span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">scatter</span><span class="p">(</span><span class="n">output</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">output</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"cx_logp, mwt, hba"</span><span class="p">)</span> <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'cx_logp'</span><span class="p">,</span> <span class="s">'full_mwt'</span><span class="p">,</span> <span class="s">'aromatic_rings'</span><span class="p">]</span> <span class="n">normalized</span> <span class="o">=</span> <span class="n">clean_df</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="n">tsne_analysis</span> <span class="o">=</span> <span class="n">TSNE</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="n">output</span> <span class="o">=</span> <span class="n">tsne_analysis</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">normalized</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">scatter</span><span class="p">(</span><span class="n">output</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">output</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"aromatic_rings, cx_logp, mwt"</span><span class="p">)</span> <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'cx_logp'</span><span class="p">,</span> <span class="s">'full_mwt'</span><span class="p">]</span> <span class="n">normalized</span> <span class="o">=</span> <span class="n">clean_df</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">scatter</span><span class="p">(</span><span class="n">normalized</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">normalized</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span> <span class="n">ax</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"cx_logp, mwt"</span><span class="p">)</span> <span class="n">fig</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span> </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_57_0.png" alt="png" /></p> <p>DrugBank</p> <p>I found someone had already <a href="https://github.com/choderalab/nano-drugbank/blob/master/df_drugbank_smiles.csv">downloaded the database</a>. I may double-over these dataframes, but query the drugbank dataset rather than chembl</p> <h2 id="some-docking-data">Some docking data</h2> <p>We have some smiles strings, molecular properties, docking scores, and information about the docking fragments</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'moonshot-submissions/covid_submissions_all_info-docked-overlap.csv'</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>TITLE</th> <th>creator</th> <th>fragments</th> <th>link</th> <th>real_space</th> <th>SCR</th> <th>BB</th> <th>extended_real_space</th> <th>in_molport_or_mcule</th> <th>in_ultimate_mcule</th> <th>in_emolecules</th> <th>covalent_frag</th> <th>covalent_warhead</th> <th>acrylamide</th> <th>acrylamide_adduct</th> <th>chloroacetamide</th> <th>chloroacetamide_adduct</th> <th>vinylsulfonamide</th> <th>vinylsulfonamide_adduct</th> <th>nitrile</th> <th>nitrile_adduct</th> <th>MW</th> <th>cLogP</th> <th>HBD</th> <th>HBA</th> <th>TPSA</th> <th>num_criterion_violations</th> <th>BMS</th> <th>Dundee</th> <th>Glaxo</th> <th>Inpharmatica</th> <th>LINT</th> <th>MLSMR</th> <th>PAINS</th> <th>SureChEMBL</th> <th>PostEra</th> <th>ORDERED</th> <th>MADE</th> <th>ASSAYED</th> <th>Hybrid2</th> <th>docked_fragment</th> <th>Mpro-x1418_dock</th> <th>site</th> <th>number_of_overlapping_fragments</th> <th>overlapping_fragments</th> <th>overlap_score</th> <th>volume</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)Cl</td> <td>MAK-UNK-9e4a73aa-2</td> <td>Maksym Voznyy</td> <td>x1418</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>366.779</td> <td>4.51890</td> <td>0</td> <td>3</td> <td>50.27</td> <td>0</td> <td>PASS</td> <td>beta-keto/anhydride</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Ketone, Dye 11</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-11.881256</td> <td>x1418</td> <td>1.206534</td> <td>active-covalent</td> <td>3</td> <td>x0434,x0678,x0830</td> <td>3.208124</td> <td>271.986084</td> </tr> <tr> <th>1</th> <td>Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])N</td> <td>KIM-UNI-60f168f5-7</td> <td>Kim Tai Tran, University of Copenhagen</td> <td>x0107,x0991</td> <td>https://covid.postera.ai/covid/submissions/KIM...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>297.362</td> <td>1.22949</td> <td>2</td> <td>5</td> <td>88.00</td> <td>0</td> <td>PASS</td> <td>imine, imine</td> <td>PASS</td> <td>PASS</td> <td>acyclic C=N-H</td> <td>Imine 3</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-11.654112</td> <td>x0107</td> <td>NaN</td> <td>active-noncovalent</td> <td>3</td> <td>x0107,x1412,x1392</td> <td>4.753475</td> <td>232.815506</td> </tr> <tr> <th>2</th> <td>c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)Cl</td> <td>MAK-UNK-9e4a73aa-14</td> <td>Maksym Voznyy</td> <td>x1418</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>368.755</td> <td>2.72410</td> <td>0</td> <td>6</td> <td>69.78</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-10.460650</td> <td>x0678</td> <td>2.716276</td> <td>active-noncovalent</td> <td>3</td> <td>x0678,x1412,x1392</td> <td>5.520980</td> <td>266.688721</td> </tr> <tr> <th>3</th> <td>Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...</td> <td>AUS-WAB-916db9c0-1</td> <td>Austin D. Chivington, Wabash College</td> <td>x0107,x1077,x1374</td> <td>https://covid.postera.ai/covid/submissions/AUS...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>351.450</td> <td>3.51932</td> <td>1</td> <td>5</td> <td>57.95</td> <td>0</td> <td>non_ring_acetal</td> <td>het-C-het not in ring</td> <td>PASS</td> <td>Filter10_Terminal_vinyl</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-9.516450</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> <td>3</td> <td>x0434,x0831,x0678</td> <td>3.446572</td> <td>284.195312</td> </tr> <tr> <th>4</th> <td>c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)O</td> <td>DRV-DNY-ae159ed1-12</td> <td>Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...</td> <td>x1249</td> <td>https://covid.postera.ai/covid/submissions/DRV...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>276.295</td> <td>3.23150</td> <td>1</td> <td>4</td> <td>63.08</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Filter44_michael_acceptor2</td> <td>PASS</td> <td>Ketone, Dye 9, vinyl michael acceptor1</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-9.243208</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> <td>3</td> <td>x0434,x0678,x0830</td> <td>2.865147</td> <td>220.275421</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>4630</th> <td>C[C@H]([C@@H](C(=O)N[C@H](Cc1ccccc1)C(=O)N[C@@...</td> <td>PAU-UNI-6d15a9f5-4</td> <td>paul brear, University of cambridge</td> <td>x1086</td> <td>https://covid.postera.ai/covid/submissions/PAU...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>714.821</td> <td>-0.91270</td> <td>8</td> <td>11</td> <td>256.10</td> <td>4</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Long aliphatic chain, Dipeptide</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>3.175111</td> <td>x0305</td> <td>NaN</td> <td>active-noncovalent</td> <td>0</td> <td>NaN</td> <td>5.297134</td> <td>548.583191</td> </tr> <tr> <th>4631</th> <td>c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...</td> <td>MAK-UNK-e05327b2-2</td> <td>Maksym Voznyy</td> <td>x1402</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>True</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>837.964</td> <td>6.63190</td> <td>0</td> <td>9</td> <td>98.31</td> <td>2</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Hetero_hetero</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>3.561681</td> <td>x1392</td> <td>NaN</td> <td>active-covalent</td> <td>0</td> <td>NaN</td> <td>3.297014</td> <td>591.877563</td> </tr> <tr> <th>4632</th> <td>Cc1cccc(c1)C[NH+]2CCN(CC2)C(=O)c3ccc(cc3)C#Cc4...</td> <td>MAK-UNK-e4a48a85-16</td> <td>Maksym Voznyy</td> <td>x0387,x0692</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>574.794</td> <td>6.18892</td> <td>0</td> <td>5</td> <td>39.68</td> <td>2</td> <td>PASS</td> <td>triple bond</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>4.056698</td> <td>x0978</td> <td>NaN</td> <td>active-covalent</td> <td>0</td> <td>NaN</td> <td>4.360606</td> <td>470.944824</td> </tr> <tr> <th>4633</th> <td>c1cc2cc(c(cc2c(c1)S(=O)(=O)N3CC[NH+](CC3)Cc4cc...</td> <td>MAK-UNK-e05327b2-6</td> <td>Maksym Voznyy</td> <td>x1402</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>True</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>990.183</td> <td>5.19160</td> <td>0</td> <td>12</td> <td>138.93</td> <td>3</td> <td>alpha_halo_heteroatom, secondary_halide_sulfate</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Hetero_hetero</td> <td>PASS</td> <td>Dithiomethylene_acetal</td> <td>Alkyl Halide</td> <td>False</td> <td>False</td> <td>False</td> <td>4.242827</td> <td>x0731</td> <td>NaN</td> <td>active-covalent</td> <td>0</td> <td>NaN</td> <td>4.193186</td> <td>694.333069</td> </tr> <tr> <th>4634</th> <td>Cc1cccc(c1)C[NH+]2CCN(CC2)c3cc(c(c(c3)Cl)c4cc5...</td> <td>MAK-UNK-e4a48a85-15</td> <td>Maksym Voznyy</td> <td>x0387,x0692</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>659.687</td> <td>7.36362</td> <td>1</td> <td>7</td> <td>68.36</td> <td>2</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>5.966927</td> <td>x0705</td> <td>NaN</td> <td>active-covalent</td> <td>0</td> <td>NaN</td> <td>1.473711</td> <td>503.583801</td> </tr> </tbody> </table> <p>4635 rows × 48 columns</p> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SMILES</th> <th>TITLE</th> <th>creator</th> <th>fragments</th> <th>link</th> <th>real_space</th> <th>SCR</th> <th>BB</th> <th>extended_real_space</th> <th>in_molport_or_mcule</th> <th>in_ultimate_mcule</th> <th>in_emolecules</th> <th>covalent_frag</th> <th>covalent_warhead</th> <th>acrylamide</th> <th>acrylamide_adduct</th> <th>chloroacetamide</th> <th>chloroacetamide_adduct</th> <th>vinylsulfonamide</th> <th>vinylsulfonamide_adduct</th> <th>nitrile</th> <th>nitrile_adduct</th> <th>MW</th> <th>cLogP</th> <th>HBD</th> <th>HBA</th> <th>TPSA</th> <th>num_criterion_violations</th> <th>BMS</th> <th>Dundee</th> <th>Glaxo</th> <th>Inpharmatica</th> <th>LINT</th> <th>MLSMR</th> <th>PAINS</th> <th>SureChEMBL</th> <th>PostEra</th> <th>ORDERED</th> <th>MADE</th> <th>ASSAYED</th> <th>Hybrid2</th> <th>docked_fragment</th> <th>Mpro-x1418_dock</th> <th>site</th> <th>number_of_overlapping_fragments</th> <th>overlapping_fragments</th> <th>overlap_score</th> <th>volume</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)Cl</td> <td>MAK-UNK-9e4a73aa-2</td> <td>Maksym Voznyy</td> <td>x1418</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>366.779</td> <td>4.51890</td> <td>0</td> <td>3</td> <td>50.27</td> <td>0</td> <td>PASS</td> <td>beta-keto/anhydride</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Ketone, Dye 11</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-11.881256</td> <td>x1418</td> <td>1.206534</td> <td>active-covalent</td> <td>3</td> <td>x0434,x0678,x0830</td> <td>3.208124</td> <td>271.986084</td> </tr> <tr> <th>1</th> <td>Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])N</td> <td>KIM-UNI-60f168f5-7</td> <td>Kim Tai Tran, University of Copenhagen</td> <td>x0107,x0991</td> <td>https://covid.postera.ai/covid/submissions/KIM...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>297.362</td> <td>1.22949</td> <td>2</td> <td>5</td> <td>88.00</td> <td>0</td> <td>PASS</td> <td>imine, imine</td> <td>PASS</td> <td>PASS</td> <td>acyclic C=N-H</td> <td>Imine 3</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-11.654112</td> <td>x0107</td> <td>NaN</td> <td>active-noncovalent</td> <td>3</td> <td>x0107,x1412,x1392</td> <td>4.753475</td> <td>232.815506</td> </tr> <tr> <th>2</th> <td>c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)Cl</td> <td>MAK-UNK-9e4a73aa-14</td> <td>Maksym Voznyy</td> <td>x1418</td> <td>https://covid.postera.ai/covid/submissions/MAK...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>368.755</td> <td>2.72410</td> <td>0</td> <td>6</td> <td>69.78</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-10.460650</td> <td>x0678</td> <td>2.716276</td> <td>active-noncovalent</td> <td>3</td> <td>x0678,x1412,x1392</td> <td>5.520980</td> <td>266.688721</td> </tr> <tr> <th>3</th> <td>Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2c...</td> <td>AUS-WAB-916db9c0-1</td> <td>Austin D. Chivington, Wabash College</td> <td>x0107,x1077,x1374</td> <td>https://covid.postera.ai/covid/submissions/AUS...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>True</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>351.450</td> <td>3.51932</td> <td>1</td> <td>5</td> <td>57.95</td> <td>0</td> <td>non_ring_acetal</td> <td>het-C-het not in ring</td> <td>PASS</td> <td>Filter10_Terminal_vinyl</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-9.516450</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> <td>3</td> <td>x0434,x0831,x0678</td> <td>3.446572</td> <td>284.195312</td> </tr> <tr> <th>4</th> <td>c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)O</td> <td>DRV-DNY-ae159ed1-12</td> <td>Dr. Vidya Desai, Dnyanprassarak Mandals Colleg...</td> <td>x1249</td> <td>https://covid.postera.ai/covid/submissions/DRV...</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>FALSE</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>False</td> <td>276.295</td> <td>3.23150</td> <td>1</td> <td>4</td> <td>63.08</td> <td>0</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>Filter44_michael_acceptor2</td> <td>PASS</td> <td>Ketone, Dye 9, vinyl michael acceptor1</td> <td>PASS</td> <td>PASS</td> <td>PASS</td> <td>False</td> <td>False</td> <td>False</td> <td>-9.243208</td> <td>x0678</td> <td>NaN</td> <td>active-noncovalent</td> <td>3</td> <td>x0434,x0678,x0830</td> <td>2.865147</td> <td>220.275421</td> </tr> </tbody> </table> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot</span><span class="p">[</span><span class="s">'Mpro-x1418_dock'</span><span class="p">].</span><span class="n">isnull</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span> <span class="c1"># Lots of missing Mpro dock scores </span></code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4586 </code></pre></div></div> <p>While there are a lot of different fragments to which the small molecule can bind, there are two “classes”, active-covalent and active-noncovalent (possibly referring to sites that covalently bond?)</p> <p>This presents a way to logically bisect the data based on some fundamental chemistry of the binding pocket.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot</span><span class="p">[</span><span class="s">'docked_fragment'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x0678 940 x0749 771 x0104 347 x0831 283 x0830 281 x0195 269 x0161 252 x0107 201 x0072 172 x1077 127 x1392 107 x1093 107 x0434 105 x0874 81 x1385 69 x1418 58 x1334 50 x0967 46 x0397 42 x0946 38 x0692 37 x0759 37 x1386 35 x0395 29 x0305 24 x1311 16 x0708 13 x0774 12 x1380 10 x1412 7 x1374 7 x1348 6 x0770 5 x1249 5 x0387 5 x0736 4 x0705 4 x1358 3 x0426 3 x1375 3 x0734 3 x0540 3 x0354 3 x1382 3 x0755 1 x1458 1 x0689 1 x0769 1 x0981 1 x0978 1 x0731 1 x1493 1 x0771 1 x1478 1 x1384 1 x1351 1 Name: docked_fragment, dtype: int64 </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>active-noncovalent 2799 active-covalent 1836 Name: site, dtype: int64 </code></pre></div></div> <p>We can examine the same correlations, but now for each type of site, and look at the hybrid docking score correlations.</p> <p>The biggest trend differences appear with the partition coefficient and number of hydrogen bond donors, but still the correlations are extremely weak</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">site_type</span> <span class="o">=</span> <span class="s">'active-noncovalent'</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'MW'</span><span class="p">,</span> <span class="s">'cLogP'</span><span class="p">,</span> <span class="s">'HBD'</span><span class="p">,</span> <span class="s">'HBA'</span><span class="p">,</span> <span class="s">'TPSA'</span><span class="p">,</span> <span class="s">'Hybrid2'</span><span class="p">]</span> <span class="n">ax</span><span class="p">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="n">site_type</span><span class="p">][</span><span class="n">cols</span><span class="p">].</span><span class="n">corr</span><span class="p">(),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'RdBu'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">cols</span><span class="p">)])</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_yticks</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">cols</span><span class="p">)])</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">rowname</span><span class="p">,</span> <span class="n">row</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="n">site_type</span><span class="p">][</span><span class="n">cols</span><span class="p">].</span><span class="n">corr</span><span class="p">().</span><span class="n">iterrows</span><span class="p">()):</span> <span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">row</span><span class="p">.</span><span class="n">iteritems</span><span class="p">()):</span> <span class="n">ax</span><span class="p">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">val</span><span class="p">:</span><span class="mf">0.2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">),</span> <span class="n">xytext</span><span class="o">=</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="o">-</span><span class="mi">5</span><span class="p">),</span> <span class="n">textcoords</span><span class="o">=</span><span class="s">"offset points"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="sa">f</span><span class="s">"Docking to </span><span class="si">{</span><span class="n">site_type</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0.5, 1.05, 'Docking to active-noncovalent') </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_68_1.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">site_type</span> <span class="o">=</span> <span class="s">'active-covalent'</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'MW'</span><span class="p">,</span> <span class="s">'cLogP'</span><span class="p">,</span> <span class="s">'HBD'</span><span class="p">,</span> <span class="s">'HBA'</span><span class="p">,</span> <span class="s">'TPSA'</span><span class="p">,</span> <span class="s">'Hybrid2'</span><span class="p">]</span> <span class="n">ax</span><span class="p">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="n">site_type</span><span class="p">][</span><span class="n">cols</span><span class="p">].</span><span class="n">corr</span><span class="p">(),</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'RdBu'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">cols</span><span class="p">)])</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_yticks</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">cols</span><span class="p">)])</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">rowname</span><span class="p">,</span> <span class="n">row</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="n">site_type</span><span class="p">][</span><span class="n">cols</span><span class="p">].</span><span class="n">corr</span><span class="p">().</span><span class="n">iterrows</span><span class="p">()):</span> <span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">row</span><span class="p">.</span><span class="n">iteritems</span><span class="p">()):</span> <span class="n">ax</span><span class="p">.</span><span class="n">annotate</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">val</span><span class="p">:</span><span class="mf">0.2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">),</span> <span class="n">xytext</span><span class="o">=</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="o">-</span><span class="mi">5</span><span class="p">),</span> <span class="n">textcoords</span><span class="o">=</span><span class="s">"offset points"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="sa">f</span><span class="s">"Docking to </span><span class="si">{</span><span class="n">site_type</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0.5, 1.05, 'Docking to active-covalent') </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_69_1.png" alt="png" /></p> <p>In general, lower docking score seem better, so the noncovalent sites might present more optimal binding locations (see histogram below). This seems non-intuitive because, if active-covalent really means sites that bond covalently, then covalent bonds would seem more energetically favorable than non-covalent interactions. Alternatively, forming covalent bonds might suggest an unstable region of the complex that could be shielded from the surroundings, inhibiting any sort of small molecule from binding the pocket? Expert opinion would be much appreciated here</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">covalent_mean</span> <span class="o">=</span> <span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="s">'active-covalent'</span><span class="p">][</span><span class="s">'Hybrid2'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span> <span class="n">noncovalent_mean</span> <span class="o">=</span> <span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="s">'active-noncovalent'</span><span class="p">][</span><span class="s">'Hybrid2'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span> <span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="s">'active-covalent'</span><span class="p">][</span><span class="s">'Hybrid2'</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">'active-covalent (mean=</span><span class="si">{</span><span class="n">covalent_mean</span><span class="p">:.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">)'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">moonshot</span><span class="p">[</span><span class="n">moonshot</span><span class="p">[</span><span class="s">'site'</span><span class="p">]</span><span class="o">==</span><span class="s">'active-noncovalent'</span><span class="p">][</span><span class="s">'Hybrid2'</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">'active-noncovalent (mean=</span><span class="si">{</span><span class="n">noncovalent_mean</span><span class="p">:.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">)'</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="sa">f</span><span class="s">"Hybrid2 histogram"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Hybrid2 score"</span><span class="p">)</span> <span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;matplotlib.legend.Legend at 0x7fac6b459850&gt; </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_71_1.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">rdkit</span> <span class="kn">import</span> <span class="n">Chem</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rdkit_smiles</span> <span class="o">=</span> <span class="p">[</span><span class="n">Chem</span><span class="p">.</span><span class="n">MolFromSmiles</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">moonshot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'Hybrid2'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">)[</span><span class="s">'SMILES'</span><span class="p">].</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span> <span class="n">scores</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">a</span><span class="p">:.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">moonshot</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'Hybrid2'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">)[</span><span class="s">'Hybrid2'</span><span class="p">].</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">img</span><span class="o">=</span><span class="n">Chem</span><span class="p">.</span><span class="n">Draw</span><span class="p">.</span><span class="n">MolsToGridImage</span><span class="p">(</span><span class="n">rdkit_smiles</span><span class="p">,</span><span class="n">molsPerRow</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span><span class="n">subImgSize</span><span class="o">=</span><span class="p">(</span><span class="mi">200</span><span class="p">,</span><span class="mi">200</span><span class="p">),</span> <span class="n">legends</span><span class="o">=</span><span class="n">scores</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">img</span> </code></pre></div></div> <p><img src="/images/2020-05-06-study_covid_moonshot_files/2020-05-06-study_covid_moonshot_75_0.png" alt="png" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> </code></pre></div></div>Alex H. Yang[email protected]Learning cheminformatics from some Folding@Home dataIs being “clutch” a myth?2020-04-05T00:00:00-05:002020-04-05T00:00:00-05:00https://ahy3nz.github.io/posts/2020/04/clutch_myth<h1 id="are-some-players-more-clutch-than-others">Are some players more “clutch” than others?</h1> <p>Clutch time is defined as “the last 5 minutes of a game in which the point differential is 5 or less”. Do some players really rise to the challenge and perform better in the clutch?</p> <p>To address this, we can use some <code class="language-plaintext highlighter-rouge">nba_api</code> functionality to get clutch stats and compare them to regular season stats. Due to the small sample size of clutch stats, we look at the total field goal percentage (total field goals made divided by total field goals attempted during clutch time). However, for regular season stats, we look at the field goal percentage per game, averaged over all games. This allows us to get a sense of the game-by-game variation of a player’s field goal percentage</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">matplotlib</span> <span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">import</span> <span class="nn">time</span> <span class="kn">import</span> <span class="nn">ballDontLie</span> <span class="kn">from</span> <span class="nn">ballDontLie.util.api_nba</span> <span class="kn">import</span> <span class="n">find_player_id</span> <span class="kn">from</span> <span class="nn">nba_api.stats.endpoints</span> <span class="kn">import</span> <span class="n">PlayerDashboardByClutch</span><span class="p">,</span> <span class="n">PlayerGameLog</span><span class="p">,</span> <span class="n">LeagueGameLog</span> </code></pre></div></div> <h2 id="sample-players-and-data-pull">Sample players and data pull</h2> <p>We will examine some famous players, some thought to be more “clutch” than others. Further, we look at these players season-by-season; in particular, their MVP seasons. We could look into more than just their MVP seasons (some players didn’t win an MVP that season but still had some very historical regular seasons or playoff runs). Further, we could also expand to non-MVP players who have many clutch moments (Damian Lillard, Brandon Roy, among others)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">players_mvps</span> <span class="o">=</span> <span class="p">{</span> <span class="s">'Michael Jordan'</span><span class="p">:</span> <span class="p">[</span><span class="s">'1987-88'</span><span class="p">,</span> <span class="s">'1990-91'</span><span class="p">,</span> <span class="s">'1991-92'</span><span class="p">,</span> <span class="s">'1995-96'</span><span class="p">,</span> <span class="s">'1997-98'</span><span class="p">],</span> <span class="s">'Kobe Bryant'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2007-08'</span><span class="p">],</span> <span class="s">'LeBron James'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2008-09'</span><span class="p">,</span> <span class="s">'2009-10'</span><span class="p">,</span> <span class="s">'2011-12'</span><span class="p">,</span> <span class="s">'2012-13'</span><span class="p">],</span> <span class="s">'Kevin Durant'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2013-14'</span><span class="p">],</span> <span class="s">'Russell Westbrook'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2016-17'</span><span class="p">],</span> <span class="s">'Allen Iverson'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2000-01'</span><span class="p">],</span> <span class="s">'Stephen Curry'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2014-15'</span><span class="p">,</span> <span class="s">'2015-16'</span><span class="p">],</span> <span class="s">'Derrick Rose'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2010-11'</span><span class="p">],</span> <span class="s">'Steve Nash'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2004-05'</span><span class="p">,</span> <span class="s">'2005-06'</span><span class="p">]</span> <span class="p">}</span> </code></pre></div></div> <p>Some constraints due to NBA stat recording, some years don’t record clutch time stats, so we will have to account for the lack of data</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">player</span><span class="p">,</span> <span class="n">mvp_seasons</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">players_mvps</span><span class="p">.</span><span class="n">items</span><span class="p">()):</span> <span class="k">for</span> <span class="n">mvp_season</span> <span class="ow">in</span> <span class="n">mvp_seasons</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">player</span><span class="p">,</span> <span class="n">mvp_season</span><span class="p">)</span> <span class="n">player_id</span> <span class="o">=</span> <span class="n">find_player_id</span><span class="p">(</span><span class="n">player</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="n">season_results</span> <span class="o">=</span> <span class="p">{</span><span class="s">'player'</span><span class="p">:</span> <span class="n">player</span><span class="p">,</span> <span class="s">'player_id'</span><span class="p">:</span> <span class="n">player_id</span><span class="p">,</span> <span class="s">'season'</span><span class="p">:</span> <span class="n">mvp_season</span><span class="p">}</span> <span class="n">regular_season_game_log</span> <span class="o">=</span> <span class="n">PlayerGameLog</span><span class="p">(</span><span class="n">player_id</span><span class="p">,</span> <span class="n">season</span><span class="o">=</span><span class="n">mvp_season</span><span class="p">,</span> <span class="n">season_type_all_star</span><span class="o">=</span><span class="s">'Regular Season'</span><span class="p">)</span> <span class="n">regular_season_clutch_games</span> <span class="o">=</span> <span class="n">PlayerDashboardByClutch</span><span class="p">(</span><span class="n">player_id</span><span class="p">,</span> <span class="n">season</span><span class="o">=</span><span class="n">mvp_season</span><span class="p">,</span> <span class="n">season_type_playoffs</span><span class="o">=</span><span class="s">'Regular Season'</span><span class="p">)</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'regular_fg_pct'</span><span class="p">]</span> <span class="o">=</span> <span class="n">regular_season_game_log</span><span class="p">.</span><span class="n">get_data_frames</span><span class="p">()[</span><span class="mi">0</span><span class="p">][</span><span class="s">'FG_PCT'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'regular_fg_pct_std'</span><span class="p">]</span> <span class="o">=</span> <span class="n">regular_season_game_log</span><span class="p">.</span><span class="n">get_data_frames</span><span class="p">()[</span><span class="mi">0</span><span class="p">][</span><span class="s">'FG_PCT'</span><span class="p">].</span><span class="n">std</span><span class="p">()</span> <span class="k">try</span><span class="p">:</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'regular_clutch_fg_pct'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">regular_season_clutch_games</span> <span class="p">.</span><span class="n">last5_min_plus_minus5_point_player_dashboard</span> <span class="p">.</span><span class="n">get_data_frame</span><span class="p">()[</span><span class="s">'FG_PCT'</span><span class="p">].</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="k">except</span> <span class="nb">IndexError</span><span class="p">:</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'regular_clutch_fg_pct'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="n">playoffs_game_log</span> <span class="o">=</span> <span class="n">PlayerGameLog</span><span class="p">(</span><span class="n">player_id</span><span class="p">,</span> <span class="n">season</span><span class="o">=</span><span class="n">mvp_season</span><span class="p">,</span> <span class="n">season_type_all_star</span><span class="o">=</span><span class="s">'Playoffs'</span><span class="p">)</span> <span class="n">playoffs_clutch_games</span> <span class="o">=</span> <span class="n">PlayerDashboardByClutch</span><span class="p">(</span><span class="n">player_id</span><span class="p">,</span> <span class="n">season</span><span class="o">=</span><span class="n">mvp_season</span><span class="p">,</span> <span class="n">season_type_playoffs</span><span class="o">=</span><span class="s">'Playoffs'</span><span class="p">)</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'playoff_fg_pct'</span><span class="p">]</span> <span class="o">=</span> <span class="n">playoffs_game_log</span><span class="p">.</span><span class="n">get_data_frames</span><span class="p">()[</span><span class="mi">0</span><span class="p">][</span><span class="s">'FG_PCT'</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'playoff_fg_pct_std'</span><span class="p">]</span> <span class="o">=</span> <span class="n">playoffs_game_log</span><span class="p">.</span><span class="n">get_data_frames</span><span class="p">()[</span><span class="mi">0</span><span class="p">][</span><span class="s">'FG_PCT'</span><span class="p">].</span><span class="n">std</span><span class="p">()</span> <span class="k">try</span><span class="p">:</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'playoff_clutch_fg_pct'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">playoffs_clutch_games</span> <span class="p">.</span><span class="n">last5_min_plus_minus5_point_player_dashboard</span> <span class="p">.</span><span class="n">get_data_frame</span><span class="p">()[</span><span class="s">'FG_PCT'</span><span class="p">].</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="k">except</span> <span class="nb">IndexError</span><span class="p">:</span> <span class="n">season_results</span><span class="p">[</span><span class="s">'playoff_clutch_fg_pct'</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1.0</span> <span class="n">summary_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="n">season_results</span><span class="p">}</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">summary_dict</span><span class="p">,</span> <span class="n">orient</span><span class="o">=</span><span class="s">'index'</span><span class="p">))</span> <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Michael Jordan 1987-88 Michael Jordan 1990-91 Michael Jordan 1991-92 Michael Jordan 1995-96 Michael Jordan 1997-98 Kobe Bryant 2007-08 LeBron James 2008-09 LeBron James 2009-10 LeBron James 2011-12 LeBron James 2012-13 Kevin Durant 2013-14 Russell Westbrook 2016-17 Allen Iverson 2000-01 Stephen Curry 2014-15 Stephen Curry 2015-16 Derrick Rose 2010-11 Steve Nash 2004-05 Steve Nash 2005-06 </code></pre></div></div> <p>Looking at the data, we have a somewhat neat dataframe of the players in their mvp seasons, and some information about their fg%. Unfortunately, we’re missing a lot of clutch information for Michael Jordan</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> </code></pre></div></div> <div> <style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>player</th> <th>player_id</th> <th>season</th> <th>regular_fg_pct</th> <th>regular_fg_pct_std</th> <th>regular_clutch_fg_pct</th> <th>playoff_fg_pct</th> <th>playoff_fg_pct_std</th> <th>playoff_clutch_fg_pct</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Michael Jordan</td> <td>893</td> <td>1987-88</td> <td>0.536207</td> <td>0.100253</td> <td>-1.000</td> <td>0.526700</td> <td>0.093797</td> <td>-1.000</td> </tr> <tr> <th>0</th> <td>Michael Jordan</td> <td>893</td> <td>1990-91</td> <td>0.543220</td> <td>0.098105</td> <td>-1.000</td> <td>0.532059</td> <td>0.105735</td> <td>-1.000</td> </tr> <tr> <th>0</th> <td>Michael Jordan</td> <td>893</td> <td>1991-92</td> <td>0.517175</td> <td>0.098628</td> <td>-1.000</td> <td>0.496273</td> <td>0.083527</td> <td>-1.000</td> </tr> <tr> <th>0</th> <td>Michael Jordan</td> <td>893</td> <td>1995-96</td> <td>0.494646</td> <td>0.098950</td> <td>-1.000</td> <td>0.455333</td> <td>0.104624</td> <td>-1.000</td> </tr> <tr> <th>0</th> <td>Michael Jordan</td> <td>893</td> <td>1997-98</td> <td>0.463866</td> <td>0.100443</td> <td>0.430</td> <td>0.468381</td> <td>0.087482</td> <td>0.440</td> </tr> <tr> <th>1</th> <td>Kobe Bryant</td> <td>977</td> <td>2007-08</td> <td>0.464744</td> <td>0.110210</td> <td>0.448</td> <td>0.485619</td> <td>0.102819</td> <td>0.484</td> </tr> <tr> <th>2</th> <td>LeBron James</td> <td>2544</td> <td>2008-09</td> <td>0.490827</td> <td>0.099416</td> <td>0.556</td> <td>0.512929</td> <td>0.100286</td> <td>0.526</td> </tr> <tr> <th>2</th> <td>LeBron James</td> <td>2544</td> <td>2009-10</td> <td>0.499855</td> <td>0.086847</td> <td>0.488</td> <td>0.487182</td> <td>0.139674</td> <td>0.714</td> </tr> <tr> <th>2</th> <td>LeBron James</td> <td>2544</td> <td>2011-12</td> <td>0.533387</td> <td>0.111258</td> <td>0.453</td> <td>0.500000</td> <td>0.094863</td> <td>0.370</td> </tr> <tr> <th>2</th> <td>LeBron James</td> <td>2544</td> <td>2012-13</td> <td>0.572526</td> <td>0.114762</td> <td>0.442</td> <td>0.496000</td> <td>0.120758</td> <td>0.440</td> </tr> <tr> <th>3</th> <td>Kevin Durant</td> <td>201142</td> <td>2013-14</td> <td>0.510494</td> <td>0.116384</td> <td>0.379</td> <td>0.460632</td> <td>0.100532</td> <td>0.515</td> </tr> <tr> <th>4</th> <td>Russell Westbrook</td> <td>201566</td> <td>2016-17</td> <td>0.425136</td> <td>0.119384</td> <td>0.446</td> <td>0.382400</td> <td>0.078567</td> <td>0.286</td> </tr> <tr> <th>5</th> <td>Allen Iverson</td> <td>947</td> <td>2000-01</td> <td>0.412845</td> <td>0.092642</td> <td>0.441</td> <td>0.380773</td> <td>0.114597</td> <td>0.306</td> </tr> <tr> <th>6</th> <td>Stephen Curry</td> <td>201939</td> <td>2014-15</td> <td>0.483463</td> <td>0.109280</td> <td>0.441</td> <td>0.456667</td> <td>0.104391</td> <td>0.381</td> </tr> <tr> <th>6</th> <td>Stephen Curry</td> <td>201939</td> <td>2015-16</td> <td>0.499405</td> <td>0.111825</td> <td>0.442</td> <td>0.436722</td> <td>0.117035</td> <td>0.538</td> </tr> <tr> <th>7</th> <td>Derrick Rose</td> <td>201565</td> <td>2010-11</td> <td>0.447617</td> <td>0.106040</td> <td>0.402</td> <td>0.400062</td> <td>0.102861</td> <td>0.409</td> </tr> <tr> <th>8</th> <td>Steve Nash</td> <td>959</td> <td>2004-05</td> <td>0.508400</td> <td>0.159107</td> <td>0.447</td> <td>0.510667</td> <td>0.126669</td> <td>0.444</td> </tr> <tr> <th>8</th> <td>Steve Nash</td> <td>959</td> <td>2005-06</td> <td>0.513215</td> <td>0.157780</td> <td>0.425</td> <td>0.500800</td> <td>0.126879</td> <td>0.385</td> </tr> </tbody> </table> </div> <h2 id="visualizing-the-results">Visualizing the results</h2> <p>We can plot the differences between the clutch fg% and the average fg% for each player’s season. If this number is above 0, then their clutch performances are better than their average performance. Evaluating statistical significance can be estimated if this difference is larger than the standard deviation of the player’s fg%.</p> <h3 id="regular-season">Regular season</h3> <p>Lebron, Russ, and AI are the only players to show a clutch fg% higher than their average fg%. Unfortunately, these performance differences are very slight</p> <h3 id="playoffs">Playoffs</h3> <p>Lebron, KD, Steph, and DRose show clutch fg%s higher than their average playoff fg%</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">itertools</span> <span class="k">as</span> <span class="n">it</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="n">unique_players</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'player'</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">player</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">unique_players</span><span class="p">):</span> <span class="n">sub_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'player'</span><span class="p">]</span><span class="o">==</span><span class="n">player</span><span class="p">]</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">errorbar</span><span class="p">([</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">sub_df</span><span class="p">),</span> <span class="mi">100</span><span class="o">*</span><span class="p">(</span><span class="n">sub_df</span><span class="p">[</span><span class="s">'regular_clutch_fg_pct'</span><span class="p">]</span> <span class="o">-</span> <span class="n">sub_df</span><span class="p">[</span><span class="s">'regular_fg_pct'</span><span class="p">]),</span> <span class="n">yerr</span><span class="o">=</span><span class="mi">100</span><span class="o">*</span><span class="n">sub_df</span><span class="p">[</span><span class="s">'regular_fg_pct_std'</span><span class="p">],</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">''</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">capsize</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">errorbar</span><span class="p">([</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">sub_df</span><span class="p">),</span> <span class="mi">100</span><span class="o">*</span><span class="p">(</span><span class="n">sub_df</span><span class="p">[</span><span class="s">'playoff_clutch_fg_pct'</span><span class="p">]</span> <span class="o">-</span> <span class="n">sub_df</span><span class="p">[</span><span class="s">'playoff_fg_pct'</span><span class="p">]),</span> <span class="n">yerr</span><span class="o">=</span><span class="mi">100</span><span class="o">*</span><span class="n">sub_df</span><span class="p">[</span><span class="s">'playoff_fg_pct_std'</span><span class="p">],</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">''</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">capsize</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Clutch vs Average FG%"</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlim</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">unique_players</span><span class="p">)])</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">unique_players</span><span class="p">)))</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">unique_players</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">xaxis</span><span class="p">.</span><span class="n">set_tick_params</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">90</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Regular season"</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Playoffs "</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylim</span><span class="p">([</span><span class="o">-</span><span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">])</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_ylim</span><span class="p">([</span><span class="o">-</span><span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">])</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(-30, 30) </code></pre></div></div> <p><img src="/images/2020-04-05-clutch_files/evaluating_clutch_9_1.png" alt="png" /></p> <h2 id="commentary-on-the-analysis">Commentary on the analysis</h2> <p>It’s not particularly fair to just take the difference between clutch fg% and average fg%. During clutch time, it’s usually assumed the team will put the ball in their best player’s hands. For these players we sampled, this will naturally lower their fg% because defenses are focusing more strongly on them, not necessarily the pressure of the moment getting to them. Honestly, if your clutch fg% is the same as your average fg%, I’d be satisfied enough to call that player clutch.</p> <p>It’s also at least fun to confirm that superstars play worse in the playoffs (if you compare the two columns in the dataframe). General concensus is that these players get guarded more tightly and schemed against, so their playoff fg% will be worse than regular season fg%.</p> <p>To better evalaute “clutch”, it might help to do this on a game-by-game basis. If a player had a hot hand and cooled off during the clutch, that’s bad. If a player was cold and hit some big shots during the clutch, that’s great. In the manner conducted here, these game-by-game fluctuations are avoided and averaged out. Looking at a more granular game-by-game method, we would witness more dramatic changes in a player’s fg% from game to game and also that player’s clutch fg% game to game (more noise in the data).</p> <p>Averaging out all the game also eschews things like game severity/importance, the teams and players they were up against, and other important factors like the player-in-question’s state of mind when they went into the game or the pressure of the moment. For example, Lebron game 6 of the 2012 ECF was a very clutch performance (techincally not even during clutch time), but a performance like that just gets averaged out against all other games. Other moments like losing 3-1 leads should be very anti-clutch performances, but those get averaged out.</p> <h2 id="conclusion">Conclusion</h2> <p>Yes, one could try to take a data-driven approach to study the clutch myth. At this day and age, there’s some data for someone to try to build a case and argue for its validity. However, I would argue there are still many “unquantifiables” that prohibit the clutch myth to truly be scrutinized with numbers. All the complicated, “you had to be there”, test-your-compsure moments demonstrate the limitations of data-driven analytics.</p> <p>This notebook can be found <a href="https://github.com/ahy3nz/BallDontLie/tree/master/ballDontLie/clutchMyth">here</a></p>Alex H. Yang[email protected]Are some players more “clutch” than others?