Ars de Datus-Scientia Jekyll 2021-05-18T08:04:51+00:00 https://etheleon.github.io/ Wesley GOI https://etheleon.github.io/ [email protected] <![CDATA[Not so big queries, hitchhiker’s guide to datawarehousing with datalakes with Spark]]> https://etheleon.github.io/articles/datalakes 2021-03-17T00:00:00-00:00 2021-03-17T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <h1 id="introduction">Introduction</h1> <p>Since our <a href="https://etheleon.github.io/articles/spark-joy/">earlier post</a>, I’ve defended my thesis, moved on from Honestbee to AirAsia to work on dynamic/auto base pricing Ancillary products. However COVID-19 happened and I found myself in the ride hailing GrabTransport / deliveries GrabFood industry again. Time does pass by quickly.</p> <blockquote> <p>The following post does not represent my employer and reflects my own personal views only.</p> </blockquote> <p>Once again, spark makes a return in my current job. After using Redshift and GCP’s BigQuery, I’ve formed myself a working style which separates ETL work for example feature engineering in SQL and model training at first in a notebook then formalised as a class object and finally a script or ML pipeline.</p> <blockquote> <p>In my opinion, <strong>SQL as king</strong> since it is the common tongue amongst data practitioners: Engineers, Scientists and Analysts.</p> </blockquote> <p>As for the ETL portion, nothing too drastic has changed. For quick and dirty data exploration I would use the web SQL workbench offered by Alation Compose the web-based editor offered by the company’s official data catalog and does not require one to set up any login credentials with a local client like DataGrip or DBeaver. Both BigQuery and Alation (presto) offers the ability to share queries via a link and has excellent access control. Important information about tables are also found in the catalog ie. column descriptions, PII, data owner and is very similar to the features offered by GCP’s data catalog.</p> <p>However, this time, instead of Redshift or Bigquery there is HIVE. <em>tables</em> exists within the Hive MetaStore (HMS) which I struggled to learn how to add drop tables to with initially. Also there was a decoupling of the query engine with the data warehouse was quite jarring at first since there were two dialects used by the company, spark SQL and Presto SQL. With bigquery the query engine is part of the datawarehouse.</p> <p>This succinct decoupling of (1) storage, (2) query engine and (3) metastore was quite new to me. It was only later after doing some reading on my own, that I was able to map BigQuery to the current setup. For example, instead of colossus there is S3 for my file system. And for the query engine instead of dremel you would have a constantly available presto cluster or a transient spark cluster to do heavier lifting jobs. Table and metadata are stored in HIVE MetaStore (HMS).</p> <p><img src="https://etheleon.github.io/images/alation.png" alt="alation" /></p> <p>Only when I need to carry out some in-depth analysis / feature engineering + model building. Previously in Honestbee we used an external vendor to maintain our spark infrastructure while in the current company, we have a in-house team managing the spark clusters to be cost efficient. Similar to BQ’s datasets and tables, we are able to save tables in HMS. However this was not very clear initially and this post aims to bridge any knowledge gaps using Spark as the query engine, I might decide to include a edit or new post of using Presto to build views in HIVE.</p> <h1 id="sparksession">SparkSession</h1> <p>Sadly the companies I’ve worked in are mostly python centric and the use of R has also decreased. However I still use GGPLOT2 for plotting although there has been more some <a href="http://plotnine.readthedocs.io">developments</a> porting it to python.</p> <p>Similar to R’s <code class="language-plaintext highlighter-rouge">sparklyr</code> package we create a spark connection of type <code class="language-plaintext highlighter-rouge">SparkSession</code> (it is often aliased as <code class="language-plaintext highlighter-rouge">spark</code>) in PySpark</p> <p>Since spark 2.X, <strong>Spark Session</strong> unifies the <strong>Spark Context</strong> and <strong>Hive Context</strong> classes into a single interface. Its use is recommended over the older APIs for code targeting Spark 2.0.0 and above.</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=create_sparksession.r"></script> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=create_session.py"></script> <h2 id="inputoutput">InputOutput</h2> <p><a href="https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html">With HIVE support enabled</a>, spark queries can query against tables found in HMS whose partitions are known beforehand or directly against files stored in buckets, if they are not registered in HMS.</p> <h3 id="input">Input</h3> <p>Spark supports querying against a metastore like a traditional data warehouse but you can also query against flat files in S3. Like how you can create tables in BQ with external files eg. CSV or parquet.</p> <h4 id="external-files">External files</h4> <p>A common format is parquet, you could register the data as a table in HMS or just work on it in memory. If you do not need register this as a table you can read the files directly into memory like the following.</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=read_partitions.py"></script> <blockquote> <p>You’ll be able to register this as a in-memory view using <code class="language-plaintext highlighter-rouge">df.createTempView("&lt;view_name"&gt;)</code> and you might also consider caching to load the whole table into memory. Since this DataFrame only exists in memory and it’s not registered in HMS there’s no table partition. However the in-memory RDDs are partitioned.</p> </blockquote> <h3 id="output">Output</h3> <p>If you would like the results of the ETL/query to persist so you can query it again later sometime in the future, you could save the results as an intermediate step which is archived or for machine learning either in parquet or Tensorflow’s TFRecord format.</p> <h4 id="tensorflow">Tensorflow</h4> <p>The recommended file format is <code class="language-plaintext highlighter-rouge">tf.Data.TFRecordDataset</code> when working with Tensorflow framework.</p> <p>You can save your results to this format using the following (gzipped to save space)</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=save_to_tfrecord.py"></script> <blockquote> <p>💡 To use the <code class="language-plaintext highlighter-rouge">tfrecords</code> format, remember to include the connector JAR and place it in the <code class="language-plaintext highlighter-rouge">extra_classpath</code>. At the point of writing <a href="https://repo1.maven.org/maven2/org/tensorflow/spark-tensorflow-connector_2.11/1.15.0/spark-tensorflow-connector_2.11-1.15.0.jar">org.tensorflow:spark-tensorflow-connector_2.11:1.115</a> works with the Gzip codec</p> </blockquote> <h3 id="registering-tables-in-hms-with-parquet-files-in-s3">Registering tables in HMS with parquet files in S3</h3> <p>Similar to how BigQuery stores the underlying data of tables in <code class="language-plaintext highlighter-rouge">capacitor</code> , columnar file format stored in google’s file system <code class="language-plaintext highlighter-rouge">colossus</code>. (GCS is built on top of colossus). It’s recommend to store the data in <code class="language-plaintext highlighter-rouge">parquet</code> also a columnar file format in S3.</p> <blockquote> <p>TIP: The fastest way to check if the table exists is to run <code class="language-plaintext highlighter-rouge">DESCRIBE schema.table</code></p> </blockquote> <p>In the following example we are going to assume that the parquet files are stored in the following path: <code class="language-plaintext highlighter-rouge">s3://datascience-bucket/wesley.goi/data/pricing/demand_tbl/</code></p> <h3 id="partitioned-tables-in-hms">Partitioned tables in HMS</h3> <p>When working with spark and HMS, one has to be mindful of the term <strong>partition</strong>, In spark, the term refers to data partitioning in Resilient Distributed Datasets (RDD), where partitions represent chunks of data sent to workers/executors for parallel processing. In HMS, the term represents how the data is stored in the cloud file system eg. S3 and helps guide queries agains the dataset in an efficient manner which is closer to the partitioned tables in databases.</p> <p>First you’ll need to be able to save the data in S3, there’s a specific naming conversion for the file path which you’ll need to follow ie. <code class="language-plaintext highlighter-rouge">s3://&lt;bucket&gt;/prefix/key=value</code>.</p> <p>As you have seen, one of the most common ways to partition a table is via timestamp eg. <code class="language-plaintext highlighter-rouge">s3://&lt;bucket&gt;/prefix/date=YYYYMMDD</code></p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=simple_partition.sql"></script> <p>One can also partition on multiple columns although in a nested manner eg. <code class="language-plaintext highlighter-rouge">folder/year=2021/month=03/day=21</code></p> <p>Where the folder structure follows:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Year=yyyy |---Month=mm | |---Day=dd | | |---&lt;parquet-files&gt; </code></pre></div></div> <blockquote> <p>⚠️: Check if the external table which you’re querying is already partitioned. <code class="language-plaintext highlighter-rouge">SHOW PARTITIONS table</code></p> </blockquote> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=nested_partitions.sql"></script> <blockquote> <p>💡You can check the number of partitions scanned if you run <code class="language-plaintext highlighter-rouge">.explain(mode="formatted")</code> to see</p> </blockquote> <h4 id="generate-column-data-type-schema">Generate Column data type schema</h4> <h5 id="manual">Manual</h5> <p>You can prepare the table column <a href="https://cloud.google.com/bigquery/docs/schemas">schema</a> like BigQuery manually and save it in a JSON file and parse it.</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=example_schema.json"></script> <h5 id="infer">Infer</h5> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=infer_col_type.py"></script> <h4 id="create-table">Create table</h4> <p>To create a table in HIVE, we will be using the CREATE statement from HIVE SQL.</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=create_hive_table.sql"></script> <p>You might also want to check if the table exists: <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=check_if_table_exists.py"></script></p> <h4 id="insert-partitions">Insert partitions</h4> <p>In this example we will be adding <code class="language-plaintext highlighter-rouge">s3://datascience-bucket/wesley.goi/data/pricing/demand_tbl/year=2021/month=01/day=11/hour=01</code></p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=add_partition_to_hive_table.sql"></script> <p>You can check if the partition has been add by running <code class="language-plaintext highlighter-rouge">SHOW PARTITIONS pricing.demand_tbl</code></p> <table> <thead> <tr> <th>partition</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">year=2021/month=01/day=11/hour=01</code></td> </tr> </tbody> </table> <p>However when you query the table you’ll notice that you cannot query the partition yet.</p> <p>You’ll still have to refresh the table for that partition</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=refresh_table.sql"></script> <p>Remember to refresh</p> <h4 id="bulk-import">Bulk import</h4> <p>If you have multiple partitions and do not wish to rerun the above for each partition, you may wish to run the MSCK command to sync the all files to the HMS.</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=bulk_update.sql"></script> <h2 id="temp-views--tables">Temp Views / Tables</h2> <p>In the same spark session, it is possible to create a temp view. Temp views should not be confused with views in BigQuery, these are not registered in HMS and persists only for the duration of the given <code class="language-plaintext highlighter-rouge">SparkSession</code>.</p> <p>Data is stored in memory in-memory columnar format.</p> <p>These are especially useful if the data manipulation is complicated and multi stepped and you wish to persist some intermediate tables. In BQ, I would just save temp as a table.</p> <blockquote> <p>NOTE: temp tables == temp views.</p> </blockquote> <p>From a query:</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=create_temp_view.sql"></script> <h2 id="views">Views</h2> <p>Unfortunately you cannot register a view in HIVE using spark but you can do so in presto.</p> <h2 id="sampling">Sampling</h2> <p>Often when training your model, you might need to <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.sample.html">sample</a> from the existing dataset due to memory constraints.</p> <p>You might want to set a seed as well when caching if you are doing hyperparameter tuning so you will get the same dataset on each iteration. And set the <code class="language-plaintext highlighter-rouge">withReplacement</code> parameter to be <code class="language-plaintext highlighter-rouge">False</code>.</p> <h1 id="caching">Caching</h1> <p>Caching is not lazy with ANSI SQL, and it will be stored in memory immediately.</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=cache_table.sql"></script> <blockquote> <p>Compared to PySpark <code class="language-plaintext highlighter-rouge">df.cache()</code> (you’ll have to run <code class="language-plaintext highlighter-rouge">df.count()</code> to force the table to be loaded into memory), the above SQL statement is not lazy and will store the table in memory once executed.</p> </blockquote> <h2 id="udfs">UDFs</h2> <p>User-Defined-Functions (UDFs) are ways to define your own functions. Which you can write in python before declaring it for use in SQL using <code class="language-plaintext highlighter-rouge">spark.udf.register</code></p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=create_udf.py"></script> <blockquote> <p><em>NOTE</em>: If UDF requires C binaries which needs to be compiled, you’ll need to install in the image used by the worker nodes.</p> </blockquote> <h2 id="sql-hints">SQL Hints</h2> <p>Hints go way back as early as spark 2.2, which introduced. These could be grouped into several categories.</p> <h3 id="repartitioning">Repartitioning</h3> <p>By default when repartitioning, it’ll be set to 200 partitions, you might not want this and to optimise the query you might want to <em>hint</em> spark otherwise</p> <ol> <li><code class="language-plaintext highlighter-rouge">REPARTITION</code></li> <li><code class="language-plaintext highlighter-rouge">COALESCE</code> only reduces the number of partitions, optimised version of repartition. Data which is kept on the original nodes and only those which needs to be moved are moved (see example below)</li> <li><code class="language-plaintext highlighter-rouge">REPARTITION_BY_RANGE</code> eg. You have records which has a running id from 0 - 100000 and you’ll want to split them into 3 partitions <code class="language-plaintext highlighter-rouge">repartitionByRange(col, 3)</code></li> </ol> <p>When coalescing you’re shrinking the number of nodes on which the data is kept eg. From 4 to</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># original Node 1 = 1,2,3 Node 2 = 4,5,6 Node 3 = 7,8,9 Node 4 = 10,11,12 # Coalescing from 4 to 2 partitions: Node 1 = 1,2,3 + (10,11,12) Node 3 = 7,8,9 + (4,5,6) </code></pre></div></div> <p>You can also improve query time by including columns when repartitioning especially if you are joining on these columns. This applies to tables as well as temp views.</p> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=hints.sql"></script> <blockquote> <p>You can also chain multiple repartition hints: repartition(100), coalesce(500) and repartition by range for column <code class="language-plaintext highlighter-rouge">c</code> into 3 partitions</p> </blockquote> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=chain_hints.sql"></script> <p>https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html the optimised plan is as follows:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Repartition to 100 partitions with respect to column c. == Optimized Logical Plan == Repartition 100, true +- Relation[name#29,c#30] parquet </code></pre></div></div> <p>Often the number of records per partition is not equal, especially if you’re partitioning by time and you might end up the number of records per partition following a cyclic pattern. eg. Traffic at night is much lesser than traffic in the day.</p> <p><img src="https://etheleon.github.io/images/partitions_imbalance.png" alt="partition_imbal" /></p> <h3 id="join-hints">Join hints</h3> <ul> <li><strong>BROADCAST JOIN</strong> replicates the full dataset (<em>if it can fit into memory</em> of the workers) into all nodes</li> </ul> <p>These are useful for selective joins (where the output is expected to small), when memory is not an issue and it’s the right table in a left join.</p> <p><img src="https://1.bp.blogspot.com/-s_HQfPph6z4/WcnjxGVNFkI/AAAAAAAAERM/9HfKO6H_SskkykKa_UaDRCo8URafsjixQCLcBGAs/s1600/Screen%2BShot%2B2017-09-25%2Bat%2B10.19.36%2BPM.png" alt="broadcast" /></p> <ul> <li><strong>MERGE</strong> : shuffle sort merge join</li> <li><strong>SHUFFLE_HASH</strong>: shuffle hash join. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats) as the build side.</li> <li><strong>SHUFFLE_REPLICATE_NL</strong>: shuffle-and-replicate nested loop join</li> </ul> <h1 id="adaptive-query-execution-aqe">Adaptive Query Execution (AQE)</h1> <p>Another new feature which comes with spark3 is the AQE. Previously the query plan is done prior to execution and no optimisation is done thereafter.</p> <h2 id="partitions">Partitions</h2> <p>One of the areas which sets itself up for optimisation during execution is the to determine the optimum number of partitions. By default,<code class="language-plaintext highlighter-rouge">spark.sql.shuffle.partitions</code> is set to 200, in cases when the dataset is small, this number would be too large while the reverse is also true.</p> <h3 id="broadcast-joins">Broadcast Joins</h3> <p>If the table from any side is smaller than broadcast in in hash join threshold, sort merge joins are automatically converted to a broadcast join.</p> <blockquote> <p>You can try this <a href="https://docs.databricks.com/_static/notebooks/aqe-demo.html?_ga=2.53314607.646459968.1595449158-1487382839.1592553333">AQE Demo - Databricks</a></p> </blockquote> <script src="https://gist.github.com/etheleon/caa944b36077f83b7a448b9b03779216.js?file=spark_three_opts.py"></script> <p><code class="language-plaintext highlighter-rouge">CustomShuffleReader</code> indicates it’s using AQE and it ends with AdaptiveSparkPlan</p> <p><a href="https://etheleon.github.io/articles/datalakes/">Not so big queries, hitchhiker’s guide to datawarehousing with datalakes with Spark</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on March 17, 2021.</p> <![CDATA[Spark Joy - Saying Konmari to your event logs with grammar of data manipulation]]> https://etheleon.github.io/articles/spark-joy 2019-02-20T00:00:00-00:00 2019-02-20T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <h1 id="sparklyr-joy">Sparklyr Joy</h1> <p>When you have a tonne of event logs to parse what should the go to <em>weapon</em> of choice be? In this article I’ll share with you our experience with using spark/sparklyr to tackle this.</p> <p>At Honestbee 🐝, our event logs are stored in AWS S3, delivered to us by <a href="https://segment.com/blog/exactly-once-delivery/">Segment</a>, at 40 minute intervals. The Data(Science) team uses these logs to evaluate the performance of our machine learning models as well as compare their performance, canonical AB testing.</p> <p>In addition, we also use the same logs to track business KPIs like <strong>C</strong>lick <strong>T</strong>hrough <strong>R</strong>ate, <strong>C</strong>onversion <strong>R</strong>ate and GMV.</p> <p>In this article, I will share how we leverage high memory clusters running Spark to parse the results logs generated from the Food Recommender System.</p> <p><img src="https://raw.githubusercontent.com/etheleon/etheleon.github.io/master/images/recommender.png" alt="" /></p> <p><strong>Fig:</strong> Whenever an Honestbee customer proceeds to checkout, our ML models will try their best at making personalised prediction for which items you’ll most likely add to cart. Especially things which you missed or .</p> <blockquote> <p>A <em>post mortem</em>, will require us to look through event logs to see which treatment group, based on a weighted distribution, a user has been assigned to.</p> </blockquote> <p>Now, LETS DIVE IN!</p> <p>Lets begin by importing the necessary libraries</p> <script src="https://gist.github.com/etheleon/581eeeefed17530f60caa53262232a84.js"></script> <h1 id="connecting-with-the-high-memory-spark-cluster">Connecting with the high memory spark cluster</h1> <p>Next, we’ll need to connect with the Spark master node.</p> <h3 id="local-cluster">Local Cluster</h3> <p>Normally if you’re connecting to a locally installed spark cluster you’ll set master as <code class="language-plaintext highlighter-rouge">local</code>.</p> <p>Luckily <code class="language-plaintext highlighter-rouge">sparkly</code> already comes with an inbuilt function to install spark on your local machine:</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sparklyr</span><span class="o">::</span><span class="n">spark_install</span><span class="p">(</span><span class="w"> </span><span class="n">version</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2.4.0"</span><span class="p">,</span><span class="w"> </span><span class="n">hadoop_version</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"2.7"</span><span class="w"> </span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <blockquote> <p>We are installing Hadoop together with spark, because the module required to read files from the S3 Filesystem with Hadoop</p> </blockquote> <p>Next you’ll connect with the cluster and establish a spark connection, sc.</p> <script src="https://gist.github.com/etheleon/673e551a5573358038896b6dada50721.js"></script> <blockquote> <p><strong>Caution:</strong> At honestbee we do not have a local cluster, so the closest we got is a LARGE EC2 instance which sometimes gives out and you probably want a <em>managed</em> cluster set up by DEs or a 3rd party vendor who knows how to deal with cluster management.</p> </blockquote> <h3 id="remote-clusters">Remote Clusters</h3> <p><em>Alternatively</em>, there’s also the option of connecting with a remote cluster via a REST API ie. the R process is not running on the master node but on a remote machine. Often these are managed by 3rd party vendors. At Honestbee, we also chosen this option and the clusters are provisioned by <a href="https://www.qubole.com/">Qubole</a> under our AWS account. PS. Pretty good deal!</p> <script src="https://gist.github.com/etheleon/2d61d1f5a83d1026b5f3dfa9eaa989b3.js"></script> <p>The gist above sets up a spark connection <code class="language-plaintext highlighter-rouge">sc</code>, you will need to use this object in most of the functions.</p> <p>Separately, because we are reading from S3, we will have to set the S3 access keys and secret. This has to be set before executing functions like <code class="language-plaintext highlighter-rouge">spark_read_json</code></p> <script src="https://gist.github.com/etheleon/f05cc79ad5cd6dc0eb3dbfc2e1bbedcc.js"></script> <blockquote> <p>So you would ask what are the pros and cons of each. Local clusters generally are good for EDA since you will be communicating through a REST API (LIVY).</p> <h1 id="reading-json-logs">Reading JSON logs</h1> <p>There are essentially two ways to read logs. The first is to read them in as a whole chunks or as a stream — as they get dumped into your bucket.</p> </blockquote> <p>There’s two functions, <code class="language-plaintext highlighter-rouge">spark_read_json</code> and <code class="language-plaintext highlighter-rouge">stream_read_json</code> the former is batched and the later creates a structured data stream. There’s also the equivalent of for reading your Parquet files</p> <h2 id="batched">Batched</h2> <p>The path should be set with the <code class="language-plaintext highlighter-rouge">s3a</code> protocol. <code class="language-plaintext highlighter-rouge">s3a://segment_bucket/segment-logs/&lt;source_id&gt;/1550361600000</code></p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">json_input</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">spark_read_json</span><span class="p">(</span><span class="w"> </span><span class="n">sc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sc</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="o">=</span><span class="w"> </span><span class="s2">"logs"</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="o">=</span><span class="w"> </span><span class="n">s3</span><span class="p">,</span><span class="w"> </span><span class="n">overwrite</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Below’s where the magic begins:</p> <script src="https://gist.github.com/etheleon/2a513d1000c38020a3a834a34e6d5e03.js"></script> <p>As you can see it’s a simple query,</p> <ol> <li>Filter for all <code class="language-plaintext highlighter-rouge">Added to Cart</code> events from the <code class="language-plaintext highlighter-rouge">Food</code> vertical</li> <li>Select following columns: <ul> <li><code class="language-plaintext highlighter-rouge">CartID</code></li> <li><code class="language-plaintext highlighter-rouge">experiment_id</code></li> <li><code class="language-plaintext highlighter-rouge">variant</code> (treatment_group) and</li> <li><code class="language-plaintext highlighter-rouge">timestamp</code></li> </ul> </li> <li>Remove events where users were not assigned to a model</li> <li>Add new columns <ul> <li><code class="language-plaintext highlighter-rouge">fulltime</code> readable time</li> <li><code class="language-plaintext highlighter-rouge">time</code> the hour of the day</li> </ul> </li> <li>Group the logs by service <code class="language-plaintext highlighter-rouge">recommender</code> and count the number of rows</li> <li>Add a new column <code class="language-plaintext highlighter-rouge">event</code> with the value <code class="language-plaintext highlighter-rouge">Added to Cart</code></li> <li>Sort by time</li> </ol> <h2 id="spark-streams">Spark Streams</h2> <p>Alternatively, you could also write the results of the above manipulation to a structured spark stream.</p> <script src="https://gist.github.com/etheleon/bf72bee8d790cf4f16d76cbb233f7a9d.js"></script> <p>You can preview these the results from the stream using the <code class="language-plaintext highlighter-rouge">tbl</code> function coupled to <code class="language-plaintext highlighter-rouge">glimpse</code>.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sc %&gt;% tbl("data_stream") %&gt;% glimpse </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Observations: ?? Variables: 2 Database: spark_connection $ expt &lt;chr&gt; "Model_A", "Model_B" $ n &lt;dbl&gt; 5345, 621 </code></pre></div></div> <p>And that’s it folks on using Sparklyr with your event logs.</p> <h2 id="model-metadata">Model Metadata</h2> <p><img src="https://raw.githubusercontent.com/etheleon/etheleon.github.io/master/images/model_graph.png" alt="" /></p> <p>With that many models in the wild, it’s hard to keep track of what’s going on. For my PhD, I personally worked on using Graph Databases to store data with complex relationships and we are currently working on coming up with such a system to store metadata related to our models.</p> <p>For example</p> <ol> <li>Which APIs they are associated with</li> <li>What airflow / argo jobs are these models being retrained with</li> <li>What helm-charts and deployments metadata these models have</li> <li>And of course meta data like the performance and scores.</li> </ol> <p>Come talk to us, we are hiring! <a href="https://boards.greenhouse.io/honestbee/jobs/1426737">Data Engineer</a>, <a href="https://boards.greenhouse.io/honestbee/jobs/1427566">Senior Data Scientist</a></p> <p><a href="https://etheleon.github.io/articles/spark-joy/">Spark Joy - Saying Konmari to your event logs with grammar of data manipulation</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on February 20, 2019.</p> <![CDATA[Tidying Up Pandas]]> https://etheleon.github.io/articles/tidying-up-pandas 2018-12-16T00:00:00-00:00 2018-12-16T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <p>For those who use python’s pandas module daily, the first thing you would notice is there are often more ways than one to do almost everything.</p> <p>The purpose of this article is to demonstrate how we can limit this by drawing inspiration from R’s <code class="language-plaintext highlighter-rouge">dplyr</code> and <code class="language-plaintext highlighter-rouge">tidyverse</code> libraries</p> <h1 id="tidying-up-pandas">Tidying up pandas?</h1> <p>As an academic, often enough the go to <em>lingua franca</em> for data science is R. Especially if you’re coming from Computational Biology/Bioinformatics or Statistics.</p> <p>And likely you’ll be hooked on the famous <code class="language-plaintext highlighter-rouge">tidyverse</code> meta-package, which includes <code class="language-plaintext highlighter-rouge">dplyr</code> (previously <code class="language-plaintext highlighter-rouge">plyr</code> for ply(e)r), <code class="language-plaintext highlighter-rouge">lubridate</code> (time-series) and <code class="language-plaintext highlighter-rouge">tidyr</code>.</p> <blockquote> <p>PS. As I am writing this article I realised it isn’t just <code class="language-plaintext highlighter-rouge">tidyverse</code>, but the whole R ecosystem which I’ve come to love whist doing metagenomics and computational biology in general.</p> </blockquote> <p>For the benefit of those who started from R, <code class="language-plaintext highlighter-rouge">pandas</code> is <em>the</em> dataframe module for python, several other packages like <a href="https://datatable.readthedocs.io/en/latest/using-datatable.html">datatable</a> exists and is is heavily inspired by R’s own <a href="https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html">datatable</a>.</p> <p>Now back to how tidyverse specifically dplyr organises dataframe manipulation.</p> <p>In his talk, <a href="https://youtu.be/dWjSYqI7Vog?t=2m7s">Hadley Wickham</a>, mentioned what we really need for table manipulation are just a handful of functions.</p> <ul> <li>filter</li> <li>select</li> <li>arrange</li> <li>mutate</li> <li>group_by</li> <li>summarise</li> <li>merge</li> </ul> <p>Although, I would argue you need just a bit more. For example, knowing R’s family of <code class="language-plaintext highlighter-rouge">apply</code> functions will help tonnes. Or a couple of summary statistics functions like <code class="language-plaintext highlighter-rouge">summary</code> or <code class="language-plaintext highlighter-rouge">str</code> , although nowadays I use <code class="language-plaintext highlighter-rouge">skimr::skim</code> a lot.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">skim</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span><span class="w"> </span><span class="c1">## Skim summary statistics</span><span class="w"> </span><span class="c1">## n obs: 150 </span><span class="w"> </span><span class="c1">## n variables: 5 </span><span class="w"> </span><span class="c1">## </span><span class="w"> </span><span class="c1">## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────</span><span class="w"> </span><span class="c1">## variable missing complete n n_unique top_counts ordered</span><span class="w"> </span><span class="c1">## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0 FALSE</span><span class="w"> </span><span class="c1">## </span><span class="w"> </span><span class="c1">## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────────</span><span class="w"> </span><span class="c1">## variable missing complete n mean sd p0 p25 p50 p75 p100 hist</span><span class="w"> </span><span class="c1">## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁</span><span class="w"> </span><span class="c1">## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ▇▁▁▅▃▃▂▂</span><span class="w"> </span><span class="c1">## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂</span><span class="w"> </span><span class="c1">## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁</span><span class="w"> </span></code></pre></div></div> <p>In fact, Google’s Facets behaves somewhat like this as well (see image below).</p> <p><img src="https://i.imgur.com/F7yQLnz.png" alt="Facets" /></p> <p>Thus, in this post I’ll try my best to demonstrate 1-to-1 mappings of the <code class="language-plaintext highlighter-rouge">tidyverse</code> vocabularies with <code class="language-plaintext highlighter-rouge">pandas</code> methods.</p> <p>For demonstration, We will be using the famous <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris flower dataset</a>.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python </span> <span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span> <span class="n">iris</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">load_data</span><span class="p">(</span><span class="s">"iris"</span><span class="p">)</span> </code></pre></div></div> <p>I’ve chosen to imports the iris data using seaborn rather than sklearn’s datasets which are numpy arrays</p> <p>The first thing I usually do when I import a table is to run the <code class="language-plaintext highlighter-rouge">str</code> function on the table</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R (iris is already loaded by default)</span><span class="w"> </span><span class="n">str</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python </span> <span class="n">iris</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="n">null_counts</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c1"># if the number of rows are too much, pandas will not do the count, # so I have to forcibly set `null_counts` to `True`. </span></code></pre></div></div> <h2 id="filter">Filter</h2> <p>The closest method similar to R’s <code class="language-plaintext highlighter-rouge">filter</code> is <code class="language-plaintext highlighter-rouge">pd.query</code>.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">cutoff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="w"> </span><span class="n">iris</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">sepal.width</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">cutoff</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>There’s two ways to do this in python. The first is probably what you’ll find most python users using</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python </span> <span class="n">cutoff</span> <span class="o">=</span> <span class="mi">30</span> <span class="n">iris</span><span class="p">[</span><span class="n">iris</span><span class="p">.</span><span class="n">sepal_width</span> <span class="o">&gt;</span> <span class="n">cutoff</span><span class="p">]</span> </code></pre></div></div> <p>However, <code class="language-plaintext highlighter-rouge">pd.DataFrame.query()</code> maps more closely with <code class="language-plaintext highlighter-rouge">dplyr::filter()</code>.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">iris</span><span class="p">.</span> \ <span class="n">query</span><span class="p">(</span><span class="s">"sepal_width &gt; @cutoff”) # this is using a SQL like language </span></code></pre></div></div> <blockquote> <p>One downside of using this is linters which follows the <code class="language-plaintext highlighter-rouge">pep8</code> convention like <code class="language-plaintext highlighter-rouge">flake8</code> will complain about the <code class="language-plaintext highlighter-rouge">cutoff</code> variable not being used although it has already been declared. This is because the linters are unable to recognise the use of <code class="language-plaintext highlighter-rouge">cutoff</code> inside the query quoted string.</p> </blockquote> <p>Surprisingly, filter makes a return in pySpark. :)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python (pyspark) </span> <span class="nb">type</span><span class="p">(</span><span class="n">flights</span><span class="p">)</span> <span class="n">pyspark</span><span class="p">.</span><span class="n">sql</span><span class="p">.</span><span class="n">dataframe</span><span class="p">.</span><span class="n">DataFrame</span> <span class="c1"># filters flights which are &gt; 1000 miles long </span><span class="n">flights</span><span class="p">.</span><span class="nb">filter</span><span class="p">(</span><span class="s">'distance &gt; 1000'</span><span class="p">)</span> </code></pre></div></div> <h2 id="select">Select</h2> <p>This is reminiscent of SQL’s <code class="language-plaintext highlighter-rouge">select</code> keyword which allows you to choose columns.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">iris</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">sepal.width</span><span class="p">,</span><span class="w"> </span><span class="n">sepal.length</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">iris</span> \ <span class="p">.</span><span class="n">loc</span><span class="p">[:</span><span class="mi">5</span><span class="p">,</span> <span class="p">[[</span><span class="s">"sepal_width"</span><span class="p">,</span> <span class="s">"sepal_length"</span><span class="p">]]]</span> <span class="c1"># selects the 1st 5 rows </span></code></pre></div></div> <p>Initially, I thought the following <code class="language-plaintext highlighter-rouge">df[['col1', 'col2']]</code> pattern would be a good map. But quickly realised we cannot do slices of the columns similar to <code class="language-plaintext highlighter-rouge">select</code>.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">iris</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">Sepal.Length</span><span class="o">:</span><span class="n">Petal.Width</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">iris</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">"sepal_length"</span><span class="p">:</span><span class="s">"petal_width"</span><span class="p">]</span> </code></pre></div></div> <p>A thing to note about the <code class="language-plaintext highlighter-rouge">loc</code> method is that it could return a series instead of a DataFrame when the selection is just one row. so you’ll have to slice it in order to return a dataframe.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Python iris.loc[1, :] # returns a Series iris.loc[[1],:] # returns a dataframe </code></pre></div></div> <p>But the really awesome thing about <code class="language-plaintext highlighter-rouge">select</code>, function its ability to /unselect/ columns which is missing in the <code class="language-plaintext highlighter-rouge">loc</code> method.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">col1</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>You have to use the <code class="language-plaintext highlighter-rouge">.drop()</code> method.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"col1"</span><span class="p">])</span> </code></pre></div></div> <blockquote> <p>Note I had to add the param <code class="language-plaintext highlighter-rouge">columns</code> because drop can not only be used to drop columns, the method can also drop rows based on their index.</p> </blockquote> <p>Like <code class="language-plaintext highlighter-rouge">filter</code>, <code class="language-plaintext highlighter-rouge">select</code> is also used in pySpark!</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python (pySpark) </span> <span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="s">"xyz"</span><span class="p">).</span><span class="n">show</span><span class="p">()</span> <span class="c1"># shows the column xyz of the spark dataframe. </span> <span class="c1"># alternative </span><span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">xyz</span><span class="p">)</span> </code></pre></div></div> <h2 id="arrange">Arrange</h2> <p>The arrange function lets one sort the table by a particular column.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">arrange</span><span class="p">(</span><span class="n">col1</span><span class="p">,</span><span class="w"> </span><span class="n">descreasing</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s">"col1"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="c1"># everything is reversed in python fml. </span></code></pre></div></div> <h2 id="mutate">Mutate</h2> <p><code class="language-plaintext highlighter-rouge">dplyr</code>’s <code class="language-plaintext highlighter-rouge">mutate</code> was really an upgrade from R’s <code class="language-plaintext highlighter-rouge">apply</code>.</p> <blockquote> <p><strong>NOTE</strong>: Other applies which is useful in R for example includes <code class="language-plaintext highlighter-rouge">mapply</code> and <code class="language-plaintext highlighter-rouge">lapply</code></p> </blockquote> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">new</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">something</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">col2</span><span class="p">,</span><span class="w"> </span><span class="n">newcol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col</span><span class="m">+1</span><span class="w"> </span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">iris</span><span class="p">.</span><span class="n">assign</span><span class="p">(</span> <span class="n">new</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">sepal_width</span> <span class="o">/</span> <span class="n">iris</span><span class="p">.</span><span class="n">sepal</span><span class="p">,</span> <span class="n">newcol</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">"col"</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span> <span class="p">)</span> </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">tidyverse</code>’s <code class="language-plaintext highlighter-rouge">mutate</code> function by default takes the whole column and does vectorised operations on it. If you want to apply the function row by row, you’ll have to couple <code class="language-plaintext highlighter-rouge">rowwise</code> with <code class="language-plaintext highlighter-rouge">mutate</code>.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="c1"># my_function does not take vectorised input of the entire column</span><span class="w"> </span><span class="c1"># this will fail</span><span class="w"> </span><span class="n">iris</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">new_column</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">my_function</span><span class="p">(</span><span class="n">sepal.width</span><span class="p">,</span><span class="w"> </span><span class="n">sepal.length</span><span class="p">))</span><span class="w"> </span><span class="c1"># this will force mutate to be applied row by row</span><span class="w"> </span><span class="n">iris</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">rowwise</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">new_column</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">my_function</span><span class="p">(</span><span class="n">sepal.width</span><span class="p">,</span><span class="w"> </span><span class="n">sepal.length</span><span class="p">))</span><span class="w"> </span></code></pre></div></div> <p>To achieve the same using the <code class="language-plaintext highlighter-rouge">.assign</code> method you can nest an <code class="language-plaintext highlighter-rouge">apply</code> inside the function.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="k">def</span> <span class="nf">do_something_string</span><span class="p">(</span><span class="n">col</span><span class="p">):</span> <span class="c1">#set_trace() </span> <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s">".*(osa)$"</span><span class="p">,</span> <span class="n">col</span><span class="p">):</span> <span class="n">value</span> <span class="o">=</span> <span class="s">"is_setosa"</span> <span class="k">else</span><span class="p">:</span> <span class="n">value</span> <span class="o">=</span> <span class="s">"not_setosa"</span> <span class="k">return</span> <span class="n">value</span> <span class="n">iris</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">assign</span><span class="p">(</span> <span class="n">transformed_species</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">df</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="s">"species"</span><span class="p">]</span> \ <span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">do_something_string</span><span class="p">)</span> <span class="p">)</span> </code></pre></div></div> <p>If you’re lazy, you could just chain two anoymous functions together.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">iris</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">assign</span><span class="p">(</span> <span class="n">transformed_species</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">df</span><span class="p">:</span> <span class="n">df</span><span class="p">.</span><span class="n">species</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">do_something_string</span><span class="p">))</span> </code></pre></div></div> <h2 id="apply">Apply</h2> <p>From R’s <code class="language-plaintext highlighter-rouge">apply</code> help docs:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apply(X, MARGIN, FUN, ...) </code></pre></div></div> <p>Where the value of <code class="language-plaintext highlighter-rouge">MARGIN</code> takes either <code class="language-plaintext highlighter-rouge">1</code> or <code class="language-plaintext highlighter-rouge">2</code> for (rows, columns), ie. if you want to apply to each row, you’ll set the axis as <code class="language-plaintext highlighter-rouge">0</code>.</p> <p>However, in pandas axis refers to what values (index i or columns j) will be used for the applied functions input parameter’s index.</p> <p>be using the <code class="language-plaintext highlighter-rouge">0</code> refers to the DataFrame’s index and axis <code class="language-plaintext highlighter-rouge">1</code> refers to the columns.</p> <p><img src="https://i.imgur.com/uNOGXVT.png" alt="Imgur" /></p> <p>So if you wanted to carry out row wise operations you could set axis to 0.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">row</span><span class="p">){</span><span class="w"> </span><span class="n">...</span><span class="w"> </span><span class="n">do</span><span class="w"> </span><span class="n">some</span><span class="w"> </span><span class="n">compute</span><span class="w"> </span><span class="n">...</span><span class="w"> </span><span class="p">})</span><span class="w"> </span></code></pre></div></div> <blockquote> <p>Rarely do that now since <code class="language-plaintext highlighter-rouge">plyr</code> and later <code class="language-plaintext highlighter-rouge">dplyr.</code></p> </blockquote> <p>However there is no <code class="language-plaintext highlighter-rouge">plyr</code> in pandas. So we have to go back to using apply if you want row-wise operations, however, the axis now is 1 not 0. I initially found this very confusing. The reason is because the <em>row</em> is a really just a <code class="language-plaintext highlighter-rouge">pandas.Series</code> whose index is the parent pandas.DataFame’s columns. Thus in this the axis is referring to which axis to set as the index.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python </span> <span class="n">iris</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="n">do_something</span><span class="p">(</span><span class="n">row</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </code></pre></div></div> <p>Interesting pattern which I do not use in R, is to use apply on columns, in this case <code class="language-plaintext highlighter-rouge">pandas.Series</code> objects.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python </span> <span class="n">iris</span><span class="p">.</span><span class="n">sepal_width</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># if you want a fancy progress bar, you could use the tqdm function </span><span class="n">iris</span><span class="p">.</span><span class="n">sepal_width</span><span class="p">.</span><span class="n">apply_progress</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># If u need parallel apply # this works with dask underneath </span><span class="kn">import</span> <span class="nn">swifter</span> <span class="n">iris</span><span class="p">.</span><span class="n">sepal_width</span><span class="p">.</span><span class="n">swifter</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> </code></pre></div></div> <p>In R, one of the common idioms, which I keep going back to for a parallel version of <code class="language-plaintext highlighter-rouge">groupby</code> is as follows:</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">unique_list</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w"> </span><span class="n">...</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">col</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">do_something</span><span class="p">()</span><span class="w"> </span><span class="c1"># do something to the subset</span><span class="w"> </span><span class="n">...</span><span class="w"> </span><span class="p">})</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span><span class="n">.</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>If you want a parallel version you’ll just have to change the <code class="language-plaintext highlighter-rouge">lapply</code> to <code class="language-plaintext highlighter-rouge">mclapply</code>.</p> <p>Additionally, there’s <code class="language-plaintext highlighter-rouge">mclapply</code> from the <code class="language-plaintext highlighter-rouge">parallel</code> /<code class="language-plaintext highlighter-rouge">snow</code> library in R.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">ncores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="c1"># the number of cores</span><span class="w"> </span><span class="n">unique_list</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mclapply</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w"> </span><span class="n">...</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">col</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">do_something</span><span class="p">()</span><span class="w"> </span><span class="c1"># do something to the subset</span><span class="w"> </span><span class="n">...</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">mc.cores</span><span class="o">=</span><span class="n">ncores</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span><span class="n">.</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Separately, in pySpark, you can split the whole table into partitions and do the manipulations in parallel.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python (pyspark) </span> <span class="n">dd</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">my_df</span><span class="p">,</span><span class="n">npartitions</span><span class="o">=</span><span class="n">nCores</span><span class="p">).</span>\ <span class="n">map_partitions</span><span class="p">(</span> <span class="k">lambda</span> <span class="n">df</span> <span class="p">:</span> <span class="n">df</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">nearest_street</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">lat</span><span class="p">,</span><span class="n">x</span><span class="p">.</span><span class="n">lon</span><span class="p">),</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)).</span>\ <span class="n">compute</span><span class="p">(</span><span class="n">get</span><span class="o">=</span><span class="n">get</span><span class="p">)</span> <span class="c1"># imports at the end </span></code></pre></div></div> <p>To achieve the same, what we can use the <code class="language-plaintext highlighter-rouge">dask</code>, or a higher level wrapper from the <code class="language-plaintext highlighter-rouge">swiftapply</code> library.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="c1"># you can easily vectorise the example using by adding the `swift` method before `.apply` </span><span class="n">series</span><span class="p">.</span><span class="n">swift</span><span class="p">.</span><span class="nb">apply</span><span class="p">()</span> </code></pre></div></div> <h2 id="group-by">Group by</h2> <p>The <code class="language-plaintext highlighter-rouge">.groupby</code> method in pandas is equivalent to R function <code class="language-plaintext highlighter-rouge">dplyr::group_by</code> returning a <code class="language-plaintext highlighter-rouge">DataFrameGroupBy</code> object.</p> <blockquote> <p>In Tidyverse there’s the <code class="language-plaintext highlighter-rouge">ungroup</code> function to ungroup grouped DataFrames, in order to achieve the same, there does not exists a1-to-1 mappable function.</p> <p>One way is to complete the <code class="language-plaintext highlighter-rouge">groupby</code> -&gt; <code class="language-plaintext highlighter-rouge">apply</code> (two-step process) and feeding apply with an identity function <code class="language-plaintext highlighter-rouge">apply(lambda x: x)</code>. Which is an identity function.</p> </blockquote> <h2 id="summarise">Summarise</h2> <p>In pandas the equivalent of the <code class="language-plaintext highlighter-rouge">summarise</code> function is <code class="language-plaintext highlighter-rouge">aggregate</code> abbreviated as the <code class="language-plaintext highlighter-rouge">agg</code> function. And you will have to couple this with <code class="language-plaintext highlighter-rouge">groupby</code>, so it’ll similar again a two step <code class="language-plaintext highlighter-rouge">groupby</code> -&gt; <code class="language-plaintext highlighter-rouge">agg</code> transformation.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">r_mt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mtcars</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">mtcars</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">cyl</span><span class="p">,</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">hp</span><span class="p">,</span><span class="w"> </span><span class="n">drat</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">cyl</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">8</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">cyl</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="w"> </span><span class="n">hp_mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">hp</span><span class="p">),</span><span class="w"> </span><span class="n">drat_mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">drat</span><span class="p">),</span><span class="w"> </span><span class="n">drat_std</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sd</span><span class="p">(</span><span class="n">drat</span><span class="p">),</span><span class="w"> </span><span class="n">diff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">drat</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">drat</span><span class="p">)</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">arrange</span><span class="p">(</span><span class="n">drat_mean</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.data.frame</span><span class="w"> </span></code></pre></div></div> <p>The same series of transformation written in Python would follow:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Python def transform1(x): return max(x)-min(x) def transform2(x): return max(x)+5 py_mt = ( mtcars. loc[:,["cyl", "model", "hp", "drat"]]. #select query("cyl &lt; 8"). #filter groupby("cyl"). #group_by agg( #summarise, agg is an abbreviation of aggregation { 'hp':'mean', 'drat':['mean', 'std', transform1, transform2] # R wins... this sux for pandas }). sort_values(by=[("drat", "mean")]) #multindex sort (unique to pandas) ) py_mt </code></pre></div></div> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">col</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="n">my_new_column</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">do_something</span><span class="p">(</span><span class="n">some_col</span><span class="p">))</span><span class="w"> </span></code></pre></div></div> <h2 id="join">Join</h2> <p>Natively, R supports the <code class="language-plaintext highlighter-rouge">merge</code> function and similarly in Pandas there’s the <code class="language-plaintext highlighter-rouge">pd.merge</code> function.</p> <p>Along side the other <code class="language-plaintext highlighter-rouge">join</code> functions: <code class="language-plaintext highlighter-rouge">left_join</code>, <code class="language-plaintext highlighter-rouge">right_join</code>, <code class="language-plaintext highlighter-rouge">inner_join</code> and <code class="language-plaintext highlighter-rouge">anti_join</code>.</p> <h2 id="inplace">Inplace</h2> <p>In R there’s the compound assignment pipe-operator <code class="language-plaintext highlighter-rouge">%&lt;&gt;%</code>, which is similar to the <code class="language-plaintext highlighter-rouge">inplace=True</code> argument in some pandas functions <em>but not all</em>. :( Apparently Pandas is going to remove inplace altogether…</p> <h3 id="debugging">Debugging</h3> <p>In R, we have the <code class="language-plaintext highlighter-rouge">browser()</code> function.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">iris</span><span class="o">$</span><span class="n">species</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">s</span><span class="p">){</span><span class="w"> </span><span class="nf">browser</span><span class="p">()</span><span class="w"> </span><span class="n">iris</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">species</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">s</span><span class="p">)</span><span class="w"> </span><span class="n">....</span><span class="w"> </span><span class="p">})</span><span class="w"> </span></code></pre></div></div> <p>It’ll let you <em>step</em> into the function which is extremely useful if you want to do some debugging.</p> <p>In Python, there’s the <code class="language-plaintext highlighter-rouge">set_trace</code> function.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="kn">from</span> <span class="nn">IPython.core.debugger</span> <span class="kn">import</span> <span class="n">set_trace</span> <span class="p">(</span> <span class="n">iris</span> <span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"species"</span><span class="p">)</span> <span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">groupedDF</span><span class="p">:</span> <span class="n">set_trace</span><span class="p">())</span> <span class="p">)</span> </code></pre></div></div> <p>Last but not least if you really need to use some R function you could always rely on the <code class="language-plaintext highlighter-rouge">rpy2</code> package. For me I rely on this a lot for plotting. ggplot2 ftw!</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># python </span> <span class="kn">import</span> <span class="nn">rpy2</span> <span class="c1"># imports the library </span><span class="o">%</span><span class="n">load_ext</span> <span class="n">rpy2</span><span class="p">.</span><span class="n">ipython</span> <span class="c1"># load the magic </span></code></pre></div></div> <blockquote> <p>Sometimes there’s issues installing r packages using R. You can run</p> </blockquote> <p><code class="language-plaintext highlighter-rouge">conda install -r r r-tidyverse r-ggplot</code></p> <p>There after you can always use R and Python interchangeably in the same Jupyter notebook.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">R</span><span class="w"> </span><span class="o">-</span><span class="n">i</span><span class="w"> </span><span class="n">python_df</span><span class="w"> </span><span class="o">-</span><span class="n">o</span><span class="w"> </span><span class="n">transformed_df</span><span class="w"> </span><span class="n">transformed_df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">python_df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">some_columns</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">newcol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">somecol</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <blockquote> <p>NOTE: <code class="language-plaintext highlighter-rouge">%%R</code> is cell magic and <code class="language-plaintext highlighter-rouge">%R</code> is line magic.</p> </blockquote> <p>If you need outputs to be printed like a normal pandas DataFrame, you can you the single percent magic</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%R some_dataFrame %&gt;% skim </code></pre></div></div> <h2 id="elipisis">Elipisis</h2> <p>In R, one nifty trick you can do is to pass arguments to inner functions without ever having to define them in the outer function’s function signature.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="cd">#' Simple function which takes two parameters `one` and `two` and elipisis `...`, </span><span class="w"> </span><span class="n">somefunction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">one</span><span class="p">,</span><span class="w"> </span><span class="n">two</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">){</span><span class="w"> </span><span class="n">three</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">one</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">two</span><span class="w"> </span><span class="n">sometwo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">four</span><span class="p">){</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">four</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">sometwo</span><span class="p">(</span><span class="n">three</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="c1"># four exists within the elipisis </span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1"># because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis</span><span class="w"> </span><span class="n">somefunction</span><span class="p">(</span><span class="n">one</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">two</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">four</span><span class="o">=</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="o">=</span><span class="s2">"wesley"</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>In python, <code class="language-plaintext highlighter-rouge">**kwargs</code> takes the place of <code class="language-plaintext highlighter-rouge">...</code>. Below is an explanation of how exactly it works.</p> <h4 id="explanation">Explanation</h4> <p>Firstly, the double asterisks <code class="language-plaintext highlighter-rouge">**</code> is called <em>unpack</em> operator (it’s placed before a function signature eg. <code class="language-plaintext highlighter-rouge">kwargs</code> so together it’ll look like <code class="language-plaintext highlighter-rouge">**kwargs</code>).</p> <blockquote> <p>The convention is to let that variable be named <code class="language-plaintext highlighter-rouge">kwargs</code> (which stands for <strong>k</strong>ey<strong>w</strong>orded arguments) but it could be named anything.</p> </blockquote> <p>Most articles which describe the unpack operator will start off with <strong>this</strong> explanation: where dictionaries are used to pass functions their parameters.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">adictionary</span> <span class="o">=</span> <span class="p">{</span> <span class="s">'first'</span> <span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'second'</span><span class="p">:</span> <span class="mi">2</span> <span class="p">}</span> <span class="k">def</span> <span class="nf">some_function</span><span class="p">(</span><span class="n">first</span><span class="p">,</span> <span class="n">second</span><span class="p">):</span> <span class="k">return</span> <span class="n">first</span> <span class="o">+</span> <span class="n">second</span> <span class="n">some_function</span><span class="p">(</span><span class="o">**</span><span class="n">adictionary</span><span class="p">)</span> <span class="c1"># which gives 3 </span></code></pre></div></div> <p><img src="https://i.imgur.com/ggSP2dK.jpg" alt="unpacking" /></p> <p>But you could also twist this around and set <code class="language-plaintext highlighter-rouge">**kwargs</code> as a function signature. Doing this lets you key in an arbitrary number of function signatures when calling the function.</p> <p>The signature-value pairs are wrapped into a dictionary named <code class="language-plaintext highlighter-rouge">kwargs</code> which is accessible inside the function.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="c1"># dummy function which prints `kwargs` </span><span class="k">def</span> <span class="nf">some_function</span> <span class="p">(</span><span class="o">**</span><span class="n">kwargs</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="n">kwargs</span><span class="p">)</span> <span class="n">some_function</span><span class="p">(</span><span class="n">first</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">second</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> </code></pre></div></div> <p>The previous two cases are not exclusive, you could actually ~<strong><em>mix</em></strong>~ them together. Ie. have named signatures as well as a <code class="language-plaintext highlighter-rouge">**kwargs</code>.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="n">adictionary</span> <span class="o">=</span> <span class="p">{</span> <span class="s">'first'</span> <span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'second'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="s">'useless_value'</span> <span class="p">:</span> <span class="s">"wesley"</span> <span class="p">}</span> <span class="k">def</span> <span class="nf">some_function</span><span class="p">(</span><span class="n">first</span><span class="p">,</span> <span class="n">second</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="n">kwargs</span><span class="p">)</span> <span class="k">return</span> <span class="n">first</span> <span class="o">+</span> <span class="n">second</span> <span class="k">print</span><span class="p">(</span><span class="n">some_function</span><span class="p">(</span><span class="o">**</span><span class="n">adictionary</span><span class="p">))</span> </code></pre></div></div> <p>The output will be: <code class="language-plaintext highlighter-rouge">{'useless_value': 'wesley'}</code></p> <p>It allows a python function to accept as many function signatures as you supply it. Those which are already defined during the declaration of the function would be directly used. And those which do not appear within them can be accessed from kwargs.</p> <p>By putting the <code class="language-plaintext highlighter-rouge">**kwargs</code> as an argument in the inner function, you’re basically unwrapping the dictionary into the function params.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python </span> <span class="k">def</span> <span class="nf">somefunction</span><span class="p">(</span><span class="n">one</span><span class="p">,</span> <span class="n">two</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"outer function:</span><span class="se">\n\t</span><span class="si">{</span><span class="n">kwargs</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> <span class="n">three</span> <span class="o">=</span> <span class="n">one</span> <span class="o">+</span> <span class="n">two</span> <span class="k">def</span> <span class="nf">sometwo</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">four</span><span class="p">):</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"inner function: </span><span class="se">\n\t</span><span class="si">{</span><span class="n">kwargs</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">four</span> <span class="k">return</span> <span class="n">sometwo</span><span class="p">(</span><span class="n">three</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span> <span class="n">somefunction</span><span class="p">(</span><span class="n">one</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">two</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">four</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="err">“</span><span class="n">wesley</span><span class="err">”</span><span class="p">)</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>outside function: {“four”:5, “name”:”wesley”} Inside inside kwargs: {'name': 'jw'} </code></pre></div></div> <p>Lets now compare this with the original R elipsis</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># R</span><span class="w"> </span><span class="cd">#' Simple function which takes two parameters `one` and `two` and elipisis `...`, </span><span class="w"> </span><span class="n">somefunction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">one</span><span class="p">,</span><span class="w"> </span><span class="n">two</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">){</span><span class="w"> </span><span class="n">three</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">one</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">two</span><span class="w"> </span><span class="n">sometwo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">four</span><span class="p">){</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">four</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">sometwo</span><span class="p">(</span><span class="n">three</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="c1"># four exists within the elipisis </span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1"># because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis</span><span class="w"> </span><span class="n">somefunction</span><span class="p">(</span><span class="n">one</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">two</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">four</span><span class="o">=</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="o">=</span><span class="s2">"wesley"</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <h2 id="conclusion">Conclusion</h2> <p>There’s many ways to do thing in pandas more so than the tidyverse way, and I wish this was clearer.</p> <p>Additionally, something which caught me off guard after coming to Honestbee was the amount of SQL I need.</p> <p>For example postgreSQL to query RDS and it’s dialect for querying Redshift, <a href="https://www.confluent.io/product/ksql/">KSQL</a> for querying data streams via Kafka and Athena’s query language build on top of presto DB for querying S3, where most of the data use to exist in parquet files.</p> <p>The shows one big deviation from academia where data in a company is usually stored in a database / datalake / datastream whereas in academia its usually just one big flat data file.</p> <p>We’ve come to the ending of this attempt at mapping tidyverse vocabularies to pandas, hope you’ve found this informative and useful! See you guys soon!</p> <p><a href="https://etheleon.github.io/articles/tidying-up-pandas/">Tidying Up Pandas</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on December 16, 2018.</p> <![CDATA[Has the ship sailed for Microbiome research?]]> https://etheleon.github.io/articles/esearch 2016-12-21T00:00:00-00:00 2017-11-02T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <p>Going about doctoral thesis writing, love hate relationship, the thought occurred to me <em>perhaps</em> the very field I’m writing on about has already lived past its Golden Era, <strong>or has it</strong>?</p> <p>A knee jerk reaction then was to see if there’s any python or R package which allows me to search the abstracts with the keyword microbiome… this turned up <a href="https://github.com/titipata/pubmed_parser/">pubmed_parser</a>. However, in order to get this running, I will have to first download a few gigabytes of abstracts in XML from the Open Access subset of pubmed abstracts and run my own pyspark…</p> <p><strong>NOPE! Not going there!</strong></p> <p>Then it occurred to me perhaps NCBI has something I could use… in the previous post we talked about the <code class="language-plaintext highlighter-rouge">esearch</code> API. Hmmm, this could be useful.</p> <p>So below’s the script which lets me do this:</p> <h4 id="libraries">Libraries</h4> <p>using the usual tidyverse, with rvest for XML parsing and artyfarty to spruce up the plot</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">suppressPackageStartupMessages</span><span class="p">({</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">rvest</span><span class="p">)</span><span class="w"> </span><span class="c1"># for XML</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">artyfarty</span><span class="p">)</span><span class="w"> </span><span class="c1">#because theme_bw is too boring</span><span class="w"> </span><span class="p">})</span><span class="w"> </span></code></pre></div></div> <h1 id="ncbi-esearch">NCBI ESEARCH</h1> <p>Here’s the NBCI’s esearch <a href="https://www.ncbi.nlm.nih.gov/books/NBK25499/">API</a>. Within it, there’s the date range options, <code class="language-plaintext highlighter-rouge">mindate</code>, <code class="language-plaintext highlighter-rouge">maxdate</code>.</p> <h4 id="mindate-maxdate-api-filter">mindate, maxdate API filter</h4> <p>Date range used to limit a search result by the date specified by datetype. These two parameters (mindate, maxdate) must be used together to specify an arbitrary date range. The general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM.</p> <p>So we will be searching between <em>1997</em> to <em>2017</em>, a 20 year period.</p> <p>So lets begin …..</p> <h4 id="keyword-microbiome">Keyword: Microbiome</h4> <p>with other synonyms microbiota</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#microbiome</span><span class="w"> </span><span class="n">api</span><span class="o">=</span><span class="s2">"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"</span><span class="w"> </span><span class="n">query</span><span class="o">=</span><span class="s2">"db=pubmed&amp;term=%s&amp;mindate=%s&amp;maxdate=%s"</span><span class="w"> </span><span class="n">searchTerm</span><span class="o">=</span><span class="n">paste0</span><span class="p">(</span><span class="n">api</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">)</span><span class="w"> </span><span class="n">keyword</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"microbiome"</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mapply</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">){</span><span class="w"> </span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read_xml</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="n">searchTerm</span><span class="p">,</span><span class="w"> </span><span class="n">keyword</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as_list</span><span class="w"> </span><span class="o">%$%</span><span class="w"> </span><span class="n">Count</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">unlist</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">count</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">)</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1997</span><span class="o">:</span><span class="m">2016</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1998</span><span class="o">:</span><span class="m">2017</span><span class="p">,</span><span class="w"> </span><span class="n">SIMPLIFY</span><span class="o">=</span><span class="kc">FALSE</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span><span class="n">.</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <h4 id="keyword-cancer">Keyword: Cancer</h4> <p>Used as a comparison</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">keyword</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cancer"</span><span class="w"> </span><span class="n">df2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mapply</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">){</span><span class="w"> </span><span class="n">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read_xml</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="n">searchTerm</span><span class="p">,</span><span class="w"> </span><span class="n">keyword</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as_list</span><span class="w"> </span><span class="o">%$%</span><span class="w"> </span><span class="n">Count</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">unlist</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">count</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">)</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1997</span><span class="o">:</span><span class="m">2016</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1998</span><span class="o">:</span><span class="m">2017</span><span class="p">,</span><span class="w"> </span><span class="n">SIMPLIFY</span><span class="o">=</span><span class="kc">FALSE</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">rbind</span><span class="p">,</span><span class="n">.</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Putting the two together before we start plotting</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"microbiome"</span><span class="p">,</span><span class="w"> </span><span class="s2">"start"</span><span class="p">,</span><span class="w"> </span><span class="s2">"end"</span><span class="p">))</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">cancer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">df2</span><span class="o">$</span><span class="n">count</span><span class="p">))</span><span class="w"> </span><span class="n">df</span><span class="o">$</span><span class="n">microbiome</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">as.integer</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">,</span><span class="w"> </span><span class="n">microbiome</span><span class="p">,</span><span class="w"> </span><span class="n">cancer</span><span class="p">)</span><span class="w"> </span><span class="n">df</span><span class="w"> </span></code></pre></div></div> <p>As you can the see the order is slighly different between the two, you’ll probably have to do some scaling.</p> <table> <thead> <tr> <th>start</th> <th>end</th> <th>microbiome</th> <th>cancer</th> </tr> </thead> <tbody> <tr> <td>1997</td> <td>1998</td> <td>91</td> <td>116522</td> </tr> <tr> <td>1998</td> <td>1999</td> <td>110</td> <td>124613</td> </tr> <tr> <td>1999</td> <td>2000</td> <td>133</td> <td>131481</td> </tr> <tr> <td>2000</td> <td>2001</td> <td>149</td> <td>139577</td> </tr> <tr> <td>2001</td> <td>2002</td> <td>196</td> <td>153651</td> </tr> <tr> <td>2002</td> <td>2003</td> <td>249</td> <td>166393</td> </tr> <tr> <td>2003</td> <td>2004</td> <td>304</td> <td>170676</td> </tr> <tr> <td>2004</td> <td>2005</td> <td>419</td> <td>181504</td> </tr> <tr> <td>2005</td> <td>2006</td> <td>576</td> <td>190710</td> </tr> <tr> <td>2006</td> <td>2007</td> <td>744</td> <td>198618</td> </tr> <tr> <td>2007</td> <td>2008</td> <td>955</td> <td>210488</td> </tr> <tr> <td>2008</td> <td>2009</td> <td>1285</td> <td>219686</td> </tr> <tr> <td>2009</td> <td>2010</td> <td>1741</td> <td>231079</td> </tr> <tr> <td>2010</td> <td>2011</td> <td>2610</td> <td>248046</td> </tr> <tr> <td>2011</td> <td>2012</td> <td>3899</td> <td>265171</td> </tr> <tr> <td>2012</td> <td>2013</td> <td>5607</td> <td>281240</td> </tr> <tr> <td>2013</td> <td>2014</td> <td>8211</td> <td>308483</td> </tr> <tr> <td>2014</td> <td>2015</td> <td>10951</td> <td>331775</td> </tr> <tr> <td>2015</td> <td>2016</td> <td>13439</td> <td>331631</td> </tr> <tr> <td>2016</td> <td>2017</td> <td>14058</td> <td>285408</td> </tr> </tbody> </table> <p>Since version 2.2.0 of ggplot2, Hadley has included the <code class="language-plaintext highlighter-rouge">sec_axis</code> function in the library which lets you add a secondary axis as long as it’s amenable to a straight forward transformation.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ggplot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">end</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">microbiome</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s2">"Microbiome"</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="m">1.1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">cancer</span><span class="o">/</span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s2">"Cancer"</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">linetype</span><span class="o">=</span><span class="s2">"dotted"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># manipulated the cancer values by dividing by 20</span><span class="w"> </span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="m">1998</span><span class="o">:</span><span class="m">2017</span><span class="p">)</span><span class="o">+</span><span class="w"> </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">sec.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sec_axis</span><span class="p">(</span><span class="o">~</span><span class="n">.</span><span class="o">*</span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Number of Publications [Cancer]"</span><span class="p">))</span><span class="o">+</span><span class="w"> </span><span class="c1"># restores the division</span><span class="w"> </span><span class="c1"># lets we set the axis title</span><span class="w"> </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="s2">"Search Terms"</span><span class="p">,</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pal</span><span class="p">(</span><span class="s2">"five38"</span><span class="p">))</span><span class="o">+</span><span class="w"> </span><span class="n">theme_scientific</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="o">=</span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="o">=</span><span class="m">90</span><span class="p">),</span><span class="w"> </span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.9</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Year"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Number of Publications [Microbiome]"</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p><img src="/images/esearchPublications_9_1.png" alt="publications-with-keyword-microbiome" /></p> <p>There you have it guys, on the left y-axis the publication count with the keyword “microbiome” and its synonyms like “microbiota” and on the right y-axis the counts for abstract with the keyword “cancer”. As you can see, the growth in publications/articles revolving around microbiome or at least associated to it have been growing at breakneck pace faster than cancer, almost exponential.</p> <p>For those astute enought, you’ll notice a dip in 2017 for cancer, and the trend is slowing down for microbiome, that’s just cause we haven’t reached the end of 2017 yet, close 😉 but definitely more papers on their way.</p> <p>Hope this will be helpful for future students! Cheers</p> <p><a href="https://etheleon.github.io/articles/esearch/">Has the ship sailed for Microbiome research?</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on November 02, 2017.</p> <![CDATA[Why has downloading fastQ files become so complicated?]]> https://etheleon.github.io/articles/ncbi-sra 2017-08-23T00:00:00-00:00 2017-08-22T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <h2 id="downloading">Downloading</h2> <p>Recently, I had to retrieve Sequencing data in <a href="https://en.wikipedia.org/wiki/FASTQ_format">fastQ</a> format belonging to a paper from <a href="https://www.nature.com/articles/srep25719">Law <em>et al</em></a>. It was for one of two remaining mini-projects standing before me and my PhD.</p> <p>Mainly they’re for applying my <a href="https://etheleon.github.io/articles/geneCentricApproach/">gene centric approach (Watch out for the next part it’ll be released soon!)</a> to a time series dataset of total RNA and for a enriched reactor core.</p> <p>So it begins with the following line in the publication:</p> <blockquote> <p>All raw metagenome, metatransriptome and amplicon sequencing data used in this study are publicly available from NCBI under BioProject ID: PRJNA320780 (http://www.ncbi.nlm.nih.gov/bioproject/320780).</p> </blockquote> <ul> <li>metagenome ie DNA</li> <li>metatranscriptome ie. total RNA</li> <li>amplicon 16S only</li> </ul> <p>Sounds easy now aint it, go to link, click on download and you’ll get everything you need. Well it wasn’t. =(</p> <p>Previously my experience with downloading of NCBI has been mostly their web portal via a browser not much programmatically.</p> <h2 id="day-1-getting-the-files">Day 1: Getting the Files</h2> <p>K, calm down all I need now is a link to wget or curl the files. No problem I’ve heard of the SRA format, SRA stands for <strong>Sequence Read Archives</strong> nothing is gonna stop me.</p> <p>On the Bioproject’s <a href="https://www.ncbi.nlm.nih.gov/bioproject/320780">page</a> I saw i had about 40 SRA files to fetch…</p> <p>Hmmmm. Do I click and download them by hand? “Of course not, I know how to write scripts why should I do this by hand”, I thought.</p> <p><img src="http://i0.kym-cdn.com/entries/icons/facebook/000/006/725/desk_flip.jpg" alt="flip-table" /></p> <p>After some digging around for ways to get the download link I found this: <a href="Edirec://www.ncbi.nlm.nih.gov/books/NBK179288/">Entrez Direct: E-utilities on the UNIX Command Line</a></p> <p>To install the tool you’ll need install some perl modules first: (forgive the PERL cause everyone knows perl is like never going away in Bioinformatics)</p> <p>you’ll probably need to CPAN some modules (I recommend installing cpanminus aka cpanm) Perl’s unofficial package manger</p> <p>Yes so thats a bunch of perl modules to install <code class="language-plaintext highlighter-rouge">Net::FTP</code></p> <div class="language-perl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">cd</span> <span class="o">~</span> <span class="nv">perl</span> <span class="o">-</span><span class="nn">MNet::</span><span class="nv">FTP</span> <span class="o">-</span><span class="nv">e</span> <span class="o">\</span> <span class="p">'</span><span class="s1">$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive =&gt; 1); $ftp-&gt;login; $ftp-&gt;binary; $ftp-&gt;get("/entrez/entrezdirect/edirect.tar.gz");</span><span class="p">'</span> <span class="nv">gunzip</span> <span class="o">-</span><span class="nv">c</span> <span class="nv">edirect</span><span class="o">.</span><span class="nv">tar</span><span class="o">.</span><span class="nv">gz</span> <span class="o">|</span> <span class="nv">tar</span> <span class="nv">xf</span> <span class="o">-</span> <span class="nv">rm</span> <span class="nv">edirect</span><span class="o">.</span><span class="nv">tar</span><span class="o">.</span><span class="nv">gz</span> <span class="nv">export</span> <span class="nv">PATH</span><span class="o">=</span><span class="nv">$PATH:$HOME</span><span class="o">/</span><span class="nv">edirect</span> <span class="o">.</span><span class="sr">/edirect/s</span><span class="nv">etup</span><span class="o">.</span><span class="nv">sh</span> </code></pre></div></div> <p>After installing this well you could finally start downloading the SRA… (you wished)</p> <p>Digging through the website it was easy to find the button to download the SRAs, but getting the links to all 40 SRAs programmatically, not so easy! And yeap I was pretty much right, after looking for a way to get the <code class="language-plaintext highlighter-rouge">runInfo.csv</code></p> <h2 id="day2--the-saga-continues-do-dont-need-to-download-the-files">Day2 : The saga continues: Do dont need to download the files</h2> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>EDIRECT=/path/2/eDirect cd $EDIRECT esearch -db sra -query PRJNA320780 | ./tools/edirect/efetch --format runinfo </code></pre></div></div> <p>which looks like this:</p> <table> <thead> <tr> <th style="text-align: left">Run</th> <th style="text-align: left">ReleaseDate</th> <th style="text-align: left">LoadDate</th> <th style="text-align: right">spots</th> <th style="text-align: right">bases</th> <th style="text-align: right">spots_with_mates</th> <th style="text-align: right">avgLength</th> <th style="text-align: right">size_MB</th> <th style="text-align: left">AssemblyName</th> <th style="text-align: left">download_path</th> <th style="text-align: left">Experiment</th> <th style="text-align: left">LibraryName</th> <th style="text-align: left">LibraryStrategy</th> <th style="text-align: left">LibrarySelection</th> <th style="text-align: left">LibrarySource</th> <th style="text-align: left">LibraryLayout</th> <th style="text-align: right">InsertSize</th> <th style="text-align: right">InsertDev</th> <th style="text-align: left">Platform</th> <th style="text-align: left">Model</th> <th style="text-align: left">SRAStudy</th> <th style="text-align: left">BioProject</th> <th style="text-align: right">Study_Pubmed_id</th> <th style="text-align: right">ProjectID</th> <th style="text-align: left">Sample</th> <th style="text-align: left">BioSample</th> <th style="text-align: left">SampleType</th> <th style="text-align: right">TaxID</th> <th style="text-align: left">ScientificName</th> <th style="text-align: left">SampleName</th> <th style="text-align: left">g1k_pop_code</th> <th style="text-align: left">source</th> <th style="text-align: left">g1k_analysis_group</th> <th style="text-align: left">Subject_ID</th> <th style="text-align: left">Sex</th> <th style="text-align: left">Disease</th> <th style="text-align: left">Tumor</th> <th style="text-align: left">Affection_Status</th> <th style="text-align: left">Analyte_Type</th> <th style="text-align: left">Histological_Type</th> <th style="text-align: left">Body_Site</th> <th style="text-align: left">CenterName</th> <th style="text-align: left">Submission</th> <th style="text-align: left">dbgap_study_accession</th> <th style="text-align: left">Consent</th> <th style="text-align: left">RunHash</th> <th style="text-align: left">ReadHash</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">SRR3501849</td> <td style="text-align: left">2016-05-18 11:35:07</td> <td style="text-align: left">2016-05-13 11:31:37</td> <td style="text-align: right">25818676</td> <td style="text-align: right">7797240152</td> <td style="text-align: right">25818676</td> <td style="text-align: right">302</td> <td style="text-align: right">4224</td> <td style="text-align: left">NA</td> <td style="text-align: left">https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501849</td> <td style="text-align: left">SRX1759558</td> <td style="text-align: left">844</td> <td style="text-align: left">WGS</td> <td style="text-align: left">RANDOM</td> <td style="text-align: left">METAGENOMIC</td> <td style="text-align: left">PAIRED</td> <td style="text-align: right">0</td> <td style="text-align: right">0</td> <td style="text-align: left">ILLUMINA</td> <td style="text-align: left">Illumina HiSeq 2500</td> <td style="text-align: left">SRP075031</td> <td style="text-align: left">PRJNA320780</td> <td style="text-align: right">2</td> <td style="text-align: right">320780</td> <td style="text-align: left">SRS1435427</td> <td style="text-align: left">SAMN04957382</td> <td style="text-align: left">simple</td> <td style="text-align: right">942017</td> <td style="text-align: left">activated sludge metagenome</td> <td style="text-align: left">UPWRP_SW_d1_r1</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">no</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">SRA425235</td> <td style="text-align: left">NA</td> <td style="text-align: left">public</td> <td style="text-align: left">8C81A9CE61F9010A73220794D655E084</td> <td style="text-align: left">0AE4D27EB24ECF49E094557AD7255216</td> </tr> <tr> <td style="text-align: left">SRR3501850</td> <td style="text-align: left">2016-05-18 11:50:28</td> <td style="text-align: left">2016-05-13 11:46:22</td> <td style="text-align: right">31189839</td> <td style="text-align: right">9419331378</td> <td style="text-align: right">31189839</td> <td style="text-align: right">302</td> <td style="text-align: right">5112</td> <td style="text-align: left">NA</td> <td style="text-align: left">https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501850</td> <td style="text-align: left">SRX1759559</td> <td style="text-align: left">845</td> <td style="text-align: left">WGS</td> <td style="text-align: left">RANDOM</td> <td style="text-align: left">METAGENOMIC</td> <td style="text-align: left">PAIRED</td> <td style="text-align: right">0</td> <td style="text-align: right">0</td> <td style="text-align: left">ILLUMINA</td> <td style="text-align: left">Illumina HiSeq 2500</td> <td style="text-align: left">SRP075031</td> <td style="text-align: left">PRJNA320780</td> <td style="text-align: right">2</td> <td style="text-align: right">320780</td> <td style="text-align: left">SRS1435428</td> <td style="text-align: left">SAMN04957383</td> <td style="text-align: left">simple</td> <td style="text-align: right">942017</td> <td style="text-align: left">activated sludge metagenome</td> <td style="text-align: left">UPWRP_SW_d2_r1</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">no</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">SRA425235</td> <td style="text-align: left">NA</td> <td style="text-align: left">public</td> <td style="text-align: left">E649F6CDCC80915B98BE85CD437B7EFE</td> <td style="text-align: left">B58C5296FB135FCF2E9BFD8544C33B29</td> </tr> <tr> <td style="text-align: left">SRR3501851</td> <td style="text-align: left">2016-05-18 11:47:02</td> <td style="text-align: left">2016-05-13 11:42:17</td> <td style="text-align: right">31966019</td> <td style="text-align: right">9653737738</td> <td style="text-align: right">31966019</td> <td style="text-align: right">302</td> <td style="text-align: right">5244</td> <td style="text-align: left">NA</td> <td style="text-align: left">https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501851</td> <td style="text-align: left">SRX1759560</td> <td style="text-align: left">945</td> <td style="text-align: left">WGS</td> <td style="text-align: left">RANDOM</td> <td style="text-align: left">METAGENOMIC</td> <td style="text-align: left">PAIRED</td> <td style="text-align: right">0</td> <td style="text-align: right">0</td> <td style="text-align: left">ILLUMINA</td> <td style="text-align: left">Illumina HiSeq 2500</td> <td style="text-align: left">SRP075031</td> <td style="text-align: left">PRJNA320780</td> <td style="text-align: right">2</td> <td style="text-align: right">320780</td> <td style="text-align: left">SRS1435429</td> <td style="text-align: left">SAMN04957392</td> <td style="text-align: left">simple</td> <td style="text-align: right">942017</td> <td style="text-align: left">activated sludge metagenome</td> <td style="text-align: left">UPWRP_SW_d1_r2</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">no</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">SRA425235</td> <td style="text-align: left">NA</td> <td style="text-align: left">public</td> <td style="text-align: left">81EC07EC8BC6509DBCB00BC4FA7401A9</td> <td style="text-align: left">9AD8B926EF9D20E3A2FD10582C72B592</td> </tr> <tr> <td style="text-align: left">SRR3501852</td> <td style="text-align: left">2016-05-18 12:02:10</td> <td style="text-align: left">2016-05-13 11:57:54</td> <td style="text-align: right">29331148</td> <td style="text-align: right">8858006696</td> <td style="text-align: right">29331148</td> <td style="text-align: right">302</td> <td style="text-align: right">4854</td> <td style="text-align: left">NA</td> <td style="text-align: left">https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501852</td> <td style="text-align: left">SRX1759561</td> <td style="text-align: left">946</td> <td style="text-align: left">WGS</td> <td style="text-align: left">RANDOM</td> <td style="text-align: left">METAGENOMIC</td> <td style="text-align: left">PAIRED</td> <td style="text-align: right">0</td> <td style="text-align: right">0</td> <td style="text-align: left">ILLUMINA</td> <td style="text-align: left">Illumina HiSeq 2500</td> <td style="text-align: left">SRP075031</td> <td style="text-align: left">PRJNA320780</td> <td style="text-align: right">2</td> <td style="text-align: right">320780</td> <td style="text-align: left">SRS1435430</td> <td style="text-align: left">SAMN04957393</td> <td style="text-align: left">simple</td> <td style="text-align: right">942017</td> <td style="text-align: left">activated sludge metagenome</td> <td style="text-align: left">UPWRP_SW_d2_r2</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">no</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">SRA425235</td> <td style="text-align: left">NA</td> <td style="text-align: left">public</td> <td style="text-align: left">63B30D9EC717121777A138CECA1F1ACA</td> <td style="text-align: left">35A116CCE17CBA7F425465AA9D7DBB6B</td> </tr> <tr> <td style="text-align: left">SRR3501853</td> <td style="text-align: left">2016-05-18 11:50:18</td> <td style="text-align: left">2016-05-13 11:46:11</td> <td style="text-align: right">34045865</td> <td style="text-align: right">10281851230</td> <td style="text-align: right">34045865</td> <td style="text-align: right">302</td> <td style="text-align: right">5630</td> <td style="text-align: left">NA</td> <td style="text-align: left">https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501853</td> <td style="text-align: left">SRX1759562</td> <td style="text-align: left">947</td> <td style="text-align: left">WGS</td> <td style="text-align: left">RANDOM</td> <td style="text-align: left">METAGENOMIC</td> <td style="text-align: left">PAIRED</td> <td style="text-align: right">0</td> <td style="text-align: right">0</td> <td style="text-align: left">ILLUMINA</td> <td style="text-align: left">Illumina HiSeq 2500</td> <td style="text-align: left">SRP075031</td> <td style="text-align: left">PRJNA320780</td> <td style="text-align: right">2</td> <td style="text-align: right">320780</td> <td style="text-align: left">SRS1435431</td> <td style="text-align: left">SAMN04957394</td> <td style="text-align: left">simple</td> <td style="text-align: right">942017</td> <td style="text-align: left">activated sludge metagenome</td> <td style="text-align: left">UPWRP_SW_d3_r2</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">no</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">SRA425235</td> <td style="text-align: left">NA</td> <td style="text-align: left">public</td> <td style="text-align: left">3AEB6D6C4FE383F80D1E16E588C2D374</td> <td style="text-align: left">876D1E61221339EF202EAAEC93AD0C5C</td> </tr> <tr> <td style="text-align: left">SRR3501854</td> <td style="text-align: left">2016-05-18 11:46:21</td> <td style="text-align: left">2016-05-13 11:41:11</td> <td style="text-align: right">29717524</td> <td style="text-align: right">8974692248</td> <td style="text-align: right">29717524</td> <td style="text-align: right">302</td> <td style="text-align: right">4935</td> <td style="text-align: left">NA</td> <td style="text-align: left">https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501854</td> <td style="text-align: left">SRX1759563</td> <td style="text-align: left">948</td> <td style="text-align: left">WGS</td> <td style="text-align: left">RANDOM</td> <td style="text-align: left">METAGENOMIC</td> <td style="text-align: left">PAIRED</td> <td style="text-align: right">0</td> <td style="text-align: right">0</td> <td style="text-align: left">ILLUMINA</td> <td style="text-align: left">Illumina HiSeq 2500</td> <td style="text-align: left">SRP075031</td> <td style="text-align: left">PRJNA320780</td> <td style="text-align: right">2</td> <td style="text-align: right">320780</td> <td style="text-align: left">SRS1435432</td> <td style="text-align: left">SAMN04957395</td> <td style="text-align: left">simple</td> <td style="text-align: right">942017</td> <td style="text-align: left">activated sludge metagenome</td> <td style="text-align: left">UPWRP_SW_d4_r2</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">no</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">NA</td> <td style="text-align: left">SRA425235</td> <td style="text-align: left">NA</td> <td style="text-align: left">public</td> <td style="text-align: left">D467387C3A275485CC8EA2025E6044ED</td> <td style="text-align: left">9EB031A8BDAD3C2135E92CF3DBB29169</td> </tr> </tbody> </table> <p>Great the links to the SRAs are in the column download_path</p> <p>So by the way I found this awesome download script which combined pycurl + tqdm (friend recommended me this, if you were wondering what tqdm stands for, it means “progress” in Arabic: taqadum)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span> <span class="kn">import</span> <span class="nn">pycurl</span> <span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span> <span class="n">downloader</span> <span class="o">=</span> <span class="n">pycurl</span><span class="p">.</span><span class="n">Curl</span><span class="p">()</span> <span class="k">def</span> <span class="nf">sanitize</span><span class="p">(</span><span class="n">c</span><span class="p">):</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">UNRESTRICTED_AUTH</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">HTTPAUTH</span><span class="p">,</span> <span class="n">pycurl</span><span class="p">.</span><span class="n">HTTPAUTH_ANYSAFE</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">ACCEPT_ENCODING</span><span class="p">,</span> <span class="sa">b</span><span class="s">''</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">TRANSFER_ENCODING</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">SSL_VERIFYPEER</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">SSL_VERIFYHOST</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">SSLVERSION</span><span class="p">,</span> <span class="n">pycurl</span><span class="p">.</span><span class="n">SSLVERSION_TLSv1</span><span class="p">)</span> <span class="c1">#c.setopt(pycurl.FOLLOWLOCATION, False) </span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">FOLLOWLOCATION</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span> <span class="k">def</span> <span class="nf">do_download</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">local</span><span class="p">,</span> <span class="o">*</span><span class="p">,</span> <span class="n">safe</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span> <span class="n">rv</span> <span class="o">=</span> <span class="bp">False</span> <span class="k">with</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">desc</span><span class="o">=</span><span class="n">url</span><span class="p">,</span> <span class="n">total</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">unit</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="n">unit_scale</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">progress</span><span class="p">:</span> <span class="n">xfer</span> <span class="o">=</span> <span class="n">XferInfoDl</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">progress</span><span class="p">)</span> <span class="k">if</span> <span class="n">safe</span><span class="p">:</span> <span class="n">local_tmp</span> <span class="o">=</span> <span class="n">local</span> <span class="o">+</span> <span class="s">'.tmp'</span> <span class="k">else</span><span class="p">:</span> <span class="n">local_tmp</span> <span class="o">=</span> <span class="n">local</span> <span class="n">c</span> <span class="o">=</span> <span class="n">downloader</span> <span class="n">c</span><span class="p">.</span><span class="n">reset</span><span class="p">()</span> <span class="n">sanitize</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">NOPROGRESS</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">XFERINFOFUNCTION</span><span class="p">,</span> <span class="n">xfer</span><span class="p">)</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">URL</span><span class="p">,</span> <span class="n">url</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">))</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">local_tmp</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span> <span class="n">c</span><span class="p">.</span><span class="n">setopt</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">WRITEDATA</span><span class="p">,</span> <span class="n">out</span><span class="p">)</span> <span class="k">try</span><span class="p">:</span> <span class="n">c</span><span class="p">.</span><span class="n">perform</span><span class="p">()</span> <span class="k">except</span> <span class="n">pycurl</span><span class="p">.</span><span class="n">error</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">unlink</span><span class="p">(</span><span class="n">local_tmp</span><span class="p">)</span> <span class="k">return</span> <span class="bp">False</span> <span class="k">if</span> <span class="n">c</span><span class="p">.</span><span class="n">getinfo</span><span class="p">(</span><span class="n">pycurl</span><span class="p">.</span><span class="n">RESPONSE_CODE</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">400</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">unlink</span><span class="p">(</span><span class="n">local_tmp</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="k">if</span> <span class="n">safe</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">local_tmp</span><span class="p">,</span> <span class="n">local</span><span class="p">)</span> <span class="n">rv</span> <span class="o">=</span> <span class="bp">True</span> <span class="n">progress</span><span class="p">.</span><span class="n">total</span> <span class="o">=</span> <span class="n">progress</span><span class="p">.</span><span class="n">n</span> <span class="o">=</span> <span class="n">progress</span><span class="p">.</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span> <span class="n">progress</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="k">return</span> <span class="n">rv</span> <span class="k">class</span> <span class="nc">XferInfoDl</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">url</span><span class="p">,</span> <span class="n">progress</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">_tqdm</span> <span class="o">=</span> <span class="n">progress</span> <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dltotal</span><span class="p">,</span> <span class="n">dlnow</span><span class="p">,</span> <span class="n">ultotal</span><span class="p">,</span> <span class="n">ulnow</span><span class="p">):</span> <span class="n">n</span> <span class="o">=</span> <span class="n">dlnow</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">_tqdm</span><span class="p">.</span><span class="n">n</span> <span class="bp">self</span><span class="p">.</span><span class="n">_tqdm</span><span class="p">.</span><span class="n">total</span> <span class="o">=</span> <span class="n">dltotal</span> <span class="ow">or</span> <span class="n">guess_size</span><span class="p">(</span><span class="n">dlnow</span><span class="p">)</span> <span class="k">if</span> <span class="n">n</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_tqdm</span><span class="p">.</span><span class="n">update</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">def</span> <span class="nf">guess_size</span><span class="p">(</span><span class="n">now</span><span class="p">):</span> <span class="s">''' Return a number that is strictly greater than `now`, but likely close to `approx`. '''</span> <span class="k">return</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">now</span><span class="p">.</span><span class="n">bit_length</span><span class="p">()</span> </code></pre></div></div> <p>K so I’ve downloaded the SRA files, I just need to extract the fq from the SRA. Which brings to the <a href="https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/">SRAtoolkit</a></p> <p>Its basically a collection of cmd line tools to deal with the SRA files, what we’re really interested with is <code class="language-plaintext highlighter-rouge">fastq-dump</code></p> <p>Its not exactly clear in NCBI’s readme, but here’s what it does fastq-dump tries to automatically download the SRAs again even though you’ve got the local file ready. Running <code class="language-plaintext highlighter-rouge">fastq-dump -v</code> shows you its trying to download from NCBI.</p> <p>The rationale for this I assume is to prevent corrupted files since there’s another tool in the toolkit <code class="language-plaintext highlighter-rouge">vdb-validate ./&lt;filename&gt;.sra</code> which checks its integrity.</p> <p>You could read the whole issues thread but I think this user’s <a href="https://github.com/ncbi/sra-tools/issues/42#issuecomment-254853204">frustration</a> just sums it up for me as well.</p> <blockquote> <p>@klymenko That is unacceptable. I do not need alignments. just the raw fastq files. This has nothing to do with RefSeq files. Further, neither fastq-dump -h nor online man pages say anything about accompanying refseq files. It simply says you can act on local SRA files. Further, all of the above validation tools approve of the downloaded SRA file</p> </blockquote> <p>The owner of the repo goes on to <a href="https://github.com/ncbi/sra-tools/issues/42#issuecomment-254860715">threaten</a> the poor fella who’s just like me trying to download file</p> <blockquote> <p>If you want help, please ask. If you want to flame, then I’ll close the issue.</p> </blockquote> <p><strong>LONG SIGH</strong></p> <p>So the prescribed way of doing this is actually to run the following if you havent downloaded the SRA.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prefetch &lt;SRA ID&gt; fastq-dump &lt;SRA ID&gt; </code></pre></div></div> <p>Yes you won’t even have to go thru downloading 1. entrez tool to get the 2. runInfo.csv with the links to get the 3. SRA files.</p> <p>And if you’ve already downloaded a local SRA file like me, you will have to run <code class="language-plaintext highlighter-rouge">prefetch</code> to check the local file, my guess is it stores the location for <code class="language-plaintext highlighter-rouge">fastq-dump</code> to recognise.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prefetch &lt;localFile&gt; fastq-dump &lt;localFile&gt; </code></pre></div></div> <p>The story deepens, turns out the extraction of the fq from the SRA is excruciatingly slow and <a href="https://github.com/ncbi/sra-tools/issues/24#issuecomment-171296735">its not just me</a></p> <blockquote> <p>It’s been running for about 3 hours and so far extracted ~15GB of what I expect to be around 60GB. An improvement, but still not exactly fast…</p> </blockquote> <p><img src="https://i.imgflip.com/11rujc.jpg" alt="patiently" /></p> <p>Looked around for other solutions to speed this up and game across the <a href="https://www.gnu.org/software/parallel/">gnu parallel tool</a>.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>parallel fastq-dump <span class="nt">--split-files</span> <span class="nt">-F</span> <span class="nt">--gzip</span> <span class="o">{}</span> ::: <span class="k">*</span>.sra </code></pre></div></div> <p>but it doesnt really solve anything since each file still has to be extracted by 1 thread.</p> <p>Thank god, later i stumbled across <a href="https://github.com/rvalieris/parallel-fastq-dump">parallel-fastq-dump</a> which makes use of the <code class="language-plaintext highlighter-rouge">-N</code> and <code class="language-plaintext highlighter-rouge">-X</code> flags in the original <code class="language-plaintext highlighter-rouge">fastq-dump</code> which splits the extraction over different ranges so it can be parallelized.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>parallel-fastq-dump <span class="nt">--sra-id</span> SRR3501865 <span class="nt">-F</span> <span class="nt">--threads</span> 20 <span class="nt">--outdir</span> ../unzipped <span class="nt">--split-files</span> <span class="nt">--gzip</span> <span class="nt">--tmpdir</span> /scratch/uesu/ </code></pre></div></div> <p>The results are stunning</p> <p><img src="https://cloud.githubusercontent.com/assets/6310472/23962085/bdefef44-098b-11e7-825f-1da53d6568d6.png" alt="results" /></p> <h3 id="conclusion">Conclusion</h3> <p>Thats all folks, the moral of the story will be to try and avoid downloading through NCBI if u can but straight from the source if possible. Have a good one!</p> <p><a href="https://etheleon.github.io/articles/ncbi-sra/">Why has downloading fastQ files become so complicated?</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on August 22, 2017.</p> <![CDATA[From raw sequencing reads to Gene Centric Analyses PART: 1]]> https://etheleon.github.io/articles/geneCentricApproach 2016-12-21T00:00:00-00:00 2017-07-18T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <p>A recent <a href="https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0233-2">paper</a> which came out on Microbiome was from Daniel Huson’s group using a new gene-centric function found within MEGAN 6 CE.</p> <p>You could use a sample fastQ to generate MEGAN summary file and do this.</p> <p><img src="" alt="geneCentricAssembly" /></p> <h1 id="simulation">Simulation</h1> <p><img src="http://i.dailymail.co.uk/i/pix/2012/10/11/article-0-006542AF00000258-91_634x345.jpg" alt="theMatrix" /></p> <p>Here at Singapore Centre for Environmental Life Sciences Engineering (SCELSE) NUS, <a href="http://www.scelse.sg/People/Detail/fa315ed9-015a-4414-b49e-5e0145e6ce42">Peter</a> and <a href="http://www.scelse.sg/People/Detail/f000cd6a-daf9-442b-b328-a7e3a2b6c64f">myself</a> work on a variety of Bioinformatics analyses concerning the Microbiome of Ulu Pandan’s microbial community. This ultimately led to pipelines and tools based on the sequencing data we retrieve from wastewater samples.</p> <p>One of the topics I work on surrounds the development of a gene centric assembly analysis for poorly annotated microbiomes.</p> <p>Our method briefly is split into the following steps</p> <ul> <li>Function binning using MEGAN’s Lowest Common Ancestor (LCA) algorithm,</li> <li>NEWBLER’s implementation of the Overlap Layout Consensus (OLC) and</li> <li>Conserve region analysis using a defined Maximum Diversity Region <a href="https://github.com/etheleon/pAss">pAss</a>.</li> </ul> <p>Unlike Huson <em>et al.</em>, we explore the alignment of contigs against respective reference sequences before deciding upon a consensus region based on a multiple sequence alignment of reference sequences with captures the most number of contigs thus facilitating a diversity analysis. To understand the dynamics of such a workflow, we have decided on firstly running this on an <em>in silico</em> simulation of 329 bacterial and archeal species, modelled after the abundance curves obtained from an initial whole genome short read analysis.</p> <p>In this post, I’ll wont be diving too deeply into details but a outline how one would use the pipeline in general starting from raw fastQ files.</p> <h2 id="1-homology-search-of-the-short-reads">1. Homology search of the short reads</h2> <p>Many databases could be used but NR Protein is a good place to begin. A useful tool for comparing short reads with a protein database is <a href="https://github.com/bbuchfink/diamond">DIAMOND</a>.</p> <h2 id="2-binning-short-reads-into-functional-groups">2. Binning short reads into functional groups</h2> <p>Once you’ve gotten the reads sorted in the proper directories we can begin assembling them. Here we use MEGAN’s blast2lca tool</p> <h2 id="3-run-newbler-olc-assembler-on-each-of-the-bins">3. Run NEWBLER OLC Assembler on each of the bins</h2> <p>However because you’ll be running the assembler possibly on 9000 different KOs or more, I’ve written a python <a href="https://github.com/etheleon/newbler">class</a> to run NEWBLER.</p> <h2 id="4-run-pass-and-identify-the-max-diversity-regions">4. Run pAss and identify the Max Diversity Regions</h2> <p>This is truly where our work begins.</p> <h3 id="mdr">MDR</h3> <p>The core algorithm works as follows:</p> <h4 id="implicit-msa-of-contigs">Implicit MSA of contigs</h4> <ol> <li>Firstly we begin by generating a MSA of protein reference sequences.</li> <li>Thereafter, using MEGAN, we gathered contig-reference sequence (prot) alignments before assigning one best aligned reference to each contig.</li> <li>Finally, we lined the contigs up according to the their cognate reference sequence’s position in the MSA.</li> </ol> <h4 id="window-of-diversity">Window of diversity</h4> <ol> <li>We ran a 200 bp sliding window across the implicit contig alignment to find a region with capturing the most number of contigs also known as a maximum diversity region (MDR)</li> </ol> <h3 id="simulation-1">Simulation</h3> <p>With the simulation, we looked specifcally at Single Copy Genes (SCGs) to see if the method “worked”.</p> <ol> <li>If the genes have been successfully assembled</li> <li>If homology search + LCA was able to correctly identify these assembled genes to the correct genus.</li> </ol> <p>Briefly, the conclusions made from this simulation was that the process leads to a overestimation in the number of genes due to duplication of genes introduced as an artifact of the assembly process.</p> <p>This could be circumvented in several ways and we have come up with two.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Duplication decreases this decreases when we stipulate the contigs to span the entire length of the window. 2. Additionally, we remove low quality contigs by thresholding contigs by their read counts until the number of duplicated genes (multiple contigs same gene from same genome) stabilised. (This could only be done in the simulation knowing where the contigs came from based on the identity of the reads) Alternatively, instead of read counts, the option to threshold based on coverage could also be done. </code></pre></div></div> <h3 id="part-2-empirical-data">Part: 2 Empirical data</h3> <p>The continuation and next part of this blog on the analysis performed on empirical sewage data.</p> <h1 id="softwares">Softwares</h1> <ol> <li>Protein homology search using <a href="https://github.com/bbuchfink/diamond">DIAMOND</a>.</li> <li>Binning of short reads using <a href="https://github.com/etheleon/pymegan">pymegan</a> for converting raw reads into LCA-ed taxonomic assignments and KEGG based functional assignments.</li> <li>Assembly of bins using the OLC assembler NEWBLER, identification of the maximum diversity regions (MDR) using <a href="https://github.com/etheleon/pAss">pAss</a>.</li> <li>Analysis of MDR and integration with noSQL database <a href="https://github.com/etheleon/omics">omics</a> using R package <a href="https://github.com/etheleon/MetamapsDB">metamapsDB</a>.</li> </ol> <h1 id="future-works">Future works</h1> <p>Make this process more friendly to other types of OLC / K-mer assemblers.</p> <h1 id="references">References</h1> <p><a href="https://etheleon.github.io/articles/geneCentricApproach/">From raw sequencing reads to Gene Centric Analyses PART: 1</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on July 18, 2017.</p> <![CDATA[Metagenomics for the not so beginner]]> https://etheleon.github.io/articles/pythonMEGAN 2017-03-07T00:00:00-00:00 2017-03-07T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <h1 id="blast2lca-a-python-wrapper-for-megan-blast2lca">blast2lca++ a Python Wrapper for MEGAN blast2lca</h1> <p>Download now from: <a href="https://github.com/etheleon/blast2lcaPlus">https://github.com/etheleon/blast2lcaPlus</a></p> <blockquote> <p>“Metagenomics (also referred to as environmental and community genomics) is the genomic analysis of microorganisms by direct extraction and cloning of DNA from an assemblage of microorganisms.”</p> </blockquote> <p>From absolute to intermediate beginners venturing into the field of Metagenomics, one tool you’ll most certainly and quickly come across is <a href="http://www-ab.informatik.uni-tuebingen.de/software/megan6/">MEGAN</a> from Daniel Huson’s Lab, Tubingen University.</p> <p><a href="http://ab.inf.uni-tuebingen.de/software/megan/"><img src="http://megan.informatik.uni-tuebingen.de/uploads/default/original/1X/c3b77ecaaa6f3b8f4c71d45f070a3a6b9952605b.png" alt="MEGANimg" /></a></p> <p>If you take a closer look inside the <code class="language-plaintext highlighter-rouge">tools</code> directory of the installation, you’ll find a bash executable called <code class="language-plaintext highlighter-rouge">blast2lca</code> (<a href="https://github.com/danielhuson/megan-ce/blob/master/tools/blast2lca">see link to script on github repo</a>)which taps into the java classes used in the desktop version of MEGAN.</p> <p><code class="language-plaintext highlighter-rouge">blast2lca</code> is extremely valuable as a tool for one to basically access the core algorithms within MEGAN (for example):</p> <ol> <li>Lowest Common Ancestor algorithm and</li> <li>Functional functional assignment (KEGG/COG/eggNOG)</li> </ol> <p>MEGAN’s been around for awhile with its <a href="http://www.genome.org/cgi/reprint/gr.5969107v1.pdf">1st release</a> way back in (2007).</p> <p>In its newest iteration <a href="http://www-ab.informatik.uni-tuebingen.de/software/megan6/">MEGAN6</a> now includes new additions to deal with increasingly large datasets.</p> <p>However, I would say the updates are still mainly for desktop users and if you need to run any huge jobs on multiple large sequencing projects, you’ll be hardpressed to find a solution unless you pay for the server edition and even then incorporating MEGAN into the customised pipeline might not be that simpl might not be that simple.</p> <p>Discussions about MEGAN server will be outside of the blogpost, message the authors if you want to know more.</p> <h2 id="blast2lca">blast2lca++</h2> <p>In this blogpost, I’ll be sharing with you python wrappers <a href="https://github.com/etheleon/blast2lcaPlus">https://github.com/etheleon/blast2lcaPlus</a>, I’ve written around <code class="language-plaintext highlighter-rouge">blast2lca</code>. (At the time of writing I tested this with MEGAN6 community edition 6.6.0 from Dec 2016).</p> <h3 id="use-case">Use case</h3> <p>Say for example you’ve got a huge number of samples which you would like to run some analysis and you’re using a weak Macbook 12, but <em>alas</em> you have access to a powerful univeristy headless server.</p> <p><img src="http://weknowmemes.com/wp-content/uploads/2013/03/i-have-a-lot-of-work-to-do-oh-well-comic.jpg" alt="lotsOfWork" /></p> <p>One option is to install MEGAN server (Ultimate edition) run the LCA and functional binning algos there and analyses it via the desktop. For someone who does further analyses in R and Python, that’s not really what I want to do, but do my own plots and run my own analyses. Luckily there’s <code class="language-plaintext highlighter-rouge">blast2lca</code> kindly provided by the author.</p> <p>However there’s still several steps which are not clear, hence the reason for this wrapper</p> <ol> <li>Combine Annotations - How does one combine KO and TAXONOMIC annotations such that we will have a combined annotation for each query (be it short read/long read/contig).</li> <li>gi2ko mapping file generator - KEGG annotations. In the Community Edition, the tools to generate the mapping file (gi kegg) is not included like in the ultimate edition. What if you’re not in a position to pay for the ultimate edition license (which bundles with it the KEGG database licence) or have a older version of KEGG lying around somewhere in the server, what should you do? (NCBI has recently done away with the GI and I’ll update this in the future)</li> <li>A complete pipeline from blast to combined output - How to go all the way from the tabbed blast output to the KO and taxonomy combined output mentioned above.</li> </ol> <p>Use <code class="language-plaintext highlighter-rouge">blast2lca++</code> tool of course!</p> <p><img src="http://img.memecdn.com/unicorn-farting-rainbows_o_1498739.jpg" alt="unicorn" /></p> <h3 id="combine-annotations">Combine Annotations</h3> <p>In the root directory of the github <a href="https://github.com/etheleon/blast2lcaPlus">repo</a>, you’ll find a <code class="language-plaintext highlighter-rouge">parseMEGAN</code> python script.</p> <p>It requires blast results be arranged in the following manner:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/projectDir └── sampleDir/ └── sample.daa / sample.m8 (tabbed blast) </code></pre></div></div> <blockquote> <p>note if you only have one sample, then just substitute sampledir with a <code class="language-plaintext highlighter-rouge">.</code>.</p> </blockquote> <p>You’ll be asked to specify the locations of the mapping files (KEGG and taxonomy as well as the path to the executable)</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>parseMEGAN $PROJECTDIR $SAMPLEDIR $SAMPLENAME taxOutput koOutput </code></pre></div></div> <p>After which you’ll get the outputs from blast2lca and with a merged file <code class="language-plaintext highlighter-rouge">sample-combined.txt</code></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/projectDir └── sampleDirName/ ├── blast2lca-tax-Output ├── blast2lca-ko-Output ├── sampleName-combined.txt └── inputSampleDAAfile.daa </code></pre></div></div> <p>In the <code class="language-plaintext highlighter-rouge">sample-combined.txt</code> file, you’ll find a table:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank ncbi-taxid KEGG-ko #reads phylum 67820 K00000 4 phylum 1224 K06937 1 phylum 1224 K00656 6 phylum 1224 K04564 2 phylum 1224 K06934 1 phylum 1224 K12524 24 phylum 1224 K00558 7 phylum 1224 K02674 1 phylum 1224 K06694 1 phylum 1224 K01785 12 ... ... species 1262910 K00033 1 species 1262911 K00000 525 species 35760 K00000 14 species 7462 K15421 1 species 7462 K00000 1 species 1262918 K00000 365 species 1262919 K02429 5 species 1262919 K07133 1 species 1262919 K03800 1 species 1262919 K00000 529 </code></pre></div></div> <p>Below’s a small example of what you could do with the data:</p> <h1 id="application">Application</h1> <p>With the above you could generate a reads per million column based on the raw counts</p> <table> <tbody> <tr> <td>level</td> <td>taxon</td> <td>ko</td> <td>rpm</td> <td>c1.raw</td> </tr> <tr> <td>Genus</td> <td>Abiotrophia</td> <td>K00000</td> <td>1.40278158</td> <td>32</td> </tr> <tr> <td>Genus</td> <td>Acanthamoeba</td> <td>K00000</td> <td>2.11362518</td> <td>47</td> </tr> <tr> <td>Genus</td> <td>Acaryochloris</td> <td>K00000</td> <td>61.73957107</td> <td>1423</td> </tr> <tr> <td>Genus</td> <td>Acaryochloris</td> <td>K00013</td> <td>0.00000000</td> <td>0</td> </tr> <tr> <td>Genus</td> <td>Acaryochloris</td> <td>K00016</td> <td>0.00000000</td> <td>0</td> </tr> <tr> <td>Genus</td> <td>Acaryochloris</td> <td>K00091</td> <td>0.05123577</td> <td>1</td> </tr> </tbody> </table> <p>You can now quickly summarise in a gene centric format the contributions made from each genus (or any taxonomic rank you choose) to each KO.</p> <p><img src="/images/posts/combiningTAX-KO.png" alt="mosiac plot" /></p> <p>Here we see the transcriptome summary and one highest expressed KOs (right most) is a one responsible for nitrogen metabolism is mostly being expressed by a single genus (orange).</p> <p><code class="language-plaintext highlighter-rouge">0</code> in the <code class="language-plaintext highlighter-rouge">ncbi-taxid</code> and <code class="language-plaintext highlighter-rouge">K00000</code> in the KEGG-ko columns stands for unclassified.</p> <p>What if you want to find out the names organisms from NCBI’s taxid?</p> <p>Checkout <a href="https://github.com/etheleon/MetamapsDB">R package MetamapsDB</a> which lets you query the names based on taxids and do much more</p> <h3 id="full-pipeline">Full pipeline</h3> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fullPipeline $PROJECTDIR $SAMPLEDIR $SAMPLENAME $INPUTFILE taxOutput koOutput --blast2lca &lt;path 2 the blast2lca script&gt; --gi2tax &lt;path to the taxonomy mapping file&gt; --gi2kegg &lt;path to the KEGG mapping file&gt; </code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">fullPipeline</code> script will take a <code class="language-plaintext highlighter-rouge">.m8</code> (tabbed blast) file or meganised <code class="language-plaintext highlighter-rouge">.DAA</code> file as input (checked via regex <code class="language-plaintext highlighter-rouge">.m8</code> or <code class="language-plaintext highlighter-rouge">.daa</code>) and carry out the taxonomic and functional (KEGG) annotation and take the outputs and generate a combined output</p> <h3 id="gi2ko-mapping-file-generator">gi2ko mapping file generator</h3> <p>Although MEGAN UE provides a KEGG mapper generating tool (not included with MEGAN CE), it doesnt take into consideration how NCBI has assigned a unique <code class="language-plaintext highlighter-rouge">GI</code> to each representative sequence in the non-redundant database (NCBI NR) under which are “duplicate” sequence GIs and ref IDs when blast or diamond does the alignment it only returns the former and the rest. It makes the kegg to gi mapping irrelevent.</p> <p>We’ve separately included in the tools folder of the python package the <a href="https://github.com/etheleon/blast2lcaPlus/blob/master/tools/ref2kegg.go">ref2kegg.go nr-gi to kegg ortholog KO mapping file generator</a> written in golang. The output of this could be fed to the blast2lca wrapper via the <code class="language-plaintext highlighter-rouge">--gi2kegg</code> flag. At the time of writing the parser is written in golang (a typed language) from perl to increase the speed of parsing the NR fasta.</p> <p><img src="https://i.imgflip.com/123oks.jpg" alt="end" /></p> <p>Hope this helps anyone doing any customised pipeline with MEGAN6! Personally I feel strongly that MEGAN has now a open source version via MEGAN CE.</p> <h2 id="reference">Reference</h2> <ol> <li>Handelsman, J (2004). Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev., 68, 4:669-85.</li> <li>Huson, D. H., Tappu, R., Bazinet, A. L., Xie, C., Cummings, M. P., Nieselt, K., &amp; Williams, R. (2017). Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads. Microbiome, 5(1), 11. http://doi.org/10.1186/s40168-017-0233-2</li> </ol> <p><a href="https://etheleon.github.io/articles/pythonMEGAN/">Metagenomics for the not so beginner</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on March 07, 2017.</p> <![CDATA[10 Bioinformatics tools and workflows you should be adopting in 2017.]]> https://etheleon.github.io/articles/Organising DNA sequencing projects 2017-01-15T00:00:00-00:00 2017-01-15T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <p>Coming from a non-Computer Science related field but find yourself having a hard time navigating data analysis? I won’t blame you if you feel at a loss as to where you should begin?</p> <p style="text-align: center;"><img src="http://m.quickmeme.com/img/da/da7c74e10383ecc127dbb18b7d812abfb7f2aa1092f116d09a3ae70e782fc059.jpg" alt="That's me" /></p> <p>Posts like <em>which language should you learn for datascience</em> often catch our eyes and it’s no different whether you’re doing <a href="https://www.biostars.org/p/7763/">bioinformatics</a> or something far removed such as <a href="https://www.linkedin.com/pulse/hr-analytics-starter-kit-part-2-intro-r-richard-rosenow-pmp">HR analytics</a>.</p> <p>The only advise I have is to immerse yourself as much as you can whenever you get the chance. This way you’ll be able to gain EXP little by little.</p> <p>There’s even a level up tree starting from a junior Bioinformatics analyst to a full on role, much like how it was described in this <a href="http://homolog.us/blogs/blog/2011/07/22/a-beginners-guide-to-bioinformatics-part-i/">post</a>.</p> <p style="text-align: center;"><img src="https://s.aolcdn.com/hss/storage/midas/daede00598c17d19b29a93ff65147585/200016989/priest+trees.jpg" alt="upgrade ursself" /></p> <p>Below’s a list of skills I’ve successfully mastered in 2016.</p> <p>Now lets count down from number 10!</p> <h2 id="10-mixing-procedural-scripts-with-object-oriented-programming-oop">10. Mixing procedural scripts with Object Oriented Programming (OOP)</h2> <p>Previously, I’ve organised my research analyses as scripts with running numbers. Mainly procedural stuff…</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ProjName.0100.doTask1.R ProjName.0101.doTask2.py . . ProjName.0109.doTask3.pl </code></pre></div></div> <p>But quickly realised much of the analyses I do is often not at all linear.</p> <p>It often works as like a networks. ScriptN+10 needs ScriptN+2 functions, there’s dependency when the project works up to a relatively medium-&gt;small size.</p> <p style="text-align: center;"><img src="https://raw.githubusercontent.com/mikel-egana-aranguren/SADI-Galaxy-Docker/master/workflow_screen.png" alt="galaxy workflow" /></p> <p>The above is a screenshot of the dependency network using galaxy workflow. It nicely summarises some of what I do. I searched around for a CLI version and stumbled upon <a href="https://github.com/spotify/luigi">Luigi</a></p> <p>In 2016, I tried using luigi (python package from spotify to deal scheduled tasks) but found it too complicated for something I could solve simply by doing OOP. Not everything is a routine operation but dependencies on another hand is a real thing.</p> <p style="text-align: center;"><img src="https://i.stack.imgur.com/xFUmZ.png" alt="luigi" /></p> <p>What OOP does is it allows you to use design patterns. Design patterns are <a href="https://simpleprogrammer.com/2016/06/15/dont-get-obsessed-design-patterns/">predefined solutions to specific kinds of problems, proven over time and known by the software community</a>, but just remember not to get too obsessed with it. By using OOP, functions and methods are quickly abstracted away and gives you cleaner analysis code (class and methods)[https://www.sitepoint.com/object-oriented-javascript-deep-dive-es6-classes/].</p> <p>One of the things which I see myself doing more going into 2017 is to apply more <a href="https://github.com/faif/python-patterns">OOP design patterns</a>.</p> <h2 id="9-package-your-code">9. Package your code</h2> <p>To be honest R users are rather spoilt by R’s fantastic package system, <a href="https://cran.r-project.org">CRAN</a>. It’s extremely easy to download and install, and share your pacakges.</p> <p>Hadley’s devtools package is close to sorcery. (Do check out more of his packages, collectively known as <a href="http://adolfoalvarez.cl/the-hitchhikers-guide-to-the-hadleyverse/">hadleyverse</a>)</p> <p>When I looked outside of R, at other ecosystems like Python’s PyPI and Perl’s CPAN, things just dont feel as easy.</p> <p>If you’re hacking in Perl, check out <a href="https://github.com/tokuhirom/Minilla">minila</a> As for Python, one rather interesting find is the templating package <a href="https://github.com/audreyr/cookiecutter">cookie cutter</a> package which makes templates python modules.</p> <p>Both allows you to upload your package to github and let other install from it directly.</p> <p>You might be kind of confused why am I still mentioning Perl, well that’s because much of Bioinformatics is still using it!</p> <h2 id="8-be-a-polyglot-for-package-management-systems">8. Be a polyglot for package management systems</h2> <p>Because bioinformatics software are written in almost any language imaginable, eg. Erlang, Haskell, Perl, C++, Java, Python2.X, Python3.X, you name it its there. Learning how to use them is always be the constant, however familiarise yourself with common installation methods like make will make a whole of a difference.</p> <p>This parallels Web development quite a fair bit. In my startup life at <a href="https://www.fundmylife.co">fundMyLife</a> we adopted meteorJS, a modern ES6 web development framework and at <a href="https://www.fundmylife.co">fundMyLife</a> we use coffeescript coupled with jade/pug. Javascript’s package system NPM is a beast but once you get a hang of it many of its sweet packages will be at your fingertimes.</p> <p>The community is quick to adopt and change, everyone wants you to use their standards and formats. One thing is clear, what stays constant are their package managers, so know them well.</p> <h2 id="7-document-your-projects">7. Document your projects</h2> <p>Read enough documentation either for installation or just simply to use the functions in a project, you’ll soon see yourself transforming into a connoisseur of sorts as to what makes documentation good. Its extremely important if you want people using your work and its just plain <a href="https://twitter.com/blahah404/status/537584999885991936">courtesy</a>.</p> <p>R’s <a href="http://r-pkgs.had.co.nz/man.html">Roxygen</a> is extremely useful. It is by far the most user friendly way to document your R code.</p> <p>Below is how you would go about annotating a function in your package and the docs are all automatically generated</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#' Add together two numbers. #' #' @param x A number. #' @param y A number. #' @return The sum of \code{x} and \code{y}. #' @examples #' add(1, 1) #' add(10, 1) add &lt;- function(x, y) { x + y } </code></pre></div></div> <p>I’m still learning Python’s documentation system.</p> <p>PERL has its POD documentation system which allows you to embed documentation between code. But it’s very clunky. I used this in my pAss package but still it can’t beat R</p> <h2 id="6-containers-containers-containers">6. Containers, Containers, Containers.</h2> <p>Most of us nowadays start off their journey in serverside analysis in a debian linux distro, usually an Ubuntu box, with full root access while enjoying the privileges to run package manager 📦 <code class="language-plaintext highlighter-rouge">apt-get</code> as and when you please without even blinking and eye.</p> <p>But when we start using a shared resource things quickly turn south. The inspiration to start incorporating this into my workflow came when i saw the web community picking this up to deal with dependency hell. Meanwhile I found out one of my mentors, who is now working in heavily in industry data science uses docker in his day to day life.</p> <p>What containers really do is <em>disrupt</em> is <a href="https://www.gnu.org/software/make/">MAKE</a>. So instead often confusing makefiles, one writes dockerfiles and everything that gets installed stays within the container without polluting your host environment, pretty much like a function.</p> <p style="text-align: center;"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Function_machine2.svg/1200px-Function_machine2.svg.png" alt="docker is like a function" /></p> <p>Docker is the most mainstream of all container technologies and you should take a look at the <a href="https://github.com/BioContainers/containers">biocontainers</a> github page, there you can find many bioinformatics softwares containerized!</p> <p>Don’t worry about accessing your files from your home directory, it isn’t a problem as Docker lets you mount the host system’s HDD onto the running container.</p> <h3 id="docker-is-like-building-a-hdd-in-minecraft">Docker is like building a HDD in minecraft</h3> <iframe width="560" height="315" src="https://www.youtube.com/embed/q7clz1TPK8o" frameborder="0" allowfullscreen=""></iframe> <p>Talk about inception</p> <p>Together this solves an acute problem as it gives the normal user back the ability to be root without ruining the rest of the host system and still have performance similar to running on bare metal.</p> <p>One disadvantage of using Docker is installing docker itself requires root access and if you’re dealing with a university wide shared resource good luck.</p> <p>Which is why #5 linuxbrew is go out to be helpful</p> <h2 id="5-linuxbrew">5. Linuxbrew</h2> <p>Linuxbrew is basically a port of the macOS/OSX package manager <a href="http://brew.sh">HomeBrew</a> . So if you’re already familiar with homebrew, linuxbrew will be a breeze.</p> <p>Just how easy is it?</p> <p>Lets try installing R, wait lets make it a tad bit more difficult, lets try to customize the installation further with the newest fastest basic algebra algos included in <a href="https://github.com/xianyi/OpenBLAS">openblas</a></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew install R --with-openblas </code></pre></div></div> <p>Tada, you’re done. Yes, its that simple.</p> <p>What Homebrew or in this case Linuxbrew lets you do is not only lets to install into your <code class="language-plaintext highlighter-rouge">$HOME</code> directory, bypassing all that superuser bullshit, it also lets you do very specific installations and dependencies.</p> <p><a href="https://twitter.com/sjackman">Steve jackman</a> (see his <a href="https://github.com/sjackman/linuxbrew-slides">slides</a>) and many others are behind the Science <a href="https://github.com/Homebrew/homebrew-science">tap</a> of with instructions for linuxbrew/homebrew to install popular bioinformatics tools.</p> <p>This makes installing and ultimately doing science much easier than before.</p> <p style="text-align: center;"><img src="http://imgs.xkcd.com/comics/outreach.png" alt="science tap" /></p> <p>If you’re still confused about how to do local installations, i recommend reading this <a href="http://sneakygcr.net/caged-python-how-to-set-up-a-scientific-python-stack-in-your-home-folder-without-going-insane.html">post</a>, although its title has Python, its really meant for everything.</p> <h2 id="4-rmarkdown">4. Rmarkdown</h2> <p>To be honest I started r markdown way back in 2015 but got really into it in 2016 because it really helps frame my questions and analyses.</p> <p>Writting your analyses as Rmarkdowns force you to place those tiny bursts effort and energy into a single compiled document with clearly defined goals and development of the story.</p> <p>Rmarkdown has a interpreter engine so it allows you to not only work with R but with Perl or python.</p> <p>Having docker installed also allowed me to use the newer versions of Rmarkdown, <a href="http://rmarkdown.rstudio.com">rmarkdown2</a>.</p> <p>One feature I absolutely love is auto code hiding in the output html.</p> <p>The output looks absolutely professional and when you need it you could always show the code.</p> <p>Interestingly, the Rstudio team has also come up with R notebooks. I’m sure many pythonistas will love this new feature but personally for me I’m very happy with the way things are for rmarkdown all I’m giving this a miss.</p> <h2 id="3-tmux-vim-slime">3. Tmux-vim-slime</h2> <p>For those who know me in person, you know I’m a big fan of the terminal. And I do most of my work if not all in that one window. So when I’m in the server I’ll always have a tmux session running</p> <p style="text-align: center;"><img src="https://github.com/jpalardy/vim-slime/raw/master/assets/vim-slime.gif" alt="my tmux" /></p> <p><a href="https://www.google.com.sg/search?q=tmux+r+plugin&amp;oq=tmux+r+plugin&amp;aqs=chrome..69i57j0l5.1754j0j1&amp;sourceid=chrome&amp;ie=UTF-8">This is a good tutorial</a> and it teaches how to work with R in the server like a cluster away from familiar Rstudios. Recently I’ve switched from this to <a href="https://github.com/jpalardy/vim-slime">vim-slime</a> a vim port of the Emacs slime cause it also supports IPYTHON.</p> <h2 id="2-biojs">2. BioJS</h2> <p>Learning web development, while building <a href="https://www.fundmylife.co">fundMyLife</a> has given me the skills required to build the UI layer instead of CLI system tools using python/perl/R.</p> <p>BioJS is one of those interesting developments where important visualisations are now rendered in a browser and hence any operating system. You see this trend outside of bioinformatics where editors like Atom and Visual Studio Code, and communications tool Slack are all built as a browser based application.</p> <p><img src="https://lh5.googleusercontent.com/FbUHBUY-GmrI727nQd3K2lid0I4nPWpQUydyXEibMdfrnOeLB5wXlKlQWPSAMeBz_rfa8YAFjpQZjWItcpqrSHOoy6BGcCKw6AWjk3SjkBfmopJnzG3k-fxW4hdtO0xAS8Brjv2J" alt="biojs-gosc" /></p> <p>The admins are were even featured in 2016’s GSoC checkout the blogpost <a href="https://opensource.googleblog.com/2016/08/from-google-summer-of-code-to-game-of.html">part2</a>, <a href="https://opensource.googleblog.com/2016/08/from-google-summer-of-code-to-game-of_12.html">part2</a> where they went on to build visualisation for the Game Of Thrones</p> <h2 id="1-writing-production-ready-code">1. Writing production ready code</h2> <p>There’s a lot of talk about reproducibility and really much of it has been solved in the industry. Nearing the end of the PhD, means most of my packages should be ready for use by the public at large.</p> <p>Personally, I’m aspiring to be able to write code good enough for a industry setting. The crossover from academia into industry isn’t that uncommon, here’s a <a href="https://eng.uber.com/emi-data-science-q-a/">post</a> about a nucelar physicist who is now working in UBER and doing both datascience and shipping production code some even do some engineering as well</p> <p style="text-align: center;"><img src="https://qph.ec.quoracdn.net/main-qimg-9281e2345d6f6adfc2c42c2fa1001094?convert_to_webp=true" alt="pay scale" /></p> <p>Thats all folk’s! I hope this helped you get orientated around Bioinformatics.</p> <p>If you know any good workflows please do share with me.</p> <p><a href="https://etheleon.github.io/articles/Organising-DNA-sequencing-projects/">10 Bioinformatics tools and workflows you should be adopting in 2017.</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on January 15, 2017.</p> <![CDATA[5 Things You Didnt Know About The Bacteria In Your Gut]]> https://etheleon.github.io/articles/metagenomics 2016-12-21T00:00:00-00:00 2016-12-27T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <h1 id="hype-vs-reality-the-microbiome">Hype vs Reality: The Microbiome</h1> <p>With so much hype today about one’s gut health, you often wonder how much of it is truth. A trip down to your local pharmacy’s supplements shelf and you will see a wide range of pro- and pre- biotics each with their own beneficial claims and often too ludicrous to be taken seriously.</p> <p>In case you still don’t believe me, just head down to <a href="https://m.reddit.com/r/Microbiome/?compact=true">subreddit: r/microbiome</a> with close to 3000 subscribers.</p> <p>Why this hype? It’s driven by the numerous use cases which are popping up all over the place involving the microbiome.</p> <blockquote> <p><em>From prediction for early <a href="http://www.nature.com/nrmicro/journal/v14/n8/fig_tab/nrmicro.2016.83_F2.html">diagnostics</a> of chronic diseases such as diabetes to <a href="http://www.sciencemag.org/news/2016/03/how-your-microbiome-can-put-you-scene-crime">finger printing</a> criminals from samples left behind at crime scenes.</em></p> </blockquote> <h2 id="so-how-do-we-study-microbes-even-those-we-cannot-culture">So How do we study microbes (even those we cannot culture)?</h2> <p><strong>Through DNA sequencing of course!</strong></p> <p><img src="https://imgflip.com/s/meme/X-Everywhere.jpg" alt="dataIsEverywhere" /></p> <p><em>More data for you, me, everybody</em></p> <p>We study them mainly using two techniques</p> <ol> <li>Amplicon Sequencing</li> <li>Whole Metagenome Sequencing</li> </ol> <h3 id="so-what-is-amplicon-sequencing-and-metagenomics">So What is Amplicon sequencing and Metagenomics?</h3> <table> <tbody> <tr> <td>The investigation of microbes in a given sample without the need for culture by directly recording the genetic content using <a href="">next generation DNA sequencing techniques</a>.</td> </tr> </tbody> </table> <h4 id="can-you-be-more-specific-tldr-version">Can you be more specific? (TL;DR version)</h4> <ol> <li>Sequencing of a only a selected representative gene of interest and in this case the variable regions in the rRNA of the 16S ribosomes</li> <li>Whole metagenome sequencing, you basically get the whole repertoire / complement of genes.</li> </ol> <p><em>for non-biologist</em>: You can compare this with many existing data science techniques where you either churn through all collected data or just zoom in onto a very specific signal which you’re looking for</p> <h1 id="5-things-you-should-know-about-your-gut-microbiome">5 things you should know about your gut microbiome</h1> <h2 id="1-community-complexity">1. Community complexity</h2> <p>Microbial communities range widely in their complexity. By complexity we mean to say the numbers of unique OTUs (Operational Taxanomic Unit) and their proportions.</p> <p>There’s several ways to quantify this, the complexity, is through <em>unweighted</em> and <em>weighted</em> indices borrowed from existing macroecology literature:</p> <h3 id="indices-and-metrics">Indices and metrics</h3> <p><img src="https://media.makeameme.org/created/diversity.jpg" alt="diversity" /></p> <p>There’s many of these around and the famous ones are <em>$\alpha$</em>-diversity and <em>$\beta$</em>- diversity.</p> <p>The former, <em>$\alpha$-diveristy</em>, measures within sample diversity and includes: Shannon-Weaver, Simpson If you want something more weighted then try Taxonomic Diversity $\Delta$ or Taxonomic Distinctiveness $\Delta^+$. <em>(See the R package: <a href="https://cran.r-project.org/web/packages/vegan/vignettes/diversity-vegan.pdf">Vegan</a> for more explanation)</em></p> <p><em>$\beta$-diveristy</em> describes the total species diversity across samples over the average species diversity per sample its used essentially as a measure to investigate heterogenity within the data amongst samples.</p> <h2 id="2-analysis-is-hard">2. Analysis is Hard</h2> <p>Simpler communities are by far easier to study. However, one must also consider the number of reference genomes available for the community in question, ie. if community was simple but the reference genomes are sparse and few inbetween, chances are the analysis will only unveil little about the community save for top level analyses.</p> <p>This places the gut microbiome in a good location to be studied as it is relatively well understood with good reference genomes and isn’t as complicated as the soil or wastewater microbiome.</p> <p><img src="http://m.quickmeme.com/img/c1/c18124ff89dc0248eadb2b59e842412592735b9b056a576547e2dd34165b7476.jpg" alt="Goldilocks" /></p> <p><em>The gut microbiome is right smack in the goldilock zone for analysis</em></p> <h2 id="3-types-of-communities">3. Types of Communities</h2> <h3 id="simple-communities">Simple Communities</h3> <p><img src="http://m.memegen.com/cmtc36.jpg" alt="simple communities" /></p> <p>These are found usually in very harsh or low nutrient environments.</p> <h3 id="complex-communities">Complex communities</h3> <p><img src="http://img.memecdn.com/reson-of-women-crying_o_1511995.jpg" alt="complex" /></p> <p>Examples of complex communities include soil, sewage where species numbers reach the 1000-2000s range.</p> <h3 id="synthetic-communities">Synthetic communities</h3> <p><img src="https://s-media-cache-ak0.pinimg.com/originals/dd/be/ed/ddbeed5b6578f12c4d569a01784972b8.jpg" alt="robots" /></p> <p>These are ma–made and can be either simulated <em>in silico</em> or as sampled from an artificial mixture. They are simple, rarely go beyond a hundred species and are mainly used for understanding and testing ecological theories.</p> <h4 id="enriched-communities">Enriched communities</h4> <p><img src="https://s-media-cache-ak0.pinimg.com/736x/e3/f8/75/e3f87538b7c52c0df16a6a15da4ee9ef.jpg" alt="borg" /></p> <p>Such communities stand at the intersections between simple but artificial and complex but close to naturally occuring communities. They form a sub category under artificial communities.</p> <h2 id="4-fingerprinting">4. Fingerprinting</h2> <p>To get an idea of how unique this “key” could get we can look to how long a the typical SSH RSA key 🔑 is:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-----BEGIN PUBLIC KEY----- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCq GKukO1De7zhZj6+H0qtjTkVxwTCpvKe4eCZ0 FPq ri0cb2JZfXJDgYSF6vUpwmJG8wVQZKjeGcjDOL5U lsuusFncCzWBQ7RKNUSesmQRMSGkVb1/3j+skZ6U tW+5u09lHNsj6tQ51s1SPrCBkedbNf0Tp0GbMJDy R4e9T04ZZwIDAQAB -----END PUBLIC KEY----- </code></pre></div></div> <p>This identifies your machine as you when you try to log into a host server.</p> <p>Similarly a microbiome signature would look like this:</p> <p><img src="http://www.frontiersin.org/files/Articles/138115/fmicb-06-00944-HTML/image_m/fmicb-06-00944-g001.jpg" alt="microbiome" /></p> <p>Look at how much the OTU abundances resembles a barcode:</p> <p><img src="http://thewindowsclub.thewindowsclubco.netdna-cdn.com/wp-content/uploads/2011/11/Barcode.jpg" alt="barcode" /></p> <p>We use this to differentiate groups of individuals from one another, usually from the diets that each individual share with others in the same group.</p> <p>Take for example the plots found in the example analysis below, for the gut microbiomes from two different sets of mice each fed on a specific diet.</p> <h2 id="5-engineer-your-gut-microbiome-now">5 Engineer Your Gut Microbiome Now.</h2> <p>Ultimately, the gut microbiome is resilient towards change and will probably stay the same unless you do something drastic about your diet like going vegan as referenced in this <a href="http://www.nature.com/nrmicro/journal/v14/n1/abs/nrmicro3552.html">review article</a>. However, it is a good diagnostic to identify groups where their physiology has altered their microbiome.</p> <h1 id="example-analyses">Example analyses</h1> <p>I’m including links to a short analyses of three groups of mice, two groups fed on 2 different diets and one group which was fed a transition diet.</p> <p><em>DISCLAIMER: The following is based on unpublished data (Little et. al). Any reproduction or use of the analysis and the results for personal/commercial use is prohibited. If you have any enquiries please contact the author of this post at [email protected]</em></p> <p><em>Part 1</em>: <a href="http://metamaps.scelse.nus.edu.sg/analyses/mouse-initial.html">Unbiased 16s profiling of whole metagenome data using Ribotagger</a></p> <p><em>Part 2</em>: <a href="http://metamaps.scelse.nus.edu.sg/analyses/init.0105.wholeMetagenome.html">Whole Metagenome profiling</a></p> <h2 id="conclusion">Conclusion</h2> <p>Things are bound to get more interestingly as studies with higher throughput time-series experiments become the norm in the near future.</p> <h1 id="references">References</h1> <p>1: Xie, C., Lui, C., Goi, W., Huson, D. H., Little, P. F. R., &amp; Williams, R. B. H. (2016). RiboTagger : fast and unbiased 16S / 18S profiling using whole community shotgun metagenomic or metatranscriptome surveys. BMC Bioinformatics, 17(Suppl 19). http://doi.org/10.1186/s12859-016-1378-x</p> <p>2: Franzosa, E. a., Hsu, T., Sirota-Madi, A., Shafquat, A., Abu-Ali, G., Morgan, X. C., &amp; Huttenhower, C. (2015). Sequencing and beyond: integrating molecular “omics” for microbial community profiling. Nature Reviews. Microbiology, 13(6), 360–72. http://doi.org/10.1038/nrmicro3451</p> <p>3: Jari Oksanen, F. Guillaume Blanchet, Michael Friendly, Roeland Kindt, Pierre Legendre, Dan McGlinn, Peter R. Minchin, R. B. O’Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens, Eduard Szoecs and Helene Wagner (2016). vegan: Community Ecology Package. R package version 2.4-1. https://CRAN.R-project.org/package=vegan</p> <p><a href="https://etheleon.github.io/articles/metagenomics/">5 Things You Didnt Know About The Bacteria In Your Gut</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on December 27, 2016.</p> <![CDATA[My First post]]> https://etheleon.github.io/blog/my-first-post 2016-12-21T00:00:00-00:00 2016-12-21T00:00:00+00:00 Wesley GOI https://etheleon.github.io [email protected] <p>So lets kick it off with some obligatory self introduction. Being a PhD Candidate here in NUS, Singapore (Computational Biology / Metagenomics), I encounter some pretty nifty data analyses which I would love to share.</p> <p>In particular the latest statistical / machine learning methods and the code required for the voodoo to happen.</p> <p>This here, fulfills both roles as a platform for sharing my ideas and thoughts (analyses and technical), and my recent forray into the startup scene here in Singapore.</p> <p>Being the tech co-founder of <a href="https://fundmylife.co">fundMylife</a>, a Insurtech company, it has opened my eyes to the many facets of doing a (tech) business. Besides the basics of software development, there’s so much more now open to me: marketing, growth hacking and bizDev.</p> <p>Back to the introductory post, I’m guessing I’ll first blog about my research analyses before going into related to startups.</p> <p>The coming posts, 2 in fact, *grins* will be analyzing the gut microbiomes of three groups of mice – two fed on two diets and one which had their diets changed from one to the other. Code will of course be published together.</p> <p>See ya folks!</p> <p>Looking forward to posting the analyses.</p> <p><a href="https://etheleon.github.io/blog/my-first-post/">My First post</a> was originally published by Wesley GOI at <a href="https://etheleon.github.io">Ars de Datus-Scientia</a> on December 21, 2016.</p>