Ars de Datus-Scientia

Not so big queries, hitchhiker’s guide to datawarehousing with datalakes with Spark

2021-03-17T00:00:00-00:00

Introduction

Since our earlier post, I’ve defended my thesis, moved on from Honestbee to AirAsia to work on dynamic/auto base pricing Ancillary products. However COVID-19 happened and I found myself in the ride hailing GrabTransport / deliveries GrabFood industry again. Time does pass by quickly.

The following post does not represent my employer and reflects my own personal views only.

Once again, spark makes a return in my current job. After using Redshift and GCP’s BigQuery, I’ve formed myself a working style which separates ETL work for example feature engineering in SQL and model training at first in a notebook then formalised as a class object and finally a script or ML pipeline.

In my opinion, SQL as king since it is the common tongue amongst data practitioners: Engineers, Scientists and Analysts.

As for the ETL portion, nothing too drastic has changed. For quick and dirty data exploration I would use the web SQL workbench offered by Alation Compose the web-based editor offered by the company’s official data catalog and does not require one to set up any login credentials with a local client like DataGrip or DBeaver. Both BigQuery and Alation (presto) offers the ability to share queries via a link and has excellent access control. Important information about tables are also found in the catalog ie. column descriptions, PII, data owner and is very similar to the features offered by GCP’s data catalog.

However, this time, instead of Redshift or Bigquery there is HIVE. tables exists within the Hive MetaStore (HMS) which I struggled to learn how to add drop tables to with initially. Also there was a decoupling of the query engine with the data warehouse was quite jarring at first since there were two dialects used by the company, spark SQL and Presto SQL. With bigquery the query engine is part of the datawarehouse.

This succinct decoupling of (1) storage, (2) query engine and (3) metastore was quite new to me. It was only later after doing some reading on my own, that I was able to map BigQuery to the current setup. For example, instead of colossus there is S3 for my file system. And for the query engine instead of dremel you would have a constantly available presto cluster or a transient spark cluster to do heavier lifting jobs. Table and metadata are stored in HIVE MetaStore (HMS).

Only when I need to carry out some in-depth analysis / feature engineering + model building. Previously in Honestbee we used an external vendor to maintain our spark infrastructure while in the current company, we have a in-house team managing the spark clusters to be cost efficient. Similar to BQ’s datasets and tables, we are able to save tables in HMS. However this was not very clear initially and this post aims to bridge any knowledge gaps using Spark as the query engine, I might decide to include a edit or new post of using Presto to build views in HIVE.

SparkSession

Sadly the companies I’ve worked in are mostly python centric and the use of R has also decreased. However I still use GGPLOT2 for plotting although there has been more some developments porting it to python.

Similar to R’s sparklyr package we create a spark connection of type SparkSession (it is often aliased as spark) in PySpark

Since spark 2.X, Spark Session unifies the Spark Context and Hive Context classes into a single interface. Its use is recommended over the older APIs for code targeting Spark 2.0.0 and above.

InputOutput

With HIVE support enabled, spark queries can query against tables found in HMS whose partitions are known beforehand or directly against files stored in buckets, if they are not registered in HMS.

Input

Spark supports querying against a metastore like a traditional data warehouse but you can also query against flat files in S3. Like how you can create tables in BQ with external files eg. CSV or parquet.

External files

A common format is parquet, you could register the data as a table in HMS or just work on it in memory. If you do not need register this as a table you can read the files directly into memory like the following.

You’ll be able to register this as a in-memory view using df.createTempView("<view_name">) and you might also consider caching to load the whole table into memory. Since this DataFrame only exists in memory and it’s not registered in HMS there’s no table partition. However the in-memory RDDs are partitioned.

Output

If you would like the results of the ETL/query to persist so you can query it again later sometime in the future, you could save the results as an intermediate step which is archived or for machine learning either in parquet or Tensorflow’s TFRecord format.

Tensorflow

The recommended file format is tf.Data.TFRecordDataset when working with Tensorflow framework.

You can save your results to this format using the following (gzipped to save space)

💡 To use the tfrecords format, remember to include the connector JAR and place it in the extra_classpath. At the point of writing org.tensorflow:spark-tensorflow-connector_2.11:1.115 works with the Gzip codec

Registering tables in HMS with parquet files in S3

Similar to how BigQuery stores the underlying data of tables in capacitor , columnar file format stored in google’s file system colossus. (GCS is built on top of colossus). It’s recommend to store the data in parquet also a columnar file format in S3.

TIP: The fastest way to check if the table exists is to run DESCRIBE schema.table

In the following example we are going to assume that the parquet files are stored in the following path: s3://datascience-bucket/wesley.goi/data/pricing/demand_tbl/

Partitioned tables in HMS

When working with spark and HMS, one has to be mindful of the term partition, In spark, the term refers to data partitioning in Resilient Distributed Datasets (RDD), where partitions represent chunks of data sent to workers/executors for parallel processing. In HMS, the term represents how the data is stored in the cloud file system eg. S3 and helps guide queries agains the dataset in an efficient manner which is closer to the partitioned tables in databases.

First you’ll need to be able to save the data in S3, there’s a specific naming conversion for the file path which you’ll need to follow ie. s3://<bucket>/prefix/key=value.

As you have seen, one of the most common ways to partition a table is via timestamp eg. s3://<bucket>/prefix/date=YYYYMMDD

One can also partition on multiple columns although in a nested manner eg. folder/year=2021/month=03/day=21

Where the folder structure follows:

Year=yyyy
 |---Month=mm
 |   |---Day=dd
 |   |   |---<parquet-files>

⚠️: Check if the external table which you’re querying is already partitioned. SHOW PARTITIONS table

💡You can check the number of partitions scanned if you run .explain(mode="formatted") to see

Generate Column data type schema

Manual

You can prepare the table column schema like BigQuery manually and save it in a JSON file and parse it.

Infer

Create table

To create a table in HIVE, we will be using the CREATE statement from HIVE SQL.

You might also want to check if the table exists:

Insert partitions

In this example we will be adding s3://datascience-bucket/wesley.goi/data/pricing/demand_tbl/year=2021/month=01/day=11/hour=01

You can check if the partition has been add by running SHOW PARTITIONS pricing.demand_tbl

partition
`year=2021/month=01/day=11/hour=01`

However when you query the table you’ll notice that you cannot query the partition yet.

You’ll still have to refresh the table for that partition

Remember to refresh

Bulk import

If you have multiple partitions and do not wish to rerun the above for each partition, you may wish to run the MSCK command to sync the all files to the HMS.

Temp Views / Tables

In the same spark session, it is possible to create a temp view. Temp views should not be confused with views in BigQuery, these are not registered in HMS and persists only for the duration of the given SparkSession.

Data is stored in memory in-memory columnar format.

These are especially useful if the data manipulation is complicated and multi stepped and you wish to persist some intermediate tables. In BQ, I would just save temp as a table.

NOTE: temp tables == temp views.

From a query:

Views

Unfortunately you cannot register a view in HIVE using spark but you can do so in presto.

Sampling

Often when training your model, you might need to sample from the existing dataset due to memory constraints.

You might want to set a seed as well when caching if you are doing hyperparameter tuning so you will get the same dataset on each iteration. And set the withReplacement parameter to be False.

Caching

Caching is not lazy with ANSI SQL, and it will be stored in memory immediately.

Compared to PySpark df.cache() (you’ll have to run df.count() to force the table to be loaded into memory), the above SQL statement is not lazy and will store the table in memory once executed.

UDFs

User-Defined-Functions (UDFs) are ways to define your own functions. Which you can write in python before declaring it for use in SQL using spark.udf.register

NOTE: If UDF requires C binaries which needs to be compiled, you’ll need to install in the image used by the worker nodes.

SQL Hints

Hints go way back as early as spark 2.2, which introduced. These could be grouped into several categories.

Repartitioning

By default when repartitioning, it’ll be set to 200 partitions, you might not want this and to optimise the query you might want to hint spark otherwise

REPARTITION
COALESCE only reduces the number of partitions, optimised version of repartition. Data which is kept on the original nodes and only those which needs to be moved are moved (see example below)
REPARTITION_BY_RANGE eg. You have records which has a running id from 0 - 100000 and you’ll want to split them into 3 partitions repartitionByRange(col, 3)

When coalescing you’re shrinking the number of nodes on which the data is kept eg. From 4 to

# original
Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

# Coalescing from 4 to 2 partitions:
Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

You can also improve query time by including columns when repartitioning especially if you are joining on these columns. This applies to tables as well as temp views.

You can also chain multiple repartition hints: repartition(100), coalesce(500) and repartition by range for column c into 3 partitions

https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html the optimised plan is as follows:

# Repartition to 100 partitions with respect to column c.
== Optimized Logical Plan ==
Repartition 100, true
+- Relation[name#29,c#30] parquet

Often the number of records per partition is not equal, especially if you’re partitioning by time and you might end up the number of records per partition following a cyclic pattern. eg. Traffic at night is much lesser than traffic in the day.

Join hints

BROADCAST JOIN replicates the full dataset (if it can fit into memory of the workers) into all nodes

These are useful for selective joins (where the output is expected to small), when memory is not an issue and it’s the right table in a left join.

MERGE : shuffle sort merge join
SHUFFLE_HASH: shuffle hash join. If both sides have the shuffle hash hints, Spark chooses the smaller side (based on stats) as the build side.
SHUFFLE_REPLICATE_NL: shuffle-and-replicate nested loop join

Adaptive Query Execution (AQE)

Another new feature which comes with spark3 is the AQE. Previously the query plan is done prior to execution and no optimisation is done thereafter.

Partitions

One of the areas which sets itself up for optimisation during execution is the to determine the optimum number of partitions. By default,spark.sql.shuffle.partitions is set to 200, in cases when the dataset is small, this number would be too large while the reverse is also true.

Broadcast Joins

If the table from any side is smaller than broadcast in in hash join threshold, sort merge joins are automatically converted to a broadcast join.

You can try this AQE Demo - Databricks

CustomShuffleReader indicates it’s using AQE and it ends with AdaptiveSparkPlan

Not so big queries, hitchhiker’s guide to datawarehousing with datalakes with Spark was originally published by Wesley GOI at Ars de Datus-Scientia on March 17, 2021.

Spark Joy - Saying Konmari to your event logs with grammar of data manipulation

2019-02-20T00:00:00-00:00

Sparklyr Joy

When you have a tonne of event logs to parse what should the go to weapon of choice be? In this article I’ll share with you our experience with using spark/sparklyr to tackle this.

At Honestbee 🐝, our event logs are stored in AWS S3, delivered to us by Segment, at 40 minute intervals. The Data(Science) team uses these logs to evaluate the performance of our machine learning models as well as compare their performance, canonical AB testing.

In addition, we also use the same logs to track business KPIs like Click Through Rate, Conversion Rate and GMV.

In this article, I will share how we leverage high memory clusters running Spark to parse the results logs generated from the Food Recommender System.

Fig: Whenever an Honestbee customer proceeds to checkout, our ML models will try their best at making personalised prediction for which items you’ll most likely add to cart. Especially things which you missed or .

A post mortem, will require us to look through event logs to see which treatment group, based on a weighted distribution, a user has been assigned to.

Now, LETS DIVE IN!

Lets begin by importing the necessary libraries

Connecting with the high memory spark cluster

Next, we’ll need to connect with the Spark master node.

Local Cluster

Normally if you’re connecting to a locally installed spark cluster you’ll set master as local.

Luckily sparkly already comes with an inbuilt function to install spark on your local machine:

sparklyr::spark_install(
    version = "2.4.0",
    hadoop_version = "2.7"
)

We are installing Hadoop together with spark, because the module required to read files from the S3 Filesystem with Hadoop

Next you’ll connect with the cluster and establish a spark connection, sc.

Caution: At honestbee we do not have a local cluster, so the closest we got is a LARGE EC2 instance which sometimes gives out and you probably want a managed cluster set up by DEs or a 3rd party vendor who knows how to deal with cluster management.

Remote Clusters

Alternatively, there’s also the option of connecting with a remote cluster via a REST API ie. the R process is not running on the master node but on a remote machine. Often these are managed by 3rd party vendors. At Honestbee, we also chosen this option and the clusters are provisioned by Qubole under our AWS account. PS. Pretty good deal!

The gist above sets up a spark connection sc, you will need to use this object in most of the functions.

Separately, because we are reading from S3, we will have to set the S3 access keys and secret. This has to be set before executing functions like spark_read_json

So you would ask what are the pros and cons of each. Local clusters generally are good for EDA since you will be communicating through a REST API (LIVY).

Reading JSON logs

There are essentially two ways to read logs. The first is to read them in as a whole chunks or as a stream — as they get dumped into your bucket.

There’s two functions, spark_read_json and stream_read_json the former is batched and the later creates a structured data stream. There’s also the equivalent of for reading your Parquet files

Batched

The path should be set with the s3a protocol. s3a://segment_bucket/segment-logs/<source_id>/1550361600000

json_input = spark_read_json(
    sc = sc, 
    name= "logs",
    path= s3, 
    overwrite=T)

Below’s where the magic begins:

As you can see it’s a simple query,

Filter for all Added to Cart events from the Food vertical
Select following columns:
- CartID
- experiment_id
- variant (treatment_group) and
- timestamp
Remove events where users were not assigned to a model
Add new columns
- fulltime readable time
- time the hour of the day
Group the logs by service recommender and count the number of rows
Add a new column event with the value Added to Cart
Sort by time

Spark Streams

Alternatively, you could also write the results of the above manipulation to a structured spark stream.

You can preview these the results from the stream using the tbl function coupled to glimpse.

sc %>% 
    tbl("data_stream") %>% 
    glimpse

Observations: ??
Variables: 2
Database: spark_connection
$ expt <chr> "Model_A", "Model_B"
$ n    <dbl> 5345, 621

And that’s it folks on using Sparklyr with your event logs.

Model Metadata

With that many models in the wild, it’s hard to keep track of what’s going on. For my PhD, I personally worked on using Graph Databases to store data with complex relationships and we are currently working on coming up with such a system to store metadata related to our models.

For example

Which APIs they are associated with
What airflow / argo jobs are these models being retrained with
What helm-charts and deployments metadata these models have
And of course meta data like the performance and scores.

Come talk to us, we are hiring! Data Engineer, Senior Data Scientist

Spark Joy - Saying Konmari to your event logs with grammar of data manipulation was originally published by Wesley GOI at Ars de Datus-Scientia on February 20, 2019.

Tidying Up Pandas

2018-12-16T00:00:00-00:00

For those who use python’s pandas module daily, the first thing you would notice is there are often more ways than one to do almost everything.

The purpose of this article is to demonstrate how we can limit this by drawing inspiration from R’s dplyr and tidyverse libraries

Tidying up pandas?

As an academic, often enough the go to lingua franca for data science is R. Especially if you’re coming from Computational Biology/Bioinformatics or Statistics.

And likely you’ll be hooked on the famous tidyverse meta-package, which includes dplyr (previously plyr for ply(e)r), lubridate (time-series) and tidyr.

PS. As I am writing this article I realised it isn’t just tidyverse, but the whole R ecosystem which I’ve come to love whist doing metagenomics and computational biology in general.

For the benefit of those who started from R, pandas is the dataframe module for python, several other packages like datatable exists and is is heavily inspired by R’s own datatable.

Now back to how tidyverse specifically dplyr organises dataframe manipulation.

In his talk, Hadley Wickham, mentioned what we really need for table manipulation are just a handful of functions.

filter
select
arrange
mutate
group_by
summarise
merge

Although, I would argue you need just a bit more. For example, knowing R’s family of apply functions will help tonnes. Or a couple of summary statistics functions like summary or str , although nowadays I use skimr::skim a lot.

skim(iris)

## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n n_unique                       top_counts ordered
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0   FALSE
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────────
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100     hist
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9 ▇▁▁▂▅▅▃▁
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5 ▇▁▁▅▃▃▂▂
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9 ▂▇▅▇▆▅▂▂
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4 ▁▂▅▇▃▂▁▁

In fact, Google’s Facets behaves somewhat like this as well (see image below).

Thus, in this post I’ll try my best to demonstrate 1-to-1 mappings of the tidyverse vocabularies with pandas methods.

For demonstration, We will be using the famous Iris flower dataset.

# python

import seaborn as sns
iris = sns.load_data("iris")  

I’ve chosen to imports the iris data using seaborn rather than sklearn’s datasets which are numpy arrays

The first thing I usually do when I import a table is to run the str function on the table

# R (iris is already loaded by default)

str(iris)

# python

iris.info(null_counts=True)  
# if the number of rows are too much, pandas will not do the count, 
# so I have to forcibly set `null_counts` to `True`.

Filter

The closest method similar to R’s filter is pd.query.

# R

cutoff = 30
iris %>% 
    filter(sepal.width > cutoff)

There’s two ways to do this in python. The first is probably what you’ll find most python users using

# python

cutoff = 30
iris[iris.sepal_width > cutoff]

However, pd.DataFrame.query() maps more closely with dplyr::filter().

# Python

iris. \
    query("sepal_width > @cutoff”)  # this is using a SQL like language

One downside of using this is linters which follows the pep8 convention like flake8 will complain about the cutoff variable not being used although it has already been declared. This is because the linters are unable to recognise the use of cutoff inside the query quoted string.

Surprisingly, filter makes a return in pySpark. :)

# python (pyspark)

type(flights)
pyspark.sql.dataframe.DataFrame

# filters flights which are > 1000 miles long
flights.filter('distance > 1000')

Select

This is reminiscent of SQL’s select keyword which allows you to choose columns.

# R

iris %>% 
    select(sepal.width, sepal.length)

# Python

iris \
    .loc[:5, [["sepal_width", "sepal_length"]]]  # selects the 1st 5 rows 

Initially, I thought the following df[['col1', 'col2']] pattern would be a good map. But quickly realised we cannot do slices of the columns similar to select.

# R

iris %>% select(Sepal.Length:Petal.Width)

# Python 

iris.loc[:, "sepal_length":"petal_width"]

A thing to note about the loc method is that it could return a series instead of a DataFrame when the selection is just one row. so you’ll have to slice it in order to return a dataframe.

# Python

iris.loc[1, :]  # returns a Series
iris.loc[[1],:] # returns a dataframe

But the really awesome thing about select, function its ability to /unselect/ columns which is missing in the loc method.

# R

df %>% select(-col1) 

You have to use the .drop() method.

# Python

df.drop(columns=["col1"])

Note I had to add the param columns because drop can not only be used to drop columns, the method can also drop rows based on their index.

Like filter, select is also used in pySpark!

# python (pySpark)

df.select("xyz").show() # shows the column xyz of the spark dataframe.

# alternative 
df.select(df.xyz)

Arrange

The arrange function lets one sort the table by a particular column.

# R

df %>% arrange(col1, descreasing=TRUE)

# Python

df.sort_values(by="col1", ascending=False)  # everything is reversed in python fml. 

Mutate

dplyr’s mutate was really an upgrade from R’s apply.

NOTE: Other applies which is useful in R for example includes mapply and lapply

# R

df %>% mutate(
    new = something / col2, 
    newcol = col+1
)

# Python

iris.assign(
    new = iris.sepal_width / iris.sepal, 
    newcol = lambda x: x["col"] + 1
)

tidyverse’s mutate function by default takes the whole column and does vectorised operations on it. If you want to apply the function row by row, you’ll have to couple rowwise with mutate.

# R

# my_function does not take vectorised input of the entire column

# this will fail
iris %>% 
    mutate(new_column = my_function(sepal.width, sepal.length))

# this will force mutate to be applied row by row
iris %>% 
    rowwise %>%
    mutate(new_column = my_function(sepal.width, sepal.length))

To achieve the same using the .assign method you can nest an apply inside the function.

# Python 

def do_something_string(col):
    #set_trace()
    if re.search(r".*(osa)$", col):
        value = "is_setosa"
    else:
        value = "not_setosa"
    return value

iris = iris.assign(
	transformed_species = lambda df: df["species"] \
		.apply(do_something_string)
	)

If you’re lazy, you could just chain two anoymous functions together.

# Python 

iris = iris.assign(
    transformed_species = lambda df: df.species.apply(do_something_string))

Apply

From R’s apply help docs:

apply(X, MARGIN, FUN, ...)

Where the value of MARGIN takes either 1 or 2 for (rows, columns), ie. if you want to apply to each row, you’ll set the axis as 0.

However, in pandas axis refers to what values (index i or columns j) will be used for the applied functions input parameter’s index.

be using the 0 refers to the DataFrame’s index and axis 1 refers to the columns.

So if you wanted to carry out row wise operations you could set axis to 0.

# R

df %>% 
    apply(0, function(row){
        ...
        do some compute
        ...
})

Rarely do that now since plyr and later dplyr.

However there is no plyr in pandas. So we have to go back to using apply if you want row-wise operations, however, the axis now is 1 not 0. I initially found this very confusing. The reason is because the row is a really just a pandas.Series whose index is the parent pandas.DataFame’s columns. Thus in this the axis is referring to which axis to set as the index.

# python

iris.apply(lambda row: do_something(row), axis=1)

Interesting pattern which I do not use in R, is to use apply on columns, in this case pandas.Series objects.

# python

iris.sepal_width.apply(lambda x: x**2)

#  if you want a fancy progress bar, you could use the tqdm function
iris.sepal_width.apply_progress(lambda x: x**2) 

# If u need parallel apply
# this works with dask underneath 
import swifter
iris.sepal_width.swifter.apply(lambda x : x**2) 

In R, one of the common idioms, which I keep going back to for a parallel version of groupby is as follows:

# R

unique_list %>% 
    lapply(function(x){ 
        ...
        df %>% filter(col == x) %>%
          do_something() # do something to the subset
          ...
}) %>% do.call(rbind,.)

If you want a parallel version you’ll just have to change the lapply to mclapply.

Additionally, there’s mclapply from the parallel /snow library in R.

# R

ncores = 10  # the number of cores
unique_list %>% 
    mclapply(function(x){ 
        ...
        df %>% filter(col == x) %>%
          do_something() # do something to the subset
          ...
}, mc.cores=ncores) %>% do.call(rbind,.)

Separately, in pySpark, you can split the whole table into partitions and do the manipulations in parallel.

# Python (pyspark)

dd.from_pandas(my_df,npartitions=nCores).\
   map_partitions(
      lambda df : df.apply(
         lambda x : nearest_street(x.lat,x.lon),axis=1)).\
   compute(get=get)
# imports at the end

To achieve the same, what we can use the dask, or a higher level wrapper from the swiftapply library.

# Python

# you can easily vectorise the example using by adding the `swift` method before `.apply`
series.swift.apply()

Group by

The .groupby method in pandas is equivalent to R function dplyr::group_by returning a DataFrameGroupBy object.

In Tidyverse there’s the ungroup function to ungroup grouped DataFrames, in order to achieve the same, there does not exists a1-to-1 mappable function.

One way is to complete the groupby -> apply (two-step process) and feeding apply with an identity function apply(lambda x: x). Which is an identity function.

Summarise

In pandas the equivalent of the summarise function is aggregate abbreviated as the agg function. And you will have to couple this with groupby, so it’ll similar again a two step groupby -> agg transformation.

# R

r_mt = mtcars %>% 
	mutate(model = rownames(mtcars)) %>%
	select(cyl, model, hp, drat) %>%
    filter(cyl < 8) %>%
    group_by(cyl) %>%
    summarise(
        hp_mean = mean(hp), 
        drat_mean = mean(drat),
        drat_std = sd(drat),
        diff = max(drat) - min(drat)
     ) %>% 
    arrange(drat_mean) %>%
    as.data.frame

The same series of transformation written in Python would follow:

# Python

def transform1(x):
    return max(x)-min(x)
        
def transform2(x):
    return max(x)+5

py_mt = (
mtcars.
    loc[:,["cyl", "model", "hp", "drat"]]. #select
    query("cyl < 8").                      #filter
    groupby("cyl").                        #group_by
    agg(                                   #summarise, agg is an abbreviation of aggregation
        {
            'hp':'mean', 
            'drat':['mean', 'std', transform1, transform2] # R wins... this sux for pandas
        }).
    sort_values(by=[("drat", "mean")])     #multindex sort (unique to pandas)
)
py_mt

# R

df %>% 
    group_by(col) %>% 
    summarise(my_new_column = do_something(some_col))

Join

Natively, R supports the merge function and similarly in Pandas there’s the pd.merge function.

Along side the other join functions: left_join, right_join, inner_join and anti_join.

Inplace

In R there’s the compound assignment pipe-operator %<>%, which is similar to the inplace=True argument in some pandas functions but not all. :( Apparently Pandas is going to remove inplace altogether…

Debugging

In R, we have the browser() function.

# R

unique(iris$species) %>%
    lapply(function(s){
        browser()
        iris %>% filter(species == s)
        ....
    })

It’ll let you step into the function which is extremely useful if you want to do some debugging.

In Python, there’s the set_trace function.

# Python

from IPython.core.debugger import set_trace

(
    iris
        .groupby("species")
        .apply(lambda groupedDF: set_trace())
)

Last but not least if you really need to use some R function you could always rely on the rpy2 package. For me I rely on this a lot for plotting. ggplot2 ftw!

# python

import rpy2                #  imports the library
%load_ext rpy2.ipython     #  load the magic

Sometimes there’s issues installing r packages using R. You can run

conda install -r r r-tidyverse r-ggplot

There after you can always use R and Python interchangeably in the same Jupyter notebook.

%%R -i python_df -o transformed_df

transformed_df = python_df %>% 
    select(-some_columns) %>%
    mutate(newcol = somecol * 2)

NOTE: %%R is cell magic and %R is line magic.

If you need outputs to be printed like a normal pandas DataFrame, you can you the single percent magic

%R some_dataFrame %>% skim

Elipisis

In R, one nifty trick you can do is to pass arguments to inner functions without ever having to define them in the outer function’s function signature.

# R

#' Simple function which takes two parameters `one` and `two` and elipisis `...`, 
somefunction = function(one, two, ...){
   three =  one + two 
   sometwo = function(x, four){
        x + four
    }
    sometwo(three,  ...) # four exists within the elipisis 
}

# because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis
somefunction(one=2, two=3, four=5, name="wesley")

In python, **kwargs takes the place of .... Below is an explanation of how exactly it works.

Explanation

Firstly, the double asterisks ** is called unpack operator (it’s placed before a function signature eg. kwargs so together it’ll look like **kwargs).

The convention is to let that variable be named kwargs (which stands for keyworded arguments) but it could be named anything.

Most articles which describe the unpack operator will start off with this explanation: where dictionaries are used to pass functions their parameters.

# Python

adictionary = {
    'first' : 1,
    'second': 2
}

def some_function(first, second):     return first + second

some_function(**adictionary)
# which gives 3

But you could also twist this around and set **kwargs as a function signature. Doing this lets you key in an arbitrary number of function signatures when calling the function.

The signature-value pairs are wrapped into a dictionary named kwargs which is accessible inside the function.

# Python

# dummy function which prints `kwargs`
def some_function (**kwargs): print(kwargs)

some_function(first=1, second=2)

The previous two cases are not exclusive, you could actually ~mix~ them together. Ie. have named signatures as well as a **kwargs.

# Python

adictionary = {
    'first' : 1,
    'second': 2,
    'useless_value' : "wesley"
}

def some_function(first, second, **kwargs):
    print(kwargs)
    return first + second

print(some_function(**adictionary))

The output will be: {'useless_value': 'wesley'}

It allows a python function to accept as many function signatures as you supply it. Those which are already defined during the declaration of the function would be directly used. And those which do not appear within them can be accessed from kwargs.

By putting the **kwargs as an argument in the inner function, you’re basically unwrapping the dictionary into the function params.

# Python

def somefunction(one, two, **kwargs):
    print(f"outer function:\n\t{kwargs}")
    three = one + two
    def sometwo(x, four):
        print(f"inner function: \n\t{kwargs}")
        return x + four
    return sometwo(three, **kwargs)

somefunction(one=2, two=3, four=5, name=“wesley”)

outside function: 
    {“four”:5, “name”:”wesley”}
Inside
    
inside kwargs:
    {'name': 'jw'}

Lets now compare this with the original R elipsis

# R
#' Simple function which takes two parameters `one` and `two` and elipisis `...`, 
somefunction = function(one, two, ...){
   three =  one + two 
   sometwo = function(x, four){
        x + four
    }
    sometwo(three,  ...) # four exists within the elipisis 
}

# because of the elipisis, we can pass as many parameters as we we want. the extras will be stored in the elipisis
somefunction(one=2, two=3, four=5, name="wesley")

Conclusion

There’s many ways to do thing in pandas more so than the tidyverse way, and I wish this was clearer.

Additionally, something which caught me off guard after coming to Honestbee was the amount of SQL I need.

For example postgreSQL to query RDS and it’s dialect for querying Redshift, KSQL for querying data streams via Kafka and Athena’s query language build on top of presto DB for querying S3, where most of the data use to exist in parquet files.

The shows one big deviation from academia where data in a company is usually stored in a database / datalake / datastream whereas in academia its usually just one big flat data file.

We’ve come to the ending of this attempt at mapping tidyverse vocabularies to pandas, hope you’ve found this informative and useful! See you guys soon!

Tidying Up Pandas was originally published by Wesley GOI at Ars de Datus-Scientia on December 16, 2018.

Has the ship sailed for Microbiome research?

2016-12-21T00:00:00-00:00

Going about doctoral thesis writing, love hate relationship, the thought occurred to me perhaps the very field I’m writing on about has already lived past its Golden Era, or has it?

A knee jerk reaction then was to see if there’s any python or R package which allows me to search the abstracts with the keyword microbiome… this turned up pubmed_parser. However, in order to get this running, I will have to first download a few gigabytes of abstracts in XML from the Open Access subset of pubmed abstracts and run my own pyspark…

NOPE! Not going there!

Then it occurred to me perhaps NCBI has something I could use… in the previous post we talked about the esearch API. Hmmm, this could be useful.

So below’s the script which lets me do this:

Libraries

using the usual tidyverse, with rvest for XML parsing and artyfarty to spruce up the plot

suppressPackageStartupMessages({
    library(tidyverse)
    library(magrittr)
    library(rvest) # for XML
    library(artyfarty) #because theme_bw is too boring
})

NCBI ESEARCH

Here’s the NBCI’s esearch API. Within it, there’s the date range options, mindate, maxdate.

mindate, maxdate API filter

Date range used to limit a search result by the date specified by datetype. These two parameters (mindate, maxdate) must be used together to specify an arbitrary date range. The general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM.

So we will be searching between 1997 to 2017, a 20 year period.

So lets begin …..

Keyword: Microbiome

with other synonyms microbiota

#microbiome
api="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
query="db=pubmed&term=%s&mindate=%s&maxdate=%s"
searchTerm=paste0(api, query)

keyword = "microbiome"

df = mapply(function(start, end){
            count = read_xml(sprintf(searchTerm, keyword, start, end)) %>%
            as_list %$%
            Count %>%
            unlist
       tibble(count, start, end)
},  start = 1997:2016,
    end = 1998:2017,
    SIMPLIFY=FALSE
) %>% do.call(rbind,.)

Keyword: Cancer

Used as a comparison

keyword = "cancer"

df2 = mapply(function(start, end){
            count = read_xml(sprintf(searchTerm, keyword, start, end)) %>%
            as_list %$%
            Count %>%
            unlist
       tibble(count, start, end)
},  start = 1997:2016,
    end = 1998:2017,
    SIMPLIFY=FALSE
) %>% do.call(rbind,.)

Putting the two together before we start plotting

df %<>% setNames(c("microbiome", "start", "end"))
df %<>% mutate(cancer = as.integer(df2$count))
df$microbiome %<>% as.integer
df %<>% select(start, end, microbiome, cancer)
df

As you can the see the order is slighly different between the two, you’ll probably have to do some scaling.

start	end	microbiome	cancer
1997	1998	91	116522
1998	1999	110	124613
1999	2000	133	131481
2000	2001	149	139577
2001	2002	196	153651
2002	2003	249	166393
2003	2004	304	170676
2004	2005	419	181504
2005	2006	576	190710
2006	2007	744	198618
2007	2008	955	210488
2008	2009	1285	219686
2009	2010	1741	231079
2010	2011	2610	248046
2011	2012	3899	265171
2012	2013	5607	281240
2013	2014	8211	308483
2014	2015	10951	331775
2015	2016	13439	331631
2016	2017	14058	285408

Since version 2.2.0 of ggplot2, Hadley has included the sec_axis function in the library which lets you add a secondary axis as long as it’s amenable to a straight forward transformation.

ggplot(df, aes(x=end)) +
    geom_line(aes(y=microbiome, color="Microbiome"), size=1.1) +
    geom_line(aes(y=cancer/20, color="Cancer"), size=1.5, linetype="dotted") + 
    # manipulated the cancer values by dividing by 20
    scale_x_continuous(breaks=1998:2017)+
    scale_y_continuous(sec.axis = sec_axis(~.*20, name = "Number of Publications [Cancer]"))+ 
    # restores the division
    # lets we set the axis title
    scale_color_manual("Search Terms",values = pal("five38"))+
    theme_scientific() +
    theme(axis.text.x=element_text(angle=90), 
          legend.position = c(0.9, 0.2)) +
    xlab("Year") + ylab("Number of Publications [Microbiome]")

There you have it guys, on the left y-axis the publication count with the keyword “microbiome” and its synonyms like “microbiota” and on the right y-axis the counts for abstract with the keyword “cancer”. As you can see, the growth in publications/articles revolving around microbiome or at least associated to it have been growing at breakneck pace faster than cancer, almost exponential.

For those astute enought, you’ll notice a dip in 2017 for cancer, and the trend is slowing down for microbiome, that’s just cause we haven’t reached the end of 2017 yet, close 😉 but definitely more papers on their way.

Hope this will be helpful for future students! Cheers

Has the ship sailed for Microbiome research? was originally published by Wesley GOI at Ars de Datus-Scientia on November 02, 2017.

Why has downloading fastQ files become so complicated?

2017-08-23T00:00:00-00:00

Downloading

Recently, I had to retrieve Sequencing data in fastQ format belonging to a paper from Law et al. It was for one of two remaining mini-projects standing before me and my PhD.

Mainly they’re for applying my gene centric approach (Watch out for the next part it’ll be released soon!) to a time series dataset of total RNA and for a enriched reactor core.

So it begins with the following line in the publication:

All raw metagenome, metatransriptome and amplicon sequencing data used in this study are publicly available from NCBI under BioProject ID: PRJNA320780 (http://www.ncbi.nlm.nih.gov/bioproject/320780).

metagenome ie DNA
metatranscriptome ie. total RNA
amplicon 16S only

Sounds easy now aint it, go to link, click on download and you’ll get everything you need. Well it wasn’t. =(

Previously my experience with downloading of NCBI has been mostly their web portal via a browser not much programmatically.

Day 1: Getting the Files

K, calm down all I need now is a link to wget or curl the files. No problem I’ve heard of the SRA format, SRA stands for Sequence Read Archives nothing is gonna stop me.

On the Bioproject’s page I saw i had about 40 SRA files to fetch…

Hmmmm. Do I click and download them by hand? “Of course not, I know how to write scripts why should I do this by hand”, I thought.

After some digging around for ways to get the download link I found this: Entrez Direct: E-utilities on the UNIX Command Line

To install the tool you’ll need install some perl modules first: (forgive the PERL cause everyone knows perl is like never going away in Bioinformatics)

you’ll probably need to CPAN some modules (I recommend installing cpanminus aka cpanm) Perl’s unofficial package manger

Yes so thats a bunch of perl modules to install Net::FTP

cd ~
perl -MNet::FTP -e \
  '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
   $ftp->login; $ftp->binary;
   $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -c edirect.tar.gz | tar xf -
rm edirect.tar.gz
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh

After installing this well you could finally start downloading the SRA… (you wished)

Digging through the website it was easy to find the button to download the SRAs, but getting the links to all 40 SRAs programmatically, not so easy! And yeap I was pretty much right, after looking for a way to get the runInfo.csv

Day2 : The saga continues: Do dont need to download the files

EDIRECT=/path/2/eDirect
cd $EDIRECT
esearch -db sra -query PRJNA320780 | ./tools/edirect/efetch --format runinfo

which looks like this:

Run	ReleaseDate	LoadDate	spots	bases	spots_with_mates	avgLength	size_MB	AssemblyName	download_path	Experiment	LibraryName	LibraryStrategy	LibrarySelection	LibrarySource	LibraryLayout	Platform	Model	SRAStudy	BioProject	Study_Pubmed_id	ProjectID	Sample	BioSample	SampleType	TaxID	ScientificName	SampleName	g1k_pop_code	source	g1k_analysis_group	Subject_ID	Sex	Disease	Tumor	Affection_Status	Analyte_Type	Histological_Type	Body_Site	CenterName	Submission	dbgap_study_accession	Consent	RunHash	ReadHash
SRR3501849	2016-05-18 11:35:07	2016-05-13 11:31:37	25818676	7797240152	25818676	302	4224	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501849	SRX1759558	844	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435427	SAMN04957382	simple	942017	activated sludge metagenome	UPWRP_SW_d1_r1	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	8C81A9CE61F9010A73220794D655E084	0AE4D27EB24ECF49E094557AD7255216
SRR3501850	2016-05-18 11:50:28	2016-05-13 11:46:22	31189839	9419331378	31189839	302	5112	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501850	SRX1759559	845	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435428	SAMN04957383	simple	942017	activated sludge metagenome	UPWRP_SW_d2_r1	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	E649F6CDCC80915B98BE85CD437B7EFE	B58C5296FB135FCF2E9BFD8544C33B29
SRR3501851	2016-05-18 11:47:02	2016-05-13 11:42:17	31966019	9653737738	31966019	302	5244	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501851	SRX1759560	945	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435429	SAMN04957392	simple	942017	activated sludge metagenome	UPWRP_SW_d1_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	81EC07EC8BC6509DBCB00BC4FA7401A9	9AD8B926EF9D20E3A2FD10582C72B592
SRR3501852	2016-05-18 12:02:10	2016-05-13 11:57:54	29331148	8858006696	29331148	302	4854	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501852	SRX1759561	946	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435430	SAMN04957393	simple	942017	activated sludge metagenome	UPWRP_SW_d2_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	63B30D9EC717121777A138CECA1F1ACA	35A116CCE17CBA7F425465AA9D7DBB6B
SRR3501853	2016-05-18 11:50:18	2016-05-13 11:46:11	34045865	10281851230	34045865	302	5630	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501853	SRX1759562	947	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435431	SAMN04957394	simple	942017	activated sludge metagenome	UPWRP_SW_d3_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	3AEB6D6C4FE383F80D1E16E588C2D374	876D1E61221339EF202EAAEC93AD0C5C
SRR3501854	2016-05-18 11:46:21	2016-05-13 11:41:11	29717524	8974692248	29717524	302	4935	NA	https://sra-download.ncbi.nlm.nih.gov/srapub/SRR3501854	SRX1759563	948	WGS	RANDOM	METAGENOMIC	PAIRED	ILLUMINA	Illumina HiSeq 2500	SRP075031	PRJNA320780	2	320780	SRS1435432	SAMN04957395	simple	942017	activated sludge metagenome	UPWRP_SW_d4_r2	NA	NA	NA	NA	NA	NA	no	NA	NA	NA	NA	NA	SRA425235	NA	public	D467387C3A275485CC8EA2025E6044ED	9EB031A8BDAD3C2135E92CF3DBB29169

Great the links to the SRAs are in the column download_path

So by the way I found this awesome download script which combined pycurl + tqdm (friend recommended me this, if you were wondering what tqdm stands for, it means “progress” in Arabic: taqadum)

import os
import pycurl
from tqdm import tqdm


downloader = pycurl.Curl()


def sanitize(c):
    c.setopt(pycurl.UNRESTRICTED_AUTH, False)
    c.setopt(pycurl.HTTPAUTH, pycurl.HTTPAUTH_ANYSAFE)
    c.setopt(pycurl.ACCEPT_ENCODING, b'')
    c.setopt(pycurl.TRANSFER_ENCODING, True)
    c.setopt(pycurl.SSL_VERIFYPEER, True)
    c.setopt(pycurl.SSL_VERIFYHOST, 2)
    c.setopt(pycurl.SSLVERSION, pycurl.SSLVERSION_TLSv1)
    #c.setopt(pycurl.FOLLOWLOCATION, False)
    c.setopt(pycurl.FOLLOWLOCATION, True)


def do_download(url, local, *, safe=True):
    rv = False
    with tqdm(desc=url, total=1, unit='b', unit_scale=True) as progress:
        xfer = XferInfoDl(url, progress)
        if safe:
            local_tmp = local + '.tmp'
        else:
            local_tmp = local

        c = downloader
        c.reset()
        sanitize(c)

        c.setopt(pycurl.NOPROGRESS, False)
        c.setopt(pycurl.XFERINFOFUNCTION, xfer)

        c.setopt(pycurl.URL, url.encode('utf-8'))
        with open(local_tmp, 'wb') as out:
            c.setopt(pycurl.WRITEDATA, out)
            try:
                c.perform()
            except pycurl.error:
                os.unlink(local_tmp)
                return False
        if c.getinfo(pycurl.RESPONSE_CODE) >= 400:
            os.unlink(local_tmp)
        else:
            if safe:
                os.rename(local_tmp, local)
            rv = True
        progress.total = progress.n = progress.n - 1
        progress.update(1)
    return rv


class XferInfoDl:
    def __init__(self, url, progress):
        self._tqdm = progress

    def __call__(self, dltotal, dlnow, ultotal, ulnow):
        n = dlnow - self._tqdm.n
        self._tqdm.total = dltotal or guess_size(dlnow)
        if n:
            self._tqdm.update(n)


def guess_size(now):
    ''' Return a number that is strictly greater than `now`,
        but likely close to `approx`.
    '''
    return 1 << now.bit_length()

K so I’ve downloaded the SRA files, I just need to extract the fq from the SRA. Which brings to the SRAtoolkit

Its basically a collection of cmd line tools to deal with the SRA files, what we’re really interested with is fastq-dump

Its not exactly clear in NCBI’s readme, but here’s what it does fastq-dump tries to automatically download the SRAs again even though you’ve got the local file ready. Running fastq-dump -v shows you its trying to download from NCBI.

The rationale for this I assume is to prevent corrupted files since there’s another tool in the toolkit vdb-validate ./<filename>.sra which checks its integrity.

You could read the whole issues thread but I think this user’s frustration just sums it up for me as well.

@klymenko That is unacceptable. I do not need alignments. just the raw fastq files. This has nothing to do with RefSeq files. Further, neither fastq-dump -h nor online man pages say anything about accompanying refseq files. It simply says you can act on local SRA files. Further, all of the above validation tools approve of the downloaded SRA file

The owner of the repo goes on to threaten the poor fella who’s just like me trying to download file

If you want help, please ask. If you want to flame, then I’ll close the issue.

LONG SIGH

So the prescribed way of doing this is actually to run the following if you havent downloaded the SRA.

prefetch <SRA ID>
fastq-dump <SRA ID>

Yes you won’t even have to go thru downloading 1. entrez tool to get the 2. runInfo.csv with the links to get the 3. SRA files.

And if you’ve already downloaded a local SRA file like me, you will have to run prefetch to check the local file, my guess is it stores the location for fastq-dump to recognise.

prefetch <localFile>
fastq-dump <localFile>

The story deepens, turns out the extraction of the fq from the SRA is excruciatingly slow and its not just me

It’s been running for about 3 hours and so far extracted ~15GB of what I expect to be around 60GB. An improvement, but still not exactly fast…

Looked around for other solutions to speed this up and game across the gnu parallel tool.

parallel fastq-dump --split-files -F --gzip {} ::: *.sra

but it doesnt really solve anything since each file still has to be extracted by 1 thread.

Thank god, later i stumbled across parallel-fastq-dump which makes use of the -N and -X flags in the original fastq-dump which splits the extraction over different ranges so it can be parallelized.

parallel-fastq-dump --sra-id SRR3501865 -F --threads 20 --outdir ../unzipped --split-files --gzip --tmpdir /scratch/uesu/

The results are stunning

Conclusion

Thats all folks, the moral of the story will be to try and avoid downloading through NCBI if u can but straight from the source if possible. Have a good one!

Why has downloading fastQ files become so complicated? was originally published by Wesley GOI at Ars de Datus-Scientia on August 22, 2017.

From raw sequencing reads to Gene Centric Analyses PART: 1

2016-12-21T00:00:00-00:00

A recent paper which came out on Microbiome was from Daniel Huson’s group using a new gene-centric function found within MEGAN 6 CE.

You could use a sample fastQ to generate MEGAN summary file and do this.

Simulation

Here at Singapore Centre for Environmental Life Sciences Engineering (SCELSE) NUS, Peter and myself work on a variety of Bioinformatics analyses concerning the Microbiome of Ulu Pandan’s microbial community. This ultimately led to pipelines and tools based on the sequencing data we retrieve from wastewater samples.

One of the topics I work on surrounds the development of a gene centric assembly analysis for poorly annotated microbiomes.

Our method briefly is split into the following steps

Function binning using MEGAN’s Lowest Common Ancestor (LCA) algorithm,
NEWBLER’s implementation of the Overlap Layout Consensus (OLC) and
Conserve region analysis using a defined Maximum Diversity Region pAss.

Unlike Huson et al., we explore the alignment of contigs against respective reference sequences before deciding upon a consensus region based on a multiple sequence alignment of reference sequences with captures the most number of contigs thus facilitating a diversity analysis. To understand the dynamics of such a workflow, we have decided on firstly running this on an in silico simulation of 329 bacterial and archeal species, modelled after the abundance curves obtained from an initial whole genome short read analysis.

In this post, I’ll wont be diving too deeply into details but a outline how one would use the pipeline in general starting from raw fastQ files.

1. Homology search of the short reads

Many databases could be used but NR Protein is a good place to begin. A useful tool for comparing short reads with a protein database is DIAMOND.

2. Binning short reads into functional groups

Once you’ve gotten the reads sorted in the proper directories we can begin assembling them. Here we use MEGAN’s blast2lca tool

3. Run NEWBLER OLC Assembler on each of the bins

However because you’ll be running the assembler possibly on 9000 different KOs or more, I’ve written a python class to run NEWBLER.

4. Run pAss and identify the Max Diversity Regions

This is truly where our work begins.

MDR

The core algorithm works as follows:

Implicit MSA of contigs

Firstly we begin by generating a MSA of protein reference sequences.
Thereafter, using MEGAN, we gathered contig-reference sequence (prot) alignments before assigning one best aligned reference to each contig.
Finally, we lined the contigs up according to the their cognate reference sequence’s position in the MSA.

Window of diversity

We ran a 200 bp sliding window across the implicit contig alignment to find a region with capturing the most number of contigs also known as a maximum diversity region (MDR)

Simulation

With the simulation, we looked specifcally at Single Copy Genes (SCGs) to see if the method “worked”.

If the genes have been successfully assembled
If homology search + LCA was able to correctly identify these assembled genes to the correct genus.

Briefly, the conclusions made from this simulation was that the process leads to a overestimation in the number of genes due to duplication of genes introduced as an artifact of the assembly process.

This could be circumvented in several ways and we have come up with two.

1. Duplication decreases this decreases when we stipulate the contigs to span the entire length of the window.
2. Additionally, we remove low quality contigs by thresholding contigs by their read counts until the number of duplicated genes (multiple contigs same gene from same genome) stabilised. (This could only be done in the simulation knowing where the contigs came from based on the identity of the reads)
   Alternatively, instead of read counts, the option to threshold based on coverage could also be done.

Part: 2 Empirical data

The continuation and next part of this blog on the analysis performed on empirical sewage data.

Softwares

Protein homology search using DIAMOND.
Binning of short reads using pymegan for converting raw reads into LCA-ed taxonomic assignments and KEGG based functional assignments.
Assembly of bins using the OLC assembler NEWBLER, identification of the maximum diversity regions (MDR) using pAss.
Analysis of MDR and integration with noSQL database omics using R package metamapsDB.

Future works

Make this process more friendly to other types of OLC / K-mer assemblers.

References

From raw sequencing reads to Gene Centric Analyses PART: 1 was originally published by Wesley GOI at Ars de Datus-Scientia on July 18, 2017.

Metagenomics for the not so beginner

2017-03-07T00:00:00-00:00

blast2lca++ a Python Wrapper for MEGAN blast2lca

Download now from: https://github.com/etheleon/blast2lcaPlus

“Metagenomics (also referred to as environmental and community genomics) is the genomic analysis of microorganisms by direct extraction and cloning of DNA from an assemblage of microorganisms.”

From absolute to intermediate beginners venturing into the field of Metagenomics, one tool you’ll most certainly and quickly come across is MEGAN from Daniel Huson’s Lab, Tubingen University.

If you take a closer look inside the tools directory of the installation, you’ll find a bash executable called blast2lca (see link to script on github repo)which taps into the java classes used in the desktop version of MEGAN.

blast2lca is extremely valuable as a tool for one to basically access the core algorithms within MEGAN (for example):

Lowest Common Ancestor algorithm and
Functional functional assignment (KEGG/COG/eggNOG)

MEGAN’s been around for awhile with its 1st release way back in (2007).

In its newest iteration MEGAN6 now includes new additions to deal with increasingly large datasets.

However, I would say the updates are still mainly for desktop users and if you need to run any huge jobs on multiple large sequencing projects, you’ll be hardpressed to find a solution unless you pay for the server edition and even then incorporating MEGAN into the customised pipeline might not be that simpl might not be that simple.

Discussions about MEGAN server will be outside of the blogpost, message the authors if you want to know more.

blast2lca++

In this blogpost, I’ll be sharing with you python wrappers https://github.com/etheleon/blast2lcaPlus, I’ve written around blast2lca. (At the time of writing I tested this with MEGAN6 community edition 6.6.0 from Dec 2016).

Use case

Say for example you’ve got a huge number of samples which you would like to run some analysis and you’re using a weak Macbook 12, but alas you have access to a powerful univeristy headless server.

One option is to install MEGAN server (Ultimate edition) run the LCA and functional binning algos there and analyses it via the desktop. For someone who does further analyses in R and Python, that’s not really what I want to do, but do my own plots and run my own analyses. Luckily there’s blast2lca kindly provided by the author.

However there’s still several steps which are not clear, hence the reason for this wrapper

Combine Annotations - How does one combine KO and TAXONOMIC annotations such that we will have a combined annotation for each query (be it short read/long read/contig).
gi2ko mapping file generator - KEGG annotations. In the Community Edition, the tools to generate the mapping file (gi kegg) is not included like in the ultimate edition. What if you’re not in a position to pay for the ultimate edition license (which bundles with it the KEGG database licence) or have a older version of KEGG lying around somewhere in the server, what should you do? (NCBI has recently done away with the GI and I’ll update this in the future)
A complete pipeline from blast to combined output - How to go all the way from the tabbed blast output to the KO and taxonomy combined output mentioned above.

Use blast2lca++ tool of course!

Combine Annotations

In the root directory of the github repo, you’ll find a parseMEGAN python script.

It requires blast results be arranged in the following manner:

/projectDir
└── sampleDir/
     └── sample.daa / sample.m8 (tabbed blast)

note if you only have one sample, then just substitute sampledir with a ..

You’ll be asked to specify the locations of the mapping files (KEGG and taxonomy as well as the path to the executable)

parseMEGAN $PROJECTDIR $SAMPLEDIR $SAMPLENAME taxOutput koOutput

After which you’ll get the outputs from blast2lca and with a merged file sample-combined.txt

/projectDir
└── sampleDirName/
     ├── blast2lca-tax-Output
     ├── blast2lca-ko-Output
     ├── sampleName-combined.txt
     └── inputSampleDAAfile.daa

In the sample-combined.txt file, you’ll find a table:

rank    ncbi-taxid  KEGG-ko #reads
phylum  67820   K00000  4
phylum  1224    K06937  1
phylum  1224    K00656  6
phylum  1224    K04564  2
phylum  1224    K06934  1
phylum  1224    K12524  24
phylum  1224    K00558  7
phylum  1224    K02674  1
phylum  1224    K06694  1
phylum  1224    K01785  12
...

...
species 1262910 K00033  1
species 1262911 K00000  525
species 35760   K00000  14
species 7462    K15421  1
species 7462    K00000  1
species 1262918 K00000  365
species 1262919 K02429  5
species 1262919 K07133  1
species 1262919 K03800  1
species 1262919 K00000  529

Below’s a small example of what you could do with the data:

Application

With the above you could generate a reads per million column based on the raw counts

level	taxon	ko	rpm	c1.raw
Genus	Abiotrophia	K00000	1.40278158	32
Genus	Acanthamoeba	K00000	2.11362518	47
Genus	Acaryochloris	K00000	61.73957107	1423
Genus	Acaryochloris	K00013	0.00000000	0
Genus	Acaryochloris	K00016	0.00000000	0
Genus	Acaryochloris	K00091	0.05123577	1

You can now quickly summarise in a gene centric format the contributions made from each genus (or any taxonomic rank you choose) to each KO.

Here we see the transcriptome summary and one highest expressed KOs (right most) is a one responsible for nitrogen metabolism is mostly being expressed by a single genus (orange).

0 in the ncbi-taxid and K00000 in the KEGG-ko columns stands for unclassified.

What if you want to find out the names organisms from NCBI’s taxid?

Checkout R package MetamapsDB which lets you query the names based on taxids and do much more

Full pipeline

fullPipeline $PROJECTDIR $SAMPLEDIR $SAMPLENAME $INPUTFILE taxOutput koOutput --blast2lca <path 2 the blast2lca script> --gi2tax <path to the taxonomy mapping file> --gi2kegg <path to the KEGG mapping file>

The fullPipeline script will take a .m8 (tabbed blast) file or meganised .DAA file as input (checked via regex .m8 or .daa) and carry out the taxonomic and functional (KEGG) annotation and take the outputs and generate a combined output

gi2ko mapping file generator

Although MEGAN UE provides a KEGG mapper generating tool (not included with MEGAN CE), it doesnt take into consideration how NCBI has assigned a unique GI to each representative sequence in the non-redundant database (NCBI NR) under which are “duplicate” sequence GIs and ref IDs when blast or diamond does the alignment it only returns the former and the rest. It makes the kegg to gi mapping irrelevent.

We’ve separately included in the tools folder of the python package the ref2kegg.go nr-gi to kegg ortholog KO mapping file generator written in golang. The output of this could be fed to the blast2lca wrapper via the --gi2kegg flag. At the time of writing the parser is written in golang (a typed language) from perl to increase the speed of parsing the NR fasta.

Hope this helps anyone doing any customised pipeline with MEGAN6! Personally I feel strongly that MEGAN has now a open source version via MEGAN CE.

Reference

Handelsman, J (2004). Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev., 68, 4:669-85.
Huson, D. H., Tappu, R., Bazinet, A. L., Xie, C., Cummings, M. P., Nieselt, K., & Williams, R. (2017). Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads. Microbiome, 5(1), 11. http://doi.org/10.1186/s40168-017-0233-2

Metagenomics for the not so beginner was originally published by Wesley GOI at Ars de Datus-Scientia on March 07, 2017.

10 Bioinformatics tools and workflows you should be adopting in 2017.

2017-01-15T00:00:00-00:00

Coming from a non-Computer Science related field but find yourself having a hard time navigating data analysis? I won’t blame you if you feel at a loss as to where you should begin?

Posts like which language should you learn for datascience often catch our eyes and it’s no different whether you’re doing bioinformatics or something far removed such as HR analytics.

The only advise I have is to immerse yourself as much as you can whenever you get the chance. This way you’ll be able to gain EXP little by little.

There’s even a level up tree starting from a junior Bioinformatics analyst to a full on role, much like how it was described in this post.

Below’s a list of skills I’ve successfully mastered in 2016.

Now lets count down from number 10!

10. Mixing procedural scripts with Object Oriented Programming (OOP)

Previously, I’ve organised my research analyses as scripts with running numbers. Mainly procedural stuff…

ProjName.0100.doTask1.R
ProjName.0101.doTask2.py
.
.
ProjName.0109.doTask3.pl

But quickly realised much of the analyses I do is often not at all linear.

It often works as like a networks. ScriptN+10 needs ScriptN+2 functions, there’s dependency when the project works up to a relatively medium->small size.

The above is a screenshot of the dependency network using galaxy workflow. It nicely summarises some of what I do. I searched around for a CLI version and stumbled upon Luigi

In 2016, I tried using luigi (python package from spotify to deal scheduled tasks) but found it too complicated for something I could solve simply by doing OOP. Not everything is a routine operation but dependencies on another hand is a real thing.

What OOP does is it allows you to use design patterns. Design patterns are predefined solutions to specific kinds of problems, proven over time and known by the software community, but just remember not to get too obsessed with it. By using OOP, functions and methods are quickly abstracted away and gives you cleaner analysis code (class and methods)[https://www.sitepoint.com/object-oriented-javascript-deep-dive-es6-classes/].

One of the things which I see myself doing more going into 2017 is to apply more OOP design patterns.

9. Package your code

To be honest R users are rather spoilt by R’s fantastic package system, CRAN. It’s extremely easy to download and install, and share your pacakges.

Hadley’s devtools package is close to sorcery. (Do check out more of his packages, collectively known as hadleyverse)

When I looked outside of R, at other ecosystems like Python’s PyPI and Perl’s CPAN, things just dont feel as easy.

If you’re hacking in Perl, check out minila As for Python, one rather interesting find is the templating package cookie cutter package which makes templates python modules.

Both allows you to upload your package to github and let other install from it directly.

You might be kind of confused why am I still mentioning Perl, well that’s because much of Bioinformatics is still using it!

8. Be a polyglot for package management systems

Because bioinformatics software are written in almost any language imaginable, eg. Erlang, Haskell, Perl, C++, Java, Python2.X, Python3.X, you name it its there. Learning how to use them is always be the constant, however familiarise yourself with common installation methods like make will make a whole of a difference.

This parallels Web development quite a fair bit. In my startup life at fundMyLife we adopted meteorJS, a modern ES6 web development framework and at fundMyLife we use coffeescript coupled with jade/pug. Javascript’s package system NPM is a beast but once you get a hang of it many of its sweet packages will be at your fingertimes.

The community is quick to adopt and change, everyone wants you to use their standards and formats. One thing is clear, what stays constant are their package managers, so know them well.

7. Document your projects

Read enough documentation either for installation or just simply to use the functions in a project, you’ll soon see yourself transforming into a connoisseur of sorts as to what makes documentation good. Its extremely important if you want people using your work and its just plain courtesy.

R’s Roxygen is extremely useful. It is by far the most user friendly way to document your R code.

Below is how you would go about annotating a function in your package and the docs are all automatically generated

#' Add together two numbers. 
#' 
#' @param x A number. 
#' @param y A number. #' @return The sum of \code{x} and \code{y}. 
#' @examples #' add(1, 1) 
#' add(10, 1) add <- function(x, y) { x + y }

I’m still learning Python’s documentation system.

PERL has its POD documentation system which allows you to embed documentation between code. But it’s very clunky. I used this in my pAss package but still it can’t beat R

6. Containers, Containers, Containers.

Most of us nowadays start off their journey in serverside analysis in a debian linux distro, usually an Ubuntu box, with full root access while enjoying the privileges to run package manager 📦 apt-get as and when you please without even blinking and eye.

But when we start using a shared resource things quickly turn south. The inspiration to start incorporating this into my workflow came when i saw the web community picking this up to deal with dependency hell. Meanwhile I found out one of my mentors, who is now working in heavily in industry data science uses docker in his day to day life.

What containers really do is disrupt is MAKE. So instead often confusing makefiles, one writes dockerfiles and everything that gets installed stays within the container without polluting your host environment, pretty much like a function.

Docker is the most mainstream of all container technologies and you should take a look at the biocontainers github page, there you can find many bioinformatics softwares containerized!

Don’t worry about accessing your files from your home directory, it isn’t a problem as Docker lets you mount the host system’s HDD onto the running container.

Docker is like building a HDD in minecraft

Talk about inception

Together this solves an acute problem as it gives the normal user back the ability to be root without ruining the rest of the host system and still have performance similar to running on bare metal.

One disadvantage of using Docker is installing docker itself requires root access and if you’re dealing with a university wide shared resource good luck.

Which is why #5 linuxbrew is go out to be helpful

5. Linuxbrew

Linuxbrew is basically a port of the macOS/OSX package manager HomeBrew . So if you’re already familiar with homebrew, linuxbrew will be a breeze.

Just how easy is it?

Lets try installing R, wait lets make it a tad bit more difficult, lets try to customize the installation further with the newest fastest basic algebra algos included in openblas

brew install R --with-openblas

Tada, you’re done. Yes, its that simple.

What Homebrew or in this case Linuxbrew lets you do is not only lets to install into your $HOME directory, bypassing all that superuser bullshit, it also lets you do very specific installations and dependencies.

Steve jackman (see his slides) and many others are behind the Science tap of with instructions for linuxbrew/homebrew to install popular bioinformatics tools.

This makes installing and ultimately doing science much easier than before.

If you’re still confused about how to do local installations, i recommend reading this post, although its title has Python, its really meant for everything.

4. Rmarkdown

To be honest I started r markdown way back in 2015 but got really into it in 2016 because it really helps frame my questions and analyses.

Writting your analyses as Rmarkdowns force you to place those tiny bursts effort and energy into a single compiled document with clearly defined goals and development of the story.

Rmarkdown has a interpreter engine so it allows you to not only work with R but with Perl or python.

Having docker installed also allowed me to use the newer versions of Rmarkdown, rmarkdown2.

One feature I absolutely love is auto code hiding in the output html.

The output looks absolutely professional and when you need it you could always show the code.

Interestingly, the Rstudio team has also come up with R notebooks. I’m sure many pythonistas will love this new feature but personally for me I’m very happy with the way things are for rmarkdown all I’m giving this a miss.

3. Tmux-vim-slime

For those who know me in person, you know I’m a big fan of the terminal. And I do most of my work if not all in that one window. So when I’m in the server I’ll always have a tmux session running

This is a good tutorial and it teaches how to work with R in the server like a cluster away from familiar Rstudios. Recently I’ve switched from this to vim-slime a vim port of the Emacs slime cause it also supports IPYTHON.

2. BioJS

Learning web development, while building fundMyLife has given me the skills required to build the UI layer instead of CLI system tools using python/perl/R.

BioJS is one of those interesting developments where important visualisations are now rendered in a browser and hence any operating system. You see this trend outside of bioinformatics where editors like Atom and Visual Studio Code, and communications tool Slack are all built as a browser based application.

The admins are were even featured in 2016’s GSoC checkout the blogpost part2, part2 where they went on to build visualisation for the Game Of Thrones

1. Writing production ready code

There’s a lot of talk about reproducibility and really much of it has been solved in the industry. Nearing the end of the PhD, means most of my packages should be ready for use by the public at large.

Personally, I’m aspiring to be able to write code good enough for a industry setting. The crossover from academia into industry isn’t that uncommon, here’s a post about a nucelar physicist who is now working in UBER and doing both datascience and shipping production code some even do some engineering as well

Thats all folk’s! I hope this helped you get orientated around Bioinformatics.

If you know any good workflows please do share with me.

10 Bioinformatics tools and workflows you should be adopting in 2017. was originally published by Wesley GOI at Ars de Datus-Scientia on January 15, 2017.

5 Things You Didnt Know About The Bacteria In Your Gut

2016-12-21T00:00:00-00:00

Hype vs Reality: The Microbiome

With so much hype today about one’s gut health, you often wonder how much of it is truth. A trip down to your local pharmacy’s supplements shelf and you will see a wide range of pro- and pre- biotics each with their own beneficial claims and often too ludicrous to be taken seriously.

In case you still don’t believe me, just head down to subreddit: r/microbiome with close to 3000 subscribers.

Why this hype? It’s driven by the numerous use cases which are popping up all over the place involving the microbiome.

From prediction for early diagnostics of chronic diseases such as diabetes to finger printing criminals from samples left behind at crime scenes.

So How do we study microbes (even those we cannot culture)?

Through DNA sequencing of course!

More data for you, me, everybody

We study them mainly using two techniques

Amplicon Sequencing
Whole Metagenome Sequencing

So What is Amplicon sequencing and Metagenomics?

The investigation of microbes in a given sample without the need for culture by directly recording the genetic content using next generation DNA sequencing techniques.

Can you be more specific? (TL;DR version)

Sequencing of a only a selected representative gene of interest and in this case the variable regions in the rRNA of the 16S ribosomes
Whole metagenome sequencing, you basically get the whole repertoire / complement of genes.

for non-biologist: You can compare this with many existing data science techniques where you either churn through all collected data or just zoom in onto a very specific signal which you’re looking for

5 things you should know about your gut microbiome

1. Community complexity

Microbial communities range widely in their complexity. By complexity we mean to say the numbers of unique OTUs (Operational Taxanomic Unit) and their proportions.

There’s several ways to quantify this, the complexity, is through unweighted and weighted indices borrowed from existing macroecology literature:

Indices and metrics

There’s many of these around and the famous ones are $\alpha$-diversity and $\beta$- diversity.

The former, $\alpha$-diveristy, measures within sample diversity and includes: Shannon-Weaver, Simpson If you want something more weighted then try Taxonomic Diversity $\Delta$ or Taxonomic Distinctiveness $\Delta^+$. (See the R package: Vegan for more explanation)

$\beta$-diveristy describes the total species diversity across samples over the average species diversity per sample its used essentially as a measure to investigate heterogenity within the data amongst samples.

2. Analysis is Hard

Simpler communities are by far easier to study. However, one must also consider the number of reference genomes available for the community in question, ie. if community was simple but the reference genomes are sparse and few inbetween, chances are the analysis will only unveil little about the community save for top level analyses.

This places the gut microbiome in a good location to be studied as it is relatively well understood with good reference genomes and isn’t as complicated as the soil or wastewater microbiome.

The gut microbiome is right smack in the goldilock zone for analysis

3. Types of Communities

Simple Communities

These are found usually in very harsh or low nutrient environments.

Complex communities

Examples of complex communities include soil, sewage where species numbers reach the 1000-2000s range.

Synthetic communities

These are ma–made and can be either simulated in silico or as sampled from an artificial mixture. They are simple, rarely go beyond a hundred species and are mainly used for understanding and testing ecological theories.

Enriched communities

Such communities stand at the intersections between simple but artificial and complex but close to naturally occuring communities. They form a sub category under artificial communities.

4. Fingerprinting

To get an idea of how unique this “key” could get we can look to how long a the typical SSH RSA key 🔑 is:

This identifies your machine as you when you try to log into a host server.

Similarly a microbiome signature would look like this:

Look at how much the OTU abundances resembles a barcode:

We use this to differentiate groups of individuals from one another, usually from the diets that each individual share with others in the same group.

Take for example the plots found in the example analysis below, for the gut microbiomes from two different sets of mice each fed on a specific diet.

5 Engineer Your Gut Microbiome Now.

Ultimately, the gut microbiome is resilient towards change and will probably stay the same unless you do something drastic about your diet like going vegan as referenced in this review article. However, it is a good diagnostic to identify groups where their physiology has altered their microbiome.

Example analyses

I’m including links to a short analyses of three groups of mice, two groups fed on 2 different diets and one group which was fed a transition diet.

DISCLAIMER: The following is based on unpublished data (Little et. al). Any reproduction or use of the analysis and the results for personal/commercial use is prohibited. If you have any enquiries please contact the author of this post at [email protected]

Part 1: Unbiased 16s profiling of whole metagenome data using Ribotagger

Part 2: Whole Metagenome profiling

Conclusion

Things are bound to get more interestingly as studies with higher throughput time-series experiments become the norm in the near future.

References

1: Xie, C., Lui, C., Goi, W., Huson, D. H., Little, P. F. R., & Williams, R. B. H. (2016). RiboTagger : fast and unbiased 16S / 18S profiling using whole community shotgun metagenomic or metatranscriptome surveys. BMC Bioinformatics, 17(Suppl 19). http://doi.org/10.1186/s12859-016-1378-x

2: Franzosa, E. a., Hsu, T., Sirota-Madi, A., Shafquat, A., Abu-Ali, G., Morgan, X. C., & Huttenhower, C. (2015). Sequencing and beyond: integrating molecular “omics” for microbial community profiling. Nature Reviews. Microbiology, 13(6), 360–72. http://doi.org/10.1038/nrmicro3451

3: Jari Oksanen, F. Guillaume Blanchet, Michael Friendly, Roeland Kindt, Pierre Legendre, Dan McGlinn, Peter R. Minchin, R. B. O’Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens, Eduard Szoecs and Helene Wagner (2016). vegan: Community Ecology Package. R package version 2.4-1. https://CRAN.R-project.org/package=vegan

5 Things You Didnt Know About The Bacteria In Your Gut was originally published by Wesley GOI at Ars de Datus-Scientia on December 27, 2016.

My First post

2016-12-21T00:00:00-00:00

So lets kick it off with some obligatory self introduction. Being a PhD Candidate here in NUS, Singapore (Computational Biology / Metagenomics), I encounter some pretty nifty data analyses which I would love to share.

In particular the latest statistical / machine learning methods and the code required for the voodoo to happen.

This here, fulfills both roles as a platform for sharing my ideas and thoughts (analyses and technical), and my recent forray into the startup scene here in Singapore.

Being the tech co-founder of fundMylife, a Insurtech company, it has opened my eyes to the many facets of doing a (tech) business. Besides the basics of software development, there’s so much more now open to me: marketing, growth hacking and bizDev.

Back to the introductory post, I’m guessing I’ll first blog about my research analyses before going into related to startups.

The coming posts, 2 in fact, *grins* will be analyzing the gut microbiomes of three groups of mice – two fed on two diets and one which had their diets changed from one to the other. Code will of course be published together.

See ya folks!

Looking forward to posting the analyses.

My First post was originally published by Wesley GOI at Ars de Datus-Scientia on December 21, 2016.