Laurens Geffert

How to Summarize your Travel History in under 5 Minutes

Mon, 05 Apr 2021 14:30:00 +0000

How to use your location history to compile a breakdown of all your international travel. Fast, simple, and valuable for immigration purposes or visa applications. We will use the Google Maps takeout feature and a small R script

Introduction

When applying for my US visa, one of the questions that USCIS had for me was a breakdown of all my international travel in the last ten years. If you’re anything like me, that question can cause a pretty big headache. Not because I have anything to hide but because it is challenging to keep track of all the different trips I make. A visa application is a serious business, and it’s vital to answer completely and truthfully. So I started digging into old emails, flight loyalty program records, even paper documents. The whole process took me the better part of two hours. I kept thinking there must be a better way to do this. Now I can say with confidence that there is, thanks to Google Maps’ Location History feature!

Data Extraction

Getting the Files

There are some pre-requisites. You need to have location history enabled and, depending on the time window of travel history that you are interested in, you should disable the auto-delete. Note that auto-delete was recently changed to OPT-OUT! I like having all that data archived and available, so I made sure to disable it (you can do that at https://myactivity.google.com/activitycontrols). I still don’t have a full ten years, but I’ll take what I can get.

Next, we need to extract the data from Goggle’s systems. I wrote a separate post on how to download location history data programmatically. However, here we’re looking to download data for a large number of days simultaneously, so the Google Takeout feature is the better choice because we won’t end up hitting API request limits.

Go to https://takeout.google.com/, de-select all products (via the button at the top), then re-select “Location History”. Leave the file format as JSON. Set the file size specifications according to your preferences. You’re very unlikely to come up against the GB file limit with just your location history. The export should take about 10 to 15 minutes. When done, you will receive a .zip file to your Gmail account that contains folders for each year and files for each month with location history data.

Getting Coordinates

Many RStats users dislike JSON, but it’s a ubiquitous and useful format! Fear not, Thomas Mock has written an excellent summary on how to process JSON data with R. Looking at the location history ` JSON’s, it seems there are two sections in each: placeVisit and activitySegment. This matches what you can see on the timeline in the Google Maps app, where places that you visited are listed as entries, and your movement between places is labeled with inferred activities like “walking”, “running”, “on a train”, or “flying”.

Originally I thought I could just use the places listed in placeVisit and extract their country from the address field (usually given in the last row). This worked well for most places, the UK and US but got really messy for Japan and Korea. I changed my approach and went for extracting raw latitude and longitude values instead. Digging in a little deeper, I found longitude-latitude values in five places:

one pair for each entry in placeVisits given with a startTimestampMs and endTimestampMs
zero to many pairs for each placeVisit in a nested list in simplifiedRawPath with a single timestampMs each
one set of startLocation, startTimestampMs, endLocation, endTimestampMs for each entry in activitySegment
zero to many pairs with timestampMs for each activitySegment in a nested list in simplifiedRawPath
zero to many pairs with timestampMs for each activitySegment in a nested list in waypointPath

Note that simplifiedRawPath and waypointPath aren’t present for every entry. Furthermore, the above list might not be complete. For example, I also noticed parkingEvents, but since it was mostly empty for me, There could be other entries depending on the types of observation, sensors, features used with Google Maps that I am not aware of and that I missed here. If that is the case, please let me know in the comments below. I’d love to hear from you!

Based on the information above, I wrote a function to extract latitude-longitude data from the takeout JSON files. I use jsonlites flatten() to make nested lists of consistent schema into data frames and then invoke transmute() along with unnest() and pivot_longer() where needed to standardize the formatting and create rows with timestamp, lng, lat observations (and some metadata). Observations from each part of the JSON get combined with bind_rows() to an output df. I’m only using the places info for now, but the same framework is easily generalizable to activities:

library(tidyverse)
library(jsonlite)
library(lubridate)
library(assert that)

# extract location data from takeout JSON
get_location_from_json <- function(fname, confidence_threshold=75) {

  message(str_glue('now processing {fname}'))
  json <- fromJSON(txt=fname, simplifyVector=TRUE, flatten=FALSE)

  # check JSON sub-lists look okay
  assert_that({
    assertthat::are_equal(length(json[[1]]), 2)
    assertthat::has_name(json[[1]], list('placeVisit', 'activitySegment'))
  })

  # get sub-elements with place and activity data
  df_places <- json %>%
    pluck(1) %>%
    pluck('placeVisit') %>%
    jsonlite::flatten(recursive=F) %>%
    tibble() %>%
    filter(visitConfidence >= confidence_threshold)

  # get location from places
  df <- df_places %>%
    transmute(
      type = 'places',
      time_start = as.numeric(duration.startTimestampMs),
      lat = location.latitudeE7,
      lng = location.longitudeE7,
      time_end = as.numeric(duration.endTimestampMs)) %>%
    # convert to rows with start and end info
    pivot_longer(starts_with('time')) %>%
    drop_na() %>%
    select(type, time=value, lat, lng)

  # get raw place coordinates if there are any
  if ('simplifiedRawPath.points' %in% colnames(df_places) &
      !all(map_lgl(df_places$simplifiedRawPath.points, is.null))) {
    df <- df_places %>%
      unnest(simplifiedRawPath.points) %>%
      transmute(
        type = 'raw',
        time = as.numeric(timestampMs),
        lat = latE7,
        lng = lngE7) %>%
      bind_rows(df)
  }
  return(df)
}

Now we just loop over all JSON files in our takeout data folder:

library(fs)

# list all files from takeout
fnames <- fs::dir_ls(
  path='data/Semantic Location History',
  type='file',
  recurse=TRUE)

# read files
df = tibble()
for (fname in fnames) {
  df <- get_location_from_json(fname) %>%
    bind_rows(df, .)
}

Processing

We have the raw data, and we’re almost ready to have a look at it. However, before we do so, we should change the format of both the timestamps and the coordinates. Timestamps come as UNIX milliseconds. The lubridate package makes it easy to convert them to a human-readable date time. Latitude and Longitude values are stored as long integers, hinted at by the original field names latitudeE7/longitudeE7. Dividing by 1e7 returns the more commonly used Degrees (°) format (decimal places).

df %<>%
  mutate(
    time = as_datetime(time / 1e3),
    lat = lat / 1e7,
    lng = lng / 1e7)

Simple points on a 360 x 180 plot would work, but it would be much better to have a polygon map as a frame of reference for our observations. I opted for rnaturalearth because it offered a quick and convenient way to get an sf country shapefile into R. By the way, if you’re not familiar with sf: it is a relatively new geospatial R package that provides simple geometry features in a tidyverse compatible form. Check out this site which has tutorials, vignettes, presentations, cheat sheets, and a wiki!

library(rnaturalearth)
library(rnaturalearthdata)
world <- ne_countries(scale = "medium", returnclass = "sf")

Visualization

Now we can finally create a map of my location history:

df %>%
  ggplot() +
  geom_sf(data=world) +
  geom_point(aes(x=lng, y=lat), color='red') +
  theme_map()

This seems to be 99% correct. I can see all most observations in the UK and the rest of Europe, where I spent most of my time between 2014 and today. I can also see the trips to the US, New Zealand, China, and Japan accurately. In an earlier version of this map, I had lots of in-flight observations from the North Atlantic near Greenland, the Island of Taiwan, and Indonesia. I got rid of these by excluding activity data altogether. There is also an observation in the South Pacific Ocean, and initially, I had no idea where that was coming from.

By manually cross-checking the Maps timeline, I found that it is related to a coastal road stop I did in the very south of New Zealand. The stop got mapped to “South Pacific” as a Google Maps Place with a visitConfidence of just 54. I decided to exclude low confidence visits (threshold of 75), which solved the issue.

Country Information

The goal of this project was to get a breakdown of time spent in different countries. I mentioned earlier that I was hoping to be able to just extract that information from the address field in the places data but that it didn’t really work. Instead, I chose the slightly more involved route of intersecting coordinates with country shapefiles. We can use the same sf object from rnaturalearth that was already loaded above. Let’s create a breakdown of places visited per country.

# convert to sf points
pnts <- st_as_sf(df, coords = c('lng', 'lat'), crs = st_crs('WGS84'))
# intersect with countries
pnts_in_countries <- st_intersection(pnts, world)

# breakdown of records per country
pnts_in_countries %>%
  as_tibble() %>%
  group_by(sovereignt) %>%
  summarize(n_places=n()) %>%
  arrange(desc(n))

# A tibble: 25 x 2
   iso_a3 n_places
   <chr>     <int>
 1 GBR       11438
 2 USA        1336
 3 DEU         819
 4 NZL         376
 5 HKG         270
 6 FRA         215
 7 ISR         101
 8 JPN          77
 9 GRC          62
10 RUS          61
# ... with 15 more rows

25 countries sound about right.

Border Crossings

Finally, the breakdown with immigration / emmigration dates. I get those using the lag() function from dplyr and comparing the country of the previous place to the country of the current place as below:

# get immigration / emmigration events
pnts_in_countries %>%
  as_tibble() %>%
  arrange(time) %>%
  transmute(
    date = as_date(time),
    country = iso_a3) %>%
  transmute(
    date_from = lag(date),
    date_to = date,
    country_from = lag(country),
    country_to = country) %>%
  filter(country_from != country_to)

# A tibble: 141 x 4
   date_from  date_to    country_from country_to
   <date>     <date>     <chr>        <chr>     
 1 2014-05-10 2014-05-12 GBR          USA       
 2 2014-05-16 2014-05-18 USA          GBR       
 3 2014-05-29 2014-06-03 GBR          DEU       
 4 2014-06-03 2014-06-04 DEU          GBR       
 5 2014-07-09 2014-07-14 GBR          DEU       
 6 2014-07-14 2014-07-14 DEU          GBR       
 7 2014-12-19 2014-12-20 GBR          USA       
 8 2015-01-08 2015-01-09 USA          DEU       
 9 2015-01-10 2015-01-11 DEU          GBR       
10 2015-03-30 2015-04-02 GBR          DEU       
# ... with 131 more rows

Wow, 141 border crossings in total. That’s a bit more than what I would have thought. Then again, just driving from London to Cologne gets you three events (GBR -> FRA -> BEL -> GER).

Conclusion

You can see how this would have taken me ages to do from old emails and flight records, and I would have probably missed some trips!

DISCLAIMER: Use this code at your own risk. I do not guarantee correctness, especially not when applying it to your own data. Please DOUBLE-CHECK MY WORK and let me know if you run into any issues. Some caveats that I am aware of: exact arrival dates can be inaccurate. Google Maps might not register a place on your timeline on the day of departure or right after arrival. An excellent way to check this is to look at records that don’t have the same date_from and date_to value (see the first row in my data). Also, I think that all dates and times are in GMT, and so the day you entered or left a given country may be captured incorrectly in local time. I might return later to work on that further.

Thank you so much for reading! Let me know in the comments below if you found it helpful and what you would like to read about next!

Building Our Own Open Source Supercomputer with R and AWS

Sun, 03 Feb 2019 18:00:00 +0000

How to build a scaleable computing cluster on AWS and run hundreds or thousands of models in a short amount of time. We will completely rely on R and open source R packages. This is post 1 out of 2.

Introduction

An ever-increasing number of businesses is moving to the cloud and using platforms such as Amazon Web Services(AWS) for their data infrastructure. This is convenient for Data Scientists like myself because this conversion of tools means that my knowledge from previous jobs becomes much more applicable to a new role and I can hit the ground running.

Lately I have become very excited about the future package and how it makes the scaling of computational tasks easy and intuitive. The basic idea of the future package is to make your code infrastructure independent. Specify your tasks and the future execution plan decides how to run the calculations.

I wanted to see what we could do with future and other open source R packages such as aws.ec2 by cloudyR, ssh by rOpenSci, remoter by Drew Schmidt, and last but not least furrr by Davis Vaughan.

The basic idea:

use R and AWS to spin up our own cloud compute cluster
log in to the head node and define a computationally expensive task
farm this task out to a number of worker nodes in our cluster
do all of this WITHOUT having to switch between RStudio, RStudioServer, the command line, the AWS console, etc.

Why do I care about the last point? Well, Data Science is a science and should rely on the Scientific Method. One core component of the Scientific Method is reproducibility, and one of the best ways to keep your Data Science workflow reproducible is to write code that can run start to finish without any user intervention. This also allows for greater applicability in the future because you can re-use your previous data product or service in the next project without retracing manual steps. Don’t just take my word for it, here is another great Hadley Wickham video in which he stresses the same point:

So without further ado, let’s get started implementing that bullet point list!

Preparation

There are a few basic requirements that need to be in place:

an active AWS account.
an Amazon Machine Image (AMI) with R, remoter, tidyverse, future, and furrr installed.
a working ssh key pair on your local machine and the AMI that allows you to ssh into and between your ec2 instances.

Detailed instructions on how to fulfil these basic requirements are beyond the scope of this post. You can find more information in the articles linked below.

Setup

Load the required packages. Also make sure your AWS access credentials are set. I do this using Sys.setenv. There is other ways but I found that this works best for me. We also specify the AMI ID and the instance type (this is a good overview; I am using t2.micro here because it is free). If you have any problems with this step, double- check that the region set in Sys.setenv matches the region of your AMI.

library(aws.ec2)
library(ssh)
library(remoter)
library(tidyverse)

# set access credentials
aws_access <- aws.signature::locate_credentials()
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = aws_access$key,
  "AWS_SECRET_ACCESS_KEY" = aws_access$secret,
  "AWS_DEFAULT_REGION" = aws_access$region
)

# set parameters
aws_ami <- "ami-06485bfe40a86470d"
aws_describe <- describe_images(aws_ami)
aws_type <- "t2.micro"

Ready for launch!

Boot and Connect

We can now fire up our head-node instance.

ec2inst <- run_instances(
  image = aws_ami,
  type = aws_type)

# wait for boot, then refresh description
Sys.sleep(10)
ec2inst <- describe_instances(ec2inst)

# get IP address of the instance
ec2inst_ip <- get_instance_public_ip(ec2inst)

The instance should be running and we can connect to it via ssh in bash. That works, but personally I’d prefer to stay in RStudio instead of switching to the command line. This is where remoter and ssh come in. We can establish an ssh connection straight from our R session and use that to launch the remoter::server on our instance. By using the future package to run the ssh command we keep our interactive RStudio session free and can subsequently use it to connect to the instance with remoter

# ssh connection
username <- system("whoami", intern = TRUE)
con <- ssh_connect(host = paste(username, ec2ip, sep = "@"))

# helper function for a random temporary password
random_tmp_password <- generate_password()
# CMD string to start remoter::server on instance
r_cmd_start_remoter <- str_c(
  "sudo Rscript -e ",
  "'remoter::server(",
  "port = 55555, ",
  "password = %pwd, ",
  "showmsg = TRUE)'",
  collapse = "") %>%
  str_replace("%pwd", str_c('"', random_tmp_password, '"'))

# connect and execute
plan(multicore)
x <- future(
  ssh_exec_wait(
    session = con,
    command = r_cmd_start_remoter))

remoter::client(
  addr = ec2ip,
  port = 55555,
  password = random_tmp_password,
  prompt = "remote")

Et Voila! We are connected to our remote head node and can run R code in the cloud without ever leaving the comfort of RStudio. And the amazing bit: all of this took me about a day to set up from scratch!

I will leave it here for now. In the next post we will dive into the details of how to scale up the approach above to create an AWS cloud computing cluster. This approach is extremely powerful for embarrassingly parallel problems (which are actually not embarrassing at all, I swear!)

As always, I hope it is useful for you. I’d very much appreciate any thoughts, comments, and feedback so write me a message below or get in touch via twitter!

Nesting Birds and Models in R Dataframes

Sat, 15 Dec 2018 09:30:00 +0000

R Dataframes in the tidyverse are more than just simple tables these days. They can store complex information in list columns, and this becomes an immensely powerful framework when we use it to apply methods to different sets of data in parallel. In this article I illustrate this approach using data for a rare UK bird species to investigate if its distribution has been impacted by climate change.

Motivation

After recently seeing a Hadley Wickham lecture on nested models I became incredibly excited about nested dataframes with s3 objects in list columns again. Here is the video I am talking about:

Hadley uses this approach for data exploration but I think it is also very powerful for iterative workflows and for experimentation or hypothesis testing on large datasets. For example, when working on my PhD thesis I was routinely fitting hundreds of machine learning models at once. All models used the same predictor set and only varied in hyperparameters as well as label data. Yet, I had to run them in separate parallel processes and load the data into each of these. Moreover, when capturing results I often looked to the list class for help. This did the job but also meant that I had to be very careful about which results belonged to which data, which hyperparameters, and which model object.

Enter nested dataframes. They still rely on the list class, but they nicely organise the corresponding data elements together, in accordance with the tidy data framework

Data

I decided to explore this framework hands-on, using a small exemplary case study in the domain of species distribution modelling. This is what the models I mentioned earlier were. For this type of modelling task we need species occurrence data (our “label”, “response”, or Y) and climatic variables (the “predictors”, or X)

Species Data

After browsing the web for a suitable case study species for a while I decided on the Scottish Crossbill (Loxia scotica). This is a small passerine bird that inhabits the Caledonian Forests of Scotland, and is the only terrestrial vertebrate species unique to the United Kingdom. Only ~ 20,000 individuals of this species are alive today.

Getting species occurrence data used to be the main challenge in Biogeography. Natural Historians such as Charles Darwin and Alexander von Humboldt would travel for years on rustic sail ships around the globe collecting specimen. Today, we are standing on the shoulders of giants. Getting data is fast and easy thanks to the work of two organisations:

the Global Biodiversity Information Facility (GBIF), an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. We will use their data in this project.
rOpenSci, a non-profit initiative that has developed an ecosystem of open source tools, runs annual unconferences, and reviews community developed software. They provide an R package called rgbif that I once made a humble contribution to. It is essentially a wrapper around the GBIF API will help us access the species data straight into R.

library(tidyverse)
library(rgbif)

# get the database id ("key") for the Scottish Crossbill
speciesKey <- name_backbone('Loxia scotica')$speciesKey

# get the occurrence records of this species
gbif_response <- occ_search(
  scientificName = "Loxia scotica", country = "GB",
  hasCoordinate = TRUE, hasGeospatialIssue = FALSE,
  limit = 9999)

# backup to reduce API load
write_rds(x = gbif_response, path = here::here('gbif_occs_loxsco.rds'))

GBIF and rOpenSci just saved us years or roaming around the highlands with a pair of binoculars, camping in mud, rain, and snow, and chasing crossbills through the forest. Nevertheless, it is still up to us to make sense of the data we got back, in particular to clean it, as data collected on this large scale can have its own issues. Luckily, GBIF provides some useful metadata on each record. Here, I will exclude those that

are not tagged as “present” (they may be artifacts from collections)
don’t have any flagged issues (nobody has noticed anything abnormal with this)
are under creative commons license (we can use them here)
are older than 1965

After cleaning the data we use tidyr::nest() to aggregate the data by decade.

library(lubridate)
birds_clean <- gbif_response$data %>%
  # get decade of record from eventDate
  mutate(decade = eventDate %>% ymd_hms() %>% round_date("10y") %>% year() %>% as.numeric()) %>%
  # clean data using metadata filters
  filter(
    # only creative commons license records
    str_detect(license, "http://creativecommons.org/") &
    # only records with no issues
    issues == "" &
    # no records before 1965
    decade >= 1970 &
    # no records after 2015 (there is not a lot of data yet)
    decade < 2020) %>%
  # retain only relevant variables
  select(decimalLongitude, decimalLatitude, decade) %>% arrange(decade)

birds_nested <- birds_clean %>%
  # define the nesting index
  group_by(decade) %>%
  # aggregate data in each group
  nest()

# let's have a look
glimpse(birds_nested)

Climate data

For the UK the MetOffice had some nice climatic datasets available. They were in a horrible format (CSV with timesteps, variable types, and geospatial information spread across rows, columns, and file partitions) but I managed to transform them into something useable. The details of this are beyond the scope of this post, but if you are interested in the code for that you can check it out here.

The final rasters look like this:

Modelling

We’ll split the data in training and test set with a true temporal holdout from all data collected between 2005 - 2015.

# last pre-processing step
df_modelling <- df_nested %>%
  # get into modelling format
  unnest() %>%
  # caret requires a factorial response variable for classification
  mutate(presence = case_when(
    presence == 1 ~ "presence",
    presence == 0 ~ "absence") %>%
    factor()) %>%
  # drop all observations with NA variables
  na.omit()

# create a training set for the model build
df_train <- df_modelling %>%
  # true temporal split as holdout
  filter(decade != "2010") %>%
  # drop decade, it's not needed anymore
  dplyr::select(-decade)

# same steps for test set
df_test <- df_modelling %>%
  filter(decade == "2010") %>%
  dplyr::select(-decade)

Species responses to environmental variables are often non-linear. For example, a species usually can’t survive if it is too cold, but it can’t deal with too much heat either. It needs the “sweet spot” in the middle. Linear models like a glm are not very useful under these circumstances. On the other hand, algorithms such as random forest can easily overfit to this kind of data. I therefore decided to test a regularised random forest (RFF) as introduced by Deng (2013), hoping that it would offer just the right ratio of bias vs variance for this use case.

Caret makes the model fitting incredibly easy! All we need to do is specify a tuning grid of hyperparameters that we want to optimise, a tune control that adjusts the number of iterations and the loss function used, and then call train with the algorithm we have picked.

library(RRF)

# for reproducibility
set.seed(12345)

# set up model fitting parameters
# tuning grid, trying every possible combination
tuneGrid <- expand.grid(
  mtry = c(3, 6, 9),
  coefReg = c(.01, .03, .1, .3, .7, 1),
  coefImp = c(.0, .1, .3, .6, 1))
tuneControl <- trainControl(
  method = 'repeatedcv',
  classProbs = TRUE,
  number = 10,
  repeats = 2,
  verboseIter = TRUE,
  summaryFunction = twoClassSummary)
# actual model build
model_fit <- train(
  presence ~ .,
  data = df_train,
  method = "RRF",
  metric = "ROC",
  tuneGrid = tuneGrid,
  trControl = tuneControl)

plot(model_fit)

We can evaluate the performance of this model on our hold-out data from 2005 - 2015. Just as uring training we are using the Area under the Receiver Operator Characteristic curve (AUC). With this metric, a model no bettern than random would score 0.5 while a perfect model making no mistakes would score 1.

# combine prediction with validation set
df_eval <- data_frame(
  "obs" = df_test$presence,
  "pred" = predict(
    object = model_fit,
    newdata = df_test,
    type = "prob") %>%
    pull(1))

# get ROC value
library(yardstick)
roc_auc_vec(estimator = "binary", truth = df_eval$obs, estimate = df_eval$pred)

Now we can combine the raw data, model performance, and predictions all in one nested dataframe. We can save this for later to make sure we always know what data was used to build which model.

df_eval <- df_modelling %>%
  group_by(decade) %>% nest() %>%
  # combine with climate data
  left_join(climate_nested, by = "decade") %>%
  # evaluate by decade
  mutate(
    "obs" = map(
      .x = data,
      ~ .x$presence),
    "pred" = map(
      .x = data,
      ~ predict(model_fit, newdata = .x, type = "prob") %>% pull("presence")),
    "auc" = map2_dbl(
      .x = obs,
      .y = pred,
      ~ roc_auc_vec(
          estimator = "binary",
          truth = .x,
          estimate = .y)),
    "climate_data" = map(
      .x = raster_stacks,
      ~ as(.x, "SpatialPixelsDataFrame") %>%
        as_data_frame() %>%
        na.omit()),
    "habitat_suitability" = map(
      .x = climate_data,
      ~ predict(model_fit, newdata = .x, type = "prob") %>% pull("presence"))
    )

df_eval

Conclusion

Let’s look at the change over time using gganimate. Unfortunately, we can see that the suitable area for the species in the UK is drastically decreasing after 1985. Not all species are negatively affected by climate change but many are. And this is just one of the many unintended consequences of our impact on planet earth.

I hope that you enjoyed this blog post despite our pessimistic findings. As you can see nested dataframes with list columns are immensely powerful in a range of situations. I will certainly use them a lot more in the future. Please let me know in the comments if you are, too!

Data Science Machine and Command Line Setup

Fri, 12 Oct 2018 13:00:00 +0000

Data Scientists require a very particular toolset for their everyday tasks, but unlike software developers, few of them spend a lot of time optimising this toolset for their specific needs. I compiled a simple step-by-step guide that helps to automate the process setting up a brand new data science machine and making it work for you by customising the command prompt and using a dotfile approach to manage configuration, identity, and access information. This gets you from zero to Data Science in minutes on MacOS

I’ve had to set up new data science laptops twice in the last couple of months and got frustrated with the tedious setup procedures. Installing libraries, customising settings, how do I switch RStudio to night mode again? Moreover, I have two new starters joining my team in the coming weeks which means that more system setups are just around the corner. So I decided to compile a guide with scripts and commands that make this process smoother and faster.

There is many things to be said for an automatic setup over manual installation. Speed, reproducibility, a standardised configuration between all team members, and the opportunity for programmatic customisation. Among software developers this approach, called .dotfile configuration, is common practice and great introductions are available here and here. However, so far I have only rarely encountered it on data science teams. This is despite the fact that data scientists frequently work with complex statements at the command line, have to pay particular attention to system setup to ensure reproducibility of their experiments, use version control, and commonly deal with data from a wide range of sources, many of which will require API tokens or access credentials. So think of this as a data science specific dotfile setup. There are three main components to this approach:

using command line tools and package managers instead of graphic installers automate first-time system setup, because this is faster, more reproducible, and more easily maintainable.
set up a beautiful, efficient, and powerful command line configuration, because it will make everyday tasks easier, because it’s awesome and because we can!
create a .dotfile repository that saves settings, application preferences, api keys, and access tokens, because it is more convenient and more secure than glueing post-its to our monitor or hard-coding passwords and tokens into our code that is then pushed to GitHub.

Most parts of this article can be used in isolation, so unlike the British Prime Minister you are free to “cherry-pick” if you are so inclined.

I am assuming here that you’re using MacOS. Parts of it may be transferable to a linux machine, much of it will need modification. If you’re on Windows… good luck! It may work with the new Ubuntu for Windows? If you get a chance to test this, please let me know in the comment section below.

Initial Setup

We start of by installing install Homebrew, the “missing package manager for MacOS”! This bit actually requires some user input ( and ), so we will split that from the rest of the basic installations.

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Once that is done we can use homebrew for some additional household essentials.

# we will need these later
brew install wget htop git git-lfs libgit2 keychain

# I like these, so I'll install them here as well
brew cask install google-chrome atom slack vlc spotify dropbox
# You can launch and configure apps like this
open ~/Applications/Dropbox.app

# install gcc and java,
# a lot of the data science tools we will install later depend on them
# (some of these may require your password again)
brew install gcc

brew tap caskroom/versions

brew cask install java
brew cask install java8
brew install jenv

Powerlevel9k Command Line

Now it’s time to beef up our command line. This is something that many software developers and engineers spend a lot of time on, to the point where some are holding competitions to show off their great shells. Many Data Scientists, on the other hand, seem to neglect command line customisation. I think that this is a mistake. Let me convince you by highlighting some of the neat extra features that we can add with a little bit of extra setup effort:

beautiful command prompt
syntax highlighting
auto completion
read/write flags
execution timing
git support with repo status tracking

The most nerdy set of productivity tools on the block! To make this work we will need iTerm2 and zsh. iTerm2 is a macOS terminal replacement with many additional features, such as more display customisation, better hotkeys, and fantastic split pane functionality. Zsh is a shell designed for interactive use. It works particularly well with oh-my-zsh, a configuration tool that helps with setting up everything just the way we like it. They have great stickers, too ;)

While we’re at it we will also install the Powerline terminal fonts, which will be needed for powerlevel9k, the zsh theme of my choosing.

# install iTerm2
brew cask install iterm2

# install zsh
brew install zsh

# get oh-my-zsh configuration tool
sh -c "$(curl -fsSL https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
# (this may require your password again)

# get powerlevel9k theme for zsh
git clone https://github.com/bhilburn/powerlevel9k.git ~/.oh-my-zsh/custom/themes/powerlevel9k
# and the corresponding font
wget -O /Library/Fonts/font_sourcecodepro_powerline_awesomeregular.ttf https://github.com/Falkor/dotfiles/blob/master/fonts/SourceCodePro+Powerline+Awesome+Regular.ttf?raw=true

The oh-my-zsh installation script changes your default shell to zsh and creates the file .zshrc. Just like .bash_profile for bash, this file is automatically sourced when a new zsh session is launched. From now on you should always use .zshrc instead of .bash_profile, for example when setting a new standard conda environment. Notice that .zshrc comes with a lot of options that are commented out. Feel free to go through the file and uncomment the modifications that may be of interest to you.

You should also add iTerm2 to the dock bar and/or assign a hot key of your choosing. Change the colour scheme (Menu bar > Profiles > Open Profiles... > Select "Default" > Edit Profiles...) as you see fit. Definitively change the font to SourceCodePro+Powerline+Awesome Regular. This last step is important as POWERLEVEL9K WON’T WORK PROPERLY WITHOUT THIS and you will end up with cryptic symbols on your prompt instead.

If you don’t have strong feelings about colour style preference, feel free to use my profile template. You can install it as a dynamic profile with the command below. DynamicProfiles enable you to share your preferences between different machines. You can create your own by exporting your profile from the profile menu to a JSON file and copying it to the same location:

# copy the profile settings for iTerm2 to DynamicProfiles folder
wget -O ~/Library/Application\ Support/iTerm2/DynamicProfiles https://github.com/JanLauGe/.dotfiles/blob/master/iterm_profile.json

There is a wide range of plugins available for iTerm2 and zsh. I automatically add a few that I find useful by installing them with homebrew. Afterwards I add them to the .zshrc configuration file with sed or by pipe-appending (>>) a string to the end of the file.

In case you’re unfamiliar with these commands: sed looks for a string in a file using regular expression and replaces the found string with a replacement string. The inplace flag -i '' is Mac specific and tells sed to overwrite the old file with the new updated version. The >> operator appends to a file or creates the file if it doesn’t exist.

Side note: Alternatively, we could just copy a pre-existing .zshrc but I felt that adding lines using sed keeps things more transparent and allows for more of a mix-and-match approach where you can choose the bits you like and leave out the ones that are not useful to you.

# change zsh theme to powerlevel9k
sed -i '' 's/ZSH_THEME="robbyrussell"/POWERLEVEL9K_MODE='awesome-patched'\
ZSH_THEME="powerlevel9k\/powerlevel9k"/g' .zshrc

# Add auto suggestions (for Oh My Zsh) suggests the commands you used
# in your terminal history. You just have to type → to fill it entirely!
# Note: $ZSH_CUSTOM/plugins path is by default ~/.oh-my-zsh/custom/plugins
brew install zsh-autosuggestions zsh-syntax-highlighting

# Add the plugins to the list of plugins in ~/.zshrc configuration file :
sed -i '' '/^plugins=(/  a\
 \ \ zsh-autosuggestions \
 \ \ web-search \
 \ \ jsontools \
 \ \ macports \
 \ \ node \
 \ \ osx \
 \ \ sudo \
 \ \ thor \
 \ \ docker \
' .zshrc

# set default user in .zshrc to avoid the nasty username@machine prompt
echo 'export DEFAULT_USER="$(whoami)"' >> .zshrc

Data Science Essentials

Data science at the command line is great, but I doubt it will be enough to do all of your day-to-day tasks. We need R & Python, and while the GUI installers for Rstudio and Anaconda make the installation child’s play, it would be nice to have it as part of this initial setup script as well. Moreover, I find myself accumulating eclectic collections of packages and libraries. Instead of reinstalling all of these manually I have included them here as well:

#### install anaconda
# May need updating for conda version
wget -O anaconda.sh https://repo.anaconda.com/archive/Anaconda3-5.3.0-MacOSX-x86_64.sh
bash anaconda.sh
rm anaconda.sh
# append conda path to bash profile
echo 'export PATH="~/anaconda3/bin:$PATH"' >> ~/.zshrc
# reload profile
source .zshrc

# create new anaconda virtual environments
conda update conda
conda config --add channels conda-forge
conda create --name dev2 python=2.7
conda create --name dev3 python=3.6
# and switch to it to avoid using the system python
source activate dev3
# do this every time we start a new session
# assuming you want to use python3 by default
echo 'source activate dev3' >> ~/.zshrc
# Install a few libraries that do not ship with anaconda
pip install awscli tensorflow tensorflow-gpu keras


#### install R and RStudio
# this is required for some advanced plotting
brew cask install xquartz # (will need password again)
brew install --with-x11 r
brew cask install --appdir=/Applications rstudio
# Note the --appdir option which will use /Applications instead of ~/Applications


# set up rJava; this can be a pain!
# I used these instructions: https://zhiyzuo.github.io/installation-rJava/
# consult google if you get stuck here

# set java environmental variables for the profile
echo 'export PATH="$HOME/.jenv/bin:$PATH"' >> ~/.zshrc
# (you may need to update version number here)
echo 'export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home"' >> ~/.zshrc
echo 'eval "$(jenv init -)"' >> ~/.zshrc
source ~/.zshrc
# make sure to set this to the version that you installed (`java -version`)
jenv add /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home
jenv global oracle64-1.8.0_181
# prepare installation and install rJava by building from source
R CMD javareconf
RScript -e "install.packages('rJava',\
  repos='http://cran.us.r-project.org',\
  type='source')"

# install R packages
RScript -e "install.packages(c(\
  'cluster','crayon','crosstalk','curl','CVST','data.table','DBI',\
  'devtools','doMC','dtplyr','foreach','foreign','ggplot2','ggthemes','glmnet',\
  'haven','here','htmltools','htmlwidgets','httr','igraph','jsonlite','knitr',\
  'labeling','lattice','lazyeval','leaflet','lubridate','magrittr','markdown',\
  'mime','praise','psych','purrr','raster','RColorBrewer','Rcpp','readr',\
  'rmarkdown','rpart','rvest','scales','shiny','stringr','survival','testthat',\
  'units','viridis','xml2','aws.s3','checkmate','feather','future',\
  'gapminder','keras','lintr','plotly','plotROC','prettyunits','pROC','progress',\
  'randomForest','ranger','reticulate','rJava','RJDBC','RJSONIO','RODBC',\
  'roxygen2','RPostgreSQL','Rtsne','slackr','sf','stringdist','tensorflow',\
  'text2vec','vegan','xgboost','XML','tidyverse'),\
  repos='http://cran.us.r-project.org')"
# This library for snowflake is only available on github
RScript -e "library(devtools); install_github('snowflakedb/dplyr-snowflakedb')"

Consider adding /bin/zsh to your RStudio global options under Global Options... > Terminal > Custom shell binary to keep your RStudio Terminal sessions in tune with the custom terminal we set up here.

Settings and Access

So now we are done with the basic setup on our local machine. However, there are still ssh keys, api access tokens, and config files to configure. This can take a lot of time and energy, and having different tokens on different machines can be confusing or even unsafe (I have seen far too many people hard-code their AWS credentials into their notebooks!).

I’ve therefore gone for an approach of creating a folder with all the files for identity management and protecting it with a single strong master password. For obvious reasons I will not go into too much detail on my exact approach to this, but let’s just say that we have synced all our identity files to a local folder called .dotfiles. From there we can sync them into our home directory, as succinctly explained by Ajmal Siddiqui in this post.

rsync .dotfiles ~

and since we want to do that whenever we start a new terminal session:

echo 'rsync .dotfiles ~' >> .zshrc

This will synchronise all files in the .dotfiles folder to the home directory where they are available to the various applications or our custom scripts that may use them. Files that I now use this for include:

.ssh - ssh keys for Github, AWS, etc.
.aws - AWS credentials needed for the aws cli
.gitconfig - To track my contributions to version controlled code bases
.kaggle.json - Access token to use the new Kaggle API
.google - Access token for the google maps SDK that I used here

So that’s all! As always, I hope it is useful for someone. Please let me know any thoughts you may have in the comments below. Also, follow me on twitter, connect with me on linkedIn, and feel free to email me.

fastai Deep Learning Image Classification

Wed, 02 May 2018 08:00:00 +0000

Here I summarise learnings from lesson 1 of the fast.ai course on deep learning. fast.ai is a deep learning online course for coders, taught by Jeremy Howard. Its tag line is to “make neural nets uncool again”. I started the class a couple of days ago and have been impressed with how fast it got me to apply the methods, an approach described by them as top-down learning. I am writing this blog post to document and reflect on the things that I learned and to help other people that may be interested getting started with the class.

The fast.ai library is a high-level library based on PyTorch, which tries to take a selection of best-practice approaches from cutting edge deep learning research and make them into a collection of intelligent default settings.

Lesson 2 outlined the fundamentals of computer vision and building image classification models. My homework: get my hands on my own image dataset and use it to train a classifier myself. I chose to attempt a classifier that can distinguish between sharks and dolphins, using images from google image search. Read along for a detailed walk through below.

Getting an Image Dataset

This handy tool allows us to get images directly from google image search.

$ pip install google_images_download

For more than 100 images we also need to install chromedriver and dependencies:

$ apt-get install python-selenium python3-selenium libxi6 libgconf-2-4 chromium-chromedriver

Note that I had some problems with accessing the chrome driver from jupyter notebooks. Changing ownership of the file with chmod 777 /usr/local/bin/chromedriver solved this for me.

Now we can run a query for images. Results are named as the number of the result plus the original file name, downloaded, and saved locally. Note that you should make sure to utilise the usage_rights flag in order to get images that are cleared for this type of use.

from google_images_download import google_images_download

os.chdir('/home/paperspace/data/sharksdolphins/')
response = google_images_download.googleimagesdownload()  

arguments = {
 "keywords": '"shark","dolphin"',
 "print_urls": False,
 "limit": "10000",
 "output_directory": "sharksdolphins/train/",
 "format": "jpg",
 "usage_rights": "labeled-for-nocommercial-reuse",
 "chromedriver": "/usr/local/bin/chromedriver"
}

response.download(arguments)

# func for renaming an image file
def image_rename(file):
    file_index = re.match(r'\A\d*', file).group(0)
    file_index = file_index.zfill(3)
    return file_index+'.jpg'

# func for renaming all files in a folder
def image_rename_all(folder):
    files = os.listdir(folder)
    [os.rename(folder+file, folder+image_rename(file)) for file in files]

# rename files
folders = ['data/sharksdolphins/train/shark/',
           'data/sharksdolphins/train/dolphin/']
[image_rename_all(folder) for folder in folders]

The above query returned about 800 images each. I ended up speeding through these images manually and removing unsuitable images manually.

We need a training and a validation set. Making this work with fast.ai is easily done by adapting the recommended folder structure of a data folder with two sub-folders (train and valid), each of which have a subfolder with images for each class.

The code chunk below takes the images downloaded earlier and randomly splits them into 80% training data and 20% validation data. Don’t forget to set a random seed so that your results stay reproducible.

# split files into test and training set# split
PATH = 'data/sharksdolphins/'
files_sharks = os.listdir(f'{PATH}train/shark')
files_dolphins = os.listdir(f'{PATH}train/dolphin')

# sample from each class to create a validation set
np.random.seed(1234)
files_sharks_val = np.random.choice(
    files_sharks,
    size=round(len(files_sharks) / 5),
    replace=False,
    p=None)
files_dolphins_val = np.random.choice(
    files_dolphins,
    size=round(len(files_dolphins) / 5),
    replace=False,
    p=None)

# move validation set images into validation folder
[os.rename(PATH+'train/shark/'+file, PATH+'valid/shark/'+file)
  for file in files_sharks_val]
[os.rename(PATH+'train/dolphin/'+file, PATH+'valid/dolphin/'+file)
  for file in files_dolphins_val]

As you will see, collecting the data was the hardest part and everyting from here onward is quite straight-forward thanks to the high level abstractions provided by fast.ai

Training a Model

Now we can finally start to train our image classifier. I am using a paperspace instance with the setup recommended and provided by fast.ai.

Train a First Model

os.chdir('/home/paperspace/')
# append fast.ai local folder to system path so modules can be imported
sys.path.append('/home/paperspace/fastai/')
# automatically reload updated sub-modules
%reload_ext autoreload
%autoreload 2
# in-line plots
%matplotlib inline

from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

# set path of data folder
PATH = "data/sharksdolphins/"
# set size images should be resized to
sz = 224
# First model
arch = resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), bs=16)
learn = ConvLearner.pretrained(arch, data, precompute=True)

Jeremy explained the learning rate finder that uses the approach of cyclical learning rates as outlined by Leslie Smith (https://arxiv.org/abs/1506.01186). Using the lr_find method and then plotting the learning rate against loss. We now want to visually choose the “The highest learning rate we can find where the loss is still clearly improving”. Note here that the learning rate finder did initially not work so well for my small-ish dataset. I had to adjust the batch size to make it work correctly (see bs=16 argument in the ImageClassifierData call above)

learn.lr_find()
learn.sched.plot_lr()

learn.fit(0.01, 5)

epoch	trn_loss	val_loss	accuracy
0	0.435999	0.267501	0.925
1	0.3081	0.29459	0.9
2	0.2548	0.235716	0.925
3	0.214654	0.229093	0.93125
4	0.237024	0.162347	0.9375

Over 93% accuracy! Really nice results already!

Let’s see if we can improve things even further.

Train a Second Model

Now let’s start training some of the lower layers and retraining these with differential learning rate. Jeremy talks about unfreezing here, but a friend that I talked to about this called it thawing instead, and I like that terminology, as it implies the differential learning rates with fast-changing weights at the last layer but slower updates in the lower layers.

# unfreeze pretrained layers
learn.unfreeze()
# set differential learning rate
lr = np.array([1e-4,1e-3,1e-2])
# train new model
learn.fit(lr, 3, cycle_len=1, cycle_mult=2)

epoch	trn_loss	val_loss	accuracy
0	0.475365	0.365881	0.89375
1	0.278049	0.177107	0.94375
2	0.174708	0.178467	0.95
3	0.120919	0.230777	0.95625
4	0.093978	0.148054	0.95625
5	0.081571	0.200582	0.95
6	0.055903	0.198029	0.95

A little bit better yet. Let’s look at the confusion matrix and some of the misclassified images:

# get predictions and transform to class probability values
log_preds = learn.predict()
preds = np.argmax(log_preds, axis=1)
probs = np.exp(log_preds[:,1])

# plot confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(data.val_y, preds)
plot_confusion_matrix(cm, data.classes)

We should also plot some of the images to develop an intuition about where our classifier does well and where it doesn’t. Here is the one misclassified dolphin and the top 4 misclassified sharks:

Discussion

You can see that we got some pretty strong results in a very short amount of time and using a very limited dataset! There is obviously many more things that could be done to improve this model further.

Weighting the Classes

In our example of sharks and dolphins, we are currently treating all misclassifications equally. This may not be the right approach! If the model was used to monitor a beach for sharks, for example, failing to recognize a dolphin would not be a problem, while failing to recognize a shark could potentially result in human fatalities. In this case, it might be advisable to retrain the model with a biased weighting function to make sure our recall on shark images is higher.

Data Leakage

Looking at images that were misclassified or low-confidence predictions, I realised that my training set introduced a number of hidden biases to the model. For example, a number of dolphin pictures have people in them that are touching the dolphin. This is less prevalent with the shark pictures, for obvious reasons. As a result, the few shark pictures with human arms and hands in them and near the shark seem to end up with lower confidence for the shark class. For the only misclassified dolphin in the dataset, on the other hand, I think that the shark-like pose (frontal, widely opened mouth) may have played a role in the model mistaking this image for a shark image.

This kind of data leakage is increasingly discussed and criticised in deep learning applications, so it is good to be aware and keep an eye out for them. Since my model is not going into production for Baywatch any time soon, I am just glad I found it nicely illustrated in this relatively small dataset.

Airdrop delivery with A* pathfinding

Fri, 06 Apr 2018 18:00:00 +0000

This post is an event report and a quick walk through to a submission that I developed with a group of participants at an Alibaba / Met Office UK hackathon. We are using the A* algorithm with a couple of tweaks to route cargo balloons from London to a number of cities in the UK.

It’s the year 2050. The invention of anti-gravity engines has led to the creation of unmanned balloons that travel the UK, delivering goods. However, unpredictable weather conditions mean that these balloons are often delayed, damaged or even destroyed(!), so we need your help. We’re inviting you to join our contest AND our hackathon to create an algorithm which allows these balloons to get to their route safely and effectively.

In January of this year, the Met Office teamed up with Alibaba Cloud in organising a hackathon at Huckletree Shoreditch. Here is a short video that gives a good impression of the event

The Problem

The hackathon was organised as a “Future Challenge”, a fictitious scenario for the year 2050. Obviously, drone delivery would be considered so 2020 and completely outdated by then. Instead, people are relying on anti-gravity balloons to deliver goods to cities across the UK via air drop. These anti-gravity balloons are reliable and efficient, but have one major shortcoming: They crash when travelling in areas with high wind speeds. This is where the Met Office comes in. The task was to navigate balloons from origin to destination while avoiding storms by using the Met Office forecasts.

I’ll spare you the full run through of rules, terms, and conditions. You can find them on the competition page. The main facts:

forecast data provides projected wind speed for fields of a grid
forecasts are for hourly intervals
balloons can move up, down, left, right, or stay in place
balloons move one field per 2 minutes
balloons crash when entering a field when the wind speed is ≥15

So the big question is: How do we safely get the balloons from origin to destination while avoiding stormy areas? I teamed up with a nice bunch of people, mostly undergrad or master level university students. It was great to see their enthusiasm for working on this problem!

The Data

The data that was provided to us was:

the coordinates of cities (origin and destinations)
weather data, separated into: * a training set with 7 days of weather forecasts from 10 models plus observations of the actual conditions that manifested on these days * a holdout set with 5 days of weather forecasts

You should still be able to get the data here in case you’d like to have a go yourself. Note that the weather forecasts came in at a rather inconvenient file size of 2 x 800 MB, and download speeds were not that great either.

See the map below for illustration purposes. The map shows gridded forecasts of wind speed for a one hour time slice, as well as city locations (origin in yellow, destinations in red). Weather Forecast

Weather Prediction

We started by looking at the weather predictions. Our initial plan was to use the first week, for which both forecasts and observations were available, to train a classifier that would identify the likelihood of high wind speeds in a given area at a particular time. This, however, turned out to be a bit of a red herring. The Met Office predictions were so good that averaging them and using a simple threshold of 15 resulted in close to zero false negatives when trying to detect cells with storms.

Lesson learned: Do your EDA properly to check which areas are worth investigating in detail and where you can use a simple ad-hoc solution.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import h5py

def convert_forecast(data_cube):
    # take mean of forecasts
    arr_world = np.min(data_cube, axis=2)
    # binarize to storm (1) or safe (0)
    arr_world = arr_world >= 15
    # from boolean to binary
    arr_world = arr_world.astype(int)
    # swap axes so x=0, y=1, z=2, day=3
    arr_world = np.swapaxes(arr_world,0,2)
    arr_world = np.swapaxes(arr_world,1,3)
    arr_world = np.swapaxes(arr_world,2,3)

    return(arr_world)

data_cube = np.load('../data/5D_test.npy')
# convert forecast to world array
forecast = convert_forecast(data_cube)

We wanted balloons to take the shortest path from origin to destination without passing into storms. That means storms can be viewed as obstacles in our path search problem, because we would never, ever, want to pass through them, even if that means a massive detour for the balloon. We therefore chose to use an A* path search algorithm. This algorithm finds the shortest path around obstacles in a reasonable amount of time and is quite straight forward to implement.

The basic approach is to start from origin and generate a frontier possible next moves from a list of valid neighbouring fields. Fields with obstacles are excluded, as well as fields that have been visited before. For each field that is part of this frontier we log which field we came from and calculate a heuristic cost of our movement to far. As soon as the destination field becomes part of the frontier, we can recursively follow the trail laid out in our log, back to the origin, to find the shortest path.

In case you would like more details or want to compare A* to other pathfinding algorithms, this page has the best summary of pathfinding algorithms I have seen, including a great section on A* with interactive simulations.

In our case, there is one additional complication: our search is two-dimensional (geographical space with latitude and longitude), but our “obstacles” change over time. My way around that was to make the search three-dimensional, with each movement time step (2 minutes) as a slice in the third dimension.

# repeat time slices x30
forecast_stack = np.repeat(forecast, repeats=30, axis=2)

I then forced the search algorithm to always take a step forward in time by restricting the valid neighbours to [..., ..., z+1]. I tried to illustrate this schematically in the diagramm below:

The code for my A* implementation in python using heapq can be found below.

Note that I allow for neighbours that have the same x and y coordinates. This essentially allows balloons to “hover in place” to wait out unfavourable weather conditions in the area ahead, should that be the most promising course of action.

Another thing to mention is that this approach massively inflates our frontier. Usually, an advantage of the A* algorithm is that fields that have been visited before do not need to be considered again. In my approach, field [0,0,0] is different from field [0,0,1] (the same latitude and longitude, but a different time step). As a result, computation becomes a lot more resource intense, but it is still feasible to run this on your local machine and in the fast-paced setting of a hackathon I think that prioritising developer time over computing time was the right call.

import heapq

def heuristic_function(a, b):
    return (b[0] - a[0]) ** 2 + (b[1] - a[1]) ** 2

def astar_3D(space, origin_xy, destination_xy):
    # make origin 3D with timeslice 0
    origin = origin_xy[0], origin_xy[1], 0
    # logs the path
    came_from = {}
    # holds the legal next moves in order of priority
    frontier = []
    # define legal moves:
    # up, down, left, right, stay in place.
    # no diagonals and always move forward one time step (z)
    neighbours = [(0,0,1),(0,1,1),(0,-1,1),(1,0,1),(-1,0,1)]

    cost_so_far = {origin: 0}
    priority = {origin: heuristic_function(origin_xy, destination_xy)}
    heapq.heappush(frontier, (priority[origin], origin))

    # While there is still options to explore
    while frontier:

        current = heapq.heappop(frontier)[1]

        # if current position is destination,
        # break the loop and find the path that lead here
        if (current[0], current[1]) == destination_xy:
            data = []
            while current in came_from:
                data.append(current)
                current = came_from[current]
            return data

        for i, j, k in neighbours:
            move = current[0] + i, current[1] + j, current[2] + k

            # check that move is legal
            if ((0 <= move[0] < space.shape[0]) &
                (0 <= move[1] < space.shape[1]) &
                (0 <= move[2] < space.shape[2])):

                if space[move[0], move[1], move[2]] != 1:

                    new_cost = 1
                    new_total = cost_so_far[current] + new_cost

                    if move not in cost_so_far:
                        cost_so_far[move] = new_total
                        # calculate total cost
                        priority[move] = new_total + heuristic_function(move, destination_xy)
                        # update frontier
                        heapq.heappush(frontier, (priority[move], move))
                        # log this move
                        came_from[move] = current

    return 'no solution found :('

# get city data
cities = pd.read_csv('../data/CityData.csv', header=0)
# run algorithm
x = astar_3D(space=arr_world_big[:,:,:,0],
             origin_xy=origin,
             destination_xy=destination)

Visualising the Results

We have a predicted optimal route now. That’s great, but it would be even better to visualise these results in a way that allows us to develop some intuition about how our solution is doing and where we could improve it further. I thought that an animation of the time slices with the paths generated would be ideal for this. So I used matplotlib.pyplot to create an image of each time slice and then combined them into an animated gif. Output and code below:

You can see that, for this day, the solution for most cities is relatively straight-forward because of low wind speeds in the majority of the area. However, the A* pathfinding algorithm can be seen nicely at work in the later timeslices and the centre-right of the map, where the purple balloon pauses for a timeslice to wait out unfavourable conditions ahead and then winds around patches of high wind speed towards its target.

def plot_solution(world, cities, solution, day):
    timesteps = list(range(0, 540, 30))
    solution = solution.loc[solution.day == day,:]
    # colour map for cities
    cmap = plt.cm.cool
    norm = matplotlib.colors.Normalize(vmin=1, vmax=10)
    # colour map for weather
    cm = matplotlib.colors.LinearSegmentedColormap.from_list('grid', [(1, 1, 1), (0.5, 0.5, 0.5)], N=2)

    for t in timesteps:
        timeslice = world[:,:,t]
        moves_sofar = solution.loc[solution.z <= t,:]
        moves_new = solution.loc[(t <= solution.z) & (solution.z <= t + 30),:]

        if len(solution_subset) > 0:    
            plt.figure(figsize=(5,5))
            plt.imshow(timeslice[:,:].T, aspect='equal', cmap = cm)

            # plot old moves
            for city in moves_sofar.city.unique():
                moves_sofar_city = moves_sofar.loc[moves_sofar.city == city,:]
                x = moves_sofar_city.x
                y = moves_sofar_city.y
                z = moves_sofar_city.z
                plt.plot(list(x), list(y), linestyle='-', color='black')

            # plot new moves
            for city in moves_new.city.unique():
                moves_new_city = moves_new.loc[moves_new.city == city,:]
                x = moves_new_city.x
                y = moves_new_city.y
                z = moves_new_city.z
                plt.plot(list(x), list(y), linestyle='-', color=cmap(norm(city)))

            # plot cities
            for city,x,y in zip(cities.cid, cities.xid, cities.yid):
                if city == 0:
                    plt.scatter([x-1], [y-1], c='black')
                else:
                    # balloon still en-route?
                    if city in moves_new.city.unique():
                        plt.scatter([x-1], [y-1], c=cmap(norm(city)))
                    else:
                        plt.scatter([x-1], [y-1], c='black')

            # save and display
            plt.savefig('img_day' + str(day) + '_timestep_' + str(t) + '.png')
            plt.show()

Editable Plots from R to PowerPoint

Sat, 24 Feb 2018 11:00:00 +0000

In this post I am giving a quick overview of how to create editable plots in PowerPoint from R. These plots are comprised of simple vector-based shapes and thus allow you to change labels, colours, or text position in seconds. Your project managers will love it!

Motivation

R allows us to create great visualisations, but in most data science settings these need to be presented to key stakeholders and decision makers in presentations or “slideuments”. Having to make small changes to previously compiled slots can be time consuming and frustrating. A solution to this common problem is to keep your plots and graphs editable as a group of vector shapes in PowerPoint. This way project managers or data scientists themselves can make small changes without having to re-execute a single line of code.

Solution

We will use a tidyverse approach for creating the plot. Furthermore, the officer package enables us to smoothly interact with PowerPoint, and the rvg package is required to save our plots as editable vector graphs.

library(tidyverse)
library(officer)
library(rvg)

For demonstration purposes, let’s create a plot using the diamonds dataset. NB: I am saving the ggplot object to a variable name, but also displaying the plot when executing the lines by appending the ; ggp at the end.

# Using diamonds dataset which is shipped with R
ggp <- diamonds %>%
  # Let's simplify things by only considering natural number carats
  mutate(carat = floor(carat)) %>%
  group_by(carat, cut, clarity, color) %>%
  summarise(price = mean(price)) %>%
  # Create a plot of price by carat, colour, cut, and clarity
  ggplot(aes(x = carat, y = price, fill = color)) +
  geom_bar(stat = 'identity') +
  facet_grid(cut ~ clarity) +
  # Simplify the plot layout a little
  theme_bw() +
  guides(fill = FALSE) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()); ggp

Now we can use officer to create a new PowerPoint document and rvg::ph_with_vg to drop our ggplot object in there.

# Create a new powerpoint document
doc <- read_pptx()
doc <- add_slide(doc, 'Title and Content', 'Office Theme')

# Add the plot
doc <- ph_with_vg(doc, ggobj = ggp, type = 'body')  

# Write the document to a file
print(doc, target = 'plot.pptx')

Now open the document in PowerPoint. Right-click and ungroup the plot. Voila! You should be able to select individual elements, for example the data bars in the plot, change their colour, move them around, change the text in labels, and much more. Have a look at the plot below. A cookie for you if you can find all ten edits that I made in the example.

As always, hope this is helpful. And FYI, I am still looking for a way to achieve the same result using Python. If you know one, collect some bounty here

Extracting location history

Sat, 21 Oct 2017 14:00:00 +0000

If you have an android phone then google logs your location. Fortunately, it makes all of that data available to you via the “timeline” dashboard. Unfortunately, there is no easy way to get it off there and into an IDE. So we’ll have to do this the hard way!

Location History

What did you do yesterday? Last week? Or on August the 17th at 15:00? The answer, at least to the last question, possibly used to be quite a stretch. But as many things, this has changed with the advent of our everyday companion, the smart phone.

If you have a smart phone and you carry it around with you, your whereabouts are constantly logged and saved. For android phones with google maps, google infers your position based on a mix of GPS, cell phone towers, WiFi name lookup, and other factors, and saves it in a “location history”. This data is available to you via the (relatively new) “timeline” feature in google maps.

The Problem

As always, Google’s dashboard is very intuitive to navigate and offers great functionality. There is just one problem: I cannot summarise, report on, or programmatically analyse my data. It’s locked into google’s systems.

Because our location data is undoubtedly our personal property, google offers us the option to download it in .kml format. However, this is only available via the web interface. If I wanted to build personalised reports on top of the data, I would need programmatic access via an API that I can pass parameters such as date and time to retrieve data dynamically. That is currently not supported by google. But with a little bit of manual work, we will get there nonetheless!

The Solution

After a little bit of digging around I found this Medium post and this associated GitHub repository that got me 90% of the way very quickly. Alex, if you’re reading this, thanks a ton for putting all of that together! The code on GitHub was still missing yearly and daily sub-setting functionality, so I added that in my repository (pull request submitted!).

So how to use this? Here are the step-by-step instructions.

In bash:

# Clone the code repository
git clone https://github.com/JanLauGe/c_google_timeline

Next, we’ll have to do some manual work. This is because we will need information from our google account sign in, saved in our cookies, in order to authenticate our GET requests for KML file downloads.

Open https://www.google.com/maps/timeline in Mozilla Firefox (I tried Chrome first, it did not work for me)
Inspect the page (Ctrl + Shift + I) and go to the Network tab
Enter the link below in the address line of your browser: https://www.google.com/maps/timeline/kml
A new event will appear in the inspect-network tab as a result of the request. Copy its content as a cURL
Paste the cURL string to a text editor and save it as a key file (I used ‘~/.env/.google_maps_cookie’)

Now that we have the cookie information, we can go back to the fun part in Python:

import datetime as DT
import process_location

# Get inputs --------------------------
# Date info
today = DT.date.today()
end_day = today.day
end_month = today.month
end_year = today.year

lastweek = today - DT.timedelta(days=7)
begin_day = lastweek.day
begin_month = lastweek.month
begin_year = lastweek.year

# Cookie info
cookie_content = open('~/.env/.google_maps_cookie', 'r').read()
# Remove line break at end of string
cookie_content = cookie_content[:-1]

# Where to save the files
folder = '~/google_timeline/data/'

# Get files --------------------------
process_location.create_kml_files(
    begin_year=begin_year,
    begin_month=begin_month,
    begin_day=begin_day,
    end_year=end_year,
    end_month=end_month,
    end_day=end_day,
    cookie_content=cookie_content,
    folder=folder)

This will download the KML files (one per day) for the last week. I’ll leave it at that for now. If you would like to know how to read the files into a pandas data frame, check out Alexandre Attia’s repo, or come back here later. I already have a specific application in mind for this, but that is a story for another post.

As always, hope this is useful for you. Please leave a comment below. I have enabled anonymous commenting to remove the entry hurdle of signing up to Disqus.

Exploring Sales Data

Sun, 01 Oct 2017 10:00:00 +0000

A big part of the interview process for many data science positions is a data science task or assignment. Companies usually choose a data set that is typical for them, while only in rare cases a sample of their actual production data. Here, I am exploring such a data set, sent out by a leading UK retailer.

The task

Your task is to read all data into your preferred software environment, get an understanding of the various variables that are included, and clean the dataset. Subsequently, proceed to build a model of your choice to predict store sales in the test dataset. You can choose as many methods of evaluating the model.

Data Preparation

We’ll start by reading the data into R and having a look at the basic structure and variables available. Three tables are supplied: stores, train, and test. Since train and test are in the same format, we will combine the two tables in one (called ‘days’) to make reformatting a little easier.

# Any good R-script should use these ;)
library(plyr)
library(dplyr)
library(tidyr)
library(magrittr)

library(ggplot2)
library(gridExtra)
library(ggthemes)

library(caret)
library(glmnet)
library(doMC)

# Keep results reproducible
set.seed(1234)

# Read the tabular data
stores <- read.csv(file = 'original/store.csv')
train <- read.csv(file = 'original/train.csv')
test <- read.csv(file = 'original/test.csv')
days <- rbind(
  cbind(train, set = 'train'),
  cbind(test, set = 'test'))

# inspect the data
head(stores)
head(days)

Put tables here

Apparently, some of the categorical columns were interpreted as integers by R. We’ll fix that and reformat the data to make sure we use columns in an appropriate fashion. We also combine the data on days and shops are combined in one table via a left-hand join.

# Reformatting 'stores'
stores %<>%
  mutate(
    # Stores number as character
    Store = as.character(Store),
    # And CompetitionStart as combination of month and year
      CompetitionStart = ifelse(
      # Where CompetitionOpenSince values are NA
      # we assume competition has been around for a while
      is.na(CompetitionOpenSinceYear), '2000-01-01',
      # Otherwise we paste in month and year and make it a date (assuming 1st)
      paste0('01-',
      CompetitionOpenSinceMonth,'-',
      CompetitionOpenSinceYear)) %>%
      as.Date(format = '%d-%m-%Y'))

# Reformatting 'days'
# Then we combine it with the shop data
days %<>%
  mutate(
    # Again, Stores as character
    Store = as.character(Store),
    # And date into a proper time stamp
    Date = as.Date(Date, format = '%d/%m/%Y')) %>%
# Let's convert categorical information into factors
# mutate_each(funs(as.factor), DayOfWeek) %>%
  # Bring in the shop data
  left_join(stores, by = 'Store') %>%
  # Any entries of 'CompetitionDistance' for days that predate
  # 'CompetitionOpenSince...' are probably invalid. We'll replace them with
  # the mean value of 'CompetitionDistance' across all stores
  mutate(CompetitionDistance = ifelse(
    Date < CompetitionStart,
    mean(stores$CompetitionDistance, na.rm = TRUE),
    CompetitionDistance)) %>%
  # Drop the old CompetitionOpenSince fields
  select(-one_of('CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear'))

Now we can take an initial look at the data to better understand what we’ve got and how we can best put it to use in the modelling exercise. I will use a couple of basic questions to guide my exploration of the data set. Since there is no sales data for the hold-out partition of the data I excluded it from most of the plots in this section.

Which Stores do Well?

We can look at the data on a sales-by-store level. Simply aggregate sales and other metrics from all days by store and plot them to visualise the relationship of metrics to one another. First, let’s try to understand what kind of shops are in our datasets. The shops are divided by their store type, and the range of products they offer (basic, medium, or full). The relationship between these factors is shown in the plot below.

# Make a data frame that aggregates the Sales data by Store
# (Sales by Store, SbS)
SbS <- days %>%
  # Use only the train data
  filter(set == 'train') %>%
  group_by(Store) %>%
  # Aggregate data by Store
  summarise(
    meanSales = mean(Sales, na.rm = TRUE),
    meanCustomers = mean(Customers, na.rm = TRUE),
    totalSales = sum(Sales, na.rm = TRUE),
    totalCustomers = sum(Customers, na.rm = TRUE),
    daysOpen = sum(Open, na.rm = TRUE),
    daysPromo = sum(Promo, na.rm = TRUE),
    daysHoliday = sum(ifelse(StateHoliday != 0, 1, 0), na.rm = TRUE),
    daysSchoolHoliday = sum(SchoolHoliday, na.rm = TRUE),
    dailySales = totalSales / daysOpen,
    dailyCustomers = totalCustomers / daysOpen,
    storeType = first(StoreType),
    range = first(Range),
    competitionDistance = mean(CompetitionDistance, na.rm = TRUE)) %>%
  # Change the factor levels for nice labels
  mutate(
    range = factor(range, labels = c('basic','medium','full')),
    storeType = factor(storeType, labels = c('small','medium','big','huge')))

# With the combined dataset, let's try and understand the shops data first
mosaicplot(
  storeType ~ range,
  data = SbS,
  main = 'Figure 1: Store type and range of products',
  xlab = 'Store type',
  ylab = 'Range of Products')

We can see that there is a large number of small shops but only very few medium ones. Another noticeable factor is that barely any shops offer the medium product range. All of those that do fall in the medium size shop type. Furthermore, small shops are more likely to only offer the basic product range, while big and huge shops are more likely to offer the full range of products. However, a number of big and huge shops restrict their product range to the basic selection, which seems surprising. Perhaps the store type does not represent an ordinal scale of shop size as initially assumed?

Next we will look at some of the continuous variables available for our shops. We can plot number of customers against daily sales and differentiate by store type:

SbS %>%
  ggplot(aes(
    x = dailyCustomers,
    y = dailySales,
    size = storeType,
    colour = range)) +
  geom_point(alpha = 0.5) +
  ggtitle('Figure 2: Relationship of daily customers with daily sales') +
  xlab('Mean Daily Number of Customers') + ylab('Mean Daily Sales')

Somewhat unsurprisingly, there seems to be a strong relationship between number of customers and sales. Interestingly tough, when splitting this data up by store inventory, both ‘basic range’ stores and ‘maximum range’ stores seem to be doing well (i.e. high sales per customer) while ‘medium range’ stores have lower sales per customer. Additionally, small shops seem to have more customers than big and huge shops. Another indicator that my assumption about the nature of the store type was wrong.

Now I will go back to the data by day. Let’s first look at the sales data by day of the week (Figure 3). We can see that sales are highest on Monday, drop a bit during the middle of the week, pick up slightly on Friday, are lower on Saturday and basically nonexistent on Sunday (most shops are closed). This suggests that the day of the week should be included as a predictor in our final model. We can also use the dates to create time series for shops and look at the variation of sales over time. To take out weekly variability I also included a rolling mean of sales for a 7-day window (SalesOfWeek). The resulting plot (Figure 4) shows us the sales of the 2 years from the beginning of 2013 till the end of 2014. We can see patterns of regular variation in sales over the year. Here is what stands out to me:

Sales are low on Sundays (most shops are closed!)
Promotions tend to be every other week
Sales are higher in week with promotion
Sales are slightly higher in the time before Easter and distinctly higher before Christmas

# Sales by WeekDay
SbWD <- days %>%
  filter(set == 'train') %>%
  mutate(
    DateType = paste0(StoreType, '-', Date)) %>%
  group_by(DateType) %>%
  summarise(
    Sales = mean(Sales),
    DayOfWeek = max(DayOfWeek),
    StateHoliday = max(ifelse(as.numeric(StateHoliday) > 1, 1, 0)),
    SchoolHoliday = max(SchoolHoliday),
    Promo = sum(Promo),
    StoreType = first(StoreType))
# Plot sales per day of the week
SbWD %>%
  mutate(
    DayOfWeek = factor(
      DayOfWeek,
      labels = c('Mon','Tue','Wed','Thu','Fri','Sat','Sun'))) %>%
  ggplot(aes(DayOfWeek, Sales, fill = StoreType)) +
  geom_boxplot() + ggtitle('Figure 3: Mean Sales per Day of the Week') +
  xlab('Day of the Week') + ylab('Mean Sales')


# Time series of sales
# (Sales by Day, SbD)
SbD <- days %>%
  filter(set == 'train') %>%
  group_by(Date) %>%
  summarise(
    Sales = mean(Sales),
    DayOfWeek = max(as.numeric(DayOfWeek)-1),
    StateHoliday = max(ifelse(as.numeric(StateHoliday) > 1, 1, 0)),
    SchoolHoliday = max(SchoolHoliday),
    Promo = sum(Promo))
# Get a rolling mean of Sales
library(RcppRoll)
SbD$SalesOfWeek <- roll_mean(
      SbD$Sales,
      n = 7,
      align = 'center',
      fill = mean(SbD$Sales),
      na.rm = TRUE)
# Timeline plot of sales data
SbD %>%
  mutate(
    # Rescale some of the variables
    Sales = Sales / 10000,
    SalesOfWeek = SalesOfWeek / 10000,
    Promo = Promo / 1000,
    DayOfWeek = DayOfWeek / 7) %>%
  gather(
    # Get data into long format for ggplot
    key = category,
    value = measurement,
    Sales, SalesOfWeek, DayOfWeek, StateHoliday, SchoolHoliday, Promo) %>%
  mutate(
    # Change the order they will be plotted by
    category = factor(
      category,
      levels = c('Sales',
                 'SalesOfWeek',
                 'Promo',
                 'DayOfWeek',
                 'SchoolHoliday',
                 'StateHoliday'))) %>%
  # Make the plot
  ggplot(aes(Date, measurement, colour = category)) +
  geom_line() + facet_grid(category ~ .) +
  ggtitle('Figure 4: Time Series Representation of the Data') +
  ylab('Rescaled Variable Value')

Modelling

For benchmarking I will create a very simple model: No data cleaning, no feature engineering, just the raw (reformatted) data in a GLMNET and GBM. I am using dummy variables for categorical variables and raw values for continuous variables. The model is evaluated with ten-fold cross-validation, repeated 5 times.

# Transforming factors into dummy variables
daysdummy <- days %>%
  filter(set == 'train') %>%
  select(one_of('DayOfWeek', 'Promo', 'StateHoliday',
    'SchoolHoliday', 'StoreType', 'Range', 'Sales')) %>%
  # Create dummy variables
  model.matrix(Sales ~ ., data = .) %>%
  # Get the Sales data back
  cbind(select(days,
    -one_of('(Intercept)', 'DayOfWeek', 'Promo', 'StateHoliday',
      'SchoolHoliday', 'StoreType', 'Range')) %>%
      filter(set == 'train')) %>%
  mutate(
    # If CompetitionDistance is NA, use max Competition distance
    CompetitionDistance = ifelse(
      is.na(CompetitionDistance),
      max(days$CompetitionDistance, na.rm = TRUE),
      CompetitionDistance)) %>%
  # Create simple training set treating each day as independent observation,
  # ignoring factors Store and Date for now
  select(-Store, -Date, -Open, -set, -CompetitionStart)

# Inspect the variables
head(daysdummy)

GLM

tuneControl <- trainControl(
 method = "repeatedcv",
 number = 10,
 repeats = 5)
model1 <- train(
 Sales ~ .,
 data = daysdummy,
 method = 'glmnet',
 metric = 'RMSE',
 trControl = tuneControl)

model1

glmnet

780829 samples 19 predictor

No pre-processing Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 702746, 702747, 702746, 702747, 702746, 702747, … Resampling results across tuning parameters:

alpha lambda RMSE Rsquared 0.10 6.91422 1210.331 0.9013087 0.10 69.14220 1214.036 0.9009795 0.10 691.42198 1386.398 0.8870391 0.55 6.91422 1210.402 0.9012853 0.55 69.14220 1228.537 0.8988403 0.55 691.42198 1611.039 0.8521414 1.00 6.91422 1211.024 0.9011885 1.00 69.14220 1245.152 0.8963104 1.00 691.42198 1754.444 0.8288464

RMSE was used to select the optimal model using the smallest value. The final values used for the model were alpha = 0.1 and lambda = 6.91422.

GBM

fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5)
model2 <- train(
  Sales ~ .,
  data = daysdummy,
  method = "gbm",
  trControl = fitControl)
save(model2, file = 'saved/basic.gbm.rda')

Stochastic Gradient Boosting

780829 samples 19 predictor

No pre-processing Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 702746, 702746, 702745, 702746, 702747, 702747, … Resampling results across tuning parameters:

interaction.depth n.trees RMSE Rsquared 1 50 1502.452 0.8639662 1 100 1296.930 0.8922199 1 150 1225.590 0.9011660 2 50 1287.706 0.8954673 2 100 1138.885 0.9137963 2 150 1101.484 0.9186211 3 50 1193.371 0.9078775 3 100 1090.395 0.9202973 3 150 1063.971 0.9238807

Tuning parameter ‘shrinkage’ was held constant at a value of 0.1 Tuning parameter ‘n.minobsinnode’ was held constant at a value of 10 RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Now, let’s see how we can improve the above predictions. I feel like I have a better understanding now of how sales behave between stores and over time, so I will try creating some features from what we have learned in the data exploration. I create a variable indicating days before Christmas and Easter, to incorporate the lag between the holiday and the impact on people’s shopping behaviour. I also calculate mean sales per store and include this values into the set of predictors, which acts as a proxy for the population density, infrastructure availability, and similar factors related to the location of each individual store. Finally, I exclude closed days here. Predicting closed days as 0 would decrease the error metric, but that feels like cheating! Instead, let’s only look at days that the shops are open.

# Include mean sales by store as predictor
meanSbS <- days %>%
  left_join(
    select(SbS, one_of('Store', 'meanSales')),
    by = 'Store') %>%
  group_by(Store) %>%
  summarise(
    meanSales = mean(Sales, na.rm = TRUE))

# Include days before christmas and easter as predictor
daysBH <- days %>%
  mutate(
    Easter = ifelse(StateHoliday == 'b', 1, 0),
    Christmas = ifelse(StateHoliday == 'c', 1, 0),
    StateHoliday = ifelse(StateHoliday == '0', 0, 1)) %>%
  group_by(Date) %>%
  summarise(
    StateHoliday = max(StateHoliday),
    Christmas = max(Christmas),
    Easter = max(Easter)) %>%
  mutate(
    BH =
      lead(StateHoliday, n = 1) * 3 +
      lead(StateHoliday, n = 2) * 2 +
      lead(StateHoliday, n = 3),
    BC =
      lead(Christmas, n = 1) * 3 +
      lead(Christmas, n = 2) * 2 +
      lead(Christmas, n = 3),
    BE =
      lead(Easter, n = 1) * 3 +
      lead(Easter, n = 2) * 2 +
      lead(Easter, n = 3)) %>%
  select(Date, BH, BC, BE)

# Add it all together
modeldata <- days %>%
  left_join(
    meanSbS, by = 'Store') %>%
  left_join(
    daysBH, by = 'Date') %>%
  mutate(
    # If CompetitionDistance is NA, use max Competition distance
    CompetitionDistance = ifelse(
      is.na(CompetitionDistance),
      max(days$CompetitionDistance, na.rm = TRUE),
      CompetitionDistance),
    # If any holiday lag is NA, use zero
    BH = ifelse(is.na(BH), 0, BH),
    BC = ifelse(is.na(BH), 0, BC),
    BE = ifelse(is.na(BE), 0, BE)) %>%
  select(-CompetitionStart)

# Transforming factors into dummy variables
modeldata %<>%
  select(one_of('DayOfWeek', 'StoreType', 'Range', 'Sales')) %>%
  mutate(
    # Dummy value to include test set in transformation
    Sales = ifelse(is.na(Sales), -999, Sales),
    DayOfWeek = as.factor(DayOfWeek)) %>%
  # Create dummy variables
  model.matrix(Sales ~ ., data = .) %>%
  # Get the Sales data back
  cbind(modeldata) %>%
  mutate(
    StateHoliday = ifelse(StateHoliday == 0, 1, 0)) %>%
  select(
    -one_of('(Intercept)', 'Store', 'Date',
            'DayOfWeek', 'Promo', 'StoreType', 'Range'))

# Select train and test data
trainset <- filter(modeldata, set == 'train' & Open == 1) %>%
  select(-Open, set)
testset <- filter(modeldata, set == 'test' & Open == 1) %>%
  select(-Open, set)

From the previous quick tuning runs, we know that the GBM did much better than the GLM. I suspect that this is due to the interaction between variables which the GBM is able to capture while the GLM (with the formula I use) only looks at variables by themselves. We’ll, therefore, use the GBM model. Predictions from this model improved with higher interaction and a higher number of trees, so I will choose some higher values here. I would have liked to do more parameter tuning on this, but the model takes quite a while to compute so I’ll do with these ad-hoc values (if you are rerunning the code, avoid rerunning this bit and load the saved model object instead).

registerDoMC(cores = 4)

fitControl <- trainControl(
 method = "repeatedcv",
 number = 4,
 repeats = 1,
 allowParallel = TRUE,
 verboseIter = TRUE)
gbmGrid <-  expand.grid(
 interaction.depth = 10,
 n.trees = 500,
 shrinkage = 0.2,
 n.minobsinnode = 10)
model4 <- train(
 Sales ~ .,
 data = trainset,
 method = "gbm",
 tuneGrid = gbmGrid,
 trControl = fitControl)

# Predict in-sample
is.prediction <- predict(model4, newdata = trainset)
os.prediction <- predict(model4, newdata = testset)

# Predict with closed days
fs.prediction <- predict(model4,
  newdata = modeldata,
  na.action = na.pass) %>%
  # Create data frame with predictions
  # Use prediction for both train and test set
  cbind('Prediction' = ., days) %>%
  # Set closed days to zero
  mutate(Prediction = ifelse(Open == 0, 0, Prediction)) %>%
  # Set negative values to zero
  mutate(Prediction = ifelse(Prediction < 0, 0, Prediction))

We have a set of predictions now, but I don’t have the Sales values for 2015 so I can’t evaluate on the blind holdout sample. Instead, I will plot the results for visual inspection.

# Simple scatterplot to see the results
fs.prediction %>%
ggplot(aes(Sales, Prediction)) +
  geom_point() + ylab('Predicted Sales') +
  ggtitle('Simple Scatterplot of observed and predicted sales')

The scatterplot reveals that for the training set there is a very good agreement between the observed and predicted sales, indicated by the long, linear shape of the point cloud. For some values, the model underestimates the values of sales, so it seems that my model is missing out on a part of the signal here. Further research could go into what distinguishes these points from the rest of the data set. When plotting the time series again, it seems that the general trend in the sales data is well captured. The range of variation in the weekly sales data, however, also seems to be slightly underestimated by the model.

Let’s plot the time series again, this time including both the train and the test period.

ts.prediction <- fs.prediction %>%
  group_by(Date) %>%
  summarise(
    Sales = mean(Sales),
    Prediction = mean(Prediction))

ts.prediction$SalesOfWeek <- roll_mean(
      ts.prediction$Sales,
      n = 7,
      align = 'center',
      fill = mean(ts.prediction$Sales),
      na.rm = TRUE)
ts.prediction$PredictionOfWeek <- roll_mean(
      ts.prediction$Prediction,
      n = 7,
      align = 'center',
      fill = mean(ts.prediction$Sales),
      na.rm = TRUE)
# Timeline plot of sales data
ts.prediction %>%
  gather(
    # Get data into long format for ggplot
    key = category,
    value = measurement,
    Sales, SalesOfWeek, Prediction, PredictionOfWeek) %>%
  mutate(
    # Change the order they will be plotted by
    category = factor(
      category,
      levels = c('Sales',
                 'SalesOfWeek',
                 'Prediction',
                 'PredictionOfWeek'))) %>%
  # Make the plot
  ggplot(aes(Date, measurement, colour = category)) +
  geom_line() + facet_grid(category ~ .) +
  ggtitle('Figure 6: Time Series Representation of the Data') +
  ylab('Variable Value')

And that’s all for now. As always, hope it is interesting. Please leave a comment below!

Database Connections in rMarkdown

Fri, 15 Sep 2017 23:00:00 +0000

Connecting R to an enterprise data warehouse? Do it properly and do not hard-code your passwords! Here is how you can do it in R with rMarkdown and RStudio version 1.0+

The Problem

Working as a data scientist in a large organisation, chances are you will have to get data out of an Enterprise Data Warehouse (EDW) and into your Data Manipulation Environment (DME, usually R, Python, Julia, or SAS). Of course, you could create a manual extract, save it as .csv and read it from disk. However, this approach has a number of downsides:

the manual workflow may be hard to reproduce later on
files use up additional disk space
csv files do not store data types

Generally, I prefer to connect my DME directly to the database, as do many other Data Scientists. What I have repeatedly come across in this context is people hard-coding their passwords and access tokens into their analysis code. In my opinion, this is a dangerous practice! It is most likely in violation of the security regulations of your organisation, and for good reason. It is far too easy for your code to accidentally end up on an unrestricted access github repository, an unprotected S3 bucket, or similar. With GDPR just around the corner, a mistake like that could soon cost your organisation up to 3% of their global annual revenue in fines!

The Solution

So what’s the “proper” way to do this? Well, RStudio (v1.0+) offers some great new features in this context. If you are using Windows (like many big corporations do) and you are connecting to your EDW using the Windows ODBC Data Source Administrator, you can read your connection details directly from there using the “odbc” package.

Note: each code block below should be a chunk in an rMarkdown

# Unfortunately, odbc is not on CRAN yet.
# So if you do not have it yet we will need devtools
install.packages(devtools)

# Using devtools, we can now install the odbc package
devtools::install_github('rstats-db/odbc')

# Get connection info from Windows ODBC Data Source Administrator
# Using the name you set manually
con <- dbConnect(odbc::odbc(), 'EDW_name')

Using this connection object, we can now write and run SQL code snippets in rMarkdowns, rNotebooks, and shiny apps. Just pass the connection as property to the snippet and specify an “output.var” that will capture the output. This “output.var” will be available in your R workspace afterwards.

-- This should be a chunk with the following header:
-- {SQL, connection = con, output.var = result}
-- As a result, this turns into sql code.
-- comments need to be marked accordingly
SELECT TOP 10 * FROM EDW_database.EDW_table

# result will be available in the next chunk!
result

This code has syntax highlighting, runs start to finish without any manual steps, does not rely on “hacky” string queries, does not have hard-coded passwords, and your data updates as and when new data becomes available in your EDW!

As always, hope this is useful for someone. Please leave a comment below!

Laurens Geffert

How to Summarize your Travel History in under 5 Minutes

Introduction

Data Extraction

Getting the Files

Getting Coordinates

Processing

Visualization

Country Information

Border Crossings

Conclusion

Building Our Own Open Source Supercomputer with R and AWS

Introduction

Preparation

Setup

Boot and Connect

Nesting Birds and Models in R Dataframes

Motivation

Data

Species Data

Climate data

Modelling

Conclusion

Data Science Machine and Command Line Setup

Initial Setup

Powerlevel9k Command Line

Data Science Essentials

Settings and Access

fastai Deep Learning Image Classification

Getting an Image Dataset

Training a Model

Discussion

Airdrop delivery with A* pathfinding

The Problem

The Data

Weather Prediction

Balloon Navigation

Visualising the Results

Editable Plots from R to PowerPoint

Motivation

Solution

Extracting location history

Location History

The Problem

The Solution

Exploring Sales Data

The task

Data Preparation

Which Stores do Well?

Modelling

GLM

GBM

Database Connections in rMarkdown

The Problem

The Solution