Laurens Geffert A collection of personal Data Science projects https://janlauge.github.io/ Tue, 06 Apr 2021 20:11:25 +0000 Tue, 06 Apr 2021 20:11:25 +0000 Jekyll v3.9.0 How to Summarize your Travel History in under 5 Minutes <p><strong>How to use your location history to compile a breakdown of all your international travel. Fast, simple, and valuable for immigration purposes or visa applications. We will use the Google Maps takeout feature and a small R script</strong></p> <!--more--> <h2 id="introduction">Introduction</h2> <p>When applying for my US visa, one of the questions that USCIS had for me was a breakdown of all my international travel in the last ten years. If you’re anything like me, that question can cause a pretty big headache. Not because I have anything to hide but because it is challenging to keep track of all the different trips I make. A visa application is a serious business, and it’s vital to answer completely and truthfully. So I started digging into old emails, flight loyalty program records, even paper documents. The whole process took me the better part of two hours. I kept thinking there must be a better way to do this. Now I can say with confidence that there is, thanks to Google Maps’ Location History feature!</p> <h2 id="data-extraction">Data Extraction</h2> <h3 id="getting-the-files">Getting the Files</h3> <p>There are some pre-requisites. You need to have location history enabled and, depending on the time window of travel history that you are interested in, you should disable the auto-delete. Note that auto-delete was recently changed to OPT-OUT! I like having all that data archived and available, so I made sure to disable it (you can do that at https://myactivity.google.com/activitycontrols). I still don’t have a full ten years, but I’ll take what I can get.</p> <p>Next, we need to extract the data from Goggle’s systems. I wrote a separate post on <a href="https://janlauge.github.io/2017/extracting-location-history/">how to download location history data programmatically</a>. However, here we’re looking to download data for a large number of days simultaneously, so the Google Takeout feature is the better choice because we won’t end up hitting API request limits.</p> <p>Go to https://takeout.google.com/, de-select all products (via the button at the top), then re-select “Location History”. Leave the file format as <code class="language-plaintext highlighter-rouge">JSON.</code> Set the file size specifications according to your preferences. You’re very unlikely to come up against the GB file limit with just your location history. The export should take about 10 to 15 minutes. When done, you will receive a <code class="language-plaintext highlighter-rouge">.zip</code> file to your Gmail account that contains folders for each year and files for each month with location history data.</p> <h3 id="getting-coordinates">Getting Coordinates</h3> <p>Many RStats users dislike <code class="language-plaintext highlighter-rouge">JSON</code>, but it’s a ubiquitous and useful format! Fear not, <a href="https://twitter.com/thomas_mock">Thomas Mock</a> has written an excellent summary on <a href="https://themockup.blog/posts/2020-05-22-parsing-json-in-r-with-jsonlite/">how to process <code class="language-plaintext highlighter-rouge">JSON</code> data with R</a>. Looking at the location history ` JSON’s, it seems there are two sections in each: <code class="language-plaintext highlighter-rouge">placeVisit</code> and <code class="language-plaintext highlighter-rouge">activitySegment</code>. This matches what you can see on the timeline in the Google Maps app, where places that you visited are listed as entries, and your movement between places is labeled with inferred activities like “walking”, “running”, “on a train”, or “flying”.</p> <p>Originally I thought I could just use the places listed in <code class="language-plaintext highlighter-rouge">placeVisit</code> and extract their country from the address field (usually given in the last row). This worked well for most places, the UK and US but got really messy for Japan and Korea. I changed my approach and went for extracting raw latitude and longitude values instead. Digging in a little deeper, I found longitude-latitude values in five places:</p> <ul> <li>one pair for each entry in <code class="language-plaintext highlighter-rouge">placeVisit</code>s given with a <code class="language-plaintext highlighter-rouge">startTimestampMs</code> and <code class="language-plaintext highlighter-rouge">endTimestampMs</code></li> <li>zero to many pairs for each <code class="language-plaintext highlighter-rouge">placeVisit</code> in a nested list in <code class="language-plaintext highlighter-rouge">simplifiedRawPath</code> with a single <code class="language-plaintext highlighter-rouge">timestampMs</code> each</li> <li>one set of <code class="language-plaintext highlighter-rouge">startLocation</code>, <code class="language-plaintext highlighter-rouge">startTimestampMs</code>, <code class="language-plaintext highlighter-rouge">endLocation</code>, <code class="language-plaintext highlighter-rouge">endTimestampMs</code> for each entry in <code class="language-plaintext highlighter-rouge">activitySegment</code></li> <li>zero to many pairs with <code class="language-plaintext highlighter-rouge">timestampMs</code> for each <code class="language-plaintext highlighter-rouge">activitySegment</code> in a nested list in <code class="language-plaintext highlighter-rouge">simplifiedRawPath</code></li> <li>zero to many pairs with <code class="language-plaintext highlighter-rouge">timestampMs</code> for each <code class="language-plaintext highlighter-rouge">activitySegment</code> in a nested list in <code class="language-plaintext highlighter-rouge">waypointPath</code></li> </ul> <p>Note that <code class="language-plaintext highlighter-rouge">simplifiedRawPath</code> and <code class="language-plaintext highlighter-rouge">waypointPath</code> aren’t present for every entry. Furthermore, the above list might not be complete. For example, I also noticed <code class="language-plaintext highlighter-rouge">parkingEvents</code>, but since it was mostly empty for me, There could be other entries depending on the types of observation, sensors, features used with Google Maps that I am not aware of and that I missed here. If that is the case, please let me know in the comments below. I’d love to hear from you!</p> <p>Based on the information above, I wrote a function to extract latitude-longitude data from the takeout <code class="language-plaintext highlighter-rouge">JSON</code> files. I use <code class="language-plaintext highlighter-rouge">jsonlite</code>s <code class="language-plaintext highlighter-rouge">flatten()</code> to make nested lists of consistent schema into data frames and then invoke <code class="language-plaintext highlighter-rouge">transmute()</code> along with <code class="language-plaintext highlighter-rouge">unnest()</code> and <code class="language-plaintext highlighter-rouge">pivot_longer()</code> where needed to standardize the formatting and create rows with <code class="language-plaintext highlighter-rouge">timestamp</code>, <code class="language-plaintext highlighter-rouge">lng</code>, <code class="language-plaintext highlighter-rouge">lat</code> observations (and some metadata). Observations from each part of the <code class="language-plaintext highlighter-rouge">JSON</code> get combined with <code class="language-plaintext highlighter-rouge">bind_rows()</code> to an output <code class="language-plaintext highlighter-rouge">df</code>. I’m only using the places info for now, but the same framework is easily generalizable to activities:</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">jsonlite</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">assert</span><span class="w"> </span><span class="n">that</span><span class="p">)</span><span class="w"> </span><span class="c1"># extract location data from takeout JSON</span><span class="w"> </span><span class="n">get_location_from_json</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fname</span><span class="p">,</span><span class="w"> </span><span class="n">confidence_threshold</span><span class="o">=</span><span class="m">75</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">message</span><span class="p">(</span><span class="n">str_glue</span><span class="p">(</span><span class="s1">'now processing {fname}'</span><span class="p">))</span><span class="w"> </span><span class="n">json</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fromJSON</span><span class="p">(</span><span class="n">txt</span><span class="o">=</span><span class="n">fname</span><span class="p">,</span><span class="w"> </span><span class="n">simplifyVector</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">flatten</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="c1"># check JSON sub-lists look okay</span><span class="w"> </span><span class="n">assert_that</span><span class="p">({</span><span class="w"> </span><span class="n">assertthat</span><span class="o">::</span><span class="n">are_equal</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">json</span><span class="p">[[</span><span class="m">1</span><span class="p">]]),</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="n">assertthat</span><span class="o">::</span><span class="n">has_name</span><span class="p">(</span><span class="n">json</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'placeVisit'</span><span class="p">,</span><span class="w"> </span><span class="s1">'activitySegment'</span><span class="p">))</span><span class="w"> </span><span class="p">})</span><span class="w"> </span><span class="c1"># get sub-elements with place and activity data</span><span class="w"> </span><span class="n">df_places</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">json</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">pluck</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">pluck</span><span class="p">(</span><span class="s1">'placeVisit'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">jsonlite</span><span class="o">::</span><span class="n">flatten</span><span class="p">(</span><span class="n">recursive</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">tibble</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">visitConfidence</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">confidence_threshold</span><span class="p">)</span><span class="w"> </span><span class="c1"># get location from places</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df_places</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">transmute</span><span class="p">(</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'places'</span><span class="p">,</span><span class="w"> </span><span class="n">time_start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">duration.startTimestampMs</span><span class="p">),</span><span class="w"> </span><span class="n">lat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">location.latitudeE7</span><span class="p">,</span><span class="w"> </span><span class="n">lng</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">location.longitudeE7</span><span class="p">,</span><span class="w"> </span><span class="n">time_end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">duration.endTimestampMs</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># convert to rows with start and end info</span><span class="w"> </span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">starts_with</span><span class="p">(</span><span class="s1">'time'</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">drop_na</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">type</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="o">=</span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">lng</span><span class="p">)</span><span class="w"> </span><span class="c1"># get raw place coordinates if there are any</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="s1">'simplifiedRawPath.points'</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">df_places</span><span class="p">)</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="o">!</span><span class="nf">all</span><span class="p">(</span><span class="n">map_lgl</span><span class="p">(</span><span class="n">df_places</span><span class="o">$</span><span class="n">simplifiedRawPath.points</span><span class="p">,</span><span class="w"> </span><span class="n">is.null</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df_places</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">unnest</span><span class="p">(</span><span class="n">simplifiedRawPath.points</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">transmute</span><span class="p">(</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'raw'</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">timestampMs</span><span class="p">),</span><span class="w"> </span><span class="n">lat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">latE7</span><span class="p">,</span><span class="w"> </span><span class="n">lng</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lngE7</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>Now we just loop over all <code class="language-plaintext highlighter-rouge">JSON</code> files in our takeout data folder:</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">fs</span><span class="p">)</span><span class="w"> </span><span class="c1"># list all files from takeout</span><span class="w"> </span><span class="n">fnames</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fs</span><span class="o">::</span><span class="n">dir_ls</span><span class="p">(</span><span class="w"> </span><span class="n">path</span><span class="o">=</span><span class="s1">'data/Semantic Location History'</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s1">'file'</span><span class="p">,</span><span class="w"> </span><span class="n">recurse</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="c1"># read files</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tibble</span><span class="p">()</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">fname</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">fnames</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">get_location_from_json</span><span class="p">(</span><span class="n">fname</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <h2 id="processing">Processing</h2> <p>We have the raw data, and we’re almost ready to have a look at it. However, before we do so, we should change the format of both the timestamps and the coordinates. Timestamps come as UNIX milliseconds. The <code class="language-plaintext highlighter-rouge">lubridate</code> package makes it easy to convert them to a human-readable date time. Latitude and Longitude values are stored as long integers, hinted at by the original field names <code class="language-plaintext highlighter-rouge">latitudeE7</code>/<code class="language-plaintext highlighter-rouge">longitudeE7</code>. Dividing by <code class="language-plaintext highlighter-rouge">1e7</code> returns the more commonly used Degrees (°) format (decimal places).</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as_datetime</span><span class="p">(</span><span class="n">time</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">1e3</span><span class="p">),</span><span class="w"> </span><span class="n">lat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">1e7</span><span class="p">,</span><span class="w"> </span><span class="n">lng</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lng</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">1e7</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Simple points on a 360 x 180 plot would work, but it would be much better to have a polygon map as a frame of reference for our observations. I opted for <code class="language-plaintext highlighter-rouge">rnaturalearth</code> because it offered a quick and convenient way to get an <code class="language-plaintext highlighter-rouge">sf</code> country shapefile into R. By the way, if you’re not familiar with <code class="language-plaintext highlighter-rouge">sf</code>: it is a relatively new geospatial R package that provides simple geometry features in a tidyverse compatible form. Check out <a href="https://r-spatial.github.io/sf/">this site</a> which has tutorials, vignettes, presentations, cheat sheets, and a wiki!</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rnaturalearth</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">rnaturalearthdata</span><span class="p">)</span><span class="w"> </span><span class="n">world</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ne_countries</span><span class="p">(</span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"medium"</span><span class="p">,</span><span class="w"> </span><span class="n">returnclass</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sf"</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <h2 id="visualization">Visualization</h2> <p>Now we can finally create a map of my location history:</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_sf</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">world</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">lng</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">lat</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_map</span><span class="p">()</span><span class="w"> </span></code></pre></div></div> <p><img src="https://janlauge.github.io/assets/locationhistory_firstmap.jpeg" alt="First simple map" /></p> <p>This seems to be 99% correct. I can see all most observations in the UK and the rest of Europe, where I spent most of my time between 2014 and today. I can also see the trips to the US, New Zealand, China, and Japan accurately. In an earlier version of this map, I had lots of in-flight observations from the North Atlantic near Greenland, the Island of Taiwan, and Indonesia. I got rid of these by excluding activity data altogether. There is also an observation in the South Pacific Ocean, and initially, I had no idea where that was coming from.</p> <p>By manually cross-checking the Maps timeline, I found that it is related to a coastal road stop I did in the very south of New Zealand. The stop got mapped to “South Pacific” as a Google Maps Place with a <code class="language-plaintext highlighter-rouge">visitConfidence</code> of just <code class="language-plaintext highlighter-rouge">54</code>. I decided to exclude low confidence visits (threshold of <code class="language-plaintext highlighter-rouge">75</code>), which solved the issue.</p> <h2 id="country-information">Country Information</h2> <p>The goal of this project was to get a breakdown of time spent in different countries. I mentioned earlier that I was hoping to be able to just extract that information from the address field in the places data but that it didn’t really work. Instead, I chose the slightly more involved route of intersecting coordinates with country shapefiles. We can use the same <code class="language-plaintext highlighter-rouge">sf</code> object from <code class="language-plaintext highlighter-rouge">rnaturalearth</code> that was already loaded above. Let’s create a breakdown of places visited per country.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># convert to sf points</span><span class="w"> </span><span class="n">pnts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">st_as_sf</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">coords</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'lng'</span><span class="p">,</span><span class="w"> </span><span class="s1">'lat'</span><span class="p">),</span><span class="w"> </span><span class="n">crs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st_crs</span><span class="p">(</span><span class="s1">'WGS84'</span><span class="p">))</span><span class="w"> </span><span class="c1"># intersect with countries</span><span class="w"> </span><span class="n">pnts_in_countries</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">st_intersection</span><span class="p">(</span><span class="n">pnts</span><span class="p">,</span><span class="w"> </span><span class="n">world</span><span class="p">)</span><span class="w"> </span><span class="c1"># breakdown of records per country</span><span class="w"> </span><span class="n">pnts_in_countries</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">sovereignt</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarize</span><span class="p">(</span><span class="n">n_places</span><span class="o">=</span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="c1"># A tibble: 25 x 2</span><span class="w"> </span><span class="n">iso_a3</span><span class="w"> </span><span class="n">n_places</span><span class="w"> </span><span class="o">&lt;</span><span class="n">chr</span><span class="o">&gt;</span><span class="w"> </span><span class="o">&lt;</span><span class="n">int</span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="m">11438</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="n">USA</span><span class="w"> </span><span class="m">1336</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="m">819</span><span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="n">NZL</span><span class="w"> </span><span class="m">376</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="n">HKG</span><span class="w"> </span><span class="m">270</span><span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="n">FRA</span><span class="w"> </span><span class="m">215</span><span class="w"> </span><span class="m">7</span><span class="w"> </span><span class="n">ISR</span><span class="w"> </span><span class="m">101</span><span class="w"> </span><span class="m">8</span><span class="w"> </span><span class="n">JPN</span><span class="w"> </span><span class="m">77</span><span class="w"> </span><span class="m">9</span><span class="w"> </span><span class="n">GRC</span><span class="w"> </span><span class="m">62</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="n">RUS</span><span class="w"> </span><span class="m">61</span><span class="w"> </span><span class="c1"># ... with 15 more rows</span><span class="w"> </span></code></pre></div></div> <p>25 countries sound about right.</p> <h2 id="border-crossings">Border Crossings</h2> <p>Finally, the breakdown with immigration / emmigration dates. I get those using the <code class="language-plaintext highlighter-rouge">lag()</code> function from <code class="language-plaintext highlighter-rouge">dplyr</code> and comparing the country of the previous place to the country of the current place as below:</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># get immigration / emmigration events</span><span class="w"> </span><span class="n">pnts_in_countries</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">arrange</span><span class="p">(</span><span class="n">time</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">transmute</span><span class="p">(</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as_date</span><span class="p">(</span><span class="n">time</span><span class="p">),</span><span class="w"> </span><span class="n">country</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">iso_a3</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">transmute</span><span class="p">(</span><span class="w"> </span><span class="n">date_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lag</span><span class="p">(</span><span class="n">date</span><span class="p">),</span><span class="w"> </span><span class="n">date_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">country_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lag</span><span class="p">(</span><span class="n">country</span><span class="p">),</span><span class="w"> </span><span class="n">country_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">country</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">country_from</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">country_to</span><span class="p">)</span><span class="w"> </span><span class="c1"># A tibble: 141 x 4</span><span class="w"> </span><span class="n">date_from</span><span class="w"> </span><span class="n">date_to</span><span class="w"> </span><span class="n">country_from</span><span class="w"> </span><span class="n">country_to</span><span class="w"> </span><span class="o">&lt;</span><span class="n">date</span><span class="o">&gt;</span><span class="w"> </span><span class="o">&lt;</span><span class="n">date</span><span class="o">&gt;</span><span class="w"> </span><span class="o">&lt;</span><span class="n">chr</span><span class="o">&gt;</span><span class="w"> </span><span class="o">&lt;</span><span class="n">chr</span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">2014-05-10</span><span class="w"> </span><span class="m">2014-05-12</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="n">USA</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">2014-05-16</span><span class="w"> </span><span class="m">2014-05-18</span><span class="w"> </span><span class="n">USA</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="m">2014-05-29</span><span class="w"> </span><span class="m">2014-06-03</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="m">2014-06-03</span><span class="w"> </span><span class="m">2014-06-04</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="m">2014-07-09</span><span class="w"> </span><span class="m">2014-07-14</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="m">2014-07-14</span><span class="w"> </span><span class="m">2014-07-14</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="m">7</span><span class="w"> </span><span class="m">2014-12-19</span><span class="w"> </span><span class="m">2014-12-20</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="n">USA</span><span class="w"> </span><span class="m">8</span><span class="w"> </span><span class="m">2015-01-08</span><span class="w"> </span><span class="m">2015-01-09</span><span class="w"> </span><span class="n">USA</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="m">9</span><span class="w"> </span><span class="m">2015-01-10</span><span class="w"> </span><span class="m">2015-01-11</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="m">2015-03-30</span><span class="w"> </span><span class="m">2015-04-02</span><span class="w"> </span><span class="n">GBR</span><span class="w"> </span><span class="n">DEU</span><span class="w"> </span><span class="c1"># ... with 131 more rows</span><span class="w"> </span></code></pre></div></div> <p>Wow, 141 border crossings in total. That’s a bit more than what I would have thought. Then again, just driving from London to Cologne gets you three events (GBR -&gt; FRA -&gt; BEL -&gt; GER).</p> <h2 id="conclusion">Conclusion</h2> <p>You can see how this would have taken me ages to do from old emails and flight records, and I would have probably missed some trips!</p> <p>DISCLAIMER: Use this code at your own risk. I do not guarantee correctness, especially not when applying it to your own data. Please DOUBLE-CHECK MY WORK and let me know if you run into any issues. Some caveats that I am aware of: exact arrival dates can be inaccurate. Google Maps might not register a place on your timeline on the day of departure or right after arrival. An excellent way to check this is to look at records that don’t have the same <code class="language-plaintext highlighter-rouge">date_from</code> and <code class="language-plaintext highlighter-rouge">date_to</code> value (see the first row in my data). Also, I think that all dates and times are in GMT, and so the day you entered or left a given country may be captured incorrectly in local time. I might return later to work on that further.</p> <p>Thank you so much for reading! Let me know in the comments below if you found it helpful and what you would like to read about next!</p> Mon, 05 Apr 2021 14:30:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2021/google_timeline_travel_history/ https://janlauge.github.iohttps://janlauge.github.io//2021/google_timeline_travel_history/ DataScience Coding Tools R Geospatial DataScience Coding Tools R Geospatial Building Our Own Open Source Supercomputer with R and AWS <p><strong>How to build a scaleable computing cluster on AWS and run hundreds or thousands of models in a short amount of time. We will completely rely on R and open source R packages. This is post 1 out of 2.</strong></p> <!--more--> <h2 id="introduction">Introduction</h2> <p>An ever-increasing number of businesses is moving to the cloud and using platforms such as <a href="https://aws.amazon.com/">Amazon Web Services</a>(AWS) for their data infrastructure. This is convenient for Data Scientists like myself because this conversion of tools means that my knowledge from previous jobs becomes much more applicable to a new role and I can hit the ground running.</p> <p>Lately I have become very excited about the <a href="https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html"><code class="language-plaintext highlighter-rouge">future</code></a> package and how it makes the scaling of computational tasks easy and intuitive. The basic idea of the future package is to make your code infrastructure independent. Specify your tasks and the <code class="language-plaintext highlighter-rouge">future</code> execution plan decides how to run the calculations.</p> <p>I wanted to see what we could do with <code class="language-plaintext highlighter-rouge">future</code> and other open source R packages such as <a href="https://rdrr.io/github/cloudyr/aws.ec2/f/README.md"><code class="language-plaintext highlighter-rouge">aws.ec2</code></a> by <a href="http://cloudyr.github.io/packages/">cloudyR</a>, <a href="https://ropensci.org/technotes/2018/06/12/ssh-02/"><code class="language-plaintext highlighter-rouge">ssh</code></a> by <a href="https://ropensci.org/">rOpenSci</a>, <a href="https://cran.r-project.org/web/packages/remoter/vignettes/remoter.pdf"><code class="language-plaintext highlighter-rouge">remoter</code></a> by Drew Schmidt, and last but not least <a href="https://davisvaughan.github.io/furrr/"><code class="language-plaintext highlighter-rouge">furrr</code></a> by Davis Vaughan.</p> <p>The basic idea:</p> <ul> <li>use R and AWS to spin up our own cloud compute cluster</li> <li>log in to the head node and define a computationally expensive task</li> <li>farm this task out to a number of worker nodes in our cluster</li> <li>do all of this WITHOUT having to switch between RStudio, RStudioServer, the command line, the AWS console, etc.</li> </ul> <p>Why do I care about the last point? Well, Data Science is a science and should rely on the <a href="https://en.wikipedia.org/wiki/Scientific_method">Scientific Method</a>. One core component of the Scientific Method is reproducibility, and one of the best ways to keep your Data Science workflow reproducible is to write code that can run start to finish without any user intervention. This also allows for greater applicability in the future because you can re-use your previous data product or service in the next project without retracing manual steps. Don’t just take my word for it, here is another great Hadley Wickham video in which he stresses the same point:</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/cpbtcsGE0OA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe> <p>So without further ado, let’s get started implementing that bullet point list!</p> <h2 id="preparation">Preparation</h2> <p>There are a few basic requirements that need to be in place:</p> <ol> <li>an active AWS account.</li> <li>an Amazon Machine Image (<a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html"><code class="language-plaintext highlighter-rouge">AMI</code></a>) with <code class="language-plaintext highlighter-rouge">R</code>, <code class="language-plaintext highlighter-rouge">remoter</code>, <code class="language-plaintext highlighter-rouge">tidyverse</code>, <code class="language-plaintext highlighter-rouge">future</code>, and <code class="language-plaintext highlighter-rouge">furrr</code> installed.</li> <li>a working <code class="language-plaintext highlighter-rouge">ssh</code> key pair on your local machine and the <code class="language-plaintext highlighter-rouge">AMI</code> that allows you to ssh into and between your <code class="language-plaintext highlighter-rouge">ec2</code> instances.</li> </ol> <p>Detailed instructions on how to fulfil these basic requirements are beyond the scope of this post. You can find more information in the articles linked below.</p> <ul> <li><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html">What Is Amazon EC2?</a></li> <li><a href="https://aws.amazon.com/blogs/big-data/running-r-on-aws/">Running R on AWS</a></li> <li><a href="http://cloudyr.github.io/">The CloudyR project</a></li> </ul> <h2 id="setup">Setup</h2> <p>Load the required packages. Also make sure your AWS access credentials are set. I do this using <code class="language-plaintext highlighter-rouge">Sys.setenv</code>. There is other ways but I found that this works best for me. We also specify the <code class="language-plaintext highlighter-rouge">AMI</code> ID and the instance type (this is a good <a href="https://aws.amazon.com/ec2/instance-types/">overview</a>; I am using <code class="language-plaintext highlighter-rouge">t2.micro</code> here because it is free). If you have any problems with this step, double- check that the region set in <code class="language-plaintext highlighter-rouge">Sys.setenv</code> matches the region of your AMI.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">aws.ec2</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">ssh</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">remoter</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="c1"># set access credentials</span><span class="w"> </span><span class="n">aws_access</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">aws.signature</span><span class="o">::</span><span class="n">locate_credentials</span><span class="p">()</span><span class="w"> </span><span class="n">Sys.setenv</span><span class="p">(</span><span class="w"> </span><span class="s2">"AWS_ACCESS_KEY_ID"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aws_access</span><span class="o">$</span><span class="n">key</span><span class="p">,</span><span class="w"> </span><span class="s2">"AWS_SECRET_ACCESS_KEY"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aws_access</span><span class="o">$</span><span class="n">secret</span><span class="p">,</span><span class="w"> </span><span class="s2">"AWS_DEFAULT_REGION"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aws_access</span><span class="o">$</span><span class="n">region</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="c1"># set parameters</span><span class="w"> </span><span class="n">aws_ami</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"ami-06485bfe40a86470d"</span><span class="w"> </span><span class="n">aws_describe</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">describe_images</span><span class="p">(</span><span class="n">aws_ami</span><span class="p">)</span><span class="w"> </span><span class="n">aws_type</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"t2.micro"</span><span class="w"> </span></code></pre></div></div> <p>Ready for launch!</p> <h2 id="boot-and-connect">Boot and Connect</h2> <p>We can now fire up our head-node instance.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ec2inst</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">run_instances</span><span class="p">(</span><span class="w"> </span><span class="n">image</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aws_ami</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aws_type</span><span class="p">)</span><span class="w"> </span><span class="c1"># wait for boot, then refresh description</span><span class="w"> </span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="n">ec2inst</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">describe_instances</span><span class="p">(</span><span class="n">ec2inst</span><span class="p">)</span><span class="w"> </span><span class="c1"># get IP address of the instance</span><span class="w"> </span><span class="n">ec2inst_ip</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">get_instance_public_ip</span><span class="p">(</span><span class="n">ec2inst</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>The instance should be running and we can connect to it via ssh in bash. That works, but personally I’d prefer to stay in RStudio instead of switching to the command line. This is where <code class="language-plaintext highlighter-rouge">remoter</code> and <code class="language-plaintext highlighter-rouge">ssh</code> come in. We can establish an ssh connection straight from our R session and use that to launch the <code class="language-plaintext highlighter-rouge">remoter::server</code> on our instance. By using the future package to run the ssh command we keep our interactive RStudio session free and can subsequently use it to connect to the instance with <code class="language-plaintext highlighter-rouge">remoter</code></p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ssh connection</span><span class="w"> </span><span class="n">username</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">system</span><span class="p">(</span><span class="s2">"whoami"</span><span class="p">,</span><span class="w"> </span><span class="n">intern</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ssh_connect</span><span class="p">(</span><span class="n">host</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">username</span><span class="p">,</span><span class="w"> </span><span class="n">ec2ip</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"@"</span><span class="p">))</span><span class="w"> </span><span class="c1"># helper function for a random temporary password</span><span class="w"> </span><span class="n">random_tmp_password</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">generate_password</span><span class="p">()</span><span class="w"> </span><span class="c1"># CMD string to start remoter::server on instance</span><span class="w"> </span><span class="n">r_cmd_start_remoter</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">str_c</span><span class="p">(</span><span class="w"> </span><span class="s2">"sudo Rscript -e "</span><span class="p">,</span><span class="w"> </span><span class="s2">"'remoter::server("</span><span class="p">,</span><span class="w"> </span><span class="s2">"port = 55555, "</span><span class="p">,</span><span class="w"> </span><span class="s2">"password = %pwd, "</span><span class="p">,</span><span class="w"> </span><span class="s2">"showmsg = TRUE)'"</span><span class="p">,</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="s2">"%pwd"</span><span class="p">,</span><span class="w"> </span><span class="n">str_c</span><span class="p">(</span><span class="s1">'"'</span><span class="p">,</span><span class="w"> </span><span class="n">random_tmp_password</span><span class="p">,</span><span class="w"> </span><span class="s1">'"'</span><span class="p">))</span><span class="w"> </span><span class="c1"># connect and execute</span><span class="w"> </span><span class="n">plan</span><span class="p">(</span><span class="n">multicore</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">future</span><span class="p">(</span><span class="w"> </span><span class="n">ssh_exec_wait</span><span class="p">(</span><span class="w"> </span><span class="n">session</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">con</span><span class="p">,</span><span class="w"> </span><span class="n">command</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_cmd_start_remoter</span><span class="p">))</span><span class="w"> </span><span class="n">remoter</span><span class="o">::</span><span class="n">client</span><span class="p">(</span><span class="w"> </span><span class="n">addr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ec2ip</span><span class="p">,</span><span class="w"> </span><span class="n">port</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">55555</span><span class="p">,</span><span class="w"> </span><span class="n">password</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">random_tmp_password</span><span class="p">,</span><span class="w"> </span><span class="n">prompt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"remote"</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Et Voila! We are connected to our remote head node and can run R code in the cloud without ever leaving the comfort of RStudio. And the amazing bit: all of this took me about a day to set up from scratch!</p> <p>I will leave it here for now. In the next post we will dive into the details of how to scale up the approach above to create an AWS cloud computing cluster. This approach is extremely powerful for embarrassingly parallel problems (which are actually not embarrassing at all, I swear!)</p> <p>As always, I hope it is useful for you. I’d very much appreciate any thoughts, comments, and feedback so write me a message below or get in touch via twitter!</p> Sun, 03 Feb 2019 18:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2019/building-our-own-open-source-supercomputer-with-R/ https://janlauge.github.iohttps://janlauge.github.io//2019/building-our-own-open-source-supercomputer-with-R/ DataScience MachineLearning CloudComputing R DataScience MachineLearning CloudComputing R Nesting Birds and Models in R Dataframes <p><strong>R Dataframes in the tidyverse are more than just simple tables these days. They can store complex information in list columns, and this becomes an immensely powerful framework when we use it to apply methods to different sets of data in parallel. In this article I illustrate this approach using data for a rare UK bird species to investigate if its distribution has been impacted by climate change.</strong></p> <!--more--> <h1 id="motivation">Motivation</h1> <p>After recently seeing a Hadley Wickham lecture on nested models I became incredibly excited about nested dataframes with s3 objects in list columns again. Here is the video I am talking about:</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/rz3_FDVt9eg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe> <p>Hadley uses this approach for data exploration but I think it is also very powerful for iterative workflows and for experimentation or hypothesis testing on large datasets. For example, when working on my PhD thesis I was routinely fitting hundreds of machine learning models at once. All models used the same predictor set and only varied in hyperparameters as well as label data. Yet, I had to run them in separate parallel processes and load the data into each of these. Moreover, when capturing results I often looked to the list class for help. This did the job but also meant that I had to be very careful about which results belonged to which data, which hyperparameters, and which model object.</p> <p>Enter nested dataframes. They still rely on the list class, but they nicely organise the corresponding data elements together, in accordance with the <a href="http://vita.had.co.nz/papers/tidy-data.html">tidy data framework</a></p> <h1 id="data">Data</h1> <p>I decided to explore this framework hands-on, using a small exemplary case study in the domain of species distribution modelling. This is what the models I mentioned earlier were. For this type of modelling task we need species occurrence data (our “label”, “response”, or Y) and climatic variables (the “predictors”, or X)</p> <h2 id="species-data">Species Data</h2> <p>After browsing the web for a suitable case study species for a while I decided on the Scottish Crossbill (<strong>Loxia scotica</strong>). This is a small passerine bird that inhabits the Caledonian Forests of Scotland, and is the only terrestrial vertebrate species unique to the United Kingdom. Only ~ 20,000 individuals of this species are alive today.</p> <p>Getting species occurrence data used to be the main challenge in Biogeography. Natural Historians such as Charles Darwin and Alexander von Humboldt would travel for years on rustic sail ships around the globe collecting specimen. Today, we are standing on the shoulders of giants. Getting data is fast and easy thanks to the work of two organisations:</p> <ul> <li> <p><a href="https://www.gbif.org/">the Global Biodiversity Information Facility (GBIF)</a>, an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. We will use their data in this project.</p> </li> <li> <p><a href="https://ropensci.org/">rOpenSci</a>, a non-profit initiative that has developed an ecosystem of open source tools, runs annual unconferences, and reviews community developed software. They provide an R package called <code class="language-plaintext highlighter-rouge">rgbif</code> that I once made a humble contribution to. It is essentially a wrapper around the GBIF API will help us access the species data straight into R.</p> </li> </ul> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">rgbif</span><span class="p">)</span><span class="w"> </span><span class="c1"># get the database id ("key") for the Scottish Crossbill</span><span class="w"> </span><span class="n">speciesKey</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">name_backbone</span><span class="p">(</span><span class="s1">'Loxia scotica'</span><span class="p">)</span><span class="o">$</span><span class="n">speciesKey</span><span class="w"> </span><span class="c1"># get the occurrence records of this species</span><span class="w"> </span><span class="n">gbif_response</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">occ_search</span><span class="p">(</span><span class="w"> </span><span class="n">scientificName</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Loxia scotica"</span><span class="p">,</span><span class="w"> </span><span class="n">country</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"GB"</span><span class="p">,</span><span class="w"> </span><span class="n">hasCoordinate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">hasGeospatialIssue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">limit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9999</span><span class="p">)</span><span class="w"> </span><span class="c1"># backup to reduce API load</span><span class="w"> </span><span class="n">write_rds</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gbif_response</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">here</span><span class="o">::</span><span class="n">here</span><span class="p">(</span><span class="s1">'gbif_occs_loxsco.rds'</span><span class="p">))</span><span class="w"> </span></code></pre></div></div> <p>GBIF and rOpenSci just saved us years or roaming around the highlands with a pair of binoculars, camping in mud, rain, and snow, and chasing crossbills through the forest. Nevertheless, it is still up to us to make sense of the data we got back, in particular to clean it, as data collected on this large scale can have its own issues. Luckily, GBIF provides some useful metadata on each record. Here, I will exclude those that</p> <ul> <li>are not tagged as “present” (they may be artifacts from collections)</li> <li>don’t have any flagged issues (nobody has noticed anything abnormal with this)</li> <li>are under creative commons license (we can use them here)</li> <li>are older than 1965</li> </ul> <p>After cleaning the data we use <code class="language-plaintext highlighter-rouge">tidyr::nest()</code> to aggregate the data by decade.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">lubridate</span><span class="p">)</span><span class="w"> </span><span class="n">birds_clean</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gbif_response</span><span class="o">$</span><span class="n">data</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># get decade of record from eventDate</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">decade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">eventDate</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">ymd_hms</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">round_date</span><span class="p">(</span><span class="s2">"10y"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">year</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">())</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># clean data using metadata filters</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="w"> </span><span class="c1"># only creative commons license records</span><span class="w"> </span><span class="n">str_detect</span><span class="p">(</span><span class="n">license</span><span class="p">,</span><span class="w"> </span><span class="s2">"http://creativecommons.org/"</span><span class="p">)</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="c1"># only records with no issues</span><span class="w"> </span><span class="n">issues</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">""</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="c1"># no records before 1965</span><span class="w"> </span><span class="n">decade</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="m">1970</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="c1"># no records after 2015 (there is not a lot of data yet)</span><span class="w"> </span><span class="n">decade</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">2020</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># retain only relevant variables</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">decimalLongitude</span><span class="p">,</span><span class="w"> </span><span class="n">decimalLatitude</span><span class="p">,</span><span class="w"> </span><span class="n">decade</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">arrange</span><span class="p">(</span><span class="n">decade</span><span class="p">)</span><span class="w"> </span><span class="n">birds_nested</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">birds_clean</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># define the nesting index</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">decade</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># aggregate data in each group</span><span class="w"> </span><span class="n">nest</span><span class="p">()</span><span class="w"> </span><span class="c1"># let's have a look</span><span class="w"> </span><span class="n">glimpse</span><span class="p">(</span><span class="n">birds_nested</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p><img src="https://janlauge.github.io/assets/nesting_nesdted_df_1.png" alt="Nested Dataframe" /></p> <h2 id="climate-data">Climate data</h2> <p>For the UK the MetOffice had some <a href="https://www.metoffice.gov.uk/climate/uk/data/ukcp09">nice climatic datasets available</a>. They were in a horrible format (CSV with timesteps, variable types, and geospatial information spread across rows, columns, and file partitions) but I managed to transform them into something useable. The details of this are beyond the scope of this post, but if you are interested in the code for that you can check it out <a href="https://github.com/JanLauGe/ds-personal-projects/blob/master/datacamp_gbif/00_data_loader.R">here</a>.</p> <p>The final rasters look like this: <img src="https://janlauge.github.io/assets/nesting_climate_plot.png" alt="Climate Plot" /></p> <h1 id="modelling">Modelling</h1> <p>We’ll split the data in training and test set with a true temporal holdout from all data collected between 2005 - 2015.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># last pre-processing step</span><span class="w"> </span><span class="n">df_modelling</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df_nested</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># get into modelling format</span><span class="w"> </span><span class="n">unnest</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># caret requires a factorial response variable for classification</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">presence</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w"> </span><span class="n">presence</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"presence"</span><span class="p">,</span><span class="w"> </span><span class="n">presence</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"absence"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">factor</span><span class="p">())</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># drop all observations with NA variables</span><span class="w"> </span><span class="n">na.omit</span><span class="p">()</span><span class="w"> </span><span class="c1"># create a training set for the model build</span><span class="w"> </span><span class="n">df_train</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df_modelling</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># true temporal split as holdout</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">decade</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"2010"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># drop decade, it's not needed anymore</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">decade</span><span class="p">)</span><span class="w"> </span><span class="c1"># same steps for test set</span><span class="w"> </span><span class="n">df_test</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df_modelling</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">decade</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"2010"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">decade</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Species responses to environmental variables are often non-linear. For example, a species usually can’t survive if it is too cold, but it can’t deal with too much heat either. It needs the “sweet spot” in the middle. Linear models like a glm are not very useful under these circumstances. On the other hand, algorithms such as random forest can easily overfit to this kind of data. I therefore decided to test a regularised random forest (RFF) as introduced by <a href="https://arxiv.org/pdf/1306.0237.pdf">Deng (2013)</a>, hoping that it would offer just the right ratio of bias vs variance for this use case.</p> <p>Caret makes the model fitting incredibly easy! All we need to do is specify a tuning grid of hyperparameters that we want to optimise, a tune control that adjusts the number of iterations and the loss function used, and then call train with the algorithm we have picked.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">RRF</span><span class="p">)</span><span class="w"> </span><span class="c1"># for reproducibility</span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">12345</span><span class="p">)</span><span class="w"> </span><span class="c1"># set up model fitting parameters</span><span class="w"> </span><span class="c1"># tuning grid, trying every possible combination</span><span class="w"> </span><span class="n">tuneGrid</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="w"> </span><span class="n">mtry</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">),</span><span class="w"> </span><span class="n">coefReg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="m">.03</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">coefImp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">.0</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.6</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="n">tuneControl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">trainControl</span><span class="p">(</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'repeatedcv'</span><span class="p">,</span><span class="w"> </span><span class="n">classProbs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">repeats</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">verboseIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">summaryFunction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">twoClassSummary</span><span class="p">)</span><span class="w"> </span><span class="c1"># actual model build</span><span class="w"> </span><span class="n">model_fit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">train</span><span class="p">(</span><span class="w"> </span><span class="n">presence</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df_train</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"RRF"</span><span class="p">,</span><span class="w"> </span><span class="n">metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ROC"</span><span class="p">,</span><span class="w"> </span><span class="n">tuneGrid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tuneGrid</span><span class="p">,</span><span class="w"> </span><span class="n">trControl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tuneControl</span><span class="p">)</span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">model_fit</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>We can evaluate the performance of this model on our hold-out data from 2005 - 2015. Just as uring training we are using the Area under the Receiver Operator Characteristic curve (AUC). With this metric, a model no bettern than random would score 0.5 while a perfect model making no mistakes would score 1.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># combine prediction with validation set</span><span class="w"> </span><span class="n">df_eval</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data_frame</span><span class="p">(</span><span class="w"> </span><span class="s2">"obs"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df_test</span><span class="o">$</span><span class="n">presence</span><span class="p">,</span><span class="w"> </span><span class="s2">"pred"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="w"> </span><span class="n">object</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model_fit</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df_test</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">pull</span><span class="p">(</span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="c1"># get ROC value</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">yardstick</span><span class="p">)</span><span class="w"> </span><span class="n">roc_auc_vec</span><span class="p">(</span><span class="n">estimator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"binary"</span><span class="p">,</span><span class="w"> </span><span class="n">truth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df_eval</span><span class="o">$</span><span class="n">obs</span><span class="p">,</span><span class="w"> </span><span class="n">estimate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df_eval</span><span class="o">$</span><span class="n">pred</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Now we can combine the raw data, model performance, and predictions all in one nested dataframe. We can save this for later to make sure we always know what data was used to build which model.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_eval</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">df_modelling</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">decade</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">nest</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># combine with climate data</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">climate_nested</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"decade"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># evaluate by decade</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="s2">"obs"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="w"> </span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.x</span><span class="o">$</span><span class="n">presence</span><span class="p">),</span><span class="w"> </span><span class="s2">"pred"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="w"> </span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model_fit</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">pull</span><span class="p">(</span><span class="s2">"presence"</span><span class="p">)),</span><span class="w"> </span><span class="s2">"auc"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map2_dbl</span><span class="p">(</span><span class="w"> </span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">obs</span><span class="p">,</span><span class="w"> </span><span class="n">.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">roc_auc_vec</span><span class="p">(</span><span class="w"> </span><span class="n">estimator</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"binary"</span><span class="p">,</span><span class="w"> </span><span class="n">truth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="n">estimate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.y</span><span class="p">)),</span><span class="w"> </span><span class="s2">"climate_data"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="w"> </span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">raster_stacks</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">as</span><span class="p">(</span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="s2">"SpatialPixelsDataFrame"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as_data_frame</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">na.omit</span><span class="p">()),</span><span class="w"> </span><span class="s2">"habitat_suitability"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="w"> </span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">climate_data</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model_fit</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">pull</span><span class="p">(</span><span class="s2">"presence"</span><span class="p">))</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="n">df_eval</span><span class="w"> </span></code></pre></div></div> <h1 id="conclusion">Conclusion</h1> <p>Let’s look at the change over time using <code class="language-plaintext highlighter-rouge">gganimate</code>. Unfortunately, we can see that the suitable area for the species in the UK is drastically decreasing after 1985. Not all species are negatively affected by climate change but many are. And this is just one of the many unintended consequences of our impact on planet earth.</p> <p><img src="https://janlauge.github.io/assets/nesting_change_animation.png" alt="Change Animation" /></p> <p>I hope that you enjoyed this blog post despite our pessimistic findings. As you can see nested dataframes with list columns are immensely powerful in a range of situations. I will certainly use them a lot more in the future. Please let me know in the comments if you are, too!</p> Sat, 15 Dec 2018 09:30:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2018/nesting-models-in-R-data-frames/ https://janlauge.github.iohttps://janlauge.github.io//2018/nesting-models-in-R-data-frames/ DataScience MachineLearning Tidyverse R DataScience MachineLearning Tidyverse R Data Science Machine and Command Line Setup <p><strong>Data Scientists require a very particular toolset for their everyday tasks, but unlike software developers, few of them spend a lot of time optimising this toolset for their specific needs. I compiled a simple step-by-step guide that helps to automate the process setting up a brand new data science machine and making it work for you by customising the command prompt and using a dotfile approach to manage configuration, identity, and access information. This gets you from zero to Data Science in minutes on MacOS</strong></p> <!--more--> <p>I’ve had to set up new data science laptops twice in the last couple of months and got frustrated with the tedious setup procedures. Installing libraries, customising settings, how do I switch RStudio to night mode again? Moreover, I have two new starters joining my team in the coming weeks which means that more system setups are just around the corner. So I decided to compile a guide with scripts and commands that make this process smoother and faster.</p> <p>There is many things to be said for an automatic setup over manual installation. Speed, reproducibility, a standardised configuration between all team members, and the opportunity for programmatic customisation. Among software developers this approach, called <code class="language-plaintext highlighter-rouge">.dotfile</code> configuration, is common practice and great introductions are available <a href="https://medium.com/@webprolific/getting-started-with-dotfiles-43c3602fd789">here</a> and <a href="https://medium.freecodecamp.org/dive-into-dotfiles-part-1-e4eb1003cff6">here</a>. However, so far I have only rarely encountered it on data science teams. This is despite the fact that data scientists frequently work with complex statements at the command line, have to pay particular attention to system setup to ensure reproducibility of their experiments, use version control, and commonly deal with data from a wide range of sources, many of which will require API tokens or access credentials. So think of this as a data science specific <code class="language-plaintext highlighter-rouge">dotfile</code> setup. There are three main components to this approach:</p> <ol> <li>using command line tools and package managers instead of graphic installers automate first-time system setup, because this is faster, more reproducible, and more easily maintainable.</li> <li>set up a beautiful, efficient, and powerful command line configuration, because it will make everyday tasks easier, because it’s awesome and because we can!</li> <li>create a <code class="language-plaintext highlighter-rouge">.dotfile</code> repository that saves settings, application preferences, api keys, and access tokens, because it is more convenient and more secure than glueing post-its to our monitor or hard-coding passwords and tokens into our code that is then pushed to <a href="https://github.com/JanLauGe">GitHub</a>.</li> </ol> <p>Most parts of this article can be used in isolation, so unlike the British Prime Minister you are free to “cherry-pick” if you are so inclined.</p> <p>I am assuming here that you’re using MacOS. Parts of it may be transferable to a linux machine, much of it will need modification. If you’re on Windows… good luck! It may work with the new <a href="https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux">Ubuntu for Windows</a>? If you get a chance to test this, please let me know in the comment section below.</p> <h3 id="initial-setup">Initial Setup</h3> <p>We start of by installing install <a href="https://brew.sh/">Homebrew</a>, the “missing package manager for MacOS”! This bit actually requires some user input (<yes> and <password>), so we will split that from the rest of the basic installations.</password></yes></p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/bin/ruby <span class="nt">-e</span> <span class="s2">"</span><span class="si">$(</span>curl <span class="nt">-fsSL</span> https://raw.githubusercontent.com/Homebrew/install/master/install<span class="si">)</span><span class="s2">"</span> </code></pre></div></div> <p>Once that is done we can use <code class="language-plaintext highlighter-rouge">homebrew</code> for some additional household essentials.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># we will need these later</span> brew <span class="nb">install </span>wget htop git git-lfs libgit2 keychain <span class="c"># I like these, so I'll install them here as well</span> brew cask <span class="nb">install </span>google-chrome atom slack vlc spotify dropbox <span class="c"># You can launch and configure apps like this</span> open ~/Applications/Dropbox.app <span class="c"># install gcc and java,</span> <span class="c"># a lot of the data science tools we will install later depend on them</span> <span class="c"># (some of these may require your password again)</span> brew <span class="nb">install </span>gcc brew tap caskroom/versions brew cask <span class="nb">install </span>java brew cask <span class="nb">install </span>java8 brew <span class="nb">install </span>jenv </code></pre></div></div> <h3 id="powerlevel9k-command-line">Powerlevel9k Command Line</h3> <p>Now it’s time to beef up our command line. This is something that many software developers and engineers spend a lot of time on, to the point where some are holding competitions to show off their great shells. Many Data Scientists, on the other hand, seem to neglect command line customisation. I think that this is a mistake. Let me convince you by highlighting some of the neat extra features that we can add with a little bit of extra setup effort:</p> <ul> <li>beautiful command prompt</li> <li>syntax highlighting</li> <li>auto completion</li> <li>read/write flags</li> <li>execution timing</li> <li>git support with repo status tracking</li> </ul> <p><img src="https://janlauge.github.io/assets/bashsetup_powerlevel9k.gif" alt="example powerlevel9k prompt" /></p> <p>The most nerdy set of productivity tools on the block! To make this work we will need <a href="https://www.iterm2.com/">iTerm2</a> and <a href="http://www.zsh.org/">zsh</a>. iTerm2 is a macOS terminal replacement with many additional features, such as more display customisation, better hotkeys, and fantastic split pane functionality. Zsh is a shell designed for interactive use. It works particularly well with <a href="https://ohmyz.sh/"><code class="language-plaintext highlighter-rouge">oh-my-zsh</code></a>, a configuration tool that helps with setting up everything just the way we like it. They have great stickers, too ;)</p> <p>While we’re at it we will also install the Powerline terminal fonts, which will be needed for <a href="https://github.com/bhilburn/powerlevel9k">powerlevel9k</a>, the zsh theme of my choosing.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># install iTerm2</span> brew cask <span class="nb">install </span>iterm2 <span class="c"># install zsh</span> brew <span class="nb">install </span>zsh <span class="c"># get oh-my-zsh configuration tool</span> sh <span class="nt">-c</span> <span class="s2">"</span><span class="si">$(</span>curl <span class="nt">-fsSL</span> https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh<span class="si">)</span><span class="s2">"</span> <span class="c"># (this may require your password again)</span> <span class="c"># get powerlevel9k theme for zsh</span> git clone https://github.com/bhilburn/powerlevel9k.git ~/.oh-my-zsh/custom/themes/powerlevel9k <span class="c"># and the corresponding font</span> wget <span class="nt">-O</span> /Library/Fonts/font_sourcecodepro_powerline_awesomeregular.ttf https://github.com/Falkor/dotfiles/blob/master/fonts/SourceCodePro+Powerline+Awesome+Regular.ttf?raw<span class="o">=</span><span class="nb">true</span> </code></pre></div></div> <p>The oh-my-zsh installation script changes your default shell to zsh and creates the file <code class="language-plaintext highlighter-rouge">.zshrc</code>. Just like <code class="language-plaintext highlighter-rouge">.bash_profile</code> for bash, this file is automatically sourced when a new zsh session is launched. From now on you should always use <code class="language-plaintext highlighter-rouge">.zshrc</code> instead of <code class="language-plaintext highlighter-rouge">.bash_profile</code>, for example when setting a new standard conda environment. Notice that <code class="language-plaintext highlighter-rouge">.zshrc</code> comes with a lot of options that are commented out. Feel free to go through the file and uncomment the modifications that may be of interest to you.</p> <p>You should also add iTerm2 to the dock bar and/or assign a hot key of your choosing. Change the colour scheme (<code class="language-plaintext highlighter-rouge">Menu bar</code> &gt; <code class="language-plaintext highlighter-rouge">Profiles</code> &gt; <code class="language-plaintext highlighter-rouge">Open Profiles...</code> &gt; <code class="language-plaintext highlighter-rouge">Select "Default"</code> &gt; <code class="language-plaintext highlighter-rouge">Edit Profiles...</code>) as you see fit. Definitively change the font to <code class="language-plaintext highlighter-rouge">SourceCodePro+Powerline+Awesome Regular</code>. This last step is important as <strong>POWERLEVEL9K WON’T WORK PROPERLY WITHOUT THIS</strong> and you will end up with cryptic symbols on your prompt instead.</p> <p>If you don’t have strong feelings about colour style preference, feel free to use my profile template. You can install it as a dynamic profile with the command below. DynamicProfiles enable you to share your preferences between different machines. You can create your own by exporting your profile from the profile menu to a JSON file and copying it to the same location:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># copy the profile settings for iTerm2 to DynamicProfiles folder</span> wget <span class="nt">-O</span> ~/Library/Application<span class="se">\ </span>Support/iTerm2/DynamicProfiles https://github.com/JanLauGe/.dotfiles/blob/master/iterm_profile.json </code></pre></div></div> <p>There is a wide range of plugins available for iTerm2 and zsh. I automatically add a few that I find useful by installing them with homebrew. Afterwards I add them to the <code class="language-plaintext highlighter-rouge">.zshrc</code> configuration file with <code class="language-plaintext highlighter-rouge">sed</code> or by pipe-appending (<code class="language-plaintext highlighter-rouge">&gt;&gt;</code>) a string to the end of the file.</p> <p>In case you’re unfamiliar with these commands: <code class="language-plaintext highlighter-rouge">sed</code> looks for a string in a file using <a href="https://regexr.com/">regular expression</a> and replaces the found string with a replacement string. The inplace flag <code class="language-plaintext highlighter-rouge">-i ''</code> is Mac specific and tells <code class="language-plaintext highlighter-rouge">sed</code> to overwrite the old file with the new updated version. The <code class="language-plaintext highlighter-rouge">&gt;&gt;</code> operator appends to a file or creates the file if it doesn’t exist.</p> <p>Side note: Alternatively, we could just copy a pre-existing <code class="language-plaintext highlighter-rouge">.zshrc</code> but I felt that adding lines using <code class="language-plaintext highlighter-rouge">sed</code> keeps things more transparent and allows for more of a mix-and-match approach where you can choose the bits you like and leave out the ones that are not useful to you.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># change zsh theme to powerlevel9k</span> <span class="nb">sed</span> <span class="nt">-i</span> <span class="s1">''</span> <span class="s1">'s/ZSH_THEME="robbyrussell"/POWERLEVEL9K_MODE='</span>awesome-patched<span class="s1">'\ ZSH_THEME="powerlevel9k\/powerlevel9k"/g'</span> .zshrc <span class="c"># Add auto suggestions (for Oh My Zsh) suggests the commands you used</span> <span class="c"># in your terminal history. You just have to type → to fill it entirely!</span> <span class="c"># Note: $ZSH_CUSTOM/plugins path is by default ~/.oh-my-zsh/custom/plugins</span> brew <span class="nb">install </span>zsh-autosuggestions zsh-syntax-highlighting <span class="c"># Add the plugins to the list of plugins in ~/.zshrc configuration file :</span> <span class="nb">sed</span> <span class="nt">-i</span> <span class="s1">''</span> <span class="s1">'/^plugins=(/ a\ \ \ zsh-autosuggestions \ \ \ web-search \ \ \ jsontools \ \ \ macports \ \ \ node \ \ \ osx \ \ \ sudo \ \ \ thor \ \ \ docker \ '</span> .zshrc <span class="c"># set default user in .zshrc to avoid the nasty username@machine prompt</span> <span class="nb">echo</span> <span class="s1">'export DEFAULT_USER="$(whoami)"'</span> <span class="o">&gt;&gt;</span> .zshrc </code></pre></div></div> <h3 id="data-science-essentials">Data Science Essentials</h3> <p><a href="https://www.datascienceatthecommandline.com/">Data science at the command line</a> is great, but I doubt it will be enough to do all of your day-to-day tasks. We need R &amp; Python, and while the GUI installers for <a href="https://www.rstudio.com/">Rstudio</a> and <a href="https://www.anaconda.com">Anaconda</a> make the installation child’s play, it would be nice to have it as part of this initial setup script as well. Moreover, I find myself accumulating eclectic collections of packages and libraries. Instead of reinstalling all of these manually I have included them here as well:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#### install anaconda</span> <span class="c"># May need updating for conda version</span> wget <span class="nt">-O</span> anaconda.sh https://repo.anaconda.com/archive/Anaconda3-5.3.0-MacOSX-x86_64.sh bash anaconda.sh <span class="nb">rm </span>anaconda.sh <span class="c"># append conda path to bash profile</span> <span class="nb">echo</span> <span class="s1">'export PATH="~/anaconda3/bin:$PATH"'</span> <span class="o">&gt;&gt;</span> ~/.zshrc <span class="c"># reload profile</span> <span class="nb">source</span> .zshrc <span class="c"># create new anaconda virtual environments</span> conda update conda conda config <span class="nt">--add</span> channels conda-forge conda create <span class="nt">--name</span> dev2 <span class="nv">python</span><span class="o">=</span>2.7 conda create <span class="nt">--name</span> dev3 <span class="nv">python</span><span class="o">=</span>3.6 <span class="c"># and switch to it to avoid using the system python</span> <span class="nb">source </span>activate dev3 <span class="c"># do this every time we start a new session</span> <span class="c"># assuming you want to use python3 by default</span> <span class="nb">echo</span> <span class="s1">'source activate dev3'</span> <span class="o">&gt;&gt;</span> ~/.zshrc <span class="c"># Install a few libraries that do not ship with anaconda</span> pip <span class="nb">install </span>awscli tensorflow tensorflow-gpu keras <span class="c">#### install R and RStudio</span> <span class="c"># this is required for some advanced plotting</span> brew cask <span class="nb">install </span>xquartz <span class="c"># (will need password again)</span> brew <span class="nb">install</span> <span class="nt">--with-x11</span> r brew cask <span class="nb">install</span> <span class="nt">--appdir</span><span class="o">=</span>/Applications rstudio <span class="c"># Note the --appdir option which will use /Applications instead of ~/Applications</span> <span class="c"># set up rJava; this can be a pain!</span> <span class="c"># I used these instructions: https://zhiyzuo.github.io/installation-rJava/</span> <span class="c"># consult google if you get stuck here</span> <span class="c"># set java environmental variables for the profile</span> <span class="nb">echo</span> <span class="s1">'export PATH="$HOME/.jenv/bin:$PATH"'</span> <span class="o">&gt;&gt;</span> ~/.zshrc <span class="c"># (you may need to update version number here)</span> <span class="nb">echo</span> <span class="s1">'export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home"'</span> <span class="o">&gt;&gt;</span> ~/.zshrc <span class="nb">echo</span> <span class="s1">'eval "$(jenv init -)"'</span> <span class="o">&gt;&gt;</span> ~/.zshrc <span class="nb">source</span> ~/.zshrc <span class="c"># make sure to set this to the version that you installed (`java -version`)</span> jenv add /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home jenv global oracle64-1.8.0_181 <span class="c"># prepare installation and install rJava by building from source</span> R CMD javareconf RScript <span class="nt">-e</span> <span class="s2">"install.packages('rJava',</span><span class="se">\</span><span class="s2"> repos='http://cran.us.r-project.org',</span><span class="se">\</span><span class="s2"> type='source')"</span> <span class="c"># install R packages</span> RScript <span class="nt">-e</span> <span class="s2">"install.packages(c(</span><span class="se">\</span><span class="s2"> 'cluster','crayon','crosstalk','curl','CVST','data.table','DBI',</span><span class="se">\</span><span class="s2"> 'devtools','doMC','dtplyr','foreach','foreign','ggplot2','ggthemes','glmnet',</span><span class="se">\</span><span class="s2"> 'haven','here','htmltools','htmlwidgets','httr','igraph','jsonlite','knitr',</span><span class="se">\</span><span class="s2"> 'labeling','lattice','lazyeval','leaflet','lubridate','magrittr','markdown',</span><span class="se">\</span><span class="s2"> 'mime','praise','psych','purrr','raster','RColorBrewer','Rcpp','readr',</span><span class="se">\</span><span class="s2"> 'rmarkdown','rpart','rvest','scales','shiny','stringr','survival','testthat',</span><span class="se">\</span><span class="s2"> 'units','viridis','xml2','aws.s3','checkmate','feather','future',</span><span class="se">\</span><span class="s2"> 'gapminder','keras','lintr','plotly','plotROC','prettyunits','pROC','progress',</span><span class="se">\</span><span class="s2"> 'randomForest','ranger','reticulate','rJava','RJDBC','RJSONIO','RODBC',</span><span class="se">\</span><span class="s2"> 'roxygen2','RPostgreSQL','Rtsne','slackr','sf','stringdist','tensorflow',</span><span class="se">\</span><span class="s2"> 'text2vec','vegan','xgboost','XML','tidyverse'),</span><span class="se">\</span><span class="s2"> repos='http://cran.us.r-project.org')"</span> <span class="c"># This library for snowflake is only available on github</span> RScript <span class="nt">-e</span> <span class="s2">"library(devtools); install_github('snowflakedb/dplyr-snowflakedb')"</span> </code></pre></div></div> <p>Consider adding <code class="language-plaintext highlighter-rouge">/bin/zsh</code> to your RStudio global options under <code class="language-plaintext highlighter-rouge">Global Options...</code> &gt; <code class="language-plaintext highlighter-rouge">Terminal</code> &gt; <code class="language-plaintext highlighter-rouge">Custom shell binary</code> to keep your RStudio Terminal sessions in tune with the custom terminal we set up here.</p> <h3 id="settings-and-access">Settings and Access</h3> <p>So now we are done with the basic setup on our local machine. However, there are still ssh keys, api access tokens, and config files to configure. This can take a lot of time and energy, and having different tokens on different machines can be confusing or even unsafe (I have seen far too many people hard-code their AWS credentials into their notebooks!).</p> <p>I’ve therefore gone for an approach of creating a folder with all the files for identity management and protecting it with a single strong master password. For obvious reasons I will not go into too much detail on my exact approach to this, but let’s just say that we have synced all our identity files to a local folder called <code class="language-plaintext highlighter-rouge">.dotfiles</code>. From there we can sync them into our home directory, as succinctly explained by Ajmal Siddiqui <a href="https://medium.freecodecamp.org/dive-into-dotfiles-part-2-6321b4a73608">in this post</a>.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rsync .dotfiles ~ </code></pre></div></div> <p>and since we want to do that whenever we start a new terminal session:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s1">'rsync .dotfiles ~'</span> <span class="o">&gt;&gt;</span> .zshrc </code></pre></div></div> <p>This will synchronise all files in the <code class="language-plaintext highlighter-rouge">.dotfiles</code> folder to the home directory where they are available to the various applications or our custom scripts that may use them. Files that I now use this for include:</p> <ul> <li><code class="language-plaintext highlighter-rouge">.ssh</code> - ssh keys for Github, AWS, etc.</li> <li><code class="language-plaintext highlighter-rouge">.aws</code> - AWS credentials needed for the <a href="https://aws.amazon.com/cli/"><code class="language-plaintext highlighter-rouge">aws cli</code></a></li> <li><code class="language-plaintext highlighter-rouge">.gitconfig</code> - To track my contributions to version controlled code bases</li> <li><code class="language-plaintext highlighter-rouge">.kaggle.json</code> - Access token to use the <a href="https://github.com/Kaggle/kaggle-api">new Kaggle API</a></li> <li><code class="language-plaintext highlighter-rouge">.google</code> - Access token for the google maps SDK that I used <a href="https://janlauge.github.io/2017/extracting-location-history/">here</a></li> </ul> <p>So that’s all! As always, I hope it is useful for someone. Please let me know any thoughts you may have in the comments below. Also, follow me on <a href="https://twitter.com/JanLauGe">twitter</a>, connect with me on <a href="https://www.linkedin.com/in/laurensgeffert/">linkedIn</a>, and feel free to email me.</p> Fri, 12 Oct 2018 13:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2018/data-science-machine-and-command-line-setup/ https://janlauge.github.iohttps://janlauge.github.io//2018/data-science-machine-and-command-line-setup/ DataScience Tools Coding DataScience Tools Coding fastai Deep Learning Image Classification <p><strong>Here I summarise learnings from lesson 1 of the fast.ai course on deep learning. fast.ai is a deep learning online course for coders, taught by Jeremy Howard. Its tag line is to “make neural nets uncool again”. I started the class a couple of days ago and have been impressed with how fast it got me to apply the methods, an approach described by them as top-down learning. I am writing this blog post to document and reflect on the things that I learned and to help other people that may be interested getting started with the class.</strong></p> <!--more--> <p>The fast.ai library is a high-level library based on PyTorch, which tries to take a selection of best-practice approaches from cutting edge deep learning research and make them into a collection of intelligent default settings.</p> <p>Lesson 2 outlined the fundamentals of computer vision and building image classification models. My homework: get my hands on my own image dataset and use it to train a classifier myself. I chose to attempt a classifier that can distinguish between sharks and dolphins, using images from google image search. Read along for a detailed walk through below.</p> <h2 id="getting-an-image-dataset">Getting an Image Dataset</h2> <p><a href="https://pypi.org/project/google-images-download/1.4.4/">This handy tool</a> allows us to get images directly from google image search.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pip <span class="nb">install </span>google_images_download </code></pre></div></div> <p>For more than 100 images we also need to install chromedriver and dependencies:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apt-get <span class="nb">install </span>python-selenium python3-selenium libxi6 libgconf-2-4 chromium-chromedriver </code></pre></div></div> <p>Note that I had some problems with accessing the chrome driver from jupyter notebooks. Changing ownership of the file with <code class="language-plaintext highlighter-rouge">chmod 777 /usr/local/bin/chromedriver</code> solved this for me.</p> <p>Now we can run a query for images. Results are named as the number of the result plus the original file name, downloaded, and saved locally. Note that you should make sure to utilise the <code class="language-plaintext highlighter-rouge">usage_rights</code> flag in order to get images that are cleared for this type of use.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">google_images_download</span> <span class="kn">import</span> <span class="n">google_images_download</span> <span class="n">os</span><span class="p">.</span><span class="n">chdir</span><span class="p">(</span><span class="s">'/home/paperspace/data/sharksdolphins/'</span><span class="p">)</span> <span class="n">response</span> <span class="o">=</span> <span class="n">google_images_download</span><span class="p">.</span><span class="n">googleimagesdownload</span><span class="p">()</span> <span class="n">arguments</span> <span class="o">=</span> <span class="p">{</span> <span class="s">"keywords"</span><span class="p">:</span> <span class="s">'"shark","dolphin"'</span><span class="p">,</span> <span class="s">"print_urls"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="s">"limit"</span><span class="p">:</span> <span class="s">"10000"</span><span class="p">,</span> <span class="s">"output_directory"</span><span class="p">:</span> <span class="s">"sharksdolphins/train/"</span><span class="p">,</span> <span class="s">"format"</span><span class="p">:</span> <span class="s">"jpg"</span><span class="p">,</span> <span class="s">"usage_rights"</span><span class="p">:</span> <span class="s">"labeled-for-nocommercial-reuse"</span><span class="p">,</span> <span class="s">"chromedriver"</span><span class="p">:</span> <span class="s">"/usr/local/bin/chromedriver"</span> <span class="p">}</span> <span class="n">response</span><span class="p">.</span><span class="n">download</span><span class="p">(</span><span class="n">arguments</span><span class="p">)</span> <span class="c1"># func for renaming an image file </span><span class="k">def</span> <span class="nf">image_rename</span><span class="p">(</span><span class="nb">file</span><span class="p">):</span> <span class="n">file_index</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="sa">r</span><span class="s">'\A\d*'</span><span class="p">,</span> <span class="nb">file</span><span class="p">).</span><span class="n">group</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="n">file_index</span> <span class="o">=</span> <span class="n">file_index</span><span class="p">.</span><span class="n">zfill</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="k">return</span> <span class="n">file_index</span><span class="o">+</span><span class="s">'.jpg'</span> <span class="c1"># func for renaming all files in a folder </span><span class="k">def</span> <span class="nf">image_rename_all</span><span class="p">(</span><span class="n">folder</span><span class="p">):</span> <span class="n">files</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">folder</span><span class="p">)</span> <span class="p">[</span><span class="n">os</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">folder</span><span class="o">+</span><span class="nb">file</span><span class="p">,</span> <span class="n">folder</span><span class="o">+</span><span class="n">image_rename</span><span class="p">(</span><span class="nb">file</span><span class="p">))</span> <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">files</span><span class="p">]</span> <span class="c1"># rename files </span><span class="n">folders</span> <span class="o">=</span> <span class="p">[</span><span class="s">'data/sharksdolphins/train/shark/'</span><span class="p">,</span> <span class="s">'data/sharksdolphins/train/dolphin/'</span><span class="p">]</span> <span class="p">[</span><span class="n">image_rename_all</span><span class="p">(</span><span class="n">folder</span><span class="p">)</span> <span class="k">for</span> <span class="n">folder</span> <span class="ow">in</span> <span class="n">folders</span><span class="p">]</span> </code></pre></div></div> <p>The above query returned about 800 images each. I ended up speeding through these images manually and removing unsuitable images manually.</p> <p>We need a training and a validation set. Making this work with fast.ai is easily done by adapting the recommended folder structure of a <code class="language-plaintext highlighter-rouge">data</code> folder with two sub-folders (<code class="language-plaintext highlighter-rouge">train</code> and <code class="language-plaintext highlighter-rouge">valid</code>), each of which have a subfolder with images for each class.</p> <p>The code chunk below takes the images downloaded earlier and randomly splits them into 80% training data and 20% validation data. Don’t forget to set a random seed so that your results stay reproducible.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># split files into test and training set# split </span><span class="n">PATH</span> <span class="o">=</span> <span class="s">'data/sharksdolphins/'</span> <span class="n">files_sharks</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">PATH</span><span class="si">}</span><span class="s">train/shark'</span><span class="p">)</span> <span class="n">files_dolphins</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">PATH</span><span class="si">}</span><span class="s">train/dolphin'</span><span class="p">)</span> <span class="c1"># sample from each class to create a validation set </span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span> <span class="n">files_sharks_val</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span> <span class="n">files_sharks</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="nb">round</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">files_sharks</span><span class="p">)</span> <span class="o">/</span> <span class="mi">5</span><span class="p">),</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span> <span class="n">files_dolphins_val</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span> <span class="n">files_dolphins</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="nb">round</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">files_dolphins</span><span class="p">)</span> <span class="o">/</span> <span class="mi">5</span><span class="p">),</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span> <span class="c1"># move validation set images into validation folder </span><span class="p">[</span><span class="n">os</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">PATH</span><span class="o">+</span><span class="s">'train/shark/'</span><span class="o">+</span><span class="nb">file</span><span class="p">,</span> <span class="n">PATH</span><span class="o">+</span><span class="s">'valid/shark/'</span><span class="o">+</span><span class="nb">file</span><span class="p">)</span> <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">files_sharks_val</span><span class="p">]</span> <span class="p">[</span><span class="n">os</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">PATH</span><span class="o">+</span><span class="s">'train/dolphin/'</span><span class="o">+</span><span class="nb">file</span><span class="p">,</span> <span class="n">PATH</span><span class="o">+</span><span class="s">'valid/dolphin/'</span><span class="o">+</span><span class="nb">file</span><span class="p">)</span> <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">files_dolphins_val</span><span class="p">]</span> </code></pre></div></div> <p>As you will see, collecting the data was the hardest part and everyting from here onward is quite straight-forward thanks to the high level abstractions provided by fast.ai</p> <h2 id="training-a-model">Training a Model</h2> <p>Now we can finally start to train our image classifier. I am using a <a href="https://www.paperspace.com/console/machines">paperspace</a> instance with the <a href="https://github.com/reshamas/fastai_deeplearn_part1/blob/master/tools/paperspace.md">setup</a> recommended and provided by <a href="http://course.fast.ai/start.html">fast.ai</a>.</p> <p><strong>Train a First Model</strong></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">os</span><span class="p">.</span><span class="n">chdir</span><span class="p">(</span><span class="s">'/home/paperspace/'</span><span class="p">)</span> <span class="c1"># append fast.ai local folder to system path so modules can be imported </span><span class="n">sys</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">'/home/paperspace/fastai/'</span><span class="p">)</span> <span class="c1"># automatically reload updated sub-modules </span><span class="o">%</span><span class="n">reload_ext</span> <span class="n">autoreload</span> <span class="o">%</span><span class="n">autoreload</span> <span class="mi">2</span> <span class="c1"># in-line plots </span><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span> <span class="kn">from</span> <span class="nn">fastai.imports</span> <span class="kn">import</span> <span class="o">*</span> <span class="kn">from</span> <span class="nn">fastai.transforms</span> <span class="kn">import</span> <span class="o">*</span> <span class="kn">from</span> <span class="nn">fastai.conv_learner</span> <span class="kn">import</span> <span class="o">*</span> <span class="kn">from</span> <span class="nn">fastai.model</span> <span class="kn">import</span> <span class="o">*</span> <span class="kn">from</span> <span class="nn">fastai.dataset</span> <span class="kn">import</span> <span class="o">*</span> <span class="kn">from</span> <span class="nn">fastai.sgdr</span> <span class="kn">import</span> <span class="o">*</span> <span class="kn">from</span> <span class="nn">fastai.plots</span> <span class="kn">import</span> <span class="o">*</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># set path of data folder </span><span class="n">PATH</span> <span class="o">=</span> <span class="s">"data/sharksdolphins/"</span> <span class="c1"># set size images should be resized to </span><span class="n">sz</span> <span class="o">=</span> <span class="mi">224</span> <span class="c1"># First model </span><span class="n">arch</span> <span class="o">=</span> <span class="n">resnet34</span> <span class="n">data</span> <span class="o">=</span> <span class="n">ImageClassifierData</span><span class="p">.</span><span class="n">from_paths</span><span class="p">(</span><span class="n">PATH</span><span class="p">,</span> <span class="n">tfms</span><span class="o">=</span><span class="n">tfms_from_model</span><span class="p">(</span><span class="n">arch</span><span class="p">,</span> <span class="n">sz</span><span class="p">),</span> <span class="n">bs</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span> <span class="n">learn</span> <span class="o">=</span> <span class="n">ConvLearner</span><span class="p">.</span><span class="n">pretrained</span><span class="p">(</span><span class="n">arch</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">precompute</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> </code></pre></div></div> <p>Jeremy explained the learning rate finder that uses the approach of cyclical learning rates as outlined by Leslie Smith (https://arxiv.org/abs/1506.01186). Using the <code class="language-plaintext highlighter-rouge">lr_find</code> method and then plotting the learning rate against loss. We now want to visually choose the “The highest learning rate we can find where the loss is still clearly improving”. Note here that the learning rate finder did initially not work so well for my small-ish dataset. I had to adjust the batch size to make it work correctly (see <code class="language-plaintext highlighter-rouge">bs=16</code> argument in the <code class="language-plaintext highlighter-rouge">ImageClassifierData</code> call above)</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">lr_find</span><span class="p">()</span> <span class="n">learn</span><span class="p">.</span><span class="n">sched</span><span class="p">.</span><span class="n">plot_lr</span><span class="p">()</span> </code></pre></div></div> <p><img src="https://janlauge.github.io/assets/fastai1_learningratefinder.png" alt="learning rate finder" /></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="mf">0.01</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> </code></pre></div></div> <table> <thead> <tr> <th>epoch       </th> <th>trn_loss       </th> <th>val_loss       </th> <th>accuracy</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>0.435999</td> <td>0.267501</td> <td>0.925</td> </tr> <tr> <td>1</td> <td>0.3081</td> <td>0.29459</td> <td>0.9</td> </tr> <tr> <td>2</td> <td>0.2548</td> <td>0.235716</td> <td>0.925</td> </tr> <tr> <td>3</td> <td>0.214654</td> <td>0.229093</td> <td>0.93125</td> </tr> <tr> <td>4</td> <td>0.237024</td> <td>0.162347</td> <td>0.9375</td> </tr> </tbody> </table> <p>Over 93% accuracy! Really nice results already!</p> <p>Let’s see if we can improve things even further.</p> <p><strong>Train a Second Model</strong></p> <p>Now let’s start training some of the lower layers and retraining these with differential learning rate. Jeremy talks about unfreezing here, but a friend that I talked to about this called it thawing instead, and I like that terminology, as it implies the differential learning rates with fast-changing weights at the last layer but slower updates in the lower layers.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># unfreeze pretrained layers </span><span class="n">learn</span><span class="p">.</span><span class="n">unfreeze</span><span class="p">()</span> <span class="c1"># set differential learning rate </span><span class="n">lr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1e-4</span><span class="p">,</span><span class="mf">1e-3</span><span class="p">,</span><span class="mf">1e-2</span><span class="p">])</span> <span class="c1"># train new model </span><span class="n">learn</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">lr</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">cycle_len</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">cycle_mult</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> </code></pre></div></div> <table> <thead> <tr> <th>epoch       </th> <th>trn_loss       </th> <th>val_loss       </th> <th>accuracy</th> </tr> </thead> <tbody> <tr> <td>0</td> <td>0.475365</td> <td>0.365881</td> <td>0.89375</td> </tr> <tr> <td>1</td> <td>0.278049</td> <td>0.177107</td> <td>0.94375</td> </tr> <tr> <td>2</td> <td>0.174708</td> <td>0.178467</td> <td>0.95</td> </tr> <tr> <td>3</td> <td>0.120919</td> <td>0.230777</td> <td>0.95625</td> </tr> <tr> <td>4</td> <td>0.093978</td> <td>0.148054</td> <td>0.95625</td> </tr> <tr> <td>5</td> <td>0.081571</td> <td>0.200582</td> <td>0.95</td> </tr> <tr> <td>6</td> <td>0.055903</td> <td>0.198029</td> <td>0.95</td> </tr> </tbody> </table> <p>A little bit better yet. Let’s look at the confusion matrix and some of the misclassified images:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># get predictions and transform to class probability values </span><span class="n">log_preds</span> <span class="o">=</span> <span class="n">learn</span><span class="p">.</span><span class="n">predict</span><span class="p">()</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">log_preds</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">probs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">log_preds</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># plot confusion matrix </span><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">confusion_matrix</span> <span class="n">cm</span> <span class="o">=</span> <span class="n">confusion_matrix</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">val_y</span><span class="p">,</span> <span class="n">preds</span><span class="p">)</span> <span class="n">plot_confusion_matrix</span><span class="p">(</span><span class="n">cm</span><span class="p">,</span> <span class="n">data</span><span class="p">.</span><span class="n">classes</span><span class="p">)</span> </code></pre></div></div> <p><img src="https://janlauge.github.io/assets/fastai1_confusionmatrix.png" alt="confusion matrix" /></p> <p>We should also plot some of the images to develop an intuition about where our classifier does well and where it doesn’t. Here is the one misclassified dolphin and the top 4 misclassified sharks:</p> <p><img src="https://janlauge.github.io/assets/fastai1_misclass_dolphin1.png" alt="misclassified dolphin" /></p> <p><img src="https://janlauge.github.io/assets/fastai1_missclass_shark1.png" alt="misclassified sharks" /></p> <h2 id="discussion">Discussion</h2> <p>You can see that we got some pretty strong results in a very short amount of time and using a very limited dataset! There is obviously many more things that could be done to improve this model further.</p> <p><strong>Weighting the Classes</strong></p> <p>In our example of sharks and dolphins, we are currently treating all misclassifications equally. This may not be the right approach! If the model was used to monitor a beach for sharks, for example, failing to recognize a dolphin would not be a problem, while failing to recognize a shark could potentially result in human fatalities. In this case, it might be advisable to retrain the model with a biased weighting function to make sure our recall on shark images is higher.</p> <p><strong>Data Leakage</strong></p> <p>Looking at images that were misclassified or low-confidence predictions, I realised that my training set introduced a number of hidden biases to the model. For example, a number of dolphin pictures have people in them that are touching the dolphin. This is less prevalent with the shark pictures, for obvious reasons. As a result, the few shark pictures with human arms and hands in them and near the shark seem to end up with lower confidence for the shark class. For the only misclassified dolphin in the dataset, on the other hand, I think that the shark-like pose (frontal, widely opened mouth) may have played a role in the model mistaking this image for a shark image.</p> <p>This kind of data leakage is increasingly discussed and criticised in deep learning applications, so it is good to be aware and keep an eye out for them. Since my model is not going into production for Baywatch any time soon, I am just glad I found it nicely illustrated in this relatively small dataset.</p> Wed, 02 May 2018 08:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2018/fast-ai-part1_deep-learning-image-classifier-with-nifty-tricks/ https://janlauge.github.iohttps://janlauge.github.io//2018/fast-ai-part1_deep-learning-image-classifier-with-nifty-tricks/ DataScience Python DeepLearning DataScience Python DeepLearning Airdrop delivery with A* pathfinding <p><strong>This post is an event report and a quick walk through to a submission that I developed with a group of participants at an Alibaba / Met Office UK hackathon. We are using the A* algorithm with a couple of tweaks to route cargo balloons from London to a number of cities in the UK.</strong> <!--more--></p> <blockquote> <p>It’s the year 2050. The invention of anti-gravity engines has led to the creation of unmanned balloons that travel the UK, delivering goods. However, unpredictable weather conditions mean that these balloons are often delayed, damaged or even destroyed(!), so we need your help. We’re inviting you to join our contest AND our hackathon to create an algorithm which allows these balloons to get to their route safely and effectively.</p> </blockquote> <p>In January of this year, the <a href="https://www.metoffice.gov.uk/">Met Office</a> teamed up with <a href="https://www.alibabacloud.com/">Alibaba Cloud</a> in organising a hackathon at <a href="https://www.huckletree.com/locations/shoreditch">Huckletree Shoreditch</a>. Here is a short video that gives a good impression of the event</p> <iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/bhsNmmdkZ7A?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen=""></iframe> <h3 id="the-problem">The Problem</h3> <p>The hackathon was organised as a “Future Challenge”, a fictitious scenario for the year 2050. Obviously, drone delivery would be considered so 2020 and completely outdated by then. Instead, people are relying on anti-gravity balloons to deliver goods to cities across the UK via air drop. These anti-gravity balloons are reliable and efficient, but have one major shortcoming: They crash when travelling in areas with high wind speeds. This is where the Met Office comes in. The task was to navigate balloons from origin to destination while avoiding storms by using the Met Office forecasts.</p> <p>I’ll spare you the full run through of rules, terms, and conditions. You can find them on the competition page. The main facts:</p> <ul> <li>forecast data provides projected wind speed for fields of a grid</li> <li>forecasts are for hourly intervals</li> <li>balloons can move up, down, left, right, or stay in place</li> <li>balloons move one field per 2 minutes</li> <li>balloons crash when entering a field when the wind speed is ≥15</li> </ul> <p>So the big question is: How do we safely get the balloons from origin to destination while avoiding stormy areas? I teamed up with a nice bunch of people, mostly undergrad or master level university students. It was great to see their enthusiasm for working on this problem!</p> <h3 id="the-data">The Data</h3> <p>The data that was provided to us was:</p> <ol> <li>the coordinates of cities (origin and destinations)</li> <li>weather data, separated into:     * a training set with 7 days of weather forecasts from 10 models plus observations of the actual conditions that manifested on these days     * a holdout set with 5 days of weather forecasts</li> </ol> <p>You should still be able to get the data <a href="https://tianchi.aliyun.com/competition/information.htm?spm=5176.100069.5678.2.7c1024fbV8ArTb&amp;raceId=231622">here</a> in case you’d like to have a go yourself. Note that the weather forecasts came in at a rather inconvenient file size of 2 x 800 MB, and download speeds were not that great either.</p> <p>See the map below for illustration purposes. The map shows gridded forecasts of wind speed for a one hour time slice, as well as city locations (origin in yellow, destinations in red). <a href="https://janlauge.github.io/assets/astar_weatherdata.png">Weather Forecast</a></p> <h3 id="weather-prediction">Weather Prediction</h3> <p>We started by looking at the weather predictions. Our initial plan was to use the first week, for which both forecasts and observations were available, to train a classifier that would identify the likelihood of high wind speeds in a given area at a particular time. This, however, turned out to be a bit of a red herring. The Met Office predictions were so good that averaging them and using a simple threshold of 15 resulted in close to zero false negatives when trying to detect cells with storms.</p> <p><em>Lesson learned:</em> Do your EDA properly to check which areas are worth investigating in detail and where you can use a simple ad-hoc solution.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">import</span> <span class="nn">h5py</span> <span class="k">def</span> <span class="nf">convert_forecast</span><span class="p">(</span><span class="n">data_cube</span><span class="p">):</span> <span class="c1"># take mean of forecasts </span> <span class="n">arr_world</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">data_cube</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="c1"># binarize to storm (1) or safe (0) </span> <span class="n">arr_world</span> <span class="o">=</span> <span class="n">arr_world</span> <span class="o">&gt;=</span> <span class="mi">15</span> <span class="c1"># from boolean to binary </span> <span class="n">arr_world</span> <span class="o">=</span> <span class="n">arr_world</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span> <span class="c1"># swap axes so x=0, y=1, z=2, day=3 </span> <span class="n">arr_world</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">swapaxes</span><span class="p">(</span><span class="n">arr_world</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="n">arr_world</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">swapaxes</span><span class="p">(</span><span class="n">arr_world</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">)</span> <span class="n">arr_world</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">swapaxes</span><span class="p">(</span><span class="n">arr_world</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">)</span> <span class="k">return</span><span class="p">(</span><span class="n">arr_world</span><span class="p">)</span> <span class="n">data_cube</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'../data/5D_test.npy'</span><span class="p">)</span> <span class="c1"># convert forecast to world array </span><span class="n">forecast</span> <span class="o">=</span> <span class="n">convert_forecast</span><span class="p">(</span><span class="n">data_cube</span><span class="p">)</span> </code></pre></div></div> <h3 id="balloon-navigation">Balloon Navigation</h3> <p>We wanted balloons to take the shortest path from origin to destination without passing into storms. That means storms can be viewed as obstacles in our path search problem, because we would never, ever, want to pass through them, even if that means a massive detour for the balloon. We therefore chose to use an A* path search algorithm. This algorithm finds the shortest path around obstacles in a reasonable amount of time and is quite straight forward to implement.</p> <p>The basic approach is to start from origin and generate a frontier possible next moves from a list of valid neighbouring fields. Fields with obstacles are excluded, as well as fields that have been visited before. For each field that is part of this frontier we log which field we came from and calculate a heuristic cost of our movement to far. As soon as the destination field becomes part of the frontier, we can recursively follow the trail laid out in our log, back to the origin, to find the shortest path.</p> <p>In case you would like more details or want to compare A* to other pathfinding algorithms, <a href="https://www.redblobgames.com/pathfinding/a-star/introduction.html">this page</a> has the best summary of pathfinding algorithms I have seen, including a great section on A* with interactive simulations.</p> <p>In our case, there is one additional complication: our search is two-dimensional (geographical space with latitude and longitude), but our “obstacles” change over time. My way around that was to make the search three-dimensional, with each movement time step (2 minutes) as a slice in the third dimension.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># repeat time slices x30 </span><span class="n">forecast_stack</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">forecast</span><span class="p">,</span> <span class="n">repeats</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> </code></pre></div></div> <p>I then forced the search algorithm to always take a step forward in time by restricting the valid neighbours to <code class="language-plaintext highlighter-rouge">[..., ..., z+1]</code>. I tried to illustrate this schematically in the diagramm below: <img src="https://janlauge.github.io/assets/astar_3d_schematic.png" alt="3D A* schematic" /></p> <p>The code for my A* implementation in python using <code class="language-plaintext highlighter-rouge">heapq</code> can be found below.</p> <p>Note that I allow for neighbours that have the same x and y coordinates. This essentially allows balloons to “hover in place” to wait out unfavourable weather conditions in the area ahead, should that be the most promising course of action.</p> <p>Another thing to mention is that this approach massively inflates our frontier. Usually, an advantage of the A* algorithm is that fields that have been visited before do not need to be considered again. In my approach, field <code class="language-plaintext highlighter-rouge">[0,0,0]</code> is different from field <code class="language-plaintext highlighter-rouge">[0,0,1]</code> (the same latitude and longitude, but a different time step). As a result, computation becomes a lot more resource intense, but it is still feasible to run this on your local machine and in the fast-paced setting of a hackathon I think that prioritising developer time over computing time was the right call.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">heapq</span> <span class="k">def</span> <span class="nf">heuristic_function</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span> <span class="k">return</span> <span class="p">(</span><span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="p">(</span><span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">a</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">**</span> <span class="mi">2</span> <span class="k">def</span> <span class="nf">astar_3D</span><span class="p">(</span><span class="n">space</span><span class="p">,</span> <span class="n">origin_xy</span><span class="p">,</span> <span class="n">destination_xy</span><span class="p">):</span> <span class="c1"># make origin 3D with timeslice 0 </span> <span class="n">origin</span> <span class="o">=</span> <span class="n">origin_xy</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">origin_xy</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">0</span> <span class="c1"># logs the path </span> <span class="n">came_from</span> <span class="o">=</span> <span class="p">{}</span> <span class="c1"># holds the legal next moves in order of priority </span> <span class="n">frontier</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># define legal moves: </span> <span class="c1"># up, down, left, right, stay in place. </span> <span class="c1"># no diagonals and always move forward one time step (z) </span> <span class="n">neighbours</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),(</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">),(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)]</span> <span class="n">cost_so_far</span> <span class="o">=</span> <span class="p">{</span><span class="n">origin</span><span class="p">:</span> <span class="mi">0</span><span class="p">}</span> <span class="n">priority</span> <span class="o">=</span> <span class="p">{</span><span class="n">origin</span><span class="p">:</span> <span class="n">heuristic_function</span><span class="p">(</span><span class="n">origin_xy</span><span class="p">,</span> <span class="n">destination_xy</span><span class="p">)}</span> <span class="n">heapq</span><span class="p">.</span><span class="n">heappush</span><span class="p">(</span><span class="n">frontier</span><span class="p">,</span> <span class="p">(</span><span class="n">priority</span><span class="p">[</span><span class="n">origin</span><span class="p">],</span> <span class="n">origin</span><span class="p">))</span> <span class="c1"># While there is still options to explore </span> <span class="k">while</span> <span class="n">frontier</span><span class="p">:</span> <span class="n">current</span> <span class="o">=</span> <span class="n">heapq</span><span class="p">.</span><span class="n">heappop</span><span class="p">(</span><span class="n">frontier</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># if current position is destination, </span> <span class="c1"># break the loop and find the path that lead here </span> <span class="k">if</span> <span class="p">(</span><span class="n">current</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">current</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">==</span> <span class="n">destination_xy</span><span class="p">:</span> <span class="n">data</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">while</span> <span class="n">current</span> <span class="ow">in</span> <span class="n">came_from</span><span class="p">:</span> <span class="n">data</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">current</span><span class="p">)</span> <span class="n">current</span> <span class="o">=</span> <span class="n">came_from</span><span class="p">[</span><span class="n">current</span><span class="p">]</span> <span class="k">return</span> <span class="n">data</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">neighbours</span><span class="p">:</span> <span class="n">move</span> <span class="o">=</span> <span class="n">current</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span> <span class="n">current</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">j</span><span class="p">,</span> <span class="n">current</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="n">k</span> <span class="c1"># check that move is legal </span> <span class="k">if</span> <span class="p">((</span><span class="mi">0</span> <span class="o">&lt;=</span> <span class="n">move</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">space</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">&amp;</span> <span class="p">(</span><span class="mi">0</span> <span class="o">&lt;=</span> <span class="n">move</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">space</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">&amp;</span> <span class="p">(</span><span class="mi">0</span> <span class="o">&lt;=</span> <span class="n">move</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">space</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">])):</span> <span class="k">if</span> <span class="n">space</span><span class="p">[</span><span class="n">move</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">move</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">move</span><span class="p">[</span><span class="mi">2</span><span class="p">]]</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">:</span> <span class="n">new_cost</span> <span class="o">=</span> <span class="mi">1</span> <span class="n">new_total</span> <span class="o">=</span> <span class="n">cost_so_far</span><span class="p">[</span><span class="n">current</span><span class="p">]</span> <span class="o">+</span> <span class="n">new_cost</span> <span class="k">if</span> <span class="n">move</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">cost_so_far</span><span class="p">:</span> <span class="n">cost_so_far</span><span class="p">[</span><span class="n">move</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_total</span> <span class="c1"># calculate total cost </span> <span class="n">priority</span><span class="p">[</span><span class="n">move</span><span class="p">]</span> <span class="o">=</span> <span class="n">new_total</span> <span class="o">+</span> <span class="n">heuristic_function</span><span class="p">(</span><span class="n">move</span><span class="p">,</span> <span class="n">destination_xy</span><span class="p">)</span> <span class="c1"># update frontier </span> <span class="n">heapq</span><span class="p">.</span><span class="n">heappush</span><span class="p">(</span><span class="n">frontier</span><span class="p">,</span> <span class="p">(</span><span class="n">priority</span><span class="p">[</span><span class="n">move</span><span class="p">],</span> <span class="n">move</span><span class="p">))</span> <span class="c1"># log this move </span> <span class="n">came_from</span><span class="p">[</span><span class="n">move</span><span class="p">]</span> <span class="o">=</span> <span class="n">current</span> <span class="k">return</span> <span class="s">'no solution found :('</span> <span class="c1"># get city data </span><span class="n">cities</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'../data/CityData.csv'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># run algorithm </span><span class="n">x</span> <span class="o">=</span> <span class="n">astar_3D</span><span class="p">(</span><span class="n">space</span><span class="o">=</span><span class="n">arr_world_big</span><span class="p">[:,:,:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">origin_xy</span><span class="o">=</span><span class="n">origin</span><span class="p">,</span> <span class="n">destination_xy</span><span class="o">=</span><span class="n">destination</span><span class="p">)</span> </code></pre></div></div> <h3 id="visualising-the-results">Visualising the Results</h3> <p>We have a predicted optimal route now. That’s great, but it would be even better to visualise these results in a way that allows us to develop some intuition about how our solution is doing and where we could improve it further. I thought that an animation of the time slices with the paths generated would be ideal for this. So I used <code class="language-plaintext highlighter-rouge">matplotlib.pyplot</code> to create an image of each time slice and then combined them into an animated gif. Output and code below:</p> <p><img src="https://janlauge.github.io/assets/astar_animation_day11.gif" alt="Conditions and routes for day 11" /></p> <p>You can see that, for this day, the solution for most cities is relatively straight-forward because of low wind speeds in the majority of the area. However, the A* pathfinding algorithm can be seen nicely at work in the later timeslices and the centre-right of the map, where the purple balloon pauses for a timeslice to wait out unfavourable conditions ahead and then winds around patches of high wind speed towards its target.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">plot_solution</span><span class="p">(</span><span class="n">world</span><span class="p">,</span> <span class="n">cities</span><span class="p">,</span> <span class="n">solution</span><span class="p">,</span> <span class="n">day</span><span class="p">):</span> <span class="n">timesteps</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">540</span><span class="p">,</span> <span class="mi">30</span><span class="p">))</span> <span class="n">solution</span> <span class="o">=</span> <span class="n">solution</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">solution</span><span class="p">.</span><span class="n">day</span> <span class="o">==</span> <span class="n">day</span><span class="p">,:]</span> <span class="c1"># colour map for cities </span> <span class="n">cmap</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">cm</span><span class="p">.</span><span class="n">cool</span> <span class="n">norm</span> <span class="o">=</span> <span class="n">matplotlib</span><span class="p">.</span><span class="n">colors</span><span class="p">.</span><span class="n">Normalize</span><span class="p">(</span><span class="n">vmin</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">vmax</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="c1"># colour map for weather </span> <span class="n">cm</span> <span class="o">=</span> <span class="n">matplotlib</span><span class="p">.</span><span class="n">colors</span><span class="p">.</span><span class="n">LinearSegmentedColormap</span><span class="p">.</span><span class="n">from_list</span><span class="p">(</span><span class="s">'grid'</span><span class="p">,</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">)],</span> <span class="n">N</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">timesteps</span><span class="p">:</span> <span class="n">timeslice</span> <span class="o">=</span> <span class="n">world</span><span class="p">[:,:,</span><span class="n">t</span><span class="p">]</span> <span class="n">moves_sofar</span> <span class="o">=</span> <span class="n">solution</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">solution</span><span class="p">.</span><span class="n">z</span> <span class="o">&lt;=</span> <span class="n">t</span><span class="p">,:]</span> <span class="n">moves_new</span> <span class="o">=</span> <span class="n">solution</span><span class="p">.</span><span class="n">loc</span><span class="p">[(</span><span class="n">t</span> <span class="o">&lt;=</span> <span class="n">solution</span><span class="p">.</span><span class="n">z</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">solution</span><span class="p">.</span><span class="n">z</span> <span class="o">&lt;=</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">30</span><span class="p">),:]</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">solution_subset</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span> <span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">timeslice</span><span class="p">[:,:].</span><span class="n">T</span><span class="p">,</span> <span class="n">aspect</span><span class="o">=</span><span class="s">'equal'</span><span class="p">,</span> <span class="n">cmap</span> <span class="o">=</span> <span class="n">cm</span><span class="p">)</span> <span class="c1"># plot old moves </span> <span class="k">for</span> <span class="n">city</span> <span class="ow">in</span> <span class="n">moves_sofar</span><span class="p">.</span><span class="n">city</span><span class="p">.</span><span class="n">unique</span><span class="p">():</span> <span class="n">moves_sofar_city</span> <span class="o">=</span> <span class="n">moves_sofar</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">moves_sofar</span><span class="p">.</span><span class="n">city</span> <span class="o">==</span> <span class="n">city</span><span class="p">,:]</span> <span class="n">x</span> <span class="o">=</span> <span class="n">moves_sofar_city</span><span class="p">.</span><span class="n">x</span> <span class="n">y</span> <span class="o">=</span> <span class="n">moves_sofar_city</span><span class="p">.</span><span class="n">y</span> <span class="n">z</span> <span class="o">=</span> <span class="n">moves_sofar_city</span><span class="p">.</span><span class="n">z</span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="nb">list</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'-'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">)</span> <span class="c1"># plot new moves </span> <span class="k">for</span> <span class="n">city</span> <span class="ow">in</span> <span class="n">moves_new</span><span class="p">.</span><span class="n">city</span><span class="p">.</span><span class="n">unique</span><span class="p">():</span> <span class="n">moves_new_city</span> <span class="o">=</span> <span class="n">moves_new</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">moves_new</span><span class="p">.</span><span class="n">city</span> <span class="o">==</span> <span class="n">city</span><span class="p">,:]</span> <span class="n">x</span> <span class="o">=</span> <span class="n">moves_new_city</span><span class="p">.</span><span class="n">x</span> <span class="n">y</span> <span class="o">=</span> <span class="n">moves_new_city</span><span class="p">.</span><span class="n">y</span> <span class="n">z</span> <span class="o">=</span> <span class="n">moves_new_city</span><span class="p">.</span><span class="n">z</span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="nb">list</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'-'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">cmap</span><span class="p">(</span><span class="n">norm</span><span class="p">(</span><span class="n">city</span><span class="p">)))</span> <span class="c1"># plot cities </span> <span class="k">for</span> <span class="n">city</span><span class="p">,</span><span class="n">x</span><span class="p">,</span><span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">cities</span><span class="p">.</span><span class="n">cid</span><span class="p">,</span> <span class="n">cities</span><span class="p">.</span><span class="n">xid</span><span class="p">,</span> <span class="n">cities</span><span class="p">.</span><span class="n">yid</span><span class="p">):</span> <span class="k">if</span> <span class="n">city</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">([</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="n">y</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'black'</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="c1"># balloon still en-route? </span> <span class="k">if</span> <span class="n">city</span> <span class="ow">in</span> <span class="n">moves_new</span><span class="p">.</span><span class="n">city</span><span class="p">.</span><span class="n">unique</span><span class="p">():</span> <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">([</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="n">y</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">cmap</span><span class="p">(</span><span class="n">norm</span><span class="p">(</span><span class="n">city</span><span class="p">)))</span> <span class="k">else</span><span class="p">:</span> <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">([</span><span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="n">y</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'black'</span><span class="p">)</span> <span class="c1"># save and display </span> <span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'img_day'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">day</span><span class="p">)</span> <span class="o">+</span> <span class="s">'_timestep_'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="o">+</span> <span class="s">'.png'</span><span class="p">)</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span> </code></pre></div></div> Fri, 06 Apr 2018 18:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2018/airdrop-delivery/ https://janlauge.github.iohttps://janlauge.github.io//2018/airdrop-delivery/ DataScience R Visualisation DataScience R Visualisation Editable Plots from R to PowerPoint <p><strong>In this post I am giving a quick overview of how to create editable plots in PowerPoint from R. These plots are comprised of simple vector-based shapes and thus allow you to change labels, colours, or text position in seconds. Your project managers will love it!</strong> <!--more--></p> <h2 id="motivation">Motivation</h2> <p>R allows us to create great visualisations, but in most data science settings these need to be presented to key stakeholders and decision makers in presentations or “slideuments”. Having to make small changes to previously compiled slots can be time consuming and frustrating. A solution to this common problem is to keep your plots and graphs editable as a group of vector shapes in PowerPoint. This way project managers or data scientists themselves can make small changes without having to re-execute a single line of code.</p> <h2 id="solution">Solution</h2> <p>We will use a tidyverse approach for creating the plot. Furthermore, the <code class="language-plaintext highlighter-rouge">officer</code> package enables us to smoothly interact with PowerPoint, and the <code class="language-plaintext highlighter-rouge">rvg</code> package is required to save our plots as editable vector graphs.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">officer</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">rvg</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>For demonstration purposes, let’s create a plot using the diamonds dataset. NB: I am saving the ggplot object to a variable name, but also displaying the plot when executing the lines by appending the <code class="language-plaintext highlighter-rouge">; ggp</code> at the end.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Using diamonds dataset which is shipped with R</span><span class="w"> </span><span class="n">ggp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">diamonds</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Let's simplify things by only considering natural number carats</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">carat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">floor</span><span class="p">(</span><span class="n">carat</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">carat</span><span class="p">,</span><span class="w"> </span><span class="n">cut</span><span class="p">,</span><span class="w"> </span><span class="n">clarity</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="n">price</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">price</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Create a plot of price by carat, colour, cut, and clarity</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">carat</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">price</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">color</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'identity'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">facet_grid</span><span class="p">(</span><span class="n">cut</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">clarity</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># Simplify the plot layout a little</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">guides</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">panel.grid.major.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w"> </span><span class="n">panel.grid.minor.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">());</span><span class="w"> </span><span class="n">ggp</span><span class="w"> </span></code></pre></div></div> <p><img src="https://janlauge.github.io/assets/ppplots_example.png" alt="Plot example" /></p> <p>Now we can use <code class="language-plaintext highlighter-rouge">officer</code> to create a new PowerPoint document and <code class="language-plaintext highlighter-rouge">rvg::ph_with_vg</code> to drop our ggplot object in there.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a new powerpoint document</span><span class="w"> </span><span class="n">doc</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_pptx</span><span class="p">()</span><span class="w"> </span><span class="n">doc</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">add_slide</span><span class="p">(</span><span class="n">doc</span><span class="p">,</span><span class="w"> </span><span class="s1">'Title and Content'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Office Theme'</span><span class="p">)</span><span class="w"> </span><span class="c1"># Add the plot</span><span class="w"> </span><span class="n">doc</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ph_with_vg</span><span class="p">(</span><span class="n">doc</span><span class="p">,</span><span class="w"> </span><span class="n">ggobj</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ggp</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'body'</span><span class="p">)</span><span class="w"> </span><span class="c1"># Write the document to a file</span><span class="w"> </span><span class="n">print</span><span class="p">(</span><span class="n">doc</span><span class="p">,</span><span class="w"> </span><span class="n">target</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'plot.pptx'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Now open the document in PowerPoint. Right-click and ungroup the plot. Voila! You should be able to select individual elements, for example the data bars in the plot, change their colour, move them around, change the text in labels, and much more. Have a look at the plot below. A cookie for you if you can find all ten edits that I made in the example.</p> <p><img src="https://janlauge.github.io/assets/ppplots_example_edited.png" alt="Plot example" /></p> <p>As always, hope this is helpful. And FYI, I am still looking for a way to achieve the same result using <code class="language-plaintext highlighter-rouge">Python</code>. If you know one, collect some bounty <a href="https://stackoverflow.com/questions/48944296/editable-plots-in-powerpoint-from-python-equivalent-of-officer-and-rvg">here</a></p> Sat, 24 Feb 2018 11:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2018/creating-editable-PowerPoint-plots-from-R/ https://janlauge.github.iohttps://janlauge.github.io//2018/creating-editable-PowerPoint-plots-from-R/ DataScience R Visualisation DataScience R Visualisation Extracting location history <p><strong>If you have an android phone then google logs your location. Fortunately, it makes all of that data available to you via the “timeline” dashboard. Unfortunately, there is no easy way to get it off there and into an IDE. So we’ll have to do this the hard way!</strong> <!--more--></p> <h2 id="location-history">Location History</h2> <p>What did you do yesterday? Last week? Or on August the 17th at 15:00? The answer, at least to the last question, possibly used to be quite a stretch. But as many things, this has changed with the advent of our everyday companion, the smart phone.</p> <p>If you have a smart phone and you carry it around with you, your whereabouts are constantly logged and saved. For android phones with google maps, google infers your position based on a mix of GPS, cell phone towers, WiFi name lookup, and other factors, and saves it in a “location history”. This data is available to you via the (relatively new) “timeline” feature in google maps.</p> <p><img src="https://janlauge.github.io/assets/extract-location-history_timeline.png" alt="Game example" /></p> <h2 id="the-problem">The Problem</h2> <p>As always, Google’s dashboard is very intuitive to navigate and offers great functionality. There is just one problem: I cannot summarise, report on, or programmatically analyse my data. It’s locked into google’s systems.</p> <p>Because our location data is undoubtedly our personal property, google offers us the option to download it in <code class="language-plaintext highlighter-rouge">.kml</code> format. However, this is only available via the web interface. If I wanted to build personalised reports on top of the data, I would need programmatic access via an API that I can pass parameters such as date and time to retrieve data dynamically. That is currently not supported by google. But with a little bit of manual work, we will get there nonetheless!</p> <h2 id="the-solution">The Solution</h2> <p>After a little bit of digging around I found <a href="https://medium.com/alex-attia-blog/how-to-take-back-control-and-use-your-google-maps-data-683fb5d4043e">this Medium post</a> and <a href="https://github.com/alexattia/Maps-Location-History">this associated GitHub repository</a> that got me 90% of the way very quickly. Alex, if you’re reading this, thanks a ton for putting all of that together! The code on GitHub was still missing yearly and daily sub-setting functionality, so I added that in <a href="https://github.com/JanLauGe/c_google_timeline">my repository</a> (pull request submitted!).</p> <p>So how to use this? Here are the step-by-step instructions.</p> <p>In bash:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Clone the code repository</span> git clone https://github.com/JanLauGe/c_google_timeline </code></pre></div></div> <p>Next, we’ll have to do some manual work. This is because we will need information from our google account sign in, saved in our cookies, in order to authenticate our <code class="language-plaintext highlighter-rouge">GET</code> requests for <code class="language-plaintext highlighter-rouge">KML</code> file downloads.</p> <ol> <li>Open https://www.google.com/maps/timeline in Mozilla Firefox (I tried Chrome first, it did not work for me)</li> <li>Inspect the page (<code class="language-plaintext highlighter-rouge">Ctrl + Shift + I</code>) and go to the Network tab</li> <li>Enter the link below in the address line of your browser: https://www.google.com/maps/timeline/kml</li> <li>A new event will appear in the inspect-network tab as a result of the request. Copy its content as a cURL</li> <li>Paste the cURL string to a text editor and save it as a key file (I used ‘~/.env/.google_maps_cookie’)</li> </ol> <p>Now that we have the cookie information, we can go back to the fun part in Python:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">datetime</span> <span class="k">as</span> <span class="n">DT</span> <span class="kn">import</span> <span class="nn">process_location</span> <span class="c1"># Get inputs -------------------------- # Date info </span><span class="n">today</span> <span class="o">=</span> <span class="n">DT</span><span class="p">.</span><span class="n">date</span><span class="p">.</span><span class="n">today</span><span class="p">()</span> <span class="n">end_day</span> <span class="o">=</span> <span class="n">today</span><span class="p">.</span><span class="n">day</span> <span class="n">end_month</span> <span class="o">=</span> <span class="n">today</span><span class="p">.</span><span class="n">month</span> <span class="n">end_year</span> <span class="o">=</span> <span class="n">today</span><span class="p">.</span><span class="n">year</span> <span class="n">lastweek</span> <span class="o">=</span> <span class="n">today</span> <span class="o">-</span> <span class="n">DT</span><span class="p">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">7</span><span class="p">)</span> <span class="n">begin_day</span> <span class="o">=</span> <span class="n">lastweek</span><span class="p">.</span><span class="n">day</span> <span class="n">begin_month</span> <span class="o">=</span> <span class="n">lastweek</span><span class="p">.</span><span class="n">month</span> <span class="n">begin_year</span> <span class="o">=</span> <span class="n">lastweek</span><span class="p">.</span><span class="n">year</span> <span class="c1"># Cookie info </span><span class="n">cookie_content</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'~/.env/.google_maps_cookie'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">).</span><span class="n">read</span><span class="p">()</span> <span class="c1"># Remove line break at end of string </span><span class="n">cookie_content</span> <span class="o">=</span> <span class="n">cookie_content</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># Where to save the files </span><span class="n">folder</span> <span class="o">=</span> <span class="s">'~/google_timeline/data/'</span> <span class="c1"># Get files -------------------------- </span><span class="n">process_location</span><span class="p">.</span><span class="n">create_kml_files</span><span class="p">(</span> <span class="n">begin_year</span><span class="o">=</span><span class="n">begin_year</span><span class="p">,</span> <span class="n">begin_month</span><span class="o">=</span><span class="n">begin_month</span><span class="p">,</span> <span class="n">begin_day</span><span class="o">=</span><span class="n">begin_day</span><span class="p">,</span> <span class="n">end_year</span><span class="o">=</span><span class="n">end_year</span><span class="p">,</span> <span class="n">end_month</span><span class="o">=</span><span class="n">end_month</span><span class="p">,</span> <span class="n">end_day</span><span class="o">=</span><span class="n">end_day</span><span class="p">,</span> <span class="n">cookie_content</span><span class="o">=</span><span class="n">cookie_content</span><span class="p">,</span> <span class="n">folder</span><span class="o">=</span><span class="n">folder</span><span class="p">)</span> </code></pre></div></div> <p>This will download the <code class="language-plaintext highlighter-rouge">KML</code> files (one per day) for the last week. I’ll leave it at that for now. If you would like to know how to read the files into a pandas data frame, check out <a href="https://github.com/alexattia/Maps-Location-History">Alexandre Attia’s repo</a>, or come back here later. I already have a specific application in mind for this, but that is a story for another post.</p> <p>As always, hope this is useful for you. Please leave a comment below. I have enabled anonymous commenting to remove the entry hurdle of signing up to Disqus.</p> Sat, 21 Oct 2017 14:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2017/extracting-location-history/ https://janlauge.github.iohttps://janlauge.github.io//2017/extracting-location-history/ DataScience Python WebScraping Coding Tools DataScience Python WebScraping Coding Tools Exploring Sales Data <p><strong>A big part of the interview process for many data science positions is a data science task or assignment. Companies usually choose a data set that is typical for them, while only in rare cases a sample of their actual production data. Here, I am exploring such a data set, sent out by a leading UK retailer.</strong> <!--more--></p> <h2 id="the-task">The task</h2> <p><em>Your task is to read all data into your preferred software environment, get an understanding of the various variables that are included, and clean the dataset. Subsequently, proceed to build a model of your choice to predict store sales in the test dataset. You can choose as many methods of evaluating the model.</em></p> <h3 id="data-preparation">Data Preparation</h3> <p>We’ll start by reading the data into R and having a look at the basic structure and variables available. Three tables are supplied: stores, train, and test. Since train and test are in the same format, we will combine the two tables in one (called ‘days’) to make reformatting a little easier.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Any good R-script should use these ;)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">plyr</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">gridExtra</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">ggthemes</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">caret</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">glmnet</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">doMC</span><span class="p">)</span><span class="w"> </span><span class="c1"># Keep results reproducible</span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">1234</span><span class="p">)</span><span class="w"> </span><span class="c1"># Read the tabular data</span><span class="w"> </span><span class="n">stores</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'original/store.csv'</span><span class="p">)</span><span class="w"> </span><span class="n">train</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'original/train.csv'</span><span class="p">)</span><span class="w"> </span><span class="n">test</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'original/test.csv'</span><span class="p">)</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">train</span><span class="p">,</span><span class="w"> </span><span class="n">set</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'train'</span><span class="p">),</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">test</span><span class="p">,</span><span class="w"> </span><span class="n">set</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'test'</span><span class="p">))</span><span class="w"> </span><span class="c1"># inspect the data</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">stores</span><span class="p">)</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">days</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p><strong>Put tables here</strong></p> <p>Apparently, some of the categorical columns were interpreted as integers by R. We’ll fix that and reformat the data to make sure we use columns in an appropriate fashion. We also combine the data on days and shops are combined in one table via a left-hand join.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Reformatting 'stores'</span><span class="w"> </span><span class="n">stores</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># Stores number as character</span><span class="w"> </span><span class="n">Store</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">Store</span><span class="p">),</span><span class="w"> </span><span class="c1"># And CompetitionStart as combination of month and year</span><span class="w"> </span><span class="n">CompetitionStart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="w"> </span><span class="c1"># Where CompetitionOpenSince values are NA</span><span class="w"> </span><span class="c1"># we assume competition has been around for a while</span><span class="w"> </span><span class="nf">is.na</span><span class="p">(</span><span class="n">CompetitionOpenSinceYear</span><span class="p">),</span><span class="w"> </span><span class="s1">'2000-01-01'</span><span class="p">,</span><span class="w"> </span><span class="c1"># Otherwise we paste in month and year and make it a date (assuming 1st)</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s1">'01-'</span><span class="p">,</span><span class="w"> </span><span class="n">CompetitionOpenSinceMonth</span><span class="p">,</span><span class="s1">'-'</span><span class="p">,</span><span class="w"> </span><span class="n">CompetitionOpenSinceYear</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'%d-%m-%Y'</span><span class="p">))</span><span class="w"> </span><span class="c1"># Reformatting 'days'</span><span class="w"> </span><span class="c1"># Then we combine it with the shop data</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># Again, Stores as character</span><span class="w"> </span><span class="n">Store</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="n">Store</span><span class="p">),</span><span class="w"> </span><span class="c1"># And date into a proper time stamp</span><span class="w"> </span><span class="n">Date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'%d/%m/%Y'</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Let's convert categorical information into factors</span><span class="w"> </span><span class="c1"># mutate_each(funs(as.factor), DayOfWeek) %&gt;%</span><span class="w"> </span><span class="c1"># Bring in the shop data</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">stores</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Store'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Any entries of 'CompetitionDistance' for days that predate</span><span class="w"> </span><span class="c1"># 'CompetitionOpenSince...' are probably invalid. We'll replace them with</span><span class="w"> </span><span class="c1"># the mean value of 'CompetitionDistance' across all stores</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">CompetitionDistance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="w"> </span><span class="n">Date</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">CompetitionStart</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">stores</span><span class="o">$</span><span class="n">CompetitionDistance</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">CompetitionDistance</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Drop the old CompetitionOpenSince fields</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">one_of</span><span class="p">(</span><span class="s1">'CompetitionOpenSinceMonth'</span><span class="p">,</span><span class="w"> </span><span class="s1">'CompetitionOpenSinceYear'</span><span class="p">))</span><span class="w"> </span></code></pre></div></div> <p>Now we can take an initial look at the data to better understand what we’ve got and how we can best put it to use in the modelling exercise. I will use a couple of basic questions to guide my exploration of the data set. Since there is no sales data for the hold-out partition of the data I excluded it from most of the plots in this section.</p> <h3 id="which-stores-do-well">Which Stores do Well?</h3> <p>We can look at the data on a sales-by-store level. Simply aggregate sales and other metrics from all days by store and plot them to visualise the relationship of metrics to one another. First, let’s try to understand what kind of shops are in our datasets. The shops are divided by their store type, and the range of products they offer (basic, medium, or full). The relationship between these factors is shown in the plot below.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Make a data frame that aggregates the Sales data by Store</span><span class="w"> </span><span class="c1"># (Sales by Store, SbS)</span><span class="w"> </span><span class="n">SbS</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Use only the train data</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'train'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">Store</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Aggregate data by Store</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="w"> </span><span class="n">meanSales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">meanCustomers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Customers</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">totalSales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">totalCustomers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">Customers</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">daysOpen</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">Open</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">daysPromo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">Promo</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">daysHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">ifelse</span><span class="p">(</span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">daysSchoolHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">SchoolHoliday</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">dailySales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">totalSales</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">daysOpen</span><span class="p">,</span><span class="w"> </span><span class="n">dailyCustomers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">totalCustomers</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">daysOpen</span><span class="p">,</span><span class="w"> </span><span class="n">storeType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first</span><span class="p">(</span><span class="n">StoreType</span><span class="p">),</span><span class="w"> </span><span class="n">range</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first</span><span class="p">(</span><span class="n">Range</span><span class="p">),</span><span class="w"> </span><span class="n">competitionDistance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">CompetitionDistance</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Change the factor levels for nice labels</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">range</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">range</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'basic'</span><span class="p">,</span><span class="s1">'medium'</span><span class="p">,</span><span class="s1">'full'</span><span class="p">)),</span><span class="w"> </span><span class="n">storeType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">storeType</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'small'</span><span class="p">,</span><span class="s1">'medium'</span><span class="p">,</span><span class="s1">'big'</span><span class="p">,</span><span class="s1">'huge'</span><span class="p">)))</span><span class="w"> </span><span class="c1"># With the combined dataset, let's try and understand the shops data first</span><span class="w"> </span><span class="n">mosaicplot</span><span class="p">(</span><span class="w"> </span><span class="n">storeType</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">range</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SbS</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Figure 1: Store type and range of products'</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Store type'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Range of Products'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p><img src="https://janlauge.github.io/assets/sales_analysis_plot_1.jpg" alt="Store type and range of products" /></p> <p>We can see that there is a large number of small shops but only very few medium ones. Another noticeable factor is that barely any shops offer the medium product range. All of those that do fall in the medium size shop type. Furthermore, small shops are more likely to only offer the basic product range, while big and huge shops are more likely to offer the full range of products. However, a number of big and huge shops restrict their product range to the basic selection, which seems surprising. Perhaps the store type does not represent an ordinal scale of shop size as initially assumed?</p> <p>Next we will look at some of the continuous variables available for our shops. We can plot number of customers against daily sales and differentiate by store type:</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SbS</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dailyCustomers</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dailySales</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">storeType</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">range</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Figure 2: Relationship of daily customers with daily sales'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Mean Daily Number of Customers'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Mean Daily Sales'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p><img src="https://janlauge.github.io/assets/sales_analysis_plot_2.jpg" alt="Relationship of daily customers with daily sales" /></p> <p>Somewhat unsurprisingly, there seems to be a strong relationship between number of customers and sales. Interestingly tough, when splitting this data up by store inventory, both ‘basic range’ stores and ‘maximum range’ stores seem to be doing well (i.e. high sales per customer) while ‘medium range’ stores have lower sales per customer. Additionally, small shops seem to have more customers than big and huge shops. Another indicator that my assumption about the nature of the store type was wrong.</p> <p>Now I will go back to the data by day. Let’s first look at the sales data by day of the week (Figure 3). We can see that sales are highest on Monday, drop a bit during the middle of the week, pick up slightly on Friday, are lower on Saturday and basically nonexistent on Sunday (most shops are closed). This suggests that the day of the week should be included as a predictor in our final model. We can also use the dates to create time series for shops and look at the variation of sales over time. To take out weekly variability I also included a rolling mean of sales for a 7-day window (SalesOfWeek). The resulting plot (Figure 4) shows us the sales of the 2 years from the beginning of 2013 till the end of 2014. We can see patterns of regular variation in sales over the year. Here is what stands out to me:</p> <ul> <li>Sales are low on Sundays (most shops are closed!)</li> <li>Promotions tend to be every other week</li> <li>Sales are higher in week with promotion</li> <li>Sales are slightly higher in the time before Easter and distinctly higher before Christmas</li> </ul> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Sales by WeekDay</span><span class="w"> </span><span class="n">SbWD</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'train'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">DateType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">StoreType</span><span class="p">,</span><span class="w"> </span><span class="s1">'-'</span><span class="p">,</span><span class="w"> </span><span class="n">Date</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">DateType</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">DayOfWeek</span><span class="p">),</span><span class="w"> </span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">ifelse</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">StateHoliday</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)),</span><span class="w"> </span><span class="n">SchoolHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">SchoolHoliday</span><span class="p">),</span><span class="w"> </span><span class="n">Promo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">Promo</span><span class="p">),</span><span class="w"> </span><span class="n">StoreType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first</span><span class="p">(</span><span class="n">StoreType</span><span class="p">))</span><span class="w"> </span><span class="c1"># Plot sales per day of the week</span><span class="w"> </span><span class="n">SbWD</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Mon'</span><span class="p">,</span><span class="s1">'Tue'</span><span class="p">,</span><span class="s1">'Wed'</span><span class="p">,</span><span class="s1">'Thu'</span><span class="p">,</span><span class="s1">'Fri'</span><span class="p">,</span><span class="s1">'Sat'</span><span class="p">,</span><span class="s1">'Sun'</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">DayOfWeek</span><span class="p">,</span><span class="w"> </span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">StoreType</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_boxplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Figure 3: Mean Sales per Day of the Week'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Day of the Week'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Mean Sales'</span><span class="p">)</span><span class="w"> </span><span class="c1"># Time series of sales</span><span class="w"> </span><span class="c1"># (Sales by Day, SbD)</span><span class="w"> </span><span class="n">SbD</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'train'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">DayOfWeek</span><span class="p">)</span><span class="m">-1</span><span class="p">),</span><span class="w"> </span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">ifelse</span><span class="p">(</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">StateHoliday</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)),</span><span class="w"> </span><span class="n">SchoolHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">SchoolHoliday</span><span class="p">),</span><span class="w"> </span><span class="n">Promo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">Promo</span><span class="p">))</span><span class="w"> </span><span class="c1"># Get a rolling mean of Sales</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">RcppRoll</span><span class="p">)</span><span class="w"> </span><span class="n">SbD</span><span class="o">$</span><span class="n">SalesOfWeek</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">roll_mean</span><span class="p">(</span><span class="w"> </span><span class="n">SbD</span><span class="o">$</span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="n">align</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'center'</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">SbD</span><span class="o">$</span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="c1"># Timeline plot of sales data</span><span class="w"> </span><span class="n">SbD</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># Rescale some of the variables</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">SalesOfWeek</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SalesOfWeek</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">Promo</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Promo</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">gather</span><span class="p">(</span><span class="w"> </span><span class="c1"># Get data into long format for ggplot</span><span class="w"> </span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">measurement</span><span class="p">,</span><span class="w"> </span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">SalesOfWeek</span><span class="p">,</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="p">,</span><span class="w"> </span><span class="n">StateHoliday</span><span class="p">,</span><span class="w"> </span><span class="n">SchoolHoliday</span><span class="p">,</span><span class="w"> </span><span class="n">Promo</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># Change the order they will be plotted by</span><span class="w"> </span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="w"> </span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Sales'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SalesOfWeek'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Promo'</span><span class="p">,</span><span class="w"> </span><span class="s1">'DayOfWeek'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SchoolHoliday'</span><span class="p">,</span><span class="w"> </span><span class="s1">'StateHoliday'</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Make the plot</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">measurement</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">category</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">facet_grid</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Figure 4: Time Series Representation of the Data'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Rescaled Variable Value'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <h2 id="modelling">Modelling</h2> <p>For benchmarking I will create a very simple model: No data cleaning, no feature engineering, just the raw (reformatted) data in a GLMNET and GBM. I am using dummy variables for categorical variables and raw values for continuous variables. The model is evaluated with ten-fold cross-validation, repeated 5 times.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Transforming factors into dummy variables</span><span class="w"> </span><span class="n">daysdummy</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'train'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">one_of</span><span class="p">(</span><span class="s1">'DayOfWeek'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Promo'</span><span class="p">,</span><span class="w"> </span><span class="s1">'StateHoliday'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SchoolHoliday'</span><span class="p">,</span><span class="w"> </span><span class="s1">'StoreType'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Range'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Sales'</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Create dummy variables</span><span class="w"> </span><span class="n">model.matrix</span><span class="p">(</span><span class="n">Sales</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Get the Sales data back</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">days</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">one_of</span><span class="p">(</span><span class="s1">'(Intercept)'</span><span class="p">,</span><span class="w"> </span><span class="s1">'DayOfWeek'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Promo'</span><span class="p">,</span><span class="w"> </span><span class="s1">'StateHoliday'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SchoolHoliday'</span><span class="p">,</span><span class="w"> </span><span class="s1">'StoreType'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Range'</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'train'</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># If CompetitionDistance is NA, use max Competition distance</span><span class="w"> </span><span class="n">CompetitionDistance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="w"> </span><span class="nf">is.na</span><span class="p">(</span><span class="n">CompetitionDistance</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">days</span><span class="o">$</span><span class="n">CompetitionDistance</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">CompetitionDistance</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Create simple training set treating each day as independent observation,</span><span class="w"> </span><span class="c1"># ignoring factors Store and Date for now</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Store</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">Open</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">set</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">CompetitionStart</span><span class="p">)</span><span class="w"> </span><span class="c1"># Inspect the variables</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">daysdummy</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <h3 id="glm">GLM</h3> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tuneControl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">trainControl</span><span class="p">(</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"repeatedcv"</span><span class="p">,</span><span class="w"> </span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">repeats</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="n">model1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">train</span><span class="p">(</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">daysdummy</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'glmnet'</span><span class="p">,</span><span class="w"> </span><span class="n">metric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'RMSE'</span><span class="p">,</span><span class="w"> </span><span class="n">trControl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tuneControl</span><span class="p">)</span><span class="w"> </span><span class="n">model1</span><span class="w"> </span></code></pre></div></div> <blockquote> <p>glmnet</p> <p>780829 samples 19 predictor</p> <p>No pre-processing Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 702746, 702747, 702746, 702747, 702746, 702747, … Resampling results across tuning parameters:</p> <p>alpha lambda RMSE Rsquared 0.10 6.91422 1210.331 0.9013087 0.10 69.14220 1214.036 0.9009795 0.10 691.42198 1386.398 0.8870391 0.55 6.91422 1210.402 0.9012853 0.55 69.14220 1228.537 0.8988403 0.55 691.42198 1611.039 0.8521414 1.00 6.91422 1211.024 0.9011885 1.00 69.14220 1245.152 0.8963104 1.00 691.42198 1754.444 0.8288464</p> <p>RMSE was used to select the optimal model using the smallest value. The final values used for the model were alpha = 0.1 and lambda = 6.91422.</p> </blockquote> <h3 id="gbm">GBM</h3> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fitControl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">trainControl</span><span class="p">(</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"repeatedcv"</span><span class="p">,</span><span class="w"> </span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">repeats</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="n">model2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">train</span><span class="p">(</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">daysdummy</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gbm"</span><span class="p">,</span><span class="w"> </span><span class="n">trControl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitControl</span><span class="p">)</span><span class="w"> </span><span class="n">save</span><span class="p">(</span><span class="n">model2</span><span class="p">,</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'saved/basic.gbm.rda'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <blockquote> <p>Stochastic Gradient Boosting</p> <p>780829 samples 19 predictor</p> <p>No pre-processing Resampling: Cross-Validated (10 fold, repeated 5 times) Summary of sample sizes: 702746, 702746, 702745, 702746, 702747, 702747, … Resampling results across tuning parameters:</p> <p>interaction.depth n.trees RMSE Rsquared 1 50 1502.452 0.8639662 1 100 1296.930 0.8922199 1 150 1225.590 0.9011660 2 50 1287.706 0.8954673 2 100 1138.885 0.9137963 2 150 1101.484 0.9186211 3 50 1193.371 0.9078775 3 100 1090.395 0.9202973 3 150 1063.971 0.9238807</p> <p>Tuning parameter ‘shrinkage’ was held constant at a value of 0.1 Tuning parameter ‘n.minobsinnode’ was held constant at a value of 10 RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.</p> </blockquote> <p>Now, let’s see how we can improve the above predictions. I feel like I have a better understanding now of how sales behave between stores and over time, so I will try creating some features from what we have learned in the data exploration. I create a variable indicating days before Christmas and Easter, to incorporate the lag between the holiday and the impact on people’s shopping behaviour. I also calculate mean sales per store and include this values into the set of predictors, which acts as a proxy for the population density, infrastructure availability, and similar factors related to the location of each individual store. Finally, I exclude closed days here. Predicting closed days as 0 would decrease the error metric, but that feels like cheating! Instead, let’s only look at days that the shops are open.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Include mean sales by store as predictor</span><span class="w"> </span><span class="n">meanSbS</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">SbS</span><span class="p">,</span><span class="w"> </span><span class="n">one_of</span><span class="p">(</span><span class="s1">'Store'</span><span class="p">,</span><span class="w"> </span><span class="s1">'meanSales'</span><span class="p">)),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Store'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">Store</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="w"> </span><span class="n">meanSales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="c1"># Include days before christmas and easter as predictor</span><span class="w"> </span><span class="n">daysBH</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">Easter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'b'</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">Christmas</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'c'</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'0'</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="w"> </span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">StateHoliday</span><span class="p">),</span><span class="w"> </span><span class="n">Christmas</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Christmas</span><span class="p">),</span><span class="w"> </span><span class="n">Easter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Easter</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">BH</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">StateHoliday</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">StateHoliday</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">StateHoliday</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">BC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">Christmas</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">Christmas</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">Christmas</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">BE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">Easter</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">Easter</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lead</span><span class="p">(</span><span class="n">Easter</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">BH</span><span class="p">,</span><span class="w"> </span><span class="n">BC</span><span class="p">,</span><span class="w"> </span><span class="n">BE</span><span class="p">)</span><span class="w"> </span><span class="c1"># Add it all together</span><span class="w"> </span><span class="n">modeldata</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">days</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="w"> </span><span class="n">meanSbS</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Store'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="w"> </span><span class="n">daysBH</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Date'</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># If CompetitionDistance is NA, use max Competition distance</span><span class="w"> </span><span class="n">CompetitionDistance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="w"> </span><span class="nf">is.na</span><span class="p">(</span><span class="n">CompetitionDistance</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">days</span><span class="o">$</span><span class="n">CompetitionDistance</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">CompetitionDistance</span><span class="p">),</span><span class="w"> </span><span class="c1"># If any holiday lag is NA, use zero</span><span class="w"> </span><span class="n">BH</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">BH</span><span class="p">),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">BH</span><span class="p">),</span><span class="w"> </span><span class="n">BC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">BH</span><span class="p">),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">BC</span><span class="p">),</span><span class="w"> </span><span class="n">BE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">BE</span><span class="p">),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">BE</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">CompetitionStart</span><span class="p">)</span><span class="w"> </span><span class="c1"># Transforming factors into dummy variables</span><span class="w"> </span><span class="n">modeldata</span><span class="w"> </span><span class="o">%&lt;&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">one_of</span><span class="p">(</span><span class="s1">'DayOfWeek'</span><span class="p">,</span><span class="w"> </span><span class="s1">'StoreType'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Range'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Sales'</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># Dummy value to include test set in transformation</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="m">-999</span><span class="p">,</span><span class="w"> </span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="n">DayOfWeek</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">DayOfWeek</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Create dummy variables</span><span class="w"> </span><span class="n">model.matrix</span><span class="p">(</span><span class="n">Sales</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Get the Sales data back</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">modeldata</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">StateHoliday</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="w"> </span><span class="o">-</span><span class="n">one_of</span><span class="p">(</span><span class="s1">'(Intercept)'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Store'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Date'</span><span class="p">,</span><span class="w"> </span><span class="s1">'DayOfWeek'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Promo'</span><span class="p">,</span><span class="w"> </span><span class="s1">'StoreType'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Range'</span><span class="p">))</span><span class="w"> </span><span class="c1"># Select train and test data</span><span class="w"> </span><span class="n">trainset</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">modeldata</span><span class="p">,</span><span class="w"> </span><span class="n">set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'train'</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">Open</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Open</span><span class="p">,</span><span class="w"> </span><span class="n">set</span><span class="p">)</span><span class="w"> </span><span class="n">testset</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">modeldata</span><span class="p">,</span><span class="w"> </span><span class="n">set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'test'</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">Open</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Open</span><span class="p">,</span><span class="w"> </span><span class="n">set</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>From the previous quick tuning runs, we know that the GBM did much better than the GLM. I suspect that this is due to the interaction between variables which the GBM is able to capture while the GLM (with the formula I use) only looks at variables by themselves. We’ll, therefore, use the GBM model. Predictions from this model improved with higher interaction and a higher number of trees, so I will choose some higher values here. I would have liked to do more parameter tuning on this, but the model takes quite a while to compute so I’ll do with these ad-hoc values (if you are rerunning the code, avoid rerunning this bit and load the saved model object instead).</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">registerDoMC</span><span class="p">(</span><span class="n">cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w"> </span><span class="n">fitControl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">trainControl</span><span class="p">(</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"repeatedcv"</span><span class="p">,</span><span class="w"> </span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">repeats</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">allowParallel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">verboseIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="n">gbmGrid</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="w"> </span><span class="n">interaction.depth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">n.trees</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="n">shrinkage</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">n.minobsinnode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="n">model4</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">train</span><span class="p">(</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">trainset</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gbm"</span><span class="p">,</span><span class="w"> </span><span class="n">tuneGrid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gbmGrid</span><span class="p">,</span><span class="w"> </span><span class="n">trControl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitControl</span><span class="p">)</span><span class="w"> </span><span class="c1"># Predict in-sample</span><span class="w"> </span><span class="n">is.prediction</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model4</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">trainset</span><span class="p">)</span><span class="w"> </span><span class="n">os.prediction</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model4</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">testset</span><span class="p">)</span><span class="w"> </span><span class="c1"># Predict with closed days</span><span class="w"> </span><span class="n">fs.prediction</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model4</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">modeldata</span><span class="p">,</span><span class="w"> </span><span class="n">na.action</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">na.pass</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Create data frame with predictions</span><span class="w"> </span><span class="c1"># Use prediction for both train and test set</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="s1">'Prediction'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">days</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Set closed days to zero</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">Prediction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">Open</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">Prediction</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Set negative values to zero</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">Prediction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">Prediction</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">Prediction</span><span class="p">))</span><span class="w"> </span></code></pre></div></div> <p>We have a set of predictions now, but I don’t have the Sales values for 2015 so I can’t evaluate on the blind holdout sample. Instead, I will plot the results for visual inspection.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Simple scatterplot to see the results</span><span class="w"> </span><span class="n">fs.prediction</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">Prediction</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Predicted Sales'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Simple Scatterplot of observed and predicted sales'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>The scatterplot reveals that for the training set there is a very good agreement between the observed and predicted sales, indicated by the long, linear shape of the point cloud. For some values, the model underestimates the values of sales, so it seems that my model is missing out on a part of the signal here. Further research could go into what distinguishes these points from the rest of the data set. When plotting the time series again, it seems that the general trend in the sales data is well captured. The range of variation in the weekly sales data, however, also seems to be slightly underestimated by the model.</p> <p><img src="https://janlauge.github.io/assets/sales_analysis_plot_4.jpg" alt="Observed and predicted sales" /></p> <p>Let’s plot the time series again, this time including both the train and the test period.</p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ts.prediction</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">fs.prediction</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarise</span><span class="p">(</span><span class="w"> </span><span class="n">Sales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="n">Prediction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Prediction</span><span class="p">))</span><span class="w"> </span><span class="n">ts.prediction</span><span class="o">$</span><span class="n">SalesOfWeek</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">roll_mean</span><span class="p">(</span><span class="w"> </span><span class="n">ts.prediction</span><span class="o">$</span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="n">align</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'center'</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">ts.prediction</span><span class="o">$</span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="n">ts.prediction</span><span class="o">$</span><span class="n">PredictionOfWeek</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">roll_mean</span><span class="p">(</span><span class="w"> </span><span class="n">ts.prediction</span><span class="o">$</span><span class="n">Prediction</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="n">align</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'center'</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">ts.prediction</span><span class="o">$</span><span class="n">Sales</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="c1"># Timeline plot of sales data</span><span class="w"> </span><span class="n">ts.prediction</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">gather</span><span class="p">(</span><span class="w"> </span><span class="c1"># Get data into long format for ggplot</span><span class="w"> </span><span class="n">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">measurement</span><span class="p">,</span><span class="w"> </span><span class="n">Sales</span><span class="p">,</span><span class="w"> </span><span class="n">SalesOfWeek</span><span class="p">,</span><span class="w"> </span><span class="n">Prediction</span><span class="p">,</span><span class="w"> </span><span class="n">PredictionOfWeek</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="w"> </span><span class="c1"># Change the order they will be plotted by</span><span class="w"> </span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="w"> </span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Sales'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SalesOfWeek'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Prediction'</span><span class="p">,</span><span class="w"> </span><span class="s1">'PredictionOfWeek'</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="c1"># Make the plot</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">measurement</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">category</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">facet_grid</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Figure 6: Time Series Representation of the Data'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Variable Value'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p><img src="https://janlauge.github.io/assets/sales_analysis_plot_5.jpg" alt="Time series plot of the full dataset" /></p> <p>And that’s all for now. As always, hope it is interesting. Please leave a comment below!</p> Sun, 01 Oct 2017 10:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2017/exploring-sales-data/ https://janlauge.github.iohttps://janlauge.github.io//2017/exploring-sales-data/ DataScience Recruiting R DataScience Recruiting R Database Connections in rMarkdown <p><strong>Connecting R to an enterprise data warehouse? Do it properly and do not hard-code your passwords! Here is how you can do it in R with rMarkdown and RStudio version 1.0+</strong> <!--more--></p> <h3 id="the-problem">The Problem</h3> <p>Working as a data scientist in a large organisation, chances are you will have to get data out of an Enterprise Data Warehouse (EDW) and into your Data Manipulation Environment (DME, usually R, Python, Julia, or SAS). Of course, you could create a manual extract, save it as .csv and read it from disk. However, this approach has a number of downsides:</p> <ul> <li>the manual workflow may be hard to reproduce later on</li> <li>files use up additional disk space</li> <li>csv files do not store data types</li> </ul> <p>Generally, I prefer to connect my DME directly to the database, as do many other Data Scientists. What I have repeatedly come across in this context is people hard-coding their passwords and access tokens into their analysis code. In my opinion, this is a dangerous practice! It is most likely in violation of the security regulations of your organisation, and for good reason. It is far too easy for your code to accidentally end up on an unrestricted access github repository, an unprotected S3 bucket, or similar. With GDPR just around the corner, a mistake like that could soon cost your organisation up to 3% of their global annual revenue in fines!</p> <h3 id="the-solution">The Solution</h3> <p>So what’s the “proper” way to do this? Well, RStudio (v1.0+) offers some great new features in this context. If you are using Windows (like many big corporations do) and you are connecting to your EDW using the Windows ODBC Data Source Administrator, you can read your connection details directly from there using the “odbc” package.</p> <p><em>Note: each code block below should be a chunk in an rMarkdown</em></p> <div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Unfortunately, odbc is not on CRAN yet.</span><span class="w"> </span><span class="c1"># So if you do not have it yet we will need devtools</span><span class="w"> </span><span class="n">install.packages</span><span class="p">(</span><span class="n">devtools</span><span class="p">)</span><span class="w"> </span><span class="c1"># Using devtools, we can now install the odbc package</span><span class="w"> </span><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s1">'rstats-db/odbc'</span><span class="p">)</span><span class="w"> </span><span class="c1"># Get connection info from Windows ODBC Data Source Administrator</span><span class="w"> </span><span class="c1"># Using the name you set manually</span><span class="w"> </span><span class="n">con</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dbConnect</span><span class="p">(</span><span class="n">odbc</span><span class="o">::</span><span class="n">odbc</span><span class="p">(),</span><span class="w"> </span><span class="s1">'EDW_name'</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>Using this connection object, we can now write <strong>and run</strong> SQL code snippets in rMarkdowns, rNotebooks, and shiny apps. Just pass the connection as property to the snippet and specify an “output.var” that will capture the output. This “output.var” will be available in your R workspace afterwards.</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- This should be a chunk with the following header:</span> <span class="c1">-- {SQL, connection = con, output.var = result}</span> <span class="c1">-- As a result, this turns into sql code.</span> <span class="c1">-- comments need to be marked accordingly</span> <span class="k">SELECT</span> <span class="n">TOP</span> <span class="mi">10</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">EDW_database</span><span class="p">.</span><span class="n">EDW_table</span> </code></pre></div></div> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># result will be available in the next chunk!</span><span class="w"> </span><span class="n">result</span><span class="w"> </span></code></pre></div></div> <p>This code has syntax highlighting, runs start to finish without any manual steps, does not rely on “hacky” string queries, does not have hard-coded passwords, and your data updates as and when new data becomes available in your EDW!</p> <p>As always, hope this is useful for someone. Please leave a comment below!</p> Fri, 15 Sep 2017 23:00:00 +0000 https://janlauge.github.iohttps://janlauge.github.io//2017/safely-connecting-to-EDWs/ https://janlauge.github.iohttps://janlauge.github.io//2017/safely-connecting-to-EDWs/ DataScience DataBases R DataScience DataBases R