Jekyll2018-06-25T19:48:19+00:00https://dziganto.github.io/Standard DeviationsMusings in machine learning, data science, and artificial intelligence.David Ziganto[email protected]Setup an EMR Cluster via AWS CLI2018-06-25T00:00:00+00:002018-06-25T00:00:00+00:00https://dziganto.github.io/aws/aws%20cli/emr/big%20data/hadoop/jupyterhub/spark/Setup-an-EMR-Cluster-via-AWS-CLI<p><img src="/assets/images/Amazon_EMR_main.png?raw=true" alt="image" class="center-image" /></p>
<h2 id="objective">Objective</h2>
<p>In this no frills post, you’ll learn how to setup a big data cluster on Amazon EMR using nothing but the AWS command line.</p>
<h2 id="prerequisites">Prerequisites</h2>
<ol>
<li>You have an AWS account.</li>
<li>You have setup a <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.html#create-a-key-pair">Key Pair</a>.</li>
<li>You have basic familiarity with the command line.</li>
<li>You have installed AWS CLI for <a href="https://docs.aws.amazon.com/cli/latest/userguide/awscli-install-linux.html">Linux</a>, <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-install-macos.html">Mac</a> or <a href="https://docs.aws.amazon.com/cli/latest/userguide/awscli-install-windows.html">Windows</a>.</li>
</ol>
<h2 id="overview">Overview</h2>
<p>Before we dive in let’s get a handle on what we need to cover. First, I’ll show you the main command I typically run to setup a cluster. Then we’ll break down the command to understand all the key pieces. Please note that text in CAPS is something you’ll need to update with your information. For example, you’ll have to provide your own key pair. So without further ado, let’s dive in.</p>
<h2 id="the-command">The Command</h2>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws emr create-cluster
--release-label emr-5.14.0
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.xlarge
--use-default-roles
--ec2-attributes SubnetIds=subnet-YOUR_SUBNET,KeyName=YOUR_KEY
--applications Name=JupyterHub Name=Spark Name=Hadoop
--name=“ThisIsMyCluster”
--log-uri s3://YOUR_BUCKET
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://REGION.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://YOUR_BUCKET/YOUR_SHELL_SCRIPT.sh"]
</code></pre></div></div>
<h2 id="the-breakdown">The Breakdown</h2>
<p>That’s a long command so let’s break it down to see what’s happening:</p>
<ol>
<li><code class="highlighter-rouge">aws emr create-cluster</code> - simply creates a cluster</li>
<li><code class="highlighter-rouge">--release-label emr-5.14.0</code> - build a cluster with EMR version 5.14.0</li>
<li><code class="highlighter-rouge">--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.xlarge</code> - build 1 Master node of type m4.xlarge and 2 Core nodes also of type m4.xlarge</li>
<li><code class="highlighter-rouge">--use-default-roles</code> - use the default service role (EMR_DefaultRole) and instance profile (EMR_EC2_DefaultRole) for permissions to access other AWS services</li>
<li><code class="highlighter-rouge">--ec2-attributes SubnetIds=subnet-YOUR_SUBNET,KeyName=YOUR_KEY</code> - configures cluster and Amazon EC2 instance configurations (you should provide a specific subnet and key here)</li>
<li><code class="highlighter-rouge">--applications Name=JupyterHub Name=Spark Name=Hadoop</code> - install JupyterHub, Spark, and Hadoop on this cluster</li>
<li><code class="highlighter-rouge">--name=“ThisIsMyCluster”</code> - name the cluster <strong>ThisIsMyCluster</strong></li>
<li><code class="highlighter-rouge">--log-uri s3://YOUR_BUCKET</code> - specify the S3 bucket where you want to store log files</li>
<li><code class="highlighter-rouge">--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://REGION.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://YOUR_BUCKET/YOUR_SHELL_SCRIPT.sh"]</code> - allows you to make additional configurations, like adding users to JupyterHub, when building the cluster (this is completely optional)</li>
</ol>
<h2 id="wrap-up">Wrap Up</h2>
<p>There you have it, an easy way to spin up a cluster. A few simple configuration tweaks to the command above and you’ll be off and crunching data on a cluster in no time!</p>David Ziganto[email protected]Introduction to Time Series2018-05-25T00:00:00+00:002018-05-25T00:00:00+00:00https://dziganto.github.io/python/time%20series/Introduction-to-Time-Series<p><img src="/assets/images/time_series_title.png?raw=true" alt="Time Series" class="center-image" /></p>
<h1 id="introduction">Introduction</h1>
<p>Dealing with data that is sequential in nature requires special techniques. Unlike traditional Ordinary Least Squares or Decision Trees where the observations are independent, time series data is such that there is correlation between successive samples. In other words, order very much matters. Think stock prices or daily temperatures. Identifying time series data and knowing what to do next is a valuable skill for any modeler.</p>
<p>The first step on our journey is to identify the three components of time series data:</p>
<ol>
<li>Trend</li>
<li>Seasonality</li>
<li>Residuals</li>
</ol>
<p>Trend, as its name suggests, is the overall direction of the data. Seasonality is a periodic component. And the residual is what’s left over when the trend and seasonality have been removed. Residuals are random fluctuations. You can think of them as a noise component.</p>
<p>Let’s look at a few plots to make sure we understand trend, seasonality, and residuals.</p>
<h3 id="time-series-data">Time Series Data</h3>
<p><img src="/assets/images/timeseries.png?raw=true" alt="TS Data" class="center-image" /></p>
<h3 id="trend">Trend</h3>
<p><img src="/assets/images/trend_component.png?raw=true" alt="Trend" class="center-image" /></p>
<h3 id="seasonality">Seasonality</h3>
<p><img src="/assets/images/seasonal_component.png?raw=true" alt="Seasonality" class="center-image" /></p>
<h3 id="residuals">Residuals</h3>
<p><img src="/assets/images/residuals.png?raw=true" alt="Residuals" class="center-image" /></p>
<p>Now that you have the big picture, let’s look at the nuts and bolts. I’ll show you how I created the data above, how to create derivatives of the plots shown above, and how to decompose a time series model in Python.</p>
<h1 id="create-time-series-data">Create Time Series Data</h1>
<p>Time series data is data that is measured at equally-spaced intervals. Think of a sensor that takes measurements every minute.</p>
<blockquote>
<p>A sensor that takes measurements at random times is not time series.</p>
</blockquote>
<h3 id="trend-1">Trend</h3>
<p>The first step is to create a time interval with equal spacing.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
time = np.arange(50)
</code></pre></div></div>
<p>Great. Now to construct the trend.</p>
<p>Sticking with the sensor example, suppose the sensor is oriented towards an oscillating fan that alternates right and left. The trend component captures the wind speed as someone adjusts the fan speed. Increased fan speed translates to increased sensor measurements.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>trend = np.empty_like(time, dtype='float')
for t in time:
if t < 10:
trend[t] = t * 2.25
elif t < 30:
trend[t] = t * -0.5 + 25
else:
trend[t] = t * 1.25 - 28
</code></pre></div></div>
<p>Better plot it.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import matplotlib.pyplot as plt
plt.plot(time, trend, 'b.')
plt.title("Trend vs Time")
plt.xlabel("minutes")
plt.ylabel("sensor measurement")
</code></pre></div></div>
<p>Here’s the result:</p>
<p><img src="/assets/images/trend.png?raw=true" alt="Trend Plot" class="center-image" /></p>
<h3 id="seasonality-1">Seasonality</h3>
<p>The next step is to create a periodic element. The wind speed sensor analog is the wind speed that’s captured as the fan sweeps left to right and back again.</p>
<p>Here’s an example of how we can create that:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seasonal = 10 + np.sin(time) * 10
</code></pre></div></div>
<p>Notice how both trend and seasonality are a function of time but independent of one another.</p>
<p>Also, here’s a plot of the seasonality component:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.plot(time, seasonal, 'g-.')
plt.title("Seasonality vs Time")
plt.xlabel("minutes")
plt.ylabel("sensor measurement")
</code></pre></div></div>
<p><img src="/assets/images/seasonality.png?raw=true" alt="Trend Plot" class="center-image" /></p>
<h3 id="residual">Residual</h3>
<p>The last component is the residual. This is a noise component, as mentioned earlier. We can fabricate that like so:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>np.random.seed(10) ## reproducible results
residual = np.random.normal(loc=0.0, scale=1, size=len(time))
</code></pre></div></div>
<p>And the plot:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.plot(time, residual, 'r-.')
plt.title("Residuals vs Time")
plt.xlabel("minutes")
plt.ylabel("electricity demand")
</code></pre></div></div>
<p><img src="/assets/images/residuals.png?raw=true" alt="Residual Plot" class="center-image" /></p>
<h1 id="aggregating-components">Aggregating Components</h1>
<p>Now comes time to aggregate the three components: trend, seasonality, and residuals. This will give us the time series data were looking for.</p>
<p>As it turns out, there are two major ways to aggregate (or decompose, as we’ll see later) time series data.</p>
<h3 id="additive">Additive</h3>
<p>The first way is simply a sum of the three components.</p>
<p><img src="/assets/images/additive_formula.png?raw=true" alt="LaTeX image 1" class="center-image" /></p>
<p>That’s as easy as <code class="highlighter-rouge">additive = trend + seasonal + residual</code>.</p>
<p>The corresponding plot is:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.plot(time, additive, 'k-.')
plt.title("Additive Time Series")
plt.xlabel("minutes")
plt.ylabel("sensor measurement");
</code></pre></div></div>
<p><img src="/assets/images/additive.png?raw=true" alt="Additive Plot" class="center-image" /></p>
<h3 id="multiplicative">Multiplicative</h3>
<p>The second way to decompose time series data is a multiplication of all three components.</p>
<p><img src="/assets/images/multiplicative_formula.png?raw=true" alt="LaTeX image 2" class="center-image" /></p>
<p>We can stitch that together with:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># ignore residual to make pattern obvious
ignored_residual = np.ones_like(residual)
multiplicative = trend * seasonal * ignored_residual
</code></pre></div></div>
<p>The corresponding plot is:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.plot(time, multiplicative, 'k-.')
plt.title("Multiplicative Time Series")
plt.xlabel("minutes")
plt.ylabel("sensor measurement")
</code></pre></div></div>
<p><img src="/assets/images/multiplicative.png?raw=true_" alt="Multiplicative Plot" class="center-image" /></p>
<h1 id="additive-vs-multiplicative">Additive vs Multiplicative?</h1>
<p>The primary question likely bouncing around your head is how can I tell if a time series is additive or multiplicative? Simply plotting the original time series data, called a <a href="https://en.wikipedia.org/wiki/Run_chart">run-sequence plot</a>, is one way to do so. If the seasonality and residual components are independent of the trend, then you have an additive series. If the seasonality and residual components are in fact dependent, meaning they fluctuate on trend, then you have a multiplicative series. Look at the additive and multiplicative plots above. You’ll notice a big difference in the amplitudes of the peaks and troughs. Specifically, the amplitude of the seasonal component of the multiplicative time series is changes with trend.</p>
<h1 id="time-series-decomposition-with-python">Time Series Decomposition with Python</h1>
<p>You’ll likely never know how real-world data was generated. However, I’m about to show you a powerful tool that will allow you to decompose a time series into its components. Let’s see how simple it is.</p>
<h3 id="additive-decomposition">Additive Decomposition</h3>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from statsmodels.tsa.seasonal import seasonal_decompose
ss_decomposition = seasonal_decompose(x=additive,
model='additive',
freq=6)
estimated_trend = ss_decomposition.trend
estimated_seasonal = ss_decomposition.seasonal
estimated_residual = ss_decomposition.resid
</code></pre></div></div>
<p>Note that you must provide the frequency. We can see from the additive and multiplicative plots that the frequency is about 6. There are more sophisticated ways to determine this number empirically, but that’s for another tutorial. Let’s keep things simple for now.</p>
<p>Now that we have the pieces let’s put them all together.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
fig.set_figheight(10)
fig.set_figwidth(15)
axes[0].plot(additive, 'k', label='Original')
axes[0].legend(loc='upper left');
axes[1].plot(estimated_trend, label='Trend')
axes[1].legend(loc='upper left');
axes[2].plot(estimated_seasonal, 'g', label='Seasonality')
axes[2].legend(loc='upper left');
axes[3].plot(estimated_residual, 'r', label='Residuals')
axes[3].legend(loc='upper left')
</code></pre></div></div>
<p><img src="/assets/images/additive_all.png?raw=true_" alt="All Additive Plots" class="center-image" /></p>
<h3 id="multiplicative-decomposition">Multiplicative Decomposition</h3>
<p>Multiplicative decomposition follows the exact same pattern. The only major change is that we change model to ‘multiplicative’.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ss_decomposition = seasonal_decompose(x=multiplicative,
model='multiplicative',
freq=6)
estimated_trend = ss_decomposition.trend
estimated_seasonal = ss_decomposition.seasonal
estimated_residual = ss_decomposition.resid
</code></pre></div></div>
<p>Some more matplotlib code:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
fig.set_figheight(10)
fig.set_figwidth(15)
axes[0].plot(multiplicative, label='Original')
axes[0].legend(loc='upper left')
axes[1].plot(estimated_trend, label='Trend')
axes[1].legend(loc='upper left')
axes[2].plot(estimated_seasonal, label='Seasonality')
axes[2].legend(loc='upper left')
axes[3].plot(estimated_residual, label='Residuals')
axes[3].legend(loc='upper left')
</code></pre></div></div>
<p>Viola! We have a multiplicative decomposition.</p>
<p><img src="/assets/images/multiplicative_all.png?raw=true_" alt="All Multiplicative Plots" /></p>
<hr />
<h1 id="summary">Summary</h1>
<p>In this tutorial you should have learned:</p>
<ol>
<li>Time series data is composed of three components: trend, seasonality, residual</li>
<li>Time series can be additive or multiplicative</li>
<li>How to decompose a time series model with Python</li>
</ol>David Ziganto[email protected]From Python to Scala - Variables2018-05-21T00:00:00+00:002018-05-21T00:00:00+00:00https://dziganto.github.io/python/scala/From-Python-to-Scala-Variables<p><img src="/assets/images/scala_logo.png?raw=true" alt="Scala" class="center-image" /></p>
<h2 id="introduction">Introduction</h2>
<p>Python is a beautiful, high-level programming language. I’ve solved innumerable problems with it over the years, so I have a particular fondness for its abilities. However, no tool is perfect for everything. Each has its strengths and each has its weaknesses. Part of Python’s power comes from its object-oriented construction. With it, you can do some pretty amazing things. However, functional programming has proven itself a powerful tool for massive scale systems. Therefore, it is time to move beyond Python to the wonderful world of Scala.</p>
<p>Scala is short for <strong>Scalable Language</strong>. It is a hybrid language that melds object-oriented structures and functional programming. Basically, it gives you the best of both worlds. Therefore, what follows is a series that will take you on a journey from Python to Scala. I hope you find it helpful!</p>
<h2 id="lesson-1-variables">Lesson 1: Variables</h2>
<p>Our first lesson is variables. In Python, saving a value to a variable is dead simple. It looks like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>myString = "this is a string"
myInt = 42
myFloat = 4.2
</code></pre></div></div>
<p>Python automatically infers the type of each variable. For example, the variable <code class="highlighter-rouge">myString</code> is saved as a string object. Python knows it’s a string because of the quotes around the text <em>this is a string</em>. You could just as easily have saved <code class="highlighter-rouge">"42"</code> or even <code class="highlighter-rouge">'42'</code>. That too would have been saved as a string object. The advantage is obvious: it takes no effort (and no thought) on the part of the user to save variables. The result is clean, easy to read code.</p>
<p>With Scala, you can do the same with only a minor change. Let’s take a look:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>var myString = "this is a string"
var myInt = 42
var myFloat = 4.2
</code></pre></div></div>
<p>Notice the <code class="highlighter-rouge">var</code> in front of the variables here. That’s important. Scala has the same ability to infer data types, same as Python, but you’re giving Scala additional information. It turns out you must provide this information to Scala or else an error is thrown. Try running this bit of code:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>myString2 = "this is a string"
</code></pre></div></div>
<p>See what I mean?</p>
<p>Should you feel the need to be explicit, Scala has your back:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>var myString: String = "this is a string"
var myInt: Int = 42
var myFloat: Double = 42
</code></pre></div></div>
<p>Now if I want to change <code class="highlighter-rouge">myString</code> to <code class="highlighter-rouge">"string string string"</code>, <code class="highlighter-rouge">myInt</code> to <code class="highlighter-rouge">99</code>, and <code class="highlighter-rouge">myFloat</code> to <code class="highlighter-rouge">3.14</code>, it’s as simple as:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>myString = "string string string"
myInt = 99
yFloat = 3.14
</code></pre></div></div>
<p>This is all basic stuff. There’s almost no difference from Python. But wait, there’s more. Scala gives you an alternative way to reference objects. Check this out:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val myStaticString = "you cannot reassign myStaticString"
val myStaticInt: Int = 12345
val myStaticFloat: Double = 2.71828
</code></pre></div></div>
<p>Ok, what’s the difference between <code class="highlighter-rouge">var</code> and <code class="highlighter-rouge">val</code>? Try to reassign <code class="highlighter-rouge">myStaticString</code>, <code class="highlighter-rouge">myStaticInt</code>, or <code class="highlighter-rouge">myStaticFloat</code>.</p>
<p>Run these commands in the interpreter:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>myStaticString = "try to reassign me, I dare you"
myStaticInt: Int = 1010101011
myStaticFloat: Double = 1.2121210
</code></pre></div></div>
<p>Didn’t work did it? Therein lies the difference. <code class="highlighter-rouge">var</code> lets you reassign while <code class="highlighter-rouge">val</code> does not. <code class="highlighter-rouge">val</code> is a great way to guarantee you don’t experience unwanted side effects in your code if you want to ensure a reference object never changes. You get a guarantee! How awesome is that?!</p>
<p>A quick side tangent. You can assign a new reference object if you include <code class="highlighter-rouge">val</code> at the beginning like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val myStaticString = "try to reassign me, I dare you"
val myStaticInt: Int = 1010101011
val myStaticFloat: Double = 1.2121210
</code></pre></div></div>
<p>So be careful. If you’re clumsy with your code, Scala can’t save you.</p>
<h1 id="summary">Summary</h1>
<p>What did we learn today? We learned Python is beautifully simple while Scala is simply beautiful. And we took our first baby step into Scala by leveraging our knowledge of Python. Scala has the same ability to infer object types when saving variables just like Python. The key difference is that Scala requires this thing called a <em>predicate</em> that can take the form <code class="highlighter-rouge">var</code> or <code class="highlighter-rouge">val</code>. We learned the difference between <code class="highlighter-rouge">var</code> and <code class="highlighter-rouge">val</code> is that the former can be reassigned whereas the latter can not. We also learned that if you write sloppy code, well, then that’s on you because no programming language is going to save your ass.</p>David Ziganto[email protected]From Zero to Spark Cluster in Under 10 Minutes2018-04-25T00:00:00+00:002018-04-25T00:00:00+00:00https://dziganto.github.io/amazon%20emr/apache%20spark/apache%20zeppelin/big%20data/From-Zero-to-Spark-Cluster-in-Under-Ten-Minutes<p><img src="/assets/images/Amazon_EMR_main.png?raw=true" alt="image" class="center-image" /></p>
<h2 id="objective">Objective</h2>
<p>In this no frills post, you’ll learn how to setup a big data cluster on Amazon EMR in less than ten minutes.</p>
<h2 id="prerequisites">Prerequisites</h2>
<ol>
<li>You have an AWS account.</li>
<li>You have setup a <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.html#create-a-key-pair">Key Pair</a>.</li>
<li>You have <code class="highlighter-rouge">Chrome</code> or <code class="highlighter-rouge">Firefox</code></li>
<li>You have basic familiarity with the command line.</li>
<li>You have basic familiarity with Python. (Optional)</li>
</ol>
<h2 id="1---foxy-proxy-setup-optional-only-for-zeppelin">1 - Foxy Proxy Setup (Optional: only for Zeppelin)</h2>
<ol>
<li>In <code class="highlighter-rouge">Chrome</code> or <code class="highlighter-rouge">Firefox</code>, add the <strong>FoxyProxy</strong> extension.</li>
<li>Restart browser after installing FoxyProxy.</li>
<li>Open your favorite text editor and save <a href="https://github.com/dziganto/dziganto.github.io/blob/master/_scripts/foxyproxy-settings.xml">this code</a> as <strong>foxyproxy-settings.xml</strong>. Keep track of where you save it.</li>
<li>In your browser, click on the <code class="highlighter-rouge">FoxyProxy icon</code> located at top right.</li>
<li>Scroll down and click <code class="highlighter-rouge">Options</code>.</li>
<li>Click <code class="highlighter-rouge">Import/Export</code> on left-hand side.</li>
<li>Click <code class="highlighter-rouge">Choose File</code>.</li>
<li>Select <code class="highlighter-rouge">foxyproxy-settings.xml</code>.</li>
<li>Click <code class="highlighter-rouge">Open</code>.</li>
<li>Congratulations, Foxy Proxy is now setup.</li>
</ol>
<h2 id="2---emr-cluster-setup">2 - EMR Cluster Setup</h2>
<ol>
<li>Login in to <a href="https://aws.amazon.com/">AWS</a>.</li>
<li>Navigate to <code class="highlighter-rouge">EMR</code> located under <strong>Analytics</strong>.<br />
<img src="/assets/images/EMR.png?raw=true" alt="EMR" class="center-image" /></li>
<li>Click the <code class="highlighter-rouge">Create cluster</code> button.
<img src="/assets/images/EMR_create_cluster.png?raw=true" alt="Create EMR Cluster" class="center-image" /></li>
<li>You are now in <strong>Step 1: Software and Steps</strong>. Click <code class="highlighter-rouge">Go to advanced options</code>. Here you can name your cluster and select whichever S3 bucket you want to connect to.<br />
<img src="/assets/images/EMR_advanced_options.png?raw=true" alt="EMR Advanced Options" class="center-image" /></li>
<li>Click the big data tools you require. I’ll select <code class="highlighter-rouge">Spark</code> and <code class="highlighter-rouge">Zeppelin</code> for this tutorial.<br />
<img src="/assets/images/EMR_select_software.png?raw=true" alt="EMR Software" class="center-image" /></li>
<li>Click <code class="highlighter-rouge">Next</code> at bottom right of screen.</li>
<li>In <strong>Step 2: Hardware</strong>, select the instance types, instance counts, on-demand or spot pricing, and auto-scaling options.</li>
<li>For this tutorial we’ll simply change the instance type to <code class="highlighter-rouge">m4.xlarge</code> and Core to 1 instance. Everything else will remain as default. See the following picture for details.<br />
<img src="/assets/images/EMR_instance_types.png?raw=true" alt="EMR Software" class="center-image" /></li>
<li>Click <code class="highlighter-rouge">Next</code> at bottom right of screen.</li>
<li>The next page is <strong>Step 3: General Cluster Settings</strong> Here you have the chance to rename your cluster, select S3 bucket, and add a bootstrap script - among other options.</li>
<li>Click <code class="highlighter-rouge">Next</code> at bottom right of screen.</li>
<li>The next page is <strong>Step 4: Security</strong>. It is imperative that you select a predefined key pair. (Do NOT proceed without a key!)</li>
<li>Click <code class="highlighter-rouge">Create cluster</code> at bottom right of screen. A new screen pops up that looks like this: <br />
<img src="/assets/images/EMR_cluster_creation.png?raw=true" alt="EMR Cluster Creation" class="center-image" /></li>
<li>Your cluster is finished building when you see a status of <strong>Waiting</strong> in green. (Be patient as this will take 5+ minutes depending on which big data software you installed. It’s not unusual for the build process to take 10-15 minutes or more.) Here’s what a complete build looks like:<br />
<img src="/assets/images/EMR_cluster_running.png?raw=true" alt="EMR Cluster Running" class="center-image" /></li>
<li>Congratulations, you have a cluster running Spark!</li>
</ol>
<h2 id="3---update-myip-optional">3 - Update MyIP (Optional)</h2>
<p>I like to set a location-specific IP for each cluster I build. This is completely optional. However, should you choose to do this, you’ll have to update your IP manually or by security group. Here’s how to do that manually:</p>
<ol>
<li>Still in the EMR dashboard, locate <code class="highlighter-rouge">Security groups for Master:</code>. Click it.</li>
<li>On next page select <strong>Master group</strong>.</li>
<li>Towards the bottom of the page select <code class="highlighter-rouge">Inbound</code> tab.</li>
<li>Then click <code class="highlighter-rouge">Edit</code>.</li>
<li>Select <code class="highlighter-rouge">MyIP</code> for SSH type.</li>
<li>Click <code class="highlighter-rouge">Save</code>.</li>
</ol>
<h2 id="4---ssh-into-your-cluster">4 - SSH Into Your Cluster</h2>
<ol>
<li>Navigate to EMR dashboard.</li>
<li>Click <code class="highlighter-rouge">SSH</code> button.<br />
<img src="/assets/images/EMR_SSH.png?raw=true" alt="SSH" class="center-image" /></li>
<li>Copy the command in the code block. Be sure to update the path to your key if it’s not located in your Home.</li>
<li>Open Terminal and paste command.</li>
<li>A prompt will ask if you want to continue connecting. Type <code class="highlighter-rouge">yes</code>.</li>
<li>A large EMR logo will pop up in your Terminal window if you followed all the steps.</li>
<li>Congratulations, you have setup your first EMR cluster and can access it remotely.</li>
</ol>
<h2 id="5---install-miniconda-on-master-optional">5 - Install Miniconda on Master (Optional)</h2>
<p>Let’s install Python and conda on this Master node now that we’re logged in. Copy and paste the following commands to install and configure Miniconda.</p>
<ol>
<li><code class="highlighter-rouge">wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O ~/anaconda.sh</code></li>
<li><code class="highlighter-rouge">bash ~/anaconda.sh -b -p $HOME/anaconda</code></li>
<li><code class="highlighter-rouge">echo -e '\nexport PATH=$HOME/anaconda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc</code></li>
<li>This process is successful if when you type <code class="highlighter-rouge">which python</code> you get <strong>~/anaconda/bin/python</strong>.</li>
<li>You can now install any python package you want with <code class="highlighter-rouge">conda install package_name</code>.</li>
<li>Congratulations, you now have Python and conda on your Master node.
<blockquote>
<p>Note that miniconda is not installed on the Core node.</p>
<p>You can do that separately or consider creating a bootstrap script that will automatically take care of this for you upon build.</p>
</blockquote>
</li>
</ol>
<h2 id="6---access-zeppelin-remotely-optional">6 - Access Zeppelin Remotely (Optional)</h2>
<ol>
<li>Open your browser that has FoxyProxy installed.</li>
<li>Click <code class="highlighter-rouge">FoxyProxy icon</code>.</li>
<li>Click <code class="highlighter-rouge">Use proxies based on their pre-defined patterns and priorities</code>.</li>
<li>On EMR dashboard, click <code class="highlighter-rouge">Enable web connection</code>.</li>
<li>Copy the command in the code block.</li>
<li>Open new Terminal tab.</li>
<li>Paste command which opens and forwards port
<blockquote>
<p>Note: it will look like it’s not working but it is so leave it alone!</p>
</blockquote>
</li>
<li>On EMR dashboard, the <code class="highlighter-rouge">Zeppelin</code> button should now be blue. Click on it.</li>
<li>You are successful if Zeppelin opens in a new tab in your browser.</li>
<li>Congratulations, you can access your EMR cluster through Zeppelin!</li>
</ol>
<h2 id="7---update-zeppelin-for-anaconda-optional">7 - Update Zeppelin for Anaconda (Optional)</h2>
<p>We have to update the Python path in Zeppelin to leverage the new version we installed in step 5.</p>
<ol>
<li>At the top right of Zeppelin, click <code class="highlighter-rouge">anonymous</code>.</li>
<li>In drop down, select <code class="highlighter-rouge">Interpreter</code>.</li>
<li>Search for <strong>python</strong>.</li>
<li>Click <code class="highlighter-rouge">Edit</code>.</li>
<li>Change <strong>zeppelin.python</strong> from <code class="highlighter-rouge">python</code> to <code class="highlighter-rouge">/home/hadoop/anaconda/bin/python</code></li>
<li>Click <code class="highlighter-rouge">Save</code> on bottom left.</li>
<li>Select dropdown for Interpreters again.</li>
<li>Search for spark.</li>
<li>Click <code class="highlighter-rouge">Edit</code>.</li>
<li>Change <strong>zeppelin.pyspark.python</strong> from <code class="highlighter-rouge">python</code> to <code class="highlighter-rouge">/home/hadoop/anaconda/bin/python</code></li>
<li>Click <code class="highlighter-rouge">Save</code> on bottom left.</li>
<li>Navigate back to Zeppelin Home by clicking <code class="highlighter-rouge">Zeppelin</code> top left.</li>
<li>Congratulations, you have all the tools you need to run PySpark on a Spark cluster!</li>
</ol>
<h2 id="8---best-part">8 - Best Part</h2>
<p>Admittedly, while that’s not a complicated process, it is time consuming. The good news is that you never have to configure FoxyProxy again AND there are neat little tricks you can add to make the build process much easier. For example, you can add a bootstrap script that will install and configure miniconda on all nodes during the build process itself.</p>
<p>Furthermore, if you want to spin up another cluster that is similar or identical to the one we just built, all you have to do is:</p>
<ol>
<li>Navigate to the EMR dashboard.</li>
<li>Select the cluster you want to mimic.</li>
<li>Select <code class="highlighter-rouge">Clone</code>.</li>
</ol>
<p>You can start building another cluster in seconds!</p>
<hr />
<h1 id="reminder-dont-forget-to-terminate-your-cluster-when-youre-done">Reminder: Don’t forget to terminate your cluster when you’re done.</h1>David Ziganto[email protected]Data Science Book Recommendations2018-03-29T00:00:00+00:002018-03-29T00:00:00+00:00https://dziganto.github.io/data%20science/machine%20learning/Data-Science-Book-Recommendations<p><img src="/assets/images/ml_books.png?raw=true" alt="image" class="center-image" /></p>
<h2 id="data-cleaning">Data Cleaning</h2>
<p><a href="https://amzn.to/2IeCYjy">Best Practices in Data Cleaning</a></p>
<h2 id="deep-learning">Deep Learning</h2>
<p><a href="https://bit.ly/2pPpLpE">Deep Learning with Python</a></p>
<p><a href="https://bit.ly/2E4T7Wf">Supervised Sequence Labelling with Recurrent Neural Networks</a></p>
<h2 id="ethicsprivacy">Ethics/Privacy</h2>
<p><a href="https://amzn.to/2GU0Eu3">Sharing Big Data Safely: Managing Data Security</a></p>
<h2 id="general-business">General Business</h2>
<p><a href="https://amzn.to/2GjyVpx">Certain to Win</a></p>
<p><a href="https://amzn.to/2GWoLrZ">The Mind Of The Strategist: The Art of Japanese Business</a></p>
<p><a href="https://amzn.to/2GnmZz2">Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results</a></p>
<h2 id="linear-algebra">Linear Algebra</h2>
<p><a href="https://amzn.to/2pR8MU2">Linear Algebra Done Right</a></p>
<h2 id="machine-learning">Machine Learning</h2>
<p><a href="https://amzn.to/2GjnA94">Applied Predictive Modeling</a></p>
<p><a href="https://amzn.to/2uu2srd">Applied Survival Analysis: Regression Modeling of Time-to-Event Data</a></p>
<p><a href="https://amzn.to/2uv1N8Z">Bayesian Data Analysis</a></p>
<p><a href="https://amzn.to/2pRbQQP">Bayesian Reasoning and Machine Learning</a></p>
<p><a href="https://amzn.to/2GVszd6">Data Analysis Using Regression and Multilevel/Hierarchical Models</a></p>
<p><a href="https://amzn.to/2GjsmiQ">Data Science at the Command Line</a></p>
<p><a href="https://amzn.to/2J5biPk">Doing Data Science: Straight Talk from the Frontline</a></p>
<p><a href="https://amzn.to/2J1ddEu">Elements of Information Theory</a></p>
<p><a href="https://amzn.to/2GYSCjG">Evaluating Learning Algorithms: A Classification Perspective</a></p>
<p><a href="https://amzn.to/2uB5eL0">Gaussian Processes for Machine Learning</a></p>
<p><a href="https://amzn.to/2E4PK1D">Hands-On Machine Learning with Scikit-Learn and TensorFlow</a></p>
<p><a href="https://amzn.to/2GEgMlC">Information Theory, Inference and Learning Algorithms</a></p>
<p><a href="https://amzn.to/2pR8T2w">Learning From Data</a></p>
<p><a href="https://amzn.to/2GjRv0C">Machine Learning: A Probabilistic Perspective</a></p>
<p><a href="https://amzn.to/2J3RV9h">Machine Learning: An Algorithmic Perspective</a></p>
<p><a href="https://amzn.to/2GUDG67">Machine Learning Refined</a></p>
<p><a href="https://amzn.to/2GAQKzT">Machine Learning: The Art and Science of Algorithms that Make Sense of Data</a></p>
<p><a href="https://amzn.to/2GoqC7X">Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis</a></p>
<p><a href="https://amzn.to/2uABVsm">Time Series Analysis and Its Applications</a></p>
<p><a href="https://amzn.to/2IeISB2">Understanding Machine Learning: From Theory to Algorithms</a></p>
<h2 id="non-technical">Non-technical</h2>
<p><a href="https://amzn.to/2GIZ14y">Analytics: How to Win with Intelligence</a></p>
<p><a href="https://oreil.ly/JXlOIo">Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking</a></p>
<p><a href="https://amzn.to/2Ghu5Jv">How to Lie with Statistics</a></p>
<p><a href="https://amzn.to/2GmhWCQ">Mastering Data Mining: The Art and Science of Customer Relationship Management</a></p>
<p><a href="https://amzn.to/2pNBjL8">Naked Statistics: Stripping the Dread from the Data</a></p>
<p><a href="https://amzn.to/2Gjj9XH">Spin Selling</a></p>
<h2 id="other">Other</h2>
<p><a href="https://amzn.to/2GmXP3o">Against the Gods: The Remarkable Story of Risk</a></p>
<p><a href="https://amzn.to/2J7fL46">Gödel, Escher, Bach: An Eternal Golden Braid</a></p>
<p><a href="https://bit.ly/2pTnDgu">The Machine Stops</a></p>
<h2 id="pedagogy">Pedagogy</h2>
<p><a href="https://amzn.to/2GUEXtM">Teaching and Learning STEM: A Practical Guide</a></p>
<p><a href="https://amzn.to/2GmsiyF">Understanding By Design</a></p>
<h2 id="programming">Programming</h2>
<p><a href="https://amzn.to/2pPG1bf">The Pragmatic Programmer: From Journeyman to Master</a></p>
<h2 id="statistics">Statistics</h2>
<p><a href="https://amzn.to/2GUvq5Y">A Course in Large Sample Theory</a></p>
<p><a href="https://amzn.to/2uxET0A">All of Statistics: A Concise Course in Statistical Inference</a></p>
<p><a href="https://amzn.to/2GmNZ1s">An Introduction to Statistical Methods and Data Analysis</a></p>
<p><a href="https://amzn.to/2pQqyXA">Applied Longitudinal Analysis</a></p>
<p><a href="https://amzn.to/2E2Z06k">Categorical Data Analysis</a></p>
<p><a href="https://amzn.to/2pQzkon">Design and Analysis: A Researcher’s Handbook</a></p>
<p><a href="https://amzn.to/2Ih0Y5F">Handbook of Parametric and Nonparametric Statistical Procedures</a></p>
<p><a href="https://amzn.to/2E2YIfK">Multivariate Analysis</a></p>
<p><a href="https://bit.ly/1FNQSUQ">OpenIntro Statistics</a></p>
<p><a href="https://amzn.to/2GjSvOh">Statistics for Experimenters: Design, Innovation, and Discovery</a></p>
<h2 id="visualization">Visualization</h2>
<p><a href="https://amzn.to/2GUllWE">Good Charts</a></p>
<p><a href="https://amzn.to/2GjGlZM">The Functional Art: An introduction to information graphics and visualization (Voices That Matter)</a></p>
<p><a href="https://amzn.to/2E5uDfj">Visualize This: The FlowingData Guide to Design, Visualization, and Statistics</a></p>David Ziganto[email protected]Yet Another Data Science Article2018-03-14T00:00:00+00:002018-03-14T00:00:00+00:00https://dziganto.github.io/data%20science/satire/Yet-Another-Data-Science-Article<h1 id="problem">Problem</h1>
<p><img src="/assets/images/magic_algorithm.png?raw=true" alt="image" class="center-image" /></p>
<h1 id="solution">Solution</h1>
<p><img src="/assets/images/ds_solution.png?raw=true" alt="image" class="center-image" /></p>David Ziganto[email protected]ProblemUnderstanding Object-Oriented Programming Through Machine Learning2018-01-28T00:00:00+00:002018-01-28T00:00:00+00:00https://dziganto.github.io/classes/data%20science/linear%20regression/machine%20learning/object-oriented%20programming/python/Understanding-Object-Oriented-Programming-Through-Machine-Learning<p><img src="/assets/images/classes.png?raw=true" alt="image" class="center-image" /></p>
<h2 id="introduction">Introduction</h2>
<p>Object-Oriented Programming (OOP) is not easy to wrap your head around. You can read tutorial after tutorial and sift through example after example only to find your head swimming. Don’t worry, you’re not alone.</p>
<p>When I first started learning OOP, I read about bicycles and bank accounts and filing cabinets. I read about all manor of objects with both basic and specific characteristics. It was easy to follow along. However, I always felt I was missing something. It wasn’t until I had that inexplicable eureka moment that I finally glimpsed the power of OOP.</p>
<p>However, I always felt as though my eureka moment took longer than it should have. I doubt I’m alone. Therefore, this post is my attempt to explain the basics of OOP through the lens of my favorite subject: machine learning. I hope you find it helpful.</p>
<h2 id="setup">Setup</h2>
<p>I discussed the basics of linear regression in a previous post entitled <a href="https://dziganto.github.io/data%20science/linear%20regression/machine%20learning/python/Linear-Regression-101-Basics/">Linear Regression 101 (Part 1 - Basics)</a>. If you’re unfamiliar, please start there because I’m going to assume you’re up to speed. Anyway, in that discussion I showed how to find the parameters of a linear regression model using nothing more than simple linear algebra. We defined a function called <strong>ols</strong> (short for Ordinary Least Squares) that looks like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def ols(X, y):
'''returns parameters based on Ordinary Least Squares.'''
xtx = np.dot(X.T, X) ## x-transpose times x
inv_xtx = np.linalg.inv(xtx) ## inverse of x-transpose times x
xty = np.dot(X.T, y) ## x-transpose times y
return np.dot(inv_xtx, xty)
</code></pre></div></div>
<p>The output of the <strong>ols</strong> function is an array of parameter values that minimize the squared residuals. As the parameters or coefficients compose the linear regression model, we saved those values like so:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>parameters = ols(X,y)
</code></pre></div></div>
<p>In other words, the variable <em>parameters</em>, an array of scalar values, defines our model. To make predictions, we simply take the dot product of our model’s parameters with that of incoming data in the same format as the <em>X</em> that was passed to the <strong>ols</strong> function. Here’s that same idea in code:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>predictions = np.dot(X_new, parameters)
</code></pre></div></div>
<p>So now we have a model and a way to make predictions. Not too complicated. But as it turns out we can do better. We can simplify. Enter OOP.</p>
<h2 id="object-oriented-programming-overview">Object-Oriented Programming Overview</h2>
<p>In the same way we abstracted away a series of calculutions that return the Ordinary Least Squares model parameters in a function called <strong>ols</strong>, we can abstract away <em>functions</em> and <em>data</em> in a single object called a <strong>class</strong>.</p>
<p><img src="/assets/images/class_diagram.png?raw=true" alt="image" class="center-image" /></p>
<p>Let me show you what I mean and then I’ll explain what’s going on.</p>
<h2 id="object-oriented-programming-machine-learning-example">Object-Oriented Programming Machine Learning Example</h2>
<p>We’ll build a class called <strong>MyLinearRegression</strong> one code block at a time so as to manage the complexity. It’s really not too tricky but it’s easier to understand in snippets. Alright, let’s get started.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
class MyLinearRegression:
def __init__(self, fit_intercept=True):
self.coef_ = None
self.intercept_ = None
self._fit_intercept = fit_intercept
</code></pre></div></div>
<p>Have no fear if that looks scary or overwhelming. I’ll break it down for you and you’ll see it’s really not that complicated. Just stay with me.</p>
<p>The first thing to notice is that we’re defining a <em>class</em> as opposed to a function. We do that, unsurprisingly, with the <strong>class</strong> keyword. By convention, you should capitalize your class names. Notice how I named my class <strong>MyLinearRegression</strong>? Starting your classes with a capital letter helps to differentiate them from functions, the latter of which is lowercase by convention.</p>
<p>The next block of code which starts with <code class="highlighter-rouge">def __init__(self, fit_intercept=True):</code> is where things get more complicated. Stay with me; I promise it’s not that bad.</p>
<p>At a high level, <code class="highlighter-rouge">__init__</code> provides a recipe for how to build an <em>instance</em> of <strong>MyLinearRegression</strong>. Think of <code class="highlighter-rouge">__init__</code> like a factory. Let’s pretend you wanted to crank out hundreds of linear regression models. You can do that one of two ways. First, you have the <strong>ols</strong> function that provides the instructions on how to calculate linear regression parameters. So you could, in theory, save off hundreds of copies of the <strong>ols</strong> function with hundreds of appropriate variable names. There’s nothing inherently wrong with that. Or you could save off hundreds of <em>instances</em> of class <strong>MyLinearRegression</strong> with hundreds of appropriate variable names. Both accomplish very similar tasks but do so in very different ways. You’ll understand why as we get a little further along.</p>
<blockquote>
<p>Technical note: the <strong>_<em>init_</em></strong> block of code is optional, though it’s quite common. You’ll know when you need it and when you don’t with a bit more practice with OOP.</p>
</blockquote>
<p>What the heck is <em>self</em>? Since an instance of <strong>MyLinearRegression</strong> can take on any name a user gives it, we need a way to link the user’s instance name back to the class so we can accomplish certain tasks. Think of <em>self</em> as a variable whose sole job is to learn the name of a particular instance. Say we named a particular instance of the class <strong>MyLinearRegression</strong> as instance <em>mlr</em> like so:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mlr = MyLinearRegression()
</code></pre></div></div>
<p>Again, the class <strong>MyLinearRegression</strong> provides instructions on how to build a linear regression model. What we did here by attaching the variable <em>mlr</em> to the <strong>MyLinearRegression</strong> class is to create an instance, a specific object called <em>mlr</em>, which will have its own data and “functions”. You’ll understand why I placed functions in quotes shortly. Anyway, <em>mlr</em> is a unique model with a unique name, much like you’re a unique person with your own name. The class object <strong>MyLinearRegression</strong> now links <em>self</em> to <em>mlr</em>. If it’s still not clear why that’s important, hang tight because it will when we get to the next code block.</p>
<p>Now this business about <code class="highlighter-rouge">self.coef_</code>, <code class="highlighter-rouge">self.intercept_</code>, and <code class="highlighter-rouge">self._fit_intercept</code>. All three are simply variables, technically called <em>attributes</em>, attached to the class object. When we build <em>mlr</em>, our class provides a blueprint that calls for the creation of three <em>attributes</em>. <code class="highlighter-rouge">self.coef_</code> and <code class="highlighter-rouge">self.intercept_</code> are placeholders. We haven’t calculated model parameters but when we do we’ll place those values into these attributes. <code class="highlighter-rouge">self._fit_intercept</code> is a boolean (True or False) that is set to True by default per the keyword argument. A user can define whether to calculate the intercept by setting this argument to True or avoid it by setting the argument to False. Since we didn’t set <em>fit_intercept</em> to False when we created <em>mlr</em>, <em>mlr</em> will provide the intercept parameter once it’s calculated.</p>
<p>Great, let’s add a “function” called <strong>fit</strong> which will take an array of data and a vector of ground truth values in order to calculate and return linear regression model parameters.</p>
<blockquote>
<p>Note: We’re building this class one piece at a time. I’m doing this simply for pedagogical reasons.</p>
</blockquote>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class MyLinearRegression:
def __init__(self, fit_intercept=True):
self.coef_ = None
self.intercept_ = None
self._fit_intercept = fit_intercept
def fit(self, X, y):
"""
Fit model coefficients.
Arguments:
X: 1D or 2D numpy array
y: 1D numpy array
"""
# check if X is 1D or 2D array
if len(X.shape) == 1:
X = X.reshape(-1,1)
# add bias if fit_intercept is True
if self._fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
# closed form solution
xTx = np.dot(X.T, X)
inverse_xTx = np.linalg.inv(xTx)
xTy = np.dot(X.T, y)
coef = np.dot(inverse_xTx, xTy)
# set attributes
if self._fit_intercept:
self.intercept_ = coef[0]
self.coef_ = coef[1:]
else:
self.intercept_ = 0
self.coef_ = coef
</code></pre></div></div>
<p>Our focus now is on the <strong>fit</strong> function. Technically a class function is called a <strong>method</strong>. That’s the term I’ll use from here on out. The <strong>fit</strong> method is quite simple.</p>
<p>First comes the docstring which tells us what the method does and what the expected inputs are for <em>X</em> and <em>y</em>.</p>
<p>Next up is a check on the dimensions of the incoming <em>X</em> array. NumPy complains if you perform certain calculations on a 1D array. If a 1D array is passed, the supplied code reshapes it so as to fake a 2D array.</p>
<blockquote>
<p>Technical note: this does not change the output in any way. It simply anticipates and solves a problem for the user.</p>
</blockquote>
<p>The next block of code checks if <code class="highlighter-rouge">fit_intercept=True</code>. If so, then a vector of ones is added to the <em>X</em> array.</p>
<blockquote>
<p>I’ll assume you’ve read my post on linear regression to understand why we need to do this.</p>
</blockquote>
<p>The next block of code simply calculates the model parameters using linear algebra. The parameters are stored in a class variable called <em>coef</em>.</p>
<blockquote>
<p>Yes, <strong>coef</strong> is technically a variable, not an attribute. A variable-like object attached to a class via <strong>self</strong> is called an attribute whereas a variable contained within a class is simply a variable.</p>
</blockquote>
<p>The final block of code parses <em>coef</em> appropriately. If <code class="highlighter-rouge">fit_intercept=True</code>, then the intercept value is copied to <code class="highlighter-rouge">self.intercept_</code>. Otherwise, <code class="highlighter-rouge">self.intercept_</code> is set to 0. The remaining parameters are stored in <code class="highlighter-rouge">self.coef_</code>.</p>
<p>Let’s see how this works.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mlr = MyLinearRegression()
mlr.fit(X_data, y_target)
</code></pre></div></div>
<p>We instantiate a model object called <em>mlr</em> and then find its model parameters on data (<em>X_data</em> and <em>y_target</em>) passed by the user. Once that’s done, we can access the intercept and remaining parameters like so:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>intercept = mlr.intercept_
parameters = mlr.coef_
</code></pre></div></div>
<p>So clean. So elegant. Let’s keep going. Let’s add a <strong>predict</strong> method.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
class MyLinearRegression:
def __init__(self, fit_intercept=True):
self.coef_ = None
self.intercept_ = None
self._fit_intercept = fit_intercept
def fit(self, X, y):
"""
Fit model coefficients.
Arguments:
X: 1D or 2D numpy array
y: 1D numpy array
"""
# check if X is 1D or 2D array
if len(X.shape) == 1:
X = X.reshape(-1,1)
# add bias if fit_intercept is True
if self._fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
# closed form solution
xTx = np.dot(X.T, X)
inverse_xTx = np.linalg.inv(xTx)
xTy = np.dot(X.T, y)
coef = np.dot(inverse_xTx, xTy)
# set attributes
if self._fit_intercept:
self.intercept_ = coef[0]
self.coef_ = coef[1:]
else:
self.intercept_ = 0
self.coef_ = coef
def predict(self, X):
"""
Output model prediction.
Arguments:
X: 1D or 2D numpy array
"""
# check if X is 1D or 2D array
if len(X.shape) == 1:
X = X.reshape(-1,1)
return self.intercept_ + np.dot(X, self.coef_)
</code></pre></div></div>
<p>The <strong>predict</strong> method is also quite simple. Pass in some data <em>X</em> formatted exactly as <em>X_data</em> in our case, and the model spits out its predictions.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>predictions = mlr.predict(X_new_data)
</code></pre></div></div>
<p>See how everything (data and methods) is contained or encapsulated in a single class object. It’s a wonderful way to keep everything organized.</p>
<p>But wait, there’s more.</p>
<p>Say we had another class called <strong>Metrics</strong>. This class captures a number of key metrics associated with regression models. See <a href="https://dziganto.github.io/data%20science/linear%20regression/machine%20learning/python/Linear-Regression-101-Metrics/">Linear Regression 101 (Part 2 - Metrics)</a> for details.</p>
<p>It looks like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class Metrics:
def __init__(self, X, y, model):
self.data = X
self.target = y
self.model = model
# degrees of freedom population dep. variable variance
self._dft = X.shape[0] - 1
# degrees of freedom population error variance
self._dfe = X.shape[0] - X.shape[1] - 1
def sse(self):
'''returns sum of squared errors (model vs actual)'''
squared_errors = (self.target - self.model.predict(self.data)) ** 2
self.sq_error_ = np.sum(squared_errors)
return self.sq_error_
def sst(self):
'''returns total sum of squared errors (actual vs avg(actual))'''
avg_y = np.mean(self.target)
squared_errors = (self.target - avg_y) ** 2
self.sst_ = np.sum(squared_errors)
return self.sst_
def r_squared(self):
'''returns calculated value of r^2'''
self.r_sq_ = 1 - self.sse()/self.sst()
return self.r_sq_
def adj_r_squared(self):
'''returns calculated value of adjusted r^2'''
self.adj_r_sq_ = 1 - (self.sse()/self._dfe) / (self.sst()/self._dft)
return self.adj_r_sq_
def mse(self):
'''returns calculated value of mse'''
self.mse_ = np.mean( (self.model.predict(self.data) - self.target) ** 2 )
return self.mse_
def pretty_print_stats(self):
'''returns report of statistics for a given model object'''
items = ( ('sse:', self.sse()), ('sst:', self.sst()),
('mse:', self.mse()), ('r^2:', self.r_squared()),
('adj_r^2:', self.adj_r_squared()))
for item in items:
print('{0:8} {1:.4f}'.format(item[0], item[1]))
</code></pre></div></div>
<p>The <strong>Metrics</strong> class requires <em>X</em>, <em>y</em>, and a <em>model object</em> to calculate the key metrics. It’s certainly not a bad solution. However, we can do better. With a little tweaking, we can give <strong>MyLinearRegression</strong> access to <strong>Metrics</strong> in a simple yet intuitive way. Let me show you how:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class ModifiedMetrics:
def sse(self):
'''returns sum of squared errors (model vs actual)'''
squared_errors = (self.target - self.predict(self.data)) ** 2
self.sq_error_ = np.sum(squared_errors)
return self.sq_error_
def sst(self):
'''returns total sum of squared errors (actual vs avg(actual))'''
avg_y = np.mean(self.target)
squared_errors = (self.target - avg_y) ** 2
self.sst_ = np.sum(squared_errors)
return self.sst_
def r_squared(self):
'''returns calculated value of r^2'''
self.r_sq_ = 1 - self.sse()/self.sst()
return self.r_sq_
def adj_r_squared(self):
'''returns calculated value of adjusted r^2'''
self.adj_r_sq_ = 1 - (self.sse()/self._dfe) / (self.sst()/self._dft)
return self.adj_r_sq_
def mse(self):
'''returns calculated value of mse'''
self.mse_ = np.mean( (self.predict(self.data) - self.target) ** 2 )
return self.mse_
def pretty_print_stats(self):
'''returns report of statistics for a given model object'''
items = ( ('sse:', self.sse()), ('sst:', self.sst()),
('mse:', self.mse()), ('r^2:', self.r_squared()),
('adj_r^2:', self.adj_r_squared()))
for item in items:
print('{0:8} {1:.4f}'.format(item[0], item[1]))
</code></pre></div></div>
<p>Notice <strong>ModifiedMetrics</strong> no longer has <strong>_<em>init_</em></strong>. Now for a slightly modified version of <strong>MyLinearRegression</strong>.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class MyLinearRegressionWithInheritance(ModifiedMetrics):
def __init__(self, fit_intercept=True):
self.coef_ = None
self.intercept_ = None
self._fit_intercept = fit_intercept
def fit(self, X, y):
"""
Fit model coefficients.
Arguments:
X: 1D or 2D numpy array
y: 1D numpy array
"""
# training data & ground truth data
self.data = X
self.target = y
# degrees of freedom population dep. variable variance
self._dft = X.shape[0] - 1
# degrees of freedom population error variance
self._dfe = X.shape[0] - X.shape[1] - 1
# check if X is 1D or 2D array
if len(X.shape) == 1:
X = X.reshape(-1,1)
# add bias if fit_intercept
if self._fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
# closed form solution
xTx = np.dot(X.T, X)
inverse_xTx = np.linalg.inv(xTx)
xTy = np.dot(X.T, y)
coef = np.dot(inverse_xTx, xTy)
# set attributes
if self._fit_intercept:
self.intercept_ = coef[0]
self.coef_ = coef[1:]
else:
self.intercept_ = 0
self.coef_ = coef
def predict(self, X):
"""Output model prediction.
Arguments:
X: 1D or 2D numpy array
"""
# check if X is 1D or 2D array
if len(X.shape) == 1:
X = X.reshape(-1,1)
return self.intercept_ + np.dot(X, self.coef_)
</code></pre></div></div>
<p>Notice how I created <strong>MyLinearRegressionWithInheritance</strong>? It contains <strong>ModifiedMetrics</strong> in parantheses right from the start. Here’s the snippet of code I’m referring to:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class MyLinearRegressionWithInheritance(ModifiedMetrics):
</code></pre></div></div>
<p>This means <strong>ModifiedMetrics</strong> acts like a base class and <strong>MyLinearRegressionWithInheritance</strong> can inherit from it. Why may this be helpufl? First, it’s far more elegant. Secondly, imagine your wrote not just a linear regression algorithm but other regression algorithms, and you wanted each of those algorithms to have access to the same methods that calculate and return key regression metrics. On the one hand, you could copy all that code into each model object. On another hand, you could pass those model objects to the <strong>Metrics</strong> class. Or you could simply inherit <strong>ModifiedMetrics</strong>. While all will work, the last solution is by far the most elegant. It keeps all your code modular. It also ensures you’re constructing your classes in a way that won’t break your code down the line. In short, it makes your life easier and ensures quality code. It’s much easier to change base class methods or add/delete without having to comb through each algorithm to see if you made the required updates. In short, it makes your code manageable at scale.</p>
<p>We covered a lot of ground in short order, so this is a good place to stop for now.</p>
<h2 id="wrap-up">Wrap Up</h2>
<p>OOP is a powerful paradigm, keeping your code organized and manageable at scale. However, it’s not a magic bullet. Like any tool, you have to know where and when it’s appropriate to use. That means you should spend some time learning at least a handful of OOP design patterns - there are many wonderful resources available. You’ll be surprised how much more powerful, elegant, and efficient you’re code will be with a little study.</p>David Ziganto[email protected]Simulated Datasets for Faster ML Understanding (Part 1/2)2018-01-23T00:00:00+00:002018-01-23T00:00:00+00:00https://dziganto.github.io/data%20science/eda/machine%20learning/python/simulated%20data/Simulated-Datasets-for-Faster-ML-Understanding<p><img src="/assets/images/innovative_approach.jpg?raw=true" alt="image" class="center-image" /></p>
<h2 id="introduction">Introduction</h2>
<p>Oftentimes, the most difficult part of gaining expertise in machine learning is developing intuition about the strengths and weaknesses of the various algorithms. Common pedagogy follows a familiar pattern: theoretical exposition followed by application on a contrived dataset. For example, suppose you’re learning a classification algorithm for supervised machine learning. For specificity, let’s assume the algorithm du jour is <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Gaussian_naive_Bayes">Gaussian Naive Bayes</a> (GNB). You learn, as a natural starting point, the mechanics and the fundamental assumptions. That gives you the big idea. Maybe you even code GNB from scratch to gain deeper insight. Great. Now comes time to apply GNB to “real” data. A canonical example is often presented, for example the <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris</a> dataset. You learn to connect theory and application. Makes perfect sense.</p>
<p>So what’s the problem?</p>
<p>The problem is that you don’t know the generative process underlying the Iris dataset. Sure, you’re trying to deduce a proxy by fitting your GNB model. That’s the point of modeling. But that’s not what I’m getting at. No, what I want to help you understand is knowing where and when certain algorithms shine and where and when they don’t. In sum, I want to pull back the curtain; I want to show you how to understand machine learning algorithms at a much deeper level, the level of intuition. How you get there and how quickly you get there is a matter of technique, and it’s this technique that I’ll share with you so you too can gain deep expertise and intuition about machine learning algorithms with great alacrity.</p>
<h2 id="baby-steps">Baby Steps</h2>
<p>Imagine you knew the generative process underlying a dataset - you knew exactly how data was generated and how all the pieces fit together. In short, imagine you have perfect information. Now imagine running GNB on your data. Because you know precisely how the data was generated and because you know how GNB works, you can start piecing together where GNB performs well and in what situations it struggles. Now imagine you knew the generative process of not one but many datasets. Furthermore, imagine applying not just GNB but Logistic Regression, Random Forest, Support Vector Machines, and a slew of other classification algorithms you have at your disposal. All of a sudden you have the ability to garner deep insights into each of the algorithms, and fast.</p>
<p>But how do you move from imagination to reality?</p>
<h2 id="on-the-road-to-something-greater">On the Road to Something Greater</h2>
<p>The answer may surprise you. Create your own datasets! That may sound daunting but really it’s not. Let me walk you through one of my earliest incarnations. I even created a little backstory just to keep things interesting. Without further ado, here are the details.</p>
<h2 id="dataset-description">Dataset Description</h2>
<p>What follows is a full on description of the very first dataset I created. By the way, industry tends to call this type of dataset a <strong>simulated dataset</strong>.</p>
<h3 id="introduction-1">Introduction</h3>
<p>This dataset is built from scratch. It has the following properties:</p>
<blockquote>
<p><strong>Type:</strong> Classification<br />
<strong>Balanced:</strong> No (slightly imbalanced)<br />
<strong>Outliers:</strong> No<br />
<strong>Simulated Human Data Entry Errors:</strong> No<br />
<strong>Missing Values:</strong> No<br />
<strong>Nonsensical Data Types:</strong> No</p>
</blockquote>
<p>Furthermore, the dataset is designed in such a way that relying on intuition alone will lead you astray if you rely on gut feel alone.</p>
<h3 id="problem-description">Problem Description</h3>
<p>InstaFace (IF) is a cutting edge startup specializing in facial recognition. As a hot tech startup, IF is constantly on the lookout for identifying and hiring the best talent. Because they are the best at what they do, their applicant pool is massive and growing. In fact, the number of applicants has grown so large and so fast that Human Resources just can’t keep up. So they need your help to create an automated way to identify the most promising candidates. In particular, they asked that you create a model that can take a number of predefined inputs and output a probability that a particular candidate will be hired. The good news is IF has hired scores of data scientists in the past, so the dataset is relatively rich.</p>
<h3 id="features">Features</h3>
<p>Below I describe the various features, whether that feature has any importance on the target variable, and if so the likelihood of someone being hired for a specific value of that feature</p>
<hr />
<p>|Feature #|Description|Important|
|:–:|:–:|:–:|
|1|degree|Y|
|2|age|N|
|3|gender|N|
|4|major|N|
|5|GPA|N|
|6|experience|Y|
|7|bootcamp|Y|
|8|GitHub|Y|
|9|blogger|Y|
|10|blogs|N|</p>
<hr />
<h4 id="feature-1">Feature 1</h4>
<ul>
<li>description: highest degree achieved</li>
<li>important: Yes</li>
<li>values: [(0=no bachelors, 8%), (1=bachelors, 70%), (2=masters, 80%), (3=PhD, 20%)]</li>
</ul>
<h4 id="feature-2">Feature 2</h4>
<ul>
<li>description: age</li>
<li>important: No</li>
<li>values: [18, 60]</li>
</ul>
<h4 id="feature-3">Feature 3</h4>
<ul>
<li>description: gender</li>
<li>important: No</li>
<li>values: [0=female, 1=male]</li>
</ul>
<h4 id="feature-4">Feature 4</h4>
<ul>
<li>description: major</li>
<li>important: No</li>
<li>values: [0=anthropology, 1=biology, 2=business, 3=chemistry, 4=engineering, 5=journalism, 6=math, 7=political science]</li>
</ul>
<h4 id="feature-5">Feature 5</h4>
<ul>
<li>description: GPA</li>
<li>important: No</li>
<li>values: [1.00, 4.00]</li>
</ul>
<h4 id="feature-6">Feature 6</h4>
<ul>
<li>description: years of experience</li>
<li>important: Yes</li>
<li>values: [(0-10, 90%), (10-25, 20%), (25-50, 5%)]</li>
</ul>
<h4 id="feature-7">Feature 7</h4>
<ul>
<li>description: attended bootcamp</li>
<li>important: Yes</li>
<li>values: [(0=No, 25%), (1=Yes, 75%)]</li>
</ul>
<h4 id="feature-8">Feature 8</h4>
<ul>
<li>description: number of projects on GitHub</li>
<li>important: Yes</li>
<li>values: [(0, 5%), (1-5, 65%), (6-20, 95%)]</li>
</ul>
<h4 id="feature-9">Feature 9</h4>
<ul>
<li>description: writes data science blog posts</li>
<li>important: Yes</li>
<li>values: [(0=No, 30%), (1=Yes, 70%)]</li>
</ul>
<h4 id="feature-10">Feature 10</h4>
<ul>
<li>description: number of blog articles written</li>
<li>important: No</li>
<li>values: [0, 20]</li>
</ul>
<h3 id="more-details">More Details</h3>
<p>Without looking at the data, many people would likely assume that a PhD would have better chances of getting hired than someone with a Master’s, that a Master’s candidate would have better chances of getting hired than someone with a Bachelor’s, and so on. This is simply not true in this case. I specifically created this dataset in such a way that people with Bachelor’s and Master’s degrees are far more likely to get hired than PhD’s or those without a degree.</p>
<p>Regarding <strong>age</strong> and <strong>gender</strong>, one may reasonably conjecture that these attributes would have high impact with regard to hiring decisions since this is a well-known bias in many real companies. However, I specifically created this dataset so that hiring decisions were made independently of these two attributes. Again, the goal is to let the data speak for itself, not to rely on intuition. There is an interesting result lurking beneath the surface, however. <strong>Age</strong> is correlated with <strong>experience</strong> so it exhibits some signal, but the true source of the signal is <strong>experience</strong>.</p>
<p>One may also assume that <strong>major</strong> and <strong>GPA</strong> are strong predictors. That may be the case at some real-world companies but not in this case. They have no impact whatsoever. Any signal present is purely due to chance.</p>
<p>On the other hand, <strong>years of experience</strong>, <strong>bootcamp experience</strong>, <strong>number of projects on GitHub</strong>, and <strong>blog experience</strong> are all strong predictors. Specifically, the dataset was designed such that candidates with light experience, bootcamp experience, numerous independent GitHub projects, and a data science blog are preferred. Surprisingly perhaps, the number of blog articles one has writen is irrelevant. This was by design.</p>
<p>One last thing to note: Whether a candidate was hired is not based on any one of the five important features. Rather, five target flags were generated probabilistically based on the values of those features and a simple majority results in being hired. To add a bit more complexity, I randomly flipped 5% of hiring decisions so that learning the hiring decision rule would be more difficult.</p>
<p>Great, so now we have all that background behind us which means it’s time to actually generate the data.</p>
<h2 id="generate-data">Generate Data</h2>
<p>There are many ways to efficiently create datasets using NumPy and Pandas. I tried to keep things simple and understandable, not necessarily efficient. Please bare with me here.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
import pandas as pd
# reproducibility
np.random.seed(10)
# number of observations
size = 5000
# feature setup
degree = np.random.choice(a=range(4), size=size)
age = np.random.choice(a=range(18,61), size=size)
gender = np.random.choice(a=range(2), size=size)
major = np.random.choice(a=range(8), size=size)
gpa = np.round(np.random.normal(loc=2.90, scale=0.5, size=size), 2)
experience = None
bootcamp = np.random.choice(a=range(2), size=size)
github = np.random.choice(a=range(21), size=size)
blogger = np.random.choice(a=range(2), size=size)
articles = 0
t1, t2, t3, t4, t5 = None, None, None, None, None
hired = 0
</code></pre></div></div>
<p>Now to create a pandas dataframe.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mydict = {"degree":degree, "age":age,
"gender":gender, "major":major,
"gpa":gpa, "experience":experience,
"github":github, "bootcamp":bootcamp,
"blogger":blogger, "articles":articles,
"t1":t1, "t2":t2, "t3":t3, "t4":t4, "t5":t5, "hired":hired}
df = pd.DataFrame(mydict,
columns=["degree", "age", "gender", "major", "gpa",
"experience", "bootcamp", "github", "blogger", "articles",
"t1", "t2", "t3", "t4", "t5", "hired"])
</code></pre></div></div>
<p>We’re not quite there yet. We still need to update some columns. Here’s an inefficient but hopefully understandable way to do that:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>np.random.seed(42)
for i, _ in df.iterrows():
# Constrain GPA
if df.loc[i, 'gpa'] < 1.00 or df.loc[i, 'gpa'] > 4.00:
if df.loc[i, 'gpa'] < 1.00:
df.loc[i, 'gpa'] = 1.00
else:
df.loc[i, 'gpa'] = 4.00
# Set experience based on age
df.loc[i, 'experience'] = np.random.choice(a=range(0, df.loc[i, 'age']-17))
# Set number of articles if blogger flag
if df.loc[i, 'blogger']:
df.loc[i, 'articles'] = np.random.choice(a=range(1, 21), size=1)
# Set target flags
for feature in ['degree', 'experience', 'bootcamp', 'github', 'blogger']:
if feature == 'degree':
if df.loc[i, feature] == 0:
df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.92, 0.08])) ## no bachelors
elif df.loc[i, feature] == 1:
df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## bachelors
elif df.loc[i, feature] == 2:
df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.20, 0.80])) ## masters
else:
df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## PhD
elif feature == 'experience':
if df.loc[i, feature] <= 10:
df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.10, 0.90])) ## <= 10 yrs exp
elif df.loc[i, feature] <= 25:
df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## 11-25 yrs exp
else:
df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## >= 26 yrs exp
elif feature == 'bootcamp':
if df.loc[i, feature]:
df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.25, 0.75])) ## bootcamp
else:
df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## no bootcamp
elif feature == 'github':
if df.loc[i, feature] == 0:
df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## 0 projects
elif df.loc[i, feature] <= 5:
df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.35, 0.65])) ## 1-5 projects
else:
df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.05, 0.95])) ## > 5 projects
else:
if df.loc[i, feature]:
df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## blogger
else:
df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## !blogger
# Set hired value
if (df.loc[i, 't1'] + df.loc[i, 't2'] + df.loc[i,'t3'] + df.loc[i,'t4'] + df.loc[i, 't5']) >= 3:
df.loc[i, 'hired'] = 1
</code></pre></div></div>
<p>The big takeaway is the last <em>if</em> statement. That’s where the target variable (aka <em>hired</em>) is set. <em>This is the generative process</em>. It simply states that if the temporary flag variable t1-t5 sum to three or or more, then set hired equal to one, otherwise zero. It’s a simple decision based on a simple summation - probably not too far off from many real hiring decisions!</p>
<p>It’s worthwhile to apply just a bit more processing. Specifically, we want to remove those temporary flag variables t1-t5 and convert <strong>experience</strong> from an object type to numeric.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Drop target flags
df.drop(df[['t1', 't2', 't3', 't4', 't5']], axis=1, inplace=True)
# Set 'experience' to numeric (was object type)
df['experience'] = df['experience'].apply(pd.to_numeric)
</code></pre></div></div>
<p>Great, we’re almost there. We just need to add the last bit of complexity where we flip a few hiring decisions. Again, the aim is not efficiency but ease of understanding here.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>np.random.seed(15)
percent_to_flip = 0.03 ## % of hired values to flip
num_to_flip = int(np.floor(percent_to_flip * len(df))) ## determine number of hired values to flip
flip_idx = np.random.randint(low=0, high=len(df), size=num_to_flip) ## randomly select indices
for i, _ in df.loc[flip_idx].iterrows():
if df.loc[i, 'hired'] == 1:
df.loc[i, 'hired'] = 0
else:
df.loc[i, 'hired'] = 1
</code></pre></div></div>
<p>Great, that’s as far as we want to take this dataset.</p>
<h2 id="wrap-up">Wrap Up</h2>
<p>We covered lots of ground already. I introduced the idea of generating your own datasets from scratch. This process is known as simulating datasets. The reason for doing this is simple: You want to truly understand the generative process so you can apply various <em>Exploratory Data Analysis (EDA)</em> and machine learning techniques for the express purposes of building your intuition into which techniques work best on different types of data. That easily elevates you from novice to expert, and all it requires is a little time and practice.</p>
<p>Next time we’ll dig a bit deeper into the data. We’ll apply some basic EDA and then round out the discussion with a few traditional machine learning models to understand a bit better why one performs better than another.</p>David Ziganto[email protected]Caesar Cipher2018-01-20T00:00:00+00:002018-01-20T00:00:00+00:00https://dziganto.github.io/cipher/cryptography/python/Caesar-Cipher<p><img src="/assets/images/code_talkers.png?raw=true" alt="image" class="center-image" /></p>
<h2 id="introduction">Introduction</h2>
<p>There are myriad ways to encrypt text. One of the simplest and easiest to understand is the <strong>Caesar cipher</strong>. It’s extremely easy to crack but it’s a great place to start for the purposes of introducing ciphers.</p>
<h2 id="a-bit-of-terminology">A Bit of Terminology</h2>
<p>The setup is pretty simple. You start with a message you want to codify so no one else can read it. Say the message is <code class="highlighter-rouge">I hope you cannot read this</code>. This is called the <strong>plaintext</strong>. Now we need to apply some algorithm to our text so the output is incoherent. For example, the output may be <code class="highlighter-rouge">O nuvk eua igttuz xkgj znoy</code>. This is called the <strong>ciphertext</strong>. Mapping the plaintext to ciphertext is called <strong>encryption</strong>. Mapping the ciphertext back to plaintext is called <strong>decryption</strong>. The algorithm used to encrypt or decrypt is called a <strong>cipher</strong>.</p>
<h2 id="caesar-cipher-how-it-works">Caesar Cipher: How it Works</h2>
<p>Mapping <code class="highlighter-rouge">I hope you cannot read this</code> to <code class="highlighter-rouge">O nuvk eua igttuz xkgj znoy</code> with the Caesar cipher works like this. First, you start by deciding how much you want to shift the alphabet. Say you choose a shift of six so A becomes G, B becomes H, C becomes I, and so on until you get to the end where Z becomes F. Now you have a way to map any plaintext character to ciphertext. In fact, that’s exactly how I encoded this message:</p>
<blockquote>
<p>plaintext: I hope you cannot read this.<br />
ciphertext: O nuvk eua igttuz xkgj znoy.</p>
</blockquote>
<p>Here’s a gif that shows the various mappings:</p>
<p><img src="https://i.stack.imgur.com/D3ypD.gif" alt="Caesar Cipher gif" class="center-image" /></p>
<p>The outer circle represents plaintext letters while the inner circle represents the ciphertext equivalent.</p>
<p>Hopefully you can see right away why this particular cipher is very easy to crack. Just mapping the plaintext to ciphertext while maintaining word lengths and spaces makes the process fairly easy. By converting all the text to lowercase and removing all spaces and punctuation, we can make it a bit more challenging. But just barely. There are only 25 different ways to shift the letters, which means a brute force attack is trivial.</p>
<p>Let’s see what this looks like in code.</p>
<h2 id="the-code">The Code</h2>
<p>We’ll create a class called <strong>CaesarCipher</strong> that can encrypt or decrypt text.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class CaesarCipher:
def _clean_text(self, text):
'''converts text to lowercase, removes spaces, and removes punctuation.'''
import string
assert type(text) == str, 'input needs to be a string!'
text = text.lower()
text = text.replace(' ', '')
self.clean_text = "".join(character for character in text
if character not in string.punctuation)
return self.clean_text
def _string2characters(self, text):
'''converts a string to individual characters.'''
assert type(text) == str, 'input needs to be a string!'
self.str2char = list(text)
return self.str2char
def _chars2nums(self, characters):
'''converts individual characters to integers.'''
assert type(characters) == list, 'input needs to be a list of characters!'
codebook = {'a':0, 'b':1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9,
'k':10, 'l':11, 'm':12, 'n':13, 'o':14, 'p':15, 'q':16, 'r':17, 's':18,
't':19, 'u':20, 'v':21, 'w':22, 'x':23, 'y':24, 'z':25}
for i, char in enumerate(characters):
try:
characters[i] = codebook[char]
except:
pass
self.char2num = characters
return self.char2num
def _nums2chars(self, numbers):
'''converts individual integers to characters .'''
assert type(numbers) == list, 'input needs to be a list of numbers!'
codebook = {0:'a', 1:'b', 2:'c', 3:'d', 4:'e', 5:'f', 6:'g', 7:'h', 8:'i', 9:'j',
10:'k', 11:'l', 12:'m', 13:'n', 14:'o', 15:'p', 16:'q', 17:'r', 18:'s',
19:'t', 20:'u', 21:'v', 22:'w', 23:'x', 24:'y', 25:'z'}
for i, num in enumerate(numbers):
try:
numbers[i] = codebook[num]
except:
pass
self.num2chars = numbers
return self.num2chars
def _preprocessing(self, text):
''''''
clean_text = self._clean_text(text)
list_of_chars = self._string2characters(clean_text)
list_of_nums = self._chars2nums(list_of_chars)
return list_of_nums
def encrypt(self, text, shift=3):
'''return text that is shifted according to user's input.'''
import numpy as np
preprocess = self._preprocessing(text)
nums_shifted = list((np.array(preprocess) + shift) % 26)
return ''.join(self._nums2chars(nums_shifted))
def decrypt(self, text, shift=3):
'''returns text shifted by user-defined shift length.'''
import numpy as np
preprocess = self._preprocessing(text)
nums = self._chars2nums(preprocess)
num_shift = list((np.array(nums) - shift) % 26)
return ''.join(self._nums2chars(num_shift))
</code></pre></div></div>
<h2 id="code-breakdown">Code Breakdown</h2>
<p>The <strong>CaesarCipher</strong> class contains a number of methods. The first is a method called <strong>_clean_text</strong> which converts all letters to lower case and removes spaces and punctuation. The second, third, and fourth methods called <strong>_string2characters</strong>, <strong>_chars2nums</strong>, and <strong>_nums2chars</strong> should be self-explanatory. The <strong>_preprocessing</strong> method is a meta-function that incorporates and applies all the aforementioned methods in one sequential process. The last two methods are the most interesting: <strong>encrypt</strong> and <strong>decrypt</strong>. They perform as advertised.</p>
<h2 id="setup">Setup</h2>
<p>Great, now let’s instantiate our class and put it through its paces.</p>
<p>To instantiate, we’re merely type <code class="highlighter-rouge">cc = CaesarCipher()</code>.</p>
<p>Now to encrypt a message: <code class="highlighter-rouge">print(cc.encrypt('I hope you cannot read this.', shift=6))</code>.</p>
<p>The <em>shift</em> parameter tells the class by how much to shift the letters to encrypt the plaintext. In this case I arbitrarily chose 6. The output is <code class="highlighter-rouge">onuvkeuaigttuzxkgjznoy</code>. That sure doesn’t look like anything I can make out.</p>
<p>Let’s try another one for fun. This one will showcase the preprocessing method in all its glory.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>text = 'the QuIcK brown fox jumps over the lazy dog!'
encrypted = cc.encrypt(text, shift=5)
print(encrypted)
</code></pre></div></div>
<p>The output is <code class="highlighter-rouge">ymjvznhpgwtbsktcozruxtajwymjqfeditl</code>.</p>
<h2 id="discussion">Discussion</h2>
<p>Now if you’ve given this a little thought, you should see ways to crack this cipher wide open.</p>
<p>The English language is replete with structure. Certain letters appear far more frequently than others. The letter <em>e</em>, for example, is the most common letter in the English language. Therefore, using letter frequencies is a very effective strategy. Another giveaway is double letters of which only so pairings exist. So given longer snippets of text, you can deduce plaintext-to-ciphertext letter mappings with ease.</p>
<p><img src="/assets/images/letter_frequency.png?raw=true" alt="Letter Frequency" class="center-image" /></p>
<p>If all else fails or you just want to find the answer quickly, a brute force search will expose the plaintext.</p>
<p>Let’s see how that works.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># show all decryption possibilities
for i in range(1,26):
print('shift{:2}: {}'.format(i, cc.decrypt(encrypted, shift=i)))
print('')
</code></pre></div></div>
<p>Which outputs:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>shift 1: xliuymgofvsarjsbnyqtwszivxlipedchsk
shift 2: wkhtxlfneurzqiramxpsvryhuwkhodcbgrj
shift 3: vjgswkemdtqyphqzlworuqxgtvjgncbafqi
shift 4: uifrvjdlcspxogpykvnqtpwfsuifmbazeph
shift 5: thequickbrownfoxjumpsoverthelazydog
shift 6: sgdpthbjaqnvmenwitlornudqsgdkzyxcnf
shift 7: rfcosgaizpmuldmvhsknqmtcprfcjyxwbme
shift 8: qebnrfzhyoltkclugrjmplsboqebixwvald
shift 9: pdamqeygxnksjbktfqilokranpdahwvuzkc
shift10: oczlpdxfwmjriajsephknjqzmoczgvutyjb
shift11: nbykocwevliqhzirdogjmipylnbyfutsxia
shift12: maxjnbvdukhpgyhqcnfilhoxkmaxetsrwhz
shift13: lzwimauctjgofxgpbmehkgnwjlzwdsrqvgy
shift14: kyvhlztbsifnewfoaldgjfmvikyvcrqpufx
shift15: jxugkysarhemdvenzkcfieluhjxubqpotew
shift16: iwtfjxrzqgdlcudmyjbehdktgiwtaponsdv
shift17: hvseiwqypfckbtclxiadgcjsfhvszonmrcu
shift18: gurdhvpxoebjasbkwhzcfbiregurynmlqbt
shift19: ftqcguowndaizrajvgybeahqdftqxmlkpas
shift20: espbftnvmczhyqziufxadzgpcespwlkjozr
shift21: droaesmulbygxpyhtewzcyfobdrovkjinyq
shift22: cqnzdrltkaxfwoxgsdvybxenacqnujihmxp
shift23: bpmycqksjzwevnwfrcuxawdmzbpmtihglwo
shift24: aolxbpjriyvdumveqbtwzvclyaolshgfkvn
shift25: znkwaoiqhxuctludpasvyubkxznkrgfejum
</code></pre></div></div>
<p>A quick scan gives away the plaintext: <code class="highlighter-rouge">shift 5: thequickbrownfoxjumpsoverthelazydog</code>.</p>
<h2 id="wrap-up">Wrap Up</h2>
<p>Hopefully you found this a fun introduction to cryptography. It’s a rich and rewarding field with endless applications.</p>
<p>Next time, we’ll build upon what we learned here as we explore a more challenging cipher known as the <strong>Vigenere cipher</strong>.</p>David Ziganto[email protected]Model Tuning (Part 2 - Validation & Cross-Validation)2018-01-19T00:00:00+00:002018-01-19T00:00:00+00:00https://dziganto.github.io/cross-validation/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-with-Validation-and-Cross-Validation<p><img src="/assets/images/cv_image.png?raw=true" alt="Comic" class="center-image" /></p>
<h2 id="introduction">Introduction</h2>
<p>Last time in <a href="https://dziganto.github.io/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-Train-Test-Split/">Model Tuning (Part 1 - Train/Test Split)</a> we discussed training error, test error, and train/test split. We learned that training a model on all the available data and then testing on that very same data is an awful way to build models because we have no indication as to how well that model will perform on unseen data. In other words, we don’t know if the model is essentially memorizing the data it’s seen or if it’s truly picking up the pattern inherent in the data (i.e. its ability to generalize).</p>
<p>To remedy that situation, we implemented <em>train/test split</em> that effectively holds some data aside from the model building process for testing at the very end when the model is fully trained. This allows us to see how the model performs on unseen data and gives us some indication as to whether the model generalizes or not.</p>
<p>Now that we have a solid foundation, we can move on to more advanced topics that will take our model-building skills to the next level. Specifically, we’ll dig in to the following topics:</p>
<ul>
<li>Bias-Variance Tradeoff</li>
<li>Validation Set</li>
<li>Model Tuning</li>
<li>Cross-Validation</li>
</ul>
<p>To make this concrete, we’ll combine theory and application. For the latter, we’ll leverage the <strong>Boston</strong> dataset in sklearn.</p>
<blockquote>
<p>Please refer to the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html">Boston dataset</a> for details.</p>
</blockquote>
<p>Our first step is to read in the data and prep it for modeling.</p>
<h2 id="get--prep-data">Get & Prep Data</h2>
<p>Here’s a bit of code to get us going:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>boston = load_boston()
data = boston.data
target = boston.target
</code></pre></div></div>
<p>And now let’s split the data into train/test split like so:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># train/test split
X_train, X_test, y_train, y_test = train_test_split(data,
target,
shuffle=True,
test_size=0.2,
random_state=15)
</code></pre></div></div>
<h2 id="setup">Setup</h2>
<p>We know we’ll need to calculate training and test error, so let’s go ahead and create functions to do just that. Let’s include a meta-function that will generate a nice report for us while we’re at it. Also, Root Mean Squared Error (RMSE) will be our metric of choice.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def calc_train_error(X_train, y_train, model):
'''returns in-sample error for already fit model.'''
predictions = model.predict(X_train)
mse = mean_squared_error(y_train, predictions)
rmse = np.sqrt(mse)
return mse
def calc_validation_error(X_test, y_test, model):
'''returns out-of-sample error for already fit model.'''
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
return mse
def calc_metrics(X_train, y_train, X_test, y_test, model):
'''fits model and returns the RMSE for in-sample error and out-of-sample error'''
model.fit(X_train, y_train)
train_error = calc_train_error(X_train, y_train, model)
validation_error = calc_validation_error(X_test, y_test, model)
return train_error, validation_error
</code></pre></div></div>
<p>Time to dive into a little theory. Stay with it because we’ll come back around to the application side where you’ll see how all the pieces fit together.</p>
<h2 id="theory">Theory</h2>
<h3 id="bias-variance-tradeoff">Bias-Variance Tradeoff</h3>
<p>Pay very close attention to this section. It is one of the most important concepts in all of machine learning. Understanding this concept will help you diagnose all types of models, be they linear regression, XGBoost, or Convolutional Neural Networks.</p>
<p>We already know how to calculate training error and test error. So far we’ve simply been using test error as a way to gauge how well our model will generalize. That was a good first step but it’s not good enough. We can do better. We can tune our model. Let’s drill down.</p>
<p>We can compare training error and something called <em>validation error</em> to figure out what’s going on with our model - more on validation error in a minute. Depending on the values of each, our model can be in one of three regions:</p>
<p>1) <strong>High Bias</strong> - underfitting<br />
2) <strong>Goldilocks Zone</strong> - just right<br />
3) <strong>High Variance</strong> - overfitting</p>
<p><img src="/assets/images/bias-variance-tradeoff.png?raw=true" alt="Bias-Variance Tradeoff" class="center-image" /></p>
<h3 id="plot-orientation">Plot Orientation</h3>
<p>The x-axis represents model complexity. This has to do with how flexible your model is. Some things that add complexity to a model include: additional features, increasing polynomial terms, and increasing the depth for tree-based models. Keep in mind this is far from an exhaustive list but you should get the gist.</p>
<p>The y-axis indicates model error. It’s often measured as <em>Mean-Squared Error (MSE)</em> for Regression and <em>Cross-Entropy</em> or <em>Accuracy</em> for Classification.</p>
<p>The blue curve is <em>Training Error</em>. Notice that it only decreases. What should be painfully obvious is that adding model complexity leads to smaller and smaller training errors. That’s a key finding.</p>
<p>The green curve forms a U-shape. This curve represents <em>Validation Error</em>. Notice the trend. First it decreases, hits a minimum, and then increases. We’ll talk in more detail shortly about what exactly <em>Validation Error</em> is and how to calculate it.</p>
<h3 id="high-bias">High Bias</h3>
<p>The rectangular box outlined by dashes to the left and labeled as <em>High Bias</em> is the first region of interest. Here you’ll notice <em>Training Error</em> and <em>Validation Error</em> are high. You’ll also notice that they are close to one another. This region is defined as the one where the model lacks the flexibility required to really pull out the inherent trend in the data. In machine learning speak, it is <em>underfitting</em>, meaning it’s doing a poor job all around and won’t generalize well. The model doesn’t even do well on the training set.</p>
<p>How do you fix this?</p>
<p>By adding model complexity of course. I’ll go into much more detail about what to do when you realize you’re under or overfitting in another post. For now, assuming you’re using linear regression, a good place to start is by adding additional features. The addition of parameters to your model grants it flexibility that can push your model into the Golidlocks Zone.</p>
<h3 id="goldilocks-zone">Goldilocks Zone</h3>
<p>The middle region without dashes I’ve named the <em>Goldilocks Zone</em>. Your model has just the right amount of flexibility to pick up on the pattern inherent in the data but isn’t so flexible that it’s really just memorizing the training data. This region is marked by <em>Training Error</em> and <em>Validation Error</em> that are both low and close to one another. This is where your model should live.</p>
<h3 id="high-variance">High Variance</h3>
<p>The dashed rectangular box to the right and labeled <em>High Variance</em> is the flip of the <em>High Bias</em> region. Here the model has so much flexiblity that it essentially starts to memorize the training data. Not surprisingly, that approach leads to low <em>Training Error</em>. But as was mentioned in the <a href="https://dziganto.github.io/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-Train-Test-Split/">train/test post</a>, a lookup table does not generalize, which is why we see high <em>Validation Error</em> in this region. You know you’re in this region when your <em>Training Error</em> is low but your <em>Validation Error</em> is high. Said another way, if there’s a sizeable delta between the two, you’re overfitting.</p>
<p>How do you fix this?</p>
<p>By decreasing model complexity. Again, I’ll go into much more detail in a separate post about what exactly to do. For now, consider applying regularization or dropping features.</p>
<h3 id="canonical-plot">Canonical Plot</h3>
<p>Let’s look at one more plot to drive these ideas home.</p>
<p><img src="/assets/images/bias-and-variance-targets.jpg?raw=true" alt="Bias-Variance Target Pic" class="center-image" /></p>
<p>Imagine you’ve entered an archery competition. You receive a score based on which portion of the target you hit: 0 for the red circle (bullseye), 1 for the blue, and 2 for the while. The goal is to minimize your score and you do that by hitting as many bullseyes as possible.</p>
<p>The archery metaphor is a useful analog to explain what we’re trying to accomplish by building a model. Given different datasets (equivalent to different arrows), we want a model that predicts as closely as possible to observed data (aka targets).</p>
<p>The top <strong>Low Bias/Low Variance</strong> portion of the graph represents the ideal case. This is the <strong>Goldilocks Zone</strong>. Our model has extracted all the useful information and generalizes well. We know this because the model is accurate and exhibits little variance, even when predicting on unforeseen data. The model is highly tuned, much like an archer who can adjust to different wind speeds, distances, and lighting conditions.</p>
<p>The <strong>Low Bias/High Variance</strong> portion of the graph represents <em>overfitting</em>. Our model does well on the training data, but we see high variance for specific datasets. This is analagous to an archer who has trained under very stringent conditions - perhaps indoors where there is no wind, the distance is consistent, and the lighting is always the same. Any variation in any of those attributes throws off the archer’s accuracy. The archer lacks consistency.</p>
<p>The <strong>High Bias/Low Variance</strong> portion of the graph represents <em>underfitting</em>. Our model does poorly on any given dataset. In fact, it’s so bad that it does just about as poorly regardless of the data you feed it, hence the small variance. As an analog, consider an archer who has learned to fire with consistency but hasn’t learned to hit the target. This is analagous to a model that always predicts the average value of the training data’s target.</p>
<p>The <strong>High Bias/High Variance</strong> portion of the graph actually has no analog in machine learning that I’m aware of. There exists a tradeoff between bias and variance. Therefore, it’s not possible for both to be high.</p>
<p>Alright, let’s shift gears to see this in practice now that we’ve got the theory down.</p>
<h2 id="application">Application</h2>
<p>Let’s build a linear regression model of the <a href="http://archive.ics.uci.edu/ml/datasets/Forest+Fires">Forest Fire</a> dataset. We’ll investigate whether our model is underfitting, overfitting, or fitting just right. If it’s under or overfitting, we’ll look at one way we can correct that.</p>
<p>Time to build the model.</p>
<blockquote>
<p>Note: I’ll use <strong>train_error</strong> to represent <strong>training error</strong> and <strong>test_error</strong> to represent <strong>validation error</strong>.</p>
</blockquote>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lr = LinearRegression(fit_intercept=True)
train_error, test_error = calc_metrics(X_train, y_train, X_test, y_test, lr)
train_error, test_error = round(train_error, 3), round(test_error, 3)
print('train error: {} | test error: {}'.format(train_error, test_error))
print('train/test: {}'.format(round(test_error/train_error, 1)))
</code></pre></div></div>
<p>The output looks like:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>train error: 21.874 | test error: 23.817
train/test: 1.1
</code></pre></div></div>
<p>Hmm, our training error is somewhat lower than the test error. In fact, the test error is 1.1 times or 10% worse. It’s not a big difference but it’s worth investigating.</p>
<p>Which region does that put us in?</p>
<p>That’s right, it’s every so slightly in the <em>High Variance</em> region, which means our model is slightly overfitting. Again, that means our model has a tad too much complexity.</p>
<p>Unfortunately, we’re stuck at this point.</p>
<p>You’re probably thinking, <em>“Hey wait, no we’re not. I can drop a feature or two and then recalculate training error and test error.”</em></p>
<p>My response is simply: <em>NOPE. DON’T. PLEASE. EVER. FOR ANY REASON. PERIOD.</em></p>
<p>Why not?</p>
<p>Because if you do that then your test set is no longer a test set. You are using it to train your model. It’s the same as if you trained your model on the all the data from the beginning. Seriously, don’t do this. Unfortunately, practicing data scientists do this sometimes; it’s one of the worst things you can do. You’re almost guaranteed to produce a model that cannot generalize.</p>
<p>So what do we do?</p>
<p>We need to go back to the beginning. We need to split our data into three datasets: training, validation, test.</p>
<p>Remember, the test set is data you don’t touch until you’re happy with your model. The test set is used only <strong>ONE</strong> time to see how your model will generalize. That’s it.</p>
<p>Okay, let’s take a look at this thing called a <strong>Validation Set</strong>.</p>
<h2 id="validation-set">Validation Set</h2>
<p>Three datasets from one seems like a lot of work but I promise it’s worth it. First, let’s see how to do this in practice.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># intermediate/test split (gives us test set)
X_intermediate, X_test, y_intermediate, y_test = train_test_split(data,
target,
shuffle=True,
test_size=0.2,
random_state=15)
# train/validation split (gives us train and validation sets)
X_train, X_validation, y_train, y_validation = train_test_split(X_intermediate,
y_intermediate,
shuffle=False,
test_size=0.25,
random_state=2018)
</code></pre></div></div>
<p>Now for a little cleanup and some output:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># delete intermediate variables
del X_intermediate, y_intermediate
# print proportions
print('train: {}% | validation: {}% | test {}%'.format(round(len(y_train)/len(target),2),
round(len(y_validation)/len(target),2),
round(len(y_test)/len(target),2)))
</code></pre></div></div>
<p>Which outputs:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>train: 0.6% | validation: 0.2% | test 0.2%
</code></pre></div></div>
<p>If you’re a visual person, this is how our data has been segmented.</p>
<p><img src="/assets/images/train-validate-test.png?raw=true" alt="Train-Validate-Test Sets" class="center-image" /></p>
<p>We have now three datasets depicted by the graphic above where the training set constitutes 60% of all data, the validation set 20%, and the test set 20%. Do notice that I haven’t changed the actual test set in any way. I used the same initial split and the same random state. That way we can compare the model we’re about to fit and tune to the linear regression model we built earlier.</p>
<blockquote>
<p>Side note: there is no hard and fast rule about how to proportion your data. Just know that your model is limited in what it can learn if you limit the data you feed it. However, if your test set is too small, it won’t provide an accurate estimate as to how your model will perform. Cross-validation allows us to handle this situation with ease, but more on that later.</p>
</blockquote>
<p>Time to fit and tune our model.</p>
<h2 id="model-tuning">Model Tuning</h2>
<p>We need to decrease complexity. One way to do this is by using <em>regularization</em>. I won’t go into the nitty gritty of how regularization works now because I’ll cover that in a future post. Just know that regularization is a form of constrained optimization that imposes limits on determining model parameters. It effectively allows me to add bias to a model that’s overfitting. I can control the amount of bias with a hyperparameter called <em>lambda</em> or <em>alpha</em> (you’ll see both, though sklearn uses alpha because lambda is a Python keyword) that defines regularization strength.</p>
<p>The code:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alphas = [0.001, 0.01, 0.1, 1, 10]
print('All errors are RMSE')
print('-'*76)
for alpha in alphas:
# instantiate and fit model
ridge = Ridge(alpha=alpha, fit_intercept=True, random_state=99)
ridge.fit(X_train, y_train)
# calculate errors
new_train_error = mean_squared_error(y_train, ridge.predict(X_train))
new_validation_error = mean_squared_error(y_validation, ridge.predict(X_validation))
new_test_error = mean_squared_error(y_test, ridge.predict(X_test))
# print errors as report
print('alpha: {:7} | train error: {:5} | val error: {:6} | test error: {}'.
format(alpha,
round(new_train_error,3),
round(new_validation_error,3),
round(new_test_error,3)))
</code></pre></div></div>
<p>And the output:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>All errors are RMSE
----------------------------------------------------------------------------
alpha: 0.001 | train error: 22.93 | val error: 19.796 | test error: 23.959
alpha: 0.01 | train error: 22.93 | val error: 19.792 | test error: 23.944
alpha: 0.1 | train error: 22.945 | val error: 19.779 | test error: 23.818
alpha: 1 | train error: 23.324 | val error: 20.135 | test error: 23.522
alpha: 10 | train error: 24.214 | val error: 20.958 | test error: 23.356
</code></pre></div></div>
<p>There are a few key takeaways here. First, notice the U-shaped behavior exhibited by the validation error. It starts at 19.796, goes down for two steps and then back up. Also notice that validation error and test error tend to move together, but by no means is the relationship perfect. We see both errors decrease as alpha increases initially but then test error keeps going down while validation error rises again. It’s not perfect. It actually has a whole lot to do with the fact that we’re dealing with a very small dataset. Each sample represents a much larger proportion of the data than say if we had a dataset with a million or more records. Anyway, validation error is a good proxy for test error, especially as dataset size increases. With small to medium-sized datasets, we can do better by leveraging cross-validation. We’ll talk about that shortly.</p>
<p>Now that we’ve tuned our model, let’s fit a new ridge regression model on all data except the test data. Then we’ll check the test error and compare it to that of our original linear regression model with all features.</p>
<h4 id="setup-data-model--calculate-errors">Setup Data, Model, & Calculate Errors</h4>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># train/test split
X_train, X_test, y_train, y_test = train_test_split(data,
target,
shuffle=True,
test_size=0.2,
random_state=15)
# instantiate model
ridge = Ridge(alpha=0.11, fit_intercept=True, random_state=99)
# fit and calculate errors
new_train_error, new_test_error = calc_metrics(X_train, y_train, X_test, y_test, ridge)
new_train_error, new_test_error = round(new_train_error, 3), round(new_test_error, 3)
</code></pre></div></div>
<h4 id="report-errors">Report Errors</h4>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>print('ORIGINAL ERROR')
print('-' * 40)
print('train error: {} | test error: {}\n'.format(train_error, test_error))
print('ERROR w/REGULARIZATION')
print('-' * 40)
print('train error: {} | test error: {}'.format(new_train_error, new_test_error))
</code></pre></div></div>
<p>Here’s that output:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ORIGINAL ERROR
----------------------------------------
train error: 21.874 | test error: 23.817
ERROR w/REGULARIZATION
----------------------------------------
train error: 21.883 | test error: 23.673
</code></pre></div></div>
<p>A very small increase in training error coupled with a small decrease in test error. We’re definitely moving in the right direction. Perhaps not quite the magnitude of change we expected, but we’re simply trying to prove a point here. Remember this is a tiny dataset. Also remember I said we can do better by using something called <em>Cross-Validation</em>. Now’s the time to talk about that.</p>
<h2 id="cross-validation">Cross-Validation</h2>
<p>Let me say this upfront: this method works great on small to medium-sized datasets. This is absolutely not the kind of thing you’d want to try on a massive dataset (think tens or hundreds of millions of rows and/or columns). Alright, let’s dig in now that that’s out of the way.</p>
<p>As we saw in the post about <a href="https://dziganto.github.io/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-Train-Test-Split/">train/test split</a>, how you split smaller datasets makes a significant difference; the results can vary tremendously. As the random state is not a hyperparameter (seriously, please don’t do that), we need a way to extract every last bit of signal from the data that we possibly can. So instead of just one train/validation split, let’s do K of them.</p>
<p>This technique is appropriately named <em>K-fold cross-validation</em>. Again, K represents how many train/validation splits you need. There’s no hard and fast rule about how to choose K but there are better and worse choices. As the size of your dataset grows, you can get away with smaller values for K, like 3 or 5. When your dataset is small, it’s common to select a larger number like 10. Again, these are just rules of thumb.</p>
<p>Here’s the general idea for 10-fold CV:</p>
<p><img src="/assets/images/kfold-cross-validation.png?raw=true" alt="Cross-Validation" class="center-image" /></p>
<p>You segment off a percentage of your training data as a validation fold.</p>
<blockquote>
<p><strong>Technical note:</strong> Be careful with terminology. Some people will refer to the <em>validation fold</em> as the <em>test fold</em>. Unfortunately, they use the terms interchangeably, which is confusing and therefore not correct. Don’t do that. The test set is the pure data that only gets consumed at the end, if it exists at all.</p>
</blockquote>
<p>Once data has been segmented off in the validation fold, you fit a fresh model on the remaining training data. Ideally, you calculate train and validation error. Some people only look at validation error, however.</p>
<p>The data included in the first validation fold will never be part of a validation fold again. A new validation fold is created, segmenting off the same percentage of data as in the first iteration. Then the process repeats - fit a fresh model, calculate key metrics, and iterate. The algorithm concludes when this process has happened K times. Therefore, you end up with K estimates of the validation error, having visited all the data points in the validation set once and numerous times in training sets. The last step is to average the validation errors for regression. This gives a good estimate as to how well a particular model will perform.</p>
<p>Again, this method is invaluable for tuning hyperparameters on small to medium-sized datasets. You technically don’t even need a test set. That’s great if you just don’t have the data. For large datasets, use a simple train/validation/test split strategy and tune your hyperparameters like we did in the previous section.</p>
<p>Alright, let’s see K-fold CV in action.</p>
<h2 id="sklearn--cv">Sklearn & CV</h2>
<p>There’s two ways to do this in sklearn, pending what you want to get out of it.</p>
<p>The first method I’ll show you is <code class="highlighter-rouge">cross_val_score</code>, which works beautifully if all you care about is validation error.</p>
<p>The second method is <code class="highlighter-rouge">KFold</code>, which is perfect if you require train and validation errors.</p>
<p>Let’s try a new model called <strong>LASSO</strong> just to keep things interesting.</p>
<h3 id="cross_val_score">cross_val_score</h3>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from sklearn.model_selection import cross_val_score
alphas = [1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1]
val_errors = []
for alpha in alphas:
lasso = Lasso(alpha=alpha, fit_intercept=True, random_state=77)
errors = np.sum(-cross_val_score(lasso,
data,
y=target,
scoring='neg_mean_squared_error',
cv=10,
n_jobs=-1))
val_errors.append(np.sqrt(errors))
</code></pre></div></div>
<p>Let’s checkout the validation errors associated with each alpha.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># RMSE
print(val_errors)
</code></pre></div></div>
<p>Which returns:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[18.64401379981868, 18.636528438323769, 18.578057471596566, 18.503285318281634, 18.565586130742307, 21.412874355105991]
</code></pre></div></div>
<p>Which value of alpha gave us the smallest validation error?</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>print('best alpha: {}'.format(alphas[np.argmin(val_errors)]))
</code></pre></div></div>
<p>Which returns: <code class="highlighter-rouge">best alpha: 0.1</code></p>
<h3 id="k-fold">K-Fold</h3>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from sklearn.model_selection import KFold
K = 10
kf = KFold(n_splits=K, shuffle=True, random_state=42)
for alpha in alphas:
train_errors = []
validation_errors = []
for train_index, val_index in kf.split(data, target):
# split data
X_train, X_val = data[train_index], data[val_index]
y_train, y_val = target[train_index], target[val_index]
# instantiate model
lasso = Lasso(alpha=alpha, fit_intercept=True, random_state=77)
#calculate errors
train_error, val_error = calc_metrics(X_train, y_train, X_val, y_val, lasso)
# append to appropriate list
train_errors.append(train_error)
validation_errors.append(val_error)
# generate report
print('alpha: {:6} | mean(train_error): {:7} | mean(val_error): {}'.
format(alpha,
round(np.mean(train_errors),4),
round(np.mean(validation_errors),4)))
</code></pre></div></div>
<p>Here’s that output:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alpha: 0.0001 | mean(train_error): 21.8217 | mean(val_error): 23.3633
alpha: 0.001 | mean(train_error): 21.8221 | mean(val_error): 23.3647
alpha: 0.01 | mean(train_error): 21.8583 | mean(val_error): 23.4126
alpha: 0.1 | mean(train_error): 22.9727 | mean(val_error): 24.6014
alpha: 1 | mean(train_error): 26.7371 | mean(val_error): 28.236
alpha: 10.0 | mean(train_error): 40.183 | mean(val_error): 40.9859
</code></pre></div></div>
<p>Comparing the output of <em>cross_val_score</em> to that of <em>KFold</em>, we can see that the general trend holds - an alpha of 10 results in the largest validation error. You may wonder why we get different values. The reason is that the data was split differently. We can control the splitting procedure with KFold but not cross_val_score. Therefore, there’s no way I know of to perfectly sync the two procedures without an exhaustive search of splits or writing the algorithm from scratch ourselves. The important thing is that each gives us a viable method to calculate whatever we need, whether it be purely validation error or a combination of training and validation error.</p>
<blockquote>
<p>Update: sklearn has a method called <a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html">cross_validate</a> that will capture training and validation errors for you. It’ll even spit out how long it took to train a model for each fold as well as the time it took to score the model on each validation set.</p>
</blockquote>
<h2 id="wrap-up">Wrap Up</h2>
<p>Once you’ve tuned your hyperparameters, what do you do? Simply train a fresh model on all the data so you can extract as much information as possible. That way your model will have the best predictive power on future data. Mission complete!</p>
<h2 id="summary">Summary</h2>
<p>We discussed the <em>Bias-Variance Tradeoff</em> where a high bias model is one that is underfit while a high variance model is one that is overfit. We also learned that we can split data into three groups for tuning purposes. Specifically, the three groups are train, validation, and test. Remember the test set is used only <em>one</em> time to check how well a model generalizes on data it’s never seen. This three-group split works exceedingly well for large datasets but not for small to medium-sized datasets, though. In that case, use cross-validation (CV). CV can help you tune your models and extract as much signal as possible from the small data sample. Remember, with CV you don’t need a test set. By using a K-fold approach, you get the equivalent of K-test sets by which to check validation error. This helps you diagnose where you’re at in the bias-variance regime.</p>David Ziganto[email protected]