Standard Deviations

Setup an EMR Cluster via AWS CLI

2018-06-25T00:00:00+00:00

Objective

In this no frills post, you’ll learn how to setup a big data cluster on Amazon EMR using nothing but the AWS command line.

Prerequisites

You have an AWS account.
You have setup a Key Pair.
You have basic familiarity with the command line.
You have installed AWS CLI for Linux, Mac or Windows.

Overview

Before we dive in let’s get a handle on what we need to cover. First, I’ll show you the main command I typically run to setup a cluster. Then we’ll break down the command to understand all the key pieces. Please note that text in CAPS is something you’ll need to update with your information. For example, you’ll have to provide your own key pair. So without further ado, let’s dive in.

The Command

aws emr create-cluster 
--release-label emr-5.14.0  
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.xlarge 
--use-default-roles 
--ec2-attributes SubnetIds=subnet-YOUR_SUBNET,KeyName=YOUR_KEY 
--applications Name=JupyterHub Name=Spark Name=Hadoop 
--name=“ThisIsMyCluster” 
--log-uri s3://YOUR_BUCKET 
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://REGION.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://YOUR_BUCKET/YOUR_SHELL_SCRIPT.sh"]

The Breakdown

That’s a long command so let’s break it down to see what’s happening:

aws emr create-cluster - simply creates a cluster
--release-label emr-5.14.0 - build a cluster with EMR version 5.14.0
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.xlarge - build 1 Master node of type m4.xlarge and 2 Core nodes also of type m4.xlarge
--use-default-roles - use the default service role (EMR_DefaultRole) and instance profile (EMR_EC2_DefaultRole) for permissions to access other AWS services
--ec2-attributes SubnetIds=subnet-YOUR_SUBNET,KeyName=YOUR_KEY - configures cluster and Amazon EC2 instance configurations (you should provide a specific subnet and key here)
--applications Name=JupyterHub Name=Spark Name=Hadoop - install JupyterHub, Spark, and Hadoop on this cluster
--name=“ThisIsMyCluster” - name the cluster ThisIsMyCluster
--log-uri s3://YOUR_BUCKET - specify the S3 bucket where you want to store log files
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://REGION.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://YOUR_BUCKET/YOUR_SHELL_SCRIPT.sh"] - allows you to make additional configurations, like adding users to JupyterHub, when building the cluster (this is completely optional)

Wrap Up

There you have it, an easy way to spin up a cluster. A few simple configuration tweaks to the command above and you’ll be off and crunching data on a cluster in no time!

Introduction to Time Series

2018-05-25T00:00:00+00:00

Introduction

Dealing with data that is sequential in nature requires special techniques. Unlike traditional Ordinary Least Squares or Decision Trees where the observations are independent, time series data is such that there is correlation between successive samples. In other words, order very much matters. Think stock prices or daily temperatures. Identifying time series data and knowing what to do next is a valuable skill for any modeler.

The first step on our journey is to identify the three components of time series data:

Trend
Seasonality
Residuals

Trend, as its name suggests, is the overall direction of the data. Seasonality is a periodic component. And the residual is what’s left over when the trend and seasonality have been removed. Residuals are random fluctuations. You can think of them as a noise component.

Let’s look at a few plots to make sure we understand trend, seasonality, and residuals.

Time Series Data

Trend

Seasonality

Residuals

Now that you have the big picture, let’s look at the nuts and bolts. I’ll show you how I created the data above, how to create derivatives of the plots shown above, and how to decompose a time series model in Python.

Create Time Series Data

Time series data is data that is measured at equally-spaced intervals. Think of a sensor that takes measurements every minute.

A sensor that takes measurements at random times is not time series.

Trend

The first step is to create a time interval with equal spacing.

import numpy as np
time = np.arange(50)

Great. Now to construct the trend.

Sticking with the sensor example, suppose the sensor is oriented towards an oscillating fan that alternates right and left. The trend component captures the wind speed as someone adjusts the fan speed. Increased fan speed translates to increased sensor measurements.

trend = np.empty_like(time, dtype='float')
for t in time:
    if t < 10:
        trend[t] = t * 2.25 
    elif t < 30:
        trend[t] = t * -0.5 + 25
    else:
        trend[t] = t * 1.25 - 28

Better plot it.

import matplotlib.pyplot as plt
plt.plot(time, trend, 'b.')
plt.title("Trend vs Time")
plt.xlabel("minutes")
plt.ylabel("sensor measurement")

Here’s the result:

Seasonality

The next step is to create a periodic element. The wind speed sensor analog is the wind speed that’s captured as the fan sweeps left to right and back again.

Here’s an example of how we can create that:

seasonal = 10 + np.sin(time) * 10

Notice how both trend and seasonality are a function of time but independent of one another.

Also, here’s a plot of the seasonality component:

plt.plot(time, seasonal, 'g-.')
plt.title("Seasonality vs Time")
plt.xlabel("minutes")
plt.ylabel("sensor measurement")

Residual

The last component is the residual. This is a noise component, as mentioned earlier. We can fabricate that like so:

np.random.seed(10)  ## reproducible results
residual = np.random.normal(loc=0.0, scale=1, size=len(time))

And the plot:

plt.plot(time, residual, 'r-.')
plt.title("Residuals vs Time")
plt.xlabel("minutes")
plt.ylabel("electricity demand")

Aggregating Components

Now comes time to aggregate the three components: trend, seasonality, and residuals. This will give us the time series data were looking for.

As it turns out, there are two major ways to aggregate (or decompose, as we’ll see later) time series data.

Additive

The first way is simply a sum of the three components.

That’s as easy as additive = trend + seasonal + residual.

The corresponding plot is:

plt.plot(time, additive, 'k-.')
plt.title("Additive Time Series")
plt.xlabel("minutes")
plt.ylabel("sensor measurement");

Multiplicative

The second way to decompose time series data is a multiplication of all three components.

We can stitch that together with:

# ignore residual to make pattern obvious
ignored_residual = np.ones_like(residual)
multiplicative = trend * seasonal * ignored_residual

The corresponding plot is:

plt.plot(time, multiplicative, 'k-.')
plt.title("Multiplicative Time Series")
plt.xlabel("minutes")
plt.ylabel("sensor measurement")

Additive vs Multiplicative?

The primary question likely bouncing around your head is how can I tell if a time series is additive or multiplicative? Simply plotting the original time series data, called a run-sequence plot, is one way to do so. If the seasonality and residual components are independent of the trend, then you have an additive series. If the seasonality and residual components are in fact dependent, meaning they fluctuate on trend, then you have a multiplicative series. Look at the additive and multiplicative plots above. You’ll notice a big difference in the amplitudes of the peaks and troughs. Specifically, the amplitude of the seasonal component of the multiplicative time series is changes with trend.

Time Series Decomposition with Python

You’ll likely never know how real-world data was generated. However, I’m about to show you a powerful tool that will allow you to decompose a time series into its components. Let’s see how simple it is.

Additive Decomposition

from statsmodels.tsa.seasonal import seasonal_decompose
ss_decomposition = seasonal_decompose(x=additive, 
                                      model='additive', 
                                      freq=6)
estimated_trend = ss_decomposition.trend
estimated_seasonal = ss_decomposition.seasonal
estimated_residual = ss_decomposition.resid

Note that you must provide the frequency. We can see from the additive and multiplicative plots that the frequency is about 6. There are more sophisticated ways to determine this number empirically, but that’s for another tutorial. Let’s keep things simple for now.

Now that we have the pieces let’s put them all together.

fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
fig.set_figheight(10)
fig.set_figwidth(15)

axes[0].plot(additive, 'k', label='Original')
axes[0].legend(loc='upper left');

axes[1].plot(estimated_trend, label='Trend')
axes[1].legend(loc='upper left');

axes[2].plot(estimated_seasonal, 'g', label='Seasonality')
axes[2].legend(loc='upper left');

axes[3].plot(estimated_residual, 'r', label='Residuals')
axes[3].legend(loc='upper left')

Multiplicative Decomposition

Multiplicative decomposition follows the exact same pattern. The only major change is that we change model to ‘multiplicative’.

ss_decomposition = seasonal_decompose(x=multiplicative, 
                                      model='multiplicative', 
                                      freq=6)
estimated_trend = ss_decomposition.trend
estimated_seasonal = ss_decomposition.seasonal
estimated_residual = ss_decomposition.resid

Some more matplotlib code:

fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
fig.set_figheight(10)
fig.set_figwidth(15)

axes[0].plot(multiplicative, label='Original')
axes[0].legend(loc='upper left')

axes[1].plot(estimated_trend, label='Trend')
axes[1].legend(loc='upper left')

axes[2].plot(estimated_seasonal, label='Seasonality')
axes[2].legend(loc='upper left')

axes[3].plot(estimated_residual, label='Residuals')
axes[3].legend(loc='upper left')

Viola! We have a multiplicative decomposition.

Summary

In this tutorial you should have learned:

Time series data is composed of three components: trend, seasonality, residual
Time series can be additive or multiplicative
How to decompose a time series model with Python

From Python to Scala - Variables

2018-05-21T00:00:00+00:00

Introduction

Python is a beautiful, high-level programming language. I’ve solved innumerable problems with it over the years, so I have a particular fondness for its abilities. However, no tool is perfect for everything. Each has its strengths and each has its weaknesses. Part of Python’s power comes from its object-oriented construction. With it, you can do some pretty amazing things. However, functional programming has proven itself a powerful tool for massive scale systems. Therefore, it is time to move beyond Python to the wonderful world of Scala.

Scala is short for Scalable Language. It is a hybrid language that melds object-oriented structures and functional programming. Basically, it gives you the best of both worlds. Therefore, what follows is a series that will take you on a journey from Python to Scala. I hope you find it helpful!

Lesson 1: Variables

Our first lesson is variables. In Python, saving a value to a variable is dead simple. It looks like this:

myString = "this is a string"
myInt = 42
myFloat = 4.2

Python automatically infers the type of each variable. For example, the variable myString is saved as a string object. Python knows it’s a string because of the quotes around the text this is a string. You could just as easily have saved "42" or even '42'. That too would have been saved as a string object. The advantage is obvious: it takes no effort (and no thought) on the part of the user to save variables. The result is clean, easy to read code.

With Scala, you can do the same with only a minor change. Let’s take a look:

var myString = "this is a string"
var myInt = 42
var myFloat = 4.2

Notice the var in front of the variables here. That’s important. Scala has the same ability to infer data types, same as Python, but you’re giving Scala additional information. It turns out you must provide this information to Scala or else an error is thrown. Try running this bit of code:

myString2 = "this is a string"

See what I mean?

Should you feel the need to be explicit, Scala has your back:

var myString: String = "this is a string"
var myInt: Int = 42
var myFloat: Double = 42

Now if I want to change myString to "string string string", myInt to 99, and myFloat to 3.14, it’s as simple as:

myString = "string string string"
myInt = 99
yFloat = 3.14

This is all basic stuff. There’s almost no difference from Python. But wait, there’s more. Scala gives you an alternative way to reference objects. Check this out:

val myStaticString = "you cannot reassign myStaticString"
val myStaticInt: Int = 12345
val myStaticFloat: Double = 2.71828

Ok, what’s the difference between var and val? Try to reassign myStaticString, myStaticInt, or myStaticFloat.

Run these commands in the interpreter:

myStaticString = "try to reassign me, I dare you"
myStaticInt: Int = 1010101011
myStaticFloat: Double = 1.2121210

Didn’t work did it? Therein lies the difference. var lets you reassign while val does not. val is a great way to guarantee you don’t experience unwanted side effects in your code if you want to ensure a reference object never changes. You get a guarantee! How awesome is that?!

A quick side tangent. You can assign a new reference object if you include val at the beginning like this:

val myStaticString = "try to reassign me, I dare you"
val myStaticInt: Int = 1010101011
val myStaticFloat: Double = 1.2121210

So be careful. If you’re clumsy with your code, Scala can’t save you.

Summary

What did we learn today? We learned Python is beautifully simple while Scala is simply beautiful. And we took our first baby step into Scala by leveraging our knowledge of Python. Scala has the same ability to infer object types when saving variables just like Python. The key difference is that Scala requires this thing called a predicate that can take the form var or val. We learned the difference between var and val is that the former can be reassigned whereas the latter can not. We also learned that if you write sloppy code, well, then that’s on you because no programming language is going to save your ass.

From Zero to Spark Cluster in Under 10 Minutes

2018-04-25T00:00:00+00:00

Objective

In this no frills post, you’ll learn how to setup a big data cluster on Amazon EMR in less than ten minutes.

Prerequisites

You have an AWS account.
You have setup a Key Pair.
You have Chrome or Firefox
You have basic familiarity with the command line.
You have basic familiarity with Python. (Optional)

1 - Foxy Proxy Setup (Optional: only for Zeppelin)

In Chrome or Firefox, add the FoxyProxy extension.
Restart browser after installing FoxyProxy.
Open your favorite text editor and save this code as foxyproxy-settings.xml. Keep track of where you save it.
In your browser, click on the FoxyProxy icon located at top right.
Scroll down and click Options.
Click Import/Export on left-hand side.
Click Choose File.
Select foxyproxy-settings.xml.
Click Open.
Congratulations, Foxy Proxy is now setup.

2 - EMR Cluster Setup

Login in to AWS.
Navigate to EMR located under Analytics.
Click the Create cluster button.
You are now in Step 1: Software and Steps. Click Go to advanced options. Here you can name your cluster and select whichever S3 bucket you want to connect to.
Click the big data tools you require. I’ll select Spark and Zeppelin for this tutorial.
Click Next at bottom right of screen.
In Step 2: Hardware, select the instance types, instance counts, on-demand or spot pricing, and auto-scaling options.
For this tutorial we’ll simply change the instance type to m4.xlarge and Core to 1 instance. Everything else will remain as default. See the following picture for details.
Click Next at bottom right of screen.
The next page is Step 3: General Cluster Settings Here you have the chance to rename your cluster, select S3 bucket, and add a bootstrap script - among other options.
Click Next at bottom right of screen.
The next page is Step 4: Security. It is imperative that you select a predefined key pair. (Do NOT proceed without a key!)
Click Create cluster at bottom right of screen. A new screen pops up that looks like this:
Your cluster is finished building when you see a status of Waiting in green. (Be patient as this will take 5+ minutes depending on which big data software you installed. It’s not unusual for the build process to take 10-15 minutes or more.) Here’s what a complete build looks like:
Congratulations, you have a cluster running Spark!

3 - Update MyIP (Optional)

I like to set a location-specific IP for each cluster I build. This is completely optional. However, should you choose to do this, you’ll have to update your IP manually or by security group. Here’s how to do that manually:

Still in the EMR dashboard, locate Security groups for Master:. Click it.
On next page select Master group.
Towards the bottom of the page select Inbound tab.
Then click Edit.
Select MyIP for SSH type.
Click Save.

4 - SSH Into Your Cluster

Navigate to EMR dashboard.
Click SSH button.
Copy the command in the code block. Be sure to update the path to your key if it’s not located in your Home.
Open Terminal and paste command.
A prompt will ask if you want to continue connecting. Type yes.
A large EMR logo will pop up in your Terminal window if you followed all the steps.
Congratulations, you have setup your first EMR cluster and can access it remotely.

5 - Install Miniconda on Master (Optional)

Let’s install Python and conda on this Master node now that we’re logged in. Copy and paste the following commands to install and configure Miniconda.

wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O ~/anaconda.sh
bash ~/anaconda.sh -b -p $HOME/anaconda
echo -e '\nexport PATH=$HOME/anaconda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
This process is successful if when you type which python you get ~/anaconda/bin/python.
You can now install any python package you want with conda install package_name.
Congratulations, you now have Python and conda on your Master node.

Note that miniconda is not installed on the Core node.

You can do that separately or consider creating a bootstrap script that will automatically take care of this for you upon build.

6 - Access Zeppelin Remotely (Optional)

Open your browser that has FoxyProxy installed.
Click FoxyProxy icon.
Click Use proxies based on their pre-defined patterns and priorities.
On EMR dashboard, click Enable web connection.
Copy the command in the code block.
Open new Terminal tab.
Paste command which opens and forwards port

Note: it will look like it’s not working but it is so leave it alone!
On EMR dashboard, the Zeppelin button should now be blue. Click on it.
You are successful if Zeppelin opens in a new tab in your browser.
Congratulations, you can access your EMR cluster through Zeppelin!

7 - Update Zeppelin for Anaconda (Optional)

We have to update the Python path in Zeppelin to leverage the new version we installed in step 5.

At the top right of Zeppelin, click anonymous.
In drop down, select Interpreter.
Search for python.
Click Edit.
Change zeppelin.python from python to /home/hadoop/anaconda/bin/python
Click Save on bottom left.
Select dropdown for Interpreters again.
Search for spark.
Click Edit.
Change zeppelin.pyspark.python from python to /home/hadoop/anaconda/bin/python
Click Save on bottom left.
Navigate back to Zeppelin Home by clicking Zeppelin top left.
Congratulations, you have all the tools you need to run PySpark on a Spark cluster!

8 - Best Part

Admittedly, while that’s not a complicated process, it is time consuming. The good news is that you never have to configure FoxyProxy again AND there are neat little tricks you can add to make the build process much easier. For example, you can add a bootstrap script that will install and configure miniconda on all nodes during the build process itself.

Furthermore, if you want to spin up another cluster that is similar or identical to the one we just built, all you have to do is:

Navigate to the EMR dashboard.
Select the cluster you want to mimic.
Select Clone.

You can start building another cluster in seconds!

Reminder: Don’t forget to terminate your cluster when you’re done.

Data Science Book Recommendations

2018-03-29T00:00:00+00:00

Data Cleaning

Best Practices in Data Cleaning

Deep Learning

Deep Learning with Python

Supervised Sequence Labelling with Recurrent Neural Networks

Ethics/Privacy

Sharing Big Data Safely: Managing Data Security

General Business

Certain to Win

The Mind Of The Strategist: The Art of Japanese Business

Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results

Linear Algebra

Linear Algebra Done Right

Machine Learning

Applied Predictive Modeling

Applied Survival Analysis: Regression Modeling of Time-to-Event Data

Bayesian Data Analysis

Bayesian Reasoning and Machine Learning

Data Analysis Using Regression and Multilevel/Hierarchical Models

Data Science at the Command Line

Doing Data Science: Straight Talk from the Frontline

Elements of Information Theory

Evaluating Learning Algorithms: A Classification Perspective

Gaussian Processes for Machine Learning

Hands-On Machine Learning with Scikit-Learn and TensorFlow

Information Theory, Inference and Learning Algorithms

Learning From Data

Machine Learning: A Probabilistic Perspective

Machine Learning: An Algorithmic Perspective

Machine Learning Refined

Machine Learning: The Art and Science of Algorithms that Make Sense of Data

Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis

Time Series Analysis and Its Applications

Understanding Machine Learning: From Theory to Algorithms

Non-technical

Analytics: How to Win with Intelligence

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

How to Lie with Statistics

Mastering Data Mining: The Art and Science of Customer Relationship Management

Naked Statistics: Stripping the Dread from the Data

Spin Selling

Other

Against the Gods: The Remarkable Story of Risk

Gödel, Escher, Bach: An Eternal Golden Braid

The Machine Stops

Pedagogy

Teaching and Learning STEM: A Practical Guide

Understanding By Design

Programming

The Pragmatic Programmer: From Journeyman to Master

Statistics

A Course in Large Sample Theory

All of Statistics: A Concise Course in Statistical Inference

An Introduction to Statistical Methods and Data Analysis

Applied Longitudinal Analysis

Categorical Data Analysis

Design and Analysis: A Researcher’s Handbook

Handbook of Parametric and Nonparametric Statistical Procedures

Multivariate Analysis

OpenIntro Statistics

Statistics for Experimenters: Design, Innovation, and Discovery

Visualization

Good Charts

The Functional Art: An introduction to information graphics and visualization (Voices That Matter)

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics

Yet Another Data Science Article

2018-03-14T00:00:00+00:00

Problem

Solution

Understanding Object-Oriented Programming Through Machine Learning

2018-01-28T00:00:00+00:00

Introduction

Object-Oriented Programming (OOP) is not easy to wrap your head around. You can read tutorial after tutorial and sift through example after example only to find your head swimming. Don’t worry, you’re not alone.

When I first started learning OOP, I read about bicycles and bank accounts and filing cabinets. I read about all manor of objects with both basic and specific characteristics. It was easy to follow along. However, I always felt I was missing something. It wasn’t until I had that inexplicable eureka moment that I finally glimpsed the power of OOP.

However, I always felt as though my eureka moment took longer than it should have. I doubt I’m alone. Therefore, this post is my attempt to explain the basics of OOP through the lens of my favorite subject: machine learning. I hope you find it helpful.

Setup

I discussed the basics of linear regression in a previous post entitled Linear Regression 101 (Part 1 - Basics). If you’re unfamiliar, please start there because I’m going to assume you’re up to speed. Anyway, in that discussion I showed how to find the parameters of a linear regression model using nothing more than simple linear algebra. We defined a function called ols (short for Ordinary Least Squares) that looks like this:

def ols(X, y):
    '''returns parameters based on Ordinary Least Squares.'''
    xtx = np.dot(X.T, X) ## x-transpose times x
    inv_xtx = np.linalg.inv(xtx) ## inverse of x-transpose times x
    xty = np.dot(X.T, y) ## x-transpose times y
    return np.dot(inv_xtx, xty)

The output of the ols function is an array of parameter values that minimize the squared residuals. As the parameters or coefficients compose the linear regression model, we saved those values like so:

parameters = ols(X,y)

In other words, the variable parameters, an array of scalar values, defines our model. To make predictions, we simply take the dot product of our model’s parameters with that of incoming data in the same format as the X that was passed to the ols function. Here’s that same idea in code:

predictions = np.dot(X_new, parameters)

So now we have a model and a way to make predictions. Not too complicated. But as it turns out we can do better. We can simplify. Enter OOP.

Object-Oriented Programming Overview

In the same way we abstracted away a series of calculutions that return the Ordinary Least Squares model parameters in a function called ols, we can abstract away functions and data in a single object called a class.

Let me show you what I mean and then I’ll explain what’s going on.

Object-Oriented Programming Machine Learning Example

We’ll build a class called MyLinearRegression one code block at a time so as to manage the complexity. It’s really not too tricky but it’s easier to understand in snippets. Alright, let’s get started.

import numpy as np

class MyLinearRegression:
    
    def __init__(self, fit_intercept=True):
        self.coef_ = None
        self.intercept_ = None
        self._fit_intercept = fit_intercept

Have no fear if that looks scary or overwhelming. I’ll break it down for you and you’ll see it’s really not that complicated. Just stay with me.

The first thing to notice is that we’re defining a class as opposed to a function. We do that, unsurprisingly, with the class keyword. By convention, you should capitalize your class names. Notice how I named my class MyLinearRegression? Starting your classes with a capital letter helps to differentiate them from functions, the latter of which is lowercase by convention.

The next block of code which starts with def __init__(self, fit_intercept=True): is where things get more complicated. Stay with me; I promise it’s not that bad.

At a high level, __init__ provides a recipe for how to build an instance of MyLinearRegression. Think of __init__ like a factory. Let’s pretend you wanted to crank out hundreds of linear regression models. You can do that one of two ways. First, you have the ols function that provides the instructions on how to calculate linear regression parameters. So you could, in theory, save off hundreds of copies of the ols function with hundreds of appropriate variable names. There’s nothing inherently wrong with that. Or you could save off hundreds of instances of class MyLinearRegression with hundreds of appropriate variable names. Both accomplish very similar tasks but do so in very different ways. You’ll understand why as we get a little further along.

Technical note: the _init_ block of code is optional, though it’s quite common. You’ll know when you need it and when you don’t with a bit more practice with OOP.

What the heck is self? Since an instance of MyLinearRegression can take on any name a user gives it, we need a way to link the user’s instance name back to the class so we can accomplish certain tasks. Think of self as a variable whose sole job is to learn the name of a particular instance. Say we named a particular instance of the class MyLinearRegression as instance mlr like so:

mlr = MyLinearRegression()

Again, the class MyLinearRegression provides instructions on how to build a linear regression model. What we did here by attaching the variable mlr to the MyLinearRegression class is to create an instance, a specific object called mlr, which will have its own data and “functions”. You’ll understand why I placed functions in quotes shortly. Anyway, mlr is a unique model with a unique name, much like you’re a unique person with your own name. The class object MyLinearRegression now links self to mlr. If it’s still not clear why that’s important, hang tight because it will when we get to the next code block.

Now this business about self.coef_, self.intercept_, and self._fit_intercept. All three are simply variables, technically called attributes, attached to the class object. When we build mlr, our class provides a blueprint that calls for the creation of three attributes. self.coef_ and self.intercept_ are placeholders. We haven’t calculated model parameters but when we do we’ll place those values into these attributes. self._fit_intercept is a boolean (True or False) that is set to True by default per the keyword argument. A user can define whether to calculate the intercept by setting this argument to True or avoid it by setting the argument to False. Since we didn’t set fit_intercept to False when we created mlr, mlr will provide the intercept parameter once it’s calculated.

Great, let’s add a “function” called fit which will take an array of data and a vector of ground truth values in order to calculate and return linear regression model parameters.

Note: We’re building this class one piece at a time. I’m doing this simply for pedagogical reasons.

class MyLinearRegression:
    
    def __init__(self, fit_intercept=True):
        self.coef_ = None
        self.intercept_ = None
        self._fit_intercept = fit_intercept

    
    def fit(self, X, y):
        """
        Fit model coefficients.

        Arguments:
        X: 1D or 2D numpy array 
        y: 1D numpy array
        """
        
        # check if X is 1D or 2D array
        if len(X.shape) == 1:
            X = X.reshape(-1,1)
            
        # add bias if fit_intercept is True
        if self._fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]
        
        # closed form solution
        xTx = np.dot(X.T, X)
        inverse_xTx = np.linalg.inv(xTx)
        xTy = np.dot(X.T, y)
        coef = np.dot(inverse_xTx, xTy)
        
        # set attributes
        if self._fit_intercept:
            self.intercept_ = coef[0]
            self.coef_ = coef[1:]
        else:
            self.intercept_ = 0
            self.coef_ = coef

Our focus now is on the fit function. Technically a class function is called a method. That’s the term I’ll use from here on out. The fit method is quite simple.

First comes the docstring which tells us what the method does and what the expected inputs are for X and y.

Next up is a check on the dimensions of the incoming X array. NumPy complains if you perform certain calculations on a 1D array. If a 1D array is passed, the supplied code reshapes it so as to fake a 2D array.

Technical note: this does not change the output in any way. It simply anticipates and solves a problem for the user.

The next block of code checks if fit_intercept=True. If so, then a vector of ones is added to the X array.

I’ll assume you’ve read my post on linear regression to understand why we need to do this.

The next block of code simply calculates the model parameters using linear algebra. The parameters are stored in a class variable called coef.

Yes, coef is technically a variable, not an attribute. A variable-like object attached to a class via self is called an attribute whereas a variable contained within a class is simply a variable.

The final block of code parses coef appropriately. If fit_intercept=True, then the intercept value is copied to self.intercept_. Otherwise, self.intercept_ is set to 0. The remaining parameters are stored in self.coef_.

Let’s see how this works.

mlr = MyLinearRegression()
mlr.fit(X_data, y_target)

We instantiate a model object called mlr and then find its model parameters on data (X_data and y_target) passed by the user. Once that’s done, we can access the intercept and remaining parameters like so:

intercept = mlr.intercept_
parameters = mlr.coef_

So clean. So elegant. Let’s keep going. Let’s add a predict method.

import numpy as np

class MyLinearRegression:
    
    def __init__(self, fit_intercept=True):
        self.coef_ = None
        self.intercept_ = None
        self._fit_intercept = fit_intercept

    
    def fit(self, X, y):
        """
        Fit model coefficients.

        Arguments:
        X: 1D or 2D numpy array 
        y: 1D numpy array
        """
        
        # check if X is 1D or 2D array
        if len(X.shape) == 1:
            X = X.reshape(-1,1)
            
        # add bias if fit_intercept is True
        if self._fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]
        
        # closed form solution
        xTx = np.dot(X.T, X)
        inverse_xTx = np.linalg.inv(xTx)
        xTy = np.dot(X.T, y)
        coef = np.dot(inverse_xTx, xTy)
        
        # set attributes
        if self._fit_intercept:
            self.intercept_ = coef[0]
            self.coef_ = coef[1:]
        else:
            self.intercept_ = 0
            self.coef_ = coef
            
    def predict(self, X):
        """
        Output model prediction.

        Arguments:
        X: 1D or 2D numpy array 
        """
        
        # check if X is 1D or 2D array
        if len(X.shape) == 1:
            X = X.reshape(-1,1) 
        return self.intercept_ + np.dot(X, self.coef_) 

The predict method is also quite simple. Pass in some data X formatted exactly as X_data in our case, and the model spits out its predictions.

predictions = mlr.predict(X_new_data)

See how everything (data and methods) is contained or encapsulated in a single class object. It’s a wonderful way to keep everything organized.

But wait, there’s more.

Say we had another class called Metrics. This class captures a number of key metrics associated with regression models. See Linear Regression 101 (Part 2 - Metrics) for details.

It looks like this:

class Metrics:
    
    def __init__(self, X, y, model):
        self.data = X
        self.target = y
        self.model = model
        # degrees of freedom population dep. variable variance
        self._dft = X.shape[0] - 1   
        # degrees of freedom population error variance
        self._dfe = X.shape[0] - X.shape[1] - 1  
    
    def sse(self):
        '''returns sum of squared errors (model vs actual)'''
        squared_errors = (self.target - self.model.predict(self.data)) ** 2
        self.sq_error_ = np.sum(squared_errors)
        return self.sq_error_
        
    def sst(self):
        '''returns total sum of squared errors (actual vs avg(actual))'''
        avg_y = np.mean(self.target)
        squared_errors = (self.target - avg_y) ** 2
        self.sst_ = np.sum(squared_errors)
        return self.sst_
    
    def r_squared(self):
        '''returns calculated value of r^2'''
        self.r_sq_ = 1 - self.sse()/self.sst()
        return self.r_sq_
    
    def adj_r_squared(self):
        '''returns calculated value of adjusted r^2'''
        self.adj_r_sq_ = 1 - (self.sse()/self._dfe) / (self.sst()/self._dft)
        return self.adj_r_sq_
    
    def mse(self):
        '''returns calculated value of mse'''
        self.mse_ = np.mean( (self.model.predict(self.data) - self.target) ** 2 )
        return self.mse_
    
    def pretty_print_stats(self):
        '''returns report of statistics for a given model object'''
        items = ( ('sse:', self.sse()), ('sst:', self.sst()), 
                 ('mse:', self.mse()), ('r^2:', self.r_squared()), 
                  ('adj_r^2:', self.adj_r_squared()))
        for item in items:
            print('{0:8} {1:.4f}'.format(item[0], item[1]))

The Metrics class requires X, y, and a model object to calculate the key metrics. It’s certainly not a bad solution. However, we can do better. With a little tweaking, we can give MyLinearRegression access to Metrics in a simple yet intuitive way. Let me show you how:

class ModifiedMetrics:
    
    def sse(self):
        '''returns sum of squared errors (model vs actual)'''
        squared_errors = (self.target - self.predict(self.data)) ** 2
        self.sq_error_ = np.sum(squared_errors)
        return self.sq_error_
        
    def sst(self):
        '''returns total sum of squared errors (actual vs avg(actual))'''
        avg_y = np.mean(self.target)
        squared_errors = (self.target - avg_y) ** 2
        self.sst_ = np.sum(squared_errors)
        return self.sst_
    
    def r_squared(self):
        '''returns calculated value of r^2'''
        self.r_sq_ = 1 - self.sse()/self.sst()
        return self.r_sq_
    
    def adj_r_squared(self):
        '''returns calculated value of adjusted r^2'''
        self.adj_r_sq_ = 1 - (self.sse()/self._dfe) / (self.sst()/self._dft)
        return self.adj_r_sq_
    
    def mse(self):
        '''returns calculated value of mse'''
        self.mse_ = np.mean( (self.predict(self.data) - self.target) ** 2 )
        return self.mse_
    
    def pretty_print_stats(self):
        '''returns report of statistics for a given model object'''
        items = ( ('sse:', self.sse()), ('sst:', self.sst()), 
                 ('mse:', self.mse()), ('r^2:', self.r_squared()), 
                  ('adj_r^2:', self.adj_r_squared()))
        for item in items:
            print('{0:8} {1:.4f}'.format(item[0], item[1]))

Notice ModifiedMetrics no longer has _init_. Now for a slightly modified version of MyLinearRegression.

class MyLinearRegressionWithInheritance(ModifiedMetrics):
    
    
    def __init__(self, fit_intercept=True):
        self.coef_ = None
        self.intercept_ = None
        self._fit_intercept = fit_intercept
          
        
    def fit(self, X, y):
        """
        Fit model coefficients.

        Arguments:
        X: 1D or 2D numpy array 
        y: 1D numpy array
        """
        
        # training data & ground truth data
        self.data = X
        self.target = y
        
        # degrees of freedom population dep. variable variance 
        self._dft = X.shape[0] - 1  
        # degrees of freedom population error variance
        self._dfe = X.shape[0] - X.shape[1] - 1
        
        # check if X is 1D or 2D array
        if len(X.shape) == 1:
            X = X.reshape(-1,1)
            
        # add bias if fit_intercept
        if self._fit_intercept:
            X = np.c_[np.ones(X.shape[0]), X]
        
        # closed form solution
        xTx = np.dot(X.T, X)
        inverse_xTx = np.linalg.inv(xTx)
        xTy = np.dot(X.T, y)
        coef = np.dot(inverse_xTx, xTy)
        
        # set attributes
        if self._fit_intercept:
            self.intercept_ = coef[0]
            self.coef_ = coef[1:]
        else:
            self.intercept_ = 0
            self.coef_ = coef
            
    def predict(self, X):
        """Output model prediction.

        Arguments:
        X: 1D or 2D numpy array 
        """
        # check if X is 1D or 2D array
        if len(X.shape) == 1:
            X = X.reshape(-1,1) 
        return self.intercept_ + np.dot(X, self.coef_)

Notice how I created MyLinearRegressionWithInheritance? It contains ModifiedMetrics in parantheses right from the start. Here’s the snippet of code I’m referring to:

class MyLinearRegressionWithInheritance(ModifiedMetrics):

This means ModifiedMetrics acts like a base class and MyLinearRegressionWithInheritance can inherit from it. Why may this be helpufl? First, it’s far more elegant. Secondly, imagine your wrote not just a linear regression algorithm but other regression algorithms, and you wanted each of those algorithms to have access to the same methods that calculate and return key regression metrics. On the one hand, you could copy all that code into each model object. On another hand, you could pass those model objects to the Metrics class. Or you could simply inherit ModifiedMetrics. While all will work, the last solution is by far the most elegant. It keeps all your code modular. It also ensures you’re constructing your classes in a way that won’t break your code down the line. In short, it makes your life easier and ensures quality code. It’s much easier to change base class methods or add/delete without having to comb through each algorithm to see if you made the required updates. In short, it makes your code manageable at scale.

We covered a lot of ground in short order, so this is a good place to stop for now.

Wrap Up

OOP is a powerful paradigm, keeping your code organized and manageable at scale. However, it’s not a magic bullet. Like any tool, you have to know where and when it’s appropriate to use. That means you should spend some time learning at least a handful of OOP design patterns - there are many wonderful resources available. You’ll be surprised how much more powerful, elegant, and efficient you’re code will be with a little study.

Simulated Datasets for Faster ML Understanding (Part 1/2)

2018-01-23T00:00:00+00:00

Introduction

Oftentimes, the most difficult part of gaining expertise in machine learning is developing intuition about the strengths and weaknesses of the various algorithms. Common pedagogy follows a familiar pattern: theoretical exposition followed by application on a contrived dataset. For example, suppose you’re learning a classification algorithm for supervised machine learning. For specificity, let’s assume the algorithm du jour is Gaussian Naive Bayes (GNB). You learn, as a natural starting point, the mechanics and the fundamental assumptions. That gives you the big idea. Maybe you even code GNB from scratch to gain deeper insight. Great. Now comes time to apply GNB to “real” data. A canonical example is often presented, for example the Iris dataset. You learn to connect theory and application. Makes perfect sense.

So what’s the problem?

The problem is that you don’t know the generative process underlying the Iris dataset. Sure, you’re trying to deduce a proxy by fitting your GNB model. That’s the point of modeling. But that’s not what I’m getting at. No, what I want to help you understand is knowing where and when certain algorithms shine and where and when they don’t. In sum, I want to pull back the curtain; I want to show you how to understand machine learning algorithms at a much deeper level, the level of intuition. How you get there and how quickly you get there is a matter of technique, and it’s this technique that I’ll share with you so you too can gain deep expertise and intuition about machine learning algorithms with great alacrity.

Baby Steps

Imagine you knew the generative process underlying a dataset - you knew exactly how data was generated and how all the pieces fit together. In short, imagine you have perfect information. Now imagine running GNB on your data. Because you know precisely how the data was generated and because you know how GNB works, you can start piecing together where GNB performs well and in what situations it struggles. Now imagine you knew the generative process of not one but many datasets. Furthermore, imagine applying not just GNB but Logistic Regression, Random Forest, Support Vector Machines, and a slew of other classification algorithms you have at your disposal. All of a sudden you have the ability to garner deep insights into each of the algorithms, and fast.

But how do you move from imagination to reality?

On the Road to Something Greater

The answer may surprise you. Create your own datasets! That may sound daunting but really it’s not. Let me walk you through one of my earliest incarnations. I even created a little backstory just to keep things interesting. Without further ado, here are the details.

Dataset Description

What follows is a full on description of the very first dataset I created. By the way, industry tends to call this type of dataset a simulated dataset.

Introduction

This dataset is built from scratch. It has the following properties:

Type: Classification
Balanced: No (slightly imbalanced)
Outliers: No
Simulated Human Data Entry Errors: No
Missing Values: No
Nonsensical Data Types: No

Furthermore, the dataset is designed in such a way that relying on intuition alone will lead you astray if you rely on gut feel alone.

Problem Description

InstaFace (IF) is a cutting edge startup specializing in facial recognition. As a hot tech startup, IF is constantly on the lookout for identifying and hiring the best talent. Because they are the best at what they do, their applicant pool is massive and growing. In fact, the number of applicants has grown so large and so fast that Human Resources just can’t keep up. So they need your help to create an automated way to identify the most promising candidates. In particular, they asked that you create a model that can take a number of predefined inputs and output a probability that a particular candidate will be hired. The good news is IF has hired scores of data scientists in the past, so the dataset is relatively rich.

Features

Below I describe the various features, whether that feature has any importance on the target variable, and if so the likelihood of someone being hired for a specific value of that feature

|Feature #|Description|Important| |:–:|:–:|:–:| |1|degree|Y| |2|age|N| |3|gender|N| |4|major|N| |5|GPA|N| |6|experience|Y| |7|bootcamp|Y| |8|GitHub|Y| |9|blogger|Y| |10|blogs|N|

Feature 1

description: highest degree achieved
important: Yes
values: [(0=no bachelors, 8%), (1=bachelors, 70%), (2=masters, 80%), (3=PhD, 20%)]

Feature 2

description: age
important: No
values: [18, 60]

Feature 3

description: gender
important: No
values: [0=female, 1=male]

Feature 4

description: major
important: No
values: [0=anthropology, 1=biology, 2=business, 3=chemistry, 4=engineering, 5=journalism, 6=math, 7=political science]

Feature 5

description: GPA
important: No
values: [1.00, 4.00]

Feature 6

description: years of experience
important: Yes
values: [(0-10, 90%), (10-25, 20%), (25-50, 5%)]

Feature 7

description: attended bootcamp
important: Yes
values: [(0=No, 25%), (1=Yes, 75%)]

Feature 8

description: number of projects on GitHub
important: Yes
values: [(0, 5%), (1-5, 65%), (6-20, 95%)]

Feature 9

description: writes data science blog posts
important: Yes
values: [(0=No, 30%), (1=Yes, 70%)]

Feature 10

description: number of blog articles written
important: No
values: [0, 20]

More Details

Without looking at the data, many people would likely assume that a PhD would have better chances of getting hired than someone with a Master’s, that a Master’s candidate would have better chances of getting hired than someone with a Bachelor’s, and so on. This is simply not true in this case. I specifically created this dataset in such a way that people with Bachelor’s and Master’s degrees are far more likely to get hired than PhD’s or those without a degree.

Regarding age and gender, one may reasonably conjecture that these attributes would have high impact with regard to hiring decisions since this is a well-known bias in many real companies. However, I specifically created this dataset so that hiring decisions were made independently of these two attributes. Again, the goal is to let the data speak for itself, not to rely on intuition. There is an interesting result lurking beneath the surface, however. Age is correlated with experience so it exhibits some signal, but the true source of the signal is experience.

One may also assume that major and GPA are strong predictors. That may be the case at some real-world companies but not in this case. They have no impact whatsoever. Any signal present is purely due to chance.

On the other hand, years of experience, bootcamp experience, number of projects on GitHub, and blog experience are all strong predictors. Specifically, the dataset was designed such that candidates with light experience, bootcamp experience, numerous independent GitHub projects, and a data science blog are preferred. Surprisingly perhaps, the number of blog articles one has writen is irrelevant. This was by design.

One last thing to note: Whether a candidate was hired is not based on any one of the five important features. Rather, five target flags were generated probabilistically based on the values of those features and a simple majority results in being hired. To add a bit more complexity, I randomly flipped 5% of hiring decisions so that learning the hiring decision rule would be more difficult.

Great, so now we have all that background behind us which means it’s time to actually generate the data.

Generate Data

There are many ways to efficiently create datasets using NumPy and Pandas. I tried to keep things simple and understandable, not necessarily efficient. Please bare with me here.

import numpy as np
import pandas as pd

# reproducibility
np.random.seed(10)

# number of observations
size = 5000

# feature setup
degree = np.random.choice(a=range(4), size=size)
age = np.random.choice(a=range(18,61), size=size)
gender = np.random.choice(a=range(2), size=size)
major = np.random.choice(a=range(8), size=size)
gpa = np.round(np.random.normal(loc=2.90, scale=0.5, size=size), 2)
experience = None  
bootcamp = np.random.choice(a=range(2), size=size)
github = np.random.choice(a=range(21), size=size)
blogger = np.random.choice(a=range(2), size=size)
articles = 0  
t1, t2, t3, t4, t5 = None, None, None, None, None
hired = 0

Now to create a pandas dataframe.

mydict = {"degree":degree, "age":age, 
          "gender":gender, "major":major, 
          "gpa":gpa, "experience":experience, 
          "github":github, "bootcamp":bootcamp, 
          "blogger":blogger, "articles":articles,
          "t1":t1, "t2":t2, "t3":t3, "t4":t4, "t5":t5, "hired":hired}

df = pd.DataFrame(mydict,
                  columns=["degree", "age", "gender", "major", "gpa", 
                           "experience", "bootcamp", "github", "blogger", "articles",
                           "t1", "t2", "t3", "t4", "t5", "hired"])

We’re not quite there yet. We still need to update some columns. Here’s an inefficient but hopefully understandable way to do that:

np.random.seed(42)

for i, _ in df.iterrows(): 
    
    # Constrain GPA
    if df.loc[i, 'gpa'] < 1.00 or df.loc[i, 'gpa'] > 4.00:
        if df.loc[i, 'gpa'] < 1.00:
            df.loc[i, 'gpa'] = 1.00
        else:
            df.loc[i, 'gpa'] = 4.00
    
    # Set experience based on age
    df.loc[i, 'experience'] = np.random.choice(a=range(0, df.loc[i, 'age']-17))    
    
    # Set number of articles if blogger flag
    if df.loc[i, 'blogger']:
        df.loc[i, 'articles'] = np.random.choice(a=range(1, 21), size=1) 
    
    # Set target flags
    for feature in ['degree', 'experience', 'bootcamp', 'github', 'blogger']:
        if feature == 'degree':  
            if df.loc[i, feature] == 0:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.92, 0.08])) ## no bachelors
            elif df.loc[i, feature] == 1:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## bachelors
            elif df.loc[i, feature] == 2:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.20, 0.80])) ## masters
            else:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## PhD
        elif feature == 'experience':
            if df.loc[i, feature] <= 10:
                df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.10, 0.90])) ## <= 10 yrs exp
            elif df.loc[i, feature] <= 25:
                df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## 11-25 yrs exp
            else:
                df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## >= 26 yrs exp
        elif feature == 'bootcamp':
            if df.loc[i, feature]:
                df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.25, 0.75])) ## bootcamp
            else:
                df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## no bootcamp
        elif feature == 'github':
            if df.loc[i, feature] == 0:
                df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## 0 projects
            elif df.loc[i, feature] <= 5:
                df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.35, 0.65])) ## 1-5 projects
            else:
                df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.05, 0.95])) ## > 5 projects
        else:
            if df.loc[i, feature]:
                df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## blogger
            else:
                df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## !blogger
    
    # Set hired value
    if (df.loc[i, 't1'] + df.loc[i, 't2'] + df.loc[i,'t3'] + df.loc[i,'t4'] + df.loc[i, 't5']) >= 3:
        df.loc[i, 'hired'] = 1

The big takeaway is the last if statement. That’s where the target variable (aka hired) is set. This is the generative process. It simply states that if the temporary flag variable t1-t5 sum to three or or more, then set hired equal to one, otherwise zero. It’s a simple decision based on a simple summation - probably not too far off from many real hiring decisions!

It’s worthwhile to apply just a bit more processing. Specifically, we want to remove those temporary flag variables t1-t5 and convert experience from an object type to numeric.

# Drop target flags        
df.drop(df[['t1', 't2', 't3', 't4', 't5']], axis=1, inplace=True)

# Set 'experience' to numeric (was object type)
df['experience'] = df['experience'].apply(pd.to_numeric)

Great, we’re almost there. We just need to add the last bit of complexity where we flip a few hiring decisions. Again, the aim is not efficiency but ease of understanding here.

np.random.seed(15)

percent_to_flip = 0.03  ## % of hired values to flip
num_to_flip = int(np.floor(percent_to_flip * len(df)))  ## determine number of hired values to flip
flip_idx = np.random.randint(low=0, high=len(df), size=num_to_flip)  ## randomly select indices

for i, _ in df.loc[flip_idx].iterrows(): 
    if df.loc[i, 'hired'] == 1:
        df.loc[i, 'hired'] = 0
    else:
        df.loc[i, 'hired'] = 1

Great, that’s as far as we want to take this dataset.

Wrap Up

We covered lots of ground already. I introduced the idea of generating your own datasets from scratch. This process is known as simulating datasets. The reason for doing this is simple: You want to truly understand the generative process so you can apply various Exploratory Data Analysis (EDA) and machine learning techniques for the express purposes of building your intuition into which techniques work best on different types of data. That easily elevates you from novice to expert, and all it requires is a little time and practice.

Next time we’ll dig a bit deeper into the data. We’ll apply some basic EDA and then round out the discussion with a few traditional machine learning models to understand a bit better why one performs better than another.

Caesar Cipher

2018-01-20T00:00:00+00:00

Introduction

There are myriad ways to encrypt text. One of the simplest and easiest to understand is the Caesar cipher. It’s extremely easy to crack but it’s a great place to start for the purposes of introducing ciphers.

A Bit of Terminology

The setup is pretty simple. You start with a message you want to codify so no one else can read it. Say the message is I hope you cannot read this. This is called the plaintext. Now we need to apply some algorithm to our text so the output is incoherent. For example, the output may be O nuvk eua igttuz xkgj znoy. This is called the ciphertext. Mapping the plaintext to ciphertext is called encryption. Mapping the ciphertext back to plaintext is called decryption. The algorithm used to encrypt or decrypt is called a cipher.

Caesar Cipher: How it Works

Mapping I hope you cannot read this to O nuvk eua igttuz xkgj znoy with the Caesar cipher works like this. First, you start by deciding how much you want to shift the alphabet. Say you choose a shift of six so A becomes G, B becomes H, C becomes I, and so on until you get to the end where Z becomes F. Now you have a way to map any plaintext character to ciphertext. In fact, that’s exactly how I encoded this message:

plaintext: I hope you cannot read this.
ciphertext: O nuvk eua igttuz xkgj znoy.

Here’s a gif that shows the various mappings:

The outer circle represents plaintext letters while the inner circle represents the ciphertext equivalent.

Hopefully you can see right away why this particular cipher is very easy to crack. Just mapping the plaintext to ciphertext while maintaining word lengths and spaces makes the process fairly easy. By converting all the text to lowercase and removing all spaces and punctuation, we can make it a bit more challenging. But just barely. There are only 25 different ways to shift the letters, which means a brute force attack is trivial.

Let’s see what this looks like in code.

The Code

We’ll create a class called CaesarCipher that can encrypt or decrypt text.

class CaesarCipher:
    
    
    def _clean_text(self, text):
        '''converts text to lowercase, removes spaces, and removes punctuation.'''
        import string
        assert type(text) == str, 'input needs to be a string!'
        text = text.lower()
        text = text.replace(' ', '')
        self.clean_text = "".join(character for character in text 
                                  if character not in string.punctuation)
        return self.clean_text
    
    
    def _string2characters(self, text):
        '''converts a string to individual characters.'''
        assert type(text) == str, 'input needs to be a string!'
        self.str2char = list(text)
        return self.str2char
    
    
    def _chars2nums(self, characters):
        '''converts individual characters to integers.'''
        assert type(characters) == list, 'input needs to be a list of characters!'
        codebook = {'a':0, 'b':1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9,
               'k':10, 'l':11, 'm':12, 'n':13, 'o':14, 'p':15, 'q':16, 'r':17, 's':18,
               't':19, 'u':20, 'v':21, 'w':22, 'x':23, 'y':24, 'z':25}
        for i, char in enumerate(characters):
            try:
                characters[i] = codebook[char]
            except:
                pass
        self.char2num = characters
        return self.char2num
    
    
    def _nums2chars(self, numbers):
        '''converts individual integers to characters .'''
        assert type(numbers) == list, 'input needs to be a list of numbers!'
        codebook = {0:'a', 1:'b', 2:'c', 3:'d', 4:'e', 5:'f', 6:'g', 7:'h', 8:'i', 9:'j',
               10:'k', 11:'l', 12:'m', 13:'n', 14:'o', 15:'p', 16:'q', 17:'r', 18:'s',
               19:'t', 20:'u', 21:'v', 22:'w', 23:'x', 24:'y', 25:'z'}
        for i, num in enumerate(numbers):
            try:
                numbers[i] = codebook[num]
            except:
                pass
        self.num2chars = numbers
        return self.num2chars
    
    
    def _preprocessing(self, text):
        ''''''
        clean_text = self._clean_text(text)
        list_of_chars = self._string2characters(clean_text)
        list_of_nums = self._chars2nums(list_of_chars)
        return list_of_nums
   
   
    def encrypt(self, text, shift=3):
        '''return text that is shifted according to user's input.'''
        import numpy as np
        preprocess = self._preprocessing(text)
        nums_shifted = list((np.array(preprocess) + shift) % 26)
        return ''.join(self._nums2chars(nums_shifted))
    
    
    def decrypt(self, text, shift=3):
        '''returns text shifted by user-defined shift length.'''
        import numpy as np
        preprocess = self._preprocessing(text)
        nums = self._chars2nums(preprocess)
        num_shift = list((np.array(nums) - shift) % 26)
        return ''.join(self._nums2chars(num_shift))

Code Breakdown

The CaesarCipher class contains a number of methods. The first is a method called _clean_text which converts all letters to lower case and removes spaces and punctuation. The second, third, and fourth methods called _string2characters, _chars2nums, and _nums2chars should be self-explanatory. The _preprocessing method is a meta-function that incorporates and applies all the aforementioned methods in one sequential process. The last two methods are the most interesting: encrypt and decrypt. They perform as advertised.

Setup

Great, now let’s instantiate our class and put it through its paces.

To instantiate, we’re merely type cc = CaesarCipher().

Now to encrypt a message: print(cc.encrypt('I hope you cannot read this.', shift=6)).

The shift parameter tells the class by how much to shift the letters to encrypt the plaintext. In this case I arbitrarily chose 6. The output is onuvkeuaigttuzxkgjznoy. That sure doesn’t look like anything I can make out.

Let’s try another one for fun. This one will showcase the preprocessing method in all its glory.

text = 'the QuIcK brown fox jumps over the lazy dog!'
encrypted = cc.encrypt(text, shift=5)
print(encrypted)

The output is ymjvznhpgwtbsktcozruxtajwymjqfeditl.

Discussion

Now if you’ve given this a little thought, you should see ways to crack this cipher wide open.

The English language is replete with structure. Certain letters appear far more frequently than others. The letter e, for example, is the most common letter in the English language. Therefore, using letter frequencies is a very effective strategy. Another giveaway is double letters of which only so pairings exist. So given longer snippets of text, you can deduce plaintext-to-ciphertext letter mappings with ease.

If all else fails or you just want to find the answer quickly, a brute force search will expose the plaintext.

Let’s see how that works.

# show all decryption possibilities
for i in range(1,26):
    print('shift{:2}: {}'.format(i, cc.decrypt(encrypted, shift=i)))
    print('')

Which outputs:

shift 1: xliuymgofvsarjsbnyqtwszivxlipedchsk

shift 2: wkhtxlfneurzqiramxpsvryhuwkhodcbgrj

shift 3: vjgswkemdtqyphqzlworuqxgtvjgncbafqi

shift 4: uifrvjdlcspxogpykvnqtpwfsuifmbazeph

shift 5: thequickbrownfoxjumpsoverthelazydog

shift 6: sgdpthbjaqnvmenwitlornudqsgdkzyxcnf

shift 7: rfcosgaizpmuldmvhsknqmtcprfcjyxwbme

shift 8: qebnrfzhyoltkclugrjmplsboqebixwvald

shift 9: pdamqeygxnksjbktfqilokranpdahwvuzkc

shift10: oczlpdxfwmjriajsephknjqzmoczgvutyjb

shift11: nbykocwevliqhzirdogjmipylnbyfutsxia

shift12: maxjnbvdukhpgyhqcnfilhoxkmaxetsrwhz

shift13: lzwimauctjgofxgpbmehkgnwjlzwdsrqvgy

shift14: kyvhlztbsifnewfoaldgjfmvikyvcrqpufx

shift15: jxugkysarhemdvenzkcfieluhjxubqpotew

shift16: iwtfjxrzqgdlcudmyjbehdktgiwtaponsdv

shift17: hvseiwqypfckbtclxiadgcjsfhvszonmrcu

shift18: gurdhvpxoebjasbkwhzcfbiregurynmlqbt

shift19: ftqcguowndaizrajvgybeahqdftqxmlkpas

shift20: espbftnvmczhyqziufxadzgpcespwlkjozr

shift21: droaesmulbygxpyhtewzcyfobdrovkjinyq

shift22: cqnzdrltkaxfwoxgsdvybxenacqnujihmxp

shift23: bpmycqksjzwevnwfrcuxawdmzbpmtihglwo

shift24: aolxbpjriyvdumveqbtwzvclyaolshgfkvn

shift25: znkwaoiqhxuctludpasvyubkxznkrgfejum

A quick scan gives away the plaintext: shift 5: thequickbrownfoxjumpsoverthelazydog.

Wrap Up

Hopefully you found this a fun introduction to cryptography. It’s a rich and rewarding field with endless applications.

Next time, we’ll build upon what we learned here as we explore a more challenging cipher known as the Vigenere cipher.

Model Tuning (Part 2 - Validation & Cross-Validation)

2018-01-19T00:00:00+00:00

Introduction

Last time in Model Tuning (Part 1 - Train/Test Split) we discussed training error, test error, and train/test split. We learned that training a model on all the available data and then testing on that very same data is an awful way to build models because we have no indication as to how well that model will perform on unseen data. In other words, we don’t know if the model is essentially memorizing the data it’s seen or if it’s truly picking up the pattern inherent in the data (i.e. its ability to generalize).

To remedy that situation, we implemented train/test split that effectively holds some data aside from the model building process for testing at the very end when the model is fully trained. This allows us to see how the model performs on unseen data and gives us some indication as to whether the model generalizes or not.

Now that we have a solid foundation, we can move on to more advanced topics that will take our model-building skills to the next level. Specifically, we’ll dig in to the following topics:

Bias-Variance Tradeoff
Validation Set
Model Tuning
Cross-Validation

To make this concrete, we’ll combine theory and application. For the latter, we’ll leverage the Boston dataset in sklearn.

Please refer to the Boston dataset for details.

Our first step is to read in the data and prep it for modeling.

Get & Prep Data

Here’s a bit of code to get us going:

boston = load_boston()
data = boston.data
target = boston.target

And now let’s split the data into train/test split like so:

# train/test split
X_train, X_test, y_train, y_test = train_test_split(data, 
                                                    target, 
                                                    shuffle=True,
                                                    test_size=0.2, 
                                                    random_state=15)

Setup

We know we’ll need to calculate training and test error, so let’s go ahead and create functions to do just that. Let’s include a meta-function that will generate a nice report for us while we’re at it. Also, Root Mean Squared Error (RMSE) will be our metric of choice.

def calc_train_error(X_train, y_train, model):
    '''returns in-sample error for already fit model.'''
    predictions = model.predict(X_train)
    mse = mean_squared_error(y_train, predictions)
    rmse = np.sqrt(mse)
    return mse
    
def calc_validation_error(X_test, y_test, model):
    '''returns out-of-sample error for already fit model.'''
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)
    return mse
    
def calc_metrics(X_train, y_train, X_test, y_test, model):
    '''fits model and returns the RMSE for in-sample error and out-of-sample error'''
    model.fit(X_train, y_train)
    train_error = calc_train_error(X_train, y_train, model)
    validation_error = calc_validation_error(X_test, y_test, model)
    return train_error, validation_error

Time to dive into a little theory. Stay with it because we’ll come back around to the application side where you’ll see how all the pieces fit together.

Theory

Bias-Variance Tradeoff

Pay very close attention to this section. It is one of the most important concepts in all of machine learning. Understanding this concept will help you diagnose all types of models, be they linear regression, XGBoost, or Convolutional Neural Networks.

We already know how to calculate training error and test error. So far we’ve simply been using test error as a way to gauge how well our model will generalize. That was a good first step but it’s not good enough. We can do better. We can tune our model. Let’s drill down.

We can compare training error and something called validation error to figure out what’s going on with our model - more on validation error in a minute. Depending on the values of each, our model can be in one of three regions:

1) High Bias - underfitting
2) Goldilocks Zone - just right
3) High Variance - overfitting

Plot Orientation

The x-axis represents model complexity. This has to do with how flexible your model is. Some things that add complexity to a model include: additional features, increasing polynomial terms, and increasing the depth for tree-based models. Keep in mind this is far from an exhaustive list but you should get the gist.

The y-axis indicates model error. It’s often measured as Mean-Squared Error (MSE) for Regression and Cross-Entropy or Accuracy for Classification.

The blue curve is Training Error. Notice that it only decreases. What should be painfully obvious is that adding model complexity leads to smaller and smaller training errors. That’s a key finding.

The green curve forms a U-shape. This curve represents Validation Error. Notice the trend. First it decreases, hits a minimum, and then increases. We’ll talk in more detail shortly about what exactly Validation Error is and how to calculate it.

High Bias

The rectangular box outlined by dashes to the left and labeled as High Bias is the first region of interest. Here you’ll notice Training Error and Validation Error are high. You’ll also notice that they are close to one another. This region is defined as the one where the model lacks the flexibility required to really pull out the inherent trend in the data. In machine learning speak, it is underfitting, meaning it’s doing a poor job all around and won’t generalize well. The model doesn’t even do well on the training set.

How do you fix this?

By adding model complexity of course. I’ll go into much more detail about what to do when you realize you’re under or overfitting in another post. For now, assuming you’re using linear regression, a good place to start is by adding additional features. The addition of parameters to your model grants it flexibility that can push your model into the Golidlocks Zone.

Goldilocks Zone

The middle region without dashes I’ve named the Goldilocks Zone. Your model has just the right amount of flexibility to pick up on the pattern inherent in the data but isn’t so flexible that it’s really just memorizing the training data. This region is marked by Training Error and Validation Error that are both low and close to one another. This is where your model should live.

High Variance

The dashed rectangular box to the right and labeled High Variance is the flip of the High Bias region. Here the model has so much flexiblity that it essentially starts to memorize the training data. Not surprisingly, that approach leads to low Training Error. But as was mentioned in the train/test post, a lookup table does not generalize, which is why we see high Validation Error in this region. You know you’re in this region when your Training Error is low but your Validation Error is high. Said another way, if there’s a sizeable delta between the two, you’re overfitting.

How do you fix this?

By decreasing model complexity. Again, I’ll go into much more detail in a separate post about what exactly to do. For now, consider applying regularization or dropping features.

Canonical Plot

Let’s look at one more plot to drive these ideas home.

Imagine you’ve entered an archery competition. You receive a score based on which portion of the target you hit: 0 for the red circle (bullseye), 1 for the blue, and 2 for the while. The goal is to minimize your score and you do that by hitting as many bullseyes as possible.

The archery metaphor is a useful analog to explain what we’re trying to accomplish by building a model. Given different datasets (equivalent to different arrows), we want a model that predicts as closely as possible to observed data (aka targets).

The top Low Bias/Low Variance portion of the graph represents the ideal case. This is the Goldilocks Zone. Our model has extracted all the useful information and generalizes well. We know this because the model is accurate and exhibits little variance, even when predicting on unforeseen data. The model is highly tuned, much like an archer who can adjust to different wind speeds, distances, and lighting conditions.

The Low Bias/High Variance portion of the graph represents overfitting. Our model does well on the training data, but we see high variance for specific datasets. This is analagous to an archer who has trained under very stringent conditions - perhaps indoors where there is no wind, the distance is consistent, and the lighting is always the same. Any variation in any of those attributes throws off the archer’s accuracy. The archer lacks consistency.

The High Bias/Low Variance portion of the graph represents underfitting. Our model does poorly on any given dataset. In fact, it’s so bad that it does just about as poorly regardless of the data you feed it, hence the small variance. As an analog, consider an archer who has learned to fire with consistency but hasn’t learned to hit the target. This is analagous to a model that always predicts the average value of the training data’s target.

The High Bias/High Variance portion of the graph actually has no analog in machine learning that I’m aware of. There exists a tradeoff between bias and variance. Therefore, it’s not possible for both to be high.

Alright, let’s shift gears to see this in practice now that we’ve got the theory down.

Application

Let’s build a linear regression model of the Forest Fire dataset. We’ll investigate whether our model is underfitting, overfitting, or fitting just right. If it’s under or overfitting, we’ll look at one way we can correct that.

Time to build the model.

Note: I’ll use train_error to represent training error and test_error to represent validation error.

lr = LinearRegression(fit_intercept=True)

train_error, test_error = calc_metrics(X_train, y_train, X_test, y_test, lr)
train_error, test_error = round(train_error, 3), round(test_error, 3)

print('train error: {} | test error: {}'.format(train_error, test_error))
print('train/test: {}'.format(round(test_error/train_error, 1)))

The output looks like:

train error: 21.874 | test error: 23.817
train/test: 1.1

Hmm, our training error is somewhat lower than the test error. In fact, the test error is 1.1 times or 10% worse. It’s not a big difference but it’s worth investigating.

Which region does that put us in?

That’s right, it’s every so slightly in the High Variance region, which means our model is slightly overfitting. Again, that means our model has a tad too much complexity.

Unfortunately, we’re stuck at this point.

You’re probably thinking, “Hey wait, no we’re not. I can drop a feature or two and then recalculate training error and test error.”

My response is simply: NOPE. DON’T. PLEASE. EVER. FOR ANY REASON. PERIOD.

Why not?

Because if you do that then your test set is no longer a test set. You are using it to train your model. It’s the same as if you trained your model on the all the data from the beginning. Seriously, don’t do this. Unfortunately, practicing data scientists do this sometimes; it’s one of the worst things you can do. You’re almost guaranteed to produce a model that cannot generalize.

So what do we do?

We need to go back to the beginning. We need to split our data into three datasets: training, validation, test.

Remember, the test set is data you don’t touch until you’re happy with your model. The test set is used only ONE time to see how your model will generalize. That’s it.

Okay, let’s take a look at this thing called a Validation Set.

Validation Set

Three datasets from one seems like a lot of work but I promise it’s worth it. First, let’s see how to do this in practice.

# intermediate/test split (gives us test set)
X_intermediate, X_test, y_intermediate, y_test = train_test_split(data, 
                                                                  target, 
                                                                  shuffle=True,
                                                                  test_size=0.2, 
                                                                  random_state=15)

# train/validation split (gives us train and validation sets)
X_train, X_validation, y_train, y_validation = train_test_split(X_intermediate,
                                                                y_intermediate,
                                                                shuffle=False,
                                                                test_size=0.25,
                                                                random_state=2018)

Now for a little cleanup and some output:

# delete intermediate variables
del X_intermediate, y_intermediate

# print proportions
print('train: {}% | validation: {}% | test {}%'.format(round(len(y_train)/len(target),2),
                                                       round(len(y_validation)/len(target),2),
                                                       round(len(y_test)/len(target),2)))

Which outputs:

train: 0.6% | validation: 0.2% | test 0.2%

If you’re a visual person, this is how our data has been segmented.

We have now three datasets depicted by the graphic above where the training set constitutes 60% of all data, the validation set 20%, and the test set 20%. Do notice that I haven’t changed the actual test set in any way. I used the same initial split and the same random state. That way we can compare the model we’re about to fit and tune to the linear regression model we built earlier.

Side note: there is no hard and fast rule about how to proportion your data. Just know that your model is limited in what it can learn if you limit the data you feed it. However, if your test set is too small, it won’t provide an accurate estimate as to how your model will perform. Cross-validation allows us to handle this situation with ease, but more on that later.

Time to fit and tune our model.

Model Tuning

We need to decrease complexity. One way to do this is by using regularization. I won’t go into the nitty gritty of how regularization works now because I’ll cover that in a future post. Just know that regularization is a form of constrained optimization that imposes limits on determining model parameters. It effectively allows me to add bias to a model that’s overfitting. I can control the amount of bias with a hyperparameter called lambda or alpha (you’ll see both, though sklearn uses alpha because lambda is a Python keyword) that defines regularization strength.

The code:

alphas = [0.001, 0.01, 0.1, 1, 10]
print('All errors are RMSE')
print('-'*76)
for alpha in alphas:
    # instantiate and fit model
    ridge = Ridge(alpha=alpha, fit_intercept=True, random_state=99)
    ridge.fit(X_train, y_train)
    # calculate errors
    new_train_error = mean_squared_error(y_train, ridge.predict(X_train))
    new_validation_error = mean_squared_error(y_validation, ridge.predict(X_validation))
    new_test_error = mean_squared_error(y_test, ridge.predict(X_test))
    # print errors as report
    print('alpha: {:7} | train error: {:5} | val error: {:6} | test error: {}'.
          format(alpha,
                 round(new_train_error,3),
                 round(new_validation_error,3),
                 round(new_test_error,3)))

And the output:

All errors are RMSE
----------------------------------------------------------------------------
alpha:   0.001 | train error: 22.93 | val error: 19.796 | test error: 23.959
alpha:    0.01 | train error: 22.93 | val error: 19.792 | test error: 23.944
alpha:     0.1 | train error: 22.945 | val error: 19.779 | test error: 23.818
alpha:       1 | train error: 23.324 | val error: 20.135 | test error: 23.522
alpha:      10 | train error: 24.214 | val error: 20.958 | test error: 23.356

There are a few key takeaways here. First, notice the U-shaped behavior exhibited by the validation error. It starts at 19.796, goes down for two steps and then back up. Also notice that validation error and test error tend to move together, but by no means is the relationship perfect. We see both errors decrease as alpha increases initially but then test error keeps going down while validation error rises again. It’s not perfect. It actually has a whole lot to do with the fact that we’re dealing with a very small dataset. Each sample represents a much larger proportion of the data than say if we had a dataset with a million or more records. Anyway, validation error is a good proxy for test error, especially as dataset size increases. With small to medium-sized datasets, we can do better by leveraging cross-validation. We’ll talk about that shortly.

Now that we’ve tuned our model, let’s fit a new ridge regression model on all data except the test data. Then we’ll check the test error and compare it to that of our original linear regression model with all features.

Setup Data, Model, & Calculate Errors

# train/test split
X_train, X_test, y_train, y_test = train_test_split(data, 
                                                    target, 
                                                    shuffle=True,
                                                    test_size=0.2, 
                                                    random_state=15)

# instantiate model
ridge = Ridge(alpha=0.11, fit_intercept=True, random_state=99)

# fit and calculate errors
new_train_error, new_test_error = calc_metrics(X_train, y_train, X_test, y_test, ridge)
new_train_error, new_test_error = round(new_train_error, 3), round(new_test_error, 3)

Report Errors

print('ORIGINAL ERROR')
print('-' * 40)
print('train error: {} | test error: {}\n'.format(train_error, test_error))
print('ERROR w/REGULARIZATION')
print('-' * 40)
print('train error: {} | test error: {}'.format(new_train_error, new_test_error))

Here’s that output:

ORIGINAL ERROR
----------------------------------------
train error: 21.874 | test error: 23.817

ERROR w/REGULARIZATION
----------------------------------------
train error: 21.883 | test error: 23.673

A very small increase in training error coupled with a small decrease in test error. We’re definitely moving in the right direction. Perhaps not quite the magnitude of change we expected, but we’re simply trying to prove a point here. Remember this is a tiny dataset. Also remember I said we can do better by using something called Cross-Validation. Now’s the time to talk about that.

Cross-Validation

Let me say this upfront: this method works great on small to medium-sized datasets. This is absolutely not the kind of thing you’d want to try on a massive dataset (think tens or hundreds of millions of rows and/or columns). Alright, let’s dig in now that that’s out of the way.

As we saw in the post about train/test split, how you split smaller datasets makes a significant difference; the results can vary tremendously. As the random state is not a hyperparameter (seriously, please don’t do that), we need a way to extract every last bit of signal from the data that we possibly can. So instead of just one train/validation split, let’s do K of them.

This technique is appropriately named K-fold cross-validation. Again, K represents how many train/validation splits you need. There’s no hard and fast rule about how to choose K but there are better and worse choices. As the size of your dataset grows, you can get away with smaller values for K, like 3 or 5. When your dataset is small, it’s common to select a larger number like 10. Again, these are just rules of thumb.

Here’s the general idea for 10-fold CV:

You segment off a percentage of your training data as a validation fold.

Technical note: Be careful with terminology. Some people will refer to the validation fold as the test fold. Unfortunately, they use the terms interchangeably, which is confusing and therefore not correct. Don’t do that. The test set is the pure data that only gets consumed at the end, if it exists at all.

Once data has been segmented off in the validation fold, you fit a fresh model on the remaining training data. Ideally, you calculate train and validation error. Some people only look at validation error, however.

The data included in the first validation fold will never be part of a validation fold again. A new validation fold is created, segmenting off the same percentage of data as in the first iteration. Then the process repeats - fit a fresh model, calculate key metrics, and iterate. The algorithm concludes when this process has happened K times. Therefore, you end up with K estimates of the validation error, having visited all the data points in the validation set once and numerous times in training sets. The last step is to average the validation errors for regression. This gives a good estimate as to how well a particular model will perform.

Again, this method is invaluable for tuning hyperparameters on small to medium-sized datasets. You technically don’t even need a test set. That’s great if you just don’t have the data. For large datasets, use a simple train/validation/test split strategy and tune your hyperparameters like we did in the previous section.

Alright, let’s see K-fold CV in action.

Sklearn & CV

There’s two ways to do this in sklearn, pending what you want to get out of it.

The first method I’ll show you is cross_val_score, which works beautifully if all you care about is validation error.

The second method is KFold, which is perfect if you require train and validation errors.

Let’s try a new model called LASSO just to keep things interesting.

cross_val_score

from sklearn.model_selection import cross_val_score

alphas = [1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1]

val_errors = []
for alpha in alphas:
    lasso = Lasso(alpha=alpha, fit_intercept=True, random_state=77)
    errors = np.sum(-cross_val_score(lasso, 
                                     data, 
                                     y=target, 
                                     scoring='neg_mean_squared_error', 
                                     cv=10, 
                                     n_jobs=-1))
    val_errors.append(np.sqrt(errors))

Let’s checkout the validation errors associated with each alpha.

# RMSE
print(val_errors)

Which returns:

[18.64401379981868, 18.636528438323769, 18.578057471596566, 18.503285318281634, 18.565586130742307, 21.412874355105991]

Which value of alpha gave us the smallest validation error?

print('best alpha: {}'.format(alphas[np.argmin(val_errors)]))

Which returns: best alpha: 0.1

K-Fold

from sklearn.model_selection import KFold

K = 10
kf = KFold(n_splits=K, shuffle=True, random_state=42)

for alpha in alphas:
    train_errors = []
    validation_errors = []
    for train_index, val_index in kf.split(data, target):
        
        # split data
        X_train, X_val = data[train_index], data[val_index]
        y_train, y_val = target[train_index], target[val_index]

        # instantiate model
        lasso = Lasso(alpha=alpha, fit_intercept=True, random_state=77)
        
        #calculate errors
        train_error, val_error = calc_metrics(X_train, y_train, X_val, y_val, lasso)
        
        # append to appropriate list
        train_errors.append(train_error)
        validation_errors.append(val_error)
    
    # generate report
    print('alpha: {:6} | mean(train_error): {:7} | mean(val_error): {}'.
          format(alpha,
                 round(np.mean(train_errors),4),
                 round(np.mean(validation_errors),4)))

Here’s that output:

alpha: 0.0001 | mean(train_error): 21.8217 | mean(val_error): 23.3633
alpha:  0.001 | mean(train_error): 21.8221 | mean(val_error): 23.3647
alpha:   0.01 | mean(train_error): 21.8583 | mean(val_error): 23.4126
alpha:    0.1 | mean(train_error): 22.9727 | mean(val_error): 24.6014
alpha:      1 | mean(train_error): 26.7371 | mean(val_error): 28.236
alpha:   10.0 | mean(train_error):  40.183 | mean(val_error): 40.9859

Comparing the output of cross_val_score to that of KFold, we can see that the general trend holds - an alpha of 10 results in the largest validation error. You may wonder why we get different values. The reason is that the data was split differently. We can control the splitting procedure with KFold but not cross_val_score. Therefore, there’s no way I know of to perfectly sync the two procedures without an exhaustive search of splits or writing the algorithm from scratch ourselves. The important thing is that each gives us a viable method to calculate whatever we need, whether it be purely validation error or a combination of training and validation error.

Update: sklearn has a method called cross_validate that will capture training and validation errors for you. It’ll even spit out how long it took to train a model for each fold as well as the time it took to score the model on each validation set.

Wrap Up

Once you’ve tuned your hyperparameters, what do you do? Simply train a fresh model on all the data so you can extract as much information as possible. That way your model will have the best predictive power on future data. Mission complete!

Summary

We discussed the Bias-Variance Tradeoff where a high bias model is one that is underfit while a high variance model is one that is overfit. We also learned that we can split data into three groups for tuning purposes. Specifically, the three groups are train, validation, and test. Remember the test set is used only one time to check how well a model generalizes on data it’s never seen. This three-group split works exceedingly well for large datasets but not for small to medium-sized datasets, though. In that case, use cross-validation (CV). CV can help you tune your models and extract as much signal as possible from the small data sample. Remember, with CV you don’t need a test set. By using a K-fold approach, you get the equivalent of K-test sets by which to check validation error. This helps you diagnose where you’re at in the bias-variance regime.