Machine Learning Notebook

Data Augmentations for n-Dimensional Image Input to CNNs

Thu, 04 Jan 2018 10:13:20 +0000

One of the greatest limiting factors for training effective deep learning frameworks is the availability, quality and organisation of the training data. To be good at classification tasks, we need to show our CNNs etc. as many examples as we possibly can. However, this is not always possible especially in situations where the training data is hard to collect e.g. medical image data. In this post, we will learn how to apply data augmentation strategies to n-Dimensional images get the most of our limited number of examples.

Introduction

If we take any image, like our little Android below, and we shift all of the data in the image to the right by a single pixel, you may struggle to see any difference visually. However, numerically, this may as well be a completely different image! Imagine taking a stack of 10 of these images, each shifted by a single pixel compared to the previous one. Now consider the pixels in the images at [20, 25] or some arbitrary location. Focusing on that point, each pixel has a different colour, different average surrounding intensity etc. A CNN take these values into account when performing convolutions and deciding upon weights. If we supplied this set of 10 images to a CNN, it would effectively be making it learn that it should be invariant to these kinds of translations.

Android

Shifted 1 pixel right

Shifted 10 pixels right

Of course, translations are not the only way in which an image can change, but still visually be the same image. Consider rotating the image by even a single degree, or 5 degrees. It’s still an Android. Traning a CNN without including translated and rotated versions of the image may cause the CNN to overfit and assume that all images of Androids have to be perfectly upright and centered.

Providing deep learning frameworks with images that are translated, rotated, scaling, intensified and flipped is what we mean when we talk about data augmentation.

In this post we’ll look at how to apply these transformations to an image, even in 3D and see how it affects the performance of a deep learning framework. We will use an image from flickr user andy_emcee as an example of a 2D nautral image. As this is an RGB (color) image it has shape [512, 640, 3], one layer for each colour channel. We could take one layer to make this grayscale and truly 2D, but most images we deal with will be color so let’s leave it. For 3D we will use a 3D MRI scan

RGB Image shape=[512, 640, 3]

Augmentations

As usual, we are going to write our augmentation functions in python. We’ll just be using simple functions from numpy and scipy.

Translation

In our functions, image is a 2 or 3D array - if it’s a 3D array, we need to be careful about specifying our translation directions in the argument called offset. We don’t really want to move images in the z direction for a couple of reasons: firstly, if it’s a 2D image, the third dimension will be the colour channel, if we move the image through this dimension the image will either become all red, all blue or all black if we move it -2, 2 or greater than these respectively; second, in a full 3D image, the third dimension is often the smallest e.g. most medical scans. In our translation function below, the offset is given as a length 2 array defining the shift in the y and x directions respectively (dont forget index 0 is which horizontal row we’re at in python). We hard-code z-direction to 0 but you’re welcome to change this if your use-case demands it. To ensure we get integer-pixel shifts, we enforce type int too.

def translateit(image, offset, isseg=False):
    order = 0 if isseg == True else 5

    return scipy.ndimage.interpolation.shift(image, (int(offset[0]), int(offset[1]), 0), order=order, mode='nearest')

Here we have also provided the option for what kind of interpolation we want to perform: order = 0 means to just use the nearest-neighbour pixel intensity and order = 5 means to perform bspline interpolation with order 5 (taking into account many pixels around the target). This is triggered with a Boolean argument to the scaleit function called isseg so named because when dealing with image-segmentations, we want to keep their integer class numbers and not get a result which is a float with a value between two classes. This is not a problem with the actual image as we want to retain as much visual smoothness as possible (though there is an arugment that we’re introducing data which didn’t exist in the original image). Similarly, when we move our image, we will leave a gap around the edges from which it’s moved. We need a way to fill in this gap: by default shift will use a contant value set to 0. This may not be helpful in some case, so it’s best to set the mode to 'nearest' which takes the cloest pixel-value and replicates it. It’s barely noticable with small shifts but looks wrong at larger offsets. We need to be careful and only apply small translations to our data.

Original Image

Shifted 5 pixels right

Shifted 25 pixels right

Original Image and Segmentation

Shifted [-3, 1] pixels

Shifted [4, -5] pixels

Scaling

When scaling an image, i.e. zooming in and out, we want to increase or decrease the area our image takes up whilst keeping the image dimensions the same. We scale our image by a certain factor. A factor > 1.0 means the image scales-up, and factor < 1.0 scales the image down. Note that we should provide a factor for each dimension: if we want to keep the same number of layers or slices in our image, we should set last value to 1.0. To determine the intensity of the resulting image at each pixel, we are taking the lattice (grid) on which each pixel sits and using this to perform interpolation of the surrounding pixel intensities. scipy provides a handy function for this called zoom:

The definition is probably more complex than one would think:

def scaleit(image, factor, isseg=False):
    order = 0 if isseg == True else 3

    height, width, depth= image.shape
    zheight             = int(np.round(factor * height))
    zwidth              = int(np.round(factor * width))
    zdepth              = depth

    if factor < 1.0:
        newimg  = np.zeros_like(image)
        row     = (height - zheight) // 2
        col     = (width - zwidth) // 2
        layer   = (depth - zdepth) // 2
        newimg[row:row+zheight, col:col+zwidth, layer:layer+zdepth] = interpolation.zoom(image, (float(factor), float(factor), 1.0), order=order, mode='nearest')[0:zheight, 0:zwidth, 0:zdepth]

        return newimg

    elif factor > 1.0:
        row     = (zheight - height) // 2
        col     = (zwidth - width) // 2
        layer   = (zdepth - depth) // 2

        newimg = interpolation.zoom(image[row:row+zheight, col:col+zwidth, layer:layer+zdepth], (float(factor), float(factor), 1.0), order=order, mode='nearest')  
        
        extrah = (newimg.shape[0] - height) // 2
        extraw = (newimg.shape[1] - width) // 2
        extrad = (newimg.shape[2] - depth) // 2
        newimg = newimg[extrah:extrah+height, extraw:extraw+width, extrad:extrad+depth]

        return newimg

    else:
        return image

There are three possibilities that we need to consider - we are scaling up, down or no scaling. In each case, we want to return an array that is equal in size to the input image. For the scaling down case, this involves making a blank image the same shape as the input, and finding the corresponding box in the resulting scaled image. For scaling up, it’s unnecessary to perform the scaling on the whole image, just the portion that will be ‘zoomed’ - so we pass only part of the array to the zoom function. There may also be some error in the final shape due to rounding, so we do some trimming of the extra rows and colums before passing it back. When no scaling is done, we just return the original image.

Original Image

Scale-factor 0.75

Scale-factor 1.25

Original Image and Segmentation

Scale-factor 1.07

Scale-factor 0.95

Resampling

It may be the case that we want to change the dimensions of our image such that they fit nicely into the input of our CNN. For example, most images and photographs have one dimension larger than the other or may be of different resolutions. This may not be the case in our training set, but most CNNs prefer to have inputs that are square and of identical sizes. We can use the same scipy function interpolation.zoom to do this:

def resampleit(image, dims, isseg=False):
    order = 0 if isseg == True else 5

    image = interpolation.zoom(image, np.array(dims)/np.array(image.shape, dtype=np.float32), order=order, mode='nearest')

    if image.shape[-1] == 3: #rgb image
        return image
    else:
        return image if isseg else (image-image.min())/(image.max()-image.min())

The key part here is that we’ve replaced the factor argument with dims of type list. dims should have length equal to the number of dimensions of our image i.e. 2 or 3. We are calculating the factor that each dimension needs to change by in order to change the image to the target dims. We’ve forced the denominator of the scaling factor to be of type float so that the resulting factor is also float.

In this step, we are also changing the intensities of the image to use the full range from 0.0 to 1.0. This ensures that all of our image intensities fall over the same range - one fewer thing for the network to be biased against. Again, note that we don’t want to do this for our segmentations as the pixel ‘intensities’ are actually labels. We could do this in a separate function, but I want this to happen to all of my images at this point. There’s no difference to the visual display of the images because they are automaticallys rescaled to use the full range of display colours.

Rotation

This function utilises another scipy function called rotate. It takes a float for the theta argument which specifies the number of degrees of the roation (negative numbers rotate anti-clockwise). We want the returned image to be of the same shape as the input image so reshape = False is used. Again we need to specify the order of the interpolation on the new lattice. The rotate function handles 3D images by rotating each slice by the same theta.

def rotateit(image, theta, isseg=False):
    order = 0 if isseg == True else 5
        
    return rotate(image, float(theta), reshape=False, order=order, mode='nearest')

Original Image

Theta = -10.0

Theta = 10.0

Original Image and Segmentation

Theta = 6.18

Theta = -1.91

Intensity Changes

The final augmentation we can perform is a scaling in the intensity of the pixels. This effectively brightens or dims the image by appling a blanket increase or decrease across all pixels. We specify the amount by a factor: factor < 1.0 will dim the image, and factor > 1.0 will brighten it. Note that we don’t want a factor = 0.0 as this will blank the image.

def intensifyit(image, factor):

    return image*float(factor)

Flipping

One of the most common image augmentation procedures for natural images (dogs, cats, landscapes etc.) is to do flipping. The premise being that a dog is a dog no matter which was it’s facing. Or it doesn’t matter if a tree is on the right or the left of an image, it’s still a tree.

We can do horizontal flipping, left-to-right or vertical flipping, up and down. It may make sense to do only one of these (if we know that dogs don’t walk on their heads for example). In this case, we can specify a list of 2 boolean values: if each is 1 then both flips are performed. We use the numpy functions fliplr and flipup for these.

As with resampling, the intensity changes are modified to take the range of the display so there wont be a noticable difference in the images. The maximum value for display is 255 so increasing this will just scale it back down.

def flipit(image, axes):
    
    if axes[0]:
        image = np.fliplr(image)
    if axes[1]:
        image = np.flipud(image)
    
    return image

Cropping

This may be a very niche function, but it’s important in my case. Often in natrual image processing, random crops are done on the image in order to give patches - these patches often contain most of the image data e.g. 224 x 224 patch rather than 299 x 299 image. This is just another way of showing the network a very similar but also entirely different image. Central crops are also done. What’s different in my case is that I always want my segmentation to be fully-visible in the image that I show to the network (I’m working with 3D cardiac MRI segmentations).

So this function looks at the segmentation and creates a bounding box using the outermost pixels. We’re producing ‘square’ crops with side-length equal to the width of the image (the shortest side not including the depth). In this case, the bounding box is created and, if necessary, the window is moved up and down the image to make sure the full segmentation is visible. It also makes sure that the output is always square in the case that the bounding box moves off the image array.

def cropit(image, seg=None, margin=5):

    fixedaxes = np.argmin(image.shape[:2])
    trimaxes  = 0 if fixedaxes == 1 else 1
    trim    = image.shape[fixedaxes]
    center  = image.shape[trimaxes] // 2

    print image.shape
    print fixedaxes
    print trimaxes
    print trim
    print center

    if seg is not None:

        hits = np.where(seg!=0)
        mins = np.argmin(hits, axis=1)
        maxs = np.argmax(hits, axis=1)

        if center - (trim // 2) > mins[0]:
            while center - (trim // 2) > mins[0]:
                center = center - 1
            center = center + margin

        if center + (trim // 2) < maxs[0]:
            while center + (trim // 2) < maxs[0]:
                center = center + 1
            center = center + margin
    
    top    = max(0, center - (trim //2))
    bottom = trim if top == 0 else center + (trim//2)

    if bottom > image.shape[trimaxes]:
        bottom = image.shape[trimaxes]
        top = image.shape[trimaxes] - trim
  
    if trimaxes == 0:
        image   = image[top: bottom, :, :]
    else:
        image   = image[:, top: bottom, :]

    if seg is not None:
        if trimaxes == 0:
            seg   = seg[top: bottom, :, :]
        else:
            seg   = seg[:, top: bottom, :]

        return image, seg
    else:
        return image

Note that this function will work to square an image even when there is no segmentation given. We also have to be careful about which axes we take as the ‘fixed’ length for the square and which one to trim.

Original Image

Cropped

Original Image and Segmentation

Cropped

Application

We should be careful about how we apply our transformations. For example, if we apply multiple transformations to the same image we need to make sure that we don’t apply ‘resampling’ after ‘intensity changes’ because this will reset the range of the image, defeating the point of the intensification. However, as we will generally want our data to span the same range, wholesale intensity shifts are less often seen. We also want to make sure that we are not being over zealous with the augmentations either - we need to set limits for our factors and other arguments.

When I implement data augmentation, I put all of these transforms into one script which can be downloaded here: transforms.py. I then call the transforms that I want from another script.

We create a set of cases, one for each transformation, which draws random (but controlled) parameters for our augmentations, remember we don’t want anything too extreme. We don’t want to apply all of these transformations every time, so we also create an array of random length (number of transformations) and randomly assigned elements (the transformations to apply).

np.random.seed()
numTrans     = np.random.randint(1, 6, size=1) 
allowedTrans = [0, 1, 2, 3, 4]
whichTrans   = np.random.choice(allowedTrans, numTrans, replace=False)

We assign a new random.seed every time to ensure that each pass is different to the last. There are 5 possible transformations so numTrans is a single random integer between 1 and 5. We then take a random.choice of the allowedTrans up to numTrans. We don’t want to apply the same transformation more than once, so replace=False.

After some trial and error, I’ve found that the following parameters are good:

rotations - theta $ \in [-10.0, 10.0] $ degrees
scaling - factor $ \in [0.9, 1.1] $ i.e. 10% zoom-in or zoom-out
intensity - factor $ \in [0.8, 1.2] $ i.e. 20% increase or decrease
translation - offset $ \in [-5, 5] $ pixels
margin - I tend to set at either 5 or 10 pixels.

For an image called thisim and segmentation called thisseg, the cases I use are:

if 0 in whichTrans:
    theta   = float(np.around(np.random.uniform(-10.0,10.0, size=1), 2))
    thisim  = rotateit(thisim, theta)
    thisseg = rotateit(thisseg, theta, isseg=True) if withseg else np.zeros_like(thisim)

if 1 in whichTrans:
    scalefactor  = float(np.around(np.random.uniform(0.9, 1.1, size=1), 2))
    thisim  = scaleit(thisim, scalefactor)
    thisseg = scaleit(thisseg, scalefactor, isseg=True) if withseg else np.zeros_like(thisim)

if 2 in whichTrans:
    factor  = float(np.around(np.random.uniform(0.8, 1.2, size=1), 2))
    thisim  = intensifyit(thisim, factor)
    #no intensity change on segmentation

if 3 in whichTrans:
    axes    = list(np.random.choice(2, 1, replace=True))
    thisim  = flipit(thisim, axes+[0])
    thisseg = flipit(thisseg, axes+[0]) if withseg else np.zeros_like(thisim)

if 4 in whichTrans:
    offset  = list(np.random.randint(-5,5, size=2))
    currseg = thisseg
    thisim  = translateit(thisim, offset)
    thisseg = translateit(thisseg, offset, isseg=True) if withseg else np.zeros_like(thisim)

In each case, a random set of parameters is found and passed to the transform functions. The image and segmentation are passed separately to each one. In my case, I only choose to flip horizontally by randomly choosing 0 or 1 and appending [0] such that the transform ignores the second axis. We’ve also added a boolean variable called withseg. When True the segmentation is augmented, otherwise a blank image is returned.

Finally, we crop the image to make it square before resampling it to the desired dims.

thisim, thisseg = cropit(thisim, thisseg)
thisim          = resampleit(thisim, dims)
thisseg         = resampleit(thisseg, dims, isseg=True) if withseg else np.zeros_like(thisim)

Putting this together in a script makes testing the augmenter easier: you can download the script here. Some things in the code to note:

The script takes one mandatory argument (image filename) and an optional segmentation filename
There’s a bit of error checking - are the files able to be loaded? Is it an rgb or full 3D image (3rd dimension greater than 3).
We specify the final image dimensions, [224, 224, 8] in this case
We also declare some default values for the parameters so that we can…
…print out the applied transformations and their parameters at the end
There’s a definition for a plotit function that just creates a 2 x 2 matrix where the top 2 images are the originals and the bottom two are the augmented images.
There’s a commented out part which is what I used to save the images created in this post

In a live setting where we want to do data-augmentation on the fly, we would essentially call this script with the filenames or image arrays to augment and create as many augmentations of the images as we wish. We’ll take a look at this as an example in the next post.

Edit: 15/05/2018

Added a sliceshift function to transforms.py. This takes in a 3D image and randomly shifts a fraction of the slices using our translateit function (which I’ve also updated slightly). This allows us to simulate motion in medical images.

Modifying the Terminal Prompt for Sanity

Tue, 08 Aug 2017 10:05:14 +0000

If you’re working with more than one computer at a time, then you’re probably using some form of remote access framework - most likely ssh. This is common in machine learning where our scripts are run on some other host with more capabilities. In this post we’ll look at how to modify the terminal prompt layout and colours to give us information we need at a glance: the current user; whether they’re root; what computer we’re working on; what folder in and the time that the last command was given.

When we ssh into another computer, the terminal prompt will most likely change. Often it becomes colourless (usually all-white text) and the structure may change based on the initial setup. I’ve often issued commands to the wrong computer because of this so it would be useful if we were able to clearly see which computer we’re working on at a glance.

Many users don’t know that they can edit their terminal prompt without root privileges to give them better indications of their user, host and location. This is done by editing the PS1 variable in the ~/.bashrc file. ~/.bashrc (where ~ is the shortcut to our /home/<username> folder and . indicates a hidden file) is a set of commands that is run every time a new terminal window is opened. This has a lot to do with how the terminal window functions as well as alias shortcuts to longer commands. We edit this with an editor like nano:

nano ~/.bashrc

The first thing we will do is to make sure that whenever we are in a terminal window (ssh or otherwise) as the current user, we are seeing colours in the terminal - this is useful for certain text editors as well as the prompt. Find the line that is currently commented out, and uncomment it:

# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
force_color_prompt=yes

Now for the prompt. In this file, we need to find the line where the PS1 format is defined. PS1 is the name for the terminal prompt. It should be a couple of blocks after the force_color_prompt variable.

if [ "$color_prompt" = yes ]; then
    PS1='\A [\[\e[0;36m\]\u\[\e[0m\]@\[\e[1;36m\]\h\[\e[0m\]:\w\[\$ '
else
    PS1='\u@\h:\w\$ '
fi

Here you’ll see the differece that the force_color_prompt variable makes: there is a lot more formatting code in the true part of this if block that adds color. The above formatting creates the prompt below from one of my machines:

Example terminal prompt for regular user account

I’ll identify the different components here, but you can find a list of all of the possible elements that can be included here.

\A - the current time in hh:mm format
\u - the current user
\h - the current host
\w - the current working directory
\$ - the $ character (if it’s not escaped, the shell reads this as if it’s trying to find a variable as in $PATH)

Any characters which are not escaped (i.e. preceeded by backslash ‘\’) are printed as they appear: e.g. @ and $. Assigning the PS1 variable the value: ‘\A \u@\h:\w\$’ we get `time user@host:/directory$’ like this:

Example terminal prompt with no formatting

In order to get colors in the prompt, we need to surround our variables e.g. ‘\A’, with some (very ugly) specific syntax. Where we want the color to start, we write ‘[\e[0;XXm]’ and where we want to finish the colour and return to normal, we can write ‘[\e[0m]’. The ‘XX’ in the first term is a 2-digit code that refers to a color. For example, to make the username green, we change the PS1 variable to this:

PS1='\A \[\e[0;32m\]\u\[\e[0m\]@\h:\w\$ '

Example terminal prompt with green username

A list of colors and their respective numbers can be found here. I choose green if we’re logged in as a regular user (as in green for go) but I choose red if the user is root. This means I can always see at a glance if I should be careful with the commands that I write.

You’ll also notice that we can change the style of the font along with the color. I find this useful for making the host standout by making it bold. This is done by changing the 0 before the ‘XX’ color to 1.

if [ "$color_prompt" = yes ]; then
    PS1='\A [\[\e[0;31m\]\u\[\e[0m\]@\[\e[1;36m\]\h\[\e[0m\]:\w\[\$ '
else
    PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi

Example terminal prompt for a `root` user

For my full PS1 variable, I have the colours, the bold host and I also added some square brackets (not escaped!) to make it a little more visually pleasing. You can change the ~/.bashrc file for each user on each computer. So if you have a regular user account and a root account on the same machine, you can create a different PS1 for both by editing their respecitve files. So feel free to change colours and formats as you wish!

Generative Adversarial Network (GAN) in TensorFlow - Part 5

Tue, 25 Jul 2017 11:07:22 +0100

This is the final part in our series on Generative Adversarial Networks (GAN). We will write our training script and look at how to run the GAN. We will also take a look at the results we get out. Can you tell the difference between the real and generated faces?

Introduction

In this series we started out with a background to GAN including some of the mathematics behind them. We then downloaded and processed our dataset. In the subsequent posts, we wrote some image helper functions before completing some data processing functions and the GAN Class itself.

In this final post, we will create the training script and visualise some of the results we get out.

Training Script

The training script is here: `gantut_trainer.py’.

It’s only short, so there isn’t anything to fill in, but let’s take a look. We need to make sure we import the GAN class from our completed gantut_gan.py file.

Note: If you’re using the files called gantut_*_complete.py you’ll need to modify this line (add the _complete). Otherwise, just make sure it’s looking for the correctly named file where your GAN class is written.

#!/usr/bin/python

import os
import numpy as  np
import tensorflow as tf

from gantut_gan import DCGAN

The ‘shebang’ on the first line allows us to call this script from the terminal without typing python first. This is a useful line if you’re going to run this network on a cluster of computers where you will probably need to create your own python (or conda) virtual environment first. This line will be changed to point to the specific python installation that you want to use to run the script

Note: I’ll add this note here. The network will take a long time to train. If you have access to a cluster, I recommend using it.

Next, we define the possible ‘flags’ or attributes that we need the network to take:

#DEFINE THE FLAGS FOR RUNNING SCRIPT FROM THE TERMINAL
# ARG1 = NAME OF THE FLAG
# ARG2 = DEFAULT VALUE
# ARG3 = DESCRIPTION
flags = tf.app.flags
flags.DEFINE_integer("epoch", 20, "Number of epochs to train [20]")
flags.DEFINE_float("learning_rate", 0.0002, "Learning rate for adam optimiser [0.0002]")
flags.DEFINE_float("beta1", 0.5, "Momentum term for adam optimiser [0.5]")
flags.DEFINE_integer("train_size", np.inf, "The size of training images [np.inf]")
flags.DEFINE_integer("batch_size", 64, "The batch-size (number of images to train at once) [64]")
flags.DEFINE_integer("image_size", 64, "The size of the images [n x n] [64]")
flags.DEFINE_string("dataset", "lfw-aligned-64", "Dataset directory.")
flags.DEFINE_string("checkpoint_dir", "checkpoint", "Directory name to save the checkpoints [checkpoint]")
flags.DEFINE_string("sample_dir", "samples", "Directory name to save the image samples [samples]")
FLAGS = flags.FLAGS

Here, we’re using the tf.flags module (which is a wrapper for argparse) that takes arguments that trail the script name in the terminal and turn them into variables we can use in the network. The format for each argument is:

flags.DEFINE_datatype(name, default_value, description)

Where datatype is what is expected (an integer, float, string etc.), name is what the resulting variable will be called, default_value is… the default value in case it’s not explicitly defined at runtime, and description is a useful descriptor of what this argument does. We package all these variables into one (called FLAGS) that can be called later to assign values.

Notice that the name here is the same as those we wrote in the __init__ method of our GAN class because these will be used to initialise the GAN.

Our network will need folders to output to and also to check whether there’s an existing checkpoint that can be loaded (rather than doing it all over again).

#CREATE SOME FOLDERS FOR THE DATA
if not os.path.exists(FLAGS.checkpoint_dir):
    os.makedirs(FLAGS.checkpoint_dir)
if not os.path.exists(FLAGS.sample_dir):
    os.makedirs(FLAGS.sample_dir)

Even though we’ve just defined some variables for our network, there are plenty of others in the Graph that need some default value. TensorFlow has a handy function for that:

# GET ALL OF THE OPTIONS FOR TENSORFLOW RUNTIME 
config = tf.ConfigProto(intra_op_parallelism_threads=8)

Tip: I’ve included the intra_op_parallelism_threads argument to tf.ConfigProto because TensorFlow has the power to take over as many cores as it can see when it’s running. This may not be a problem if you’re not using your machine too much, but if you’re running on a cluster, TF will ignore the ‘requested’ number of cpus/gpus and leech into other cores. Setting intra_op_parallelism_threads to the correct number of threads stops this from happening.

Finally, we initialise the TensorFlow session (with out config above), initialise the GAN and pass the flags to the .train method of the GAN class.

Tip: It is good to initialise the session in this way with with because it will be automatically closed when the GAN training is finished.

with tf.Session(config=config) as sess:
    #INITIALISE THE GAN BY CREATING A NEW INSTANCE OF THE DCGAN CLASS
    dcgan = DCGAN(sess, image_size=FLAGS.image_size, batch_size=FLAGS.batch_size,
                  is_crop=False, checkpoint_dir=FLAGS.checkpoint_dir)

    #TRAIN THE GAN
    dcgan.train(FLAGS)

Training

This is it! 5 posts later and we can train our GAN. From our terminal, we are going to call the training script gantut_trainer.py and pass it a couple of arguments:

~/GAN/gantut_trainer.py --dataset ~/GAN/aligned --epoch 20

Of course, if you’ve put your aligned training set somewhere else, make sure that path goes into the --dataset flag. The other flags can be set to default because that’s how we’ve written our GAN class. Now 20 epochs will take a seriously long time (it look me nearly 4 days using 12 cores on a cluster).

There will be 3 folders of output from the GAN:

logs - where the logs from the training will be saved. These can be viewed with TensorBoard
checkpoints - where the model itself is saved
samples - this is where the image array we created in gantut_imgfuncs.py will be output to every so often.

Logs

Whilst the network is training (if you’re doing it locally) you can pull up tensorboard and watch how the training is progressing. From the terminal:

tensorboard --logdir="~/GAN/logs"

Follow the link it spits out and you’ll be presented with a lot of information about the network. You will find graphs of the loss-functions under ‘scalars’, some examples from the generator under ‘images’ and the Graph itself is nicely represented under ‘graph’. ‘Histograms’ show how the distributions are changing over time. We can see in these that our noise distribution $p_{z}$ is uniform (which is what we defined) and that the real and fake images take values around 1 and 0 at the discriminator, as we also described in part 1.

Figure 1: The distributions of (Left to right) the noise vectors $z$ and the real and fake images at the discriminator.

Figure 2: The TensorFlow Graph that we build using our GAN `class`.

Results

Here it is, the output from our GAN (after 14 epochs in this case) showing how well the network has learned how to create faces. It may take longer than expect to load as I’ve tried to preserve quality.

Figure 3: The output of our GAN at the end of each epoch ending at epoch 14. (created at gifmaker.me).

We can see that some of the faces are still not quite there yet, but there are a few that are unbelieveably realistic. In fact, we can perform a kind of ‘Turing Test’ on this data. The Turing Test, put simply, is that if a user is unable to reliably tell the difference between a computer and human performing the same task, then the computer has passed the Turing Test.

Have a go at the test below: study each face, decide if it is a real or fake image; then click on the image to reveal the true result. If you only guess 50% or less, then the computer has passed this simplistic Turing Test.

Click Here for the Turing Test
(opens in a new window)

Conclusion

So it looks great, but what was the point? Well, remember back to part 1 - GANs and other generative networks are used for image completion. We can use the fact that our network has learned what a face should look like to ‘fill-in’ any missing bits. Lets say someone has a large tattoo across their face, we can reconstruct what the skin would look like without it. Or maybe we have an amazing photo, with a beautiufl background, but we’re not smiling: the GAN can reconstruct a smile. More advanced work can include learning what glasses are and putting them onto other faces.

Again, for credit, this series is based on the main code by carpedm20 and inspired from the blog of B. Amos.

GANs are powerful networks, but work in a relatively simple way by trying to trick a discriminator by generating more and more realistic-looking images.

Generative Adversarial Network (GAN) in TensorFlow - Part 4

Mon, 17 Jul 2017 09:37:58 +0100

Now that we’re able to import images into our network, we really need to build the GAN iteself. This tuorial will build the GAN class including the methods needed to create the generator and discriminator. We’ll also be looking at some of the data functions needed to make this work.

*Note: This table of contents does not follow the order in the post. The contents is grouped by the methods in the GAN class and the functions in gantut_imgfuncs.py.

Introduction
The GAN
- dataset_files()
- GAN Class
  - __init__()
  - discriminator()
  - generator()
  - build_model()
  - save()
  - load()
  - train()
- Data Functions
Conclusion

Introduction

In the last tutorial, we build the functions in gantut_imgfuncs.pywhich allow us to import data into our networks. The completed file is here. In this tutorial we will be working on the final two code skeletons:

First, let’s take a look at the various parts of our GAN in the gantut_gan.py file and see what they’re going to do.

The GAN

We’re going to import a number of modules for this file including those from our own gantut_datafuncs.py and gantut_imgfuncs.py:

from __future__ import division
import os
import time
import math
import itertools
from glob import glob
import tensorflow as tf
import numpy as np
from six.moves import xrange

#IMPORT OUR IMAGE AND DATA FUNCTIONS
from gantut_datafuncs import *
from gantut_imgfuncs import *

dataset_files()

The initial part of this file is a little housekeeping - ensuring that we are only dealing with supported filetypes. This way of doing things I liked in B. Amos blog. We define accepted file-extensions and then return a list of all of the possible files we can use for training purposes. the itertools.chain.from_iterable function is useful for create a single list of all of the files found in the folders and subfolders of a particular root with an appropriate ext. Notice that it doesn’t really matter what we call the images, so this will work for all datasets.

SUPPORTED_EXTENSIONS = ["png", "jpg", "jpeg"]

""" Returns the list of all SUPPORTED image files in the directory
"""
def dataset_files(root):
    return list(itertools.chain.from_iterable(
    glob(os.path.join(root, "*.{}".format(ext))) for ext in SUPPORTED_EXTENSIONS))

DCGAN()

This is where the hard work begins. We’re going to build the DCGAN class (i.e. Deep Convolutional Generative Adversarial Network). The skeleton code already has the necessary method names for our model, let’s have a look at what we’ve got to create:

__init__: to initialise the model and set parameters
build_model: creates the model (or ‘graph’ in TensorFlow-speak) by calling…
generator: defines the generator network
discriminator: defines the discriminator network
train: is called to begin the training of the network with data
save: saves the TensorFlow checkpoints of the GAN
load: loads the TensorFlow checkpoints of the GAN

We create an instance of our GAN class with DCGAN(args) and be returned a DCGAN object with the above methods. Let’s code.

init()

To initialise our GAN object, we need some initial parameters. It looks like this:

def __init__(self, sess, image_size=64, is_crop=False, batch_size=64, sample_size=64, z_dim=100,
             gf_dim=64, df_dim=64, gfc_dim=1024, dfc_dim=1024, c_dim=3, checkpoint_dir=None, lam=0.1):

The parameters are:

sess: the TensorFlow session to run in
image_size: the width of the images, which should be the same as the height as we like square inputs
is_crop: whether to crop the images or leave them as they are
batch_size: number of images to use in each run
sample_size: number of z samples to take on each run, should be equal to batch_size
z_dim: number of samples to take for each z
gf_dim: dimension of generator filters in first conv layer
df_dim: dimenstion of discriminator filters in first conv layer
gfc_dim: dimension of generator units for fully-connected layer
dfc_gim: dimension of discriminator units for fully-connected layer
c_dim: number of image cannels (gray=1, RGB=3)
checkpoint_dir: where to store the TensorFlow checkpoints
lam: small constant weight for the sum of contextual and perceptual loss

These are the controllable parameters for the GAN. As this is the initialising function, we need to transfer these inputs to the self of the class so they are accessible later on. We will also add two new lines:

Let’s add a check that the image_size is a power of 2 (to make the convolution work well). This clever ‘bit-wise-and’ operator & will do the job for us. It uses the unique property of all power of 2 numbers have only one bit set to 1 and all others to 0. Let’s also check that the image is bigger than $[8 \times 8]$ to we don’t convolve too far:
Get the image_shape which is the width and height of the image along with the number of channels (gray or RBG).

#image_size must be power of 2 and 8+
assert(image_size & (image_size - 1) == 0 and image_size >= 8)

self.sess = sess
self.is_crop = is_crop
self.batch_size = batch_size
self.image_size = image_size
self.sample_size = sample_size
self.image_shape = [image_size, image_size, c_dim]

self.z_dim = z_dim
self.gf_dim = gf_dim
self.df_dim = df_dim        
self.gfc_dim = gfc_dim
self.dfc_dim = dfc_dim

self.lam = lam
self.c_dim = c_dim

Later on, we will want to do ‘batch normalisation’ on our data to make sure non of our images are extremely different to the others. We will need a batch-norm layer for each of the conv layers in our generator and discriminator. We will initialise the layers here, but define them in our gantut_datafuncs.py file shortly.

#batchnorm (from funcs.py)
self.d_bns = [batch_norm(name='d_bn{}'.format(i,)) for i in range(4)]

log_size = int(math.log(image_size) / math.log(2))
self.g_bns = [batch_norm(name='g_bn{}'.format(i,)) for i in range(log_size)]

This shows that we will be using 4 layers in our discriminator. But we will need more in our generator: our generator starts with a simple vector z and needs to upscale to the size of image_size. It does this by a factor of 2 in each layer, thus $\log(\mathrm{image \ size})/\log(2)$ is equal to the number of upsamplings to be done i.e. $2^{\mathrm{num \ of \ layers}} = 64$ in our case. Also note that we’ve created these objects (layers) with an iterator so that each has the name g_bn1, g_bn1 etc.

To finish __init__() we set the checkpoint directory for TensorFlow saves, instruct the class to build the model and name it ‘DCGAN.model’.

self.checkpoint_dir = checkpoint_dir
self.build_model()

self.model_name="DCGAN.model"

batch_norm()

This is the first of our gantut_datafuncs.py functions.

If some of our images are very different to the others then the network will not learn the features correctly. To avoid this, we add batch normalisation (as described in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift - Ioffe & Szegedy (2015). We effectively redistribute the intensities of the images around a common mean with a set variance.

This is a class that will be instantiated with set parameters when called. Then, the method will perform batch normalisation whenever the object is called on the set of images x. We are using Tensorflow’s built-in tf.contrib.layers.batch_norm() layer for this which implements the method from the paper above.

Parameters

epsilon: ‘small float added to variance [of the input data] to avoid division by 0’
momentum: ‘decay value for the moving average, usually 0.999, 0.99, 0.9’

Inputs

x: the set of input images to be normalised
train: whether or not the network is in training mode [True or False]

Returns

A batch_norm ‘object’ on instantiation
A tensor representing the output of the batch_norm operation

"""Batch normalisation function to standardise the input
Initialises an object with all of the batch norm properties
When called, performs batch norm on input 'x'
"""
class batch_norm(object):
    def __init__(self, epsilon=1e-5, momentum = 0.9, name="batch_norm"):
        with tf.variable_scope(name):
            self.epsilon = epsilon
            self.momentum = momentum

            self.name = name

    def __call__(self, x, train):
        return tf.contrib.layers.batch_norm(x, decay=self.momentum, updates_collections=None, epsilon=self.epsilon,
                                            center=True, scale=True, is_training=train, scope=self.name)

discriminator()

As the discriminator is a simple convolutional neural network (CNN) this will not take many lines. We will have to create a couple of wrapper functions that will perform the actual convolutions, but let’s get the method written in gantut_gan.py first.

We want our discriminator to check a real image, save varaibles and then use the same variables to check a fake image. This way, if the images are fake, but fool the discriminator, we know we’re on the right track. Thus we use the variable reuse when calling the discriminator() method - we will set it to True when we’re using the fake images.

We add tf.variable_scope() to our functions so that when we visualise our graph in TensorBoard we can recognise the various pieces of our GAN.

Next are the definitions of the 4 layers of our discriminator. each one takes in the images, the kernel (filter) dimensions and has a name to identify it later on. Notice that we also call our d_bns objects which are the batch-norm objects that were set-up during instantiation of the GAN. These act on the result of the convolution before being passed through the non-linear lrelu function. The last layer is just a linear layer that outputs the unbounded results from the network.

As this is a classificaiton task (real or fake) we finish by returning the probabilities in the range $[0 \ 1]$ by applying the sigmoid function. The full output is also returned.

def discriminator(self, image, reuse=False):
	with tf.variable_scope("discriminator") as scope:
	    if reuse:
		scope.reuse_variables()
	   	    
	    h0 = lrelu(conv2d(image, self.df_dim, name='d_h00_conv'))
	    h1 = lrelu(self.d_bns[0](conv2d(h0, self.df_dim*2, name='d_h1_conv'), self.is_training))
	    h2 = lrelu(self.d_bns[1](conv2d(h1, self.df_dim*4, name='d_h2_conv'), self.is_training))
	    h3 = lrelu(self.d_bns[2](conv2d(h2, self.df_dim*8, name='d_h3_conv'), self.is_training))
	    h4 = linear(tf.reshape(h3, [-1, 8192]), 1, 'd_h4_lin')
	    
	    return tf.nn.sigmoid(h4), h4

This method calls a couple of functions that we haven’t defined yet: cov2d, lrelu and linear so lets do those now.

conv2d()

This function we’ve seen before in our CNN tutorial. We’ve defined the weights w for each kernel which is [k_h x k_w x number of images x number of kernels]not forgetting that different weights are learned for different images. We’ve initialised these weights using a standard, random sampling from a normal distribution with standard deviation stddev.

The convolution is done by TensorFlow’s [tf.nn.conv2d]( “tf.nn.conv2d”) function using the weights w we’ve already defined. The padding option SAME makes sure that we end up with output that is the same size as the input. Biases are added (the same size as the number of kernels and initialised at a constant value) before the result is returned.

Inputs

input_: the input images (full batch)
output_dim: the number of kernels/filters to be learned
k_h, k_w: height and width of the kernels to be learned
d_h, d_w: stride of the kernel horizontally and vertically
stddev: standard deviation for the normal func in weight-initialiser

Returns

the convolved images for each kernel

"""Defines how to perform the convolution for the discriminator,
i.e. traditional conv rather than reverse conv for the generator
"""
def conv2d(input_, output_dim, k_h=5, k_w=5, d_h=2, d_w=2, stddev=0.02, name="conv2d"):
    with tf.variable_scope(name):
        w = tf.get_variable('w', [k_h, k_w, input_.get_shape()[-1], output_dim],
                            initializer=tf.truncated_normal_initializer(stddev=stddev))
        conv = tf.nn.conv2d(input_, w, strides=[1, d_h, d_w, 1], padding='SAME')

        biases = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))
        # conv = tf.reshape(tf.nn.bias_add(conv, biases), conv.get_shape())
        conv = tf.nn.bias_add(conv, biases)

        return conv

relu()

The network need to be able to learn complex functions, so we add some non-linearity to the output of our convolution layers. We’ve seen this before in our tutorial on transfer functions. Here we use the leaky rectified linear unit (lReLU).

Parameters

leak: the ‘leakiness’ of the lrelu

Inputs

x: some data with a wide range

Returns

the transformed input data

"""Neural nets need this non-linearity to build complex functions
"""
def lrelu(x, leak=0.2, name="lrelu"):
    with tf.variable_scope(name):
        f1 = 0.5 * (1 + leak)
        f2 = 0.5 * (1 - leak)
        return f1 * x + f2 * abs(x)

linear()

This linear layer takes the outputs from the convolution and does a linear transform using some randomly initialised weights. This does not have the same non-linear property as the lrelu function because we will use this output to calcluate probabilities for classification. We return the result of input_ x matrix by default, but if we also need the weights, we also output matrix and bias through the if statement.

Parameters

stddev: standard deviation for weight initialiser
bias_start: for the bias initialiser (constant value)
with_w: return the weight matrix (and biases) as well as the output if True

Inputs

input_: input data (shape is used to define weight/bias matrices)
output_size: desired output size of the linear layer

"""For the final layer of the discriminator network to get the
full detail (probabilities etc.) from the output
"""
def linear(input_, output_size, scope=None, stddev=0.02, bias_start=0.0, with_w=False):
    shape = input_.get_shape().as_list()

    with tf.variable_scope(scope or "Linear"):
        matrix = tf.get_variable("Matrix", [shape[1], output_size], tf.float32,
                                 tf.random_normal_initializer(stddev=stddev))
        bias = tf.get_variable("bias", [output_size],
            initializer=tf.constant_initializer(bias_start))
        if with_w:
            return tf.matmul(input_, matrix) + bias, matrix, bias
        else:
            return tf.matmul(input_, matrix) + bias

generator()

Finally! We’re going to write the code for the generative part of the GAN. This method will take a single input - the randomly-sampled vector $z$ from the well known distribution $p_z$.

Remember that the generator is effectively a reverse discriminator in that it is a CNN that works backwards. Thus we start with the ‘values’ and must perform the linear transformation on them before feeding them through the other layers of the network. As we do not know the weights or biases yet in this network, we need to make sure we output these from the linear layer with with_w=True.

This first hidden layer hs[0] needs reshaping to be the small image-shaped array that we can send through the network to become the upscaled $[64 \times 64]$ image at the end. So we take the linearly-transformed z-values and reshape to $[4 x 4 x num_kernels]$. Don’t forget the -1 to do this for all images in the batch. As before, we must batch-norm the result and pass it through the non-linearity.

The number of layers in this network has been calculated earlier (using the logarithm ratio of image size to downsampling factor. We can therefore do the next part of the generator in a loop.

In each loop/layer we are going to:

give the layer a name
perform the inverse convolution
apply non-linearity

1 and 3 are self-explanatory, but the inverse convolution function still needs to be written. This is the function that will take in the small square image and upsample it to a larger image using some weights that are being learnt. We start at layer i=1 where we want the image to go to size=8 from size=4 at layer i=0. This will increase by a factor of 2 at each layer. As with a regular CNN we want to learn fewer kernels on the larger images, so we need to decrease the depth_mul by a factor of 2 at each layer. Note that the while loop will terminate when the size gets to the size of the input images image_size.

The final layer is added which takes the last output and does the inverse convolution to get the final fake image (that will be tested with the discriminator.

def generator(self, z):
	with tf.variable_scope("generator") as scope:
	    self.z_, self.h0_w, self.h0_b = linear(z, self.gf_dim*8*4*4, 'g_h0_lin', with_w=True)

	    hs = [None]
	    hs[0] = tf.reshape(self.z_, [-1, 4, 4, self.gf_dim * 8])
	    hs[0] = tf.nn.relu(self.g_bns[0](hs[0], self.is_training))
	    
	    i=1             #iteration number
	    depth_mul = 8   #depth decreases as spatial component increases
	    size=8          #size increases as depth decreases
	    
	    while size < self.image_size:
		hs.append(None)
		name='g_h{}'.format(i)
		hs[i], _, _ = conv2d_transpose(hs[i-1], [self.batch_size, size, size, self.gf_dim*depth_mul],
		                                name=name, with_w=True)
		hs[i] = tf.nn.relu(self.g_bns[i](hs[i], self.is_training))
		
		i += 1
		depth_mul //= 2
		size *= 2
		
	    hs.append(None)
	    name = 'g_h{}'.format(i)
	    hs[i], _, _ = conv2d_transpose(hs[i-1], [self.batch_size, size, size, 3], name=name, with_w=True)
	    
	    return tf.nn.tanh(hs[i])

conv2d_transpose()

The inverse convolution function looks very similar to the forward convolution function. We’ve had to make sure that different versions of TensorFlow work here - in newer versions, the correct function is located at tf.nn.conv2d_transpose where as in older ones we must use tf.nn.deconv2d.

Inputs

input_: a vector (of noise) with dim=batch_size x z_dim
output_shape: the final shape of the generated image
k_h, k_w: the height and width of the kernels
d_h, d_w: the stride of the kernel horiz and vert.

Returns

an image (upscaled from the initial data)

"""Deconv isn't an accurate word, but is a handy shortener,
so we'll use that. This is for the generator that has to make
the image from some randomly sampled data
"""
def conv2d_transpose(input_, output_shape, k_h=5, k_w=5, d_h=2, d_w=2, stddev=0.02,
                     name="conv2d_transpose", with_w=False):
    with tf.variable_scope(name):
        w = tf.get_variable('w', [k_h, k_w, output_shape[-1], input_.get_shape()[-1]],
                            initializer=tf.random_normal_initializer(stddev=stddev))

        try:
            deconv = tf.nn.conv2d_transpose(input_, w, output_shape=output_shape,
                                strides=[1, d_h, d_w, 1])

        # Support for verisons of TensorFlow before 0.7.0
        except AttributeError:
            deconv = tf.nn.deconv2d(input_, w, output_shape=output_shape,
                                strides=[1, d_h, d_w, 1])

        biases = tf.get_variable('biases', [output_shape[-1]], initializer=tf.constant_initializer(0.0))
        # deconv = tf.reshape(tf.nn.bias_add(deconv, biases), deconv.get_shape())
        deconv = tf.nn.bias_add(deconv, biases)

        if with_w:
            return deconv, w, biases
        else:
            return deconv

build_model()

The build_model() method bring together the image data and the generator and discriminator methods. This is the ‘graph’ for TensorFlow to follow. It contains some tf.placeholder pieces which we must supply attributes to when we finally train the model.

We will need to know whether the model is in training or inference mode throughout our code, so we have a placeholder for that variable. We also need a placeholder for the image data itself because there will be a different batch of data being injected at each epoch. These are our real_images.

When we inject the z vectors into the GAN (served by another palceholder) we will also produce some monitoring output for TensorBoard. By adding tf.summary.histogram() we are able to keep track of how the different z vectors look at each epoch.

    def build_model(self):
        self.is_training = tf.placeholder(tf.bool, name='is_training')
        self.images = tf.placeholder(
            tf.float32, [None] + self.image_shape, name='real_images')
        self.lowres_images = tf.reduce_mean(tf.reshape(self.images,
            [self.batch_size, self.lowres_size, self.lowres,
             self.lowres_size, self.lowres, self.c_dim]), [2, 4])
        self.z = tf.placeholder(tf.float32, [None, self.z_dim], name='z')
        self.z_sum = tf.summary.histogram("z", self.z)

Next, lets tell the graph to take the injected z vector an turn it into an image with our generator. We’ll also produce a lowres version of this image. Now, put the ‘real_images’ into the discriminator, which gives back our probabilities and the final-layer data (the logits). We then reuse the same discriminator parameters to test the fake image from the generator. Here we also output some histograms of the probabilities of the ‘real_image’ and the fake image. We will also output the current fake image from the generator to TensorBoard.

        self.G = self.generator(self.z)
        self.lowres_G = tf.reduce_mean(tf.reshape(self.G,
            [self.batch_size, self.lowres_size, self.lowres,
             self.lowres_size, self.lowres, self.c_dim]), [2, 4])
        self.D, self.D_logits = self.discriminator(self.images)

        self.D_, self.D_logits_ = self.discriminator(self.G, reuse=True)

        self.d_sum = tf.summary.histogram("d", self.D)
        self.d__sum = tf.summary.histogram("d_", self.D_)
        self.G_sum = tf.summary.image("G", self.G)

Now for some of the necessary calculations needed to be able to update the network. Let’s find the ‘loss’ on the current outputs. We will utilise a very efficient loss function here the tf.nn.sigmoid_cross_entropy_with_logits. We want to calculate a few things:

how well did the discriminator do at letting true images through (i.e. comparing D to 1)
how often was the discriminator fooled by the generator (i.e. comparing D_ to 1)
how often did the generator fail at making realistic images (i.e. comparing D_ to 0).

We’ll add the discriminator losses up (1 + 2) and create a TensorBoard summary statistic (a scalar value) for the discriminator and generator losses in this epoch. These are what we will optimise during training.

To keep everything tidy, we’ll group the discriminator and generator variables into d_vars and g_vars respectively.

        self.d_loss_real = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(logits=self.D_logits,
                                                    labels=tf.ones_like(self.D)))
        self.d_loss_fake = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(logits=self.D_logits_,
                                                    labels=tf.zeros_like(self.D_)))
        self.g_loss = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(logits=self.D_logits_,
                                                    labels=tf.ones_like(self.D_)))

        self.d_loss_real_sum = tf.summary.scalar("d_loss_real", self.d_loss_real)
        self.d_loss_fake_sum = tf.summary.scalar("d_loss_fake", self.d_loss_fake)

        self.d_loss = self.d_loss_real + self.d_loss_fake

        self.g_loss_sum = tf.summary.scalar("g_loss", self.g_loss)
        self.d_loss_sum = tf.summary.scalar("d_loss", self.d_loss)

        t_vars = tf.trainable_variables()

        self.d_vars = [var for var in t_vars if 'd_' in var.name]
        self.g_vars = [var for var in t_vars if 'g_' in var.name]

We don’t want t lose our progress, so lets make sure we setup the tf.Saver() function just keeping the most recent variables each time.

        self.saver = tf.train.Saver(max_to_keep=1)

save()

When we want to save a checkpoint (i.e. save all of the weights we’ve learned) we will call this function. It will check whether the output directory exists, if not it will create it. Then it wll call the tf.train.Saver.save() function which takes in the current session sess, the save directory, model name and keeps track of the number of steps that’ve been done.

    def save(self, checkpoint_dir, step):
        if not os.path.exists(checkpoint_dir):
            os.makedirs(checkpoint_dir)
            
        self.saver.save(self.sess, os.path.join(checkpoint_dir, self.model_name), global_step=step)

load()

Equally, if we’ve already spent a long time learning weights, we don’t want to start from scratch every time we want to push the network further. This function will load the most recent checkpoint in the save directory. TensorFlow has build-in functions for checking out the most recent checkpoint. If there is no checkpoint available, the function returns false and the appropriate action is taken by the main method that called it.

    def load(self, checkpoint_dir):
        print(" [*] Reading checkpoints...")
        
        ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
        if ckpt and ckpt.model_checkpoint_path:
            self.saver.restore(self.sess, ckpt.model_checkpoint_path)
            return True
        else:
            return False

train()

The all-important train() method. This is where the magic happens. When we call DCGAN.train(config) the networks will begin their fight and train. We will discuss the config argument later on, but succinctly: it’s a list of all hyperparameters TensorFlow will use in the network. Here’s how train() works:

First we give the trainer the data (using our dataset_files function) and make sure that it’s randomly shuffled. We want to make sure that the images next to each other have nothing in common so that we can truly randomly sample them. There’s also a check here `assert(len(data) > 0) to make sure that we don’t pass in an empty directory… that wouln’t be useful to learn from.

def train(self, config):
	data = dataset_files(config.dataset)
	np.random.shuffle(data)
	assert(len(data) > 0)

We’re going to use the adaptive non-convex optimization method tf.train.AdamOptimizer() from Kingma et al (2014) to train out networks. Let’s set this up for the discriminator (d_optim) and the generator (g_optim).

	d_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1).minimize(self.d_loss, var_list=self.d_vars)
	g_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1).minimize(self.g_loss, var_list=self.g_vars)

Next we will initialize all variables in the network (depending on TensorFlow version) and generate some tf.summary variables for TensorBoard which group together all of the summaries that we want to keep track of.

	try:
	    tf.global_variables_initializer().run()
	except:
	    tf.initialize_all_variables().run()
	    
	self.g_sum = tf.summary.merge([self.z_sum, self.d__sum, self.G_sum, self.d_loss_fake_sum, self.g_loss_sum])
	self.d_sum = tf.summary.merge([self.z_sum, self.d_sum, self.d_loss_real_sum, self.d_loss_sum])
	self.writer = tf.summary.FileWriter("./logs", self.sess.graph)

So here’s the part where we now sample this well-known distribution $p_z$ to get the noise vector $z$. We’re using a np.random.uniform distribution. Keep a look out for this when we’re watching the network in TensorBoard, we told the GAN class to output the histogram of $z$ vectors that are sampled from $p_z$. So they should all approximate to a uniform distribution.

We’re also going to sample the input real image files we shuffled earlier taking sample_size images through to the training process. We will use these later on to assess the loss functions every now and again when we output some examples.

We need to load in the data using the function get_image() that we wrote into gantut_imgfuncs.py during the last tutorial. After loading the images, lets make sure that they’re all in one np.array ready to be used.

	sample_z = np.random.uniform(-1, 1, size=(self.sample_size, self.z_dim))

	sample_files = data[0:self.sample_size]
	sample = [get_image(sample_file, self.image_size, is_crop=self.is_crop) for sample_file in sample_files]
	sample_images = np.array(sample).astype(np.float32)

Set the epoch counter and get the start time (it can be frustrating if we can’t see how long things are taking). We also want to be sure to load any previous checkpoints from TensorFlow before we start again from scratch.

	counter = 1
	start_time = time.time()

	if self.load(self.checkpoint_dir):
	    print(""" An existing model was found - delete the directory or specify a new one with --checkpoint_dir """)
	else:
	    print(""" No model found - initializing a new one""")

Here’s the actual training bit taking place. For each epoch that we’ve assigned in config, we create two minibatches: a sampling of real images, and those generated from the $z$ vector. We then update the discriminator network before updating the generator. We also write these loss values to the TensorBoard summary. There are two things to notice:

By calling sess.run() with specified variables in the first (or fetch attribute) we are able to keep the generator steady whilst updating the discriminator, and vice versa.
The generator is updated twice. This is to make sure that the discriminator loss function does not just converge to zero very quickly.

	for epoch in xrange(config.epoch):
	    data = dataset_files(config.dataset)
	    batch_idxs = min(len(data), config.train_size) // self.batch_size
	    
	    for idx in xrange(0, batch_idxs):
		batch_files = data[idx*config.batch_size:(idx+1)*config.batch_size]
		batch = [get_image(batch_file, self.image_size, is_crop=self.is_crop) for batch_file in batch_files]
		batch_images = np.array(batch).astype(np.float32)
		
		batch_z = np.random.uniform(-1, 1, [config.batch_size, self.z_dim]).astype(np.float32)
		
		#update D network
		_, summary_str = self.sess.run([d_optim, self.d_sum],
		                               feed_dict={self.images: batch_images, self.z: batch_z, self.is_training: True})
		self.writer.add_summary(summary_str, counter)
		
		#update G network
		_, summary_str = self.sess.run([g_optim, self.g_sum],
		                               feed_dict={self.z: batch_z, self.is_training: True})
		self.writer.add_summary(summary_str, counter)
		
		#run g_optim twice to make sure that d_loss does not go to zero
		_, summary_str = self.sess.run([g_optim, self.g_sum],
		                               feed_dict={self.z: batch_z, self.is_training: True})
		self.writer.add_summary(summary_str, counter)

To get the errors needed for backpropagation, we evaluate d_loss_fake, d_loss_real and g_loss. We run the $z$ vector through the graph to get the fake loss and the generator loss, and use the real batch_images for the real loss.

		errD_fake = self.d_loss_fake.eval({self.z: batch_z, self.is_training: False})
		errD_real = self.d_loss_real.eval({self.images: batch_images, self.is_training: False})
		errG = self.g_loss.eval({self.z: batch_z, self.is_training: False})

Let’s get some output to stdout for the user. The current epoch and progress through the minibatches is output at each new minibatch. Every 100 minibatches we’re going to evaluate the current generator self.G and calculate the loss against the small set of images we sampled earlier. We will output the result of the generator and use our save_images() function to create that image array we worked on in the last tutorial.

		counter += 1
		print("Epoch [{:2d}] [{:4d}/{:4d}] time: {:4.4f}, d_loss: {:.8f}".format(
		        epoch, idx, batch_idxs, time.time() - start_time, errD_fake + errD_real, errG))
		
		if np.mod(counter, 100) == 1:
		    samples, d_loss, g_loss = self.sess.run([self.G, self.d_loss, self.g_loss], 
		                                            feed_dict={self.z: sample_z, self.images: sample_images, self.is_training: False})
		    save_images(samples, [8,8], './samples/train_{:02d}-{:04d}.png'.format(epoch, idx))
		    print("[Sample] d_loss: {:.8f}, g_loss: {:.8f}".format(d_loss, g_loss))

Finally, we need to save the current weights from our networks.

		if np.mod(counter, 500) == 2:
		    self.save(config.checkpoint_dir, counter)

Conclusion

That’s it! We’ve completed the gantut_gan.py and gantut_datafuncs.py files. Checkout the completed files below:

Completed versions of:

By following this tutorial series we should now have:

A background in how GANs work
Necessary data, fullly pre-processed and ready to use
The gantut_imgfuncs.py for loading data into the neworks
A GAN class with the necessary methods in gantut_gan.py and the gantut_datafuncs.py we need to do the computations.

In the final part of the series, we will run this network and take a look at the outputs in TensorBoard.

Generative Adversarial Network (GAN) in TensorFlow - Part 3

Thu, 13 Jul 2017 09:16:32 +0100

We’re ready to code! In Part 1 we looked at how GANs work and Part 2 showed how to get the data ready. In this Part, we will begin creating the functions that handle the image data including some pre-procesing and data normalisation.

Introduction

In the previous post we downloaded and pre-processed our training data. There were also links to the skeleton code we will be using in the remainder of the tutorial, here they are again:

gantut_imgfuncs.py: holds the image-related functions
gantut_datafuncs.py: contains the data-related functions
gantut_gan.py: is where we define the GAN class
gantut_trainer.py: is the script that we will call in order to train the GAN

Again, the code is based from other sources, particularly the respository by carpedm20 and B. Amos.

Now, if your folder structure that looks something like this then we’re ready to go:

~/GAN
  |- raw
    |-- 00001.jpg
    |-- ...
  |- aligned
    |-- 00001.jpg
    |-- ...
  |- gantut_imgfuncs.py
  |- gantut_datafuncs.py
  |- gantut_gan.py
  |- gantut_trainer.py

Image Functions

We’re going to want to be able to read-in a set of images. We will also want to be able to output some generated images. We will also add in a fail-safe cropping/transformation procedure in-case we want to make sure we have the right input format. The skeleton code gantut_imgfuncs.py contains the definition headers for these functions, we will fill them in as we go along.

Importing Functions

These are the functions needed to get the data from the hard-disk into our network. They are called like this:

get_image which calls
imread and
transform which calls
center_crop

imread()

We are dealing with standard image files and our GAN will support .jpg, .jpeg and .png as input. For these kind of files, Python already has well-developed tools: specifically we can use the scipy.misc.imread function from the scipy.misc library. This is a one-liner and is already written in the skeleton code.

Inputs

path: location of the image

Returns

the image

""" Reads in the image (part of get_image function)
"""
def imread(path):
    return scipy.misc.imread(path, mode='RGB').astype(np.float)

transform()

[to top][100] This function we will have to write into the skeleton. We are including this to make sure that the image data are all of the same dimensions. So this function will need to take in the image, the desired width (the output will be square) and whether to perform the cropping or not. We may have already cropped our images (as we have) because we've done some registration/alignment etc. We do a check on whether we want to crop the image, if we do then call the `center_crop` function, other wise, just take the `image` as it is. Before returning our cropped (or uncropped) image, we are going to perform normalisation. Currently the pixels have intensity values in the range $[0 \ 255]$ for each channel (reg, green, blue). It is best not to have this kind of skew on our data, so we will normalise our images to have intensity values in the range $[-1 \ 1]$ by dividing by the mean of the maximum range (127.5) and subtracting 1. i.e. image/127.5 - 1. We will define the cropping function next, but note that the returned image is a simply a `numpy` array. *Inputs* * `image`: the image data to be transformed * `npx`: the size of the transformed image [`npx` x `npx`] * `is_crop`: whether to preform cropping too [`True` or `False`] *Returns* * the cropped, normalised image ```python """ Transforms the image by cropping and resizing and normalises intensity values between -1 and 1 """ def transform(image, npx=64, is_crop=True): if is_crop: cropped_image = center_crop(image, npx) else: cropped_image = image return np.array(cropped_image)/127.5 - 1. ```

center_crop()

Lets perform the cropping of the images (if requested). Usually we deal with square images, say $[64 \times 64]$. We can add a quick option to change that with short if statements looking at the crop_w argument to this function. We take the current height and width (h and w) from the shape of the image x.

To find the location of the centre of the image around which to take the square crop, we take half the result of h - crop_h and w - crop_w, making sure to round both to get a definite pixel value. However, it’s not guaranteed (depending on the image dimensions) that we will end up with a nice $[64 \times 64]$ image. Let’s fix that at the end.

As before, scipy has some efficient functions that we may as well use. scipy.misc.imresize takes in an image array and the desired size and outputs a resized image. We can give it our array, which may not be a nice square image due to the initial image dimensions, and imresize will perform interpolation (bilinear by default) to make sure we get a nice square image at the end.

Inputs

x: the input image
crop_h: the height of the crop region
crop_w: if None crop width = crop height
resize_w: the width of the resized image

Returns

the cropped image

""" Crops the input image at the centre pixel
"""
def center_crop(x, crop_h, crop_w=None, resize_w=64):
    if crop_w is None:
        crop_w = crop_h
    h, w = x.shape[:2]
    j = int(round((h - crop_h)/2.))
    i = int(round((w - crop_w)/2.))
    return scipy.misc.imresize(x[j:j+crop_h, i:i+crop_w],
                               [resize_w, resize_w])

get_image()

The get_image function is a wrapper that will call the imread and transform functions. It is the function that we’ll call to get the data rather than doing two separate function calls in the main GAN class. This is a one-liner and is already written in the skeleton code.

Parameters

is_crop: whether to crop the image or not [True or False]

Inputs

image_path: location of the image
image_size: width (in pixels) of the output image

Returns

the cropped image

""" Loads the image and crops it to 'image_size'
"""
def get_image(image_path, image_size, is_crop=True):
    return transform(imread(image_path), image_size, is_crop)

Saving Functions

When we’re training our network, we will want to see some of the results. The previous functions all deal with getting images from storage into the networks. We now want to take some images out. The functions are called like this:

save_images which calls
inverse_transform and
imsave which calls
merge

inverse_transform()

Firstly, let’s put the intensities back into the skewed range, we’ll just go from $[-1 \ 1]$ to $[0 \ 1]$ here.

Inputs

images: the image to be transformed

Returns

the transformed image

""" This turns the intensities back to a normal range
"""
def inverse_transform(images):
    return (images+1.)/2.

merge()

We will create an array of several example images from the network which we can output every now and again to see how things are progressing. We need some images to go in and a size which will say how many images in width and height the array should be.

First get the height h and width w of the images from their shape (we assume they’re all the same size becuase we will have already used our previous functions to make this happen). Note that images is a collection of images where each image has the same h and w.

We define img to be the final image array and initialise it to all zeros. Notice that there is a ‘3’ on the end to denote the number of channels as these are RGB images. This will still work for grayscale images.

Next we will iterate through each image in images and put it into place. The % operator is the modulo which returns the remainder of the division between two numbers. // is the floor division operator which returns the integer result of division rounded down. So this will move along the top row of the array (remembering Python indexing starts at 0) and move down placing the image at each iteration.

Inputs

images: the set of input images
size: [height, width] of the array

Returns

an array of images as a single image

""" Takes a set of 'images' and creates an array from them.
""" 
def merge(images, size):
    h, w = images.shape[1], images.shape[2]
    img = np.zeros((int(h * size[0]), int(w * size[1]), 3))
    for idx, image in enumerate(images):
        i = idx % size[1]
        j = idx // size[1]
        img[j*h:j*h+h, i*w:i*w+w, :] = image
        
    return img

imsave()

Our image array img now has intensity values in $[0 \ 1]$ lets make this the proper image range $[0 \ 255]$ before getting the integer values as an image array with scipy.misc.imsave.

Inputs

images: the set of input images
size: [height, width] of the array
path: the save location

Returns

an image saved to disk

""" Takes a set of `images` and calls the merge function. Converts
the array to image data and saves to disk.
"""
def imsave(images, size, path):
    img = merge(images, size)
    return scipy.misc.imsave(path, (255*img).astype(np.uint8))

save_images()

Finally, let’s create the wrapper to pull this together:

Inputs

images: the images to be saves
size: the size of the img array [width height]
image_path: where the array is to be stored on disk

""" takes an image and saves it to disk. Redistributes
intensity values [-1 1] from [0 255]
"""
def save_images(images, size, image_path):
    return imsave(inverse_transform(images), size, image_path)

Conclusion

In this post, we’ve dealt with all of the functions that are needed to import image data into our network and also some that will create outputs so we can see what’s going on. We’ve made sure that we can import any image-size and it will be dealt with correctly.

Make sure that we’ve imported scpipy.misc and numpy to this script:

import numpy as np
import scipy.misc

The complete script can be found here. In the next post, we will be working on the GAN itself and building the gantut_datafuncs.py functions as we go.

Generative Adversarial Network (GAN) in TensorFlow - Part 2

Wed, 12 Jul 2017 11:59:45 +0100

This tutorial will provide the data that we will use when training our Generative Adversarial Networks. It will also take an overview on the structure of the necessary code for creating a GAN and provide some skeleton code which we can work on in the next post. If you’re not up to speed on GANs, please do read the brief introduction in Part 1 of this series on Generative Adversarial Networks.

Introduction

We’ve looked at how a GAN works and how it is trained, but how do we implement this in Python? There are several stages to this task:

Create some initial functions that will read in our training data
Create some functions that will perform the steps in the CNN
Write a class that will hold our GAN and all of its important methods
Put these together in a script that we can run to train the GAN

The way I’d like to go through this process (in the next post) is by taking the network piece by piece as it would be called by the program. I think this is important to help to understand the flow of the data through the network. The code that I’ve used for the basis of these tutorials is from carpedm20’s DCGAN-tensorflow repository, with a lot of influence from other sources including this blog from B. Amos. I’m hoping that by putting this together in several posts, and fleshing out the code, it will become clearer.

Skeleton Code

We will structure our code into 4 separate .py files. Each file represents one of the 4 stages set out above:

gantut_imgfuncs.py: holds the image-related functions
gantut_datafuncs.py: contains the data-related functions
gantut_gan.py: is where we define the GAN class
gantut_trainer.py: is the script that we will call in order to train the GAN

For our project, let’s use the working directory ~/GAN. Download these skeletons using the links above into `~/GAN’.

If you look through each of these files, you will see that they contain only a comment for each function/class and the line defining each function/method. Each of these will have to be completed when we go through the next couple of posts. In the remainder of this post, we will take a look at the dataset that we will be using and prepare the images.

Dataset

We clearly need to have some training data to hand to be able to make this work. Several posts have used databases of faces or even the MNIST digit-classification dataset. In our tutorial, we will be using faces - I find this very interesting as it allows the computer to create photo-realistic images of people that don’t actually exist!

To get the dataset prepared we need to download it, and then pre-process the images so that they will be small enough to use in our GAN.

Download

We are going to use the CelebA databse. Here is a direct link to the GoogleDrive which stores the data: https://drive.google.com/drive/folders/0B7EVK8r0v71pTUZsaXdaSnZBZzg. You will want to go to the “img” folder and download the “img_align_celeba.zip” file. Direct download link should be:

img_align_celeba.zip (1.3GB)

Download and extract this folder into ~/GAN/raw_images to find it contains 200,000+ examples of celebrity faces. Even though the .zip says ‘align’ in the name, we still need to resize the images and thus may need to realign them too.

Figure 1: Examples from the CelebA Database. Source: CelebA

Processing

To process this volume of images, we need an automated method for resizing and cropping. We will use OpenFace. Specifically, there’s a small tool we will want to use from this.

Open a terminal, navigate to or create your working directory (we’ll use ~/GAN and follow the instructions below to clone OpenFace and get the Python wrapping sorted:

cd ~/GAN
git clone https://github.com/cmusatyalab/openface.git openface

Cloning complete, move into the openface folder and install the requirements (handily they’re in requirements.txt, so do this:

cd ./openface
sudo pip install -r requirements.txt

Installation complete (make sure you use sudo to get the permissions to install). Next we want to install the models that we can use with Python:

./models/get-models.sh

This make take a short while. When this is done, you may want to update Scipy. This is because the requirements.txt wants a previous version to the most recent. Easily fixed:

sudo pip install --upgrade scipy

Now we have access to the Python tool that will do the aligning and cropping of our faces. This is an important step to ensure that all images going into the network are the same dimensions, but also so that the network can learn the faces well (there’s no point in having eyes at the bottom of an image, or a face that’s half out of the field of view).

In our working directory `~/GAN’, do the following:

./openface/util/align-dlib.py ./raw_images align innerEyesAndBottomLip ./aligned --size 64

This will align all of the innerEyesAndBottomLip of the images in ./raw_images, crop them to 64 x 64 and put them in ./aligned. This will take a long time (for 200,000+ images!).

Figure 2: Examples of aligned, cropped and resized images from the CelebA database.

That’s it! Now we will have a good training set to use with our network. We also have the skeletons that we can build up to form our GAN. Our next post will look at the functions that will read-in the images for use with the GAN and begin to work on the GAN class.

Generative Adversarial Network (GAN) in TensorFlow - Part 1

Tue, 11 Jul 2017 09:15:54 +0100

We’ve seen that CNNs can learn the content of an image for classification purposes, but what else can they do? This tutorial will look at the Generative Adversarial Network (GAN) which is able to learn from a set of images and create an entirely new ‘fake’ image which isn’t in the training set. Why? By the end of this tutorial you’ll get know why this might be done and how to do it.

Introduction

Generative Adversarial Networks (GANs) were proposed by Ian Goodfellow et al in 2014 at annual the Neural Information and Processing Systems (NIPS) conference. The original paper is available on Arxiv along with a later tutorial by Goodfellow delivered at NIPS in 2016 here. I’ve read both of these (and others) as well as taking a look at other tutorials but sometimes things just weren’t clear enough for me. This blog from B. Amos has been helpful in getting my thoughts organised on this series, and hopefully I can build on this a little and make things more concrete.

What's a GAN?

GANs are used in a number of ways, for example:

to generate new images based upon some training data. For our tutorial, we will train with a database of faces and ask the network to produce a new face.
to do ‘inpainting’ or ‘image completion’. This is where part of a scene may be missing and we wish to recover the full image. It could be that we want to remove parts of the image e.g. people, and fill-in the background.

There are two components in a GAN which try to work against each other (hence the ‘adversarial’ part).

The Generator (G) starts off by creating a very noisy image based upon some random input data. Its job is to try to come up with images that are as real as possible.
The Discriminator (D) is trying to determine whether an image is real or fake.

Though these two are the primary components of the network, we also need to write some functions for importing data and dealing with the training of this two-stage network. Part 1 of this tutorial will go through some background and mathematics, in Part 2 we will do some general housekeeping and get us prepared to write the main model of our network in Part 3.

Background

There are a number of situations where you may want to use a GAN. A common task is for image completion or ‘in-painting’. This would be where we have an image and would like to remove some obstruction or imperfection by replacing it with the background. Maybe there’s a lovely holiday photo of beautiful scenery, but there are some people you don’t know spoiling the view. Figure 1 shows an example of the result of image completion using PhotoShop on such an image.

Figure 1: Removal of unwated parts of a scene with image completion. Source: Flickr:littleredelf

We have a couple of options if we want to try and do this kind of image completion ourselves. Let’s say we draw around an area we want to change:

If we’ve never seen a beach or the sky before, well we may just have to use the neighbouring pixels to inform our in-filling. If we’re feeling fancy, we would look a little further afield and use that information too ( i.e. is there just sky around the area, or is there something else).
Or… we could look at the image as a whole and try to see what would fit best. For this we would have to use our knowledge of similar scenes we’ve observed.

This is the difference between using (1) contextual and (2) perceptual information. But before we look more heavily into this, let’s take a look at the idea behind a GAN.

Generative Adversarial Networks

We’ve said that there are two components in a GAN, the generator and the discriminator. Here, we’ll look more closely at what they do.

Our purpose is to create images which are as realistic as possible. So much so, that they are able to fool not only humans, but the computer that has generated them. You will often see GANs being compared to money counterfeiting: our generator is trying to create fake money whilst our discriminator is trying to tell the difference between the real and fake bills. How does this work?

Say we have an image $x$ which our discriminator $D$ is analysing. $D(x)$ gives a low value near to 0 if the image looks normal or ‘natural’ and a higher value near to 1 if it thinks the image is fake - this could mean it is very noisy for example. The generator $G$ takes a vector $z$ that has been randomly sampled from a very simple, but well known, distribution e.g. a uniform or normal distribution. The image that is produced by $G(z)$ should help to train the function at $D$. We alternate showing the discriminator a real image (which will change its parameters to give a low output) and then an image from $G$ (which will change $D$ to give a higher output). At the same time, we want $G$ to also be learning to produce more realistic images which are more likely to fool $D$. We want $G$ to minimise the output of $D$ whilst $D$ is trying to maximise the same thing. They are playing a ‘minimax’ game against each other, which is where we get the term ‘adversarial’ training.

Figure 2: Generative Adversarial Network concept. Simple, known distribution $p_z$ from which the vector $z$ is drawn. Generator $G(z)$ generates an image. Discriminator tries to determine if image came from $G$ or from the true, unknown distribution $p_{data}$.

Let’s keep going with the maths…

This kind of network has a lot of latent (hidden) variables that need to be found. But we can start from a strong position by using a distribution that we know very well like a uniform distribution.

The known distribution we denote $p_z$ We will randomly draw a vector $z$ from $p_z$.
We know that our data must have some distribution but we do not know this. We’ll call this $p_{data}$
Our generator will try to learn its own distribution $p_g$. Our goal is for $p_g = p_{data}$

We have two networks to train, $D$ and $G$:

We want to minimise $D(x)$ if $x$ is drawn from our true distribution $p_{data}$ i.e. minimise $D(x)$ if it’s not.
and maximise $D(G(z))$ i.e. minimise $1 - D(G(z))$.

More formally:

$$ \min_{G}\max_{D} V(D, G) = \mathbb{E}_{x\sim p_{data}} \left[ \log D(x) \right]+ \mathbb{E}_{z\sim p_{z}} \left[ \log \left( 1 - D(G(z)) \right) \right] $$

Where $\mathbb{E}$ is the expectation. The advantage of working with neural networks is that we can easily compute gradients and use backpropagation to perform training. This is because the generator and the discriminator are defined by the multi-layer perceptron (MLP) parameters $\theta_g$ and $\theta_d$ respectively.

We will train the networks (the $G$ and the $D$) one at a time, fixing the weights of one whilst training the other. From the GAN paper by Goodfellow et al we get the pseudo code for this procedure:

Figure 3: pseudo code for GAN training. With $k=1$ this equates to training $D$ then $G$ one after the other. Adapted from Goodfellow et al. 2014.

Notice that with $k=1$ we are training $D$ then $G$ one after the other. What is the training actually doing? Fig. 4 shows the distribution $p_g$ of the generator in green. Notice that with each training step, the $p_g$ becomes more like the true distribution of the image data $p_{data}$ in black. After each alternation, the error is backpropagated to udate the weights on the network that is not being held fixed. The discriminator eventually reaches its lowest maximum where it is no longer able to tell the difference between the true and fake images.

Figure 4: Initially (a) the generator's and true data distributions (green and black) are not very similar. (b) the discriminator (blue) is updated with generator held constant. (c) Generator is updated with discriminator held constant, until (d) $p_g$ and $p_{data}$ are most alike. Adapted from Goodfellow et al. 2014.

What's Next?"

That really is it. The basics of a GAN are just a game between two networks, the generator $G$, which produces images from some latent variables $z$, and the discriminator $D$ which tries to detect the faked images.

Implementing this in Python seems old-hat to many and there are many pre-built solutions available. The work in this tutorial series will mostly follow the base-code from carpedm20’s DCGAN-tensorflow repository.

In the next post, we’ll get ourselves organised, make sure we have some dependencies, create some files and get our training data sorted.

As always, if there’s anything wrong or that doesn’t make send please get in contact and let me know. A comment here is great.

Convolutional Neural Networks - TensorFlow (Basics)

Mon, 03 Jul 2017 09:44:24 +0100

We’ve looked at the principles behind how a CNN works, but how do we actually implement this in Python? This tutorial will look at the basic idea behind Google’s TensorFlow: an efficient way to build a CNN using purpose-build Python libraries.

Introduction

Building a CNN from scratch in Python is perfectly possible, but very memory intensive. It can also lead to very long pieces of code. Several libraries have been developed by the community to solve this problem by wrapping the most common parts of CNNs into special methods called from their own libraries. Theano, Keras and PyTorch are notable libraries being used today that are all opensource. However, since TensorFlow was released and Google announced their machine-learning-specific hardware, the Tensor Processing Unit (TPU), TensorFlow has quickly become a much-used tool in the field. If any applications being built today are intended for use on mobile devices, TensorFlow is the way to go as the mobile TPU in the upcoming Google phones will be able to perform inference from machine learning models in the User’s hand. Of course, being a relative newcomer and updates still very much controlled by Google, TensorFlow may not have the huge body of support that has built up with Theano, say.

Nevertheless, TensorFlow is powerful and quick to setup so long as you know how: read on to find out. Much of this tutorial is based around the documentation provided by Google, but gives a lot more information that many be useful to less experienced users.

Installation

TensorFlow is just another set of Python libraries distributed by Google via the website: https://www.tensorflow.org/install. There’s the option to install the version for use on GPUs but that’s not necessary for this tutorial, we’ll be using the MNIST dataset which is not too memory instensive.

Go ahead and install the TensorFlow libraries. I would say that even though they suggest using TF in a virtual environment, we will be coding up our CNN in a Python script so don’t worry about that if you’re not comfortable with it.

One of the most frustrating things you will find with TF is that much of the documentation on various websites is already out-of-date. Some of the commands have been re-written or renamed since the support was put in place. Even some of Google’s own tutorials are now old and require tweaking. Currently, the code written here will work on all versions, but may throw some ‘depreication’ warnings.

TensorFlow Structure

The idea of ‘flow’ is central to TF’s organisation. The actual CNN is written as a ‘graph’. A graph is simply a list of the differnet layers in your network each with their own input and output. Whatever data we input at the top will ‘flow’ through the graph and output some values. The values we will also deal with using TensorFlow which will automatically take care of the updating of any internal weights via whatever optimization method and loss function we prefer.

The graph is called by some initial functions in the script that create the classifier, run the training and output whatever evlauation metrics we like.

Before writing any functions, lets import the necessary includes and tell TF to limit any program logging:

import numpy as np
import os
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib


os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

We’ve included multiple TF lines to save on the typing later.

The Graph

Let’s get straight to it and start to build our graph. We will keep it simple:

2 convolutional layers learning 16 filters (or kernels) of [3 x 3]
2 max-pooling layers that half the size of the image using [2 x 2] kernel
A fully connected layer at the end.

#Hyperparameters
numK = 16               #number of kernels in each conv layer
sizeConvK = 3           #size of the kernels in each conv layer [n x n]
sizePoolK = 2           #size of the kernels in each pool layer [m x m]
inputSize = 28          #size of the input image
numChannels = 1         #number of channels to the input image grayscale=1, RGB=3

def convNet(inputs, labels, mode):
    #reshape the input from a vector to a 2D image
    input_layer = tf.reshape(inputs, [-1, inputSize, inputSize, numChannels])   
    
    #perform convolution and pooling
    conv1 = doConv(input_layer) 
    pool1 = doPool(conv1)      
    
    conv2 = doConv(pool1)
    pool2 = doPool(conv2)

    #flatted the result back to a vector for the FC layer
    flatPool = tf.reshape(pool2, [-1, 7 * 7 * numK])    
    dense = tf.layers.dense(inputs=flatPool, units=1024, activation=tf.nn.relu)

So what’s going on here? First we’ve defined some parameters for the CNN such as kernel sizes, the height of the input image (assuming it’s square) and the number of channels for the image. The number of channels is 1 for both Black and White with intensity values of either 0 or 1, and grayscale images with intensities in the range [0 255]. Colour images have 3 channels, Red, Green and Blue.

You’ll notice that we’ve barely used TF so far: we use it to reshape the data. This is important, when we run our script, TF will take our raw data and turn it into its own data type i.e. a tensor. That means our normal numpy operations won’t work on them so we should use the in-built tf.reshape function which works in the same was as the one in numpy - it takes the input data and an output shape as arguments.

But why are we reshaping at all? Well, the data that is input into the network will be in the form of vectors. The image will have been saved along with lots of other images as single lines of a larger file. This is the case with the MNIST dataset and is common in machine learning. So we need to put it back into image-form so that we can perform convolutions.

“Where are those random 7s and the -1 from?”… good question. In this example, we are going to be using the MNIST dataset whose images are 28 x 28. If we put this through 2 pooling layers we will half (14 x 14) and half again (7 x 7) the width. Thus the layer needs to know what it is expecting the output to look like based upon the input which will be a 7 x 7 x numK tensor, one 7 x 7 for each kernel. Keep in mind that we will be running the network with more than one input image at a time, so in reality when we get to this stage, there will be n images here which all have 7 x 7 x numK values associated with them. The -1 simply tells TensorFlow to take all of these images and do the same to each. It’s short hand for “do this for the whole batch”.

There’s also a tf.layers.dense method at the end here. This is one of TF’s in-built layer types that is very handy. We just tell it what to take as input, how many units we want it to have and what non-linearity we would prefer at the end. Instead of typing this all separately, it’s combined into a single line. Neat!

But what about the conv and pool layers? Well, to keep the code nice and tidy, I like to write the convolution and pooling layers in separate functions. This means that if I want to add more conv or pool layers, I can just write them in underneath the current ones and the code will still look clean (not that the functions are very long). Here they are:

def doConv(inputs):
    convOut = tf.layers.conv2d(inputs=inputs, filters=numK, kernel_size=[sizeConvK, sizeConvK], \
    	padding="SAME", activation=tf.nn.relu)    
    return convOut
    
def doPool(inputs):
    poolOut = tf.layers.max_pooling2d(inputs=inputs, pool_size=[sizePoolK, sizePoolK], strides=2)
    return poolOut

Again, both the conv and pool layers are simple one-liners. They both take in some input data and need to know the size of the kernel you want them to use (which we defined earlier on). The conv layer needs to know how many filters to learn too. Alongside this, we need to take care of any mis-match between the image size and the size of the kernels to ensure that we’re not changing the size of the image when we get the output. This is easily done in TF by setting the padding attribute to "SAME". We’ve got our non-linearity at the end here too. We’ve hard-coded that the pooling layer will have strides=2 and will therefore half in size at each pooling layer.

Now we have the main part of our network coded-up. But it wont do very much unless we ask TF to give us some outputs and compare them to some training data.

As the MNIST data is used for image-classification problems, we’ll be trying to get the network to output probabilities that the image it is given belongs to a specific class i.e. a number 0-9. The MNIST dataset provides the numbers 0-9 which, if we provided this to the network, would start to output guesses of decimal values 0.143, 4.765, 8.112 or whatever. We need to change this data so that each class can have its own specific box which the network can assign a probability. We use the idea of ‘one-hot’ labels for this. For example, class 3 becomes [0 0 0 1 0 0 0 0 0 0] and class 9 becomes [0 0 0 0 0 0 0 0 0 1]. This way we’re not asking the network to predict the number associated with each class but rather how likely is the test-image to be in this class.

TF has a very handy function for changing class labels into ‘one-hot’ labels. Let’s continue coding our graph in the convNet function.

     #Get the output in the form of one-hot labels with x units
    logits = tf.layers.dense(inputs=dense, units=10) 
    
    loss = None
    train_op = None
    #At the end of the network, check how well we did     
    if mode != learn.ModeKeys.INFER:
        #create one-hot tabels from the training-labels
        onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
        #check how close the output is to the training-labels
        loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
    
    #After checking the loss, use it to train the network weights   
    if mode == learn.ModeKeys.TRAIN:
        train_op = tf.contrib.layers.optimize_loss(loss=loss, global_step=tf.contrib.framework.get_global_step(), \
            learning_rate=learning_rate, optimizer="SGD")

logits here is the output of the network which corresponds to the 10 classes of the training labels. The next two sections check whether we should be training the weights right now, or checking how well we’ve done. First we check our progress: we use tf.one_hot to create the one-hot labels from the numeric training labels given to the network in labels. We’ve performed a tf.cast operation to make sure that the data is of the correct type before doing the conversion.

Our loss-function is an important part of a CNN (or any machine learning algorithm). There are many different loss functions already built-in with TensorFlow from simple absolute_difference to more complex functions like our softmax_cross_entropy. We won’t delve into how this is calculated, just know that we can pick any loss function. More advanced users can write their own loss-functions. The loss function takes in the output of the network logits and compares it to our onehot_labels.

When this is done, we ask TF to perform some updating or ‘optimization’ of the network based on the loss that we just calculated. the train_op in TF is the name given in support documents to the function that performs any background changes to the fundamentals of the network or updates values. Our train_op here is a simple loss-optimiser that tries to find the minimum loss for our data. As with all machine learning algorithms, the parameters of this optimiser are subject to much research. Using a pre-built optimiser such as those included with TF will ensure that your network performs efficiently and trains as quickly as possible. The learning_rate can be set as a variable at the beginning of our script along with the other parameters. We tend to stick with 0.001 to begin with and move in orders of magnitude if we need to e.g. 0.01 or 0.0001. Just like the loss functions, there are a number of optimisers to use, some will take longer than others if they are more complex. For our purposes on the MNIST dataset, simple stochastic gradient descent (SGD) will suffice.

Notice that we are just giving TF some instructions: take my network, calculate the loss and do some optimisation based on that loss.

We are going to want to show what the network has learned, so we output the current predictions by definiing a dictionary of data. The raw logits information and the associated probabilities (found by taking the softmax of the logits tensor).

predictions ={"classes": tf.argmax(input=logits, axis=1), "probabilities": tf.nn.softmax(logits, name="softmax_tensor")}

We can finish off our graph by making sure it returns the data:

return model_fn_lib.ModelFnOps(mode=mode, predictions=predictions, loss=loss, train_op=train_op)

ModelFnOps class is returned that contains the current mode of the network (training or inference), the current predictions, loss and the train_op that we use to train the network.

Setting up the Script

Now that the graph has been constructed, we need to call it and tell TF to do the training. First, lets take a moment to load the data the we will be using. The MNIST dataset has its own loading method within TF (handy!). Let’s define the main body of our script:

def main(unused_argv):
    # Load training and eval data
    mnist = learn.datasets.load_dataset("mnist")
    train_data = mnist.train.images # Returns np.array
    train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
    eval_data = mnist.test.images # Returns np.array
    eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)

Next, we create the classifier that will hold the network and all of its data. We have to tell it what our graph is called under model_fn and where we would like our output stored.

Note: If you use the /tmp directory in Linux you will probably find that the model will no longer be there if you restart your computer. If you intend to reload and use your model later on, be sure to save it in a more conventient place.

    mnistClassifier = learn.Estimator(model_fn=convNet,   model_dir="/tmp/mln_MNIST")

We will want to get some information out of our network that tells us about the training performance. For example, we can create a dictionary that will hold the probabilities from the key that we named ‘softmax_tensor’ in the graph. How often we save this information is controlled with the every_n_iter attricute. We add this to the tf.train.LoggingTensorHook.

    tensors2log = {"probabilities": "softmax_tensor"}
    logging_hook = tf.train.LoggingTensorHook(tensors=tensors2log, every_n_iter=100)

Finally! Let’s get TF to actually train the network. We call the .fit method of the classifier that we created earlier. We pass it the training data and the labels along with the batch size (i.e. how much of the training data we want to use in each iteration). Bare in mind that even though the MNIST images are very small, there are 60,000 of them and this may not do well for your RAM. We also need to say what the maximum number of iterations we’d like TF to perform is and also add on that we want to monitor the training by outputting the data we’ve requested in logging_hook.

    mnistClassifier.fit(x=train_data, y=train_labels, batch_size=100, steps=1000, monitors=[logging_hook])

When the training is complete, we’d like TF to take some test-data and tell us how well the network performs. So we create a special metrics dictionary that TF will populate by calling the .evaluate method of the classifier.

    metrics = {"accuracy": learn.MetricSpec(metric_fn=tf.metrics.accuracy, prediction_key="classes")}
    
    eval_results = mnistClassifier.evaluate(x=eval_data, y=eval_labels, metrics=metrics)
    print(eval_results)

In this case, we’ve chosen to find the accuracy of the classifier by using the tf.metrics.accuracy value for the metric_fn. We also need to tell the evaluator that it’s the ‘classes’ key we’re looking at in the graph. This is then passed to the evaluator along with the test data.

Running the Network

Adding the final main function to the script and making sure we’ve done all the necessary includes, we can run the program. The full script can be found here.

In the current configuration, running the network for 1000 epochs gave me an output of:

{'loss': 1.9025836, 'global_step': 1000, 'accuracy': 0.64929998}

Definitely not a great accuracy for the MNIST dataset! We could just run this for longer and would likely see an increase in accuracy, Instead, lets make some of the easy tweaks to our network that we’ve described before: dropout and batch normalisation.

In our graph, we want to add:

    dense = tf.contrib.layers.batch_norm(dense, decay=0.99, is_training= mode==learn.ModeKeys.TRAIN)
    dense = tf.layers.dropout(inputs=dense, rate=keepProb, training = mode==learn.ModeKeys.TRAIN)

This layer has many different attirbutes. It’s functionality is taken from the paper by Loffe and Szegedy (2015).

Dropout layer’s keepProb is defined in the Hyperparameter pramble to the script. Another value that can be changed to improve the performance of the network. Both of these lines are in the final script available here, just uncomment them.

If we re-run the script, it will automatically load the most recent state of the network (clever TensorFlow!) but… it will fail because the checkpoint does not include the two new layers in its graph. So we must either delete our /tmp/mln_MNIST folder, or give the classifier a new model_dir.

Doing this and rerunning for the same 1000 epochs, I get an instant 140% increase in accuracy:

{'loss': 0.29391664, 'global_step': 1000, 'accuracy': 0.91680002}

Simply changing the optimiser to use the “Adam” rather than “SGD” optimiser yields:

{'loss': 0.040745325, 'global_step': 1000, 'accuracy': 0.98500001}

And running for slightly longer (20,000 iterations);

{'loss': 0.046967514, 'global_step': 20000, 'accuracy': 0.99129999}

Conclusion

TensorFlow takes away the tedium of having to write out the full code for each individual layer and is able to perform optimisation and evaluation with minimal effort.

If you look around online, you will see many methods for using TF that will get you similar results. I actually prefer some methods that are a little more explicit. The tutorial on Google for example has some room to allow us to including more logging features.

In future posts, we will look more into logging and TensorBoard, but for now, happy coding!

Convolutional Neural Networks - Basics

Fri, 07 Apr 2017 09:46:56 +0100

This series will give some background to CNNs, their architecture, coding and tuning. In particular, this tutorial covers some of the background to CNNs and Deep Learning. We won’t go over any coding in this session, but that will come in the next one. What’s the big deal about CNNs? What do they look like? Why do they work? Find out in this tutorial.

Introduction

A convolutional neural network (CNN) is very much related to the standard NN we’ve previously encountered. I found that when I searched for the link between the two, there seemed to be no natural progression from one to the other in terms of tutorials. It would seem that CNNs were developed in the late 1980s and then forgotten about due to the lack of processing power. In fact, it wasn’t until the advent of cheap, but powerful GPUs (graphics cards) that the research on CNNs and Deep Learning in general was given new life. Thus you’ll find an explosion of papers on CNNs in the last 3 or 4 years.

Nonetheless, the research that has been churned out is powerful. CNNs are used in so many applications now:

Object recognition in images and videos (think image-search in Google, tagging friends faces in Facebook, adding filters in Snapchat and tracking movement in Kinect)
Natural language processing (speech recognition in Google Assistant or Amazon’s Alexa)
Playing games (the recent defeat of the world ‘Go’ champion by DeepMind at Google)
Medical innovation (from drug discovery to prediction of disease)

Dispite the differences between these applications and the ever-increasing sophistication of CNNs, they all start out in the same way. Let’s take a look.

CNN or Deep Learning?

The "deep" part of deep learning comes in a couple of places: the number of layers and the number of features. Firstly, as one may expect, there are usually more layers in a deep learning framework than in your average multi-layer perceptron or standard neural network. We have some architectures that are 150 layers deep. Secondly, each layer of a CNN will learn multiple 'features' (multiple sets of weights) that connect it to the previous layer; so in this sense it's much deeper than a normal neural net too. In fact, some powerful neural networks, even CNNs, only consist of a few layers. So the 'deep' in DL acknowledges that each layer of the network learns multiple features. More on this later.

Often you may see a conflation of CNNs with DL, but the concept of DL comes some time before CNNs were first introduced. Connecting multiple neural networks together, altering the directionality of their weights and stacking such machines all gave rise to the increasing power and popularity of DL.

We won't delve too deeply into history or mathematics in this tutorial, but if you want to know the timeline of DL in more detail, I'd suggest the paper "On the Origin of Deep Learning" (Wang and Raj 2016) available here. It's a lengthy read - 72 pages including references - but shows the logic between progressive steps in DL.

As with the study of neural networks, the inspiration for CNNs came from nature: specifically, the visual cortex. It drew upon the idea that the neurons in the visual cortex focus upon different sized patches of an image getting different levels of information in different layers. If a computer could be programmed to work in this way, it may be able to mimic the image-recognition power of the brain. So how can this be done?

A CNN takes as input an array, or image (2D or 3D, grayscale or colour) and tries to learn the relationship between this image and some target data e.g. a classification. By ‘learn’ we are still talking about weights just like in a regular neural network. The difference in CNNs is that these weights connect small subsections of the input to each of the different neurons in the first layer. Fundamentally, there are multiple neurons in a single layer that each have their own weights to the same subsection of the input. These different sets of weights are called ‘kernels’.

It’s important at this stage to make sure we understand this weight or kernel business, because it’s the whole point of the ‘convolution’ bit of the CNN.

Convolution and Kernels

Convolution is something that should be taught in schools along with addition, and multiplication - it’s just another mathematical operation. Perhaps the reason it’s not, is because it’s a little more difficult to visualise.

Let’s say we have a pattern or a stamp that we want to repeat at regular intervals on a sheet of paper, a very convenient way to do this is to perform a convolution of the pattern with a regular grid on the paper. Think about hovering the stamp (or kernel) above the paper and moving it along a grid before pushing it into the page at each interval.

This idea of wanting to repeat a pattern (kernel) across some domain comes up a lot in the realm of signal processing and computer vision. In fact, if you’ve ever used a graphics package such as Photoshop, Inkscape or GIMP, you’ll have seen many kernels before. The list of ‘filters’ such as ‘blur’, ‘sharpen’ and ‘edge-detection’ are all done with a convolution of a kernel or filter with the image that you’re looking at.

For example, let’s find the outline (edges) of the image ‘A’.

We can use a kernel, or set of weights, like the ones below.

Finds horizontals

Finds verticals

A kernel is placed in the top-left corner of the image. The pixel values covered by the kernel are multiplied with the corresponing kernel values and the products are summated. The result is placed in the new image at the point corresponding to the centre of the kernel. An example for this first step is shown in the diagram below. This takes the vertical Sobel filter (used for edge-detection) and applies it to the pixels of the image.

A step in the Convolution Process.

The kernel is moved over by one pixel and this process is repated until all of the possible locations in the image are filtered as below, this time for the horizontal Sobel filter. Notice that there is a border of empty values around the convolved image. This is because the result of convolution is placed at the centre of the kernel. To deal with this, a process called ‘padding’ or more commonly ‘zero-padding’ is used. This simply means that a border of zeros is placed around the original image to make it a pixel wider all around. The convolution is then done as normal, but the convolution result will now produce an image that is of equal size to the original.

The kernel is moved over the image performing the convolution as it goes.

Zero-padding is used so that the resulting image doesn't shrink.

Now that we have our convolved image, we can use a colourmap to visualise the result. Here, I’ve just normalised the values between 0 and 255 so that I can apply a grayscale visualisation:

Result of the convolution

This dummy example could represent the very bottom left edge of the Android’s head and doesn’t really look like it’s detected anything. To see the proper effect, we need to scale this up so that we’re not looking at individual pixels. Performing the horizontal and vertical sobel filtering on the full 264 x 264 image gives:

Horizontal Sobel

Vertical Sobel

Combined Sobel

Where we’ve also added together the result from both filters to get both the horizontal and vertical ones.

How does this feed into CNNs?

Clearly, convolution is powerful in finding the features of an image if we already know the right kernel to use. Kernel design is an artform and has been refined over the last few decades to do some pretty amazing things with images (just look at the huge list in your graphics software!). But the important question is, what if we don’t know the features we’re looking for? Or what if we do know, but we don’t know what the kernel should look like?

Well, first we should recognise that every pixel in an image is a feature and that means it represents an input node. The result from each convolution is placed into the next layer in a hidden node. Each feature or pixel of the convolved image is a node in the hidden layer.

We’ve already said that each of these numbers in the kernel is a weight, and that weight is the connection between the feature of the input image and the node of the hidden layer. The kernel is swept across the image and so there must be as many hidden nodes as there are input nodes (well actually slightly fewer as we should add zero-padding to the input image). This means that the hidden layer is also 2D like the input image. Sometimes, instead of moving the kernel over one pixel at a time, the stride, as it’s called, can be increased. This will result in fewer nodes or fewer pixels in the convolved image. Consider it like this:

Hidden Layer Nodes in a CNN

Increased stride means fewer hidden-layer nodes

These weights that connect to the nodes need to be learned in exactly the same way as in a regular neural network. The image is passed through these nodes (by being convolved with the weights a.k.a the kernel) and the result is compared to some output (the error of which is then backpropagated and optimised).

In reality, it isn’t just the weights or the kernel for one 2D set of nodes that has to be learned, there is a whole array of nodes which all look at the same area of the image (sometimes, but possibly incorrectly, called the receptive field*). Each of the nodes in this row (or fibre) tries to learn different kernels (different weights) that will show up some different features of the image, like edges. So the hidden-layer may look something more like this:

* Note: we’ll talk more about the receptive field after looking at the pooling layer below

For a 2D image learning a set of kernels.

For a 3 channel RGB image the kernel becomes 3D.

Now this is why deep learning is called deep learning. Each hidden layer of the convolutional neural network is capable of learning a large number of kernels. The output from this hidden-layer is passed to more layers which are able to learn their own kernels based on the convolved image output from this layer (after some pooling operation to reduce the size of the convolved output). This is what gives the CNN the ability to see the edges of an image and build them up into larger features.

CNN Archiecture

It is the architecture of a CNN that gives it its power. A lot of papers that are puplished on CNNs tend to be about a new achitecture i.e. the number and ordering of different layers and how many kernels are learnt. Though often it’s the clever tricks applied to older architecures that really give the network power. Let’s take a look at the other layers in a CNN.

Layers

Input Layer

The input image is placed into this layer. It can be a single-layer 2D image (grayscale), 2D 3-channel image (RGB colour) or 3D. The main difference between how the inputs are arranged comes in the formation of the expected kernel shapes. Kernels need to be learned that are the same depth as the input i.e. 5 x 5 x 3 for a 2D RGB image with dimensions of 5 x 5.

Inputs to a CNN seem to work best when they’re of certain dimensions. This is because of the behviour of the convolution. Depending on the stride of the kernel and the subsequent pooling layers the outputs may become an “illegal” size including half-pixels. We’ll look at this in the pooling layer section.

Convolutional Layer

We’ve already looked at what the conv layer does. Just remember that it takes in an image e.g. [56 x 56 x 3] and assuming a stride of 1 and zero-padding, will produce an output of [56 x 56 x 32] if 32 kernels are being learnt. It’s important to note that the order of these dimensions can be important during the implementation of a CNN in Python. This is because there’s alot of matrix multiplication going on!

Non-linearity

The ‘non-linearity’ here isn’t its own distinct layer of the CNN, but comes as part of the convolution layer as it is done on the output of the neurons (just like a normal NN). By this, we mean “don’t take the data forwards as it is (linearity) let’s do something to it (non-linearlity) that will help us later on”.

In our neural network tutorials we looked at different activation functions. These each provide a different mapping of the input to an output, either to [-1 1], [0 1] or some other domain e.g the Rectified Linear Unit thresholds the data at 0: max(0,x). The ReLU is very popular as it doesn’t require any expensive computation and it’s been shown to speed up the convergence of stochastic gradient descent algorithms.

Pooling Layer

The pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will half the size of the convolved image. The number of feature-maps produced by the learned kernels will remain the same as pooling is done on each one in turn. Thus the pooling layer returns an array with the same depth as the convolution layer. The figure below shows the principal.

Max-pooling: Pooling using a "max" filter with stride equal to the kernel size

A Note on the Receptive Field

This is quite an important, but sometimes neglected, concept. We said that the receptive field of a single neuron can be taken to mean the area of the image which it can ‘see’. Each neuron therefore has a different receptive field. While this is true, the full impact of it can only be understood when we see what happens after pooling.

Let’s take an image of size [12 x 12] and a kernel size in the first conv layer of [3 x 3]. The output of the conv layer (assuming zero-padding and stride of 1) is going to be [12 x 12 x 10] if we’re learning 10 kernels. After pooling with a [3 x 3] kernel, we get an output of [4 x 4 x 10]. This gets fed into the next conv layer. Suppose the kernel in the second conv layer is [2 x 2], would we say that the receptive field here is also [2 x 2]? Well, some people do but, actually, no it’s not. In fact, a neuron in this layer is not just seeing the [2 x 2] area of the convolved image, it is actually seeing a [4 x 4] area of the original image too. That’s the [3 x 3] of the first layer for each of the pixels in the ‘receptive field’ of the second layer (remembering we had a stride of 1 in the first layer). Continuing this through the rest of the network, it is possible to end up with a final layer with a recpetive field equal to the size of the original image. Understanding this gives us the real insight to how the CNN works, building up the image as it goes.

Fully-connected (Dense) Layer

So this layer took me a while to figure out, despite its simplicity. If I take all of the say [3 x 3 x 64] featuremaps of my final pooling layer I have 3 x 3 x 64 = 576 different weights to consider and update. I need to make sure that my training labels match with the outputs from my output layer. We may only have 10 possibilities in our output layer (say the digits 0 - 9 in the classic MNIST number classification task). Thus we want the final numbers in our output layer to be [10,] and the layer before this to be [? x 10] where the ? represents the number of nodes in the layer before: the fully-connected (FC) layer. If there was only 1 node in this layer, it would have 576 weights attached to it - one for each of the weights coming from the previous pooling layer. This is not very useful as it won’t allow us to learn any combinations of these low-dimensional outputs. Increasing the number of neurons to say 1,000 will allow the FC layer to provide many different combinations of features and learn a more complex non-linear function that represents the feature space. The number of nodes in this layer can be whatever we want it to be and isn’t constrained by any previous dimensions - this is the thing that kept confusing me when I looked at other CNNs. Sometimes it’s also seen that there are two FC layers together, this just increases the possibility of learning a complex function. FC layers are 1D vectors.

However, FC layers act as ‘black boxes’ and are notoriously uninterpretable. They’re also prone to overfitting so dropout’ is often performed (discussed below).

Fully-connected as a Convolutional Layer

If the idea above doesn’t help you lets remove the FC layer and replace it with another convolutional layer. This is very simple - take the output from the pooling layer as before and apply a convolution to it with a kernel that is the same size as a featuremap in the pooling layer. For this to be of use, the input to the conv should be down to around [5 x 5] or [3 x 3] by making sure there have been enough pooling layers in the network. What does this achieve? By convolving a [3 x 3] image with a [3 x 3] kernel we get a 1 pixel output. There is no striding, just one convolution per featuremap. So our output from this layer will be a [1 x k] vector where k is the number of featuremaps. This is very similar to the FC layer, except that the output from the conv is only created from an individual featuremap rather than being connected to all of the featuremaps.

But, isn’t this more weights to learn? Yes, so it isn’t done. Instead, we perform either global average pooling or global max pooling where the global refers to a whole single feature map (not the whole set of feature maps). So we’re taking the average of all points in the feature and repeating this for each feature to get the [1 x k] vector as before. Note that the number of channels (kernels/features) in the last conv layer has to be equal to the number of outputs we want, or else we have to include an FC layer to change the [1 x k] vector to what we need.

This can be powerfull as we have represented a very large receptive field by a single pixel and also removed some spatial information that allows us to try and take into account translations of the input. We’re able to say, if the value of the output is high, that all of the featuremaps visible to this output have activated enough to represent a ‘cat’ or whatever it is we are training our network to learn.

Dropout Layer

The previously mentioned fully-connected layer is connected to all weights in the previous layer - this can be a very large number. As such, an FC layer is prone to overfitting meaning that the network won’t generalise well to new data. There are a number of techniques that can be used to reduce overfitting though the most commonly seen in CNNs is the dropout layer, proposed by Hinton. As the name suggests, this causes the network to ‘drop’ some nodes on each iteration with a particular probability. The keep probability is between 0 and 1, most commonly around 0.2-0.5 it seems. This is the probability that a particular node is dropped during training. When back propagation occurs, the weights connected to these nodes are not updated. They are readded for the next iteration before another set is chosen for dropout.

Output Layer

Of course depending on the purpose of your CNN, the output layer will be slightly different. In general, the output layer consists of a number of nodes which have a high value if they are ‘true’ or activated. Consider a classification problem where a CNN is given a set of images containing cats, dogs and elephants. If we’re asking the CNN to learn what a cat, dog and elephant looks like, output layer is going to be a set of three nodes, one for each ‘class’ or animal. We’d expect that when the CNN finds an image of a cat, the value at the node representing ‘cat’ is higher than the other two. This is the same idea as in a regular neural network. In fact, the FC layer and the output layer can be considered as a traditional NN where we also usually include a softmax activation function. Some output layers are probabilities and as such will sum to 1, whilst others will just achieve a value which could be a pixel intensity in the range 0-255. The output can also consist of a single node if we’re doing regression or deciding if an image belong to a specific class or not e.g. diseased or healthy. Commonly, however, even binary classificaion is proposed with 2 nodes in the output and trained with labels that are ‘one-hot’ encoded i.e. [1,0] for class 0 and [0,1] for class 1.

A Note on Back Propagation

I’ve found it helpful to consider CNNs in reverse. It didn’t sit properly in my mind that the CNN first learns all different types of edges, curves etc. and then builds them up into large features e.g. a face. It came up in a discussion with a colleague that we could consider the CNN working in reverse, and in fact this is effectively what happens - back propagation updates the weights from the final layer back towards the first. In fact, the error (or loss) minimisation occurs firstly at the final layer and as such, this is where the network is ‘seeing’ the bigger picture. The gradient (updates to the weights) vanishes towards the input layer and is greatest at the output layer. We can effectively think that the CNN is learning “face - has eyes, nose mouth” at the output layer, then “I don’t know what a face is, but here are some eyes, noses, mouths” in the previous one, then “What are eyes? I’m only seeing circles, some white bits and a black hole” followed by “woohoo! round things!” and initially by “I think that’s what a line looks like”. Possibly we could think of the CNN as being less sure about itself at the first layers and being more advanced at the end.

CNNs can be used for segmentation, classification, regression and a whole manner of other processes. On the whole, they only differ by four things:

architecture (number and order of conv, pool and fc layers plus the size and number of the kernels)
output (probabilitstic etc.)
training method (cost or loss function, regularisation and optimiser)
hyperparameters (learning rate, regularisation weights, batch size, iterations…)

There may well be other posts which consider these kinds of things in more detail, but for now I hope you have some insight into how CNNs function. Now, lets code it up…

A Simple Neural Network - Simple Performance Improvements

Fri, 17 Mar 2017 08:53:55 +0000

The 5th installment of our tutorial on implementing a neural network (NN) in Python. By the end of this tutorial, our NN should perform much more efficiently giving good results with fewer iterations. We will do this by implementing “momentum” into our network. We will also put in the other transfer functions for each layer.

Introduction

To contents

We’ve come so far! The intial maths was a bit of a slog, as was the vectorisation of that maths, but it was important to be able to implement our NN in Python which we did in our previous post. So what now? Well, you may have noticed when running the NN as it stands that it isn’t overly quick, depening on the randomly initialised weights, it may take the network the full number of maxIterations to converge, and then it may not converge at all! But there is something we can do about it. Let’s learn about, and implement, ‘momentum’.

Momentum

Background

To contents

Let’s revisit our equation for error in the NN:

$$ \text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2} $$

This isn’t the only error function that could be used. In fact, there’s a whole field of study in NN about the best error or ‘optimisation’ function that should be used. This one tries to look at the sum of the squared-residuals between the outputs and the expected values at the end of each forward pass (the so-called $l_{2}$-norm). Others e.g. $l_{1}$-norm, look at minimising the sum of the absolute differences between the values themselves. There are more complex error (a.k.a. optimisation or cost) functions, for example those that look at the cross-entropy in the data. There may well be a post in the future about different cost-functions, but for now we will still focus on the equation above.

Now this function is described as a ‘convex’ function. This is an important property if we are to make our NN converge to the correct answer. Take a look at the two functions below:

Figure 1: A convex (left) and non-convex (right) cost function

Let’s say that our current error was represented by the green ball. Our NN will calculate the gradient of its cost function at this point then look for the direction which is going to minimise the error i.e. go down a slope. The NN will feed the result into the back-propagation algorithm which will hopefully mean that on the next iteration, the error will have decreased. For a convex function, this is very straight forward, the NN just needs to keep going in the direction it found on the first run. But, look at the non-convex or stochastic function: our current error (green ball) sits at a point where either direction will take it to a lower error i.e. the gradient decreases on both sides. If the error goes to the left, it will hit one of the possible minima of the function, but this will be a higher minima (higher final error) than if the error had chosen the gradient to the right. Clearly the starting point for the error here has a big impact on the final result. Looking down at the 2D perspective (remembering that these are complex multi-dimensional functions), the non-convex case is clearly more ambiguous in terms of the location of the minimum and direction of descent. The convex function, however, nicely guides the error to the minimum with little care of the starting point.

Figure 2: Contours for a portion of the convex (left) and non-convex (right) cost function

So let’s focus on the convex case and explain what momentum is and why it works. I don’t think you’ll ever see a back propagation algorithm without momentum implemented in some way. In its simplest form, it modifies the weight-update equation:

$$ \mathbf{ \Delta W_{JK} = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}} $$

by adding an extra momentum term:

$$ \mathbf{ \Delta W_{JK}\left(t\right) = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}} + m \mathbf{\Delta W_{JK}\left(t-1\right)} $$

The weight delta (the update amount to the weights after BP) now relies on its previous value i.e. the weight delta now at iteration $t$ requires the value of itself from $t-1$. The $m$ or momentum term, like the learning rate $\eta$ is just a small number between 0 and 1. What effect does this have?

Using prior information about the network is beneficial as it stops the network firing wildly into the unknown. If it can know the previous weights that have given the current error, it can keep the descent to the minimum roughly pointing in the same direction as it was before. The effect is that each iteration does not jump around so much as it would otherwise. In effect, the result is similar to that of the learning rate. We should be careful though, a large value for $m$ may cause the result to jump past the minimum and back again if combined with a large learning rate. We can think of momentum as changing the path taken to the optimum.

Momentum in Python

To contents

So, implementing momentum into our NN should be pretty easy. We will need to provide a momentum term to the backProp method of the NN and also create a new matrix in which to store the weight deltas from the current epoch for use in the subsequent one.

In the __init__ method of the NN, we need to initialise the previous weight matrix and then give them some values - they’ll start with zeros:

def __init__(self, numNodes):
	"""Initialise the NN - setup the layers and initial weights"""

	# Layer info
	self.numLayers = len(numNodes) - 1
	self.shape = numNodes 

	# Input/Output data from last run
	self._layerInput = []
	self._layerOutput = []
	self._previousWeightDelta = []

	# Create the weight arrays
	for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
	    self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1))) 
	    self._previousWeightDelta.append(np.zeros((l2,l1+1)))

The only other part of the NN that needs to change is the definition of backProp adding momentum to the inputs, and updating the weight equation. Finally, we make sure to save the current weights into the previous-weight matrix:

def backProp(self, input, target, trainingRate = 0.2, momentum=0.5):
	"""Get the error, deltas and back propagate to update the weights"""
	...
	weightDelta = trainingRate * thisWeightDelta + momentum * self._previousWeightDelta[index]

	self.weights[index] -= weightDelta

	self._previousWeightDelta[index] = weightDelta

Testing

To contents

Our default values for learning rate and momentum are 0.2 and 0,5 respectively. We can change either of these by including them in the call to backProp. Thi is the only change to the iteration process:

for i in range(maxIterations + 1):
    Error = NN.backProp(Input, Target, learningRate=0.2, momentum=0.5)
    if i % 2500 == 0:
        print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
    if Error <= minError:
        print("Minimum error reached at iteration {0}".format(i))
        break
        
Iteration 100000	Error: 0.000076
Input 	Output 		Target
[0 0]	 [ 0.00491572] 	[ 0.]
[1 1]	 [ 0.00421318] 	[ 0.]
[0 1]	 [ 0.99586268] 	[ 1.]
[1 0]	 [ 0.99586257] 	[ 1.]

Feel free to play around with these numbers, however, it would be unlikely that much would change right now. I say this beacuse there is only so good that we can get when using only the sigmoid function as our activation function. If you go back and read the post on transfer functions you’ll see that it’s more common to use linear functions for the output layer. As it stands, the sigmoid function is unable to output a 1 or a 0 because it is asymptotic at these values. Therefore, no matter what learning rate or momentum we use, the network will never be able to get the best output.

This seems like a good time to implement the other transfer functions.

Transfer Functions

To contents

We’ve already gone through writing the transfer functions in Python in the transfer functions post. We’ll just put these under the sigmoid function we defined earlier. I’m going to use sigmoid, linear, gaussian and tanh here.

To modify the network, we need to assign each layer its own activation function, so let’s put that in the ‘layer information’ part of the __init__ method:

def __init__(self, layerSize, transferFunctions=None):
	"""Initialise the Network"""

	# Layer information
	self.numLayers = len(numLayers) - 1
	self.shape = numNodes
	
	if transferFunctions is None:
	    layerTFs = []
	    for i in range(self.numLayers):
		if i == self.numLayers - 1:
		    layerTFs.append(linear)
		else:
		    layerTFs.append(sigmoid)
	else:
            if len(numNodes) != len(transferFunctions):
                raise ValueError("Number of transfer functions must match the number of layers: minus input layer")
            elif transferFunctions[0] is not None:
                raise ValueError("The Input layer doesn't need a a transfer function: give it [None,...]")
            else:
                layerTFs = transferFunctions[1:]
		
	self.tFunctions = layerTFs

Let’s go through this. We input into the initialisation a parameter called transferFunctions with a default value of None. If the default it taken, or if the parameter is ommitted, we set some defaults. for each layer, we use the sigmoid function, unless its the output layer where we will use the linear function. If a list of transferFunctions is given, first, check that it’s a ‘legal’ input. If the number of functions in the list is not the same as the number of layers (given by numNodes) then throw an error. Also, if the first function in the list is not "None" throw an error, because the first layer shouldn’t have an activation function (it is the input layer). If those two things are fine, go ahead and store the list of functions as layerTFs without the first (element 0) one.

We next need to replace all of our calls directly to sigmoid and its derivative. These should now refer to the list of functions via an index that depends on the number of the current layer. There are 3 instances of this in our NN: 1 in the forward pass where we call sigmoid directly, and 2 in the backProp method where we call the derivative at the output and hidden layers. so sigmoid(layerInput) for example should become:

self.tFunctions[index](layerInput)

Check the updated code here if that’s confusing.

Let’s test this out! We’ll modify the call to initialising the NN by adding a list of functions like so:

Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])
transferFunctions = [None, sigmoid, linear]
    
NN = backPropNN((2,2,1), transferFunctions)

Running the NN like this with the default learning rate and momentum should provide you with an immediate performance boost simply becuase with the linear function we’re now able to get closer to the target values, reducing the error.

Iteration 0	Error: 1.550211
Iteration 2500	Error: 1.000000
Iteration 5000	Error: 0.999999
Iteration 7500	Error: 0.999999
Iteration 10000	Error: 0.999995
Iteration 12500	Error: 0.999969
Minimum error reached at iteration 14543
Input 	Output 		Target
[0 0]	 [ 0.0021009] 	[ 0.]
[1 1]	 [ 0.00081154] 	[ 0.]
[0 1]	 [ 0.9985881] 	[ 1.]
[1 0]	 [ 0.99877479] 	[ 1.]

Play around with the number of layers and different combinations of transfer functions as well as tweaking the learning rate and momentum. You’ll soon get a feel for how each changes the performance of the NN.

A Simple Neural Network - With Numpy in Python

Wed, 15 Mar 2017 09:55:00 +0000

Part 4 of our tutorial series on Simple Neural Networks. We’re ready to write our Python script! Having gone through the maths, vectorisation and activation functions, we’re now ready to put it all together and write it up. By the end of this tutorial, you will have a working NN in Python, using only numpy, which can be used to learn the output of logic gates (e.g. XOR)

Introduction

To contents

We’ve ploughed through the maths, then some more, now we’re finally here! This tutorial will run through the coding up of a simple neural network (NN) in Python. We’re not going to use any fancy packages (though they obviously have their advantages in tools, speed, efficiency…) we’re only going to use numpy!

By the end of this tutorial, we will have built an algorithm which will create a neural network with as many layers (and nodes) as we want. It will be trained by taking in multiple training examples and running the back propagation algorithm many times.

Here are the things we’re going to need to code:

The transfer functions
The forward pass
The back propagation algorithm
The update function

To keep things nice and contained, the forward pass and back propagation algorithms should be coded into a class. We’re going to expect that we can build a NN by creating an instance of this class which has some internal functions (forward pass, delta calculation, back propagation, weight updates).

First things first… lets import numpy:

import numpy as np

Now let’s go ahead and get the first bit done:

Transfer Function

To contents

To begin with, we’ll focus on getting the network working with just one transfer function: the sigmoid function. As we discussed in a previous post this is very easy to code up because of its simple derivative:

$$ f\left(x_{i} \right) = \frac{1}{1 + e^{ - x_{i} }} \ \ \ \ f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 - \sigma(x_{i}) \right) $$

def sigmoid(x, Derivative=False):
	if not Derivative:
		return 1 / (1 + np.exp (-x))
	else:
		out = sigmoid(x)
		return out * (1 - out)

This is a succinct expression which actually calls itself in order to get a value to use in its derivative. We’ve used numpy’s exponential function to create the sigmoid function and created an out variable to hold this in the derivative. Whenever we want to use this function, we can supply the parameter True to get the derivative, We can omit this, or enter False to just get the output of the sigmoid. This is the same function I used to get the graphs in the post on transfer functions.

Back Propagation Class

To contents

I’m fairly new to building my own classes in Python, but for this tutorial, I really relied on the videos of Ryan on YouTube. Some of his hacks were very useful so I’ve taken some of those on board, but i’ve made a lot of the variables more self-explanatory.

First we’re going to get the skeleton of the class setup. This means that whenever we create a new variable with the class of backPropNN, it will be able to access all of the functions and variables within itself.

It looks like this:

class backPropNN:
    """Class defining a NN using Back Propagation"""
    
    # Class Members (internal variables that are accessed with backPropNN.member) 
    numLayers = 0
    shape = None
    weights = []
    
    # Class Methods (internal functions that can be called)
    
    def __init__(self):
        """Initialise the NN - setup the layers and initial weights"""
        
    # Forward Pass method
    def FP(self):
    	"""Get the input data and run it through the NN"""
    	 
    # TrainEpoch method
    def backProp(self):
        """Get the error, deltas and back propagate to update the weights"""

We’ve not added any detail to the functions (or methods) yet, but we know there needs to be an __init__ method for any class, plus we’re going to want to be able to do a forward pass and then back propagate the error.

We’ve also added a few class members, variables which can be called from an instance of the backPropNN class. numLayers is just that, a count of the number of layers in the network, initialised to 0. The shape of the network will return the size of each layer of the network in an array and the weights will return an array of the weights across the network.

Initialisation

To contents

We’re going to make the user supply an input variablewhich is the size of the layers in the network i.e. the number of nodes in each layer: numNodes. This will be an array which is the length of the number of layers (including the input and output layers) where each element is the number of nodes in that layer.

def __init__(self, numNodes):
	"""Initialise the NN - setup the layers and initial weights"""

	# Layer information
	self.numLayers = len(numNodes) - 1
	self.shape = numNodes

We’ve told our network to ignore the input layer when counting the number of layers (common practice) and that the shape of the network should be returned as the input array numNodes.

Lets also initialise the weights. We will take the approach of initialising all of the weights to small, random numbers. To keep the code succinct, we’ll use a neat functionzip. zip is a function which takes two vectors and pairs up the elements in corresponding locations (like a zip). For example:

A = [1, 2, 3]
B = [4, 5, 6]

zip(A,B)
[(1,4), (2,5), (3,6)]

Why might this be useful? Well, when we talk about weights we’re talking about the connections between layers. Lets say we have numNodes=(2, 2, 1) i.e. a 2 layer network with 2 inputs, 1 output and 2 nodes in the hidden layer. Then we need to let the algorithm know that we expect two input nodes to send weights to 2 hidden nodes. Then 2 hidden nodes to send weights to 1 output node, or [(2,2), (2,1)]. Note that overall we will have 4 weights from the input to the hidden layer, and 2 weights from the hidden to the output layer.

What is our A and B in the code above that will give us [(2,2), (2,1)]? It’s this:

numNodes = (2,2,1)
A = numNodes[:-1]
B = numNodes[1:]

A
(2,2)
B
(2,1)
zip(A,B)
[(2,2), (2,1)]

Great! So each pair represents the nodes between which we need initialise some weights. In fact, the shape of each pair (2,2) is the clue to how many weights we are going to need between each layer e.g. between the input and hidden layers we are going to need (2 x 2) =4 weights.

so for each pair in zip(A,B) (hint hint) we need to append some weights into that empty weight matrix we initialised earlier.

# Initialise the weight arrays
for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
    self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))

self.weights as we’re appending to the class member initialised earlier. We’re using the numpy random number generator from a normal distribution. The scale just tells numpy to choose numbers around the 0.1 kind of mark and that we want a matrix of results which is the size of the tuple (l2,l1+1). Huh, +1? Don’t think we’re getting away without including the bias term! We want a random starting point even for the weight connecting the bias node (=1) to the next layer. Ok, but why this way and not (l1+1,l2)? Well, we’re looking for l2 connections from each of the l1+1 nodes in the previous layer - think of it as (number of observations x number of features). We’re creating a matrix of weights which goes across the nodes and down the weights from each node, or as we’ve seen in our maths tutorial:

$$ W_{ij} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \\ w_{12} &w_{22} & w_{32} \end{pmatrix}, \ \ \ \ W_{jk} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \end{pmatrix} $$

Between the first two layers, and second 2 layers respectively with node 3 being the bias node.

Before we move on, lets also put in some placeholders in __init__ for the input and output values to each layer:

self._layerInput = []
self._layerOutput = []

Forward Pass

To contents

We’ve now initialised out network enough to be able to focus on the forward pass (FP).

Our FP function needs to have the input data. It needs to know how many training examples it’s going to have to go through, and it will need to reassign the inputs and outputs at each layer, so lets clean those at the beginning:

def FP(self,input):

	numExamples = input.shape[0]

	# Clean away the values from the previous layer
	self._layerInput = []
	self._layerOutput = []

So lets propagate. We already have a matrix of (randomly initialised) weights. We just need to know what the input is to each of the layers. We’ll separate this into the first hidden layer, and subsequent hidden layers.

For the first hidden layer we will write:

layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))

Let’s break this down:

Our training example inputs need to match the weights that we’ve already created. We expect that our examples will come in rows of an array with columns acting as features, something like [(0,0), (0,1),(1,1),(1,0)]. We can use numpy’s vstack to put each of these examples one on top of the other.

Each of the input examples is a matrix which will be multiplied by the weight matrix to get the input to the current layer:

$$ \mathbf{x_{J}} = \mathbf{W_{IJ} \vec{\mathcal{O}}_{I}} $$

where $\mathbf{x_{J}}$ are the inputs to the layer $J$ and $\mathbf{\vec{\mathcal{O}}_{I}}$ is the output from the precious layer (the input examples in this case).

So given a set of $n$ input examples we vstack them so we just have (n x numInputNodes). We want to transpose this, (numInputNodes x n) such that we can multiply by the weight matrix which is (numOutputNodes x numInputNodes). This gives an input to the layer which is (numOutputNodes x n) as we expect.

Note we’re actually going to do the transposition first before doing the vstack - this does exactly the same thing, but it also allows us to more easily add the bias nodes in to each input.

Bias! Lets not forget this: we add a bias node which always has the value 1 to each input (including the input layer). So our actual method is:

Transpose the inputs input.T
Add a row of ones to the bottom (one bias node for each input) [input.T, np.ones([1,numExamples])]
vstack this to compact the array np.vstack(...)
Multipy with the weights connecting from the previous to the current layer self.weights[0].dot(...)

But what about the subsequent hidden layers? We’re not using the input examples in these layers, we are using the output from the previous layer [self._layerOutput[-1]] (multiplied by the weights).

for index in range(self.numLayers):
#Get input to the layer
if index ==0:
        layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
else:
        layerInput = self.weights[index].dot(np.vstack([self._layerOutput[-1],np.ones([1,numExamples])]))

Make sure to save this output, but also to now calculate the output of the current layer i.e.:

$$ \mathbf{ \vec{ \mathcal{O}}_{J}} = \sigma(\mathbf{x_{J}}) $$

self._layerInput.append(layerInput)
self._layerOutput.append(sigmoid(layerInput))

Finally, make sure that we’re returning the data from our output layer the same way that we got it:

return self._layerOutput[-1].T

Back Propagation

To contents

We’ve successfully sent the data from the input layer to the output layer using some initially randomised weights and we’ve included the bias term (a kind of threshold on the activation functions). Our vectorised equations from the previous post will now come into play:

$$ \begin{align} \mathbf{\vec{\delta}_{K}} &= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right) \\[0.5em] \mathbf{ \vec{ \delta }_{J}} &= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}} \end{align} $$

$$ \begin{align} \mathbf{W_{JK}} + \Delta \mathbf{W_{JK}} &\rightarrow \mathbf{W_{JK}}, \ \ \ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} \\[0.5em] \vec{\theta} + \Delta \vec{\theta} &\rightarrow \vec{\theta}, \ \ \ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}} \end{align} $$

With $*$ representing an elementwise multiplication between the matrices.

First, lets initialise some variables and get the error on the output of the output layer. We assume that the target values have been formatted in the same way as the input values i.e. they are a row-vector per input example. In our forward propagation method, the outputs are stored as column-vectors, thus the targets have to be transposed. We will need to supply the input data, the target data and $\eta$, the learning rate, which we will set at some small number for default. So we start back propagation by first initialising a placeholder for the deltas and getting the number of training examples before running them through the FP method:

def backProp(self, input, target, trainingRate = 0.2):
"""Get the error, deltas and back propagate to update the weights"""

delta = []
numExamples = input.shape[0]

# Do the forward pass
self.FP(input)

output_delta = self._layerOutput[index] - target.T
error = np.sum(output_delta**2)

We know from previous posts that the error is squared to get rid of the negatives. From this we compute the deltas for the output layer:

delta.append(output_delta * sigmoid(self._layerInput[index], True))

We now have the error but need to know what direction to alter the weights in, thus the gradient of the inputs to the layer need to be known. So, we get the gradient of the activation function at the input to the layer and get the product with the error. Notice we’ve supplied True to the sigmoid function to get its derivative.

This is the delta for the output layer. So this calculation is only done when we’re considering the index at the end of the network. We should be careful that when telling the algorithm that this is the “last layer” we take account of the zero-indexing in Python i.e. the last layer is self.numLayers - 1 i.e. in a network with 2 layers, layer[2] does not exist.

We also need to get the deltas of the intermediate hidden layers. To do this, (according to our equations above) we have to ‘pull back’ the delta from the output layer first. More accurately, for any hidden layer, we pull back the delta from the next layer, which may well be another hidden layer. These deltas from the next layer are multiplied by the weights from the next layer [index + 1], before getting the product with the sigmoid derivative evaluated at the current layer.

Note: this is back propagation. We have to start at the end and work back to the beginning. We use the reversed keyword in our loop to ensure that the algorithm considers the layers in reverse order.

Combining this into one method:

# Calculate the deltas
for index in reversed(range(self.numLayers)):
    if index == self.numLayers - 1:
        # If the output layer, then compare to the target values
        output_delta = self._layerOutput[index] - target.T
        error = np.sum(output_delta**2)
        delta.append(output_delta * sigmoid(self._layerInput[index], True))
    else:
        # If a hidden layer. compare to the following layer's delta
        delta_pullback = self.weights[index + 1].T.dot(delta[-1])
        delta.append(delta_pullback[:-1,:] * sigmoid(self._layerInput[index], True))

Pick this piece of code apart. This is an important snippet as it calculates all of the deltas for all of the nodes in the network. Be sure that we understand:

This is a reversed loop because we want to deal with the last layer first
The delta of the output layer is the residual between the output and target multiplied with the gradient (derivative) of the activation function at the current layer.
The delta of a hidden layer first needs the product of the subsequent layer’s delta with the subsequent layer’s weights. This is then multiplied with the gradient of the activation function evaluated at the current layer.

Double check that this matches up with the equations above too! We can double check the matrix multiplication. For the output layer:

output_delta = (numOutputNodes x 1) - (1 x numOutputNodes).T = (numOutputNodes x 1) error = (numOutputNodes x 1) **2 = (numOutputNodes x 1) delta = (numOutputNodes x 1) * sigmoid( (numOutputNodes x 1) ) = (numOutputNodes x 1)

For the hidden layers (take the one previous to the output as example):

delta_pullback = (numOutputNodes x numHiddenNodes).T.dot(numOutputNodes x 1) = (numHiddenNodes x 1) delta = (numHiddenNodes x 1) * sigmoid ( (numHuddenNodes x 1) ) = (numHiddenNodes x 1)

Hurray! We have the delta at each node in our network. We can use them to update the weights for each layer in the network. Remember, to update the weights between layer $J$ and $K$ we need to use the output of layer $J$ and the deltas of layer $K$. This means we need to keep a track of the index of the layer we’re currently working on ($J$) and the index of the delta layer ($K$) - not forgetting about the zero-indexing in Python:

for index in range(self.numLayers):
    delta_index = self.numLayers - 1 - index

Let’s first get the outputs from each layer:

    if index == 0:
        layerOutput = np.vstack([input.T, np.ones([1, numExamples])])
    else:
        layerOutput = np.vstack([self._layerOutput[index - 1], np.ones([1,self._layerOutput[index -1].shape[1]])])

The output of the input layer is just the input examples (which we’ve vstack-ed again and the output from the other layers we take from calculation in the forward pass (making sure to add the bias term on the end).

For the current index (layer) lets use this layerOutput to get the change in weight. We will use a few neat tricks to make this succinct:

	thisWeightDelta = np.sum(\
	    layerOutput[None,:,:].transpose(2,0,1) * delta[delta_index][None,:,:].transpose(2,1,0) \
	    , axis = 0)

Break it down. We’re looking for $\mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} $ so it’s the delta at delta_index, the next layer along.

We want to be able to deal with all of the input training examples simultaneously. This requires a bit of fancy slicing and transposing of the matrices. Take a look: by calling vstack we made all of the input data and bias terms live in the same matrix of a numpy array. When we slice this arraywith the [None,:,:] argument, it tells Python to take all (:) the data in the rows and columns and shift it to the 1st and 2nd dimensions and leave the first dimension empty (None). We do this to create the three dimensions which we can now transpose into. Calling transpose(2,0,1) instructs Python to move around the dimensions of the data (e.g. its rows… or examples). This creates an array where each example now lives in its own plane. The same is done for the deltas of the subsequent layer, but being careful to transpost them in the opposite direction so that the matrix multiplication can occur. The axis= 0 is supplied to make sure that the inputs are multiplied by the correct dimension of the delta matrix.

This looks incredibly complicated. It an be broken down into a for-loop over the input examples, but this reduces the efficiency of the network. Taking advantage of the numpy array like this keeps our calculations fast. In reality, if you’re struggling with this particular part, just copy and paste it, forget about it and be happy with yourself for understanding the maths behind back propagation, even if this random bit of Python is perplexing.

Anyway. Lets take this set of weight deltas and put back the $\eta$. We’ll call this the learningRate. It’s called a lot of things, but this seems to be the most common. We’ll update the weights by making sure to include the - from the $-\eta$.

	weightDelta = trainingRate * thisWeightDelta
	self.weights[index] -= weightDelta

the -= is Python slang for: take the current value and subtract the value of weightDelta.

To finish up, we want our back propagation to return the current error in the network, so:

return error

A Toy Example

To contents

Believe it or not, that’s it! The fundamentals of forward and back propagation have now been implemented in Python. If you want to double check your code, have a look at my completed .py here

Let’s test it!

Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])

NN = backPropNN((2,2,1))

Error = NN.backProp(Input, Target)
Output = NN.FP(Input)

print 'Input \tOutput \t\tTarget'
for i in range(Input.shape[0]):
    print '{0}\t {1} \t{2}'.format(Input[i], Output[i], Target[i])

This will provide 4 input examples and the expected targets. We create an instance of the network called NN with 2 layers (2 nodes in the hidden and 1 node in the output layer). We make NN do backProp with the input and target data and then get the output from the final layer by running out input through the network with a FP. The printout is self explantory. Give it a try!

Input 	Output 		Target
[0 0]	 [ 0.51624448] 	[ 0.]
[1 1]	 [ 0.51688469] 	[ 0.]
[0 1]	 [ 0.51727559] 	[ 1.]
[1 0]	 [ 0.51585529] 	[ 1.]

We can see that the network has taken our inputs, and we have some outputs too. They’re not great, and all seem to live around the same value. This is because we initialised the weights across the network to a similarly small random value. We need to repeat the FP and backProp process many times in order to keep updating the weights.

Iterating

To contents

Iteration is very straight forward. We just tell our algorithm to repeat a maximum of maxIterations times or until the Error is below minError (whichever comes first). As the weights are stored internally within NN every time we call the backProp method, it uses the latest, internally stored weights and doesn’t start again - the weights are only initialised once upon creation of NN.

maxIterations = 100000
minError = 1e-5

for i in range(maxIterations + 1):
    Error = NN.backProp(Input, Target)
    if i % 2500 == 0:
        print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
    if Error <= minError:
        print("Minimum error reached at iteration {0}".format(i))
        break

Here’s the end of my output from the first run:

Iteration 100000	Error: 0.000291
Input 	Output 		Target
[0 0]	 [ 0.00780385] 	[ 0.]
[1 1]	 [ 0.00992829] 	[ 0.]
[0 1]	 [ 0.99189799] 	[ 1.]
[1 0]	 [ 0.99189943] 	[ 1.]

Much better! The error is very small and the outputs are very close to the correct value. However, they’re note completely right. We can do better, by implementing different activation functions which we will do in the next tutorial.

Please let me know if anything is unclear, or there are mistakes. Let me know how you get on!

A Simple Neural Network - Vectorisation

Mon, 13 Mar 2017 10:33:08 +0000

The third in our series of tutorials on Simple Neural Networks. This time, we’re looking a bit deeper into the maths, specifically focusing on vectorisation. This is an important step before we can translate our maths in a functioning script in Python.

So we’ve been through the maths of a neural network (NN) using back propagation and taken a look at the different activation functions that we could implement. This post will translate the mathematics into Python which we can piece together at the end into a functioning NN!

Forward Propagation

Let’s remimnd ourselves of our notation from our 2 layer network in the maths tutorial:

I is our input layer
J is our hidden layer
$w_{ij}$ is the weight connecting the $i^{\text{th}}$ node in in $I$ to the $j^{\text{th}}$ node in $J$
$x_{j}$ is the total input to the $j^{\text{th}}$ node in $J$

So, assuming that we have three features (nodes) in the input layer, the input to the first node in the hidden layer is given by:

$$ x_{1} = \mathcal{O}_{1}^{I} w_{11} + \mathcal{O}_{2}^{I} w_{21} + \mathcal{O}_{3}^{I} w_{31} $$

Lets generalise this for any connected nodes in any layer: the input to node $j$ in layer $l$ is:

$$ x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j} $$

But we need to be careful and remember to put in our bias term $\theta$. In our maths tutorial, we said that the bias term was always equal to 1; now we can try to understand why.

We could just add the bias term onto the end of the previous equation to get:

$$ x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j} + \theta_{i} $$

If we think more carefully about this, what we are really saying is that “an extra node in the previous layer, which always outputs the value 1, is connected to the node $j$ in the current layer by some weight $w_{4j}$“. i.e. $1 \cdot w_{4j}$:

$$ x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j} + 1 \cdot w_{4j} $$

By the magic of matrix multiplication, we should be able to convince ourselves that:

$$ x_{j} = \begin{pmatrix} w_{1j} &w_{2j} &w_{3j} &w_{4j} \end{pmatrix} \begin{pmatrix} \mathcal{O}_{1}^{l-1} \\ \mathcal{O}_{2}^{l-1} \\ \mathcal{O}_{3}^{l-1} \\ 1 \end{pmatrix} $$

Now, lets be a little more explicit, consider the input $x$ to the first two nodes of the layer $J$:

$$ \begin{align} x_{1} &= \begin{pmatrix} w_{11} &w_{21} &w_{31} &w_{41} \end{pmatrix} \begin{pmatrix} \mathcal{O}_{1}^{l-1} \\ \mathcal{O}_{2}^{l-1} \\ \mathcal{O}_{3}^{l-1} \\ 1 \end{pmatrix} \\[0.5em] x_{2} &= \begin{pmatrix} w_{12} &w_{22} &w_{32} &w_{42} \end{pmatrix} \begin{pmatrix} \mathcal{O}_{1}^{l-1} \\ \mathcal{O}_{2}^{l-1} \\ \mathcal{O}_{3}^{l-1} \\ 1 \end{pmatrix} \end{align} $$

Note that the second matrix is constant between the input calculations as it is only the output values of the previous layer (including the bias term). This means (again by the magic of matrix multiplication) that we can construct a single vector containing the input values $x$ to the current layer:

$$ \begin{pmatrix} x_{1} \\ x_{2} \end{pmatrix} = \begin{pmatrix} w_{11} & w_{21} & w_{31} & w_{41} \\ w_{12} & w_{22} & w_{32} & w_{42} \end{pmatrix} \begin{pmatrix} \mathcal{O}_{1}^{l-1} \\ \mathcal{O}_{2}^{l-1} \\ \mathcal{O}_{3}^{l-1} \\ 1 \end{pmatrix} $$

This is an $\left(n \times m+1 \right)$ matrix multiplied with an $\left(m +1 \times 1 \right)$ where:

$n$ is the number of nodes in the current layer $l$
$m$ is the number of nodes in the previous layer $l-1$

Lets generalise - the vector of inputs to the $n$ nodes in the current layer from the nodes $m$ in the previous layer is:

$$ \begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{pmatrix} = \begin{pmatrix} w_{11} & w_{21} & \cdots & w_{(m+1)1} \\ w_{12} & w_{22} & \cdots & w_{(m+1)2} \\ \vdots & \vdots & \ddots & \vdots \\ w_{1n} & w_{2n} & \cdots & w_{(m+1)n} \\ \end{pmatrix} \begin{pmatrix} \mathcal{O}_{1}^{l-1} \\ \mathcal{O}_{2}^{l-1} \\ \mathcal{O}_{3}^{l-1} \\ 1 \end{pmatrix} $$

or:

$$ \mathbf{x_{J}} = \mathbf{W_{IJ}} \mathbf{\vec{\mathcal{O}}_{I}} $$

In this notation, the output from the current layer $J$ is easily written as:

$$ \mathbf{\vec{\mathcal{O}}_{J}} = \sigma \left( \mathbf{W_{IJ}} \mathbf{\vec{\mathcal{O}}_{I}} \right) $$

Where $\sigma$ is the activation or transfer function chosen for this layer which is applied elementwise to the product of the matrices.

This notation allows us to very efficiently calculate the output of a layer which reduces computation time. Additionally, we are now able to extend this efficiency by making out network consider all of our input examples at once.

Remember that our network requires training (many epochs of forward propagation followed by back propagation) and as such needs training data (preferably a lot of it!). Rather than consider each training example individually, we vectorise each example into a large matrix of inputs.

Our weights $\mathbf{W_{IJ}}$ connecting the layer $l$ to layer $J$ are the same no matter which input example we put into the network: this is fundamental as we expect that the network would act the same way for similar inputs i.e. we expect the same neurons (nodes) to fire based on the similar features in the input.

If 2 input examples gave the outputs $ \mathbf{\vec{\mathcal{O}}_{I_{1}}} $ and $ \mathbf{\vec{\mathcal{O}}_{I_{2}}} $ from the nodes in layer $I$ to a layer $J$ then the outputs from layer $J$ , $\mathbf{\vec{\mathcal{O}}_{J_{1}}}$ and $\mathbf{\vec{\mathcal{O}}_{J_{1}}}$ can be written:

$$ \begin{pmatrix} \mathbf{\vec{\mathcal{O}}_{J_{1}}} \\ \mathbf{\vec{\mathcal{O}}_{J_{2}}} \end{pmatrix} = \sigma \left(\mathbf{W_{IJ}}\begin{pmatrix} \mathbf{\vec{\mathcal{O}}_{I_{1}}} & \mathbf{\vec{\mathcal{O}}_{I_{2}}} \end{pmatrix} \right) = \sigma \left(\mathbf{W_{IJ}}\begin{pmatrix} \begin{bmatrix}\mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m} \end{bmatrix} \begin{bmatrix}\mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m} \end{bmatrix} \end{pmatrix} \right) = \sigma \left(\begin{pmatrix} \mathbf{W_{IJ}}\begin{bmatrix}\mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m} \end{bmatrix} & \mathbf{W_{IJ}} \begin{bmatrix}\mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m} \end{bmatrix} \end{pmatrix} \right) $$

For the $m$ nodes in the input layer. Which may look hideous, but the point is that all of the training examples that are input to the network can be dealt with simultaneously because each example becomes another column in the input vector and a corresponding column in the output vector.

In summary, for forward propagation:

All $n$ training examples with $m$ features (input nodes) are put into column vectors to build the input matrix $I$, taking care to add the bias term to the end of each.

All weight vectors that connect $m +1$ nodes in the layer $I$ to the $n$ nodes in layer $J$ are put together in a weight-matrix

$$ \mathbf{I} = \left( \begin{bmatrix} \mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m} \\ 1 \end{bmatrix} \begin{bmatrix} \mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m} \\ 1 \end{bmatrix} \begin{bmatrix} \cdots \\ \cdots \\ \ddots \\ \cdots \end{bmatrix} \begin{bmatrix} \mathcal{O}_{I_{n}}^{1} \\ \vdots \\ \mathcal{O}_{I_{n}}^{m} \\ 1 \end{bmatrix} \right) \ \ \ \ \mathbf{W_{IJ}} = \begin{pmatrix} w_{11} & w_{21} & \cdots & w_{(m+1)1} \\ w_{12} & w_{22} & \cdots & w_{(m+1)2} \\ \vdots & \vdots & \ddots & \vdots \\ w_{1n} & w_{2n} & \cdots & w_{(m+1)n} \\ \end{pmatrix} $$

We perform $ \mathbf{W_{IJ}} \mathbf{I}$ to get the vector $\mathbf{\vec{\mathcal{O}}_{J}}$ which is the output from each of the $m$ nodes in layer $J$

Back Propagation

To perform back propagation there are a couple of things that we need to vectorise. The first is the error on the weights when we compare the output of the network $\mathbf{\vec{\mathcal{O}}_{K}}$ with the known target values:

$$ \mathbf{T_{K}} = \begin{bmatrix} t_{1} \\ \vdots \\ t_{k} \end{bmatrix} $$

A reminder of the formulae:

$$ \delta_{k} = \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \left( \mathcal{O}_{k} - t_{k} \right), \ \ \ \ \delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk} $$

Where $\delta_{k}$ is the error on the weights to the output layer and $\delta_{j}$ is the error on the weights to the hidden layers. We also need to vectorise the update formulae for the weights and bias:

$$ W + \Delta W \rightarrow W, \ \ \ \ \theta + \Delta\theta \rightarrow \theta $$

Vectorising the Output Layer Deltas

Lets look at the output layer delta: we need a subtraction between the outputs and the target which is multiplied by the derivative of the transfer function (sigmoid). Well, the subtraction between two matrices is straight forward:

$$ \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}} $$

but we need to consider the derivative. Remember that the output of the final layer is:

$$ \mathbf{\vec{\mathcal{O}}_{K}} = \sigma \left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) $$

and the derivative can be written:

$$ \sigma ^{\prime} \left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) = \mathbf{\vec{\mathcal{O}}_{K}}\left( 1 - \mathbf{\vec{\mathcal{O}}_{K}} \right) $$

Note: This is the derivative of the sigmoid as evaluated at each of the nodes in the layer $K$. It is acting elementwise on the inputs to layer $K$. Thus it is a column vector with the same length as the number of nodes in layer $K$.

Put the derivative and subtraction terms together and we get:

$$ \mathbf{\vec{\delta}_{K}} = \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right) $$

Again, the derivatives are being multiplied elementwise with the results of the subtration. Now we have a vector of deltas for the output layer $K$! Things aren’t so straight forward for the detlas in the hidden layers.

Lets visualise what we’ve seen:

Figure 1: NN showing the weights and outputs in vector form along with the target values for layer $K$

Vectorising the Hidden Layer Deltas

We need to vectorise:

$$ \delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk} $$

Let’s deal with the summation. We’re multipying each of the deltas $\delta_{k}$ in the output layer (or more generally, the subsequent layer could be another hidden layer) by the weight $w_{jk}$ that pulls them back to the node $j$ in the current layer before adding the results. For the first node in the hidden layer:

$$ \sum_{k \in K} \delta_{k} W_{jk} = \delta_{k}^{1}w_{11} + \delta_{k}^{2}w_{12} + \delta_{k}^{3}w_{13} = \begin{pmatrix} w_{11} & w_{12} & w_{13} \end{pmatrix} \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3}\end{pmatrix} $$

Notice the weights? They pull the delta from each output layer node back to the first node of the hidden layer. In forward propagation, these we consider multiple nodes going out to a single node, rather than this way of receiving multiple nodes at a single node.

Combine this summation with the multiplication by the activation function derivative:

$$ \delta_{j}^{1} = \sigma^{\prime} \left( x_{j}^{1} \right) \begin{pmatrix} w_{11} & w_{12} & w_{13} \end{pmatrix} \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix} $$

remembering that the input to the $\text{1}^\text{st}$ node in the layer $J$

$$ x_{j}^{1} = \mathbf{W_{I1}}\mathbf{\vec{\mathcal{O}}_{I}} $$

What about the $\text{2}^\text{nd}$ node in the hidden layer?

$$ \delta_{j}^{2} = \sigma^{\prime} \left( x_{j}^{2} \right) \begin{pmatrix} w_{21} & w_{22} & w_{23} \end{pmatrix} \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix} $$

This is looking familiar, hopefully we can be confident based upon what we’ve done before to say that:

$$ \begin{pmatrix} \delta_{j}^{1} \\ \delta_{j}^{2} \end{pmatrix} = \begin{pmatrix} \sigma^{\prime} \left( x_{j}^{1} \right) \\ \sigma^{\prime} \left( x_{j}^{2} \right) \end{pmatrix} * \begin{pmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{pmatrix} \begin{pmatrix}\delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix} $$

We’ve seen a version of this weights matrix before when we did the forward propagation vectorisation. In this case though, look carefully - as we mentioned, the weights are not in the same places, in fact, the weight matrix has been transposed from the one we used in forward propagation. This makes sense because we’re going backwards through the network now! This is useful because it means there is very little extra calculation needed here - the matrix we need is already available from the forward pass, but just needs transposing. We can call the weights in back propagation here $ \mathbf{ W_{KJ}} $ as we’re pulling the deltas from $K$ to $J$.

$$ \begin{align} \mathbf{W_{KJ}} &= \begin{pmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{23} \\ \vdots & \vdots & \ddots & \vdots \\ w_{(m+1)1} & w_{(m+1)2} & \cdots & w_{(m+1)n} \end{pmatrix} , \ \ \ \mathbf{W_{JK}} = \begin{pmatrix} w_{11} & w_{21} & \cdots & w_{(m+1)1} \\ w_{12} & w_{22} & \cdots & w_{(m+1)2} \\ \vdots & \vdots & \ddots & \vdots \\ w_{1n} & w_{2n} & \cdots & w_{(m+1)n} \\ \end{pmatrix} \\[0.5em] \mathbf{W_{KJ}} &= \mathbf{W^{\intercal}_{JK}} \end{align} $$

And so, the vectorised equations for the output layer and hidden layer deltas are:

Lets visualise what we’ve seen:

Figure 2: The NN showing the delta vectors

Vectorising the Update Equations

Finally, now that we have the vectorised equations for the deltas (which required us to get the vectorised equations for the forward pass) we’re ready to get the update equations in vector form. Let’s recall the update equations

$$ \begin{align} \Delta W &= -\eta \ \delta_{l} \ \mathcal{O}_{l-1} \\ \Delta\theta &= -\eta \ \delta_{l} \end{align} $$

Ignoring the $-\eta$ for now, we need to get a vector form for $\delta_{l} \ \mathcal{O}_{l-1}$ in order to get the update to the weights. We have the matrix of weights:

$$ \mathbf{W_{JK}} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \\ w_{12} & w_{22} & w_{32} \\ \end{pmatrix} $$

Suppose we are updating the weight $w_{21}$ in the matrix. We’re looking to find the product of the output from the second node in $J$ with the delta from the first node in $K$.

$$ \Delta w_{21} = \delta_{K}^{1} \mathcal{O}_{J}^{2} $$

Considering this example, we can write the matrix for the weight updates as:

$$ \Delta \mathbf{W_{JK}} = \begin{pmatrix} \delta_{K}^{1} \mathcal{O}_{J}^{1} & \delta_{K}^{1} \mathcal{O}_{J}^{2} & \delta_{K}^{1} \mathcal{O}_{J}^{3} \\ \delta_{K}^{2} \mathcal{O}_{J}^{1} & \delta_{K}^{2} \mathcal{O}_{J}^{2} & \delta_{K}^{2} \mathcal{O}_{J}^{3} \end{pmatrix} = \begin{pmatrix} \delta_{K}^{1} \\ \delta_{K}^{2}\end{pmatrix} \begin{pmatrix} \mathcal{O}_{J}^{1} & \mathcal{O}_{J}^{2}& \mathcal{O}_{J}^{3} \end{pmatrix} $$

Generalising this into vector notation and including the learning rate $\eta$, the update for the weights in layer $J$ is:

$$ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} $$

Similarly, we have the update to the bias term. If:

$$ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}} $$

So the bias term is updated just by taking the deltas straight from the nodes in the subsequent layer (with the negative factor of learning rate).

In summary, for back propagation, the equations we need in vector form are:

With $*$ representing an elementwise multiplication between the matrices.

What's next?

Although this kinds of mathematics can be tedious and sometimes hard to follow (and probably with numerous notation mistakes… please let me know if you find them!), it is necessary in order to write a quick, efficient NN. Our next step is to implement this setup in Python.

A Simple Neural Network - Transfer Functions

Wed, 08 Mar 2017 10:43:07 +0000

We’re going to write a little bit of Python in this tutorial on Simple Neural Networks (Part 2). It will focus on the different types of activation (or transfer) functions, their properties and how to write each of them (and their derivatives) in Python.

As promised in the previous post, we’ll take a look at some of the different activation functions that could be used in our nodes. Again please let me know if there’s anything I’ve gotten totally wrong - I’m very much learning too.

Linear (Identity) Function

To contents

What does it look like?

Figure 1: The linear function (left) and its derivative (right)

Formulae

$$ f \left( x_{i} \right) = x_{i} $$

Python Code

def linear(x, Derivative=False):
    if not Derivative:
        return x
    else:
        return 1.0

Why is it used?

If there’s a situation where we want a node to give its output without applying any thresholds, then the identity (or linear) function is the way to go.

Hopefully you can see why it is used in the final output layer nodes as we only want these nodes to do the $ \text{input} \times \text{weight}$ operations before giving us its answer without any further modifications.

Note: The linear function is not used in the hidden layers. We must use non-linear transfer functions in the hidden layer nodes or else the output will only ever end up being a linearly separable solution.

The Sigmoid (or Fermi) Function

To contents

What does it look like?

Figure 2: The sigmoid function (left) and its derivative (right)

Formulae

$$ f\left(x_{i} \right) = \frac{1}{1 + e^{ - x_{i} }}, \ \ f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 - \sigma(x_{i}) \right) $$

Python Code

def sigmoid(x,Derivative=False):
    if not Derivative:
        return 1 / (1 + np.exp (-x))
    else:
        out = sigmoid(x)
        return out * (1 - out)

Why is it used?

This function maps the input to a value between 0 and 1 (but not equal to 0 or 1). This means the output from the node will be a high signal (if the input is positive) or a low one (if the input is negative). This function is often chosen as it is one of the easiest to hard-code in terms of its derivative. The simplicity of its derivative allows us to efficiently perform back propagation without using any fancy packages or approximations. The fact that this function is smooth, continuous (differentiable), monotonic and bounded means that back propagation will work well.

The sigmoid’s natural threshold is 0.5, meaning that any input that maps to a value above 0.5 will be considered high (or 1) in binary terms.

Hyperbolic Tangent Function ( $\tanh(x)$ )

To contents

What does it look like?

Figure 3: The hyperbolic tangent function (left) and its derivative (right)

Formulae

$$ f\left(x_{i} \right) = \tanh\left(x_{i}\right), f^{\prime}\left(x_{i} \right) = 1 - \tanh\left(x_{i}\right)^{2} $$

Why is it used?

This is a very similar function to the previous sigmoid function and has much of the same properties: even its derivative is straight forward to compute. However, this function allows us to map the input to any value between -1 and 1 (but not inclusive of those). In effect, this allows us to apply a plenalty to the node (negative) rather than just have the node not fire at all. It also gives us a larger range of output to play with in the positive end of the scale meaning finer adjustments can be made.

This function has a natural threshold of 0, meaning that any input which maps to a value greater than 0 is considered high (or 1) in binary terms.

Again, the fact that this function is smooth, continuous (differentiable), monotonic and bounded means that back propagation will work well. The subsequent functions don’t all have these properties which makes them more difficult to use in back propagation (though it is done).

What’s the difference between the sigmoid and hyperbolic tangent?

They both achieve a similar mapping, are both continuous, smooth, monotonic and differentiable, but give out different values. For a sigmoid function, a large negative input generates an almost zero output. This lack of output will affect all subsequent weights in the network which may not be desirable - effectively stopping the next nodes from learning. In contrast, the $\tanh$ function supplies -1 for negative values, maintaining the output of the node and allowing subsequent nodes to learn from it.

Gaussian Function

To contents

What does it look like?

Figure 4: The gaussian function (left) and its derivative (right)

Formulae

$$ f\left( x_{i}\right ) = e^{ -x_{i}^{2}}, \ \ f^{\prime}\left( x_{i}\right ) = - 2x e^{ - x_{i}^{2}} $$

Python Code

def gaussian(x, Derivative=False):
    if not Derivative:
        return np.exp(-x**2)
    else:
        return -2 * x * np.exp(-x**2)

Why is it used?

The gaussian function is an even function, thus is gives the same output for equally positive and negative values of input. It gives its maximal output when there is no input and has decreasing output with increasing distance from zero. We can perhaps imagine this function is used in a node where the input feature is less likely to contribute to the final result.

Step (or Heaviside) Function

To contents

What does it look like?

Figure 5: The Heaviside function (left) and its derivative (right)

Formulae

$$ f(x)= \begin{cases} \begin{align} 0 \ &: \ x_{i} \leq T\\ 1 \ &: \ x_{i} > T\\ \end{align} \end{cases} $$

Why is it used?

Some cases call for a function which applies a hard thresold: either the output is precisely a single value, or not. The other functions we’ve looked at have an intrinsic probablistic output to them i.e. a higher output in decimal format implying a greater probability of being 1 (or a high output). The step function does away with this opting for a definite high or low output depending on some threshold on the input $T$.

However, the step-function is discontinuous and therefore non-differentiable (its derivative is the Dirac-delta function). Therefore use of this function in practice is not done with back-propagation.

Ramp Function

To contents

What does it look like?

Figure 6: The ramp function (left) and its derivative (right) with $T1=-2$ and $T2=3$.

Formulae

$$ f(x)= \begin{cases} \begin{align} 0 \ &: \ x_{i} \leq T_{1}\\[0.5em] \frac{\left( x_{i} - T_{1} \right)}{\left( T_{2} - T_{1} \right)} \ &: \ T_{1} \leq x_{i} \leq T_{2}\\[0.5em] 1 \ &: \ x_{i} > T_{2}\\ \end{align} \end{cases} $$

Python Code

def ramp(x, Derivative=False, T1=0, T2=np.max(x)):
    out = np.ones(x.shape)
    ids = ((x < T1) | (x > T2))
    if not Derivative:
        out = ((x - T1)/(T2-T1))
        out[(x < T1)] = 0
        out[(x > T2)] = 1
        return out
    else:
        out[ids]=0
        return out

Why is it used?

The ramp function is a truncated version of the linear function. From its shape, the ramp function looks like a more definitive version of the sigmoid function in that its maps a range of inputs to outputs over the range (0 1) but this time with definitive cut off points $T1$ and $T2$. This gives the function the ability to fire the node very definitively above a threshold, but still have some uncertainty in the lower regions. It may not be common to see $T1$ in the negative region unless the ramp is equally distributed about $0$.

6.1 Rectified Linear Unit (ReLU)

There is a popular, special case of the ramp function in use in the powerful convolutional neural network (CNN) architecture called a Rectifying Linear Unit (ReLU). In a ReLU, $T1=0$ and $T2$ is the maximum of the input giving a linear function with no negative values as below:

Figure 7: The Rectified Linear Unit (ReLU) (left) with its derivative (right).

and in Python:

def relu(x, Derivative=False):
    if not Derivative:
        return np.maximum(0,x)
    else:
        out = np.ones(x.shape)
        out[(x < 0)]=0
        return out

A Simple Neural Network - Mathematics

Mon, 06 Mar 2017 17:04:53 +0000

This is the first part of a series of tutorials on Simple Neural Networks (NN). Tutorials on neural networks (NN) can be found all over the internet. Though many of them are the same, each is written (or recorded) slightly differently. This means that I always feel like I learn something new or get a better understanding of things with every tutorial I see. I’d like to make this tutorial as clear as I can, so sometimes the maths may be simplistic, but hopefully it’ll give you a good unserstanding of what’s going on. Please let me know if any of the notation is incorrect or there are any mistakes - either comment or use the contact page on the left.

1. Neural Network Architecture

To contents

By now, you may well have come across diagrams which look very similar to the one below. It shows some input node, connected to some output node via an intermediate node in what is called a ‘hidden layer’ - ‘hidden’ because in the use of NN only the input and output is of concern to the user, the ‘under-the-hood’ stuff may not be interesting to them. In real, high-performing NN there are usually more hidden layers.

Figure 1: A simple 2-layer NN with 2 features in the input layer, 3 nodes in the hidden layer and two nodes in the output layer.

When we train our network, the nodes in the hidden layer each perform a calculation using the values from the input nodes. The output of this is passed on to the nodes of the next layer. When the output hits the final layer, the ‘output layer’, the results are compared to the real, known outputs and some tweaking of the network is done to make the output more similar to the real results. This is done with an algorithm called back propagation. Before we get there, lets take a closer look at these calculations being done by the nodes.

2. Transfer Function

To contents

At each node in the hidden and output layers of the NN, an activation or transfer function is executed. This function takes in the output of the previous node, and multiplies it by some weight. These weights are the lines which connect the nodes. The weights that come out of one node can all be different, that is they will activate different neurons. There can be many forms of the transfer function, we will first look at the sigmoid transfer function as it seems traditional.

Figure 2: The sigmoid function.

As you can see from the figure, the sigmoid function takes any real-valued input and maps it to a real number in the range $(0 \ 1)$ - i.e. between, but not equal to, 0 and 1. We can think of this almost like saying ‘if the value we have maps to an output near 1, this node fires, if it maps to an output near 0, the node does not fire’. The equation for this sigmoid function is:

$$ \sigma ( x ) = \frac{1}{1 + e^{-x}} $$

We need to have the derivative of this transfer function so that we can perform back propagation later on. This is the process where by the connections in the network are updated to tune the performance of the NN. We’ll talk about this in more detail later, but let’s find the derivative now.

$$ \begin{align*} \frac{d}{dx}\sigma ( x ) &= \frac{d}{dx} \left( 1 + e^{ -x }\right)^{-1}\\ &= -1 \times -e^{-x} \times \left(1 + e^{-x}\right)^{-2}= \frac{ e^{-x} }{ \left(1 + e^{-x}\right)^{2} } \\ &= \frac{\left(1 + e^{-x}\right) - 1}{\left(1 + e^{-x}\right)^{2}} = \frac{\left(1 + e^{-x}\right) }{\left(1 + e^{-x}\right)^{2}} - \frac{1}{\left(1 + e^{-x}\right)^{2}} = \frac{1}{\left(1 + e^{-x}\right)} - \left( \frac{1}{\left(1 + e^{-x}\right)} \right)^{2} \\[0.5em] &= \sigma ( x ) - \sigma ( x ) ^ {2} \end{align*} $$

Therefore, we can write the derivative of the sigmoid function as:

$$ \sigma^{\prime}( x ) = \sigma (x ) \left( 1 - \sigma ( x ) \right) $$

The sigmoid function has the nice property that its derivative is very simple: a bonus when we want to hard-code this into our NN later on. Now that we have our activation or transfer function selected, what do we do with it?

3. Feed-forward

To contents

During a feed-forward pass, the network takes in the input values and gives us some output values. To see how this is done, let’s first consider a 2-layer neural network like the one in Figure 1. Here we are going to refer to:

$i$ - the $i^{\text{th}}$ node of the input layer $I$
$j$ - the $j^{\text{th}}$ node of the hidden layer $J$
$k$ - the $k^{\text{th}}$ node of the input layer $K$

The activation function at a node $j$ in the hidden layer takes the value:

$$ \begin{align} x_{j} &= \xi_{1} w_{1j} + \xi_{2} w_{2j} \\[0.5em] &= \sum_{i \in I} \xi_{i} w_{i j} \end{align} $$

where $\xi_{i}$ is the value of the $i^{\text{th}}$ input node and $w_{i j}$ is the weight of the connection between $i^{\text{th}}$ input node and the $j^{\text{th}}$ hidden node. In short: at each hidden layer node, multiply each input value by the connection received by that node and add them together.

Note: the weights are initisliased when the network is setup. Sometimes they are all set to 1, or often they’re set to some small random value.

We apply the activation function on $x_{j}$ at the $j^{\text{th}}$ hidden node and get:

$$ \begin{align} \mathcal{O}_{j} &= \sigma(x_{j}) \\ &= \sigma( \xi_{1} w_{1j} + \xi_{2} w_{2j}) \end{align} $$

$\mathcal{O}_{j}$ is the output of the $j^{\text{th}}$ hidden node. This is calculated for each of the $j$ nodes in the hidden layer. The resulting outputs now become the input for the next layer in the network. In our case, this is the final output later. So for each of the $k$ nodes in $K$:

$$ \begin{align} \mathcal{O}_{k} &= \sigma(x_{k}) \\ &= \sigma \left( \sum_{j \in J} \mathcal{O}_{j} w_{jk} \right) \end{align} $$

As we’ve reached the end of the network, this is also the end of the feed-foward pass. So how well did our network do at getting the correct result $\mathcal{O}_{k}$? As this is the training phase of our network, the true results will be known an we cal calculate the error.

4. Error

To contents

We measure error at the end of each foward pass. This allows us to quantify how well our network has performed in getting the correct output. Let’s define $t_{k}$ as the expected or target value of the $k^{\text{th}}$ node of the output layer $K$. Then the error $E$ on the entire output is:

$$ \text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2} $$

Dont’ be put off by the random ¹⁄₂ in front there, it’s been manufactured that way to make the upcoming maths easier. The rest of this should be easy enough: get the residual (difference between the target and output values), square this to get rid of any negatives and sum this over all of the nodes in the output layer.

Good! Now how does this help us? Our aim here is to find a way to tune our network such that when we do a forward pass of the input data, the output is exactly what we know it should be. But we can’t change the input data, so there are only two other things we can change:

the weights going into the activation function
the activation function itself

We will indeed consider the second case in another post, but the magic of NN is all about the weights. Getting each weight i.e. each connection between nodes, to be just the perfect value, is what back propagation is all about. The back propagation algorithm we will look at in the next section, but lets go ahead and set it up by considering the following: how much of this error $E$ has come from each of the weights in the network?

We’re asking, what is the proportion of the error coming from each of the $W_{jk}$ connections between the nodes in layer $J$ and the output layer $K$. Or in mathematical terms:

$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} = \frac{\partial{}}{\partial{W_{jk}}} \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2} $$

If you’re not concerned with working out the derivative, skip this highlighted section.

To tackle this we can use the following bits of knowledge: the derivative of the sum is equal to the sum of the derivatives i.e. we can move the derivative term inside of the summation:

$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} = \frac{1}{2} \sum_{k \in K} \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$

the weight $w_{1k}$ does not affect connection $w_{2k}$ therefore the change in $W_{jk}$ with respect to any node other than the current $k$ is zero. Thus the summation goes away:

$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} = \frac{1}{2} \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$

apply the power rule knowing that $t_{k}$ is a constant:

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{jk}}} &= \frac{1}{2} \times 2 \times \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k}\right) \\ &= \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k}\right) \end{align} $$

the leftover derivative is the chage in the output values with respect to the weights. Substituting $ \mathcal{O}_{k} = \sigma(x_{k}) $ and the sigmoid derivative $\sigma^{\prime}( x ) = \sigma (x ) \left( 1 - \sigma ( x ) \right)$:

$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} = \left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \sigma ( x ) \right) \frac{\partial{}}{\partial{W_{jk}}} \left( x_{k}\right) $$

the final derivative, the input value $x_{k}$ is just $\mathcal{O}_{j} W_{jk}$ i.e. output of the previous layer times the weight to this layer. So the change in $\mathcal{O}_{j} w_{jk}$ with respect to $w_{jk}$ just gives us the output value of the previous layer $ \mathcal{O}_{j} $ and so the full derivative becomes:

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{jk}}} &= \left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \sigma ( x ) \right) \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{j} W_{jk} \right) \\[0.5em] &=\left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \mathcal{O}_{k} \right) \mathcal{O}_{j} \end{align} $$

We can replace the sigmoid function with the output of the layer

The derivative of the error function with respect to the weights is then:

$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} =\left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \mathcal{O}_{j} $$

We group the terms involving $k$ and define:

$$ \delta_{k} = \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \left( \mathcal{O}_{k} - t_{k} \right) $$

And therefore:

$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} = \mathcal{O}_{j} \delta_{k} $$

So we have an expression for the amount of error, called ‘deta’ ($\delta_{k}$), on the weights from the nodes in $J$ to each node $k$ in $K$. But how does this help us to improve out network? We need to back propagate the error.

5. Back Propagation - the gradients

To contents

Back propagation takes the error function we found in the previous section, uses it to calculate the error on the current layer and updates the weights to that layer by some amount.

So far we’ve only looked at the error on the output layer, what about the hidden layer? This also has an error, but the error here depends on the output layer’s error too (because this is where the difference between the target $t_{k}$ and output $\mathcal{O}_{k}$ can be calculated). Lets have a look at the error on the weights of the hidden layer $W_{ij}$:

$$ \frac{\partial{\text{E}}}{\partial{W_{ij}}} = \frac{\partial{}}{\partial{W_{ij}}} \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$

Now, unlike before, we cannot just drop the summation as the derivative is not directly acting on a subscript $k$ in the summation. We should be careful to note that the output from every node in $J$ is actually connected to each of the nodes in $K$ so the summation should stay. But we can still use the same tricks as before: lets use the power rule again and move the derivative inside (because the summation is finite):

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \frac{1}{2} \times 2 \times \frac{\partial{}}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \\ &= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} \mathcal{O}_{k} \end{align} $$

Again, we substitute $\mathcal{O}_{k} = \sigma( x_{k})$ and its derivative and revert back to our output notation:

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} (\sigma(x_{k}) )\\ &= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \sigma(x_{k}) \left( 1 - \sigma(x_{k}) \right) \frac{\partial{}}{\partial{W_{ij}}} (x_{k}) \\ &= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} (x_{k}) \end{align} $$

This still looks familar from the output layer derivative, but now we’re struggling with the derivative of the input to $k$ i.e. $x_{k}$ with respect to the weights from $I$ to $J$. Let’s use the chain rule to break apart this derivative in terms of the output from $J$:

$$ \frac{\partial{ x_{k}}}{\partial{W_{ij}}} = \frac{\partial{ x_{k}}}{\partial{\mathcal{O}_{j}}}\frac{\partial{\mathcal{O}_{j}}}{\partial{W_{ij}}} $$

The change of the input to the $k^{\text{th}}$ node with respect to the output from the $j^{\text{th}}$ node is down to a product with the weights, therefore this derivative just becomes the weights $W_{jk}$. The final derivative has nothing to do with the subscript $k$ anymore, so we’re free to move this around - lets put it at the beginning:

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \frac{\partial{\mathcal{O}_{j}}}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \end{align} $$

Lets finish the derivatives, remembering that the output of the node $j$ is just $\mathcal{O}_{j} = \sigma(x_{j}) $ and we know the derivative of this function too:

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \frac{\partial{}}{\partial{W_{ij}}}\sigma(x_{j}) \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \\ &= \sigma(x_{j}) \left( 1 - \sigma(x_{j}) \right) \frac{\partial{x_{j} }}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \\ &= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right) \frac{\partial{x_{j} }}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \end{align} $$

The final derivative is straightforward too, the derivative of the input to $j$ with repect to the weights is just the previous input, which in our case is $\mathcal{O}_{i}$,

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right) \mathcal{O}_{i} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \end{align} $$

Almost there! Recall that we defined $\delta_{k}$ earlier, lets sub that in:

$$ \begin{align} \frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right) \mathcal{O}_{i} \sum_{k \in K} \delta_{k} W_{jk} \end{align} $$

To clean this up, we now define the ‘delta’ for our hidden layer:

$$ \delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk} $$

Thus, the amount of error on each of the weights going into our hidden layer:

$$ \frac{\partial{\text{E}}}{\partial{W_{ij}}} = \mathcal{O}_{i} \delta_{j} $$

Note: the reason for the name back propagation is that we must calculate the errors at the far end of the network and work backwards to be able to calculate the weights at the front.

6. Bias

To contents

Lets remind ourselves what happens inside our hidden layer nodes:

Figure 3: The insides of a hidden layer node, $j$.

Each feature $\xi_{i}$ from the input layer $I$ is multiplied by some weight $w_{ij}$
These are added together to get $x_{i}$ the total, weighted input from the nodes in $I$
$x_{i}$ is passed through the activation, or transfer, function $\sigma(x_{i})$
This gives the output $\mathcal{O}_{j}$ for each of the $j$ nodes in hidden layer $J$
$\mathcal{O}_{j}$ from each of the $J$ nodes becomes $\xi_{j}$ for the next layer

When we talk about the bias term in NN, we are talking about an additional parameter that is inluded in the summation of step 2 above. The bias term is usually denoted with the symbol $\theta$ (theta). It’s function is to act as a threshold for the activation (transfer) function. It is given the value of 1 and is not connected to anything else. As such, this means that any derivative of the node’s output with respect to the bias term would just give a constant, 1. This allows us to just think of the bias term as an output from the node with the value of 1. This will be updated later during backpropagation to change the threshold at which the node fires.

Lets update the equation for $x_{i}$:

$$ \begin{align} x_{i} &= \xi_{1j} w_{1j} + \xi_{2j} w_{2j} + \theta_{j} \\[0.5em] \sigma( x_{i} ) &= \sigma \left( \sum_{i \in I} \left( \xi_{ij} w_{ij} \right) + \theta_{j} \right) \end{align} $$

and put it on the diagram:

Figure 3: The insides of a hidden layer node, $j$.

7. Back Propagation - the algorithm

To contents

Now we have all of the pieces! We’ve got the initial outputs after our feed-forward, we have the equations for the delta terms (the amount by which the error is based on the different weights) and we know we need to update our bias term too. So what does it look like:

Input the data into the network and feed-forward
For each of the output nodes calculate:

$$ \delta_{k} = \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \left( \mathcal{O}_{k} - t_{k} \right) $$
For each of the hidden layer nodes calculate:

$$ \delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk} $$
Calculate the changes that need to be made to the weights and bias terms:

$$ \begin{align} \Delta W &= -\eta \ \delta_{l} \ \mathcal{O}_{l-1} \\ \Delta\theta &= -\eta \ \delta_{l} \end{align} $$
Update the weights and biases across the network:

$$ \begin{align} W + \Delta W &\rightarrow W \\ \theta + \Delta\theta &\rightarrow \theta \end{align} $$

Here, $\eta$ is just a small number that limit the size of the deltas that we compute: we don’t want the network jumping around everywhere. The $l$ subscript denotes the deltas and output for that layer $l$. That is, we compute the delta for each of the nodes in a layer and vectorise them. Thus we can compute the element-wise product with the output values of the previous layer and get our update $\Delta W$ for the weights of the current later. Similarly with the bias term.

This algorithm is looped over and over until the error between the output and the target values is below some set threshold. Depending on the size of the network i.e. the number of layers and number of nodes per layer, it can take a long time to complete one ‘epoch’ or run through of this algorithm.

Some of the ideas and notation in this tutorial comes from the good videos by Ryan Harris

Web Design Wisdom

Sat, 04 Mar 2017 17:21:15 +0000

So I’m quite a bit into getting MLNotebook setup and I’ve been learning a hell of a lot about web design using Hugo (a static site generator). There are a few things around the internet that could be explained more clearly or where more examples could be given, so hopefully that’s what I can do for you here!

I thought I’d give an overview of some of the wisdom I’ve gained from creating MLNotebook - my adventures in markdown… and the rest!

Hugo

Setup

Hugo was relatively easy to setup, but I think some of the guides around could be a lot clearer particularly when it comes to hosting on Githib Pages. Firstly, make sure that you download Hugo here and extract it to /usr/local/bin. I renamed mine to “hugo”. Check whether its properly installed with the command:

$ hugo -v

This will provide the version number. If not, add /usr/local/bin to your system path:

$ PATH=$PATH:/usr/local/bin

Creating a new site called “newsite” from scratch is the easy bit:

$ hugo new site ./newsite

Theme and overrides

To get my theme to work, I simply cloned the repository (as shown here) directly into ./newsite/themes/blackburn. Be sure to copy the config.toml file to ./newsite. That’s all there is to it!

$ mkdir themes
$ cd themes
$ git clone https://github.com/yoshiharuyamashita/blackburn.git

Customising this theme was really easy as it is mostly done in config.toml. What I wish I knew about Hugo straight off the bat is that the tree structure is important. So anything in the “themes” folder is a fall-back for anything that isn’t present in the root folder of the site. That means if you have your own template for a post in ./newsite/layouts/single.html it will be used instead of the themes one in ./newsite/themes/layouts/single.html. Thus if you want to edit the layout, copy the theme’s one into your sites layout folder and edit it from there.

The index page is the same deal, just copy it to your sites root and it will take precident over the default theme’s one.

Partials

The partials bit can be a little confusing if you’re not too familiar with how the site is put together. Effectively, the page you’re loooking at right now is made up of lots of different parts (partials) that have been edited separately, put through a parser, turned into HTML and pasted together into a single HTML page. The head and footer don’t have much in them but are important for adding calls to Javascripts as they are stitched into each and every page on the website. Don’t confuse the head.html and header.html files, the latter is the actual title/banner at the top of the homepage (it is another partial that is stitched into index.html.

Social Media Buttons

I spend a while trying to figure out how to get my social media buttons to actually take the url of the page they were on and share that exact post. I tried a hosted service which gave me a script that pulled down the buttons from them and allowed me to edit them via their interface, but it wasn’t content-specific. To dynamically get the url and get some nice-looking icons, I actually used the site Simple Sharing Buttons, chose the sites I wanted and theyprovided the icons along with the HTML. In comparisson to other sites and methods, this seems to work the best (except for the reddit one really).

<ul class="share-buttons">
  <li><a href="https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fmlnotebook.github.io&t=" title="Share on Facebook" target="_blank" onclick="window.open('https://www.facebook.com/sharer/sharer.php?u=' + encodeURIComponent(document.URL) + '&t=' + encodeURIComponent(document.URL),'','width=500,height=300'); return false;"><img alt="Share on facebook" src="/img/facebook.png"></a></li>
  <li><a href="https://twitter.com/intent/tweet?source=https%3A%2F%2Fmlnotebook.github.io&text=:%20https%3A%2F%2Fmlnotebook.github.io&via=mlnotebook" target="_blank" title="Tweet" onclick="window.open('https://twitter.com/intent/tweet?text=' + encodeURIComponent(document.title) + ':%20'  + encodeURIComponent(document.URL),'','width=500,height=300'); return false;"><img alt="Tweet" src="/img/twitter.png"></a></li>
  <li><a href="http://www.reddit.com/submit?url=https%3A%2F%2Fmlnotebook.github.io&title=" target="_blank" title="Submit to Reddit" onclick="window.open('http://www.reddit.com/submit?url=' + encodeURIComponent(document.URL) + '&title=' +  encodeURIComponent(document.title),'','width=500,height=300'); return false;"><img alt="Submit to Reddit" src="/img/reddit.png"></a></li>
  <li><a href="http://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fmlnotebook.github.io&title=&summary=&source=https%3A%2F%2Fmlnotebook.github.io" target="_blank" title="Share on LinkedIn" onclick="window.open('http://www.linkedin.com/shareArticle?mini=true&url=' + encodeURIComponent(document.URL) + '&title=' +  encodeURIComponent(document.title),'','width=500,height=300'); return false;"><img alt="Share on LinkedIn" src="/img/linkedin.png"></a></li>
</ul>

Hosting on Peronal Github Pages

Again, some of the tutorials out there aren’t great at properly explaining how to get your pages hosted on your personal Github pages, rather than project ones (i.e. https://<your username>.github.io) I’ll try to give you another version here.

Firstly, login to Github and create the repository <your username>.github.io. This is important as the master branch will be used to locate your website at exactly https://<your username>.github.io. Initialise it with the README.md. Create a new branch called hugo and initialise this with the README.md too.

In your ./newsite directory you’ll need to build the site, initialise the git respository and add the remote:

$ hugo
$
$ git init
$ git remote add origin git@github.com:<username>/<username>.github.io.git

If you’re having trouble adding the remote because of permissions it could be that you’re using a different Git account for your website than normal. Have a look at the git config options to change the username/password. If that fails, it could be that you need to sort an ssh key - instructions for that are on your account settings page.

From here, I managed to find and adapt two scripts from here. The first is setup.sh (download) and only needs to be executed once. It does the following:

Deletes the master branch (perfectly safe)
Creates a new orphaned master branch
Takes the README.md from hugo and makes an initial commit to master
Changes back to hugo
Removes the existing ./public folder
Sets the master branch as a subtree for the ./public folder
Pulls the commited master back into ./public to stop merge conflicts.

Make sure that you edit the `USERNAME` field in `setup.sh` before executing.

After that, whenever you want to upload your site, just run the second script deploy.sh which I’ve altered slightly (download) with an optional argument which will be your commit message: missing out the argument submits a default message.

deploy.sh commits and pushes all of your changes to the hugo source branch before putting the ./public folder on master.

Make sure that you edit the `USERNAME` field in `deploy.sh` before executing

And that’s it! If the website doesn’t load when you go to https://<your username>.github.io you may need to hit settings in your repo (top right of the menu bar), scroll down to “Github Pages” and select master as your source.

HTML / CSS

Contact Form

The first part of the site I altered was the contant page. I added a contact form which largely involves html formatted with css. The magic that makes it work comes from the free service called Formspree. Essentially, the submit button sends the information to formspree and they forward it on to me directly. It uses a hidden field to give the forwarded emails the same subject, this makes for easy filtering. It also provides a free “I’m not a robot” page after clicking submit.

<div id="contactform" class="center">
<form action="https://formspree.io/your@email.com method="POST" name="sentMessage" id="contactForm" novalidate>
	<input type="text" name="name" placeholder="Name" id="name" required data-validation-required-message="Please enter your name."><br>
	<input type="email" name="_replyto" placeholder="Email Address" id="email" required data-validation-required-message="Please enter your email address." ><br>

	<input type="hidden"  name="_subject" value="Message from MLNotebook">
	<input type="text" name="_gotcha" style="display:none" />
	<textarea rows="10" name="message" class="form-control" placeholder="Message" id="message" required data-validation-required-message="Please enter a message."></textarea><br>
	<input type="submit" value="Send">
</form>
</div>

The formatting was a pain as I’d never used the box-size argument before - this is what I found made the boxes all the same size and have the same alignment. I added for all browsers too.


input[type=text], input[type=email], textarea {
	display: inline-block;
  	border: 1px solid transparent;
  	border-top: none;
  	border-bottom: 1px solid #DDD;
  	box-shadow: inset 0 1px 2px rgba(0,0,0,.39), 0 -1px 1px #FFF, 0 1px 0 #FFF;
	border-radius: 4px;
	margin: 2px 2px 2px 2px;
	resize:none;
	float: left;
	width: 100%;
}

textarea, input {
    -webkit-box-sizing: border-box;
    -moz-box-sizing: border-box;
    box-sizing: border-box;
}

input[type=submit] {
	width: 100%;
}

.center {
	margin: auto;
}

input {
	height:50px;
}

textarea {
	height: 200px;
	padding-left: 0px;
}

input, textarea::-webkit-input-placeholder {
   padding-left: 10px;
}
input, textarea::-moz-placeholder {
   padding-left: 10px;
}
input, textarea:-ms-input-placeholder {
   padding-left: 10px;
}
input, textarea:-moz-placeholder {
   padding-left: 10px;
}

Resizing for Small Screens

One of my final hurdles in getting the site setup was making the homepage a little more friendly that just showing the recent posts. So I decided to add my twitter feed to the side. Twitter has an easy code to embed this, and I just put it into its own partial in layouts/partials/twitterfeed.html.

My problem here though was that when I viewed my site on my phone, or resized the web-browser on the computer, the content would shrink and be almost unreadable - I wanted the feed to move below the text if the screen was below a certain size. So I created the usual div containers within my index.html file and added the shortcode to include my twitterfeed.html in the right-hand side.

<div id="container" class="center">
	<div id="left_content" class="center">
		<div class="content">
		  {{ range ( .Paginate (where .Data.Pages "Type" "post")).Pages }}
		    {{ .Render "summary"}}
		  {{ end }}

		  {{ partial "pagination.html" . }}
		</div>
	</div>
	<div id="right_content" class="center">
		<center>{{ partial "twitterfeed.html" . }}</center>
	</div>
</div>

I then used css to give the div containers their own properties for different screen sizes:

#container {
	position: relative;
	width:auto;

}

#right_content {
	float:left;
	overflow:hidden;
	display:block;
	padding-right:1%;

}

#left_content {
	float:left;
	width:80%;
	display:block;
	margin:auto;
	min-width=600px;

}

pre > code {
	font-size:11pt;
}

@media screen and (max-width: 1000px) {

#left_content {
	width: 100%;
	}
	
	.content {
	max-width:100%;
	}



#right_content {
	width:100%;
}

pre > code {
	font-size:8pt;
}

}

Note that this allows the size of the font in the code-snippets to shrink when the screensize is small - I find that it reads more easily.

Syntax highlighting

So actually getting code into the website was trickier than I thought. The in-built markdown codeblocks seem to work just fine by adding code between backticks: `<code here>`. Markdown doesn’t do syntax highlightsing right out of the box though. So I’m using highlight.js. My theme does come with a highlight shortcode option, but I found that I couldn’t customise it how I wanted - particuarly, the font size was just too big. I tried everything, even adding extra <pre> </pre> tags around it and using css to format them. In the end, I found that using highlight.js was much simpler - I just loaded the script straight off their server and voila! The link just needed editing to select the theme I wanted, but I opted for the standard monokai anyway. I placed this in my site’s head partial.

<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.9.0/styles/monokai.min.css">
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.9.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>

Maths Rendering

Being a site on machine learning, I’m going to need to be able to include some mathematics sometimes. I’m very familiar with $\rm\LaTeX$ and I’ve written-up a lot of formulae already, so I looked into getting $\rm\LaTeX$ formatting into markdown/Hugo. A few math rendering engines are around, but not all are simple to implement. The best option I found was MathJax which literally required me to add these few lines to my head partial.

<script type="text/javascript"
  src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>

<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  tex2jax: {
    inlineMath: [['$','$'], ['\\(','\\)']],
    displayMath: [['$$','$$'], ['\\[','\\]']],
    processEscapes: true,
    processEnvironments: true,
    skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'],
    TeX: { equationNumbers: { autoNumber: "AMS" },
         extensions: ["AMSmath.js", "AMSsymbols.js"] }
  }
});
</script>

From there, it allows me to put inline math into my websites such as $ c = \sqrt{a^{2} + b^{2}} $ by enclosing them in the normal \$ symbols like so: \$ some math \$. MathJax also provides display-style input with enclosing <div>\$\$ code \$\$</div> e.g.:

$$ c = \sqrt{a^{2} + b^{2}} $$

The formatting is done by some css

code.has-jax {
	font: inherit;
	font-size: 100%;
	background: inherit;
	border: inherit;
	color: #515151;
}