Python on Machine Learning Notebook

A Simple Neural Network - Simple Performance Improvements

Fri, 17 Mar 2017 08:53:55 +0000

The 5th installment of our tutorial on implementing a neural network (NN) in Python. By the end of this tutorial, our NN should perform much more efficiently giving good results with fewer iterations. We will do this by implementing “momentum” into our network. We will also put in the other transfer functions for each layer.

Introduction

To contents

We’ve come so far! The intial maths was a bit of a slog, as was the vectorisation of that maths, but it was important to be able to implement our NN in Python which we did in our previous post. So what now? Well, you may have noticed when running the NN as it stands that it isn’t overly quick, depening on the randomly initialised weights, it may take the network the full number of maxIterations to converge, and then it may not converge at all! But there is something we can do about it. Let’s learn about, and implement, ‘momentum’.

Momentum

Background

To contents

Let’s revisit our equation for error in the NN:

$$ \text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2} $$

This isn’t the only error function that could be used. In fact, there’s a whole field of study in NN about the best error or ‘optimisation’ function that should be used. This one tries to look at the sum of the squared-residuals between the outputs and the expected values at the end of each forward pass (the so-called $l_{2}$-norm). Others e.g. $l_{1}$-norm, look at minimising the sum of the absolute differences between the values themselves. There are more complex error (a.k.a. optimisation or cost) functions, for example those that look at the cross-entropy in the data. There may well be a post in the future about different cost-functions, but for now we will still focus on the equation above.

Now this function is described as a ‘convex’ function. This is an important property if we are to make our NN converge to the correct answer. Take a look at the two functions below:

Figure 1: A convex (left) and non-convex (right) cost function

Let’s say that our current error was represented by the green ball. Our NN will calculate the gradient of its cost function at this point then look for the direction which is going to minimise the error i.e. go down a slope. The NN will feed the result into the back-propagation algorithm which will hopefully mean that on the next iteration, the error will have decreased. For a convex function, this is very straight forward, the NN just needs to keep going in the direction it found on the first run. But, look at the non-convex or stochastic function: our current error (green ball) sits at a point where either direction will take it to a lower error i.e. the gradient decreases on both sides. If the error goes to the left, it will hit one of the possible minima of the function, but this will be a higher minima (higher final error) than if the error had chosen the gradient to the right. Clearly the starting point for the error here has a big impact on the final result. Looking down at the 2D perspective (remembering that these are complex multi-dimensional functions), the non-convex case is clearly more ambiguous in terms of the location of the minimum and direction of descent. The convex function, however, nicely guides the error to the minimum with little care of the starting point.

Figure 2: Contours for a portion of the convex (left) and non-convex (right) cost function

So let’s focus on the convex case and explain what momentum is and why it works. I don’t think you’ll ever see a back propagation algorithm without momentum implemented in some way. In its simplest form, it modifies the weight-update equation:

$$ \mathbf{ \Delta W_{JK} = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}} $$

by adding an extra momentum term:

$$ \mathbf{ \Delta W_{JK}\left(t\right) = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}} + m \mathbf{\Delta W_{JK}\left(t-1\right)} $$

The weight delta (the update amount to the weights after BP) now relies on its previous value i.e. the weight delta now at iteration $t$ requires the value of itself from $t-1$. The $m$ or momentum term, like the learning rate $\eta$ is just a small number between 0 and 1. What effect does this have?

Using prior information about the network is beneficial as it stops the network firing wildly into the unknown. If it can know the previous weights that have given the current error, it can keep the descent to the minimum roughly pointing in the same direction as it was before. The effect is that each iteration does not jump around so much as it would otherwise. In effect, the result is similar to that of the learning rate. We should be careful though, a large value for $m$ may cause the result to jump past the minimum and back again if combined with a large learning rate. We can think of momentum as changing the path taken to the optimum.

Momentum in Python

To contents

So, implementing momentum into our NN should be pretty easy. We will need to provide a momentum term to the backProp method of the NN and also create a new matrix in which to store the weight deltas from the current epoch for use in the subsequent one.

In the __init__ method of the NN, we need to initialise the previous weight matrix and then give them some values - they’ll start with zeros:

def __init__(self, numNodes):
	"""Initialise the NN - setup the layers and initial weights"""

	# Layer info
	self.numLayers = len(numNodes) - 1
	self.shape = numNodes 

	# Input/Output data from last run
	self._layerInput = []
	self._layerOutput = []
	self._previousWeightDelta = []

	# Create the weight arrays
	for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
	    self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1))) 
	    self._previousWeightDelta.append(np.zeros((l2,l1+1)))

The only other part of the NN that needs to change is the definition of backProp adding momentum to the inputs, and updating the weight equation. Finally, we make sure to save the current weights into the previous-weight matrix:

def backProp(self, input, target, trainingRate = 0.2, momentum=0.5):
	"""Get the error, deltas and back propagate to update the weights"""
	...
	weightDelta = trainingRate * thisWeightDelta + momentum * self._previousWeightDelta[index]

	self.weights[index] -= weightDelta

	self._previousWeightDelta[index] = weightDelta

Testing

To contents

Our default values for learning rate and momentum are 0.2 and 0,5 respectively. We can change either of these by including them in the call to backProp. Thi is the only change to the iteration process:

for i in range(maxIterations + 1):
    Error = NN.backProp(Input, Target, learningRate=0.2, momentum=0.5)
    if i % 2500 == 0:
        print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
    if Error <= minError:
        print("Minimum error reached at iteration {0}".format(i))
        break
        
Iteration 100000	Error: 0.000076
Input 	Output 		Target
[0 0]	 [ 0.00491572] 	[ 0.]
[1 1]	 [ 0.00421318] 	[ 0.]
[0 1]	 [ 0.99586268] 	[ 1.]
[1 0]	 [ 0.99586257] 	[ 1.]

Feel free to play around with these numbers, however, it would be unlikely that much would change right now. I say this beacuse there is only so good that we can get when using only the sigmoid function as our activation function. If you go back and read the post on transfer functions you’ll see that it’s more common to use linear functions for the output layer. As it stands, the sigmoid function is unable to output a 1 or a 0 because it is asymptotic at these values. Therefore, no matter what learning rate or momentum we use, the network will never be able to get the best output.

This seems like a good time to implement the other transfer functions.

Transfer Functions

To contents

We’ve already gone through writing the transfer functions in Python in the transfer functions post. We’ll just put these under the sigmoid function we defined earlier. I’m going to use sigmoid, linear, gaussian and tanh here.

To modify the network, we need to assign each layer its own activation function, so let’s put that in the ‘layer information’ part of the __init__ method:

def __init__(self, layerSize, transferFunctions=None):
	"""Initialise the Network"""

	# Layer information
	self.numLayers = len(numLayers) - 1
	self.shape = numNodes
	
	if transferFunctions is None:
	    layerTFs = []
	    for i in range(self.numLayers):
		if i == self.numLayers - 1:
		    layerTFs.append(linear)
		else:
		    layerTFs.append(sigmoid)
	else:
            if len(numNodes) != len(transferFunctions):
                raise ValueError("Number of transfer functions must match the number of layers: minus input layer")
            elif transferFunctions[0] is not None:
                raise ValueError("The Input layer doesn't need a a transfer function: give it [None,...]")
            else:
                layerTFs = transferFunctions[1:]
		
	self.tFunctions = layerTFs

Let’s go through this. We input into the initialisation a parameter called transferFunctions with a default value of None. If the default it taken, or if the parameter is ommitted, we set some defaults. for each layer, we use the sigmoid function, unless its the output layer where we will use the linear function. If a list of transferFunctions is given, first, check that it’s a ‘legal’ input. If the number of functions in the list is not the same as the number of layers (given by numNodes) then throw an error. Also, if the first function in the list is not "None" throw an error, because the first layer shouldn’t have an activation function (it is the input layer). If those two things are fine, go ahead and store the list of functions as layerTFs without the first (element 0) one.

We next need to replace all of our calls directly to sigmoid and its derivative. These should now refer to the list of functions via an index that depends on the number of the current layer. There are 3 instances of this in our NN: 1 in the forward pass where we call sigmoid directly, and 2 in the backProp method where we call the derivative at the output and hidden layers. so sigmoid(layerInput) for example should become:

self.tFunctions[index](layerInput)

Check the updated code here if that’s confusing.

Let’s test this out! We’ll modify the call to initialising the NN by adding a list of functions like so:

Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])
transferFunctions = [None, sigmoid, linear]
    
NN = backPropNN((2,2,1), transferFunctions)

Running the NN like this with the default learning rate and momentum should provide you with an immediate performance boost simply becuase with the linear function we’re now able to get closer to the target values, reducing the error.

Iteration 0	Error: 1.550211
Iteration 2500	Error: 1.000000
Iteration 5000	Error: 0.999999
Iteration 7500	Error: 0.999999
Iteration 10000	Error: 0.999995
Iteration 12500	Error: 0.999969
Minimum error reached at iteration 14543
Input 	Output 		Target
[0 0]	 [ 0.0021009] 	[ 0.]
[1 1]	 [ 0.00081154] 	[ 0.]
[0 1]	 [ 0.9985881] 	[ 1.]
[1 0]	 [ 0.99877479] 	[ 1.]

Play around with the number of layers and different combinations of transfer functions as well as tweaking the learning rate and momentum. You’ll soon get a feel for how each changes the performance of the NN.

A Simple Neural Network - With Numpy in Python

Wed, 15 Mar 2017 09:55:00 +0000

Part 4 of our tutorial series on Simple Neural Networks. We’re ready to write our Python script! Having gone through the maths, vectorisation and activation functions, we’re now ready to put it all together and write it up. By the end of this tutorial, you will have a working NN in Python, using only numpy, which can be used to learn the output of logic gates (e.g. XOR)

Introduction

To contents

We’ve ploughed through the maths, then some more, now we’re finally here! This tutorial will run through the coding up of a simple neural network (NN) in Python. We’re not going to use any fancy packages (though they obviously have their advantages in tools, speed, efficiency…) we’re only going to use numpy!

By the end of this tutorial, we will have built an algorithm which will create a neural network with as many layers (and nodes) as we want. It will be trained by taking in multiple training examples and running the back propagation algorithm many times.

Here are the things we’re going to need to code:

The transfer functions
The forward pass
The back propagation algorithm
The update function

To keep things nice and contained, the forward pass and back propagation algorithms should be coded into a class. We’re going to expect that we can build a NN by creating an instance of this class which has some internal functions (forward pass, delta calculation, back propagation, weight updates).

First things first… lets import numpy:

import numpy as np

Now let’s go ahead and get the first bit done:

Transfer Function

To contents

To begin with, we’ll focus on getting the network working with just one transfer function: the sigmoid function. As we discussed in a previous post this is very easy to code up because of its simple derivative:

$$ f\left(x_{i} \right) = \frac{1}{1 + e^{ - x_{i} }} \ \ \ \ f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 - \sigma(x_{i}) \right) $$

def sigmoid(x, Derivative=False):
	if not Derivative:
		return 1 / (1 + np.exp (-x))
	else:
		out = sigmoid(x)
		return out * (1 - out)

This is a succinct expression which actually calls itself in order to get a value to use in its derivative. We’ve used numpy’s exponential function to create the sigmoid function and created an out variable to hold this in the derivative. Whenever we want to use this function, we can supply the parameter True to get the derivative, We can omit this, or enter False to just get the output of the sigmoid. This is the same function I used to get the graphs in the post on transfer functions.

Back Propagation Class

To contents

I’m fairly new to building my own classes in Python, but for this tutorial, I really relied on the videos of Ryan on YouTube. Some of his hacks were very useful so I’ve taken some of those on board, but i’ve made a lot of the variables more self-explanatory.

First we’re going to get the skeleton of the class setup. This means that whenever we create a new variable with the class of backPropNN, it will be able to access all of the functions and variables within itself.

It looks like this:

class backPropNN:
    """Class defining a NN using Back Propagation"""
    
    # Class Members (internal variables that are accessed with backPropNN.member) 
    numLayers = 0
    shape = None
    weights = []
    
    # Class Methods (internal functions that can be called)
    
    def __init__(self):
        """Initialise the NN - setup the layers and initial weights"""
        
    # Forward Pass method
    def FP(self):
    	"""Get the input data and run it through the NN"""
    	 
    # TrainEpoch method
    def backProp(self):
        """Get the error, deltas and back propagate to update the weights"""

We’ve not added any detail to the functions (or methods) yet, but we know there needs to be an __init__ method for any class, plus we’re going to want to be able to do a forward pass and then back propagate the error.

We’ve also added a few class members, variables which can be called from an instance of the backPropNN class. numLayers is just that, a count of the number of layers in the network, initialised to 0. The shape of the network will return the size of each layer of the network in an array and the weights will return an array of the weights across the network.

Initialisation

To contents

We’re going to make the user supply an input variablewhich is the size of the layers in the network i.e. the number of nodes in each layer: numNodes. This will be an array which is the length of the number of layers (including the input and output layers) where each element is the number of nodes in that layer.

def __init__(self, numNodes):
	"""Initialise the NN - setup the layers and initial weights"""

	# Layer information
	self.numLayers = len(numNodes) - 1
	self.shape = numNodes

We’ve told our network to ignore the input layer when counting the number of layers (common practice) and that the shape of the network should be returned as the input array numNodes.

Lets also initialise the weights. We will take the approach of initialising all of the weights to small, random numbers. To keep the code succinct, we’ll use a neat functionzip. zip is a function which takes two vectors and pairs up the elements in corresponding locations (like a zip). For example:

A = [1, 2, 3]
B = [4, 5, 6]

zip(A,B)
[(1,4), (2,5), (3,6)]

Why might this be useful? Well, when we talk about weights we’re talking about the connections between layers. Lets say we have numNodes=(2, 2, 1) i.e. a 2 layer network with 2 inputs, 1 output and 2 nodes in the hidden layer. Then we need to let the algorithm know that we expect two input nodes to send weights to 2 hidden nodes. Then 2 hidden nodes to send weights to 1 output node, or [(2,2), (2,1)]. Note that overall we will have 4 weights from the input to the hidden layer, and 2 weights from the hidden to the output layer.

What is our A and B in the code above that will give us [(2,2), (2,1)]? It’s this:

numNodes = (2,2,1)
A = numNodes[:-1]
B = numNodes[1:]

A
(2,2)
B
(2,1)
zip(A,B)
[(2,2), (2,1)]

Great! So each pair represents the nodes between which we need initialise some weights. In fact, the shape of each pair (2,2) is the clue to how many weights we are going to need between each layer e.g. between the input and hidden layers we are going to need (2 x 2) =4 weights.

so for each pair in zip(A,B) (hint hint) we need to append some weights into that empty weight matrix we initialised earlier.

# Initialise the weight arrays
for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
    self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))

self.weights as we’re appending to the class member initialised earlier. We’re using the numpy random number generator from a normal distribution. The scale just tells numpy to choose numbers around the 0.1 kind of mark and that we want a matrix of results which is the size of the tuple (l2,l1+1). Huh, +1? Don’t think we’re getting away without including the bias term! We want a random starting point even for the weight connecting the bias node (=1) to the next layer. Ok, but why this way and not (l1+1,l2)? Well, we’re looking for l2 connections from each of the l1+1 nodes in the previous layer - think of it as (number of observations x number of features). We’re creating a matrix of weights which goes across the nodes and down the weights from each node, or as we’ve seen in our maths tutorial:

$$ W_{ij} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \\ w_{12} &w_{22} & w_{32} \end{pmatrix}, \ \ \ \ W_{jk} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \end{pmatrix} $$

Between the first two layers, and second 2 layers respectively with node 3 being the bias node.

Before we move on, lets also put in some placeholders in __init__ for the input and output values to each layer:

self._layerInput = []
self._layerOutput = []

Forward Pass

To contents

We’ve now initialised out network enough to be able to focus on the forward pass (FP).

Our FP function needs to have the input data. It needs to know how many training examples it’s going to have to go through, and it will need to reassign the inputs and outputs at each layer, so lets clean those at the beginning:

def FP(self,input):

	numExamples = input.shape[0]

	# Clean away the values from the previous layer
	self._layerInput = []
	self._layerOutput = []

So lets propagate. We already have a matrix of (randomly initialised) weights. We just need to know what the input is to each of the layers. We’ll separate this into the first hidden layer, and subsequent hidden layers.

For the first hidden layer we will write:

layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))

Let’s break this down:

Our training example inputs need to match the weights that we’ve already created. We expect that our examples will come in rows of an array with columns acting as features, something like [(0,0), (0,1),(1,1),(1,0)]. We can use numpy’s vstack to put each of these examples one on top of the other.

Each of the input examples is a matrix which will be multiplied by the weight matrix to get the input to the current layer:

$$ \mathbf{x_{J}} = \mathbf{W_{IJ} \vec{\mathcal{O}}_{I}} $$

where $\mathbf{x_{J}}$ are the inputs to the layer $J$ and $\mathbf{\vec{\mathcal{O}}_{I}}$ is the output from the precious layer (the input examples in this case).

So given a set of $n$ input examples we vstack them so we just have (n x numInputNodes). We want to transpose this, (numInputNodes x n) such that we can multiply by the weight matrix which is (numOutputNodes x numInputNodes). This gives an input to the layer which is (numOutputNodes x n) as we expect.

Note we’re actually going to do the transposition first before doing the vstack - this does exactly the same thing, but it also allows us to more easily add the bias nodes in to each input.

Bias! Lets not forget this: we add a bias node which always has the value 1 to each input (including the input layer). So our actual method is:

Transpose the inputs input.T
Add a row of ones to the bottom (one bias node for each input) [input.T, np.ones([1,numExamples])]
vstack this to compact the array np.vstack(...)
Multipy with the weights connecting from the previous to the current layer self.weights[0].dot(...)

But what about the subsequent hidden layers? We’re not using the input examples in these layers, we are using the output from the previous layer [self._layerOutput[-1]] (multiplied by the weights).

for index in range(self.numLayers):
#Get input to the layer
if index ==0:
        layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
else:
        layerInput = self.weights[index].dot(np.vstack([self._layerOutput[-1],np.ones([1,numExamples])]))

Make sure to save this output, but also to now calculate the output of the current layer i.e.:

$$ \mathbf{ \vec{ \mathcal{O}}_{J}} = \sigma(\mathbf{x_{J}}) $$

self._layerInput.append(layerInput)
self._layerOutput.append(sigmoid(layerInput))

Finally, make sure that we’re returning the data from our output layer the same way that we got it:

return self._layerOutput[-1].T

Back Propagation

To contents

We’ve successfully sent the data from the input layer to the output layer using some initially randomised weights and we’ve included the bias term (a kind of threshold on the activation functions). Our vectorised equations from the previous post will now come into play:

$$ \begin{align} \mathbf{\vec{\delta}_{K}} &= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right) \\[0.5em] \mathbf{ \vec{ \delta }_{J}} &= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}} \end{align} $$

$$ \begin{align} \mathbf{W_{JK}} + \Delta \mathbf{W_{JK}} &\rightarrow \mathbf{W_{JK}}, \ \ \ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} \\[0.5em] \vec{\theta} + \Delta \vec{\theta} &\rightarrow \vec{\theta}, \ \ \ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}} \end{align} $$

With $*$ representing an elementwise multiplication between the matrices.

First, lets initialise some variables and get the error on the output of the output layer. We assume that the target values have been formatted in the same way as the input values i.e. they are a row-vector per input example. In our forward propagation method, the outputs are stored as column-vectors, thus the targets have to be transposed. We will need to supply the input data, the target data and $\eta$, the learning rate, which we will set at some small number for default. So we start back propagation by first initialising a placeholder for the deltas and getting the number of training examples before running them through the FP method:

def backProp(self, input, target, trainingRate = 0.2):
"""Get the error, deltas and back propagate to update the weights"""

delta = []
numExamples = input.shape[0]

# Do the forward pass
self.FP(input)

output_delta = self._layerOutput[index] - target.T
error = np.sum(output_delta**2)

We know from previous posts that the error is squared to get rid of the negatives. From this we compute the deltas for the output layer:

delta.append(output_delta * sigmoid(self._layerInput[index], True))

We now have the error but need to know what direction to alter the weights in, thus the gradient of the inputs to the layer need to be known. So, we get the gradient of the activation function at the input to the layer and get the product with the error. Notice we’ve supplied True to the sigmoid function to get its derivative.

This is the delta for the output layer. So this calculation is only done when we’re considering the index at the end of the network. We should be careful that when telling the algorithm that this is the “last layer” we take account of the zero-indexing in Python i.e. the last layer is self.numLayers - 1 i.e. in a network with 2 layers, layer[2] does not exist.

We also need to get the deltas of the intermediate hidden layers. To do this, (according to our equations above) we have to ‘pull back’ the delta from the output layer first. More accurately, for any hidden layer, we pull back the delta from the next layer, which may well be another hidden layer. These deltas from the next layer are multiplied by the weights from the next layer [index + 1], before getting the product with the sigmoid derivative evaluated at the current layer.

Note: this is back propagation. We have to start at the end and work back to the beginning. We use the reversed keyword in our loop to ensure that the algorithm considers the layers in reverse order.

Combining this into one method:

# Calculate the deltas
for index in reversed(range(self.numLayers)):
    if index == self.numLayers - 1:
        # If the output layer, then compare to the target values
        output_delta = self._layerOutput[index] - target.T
        error = np.sum(output_delta**2)
        delta.append(output_delta * sigmoid(self._layerInput[index], True))
    else:
        # If a hidden layer. compare to the following layer's delta
        delta_pullback = self.weights[index + 1].T.dot(delta[-1])
        delta.append(delta_pullback[:-1,:] * sigmoid(self._layerInput[index], True))

Pick this piece of code apart. This is an important snippet as it calculates all of the deltas for all of the nodes in the network. Be sure that we understand:

This is a reversed loop because we want to deal with the last layer first
The delta of the output layer is the residual between the output and target multiplied with the gradient (derivative) of the activation function at the current layer.
The delta of a hidden layer first needs the product of the subsequent layer’s delta with the subsequent layer’s weights. This is then multiplied with the gradient of the activation function evaluated at the current layer.

Double check that this matches up with the equations above too! We can double check the matrix multiplication. For the output layer:

output_delta = (numOutputNodes x 1) - (1 x numOutputNodes).T = (numOutputNodes x 1) error = (numOutputNodes x 1) **2 = (numOutputNodes x 1) delta = (numOutputNodes x 1) * sigmoid( (numOutputNodes x 1) ) = (numOutputNodes x 1)

For the hidden layers (take the one previous to the output as example):

delta_pullback = (numOutputNodes x numHiddenNodes).T.dot(numOutputNodes x 1) = (numHiddenNodes x 1) delta = (numHiddenNodes x 1) * sigmoid ( (numHuddenNodes x 1) ) = (numHiddenNodes x 1)

Hurray! We have the delta at each node in our network. We can use them to update the weights for each layer in the network. Remember, to update the weights between layer $J$ and $K$ we need to use the output of layer $J$ and the deltas of layer $K$. This means we need to keep a track of the index of the layer we’re currently working on ($J$) and the index of the delta layer ($K$) - not forgetting about the zero-indexing in Python:

for index in range(self.numLayers):
    delta_index = self.numLayers - 1 - index

Let’s first get the outputs from each layer:

    if index == 0:
        layerOutput = np.vstack([input.T, np.ones([1, numExamples])])
    else:
        layerOutput = np.vstack([self._layerOutput[index - 1], np.ones([1,self._layerOutput[index -1].shape[1]])])

The output of the input layer is just the input examples (which we’ve vstack-ed again and the output from the other layers we take from calculation in the forward pass (making sure to add the bias term on the end).

For the current index (layer) lets use this layerOutput to get the change in weight. We will use a few neat tricks to make this succinct:

	thisWeightDelta = np.sum(\
	    layerOutput[None,:,:].transpose(2,0,1) * delta[delta_index][None,:,:].transpose(2,1,0) \
	    , axis = 0)

Break it down. We’re looking for $\mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} $ so it’s the delta at delta_index, the next layer along.

We want to be able to deal with all of the input training examples simultaneously. This requires a bit of fancy slicing and transposing of the matrices. Take a look: by calling vstack we made all of the input data and bias terms live in the same matrix of a numpy array. When we slice this arraywith the [None,:,:] argument, it tells Python to take all (:) the data in the rows and columns and shift it to the 1st and 2nd dimensions and leave the first dimension empty (None). We do this to create the three dimensions which we can now transpose into. Calling transpose(2,0,1) instructs Python to move around the dimensions of the data (e.g. its rows… or examples). This creates an array where each example now lives in its own plane. The same is done for the deltas of the subsequent layer, but being careful to transpost them in the opposite direction so that the matrix multiplication can occur. The axis= 0 is supplied to make sure that the inputs are multiplied by the correct dimension of the delta matrix.

This looks incredibly complicated. It an be broken down into a for-loop over the input examples, but this reduces the efficiency of the network. Taking advantage of the numpy array like this keeps our calculations fast. In reality, if you’re struggling with this particular part, just copy and paste it, forget about it and be happy with yourself for understanding the maths behind back propagation, even if this random bit of Python is perplexing.

Anyway. Lets take this set of weight deltas and put back the $\eta$. We’ll call this the learningRate. It’s called a lot of things, but this seems to be the most common. We’ll update the weights by making sure to include the - from the $-\eta$.

	weightDelta = trainingRate * thisWeightDelta
	self.weights[index] -= weightDelta

the -= is Python slang for: take the current value and subtract the value of weightDelta.

To finish up, we want our back propagation to return the current error in the network, so:

return error

A Toy Example

To contents

Believe it or not, that’s it! The fundamentals of forward and back propagation have now been implemented in Python. If you want to double check your code, have a look at my completed .py here

Let’s test it!

Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])

NN = backPropNN((2,2,1))

Error = NN.backProp(Input, Target)
Output = NN.FP(Input)

print 'Input \tOutput \t\tTarget'
for i in range(Input.shape[0]):
    print '{0}\t {1} \t{2}'.format(Input[i], Output[i], Target[i])

This will provide 4 input examples and the expected targets. We create an instance of the network called NN with 2 layers (2 nodes in the hidden and 1 node in the output layer). We make NN do backProp with the input and target data and then get the output from the final layer by running out input through the network with a FP. The printout is self explantory. Give it a try!

Input 	Output 		Target
[0 0]	 [ 0.51624448] 	[ 0.]
[1 1]	 [ 0.51688469] 	[ 0.]
[0 1]	 [ 0.51727559] 	[ 1.]
[1 0]	 [ 0.51585529] 	[ 1.]

We can see that the network has taken our inputs, and we have some outputs too. They’re not great, and all seem to live around the same value. This is because we initialised the weights across the network to a similarly small random value. We need to repeat the FP and backProp process many times in order to keep updating the weights.

Iterating

To contents

Iteration is very straight forward. We just tell our algorithm to repeat a maximum of maxIterations times or until the Error is below minError (whichever comes first). As the weights are stored internally within NN every time we call the backProp method, it uses the latest, internally stored weights and doesn’t start again - the weights are only initialised once upon creation of NN.

maxIterations = 100000
minError = 1e-5

for i in range(maxIterations + 1):
    Error = NN.backProp(Input, Target)
    if i % 2500 == 0:
        print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
    if Error <= minError:
        print("Minimum error reached at iteration {0}".format(i))
        break

Here’s the end of my output from the first run:

Iteration 100000	Error: 0.000291
Input 	Output 		Target
[0 0]	 [ 0.00780385] 	[ 0.]
[1 1]	 [ 0.00992829] 	[ 0.]
[0 1]	 [ 0.99189799] 	[ 1.]
[1 0]	 [ 0.99189943] 	[ 1.]

Much better! The error is very small and the outputs are very close to the correct value. However, they’re note completely right. We can do better, by implementing different activation functions which we will do in the next tutorial.

Please let me know if anything is unclear, or there are mistakes. Let me know how you get on!

Surface Distance Function

Wed, 01 Mar 2017 19:27:27 +0000

Surface Distance measures are a good way of evaluating the accuracy of an image-segmentation if we already know the ground truth (GT). The problem is that there is no nicely packaged function in Python to do this directly. In this post, we’ll write a surface distance function in Python which uses numpy and scipy. It’ll help us to calculate Mean Surface Distance (MSD), Residual Mean-Square Error (RMS) and the Hausdorff Distance (HD).

Background

Recently, I have been doing a lot of segmentation evaluation - seeing how good a segmentation done by a machine compares with one that’s done manual, a ‘ground truth’ (GT). Traditionally, such verification is done by comparing the overlap between the two e.g. Dice Simlarity Coefficient (DSC) [1]. There are a few different calculations that can be done (there’ll be a longer post on just that) and ‘surface distance’ calculations are one of them.

Method

For this calculation, we need to be able to find the outline of the segmentation and compare it to the outline of the GT. We can then take measurements of how far each segmentation pixel is from its corresponding pixel in the GT outline.

Let’s take a look at the maths. Surface distance metrics estimate the error between the outer surfaces $S$ and $S^{\prime}$ of the segmentations $X$ and $X^{\prime}$. The distance between a point $p$ on surface $S$ and the surface $S^{\prime}$ is given by the minimum of the Euclidean norm:

$$ d(p, S^{\prime}) = \min_{p^{\prime} \in S^{\prime}} \left|\left| p - p^{\prime} \right|\right|_{2} $$

Doing this for all pixels in the surface gives the total surface distance between $S$ and $S^{\prime}$: $d(S, S^{\prime})$:

Now I’ve seen MATLAB code that can do this, though often its not entirely accurate. Plus I wanted to do this calculation on-the-fly as part of my program which was written in Python. So I came up with this function:

import numpy as np
from scipy.ndimage import morphology

def surfd(input1, input2, sampling=1, connectivity=1):
    
    input_1 = np.atleast_1d(input1.astype(np.bool))
    input_2 = np.atleast_1d(input2.astype(np.bool))
    

    conn = morphology.generate_binary_structure(input_1.ndim, connectivity)

    S = input_1 - morphology.binary_erosion(input_1, conn)
    Sprime = input_2 - morphology.binary_erosion(input_2, conn)

    
    dta = morphology.distance_transform_edt(~S,sampling)
    dtb = morphology.distance_transform_edt(~Sprime,sampling)
    
    sds = np.concatenate([np.ravel(dta[Sprime!=0]), np.ravel(dtb[S!=0])])
       
    
    return sds

Lets go through it bit-by-bit. The function surfd is defined to take in four variables:

input1 - the segmentation that has been created. It can be a multi-class segmentation, but this function will make the image binary. We’ll talk about how to use this function on individual classes later.
input2 - the GT segmentation against which we wish to compare input1
sampling - the pixel resolution or pixel size. This is entered as an n-vector where n is equal to the number of dimensions in the segmentation i.e. 2D or 3D. The default value is 1 which means pixels (or rather voxels) are 1 x 1 x 1 mm in size.
connectivity - creates either a 2D (3 x 3) or 3D (3 x 3 x 3) matrix defining the neighbourhood around which the function looks for neighbouring pixels. Typically, this is defined as a six-neighbour kernel which is the default behaviour of this function.

First we’ll be making use of simple numpy operations, but we’ll also need the morphology module from scipy’s dnimage package. These are imported first. More information on this module can be found here

import numpy as np
from scipy.ndimage import morphology

The two inputs are checked for their size and made binary. Any value greater than zero is made 1 (true).

    input_1 = np.atleast_1d(input1.astype(np.bool))
    input_2 = np.atleast_1d(input2.astype(np.bool))

We use the the morphology.generate_binary_structure function, along with the number of dimensions of the segmentation, to create the kernel that will be used to detect the edges of the segmentations. This could be done just by hard-coding the kernel itself: [[0 0 0],[0 1 0],[0 0 0]; [0 1 0], [1 1 1], [0 1 0]; [0 0 0], [0 1 0], [0 0 0]]. This kernel ‘conn’ is supplied to the morphology.binary_erosion function which strips the outermost pixel from the edge of the segmentation. Subtracting this result from the segmentation itself leaves only the single-pixel-wide surface.

    conn = morphology.generate_binary_structure(input_1.ndim, connectivity)

    S = input_1 - morphology.binary_erosion(input_1, conn)
    Sprime = input_2 - morphology.binary_erosion(input_2, conn)

Next we again use the morphology module. This time we give the distance_transform_edt function our pixel-size (samping) and also the inverted surface-image. The inversion is used such that the surface itself is given the value of zero i.e. any pixel at this location, will have zero surface-distance. The transform increases the value/error/penalty of the remaining pixels with increasing distance away from the surface.

Each pixel of the opposite segmentation-surface is then laid upon this ‘map’ of penalties and both results are concatenated into a vector which is as long as the number of pixels in the surface of each segmentation. This vector of surface distances is returned. Note that this is technically the symmetric surface distance as we are not assuming that just doing this for one of the surfaces is enough. It may be that the distance between a pixel in A and in B is not the same as between the pixel in B and in A. i.e. $d(S, S^{\prime}) \neq d(S^{\prime}, S)$

    dta = morphology.distance_transform_edt(~input1_border,sampling)
    dtb = morphology.distance_transform_edt(~Sprime,sampling)
    
    sds = np.concatenate([np.ravel(dta[Sprime!=0]), np.ravel(dtb[S!=0])])
        
    return sds

How is it used?

The function example below takes two segmentations (which both have multiple classes). The sampling vector is a typical pixel-size from an MRI scan and the 1 indicated I’d like a 6 neighbour (cross-shaped) kernel for finding the edges.

    surface_distance = surfd(test_seg, GT_seg, [1.25, 1.25, 10],1)

By specifcing the value of the voxel-label I’m interested in (assuming we’re talking about classes which are contiguous and not spread out), we can find the surface accuracy of that class.

    surface_distance = surfd(test_seg(test_seg==1), \
                   GT_seg(GT_seg==1), [1.25, 1.25, 10],1)

What do the results mean?

The returned surface distances can be used to calculate:

Mean Surface Distance (MSD) - the mean of the vector is taken. This tell us how much, on average, the surface varies between the segmentation and the GT (in mm).

$$ \text{MSD} = \frac{1}{n_{S} + n_{S^{\prime}}} \left( \sum_{p = 1}^{n_{S}} d(p, S^{\prime}) + \sum_{p^{\prime}=1}^{n_{S^{\prime}}} d(p^{\prime}, S) \right) $$

Residual Mean Square Distance (RMS) - as it says, the mean is taken from each of the points in the vector, these residuals are squared (to remove negative signs), summated, weighted by the mean and then the square-root is taken. Measured in mm.

$$ \text{RMS} = \sqrt{\frac{1}{n_{S} + n_{S^{\prime}}} \left( \sum_{p = 1}^{n_{S}} d(p, S^{\prime})^{2} + \sum_{p^{\prime}=1}^{n_{S^{\prime}}} d(p^{\prime}, S)^{2} \right) }\ $$

Hausdorff Distance (HD) - the maximum of the vector. The largest difference between the surface distances. Also measured in mm. We calculate the symmetric Hausdorff distance as:

$$\text{HD} = \max \left[ d(S, S^{\prime}) , d(S^{\prime}, S) \right]$$

Or in Python:

    msd = surface_distance.mean()
    rms = np.sqrt((surface_distance**2).mean())
    hd  = surface_distance.max()

The full function can be found here: surfaceDistanceFunction.py

References

[1] Dice, L. R. (1945). Measures of the Amount of Ecologic Association Between Species. Ecology, 26(3), 297–302. https://doi.org/10.2307/1932409