<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Machine Learning Notebook</title>
    <link>/index.xml</link>
    <description>Recent content on Machine Learning Notebook</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 04 Jan 2018 10:13:20 +0000</lastBuildDate>
    <atom:link href="/index.xml" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Data Augmentations for n-Dimensional Image Input to CNNs</title>
      <link>/post/dataaug/</link>
      <pubDate>Thu, 04 Jan 2018 10:13:20 +0000</pubDate>
      
      <guid>/post/dataaug/</guid>
      <description>&lt;p&gt;One of the greatest limiting factors for training effective deep learning frameworks is the availability, quality and organisation of the &lt;em&gt;training data&lt;/em&gt;. To be good at classification tasks, we need to show our CNNs &lt;em&gt;etc.&lt;/em&gt; as many examples as we possibly can. However, this is not always possible especially in situations where the training data is hard to collect e.g. medical image data. In this post, we will learn how to apply &lt;em&gt;data augmentation&lt;/em&gt; strategies to n-Dimensional images get the most of our limited number of examples.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;h2 id=&#34;intro&#34;&gt; Introduction &lt;/h2&gt;

&lt;p&gt;If we take any image, like our little Android below, and we shift all of the data in the image to the right by a single pixel, you may struggle to see any difference visually. However, numerically, this may as well be a completely different image! Imagine taking a stack of 10 of these images, each shifted by a single pixel compared to the previous one. Now consider the pixels in the images at [20, 25] or some arbitrary location. Focusing on that point, each pixel has a different colour, different average surrounding intensity etc. A CNN take these values into account when performing convolutions and deciding upon weights. If we supplied this set of 10 images to a CNN, it would effectively be making it learn that it should be invariant to these kinds of translations.&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34;  style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/android.jpg&#34; &gt;&lt;br&gt;
&lt;b&gt;Android&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34;&#34; height=300 src=&#34;/img/augmentation/android1px.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Shifted 1 pixel right&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34;&#34; height=300 src=&#34;/img/augmentation/android10px.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Shifted 10 pixels right&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Of course, translations are not the only way in which an image can change, but still &lt;em&gt;visually&lt;/em&gt; be the same image. Consider rotating the image by even a single degree, or 5 degrees. It&amp;rsquo;s still an Android. Traning a CNN without including translated and rotated versions of the image may cause the CNN to &lt;strong&gt;overfit&lt;/strong&gt; and assume that all images of Androids have to be perfectly upright and centered.&lt;/p&gt;

&lt;p&gt;Providing deep learning frameworks with images that are translated, rotated, scaling, intensified and flipped is what we mean when we talk about &lt;em&gt;data augmentation&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In this post we&amp;rsquo;ll look at how to apply these transformations to an image, even in 3D and see how it affects the performance of a deep learning framework. We will use an image from &lt;em&gt;flickr&lt;/em&gt; user  &lt;a href=&#34;https://www.flickr.com/photos/andy_emcee/6416366321&#34; title=&#34;Cat and Dog Image&#34;&gt;andy_emcee&lt;/a&gt; as an example of a 2D nautral image. As this is an RGB (color) image it has shape [512, 640, 3], one layer for each colour channel. We could take one layer to make this grayscale and truly 2D, but most images we deal with will be color so let&amp;rsquo;s leave it. For 3D we will use a 3D MRI scan&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:49%; margin:auto;min-width:350px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34; height=300 src=&#34;/img/augmentation/naturalimg.jpg&#34;&gt;&lt;br&gt;
&lt;b&gt;RGB Image shape=[512, 640, 3]&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;augs&#34;&gt; Augmentations &lt;/h2&gt;

&lt;p&gt;As usual, we are going to write our augmentation functions in python. We&amp;rsquo;ll just be using simple functions from &lt;code&gt;numpy&lt;/code&gt; and &lt;code&gt;scipy&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&#34;translate&#34;&gt; Translation &lt;/h3&gt;

&lt;p&gt;In our functions, &lt;code&gt;image&lt;/code&gt; is a 2 or 3D array - if it&amp;rsquo;s a 3D array, we need to be careful about specifying our translation directions in the argument called &lt;code&gt;offset&lt;/code&gt;. We don&amp;rsquo;t really want to move images in the &lt;code&gt;z&lt;/code&gt; direction for a couple of reasons: firstly, if it&amp;rsquo;s a 2D image, the third dimension will be the colour channel, if we move the image through this dimension the image will either become all red, all blue or all black if we move it &lt;code&gt;-2&lt;/code&gt;, &lt;code&gt;2&lt;/code&gt; or greater than these respectively; second, in a full 3D image, the third dimension is often the smallest e.g. most medical scans. In our translation function below, the &lt;code&gt;offset&lt;/code&gt; is given as a length 2 array defining the shift in the &lt;code&gt;y&lt;/code&gt; and &lt;code&gt;x&lt;/code&gt; directions respectively (dont forget index 0 is which horizontal row we&amp;rsquo;re at in python). We hard-code z-direction to &lt;code&gt;0&lt;/code&gt; but you&amp;rsquo;re welcome to change this if your use-case demands it. To ensure we get integer-pixel shifts, we enforce type &lt;code&gt;int&lt;/code&gt; too.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def translateit(image, offset, isseg=False):
    order = 0 if isseg == True else 5

    return scipy.ndimage.interpolation.shift(image, (int(offset[0]), int(offset[1]), 0), order=order, mode=&#39;nearest&#39;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here we have also provided the option for what kind of interpolation we want to perform: &lt;code&gt;order = 0&lt;/code&gt; means to just use the nearest-neighbour pixel intensity and &lt;code&gt;order = 5&lt;/code&gt; means to perform bspline interpolation with order 5 (taking into account many pixels around the target). This is triggered with a Boolean argument to the &lt;code&gt;scaleit&lt;/code&gt; function called &lt;code&gt;isseg&lt;/code&gt; so named because when dealing with image-segmentations, we want to keep their integer class numbers and not get a result which is a float with a value between two classes. This is not a problem with the actual image as we want to retain as much visual smoothness as possible (though there is an arugment that we&amp;rsquo;re introducing data which didn&amp;rsquo;t exist in the original image). Similarly, when we move our image, we will leave a gap around the edges from which it&amp;rsquo;s moved. We need a way to fill in this gap: by default &lt;code&gt;shift&lt;/code&gt; will use a contant value set to &lt;code&gt;0&lt;/code&gt;. This may not be helpful in some case, so it&amp;rsquo;s best to set the &lt;code&gt;mode&lt;/code&gt; to &lt;code&gt;&#39;nearest&#39;&lt;/code&gt; which takes the cloest pixel-value and replicates it. It&amp;rsquo;s barely noticable with small shifts but looks wrong at larger offsets. We need to be careful and only apply small translations to our data.&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34;  style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimg.jpg&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimgtrans5px.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Shifted 5 pixels right&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimgtrans25px.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Shifted 25 pixels right&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimg.png&#34; &gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrseg.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image and Segmentation&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimgtrans1.png&#34;&gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrsegtrans1.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Shifted [-3, 1] pixels&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimgtrans2.png&#34;&gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrsegtrans2.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Shifted [4, -5] pixels&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;scale&#34;&gt; Scaling &lt;/h3&gt;

&lt;p&gt;When scaling an image, i.e. zooming in and out, we want to increase or decrease the area our image takes up whilst keeping the image dimensions the same. We scale our image by a certain &lt;code&gt;factor&lt;/code&gt;. A &lt;code&gt;factor &amp;gt; 1.0&lt;/code&gt; means the image scales-up, and &lt;code&gt;factor &amp;lt; 1.0&lt;/code&gt; scales the image down. Note that we should provide a factor for each dimension: if we want to keep the same number of layers or slices in our image, we should set last value to &lt;code&gt;1.0&lt;/code&gt;. To determine the intensity of the resulting image at each pixel, we are taking the lattice (grid) on which each pixel sits and using this to perform &lt;em&gt;interpolation&lt;/em&gt; of the surrounding pixel intensities. &lt;code&gt;scipy&lt;/code&gt; provides a handy function for this called &lt;code&gt;zoom&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;The definition is probably more complex than one would think:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def scaleit(image, factor, isseg=False):
    order = 0 if isseg == True else 3

    height, width, depth= image.shape
    zheight             = int(np.round(factor * height))
    zwidth              = int(np.round(factor * width))
    zdepth              = depth

    if factor &amp;lt; 1.0:
        newimg  = np.zeros_like(image)
        row     = (height - zheight) // 2
        col     = (width - zwidth) // 2
        layer   = (depth - zdepth) // 2
        newimg[row:row+zheight, col:col+zwidth, layer:layer+zdepth] = interpolation.zoom(image, (float(factor), float(factor), 1.0), order=order, mode=&#39;nearest&#39;)[0:zheight, 0:zwidth, 0:zdepth]

        return newimg

    elif factor &amp;gt; 1.0:
        row     = (zheight - height) // 2
        col     = (zwidth - width) // 2
        layer   = (zdepth - depth) // 2

        newimg = interpolation.zoom(image[row:row+zheight, col:col+zwidth, layer:layer+zdepth], (float(factor), float(factor), 1.0), order=order, mode=&#39;nearest&#39;)  
        
        extrah = (newimg.shape[0] - height) // 2
        extraw = (newimg.shape[1] - width) // 2
        extrad = (newimg.shape[2] - depth) // 2
        newimg = newimg[extrah:extrah+height, extraw:extraw+width, extrad:extrad+depth]

        return newimg

    else:
        return image
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are three possibilities that we need to consider - we are scaling up, down or no scaling. In each case, we want to return an array that is &lt;em&gt;equal in size&lt;/em&gt; to the input &lt;code&gt;image&lt;/code&gt;. For the scaling down case, this involves making a blank image the same shape as the input, and finding the corresponding box in the resulting scaled image. For scaling up, it&amp;rsquo;s unnecessary to perform the scaling on the whole image, just the portion that will be &amp;lsquo;zoomed&amp;rsquo; - so we pass only part of the array to the &lt;code&gt;zoom&lt;/code&gt; function. There may also be some error in the final shape due to rounding, so we do some trimming of the extra rows and colums before passing it back. When no scaling is done, we just return the original image.&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34;  style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimg.jpg&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimgscale075.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Scale-factor 0.75&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimgscale125.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Scale-factor 1.25&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimg.png&#34; &gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrseg.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image and Segmentation&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimgscale1.png&#34;&gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrsegscale1.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Scale-factor 1.07&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimgscale2.png&#34;&gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrsegscale2.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Scale-factor 0.95&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#39;resample&#39;&gt; Resampling &lt;/h3&gt;

&lt;p&gt;It may be the case that we want to change the dimensions of our image such that they fit nicely into the input of our CNN. For example, most images and photographs have one dimension larger than the other or may be of different resolutions. This may not be the case in our training set, but most CNNs prefer to have inputs that are square and of identical sizes. We can use the same &lt;code&gt;scipy&lt;/code&gt; function &lt;code&gt;interpolation.zoom&lt;/code&gt; to do this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def resampleit(image, dims, isseg=False):
    order = 0 if isseg == True else 5

    image = interpolation.zoom(image, np.array(dims)/np.array(image.shape, dtype=np.float32), order=order, mode=&#39;nearest&#39;)

    if image.shape[-1] == 3: #rgb image
        return image
    else:
        return image if isseg else (image-image.min())/(image.max()-image.min()) 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The key part here is that we&amp;rsquo;ve replaced the &lt;code&gt;factor&lt;/code&gt; argument with &lt;code&gt;dims&lt;/code&gt; of type &lt;code&gt;list&lt;/code&gt;. &lt;code&gt;dims&lt;/code&gt; should have length equal to the number of dimensions of our image i.e. 2 or 3. We are calculating the factor that each dimension needs to change by in order to change the image to the target &lt;code&gt;dims&lt;/code&gt;. We&amp;rsquo;ve forced the denominator of the scaling factor to be of type &lt;code&gt;float&lt;/code&gt; so that the resulting factor is also &lt;code&gt;float&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In this step, we are also changing the intensities of the image to use the full range from &lt;code&gt;0.0&lt;/code&gt; to &lt;code&gt;1.0&lt;/code&gt;. This ensures that all of our image intensities fall over the same range - one fewer thing for the network to be biased against. Again, note that we don&amp;rsquo;t want to do this for our segmentations as the pixel &amp;lsquo;intensities&amp;rsquo; are actually labels. We could do this in a separate function, but I want this to happen to all of my images at this point. There&amp;rsquo;s no difference to the visual display of the images because they are automaticallys rescaled to use the full range of display colours.&lt;/p&gt;

&lt;h3 id=&#34;rotate&#34;&gt; Rotation &lt;/h3&gt;

&lt;p&gt;This function utilises another &lt;code&gt;scipy&lt;/code&gt; function called &lt;code&gt;rotate&lt;/code&gt;. It takes a &lt;code&gt;float&lt;/code&gt; for the &lt;code&gt;theta&lt;/code&gt; argument which specifies the number of degrees of the roation (negative numbers rotate anti-clockwise). We want the returned image to be of the same shape as the input &lt;code&gt;image&lt;/code&gt; so &lt;code&gt;reshape = False&lt;/code&gt; is used. Again we need to specify the &lt;code&gt;order&lt;/code&gt; of the interpolation on the new lattice. The rotate function handles 3D images by rotating each slice by the same &lt;code&gt;theta&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def rotateit(image, theta, isseg=False):
    order = 0 if isseg == True else 5
        
    return rotate(image, float(theta), reshape=False, order=order, mode=&#39;nearest&#39;)
&lt;/code&gt;&lt;/pre&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34;  style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimg.jpg&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimgrotate-10.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Theta = -10.0 &lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimgrotate10.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Theta = 10.0&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimg.png&#34; &gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrseg.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image and Segmentation&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimgrotate1.png&#34;&gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrsegrotate1.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Theta = 6.18&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimgrotate2.png&#34;&gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrsegrotate2.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Theta = -1.91&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;intensify&#34;&gt; Intensity Changes &lt;/h3&gt;

&lt;p&gt;The final augmentation we can perform is a scaling in the intensity of the pixels. This effectively brightens or dims the image by appling a blanket increase or decrease across all pixels. We specify the amount by a factor: &lt;code&gt;factor &amp;lt; 1.0&lt;/code&gt; will dim the image, and &lt;code&gt;factor &amp;gt; 1.0&lt;/code&gt; will brighten it. Note that we don&amp;rsquo;t want a &lt;code&gt;factor = 0.0&lt;/code&gt; as this will blank the image.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def intensifyit(image, factor):

    return image*float(factor)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;flip&#34;&gt; Flipping &lt;/h3&gt;

&lt;p&gt;One of the most common image augmentation procedures for natural images (dogs, cats, landscapes etc.) is to do flipping. The premise being that a dog is a dog no matter which was it&amp;rsquo;s facing. Or it doesn&amp;rsquo;t matter if a tree is on the right or the left of an image, it&amp;rsquo;s still a tree.&lt;/p&gt;

&lt;p&gt;We can do horizontal flipping, left-to-right or vertical flipping, up and down. It may make sense to do only one of these (if we know that dogs don&amp;rsquo;t walk on their heads for example). In this case, we can specify a &lt;code&gt;list&lt;/code&gt; of 2 boolean values: if each is &lt;code&gt;1&lt;/code&gt; then both flips are performed. We use the &lt;code&gt;numpy&lt;/code&gt; functions &lt;code&gt;fliplr&lt;/code&gt; and &lt;code&gt;flipup&lt;/code&gt; for these.&lt;/p&gt;

&lt;p&gt;As with resampling, the intensity changes are modified to take the range of the display so there wont be a noticable difference in the images. The maximum value for display is 255 so increasing this will just scale it back down.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def flipit(image, axes):
    
    if axes[0]:
        image = np.fliplr(image)
    if axes[1]:
        image = np.flipud(image)
    
    return image
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;cropping&#34;&gt; Cropping &lt;/h3&gt;

&lt;p&gt;This may be a very niche function, but it&amp;rsquo;s important in my case. Often in natrual image processing, random crops are done on the image in order to give patches - these patches often contain most of the image data e.g. 224 x 224 patch rather than 299 x 299 image. This is just another way of showing the network a very similar but also entirely different image. Central crops are also done. What&amp;rsquo;s different in my case is that I always want my segmentation to be fully-visible in the image that I show to the network (I&amp;rsquo;m working with 3D cardiac MRI segmentations).&lt;/p&gt;

&lt;p&gt;So this function looks at the segmentation and creates a bounding box using the outermost pixels. We&amp;rsquo;re producing &amp;lsquo;square&amp;rsquo; crops with side-length equal to the width of the image (the shortest side not including the depth). In this case, the bounding box is created and, if necessary, the window is moved up and down the image to make sure the full segmentation is visible. It also makes sure that the output is always square in the case that the bounding box moves off the image array.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def cropit(image, seg=None, margin=5):

    fixedaxes = np.argmin(image.shape[:2])
    trimaxes  = 0 if fixedaxes == 1 else 1
    trim    = image.shape[fixedaxes]
    center  = image.shape[trimaxes] // 2

    print image.shape
    print fixedaxes
    print trimaxes
    print trim
    print center

    if seg is not None:

        hits = np.where(seg!=0)
        mins = np.argmin(hits, axis=1)
        maxs = np.argmax(hits, axis=1)

        if center - (trim // 2) &amp;gt; mins[0]:
            while center - (trim // 2) &amp;gt; mins[0]:
                center = center - 1
            center = center + margin

        if center + (trim // 2) &amp;lt; maxs[0]:
            while center + (trim // 2) &amp;lt; maxs[0]:
                center = center + 1
            center = center + margin
    
    top    = max(0, center - (trim //2))
    bottom = trim if top == 0 else center + (trim//2)

    if bottom &amp;gt; image.shape[trimaxes]:
        bottom = image.shape[trimaxes]
        top = image.shape[trimaxes] - trim
  
    if trimaxes == 0:
        image   = image[top: bottom, :, :]
    else:
        image   = image[:, top: bottom, :]

    if seg is not None:
        if trimaxes == 0:
            seg   = seg[top: bottom, :, :]
        else:
            seg   = seg[:, top: bottom, :]

        return image, seg
    else:
        return image
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that this function will work to square an image even when there is no segmentation given. We also have to be careful about which axes we take as the &amp;lsquo;fixed&amp;rsquo; length for the square and which one to trim.&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34;  style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimg.jpg&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;Natural Image Grayscale&#34; style=&#34;border: 2px solid black;&#34; height=300 src=&#34;/img/augmentation/naturalimgcrop.png&#34;&gt;&lt;br&gt;
&lt;b&gt; Cropped &lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimg.png&#34; &gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrseg.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Original Image and Segmentation&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;&#34;&gt;
&lt;img title=&#34;CMR Image&#34; height=300 src=&#34;/img/augmentation/cmrimgcrop.png&#34;&gt;
&lt;img title=&#34;CMR Segmentation&#34; height=300 src=&#34;/img/augmentation/cmrsegcrop.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Cropped&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;application&#34;&gt; Application &lt;/h2&gt;

&lt;p&gt;We should be careful about how we apply our transformations. For example, if we apply multiple transformations to the same image we need to make sure that we don&amp;rsquo;t apply &amp;lsquo;resampling&amp;rsquo; after &amp;lsquo;intensity changes&amp;rsquo; because this will reset the range of the image, defeating the point of the intensification. However, as we will generally want our data to span the same range, wholesale intensity shifts are less often seen. We also want to make sure that we are not being over zealous with the augmentations either - we need to set limits for our factors and other arguments.&lt;/p&gt;

&lt;p&gt;When I implement data augmentation, I put all of these transforms into one script which can be downloaded here: &lt;a href=&#34;/docs/transforms.py&#34; title=&#34;transforms.py&#34;&gt;&lt;code&gt;transforms.py&lt;/code&gt;&lt;/a&gt;. I then call the transforms that I want from another script.&lt;/p&gt;

&lt;p&gt;We create a set of cases, one for each transformation, which draws random (but controlled) parameters for our augmentations, remember we don&amp;rsquo;t want anything too extreme. We don&amp;rsquo;t want to apply all of these transformations every time, so we also create an array of random length (number of transformations) and randomly assigned elements (the transformations to apply).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;np.random.seed()
numTrans     = np.random.randint(1, 6, size=1) 
allowedTrans = [0, 1, 2, 3, 4]
whichTrans   = np.random.choice(allowedTrans, numTrans, replace=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We assign a new &lt;code&gt;random.seed&lt;/code&gt; every time to ensure that each pass is different to the last. There are 5 possible transformations so &lt;code&gt;numTrans&lt;/code&gt; is a single random integer between 1 and 5. We then take a &lt;code&gt;random.choice&lt;/code&gt; of the &lt;code&gt;allowedTrans&lt;/code&gt; up to &lt;code&gt;numTrans&lt;/code&gt;. We don&amp;rsquo;t want to apply the same transformation more than once, so &lt;code&gt;replace=False&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;After some trial and error, I&amp;rsquo;ve found that the following parameters are good:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rotations - &lt;code&gt;theta&lt;/code&gt; $ \in [-10.0, 10.0] $ degrees&lt;/li&gt;
&lt;li&gt;scaling - &lt;code&gt;factor&lt;/code&gt; $ \in [0.9, 1.1] $ i.e. 10% zoom-in or zoom-out&lt;/li&gt;
&lt;li&gt;intensity - &lt;code&gt;factor&lt;/code&gt; $ \in [0.8, 1.2] $ i.e. 20% increase or decrease&lt;/li&gt;
&lt;li&gt;translation - &lt;code&gt;offset&lt;/code&gt; $ \in [-5, 5] $ pixels&lt;/li&gt;
&lt;li&gt;margin - I tend to set at either 5 or 10 pixels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For an image called &lt;code&gt;thisim&lt;/code&gt; and segmentation called &lt;code&gt;thisseg&lt;/code&gt;, the cases I use are:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;if 0 in whichTrans:
    theta   = float(np.around(np.random.uniform(-10.0,10.0, size=1), 2))
    thisim  = rotateit(thisim, theta)
    thisseg = rotateit(thisseg, theta, isseg=True) if withseg else np.zeros_like(thisim)

if 1 in whichTrans:
    scalefactor  = float(np.around(np.random.uniform(0.9, 1.1, size=1), 2))
    thisim  = scaleit(thisim, scalefactor)
    thisseg = scaleit(thisseg, scalefactor, isseg=True) if withseg else np.zeros_like(thisim)

if 2 in whichTrans:
    factor  = float(np.around(np.random.uniform(0.8, 1.2, size=1), 2))
    thisim  = intensifyit(thisim, factor)
    #no intensity change on segmentation

if 3 in whichTrans:
    axes    = list(np.random.choice(2, 1, replace=True))
    thisim  = flipit(thisim, axes+[0])
    thisseg = flipit(thisseg, axes+[0]) if withseg else np.zeros_like(thisim)

if 4 in whichTrans:
    offset  = list(np.random.randint(-5,5, size=2))
    currseg = thisseg
    thisim  = translateit(thisim, offset)
    thisseg = translateit(thisseg, offset, isseg=True) if withseg else np.zeros_like(thisim)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In each case, a random set of parameters is found and passed to the transform functions. The image and segmentation are passed separately to each one. In my case, I only choose to flip horizontally by randomly choosing 0 or 1 and appending &lt;code&gt;[0]&lt;/code&gt; such that the transform ignores the second axis. We&amp;rsquo;ve also added a boolean variable called &lt;code&gt;withseg&lt;/code&gt;. When &lt;code&gt;True&lt;/code&gt; the segmentation is augmented, otherwise a blank image is returned.&lt;/p&gt;

&lt;p&gt;Finally, we crop the image to make it square before resampling it to the desired &lt;code&gt;dims&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;thisim, thisseg = cropit(thisim, thisseg)
thisim          = resampleit(thisim, dims)
thisseg         = resampleit(thisseg, dims, isseg=True) if withseg else np.zeros_like(thisim)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Putting this together in a script makes testing the augmenter easier: you can download the script &lt;a href=&#34;/docs/augmenter.py&#34; title=&#34;augmenter.py&#34;&gt;here&lt;/a&gt;. Some things in the code to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The script takes one mandatory argument (image filename) and an optional segmentation filename&lt;/li&gt;
&lt;li&gt;There&amp;rsquo;s a bit of error checking - are the files able to be loaded? Is it an rgb or full 3D image (3rd dimension greater than 3).&lt;/li&gt;
&lt;li&gt;We specify the final image dimensions, [224, 224, 8] in this case&lt;/li&gt;
&lt;li&gt;We also declare some default values for the parameters so that we can&amp;hellip;&lt;/li&gt;
&lt;li&gt;&amp;hellip;print out the applied transformations and their parameters at the end&lt;/li&gt;
&lt;li&gt;There&amp;rsquo;s a definition for a &lt;code&gt;plotit&lt;/code&gt; function that just creates a 2 x 2 matrix where the top 2 images are the originals and the bottom two are the augmented images.&lt;/li&gt;
&lt;li&gt;There&amp;rsquo;s a commented out part which is what I used to save the images created in this post&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a live setting where we want to do data-augmentation on the fly, we would essentially call this script with the filenames or image arrays to augment and create as many augmentations of the images as we wish. We&amp;rsquo;ll take a look at this as an example in the next post.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edit: 15/05/2018&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Added a &lt;code&gt;sliceshift&lt;/code&gt; function to &lt;code&gt;transforms.py&lt;/code&gt;. This takes in a 3D image and randomly shifts a &lt;code&gt;fraction&lt;/code&gt; of the slices using our &lt;code&gt;translateit&lt;/code&gt; function (which I&amp;rsquo;ve also updated slightly). This allows us to simulate motion in medical images.&lt;/li&gt;
&lt;/ul&gt;</description>
    </item>
    
    <item>
      <title>Modifying the Terminal Prompt for Sanity</title>
      <link>/post/ps1terminal/</link>
      <pubDate>Tue, 08 Aug 2017 10:05:14 +0000</pubDate>
      
      <guid>/post/ps1terminal/</guid>
      <description>&lt;p&gt;If you&amp;rsquo;re working with more than one computer at a time, then you&amp;rsquo;re probably using some form of remote access framework - most likely &lt;code&gt;ssh&lt;/code&gt;. This is common in machine learning where our scripts are run on some other host with more capabilities. In this post we&amp;rsquo;ll look at how to modify the terminal prompt layout and colours to give us information we need at a glance: the current user; whether they&amp;rsquo;re &lt;code&gt;root&lt;/code&gt;; what computer we&amp;rsquo;re working on; what folder in and the time that the last command was given.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;p&gt;When we &lt;code&gt;ssh&lt;/code&gt; into another computer, the terminal prompt will most likely change. Often it becomes colourless (usually all-white text) and the structure may change based on the initial setup. I&amp;rsquo;ve often issued commands to the wrong computer because of this so it would be useful if we were able to clearly see which computer we&amp;rsquo;re working on at a glance.&lt;/p&gt;

&lt;p&gt;Many users don&amp;rsquo;t know that they can edit their terminal prompt &lt;em&gt;without root privileges&lt;/em&gt; to give them better indications of their user, host and location. This is done by editing the &lt;code&gt;PS1&lt;/code&gt; variable in the &lt;code&gt;~/.bashrc&lt;/code&gt; file. &lt;code&gt;~/.bashrc&lt;/code&gt; (where &lt;code&gt;~&lt;/code&gt; is the shortcut to our &lt;code&gt;/home/&amp;lt;username&amp;gt;&lt;/code&gt; folder and &lt;code&gt;.&lt;/code&gt; indicates a hidden file) is a set of commands that is run every time a new terminal window is opened. This has a lot to do with how the terminal window functions as well as &lt;code&gt;alias&lt;/code&gt; shortcuts to longer commands. We edit this with an editor like &lt;code&gt;nano&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;nano ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first thing we will do is to make sure that whenever we are in a terminal window (&lt;code&gt;ssh&lt;/code&gt; or otherwise) as the current user, we are seeing colours in the terminal - this is useful for certain text editors as well as the prompt. Find the line that is currently commented out, and uncomment it:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
force_color_prompt=yes
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now for the prompt. In this file, we need to find the line where the PS1 format is defined. PS1 is the name for the terminal prompt. It should be a couple of blocks after the &lt;code&gt;force_color_prompt&lt;/code&gt; variable.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;if [ &amp;quot;$color_prompt&amp;quot; = yes ]; then
    PS1=&#39;\A [\[\e[0;36m\]\u\[\e[0m\]@\[\e[1;36m\]\h\[\e[0m\]:\w\[\$ &#39;
else
    PS1=&#39;\u@\h:\w\$ &#39;
fi
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here you&amp;rsquo;ll see the differece that the &lt;code&gt;force_color_prompt&lt;/code&gt; variable makes: there is a lot more formatting code in the &lt;code&gt;true&lt;/code&gt; part of this &lt;code&gt;if&lt;/code&gt; block that adds color. The above formatting creates the prompt below from one of my machines:&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:100%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34; width=700px src=&#34;/img/ps1/exampleuser.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Example terminal prompt for regular user account&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;I&amp;rsquo;ll identify the different components here, but you can find a list of all of the possible elements that can be included &lt;a href=&#34;https://ss64.com/bash/syntax-prompt.html&#34; title=&#34;PS1 Prompt Variables&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;\A&lt;/code&gt; - the current time in &lt;code&gt;hh:mm&lt;/code&gt; format&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\u&lt;/code&gt; - the current user&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\h&lt;/code&gt; - the current host&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\w&lt;/code&gt; - the current working directory&lt;/li&gt;
&lt;li&gt;&lt;code&gt;\$&lt;/code&gt; - the $ character (if it&amp;rsquo;s not escaped, the shell reads this as if it&amp;rsquo;s trying to find a variable as in &lt;code&gt;$PATH&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Any characters which are not escaped (i.e. preceeded by backslash &amp;lsquo;&lt;code&gt;\&lt;/code&gt;&amp;rsquo;) are printed as they appear: e.g. &lt;code&gt;@&lt;/code&gt; and &lt;code&gt;$&lt;/code&gt;. Assigning the PS1 variable the value: &amp;lsquo;&lt;code&gt;\A \u@\h:\w\$&lt;/code&gt;&amp;rsquo; we get `time user@host:/directory$&amp;rsquo; like this:&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:100%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34; width=700px src=&#34;/img/ps1/plainexample.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Example terminal prompt with no formatting&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;In order to get colors in the prompt, we need to surround our variables e.g. &amp;lsquo;&lt;code&gt;\A&lt;/code&gt;&amp;rsquo;, with some (very ugly) specific syntax. Where we want the color to start, we write &amp;lsquo;[\e[0;XXm]&amp;rsquo; and where we want to finish the colour and return to normal, we can write &amp;lsquo;[\e[0m]&amp;rsquo;. The &amp;lsquo;XX&amp;rsquo; in the first term is a 2-digit code that refers to a color. For example, to make the username green, we change the PS1 variable to this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;PS1=&#39;\A \[\e[0;32m\]\u\[\e[0m\]@\h:\w\$ &#39;
&lt;/code&gt;&lt;/pre&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:100%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34; width=700px src=&#34;/img/ps1/greenuser.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Example terminal prompt with green username&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;A list of colors and their respective numbers can be found &lt;a href=&#34;https://unix.stackexchange.com/a/124408&#34; title=&#34;PS1 Prompt Colors&#34;&gt;here&lt;/a&gt;. I choose green if we&amp;rsquo;re logged in as a regular user (as in green for go) but I choose red if the user is &lt;code&gt;root&lt;/code&gt;. This means I can always see at a glance if I should be careful with the commands that I write.&lt;/p&gt;

&lt;p&gt;You&amp;rsquo;ll also notice that we can change the &lt;em&gt;style&lt;/em&gt; of the font along with the color. I find this useful for making the &lt;code&gt;host&lt;/code&gt; standout by making it bold. This is done by changing the &lt;code&gt;0&lt;/code&gt; before the &amp;lsquo;XX&amp;rsquo; color to &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;if [ &amp;quot;$color_prompt&amp;quot; = yes ]; then
    PS1=&#39;\A [\[\e[0;31m\]\u\[\e[0m\]@\[\e[1;36m\]\h\[\e[0m\]:\w\[\$ &#39;
else
    PS1=&#39;${debian_chroot:+($debian_chroot)}\u@\h:\w\$ &#39;
fi
&lt;/code&gt;&lt;/pre&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:100%; margin:auto;min-width:325px;&#34;&gt;
&lt;img title=&#34;Natural Image RGB&#34; width=700px src=&#34;/img/ps1/exampleroot.png&#34; &gt;&lt;br&gt;
&lt;b&gt;Example terminal prompt for a `root` user&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;For my full PS1 variable, I have the colours, the bold host and I also added some square brackets (not escaped!) to make it a little more visually pleasing. You can change the &lt;code&gt;~/.bashrc&lt;/code&gt; file for each user on each computer. So if you have a regular user account &lt;em&gt;and&lt;/em&gt; a root account on the same machine, you can create a different PS1 for both by editing their respecitve files. So feel free to change colours and formats as you wish!&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Generative Adversarial Network (GAN) in TensorFlow - Part 5</title>
      <link>/post/GAN5/</link>
      <pubDate>Tue, 25 Jul 2017 11:07:22 +0100</pubDate>
      
      <guid>/post/GAN5/</guid>
      <description>&lt;p&gt;This is the final part in our series on Generative Adversarial Networks (GAN). We will write our training script and look at how to run the GAN. We will also take a look at the results we get out. Can you tell the difference between the real and generated faces?&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;h2 id=&#34;introduction&#34;&gt; Introduction &lt;/h2&gt;

&lt;p&gt;In this series we started out with a &lt;a href=&#34;/post/GAN1&#34; title=&#34;GAN - Part 1&#34;&gt;background to GAN&lt;/a&gt; including some of the mathematics behind them. We then downloaded and processed our &lt;a href=&#34;/post/GAN2&#34; title=&#34;GAN - Part 2&#34;&gt;dataset&lt;/a&gt;. In the subsequent posts, we wrote some &lt;a href=&#34;/post/GAN3&#34; title=&#34;GAN - Part 3&#34;&gt;image helper functions&lt;/a&gt; before completing some &lt;a href=&#34;/post/GAN4&#34; title=&#34;GAN - Part 4&#34;&gt;data processing functions&lt;/a&gt; and the &lt;a href=&#34;/post/GAN4&#34; title=&#34;GAN - Part 4&#34;&gt;GAN Class itself&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this final post, we will create the training script and visualise some of the results we get out.&lt;/p&gt;

&lt;h2 id=&#34;script&#34;&gt; Training Script &lt;/h2&gt;

&lt;p&gt;The training script is here: &lt;a href=&#34;/docs/GAN/gantut_trainer.py&#34; title=&#34;gantut_trainer.py&#34;&gt;`gantut_trainer.py&amp;rsquo;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s only short, so there isn&amp;rsquo;t anything to fill in, but let&amp;rsquo;s take a look. We need to make sure we import the GAN &lt;code&gt;class&lt;/code&gt; from our completed &lt;code&gt;gantut_gan.py&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: If you&amp;rsquo;re using the files called &lt;code&gt;gantut_*_complete.py&lt;/code&gt; you&amp;rsquo;ll need to modify this line (add the &lt;code&gt;_complete&lt;/code&gt;). Otherwise, just make sure it&amp;rsquo;s looking for the correctly named file where your GAN class is written.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;#!/usr/bin/python

import os
import numpy as  np
import tensorflow as tf

from gantut_gan import DCGAN
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &amp;lsquo;shebang&amp;rsquo; on the first line allows us to call this script from the terminal without typing &lt;code&gt;python&lt;/code&gt; first. This is a useful line if you&amp;rsquo;re going to run this network on a cluster of computers where you will probably need to create your own python (or conda) virtual environment first. This line will be changed to point to the specific python installation that you want to use to run the script&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: I&amp;rsquo;ll add this note here. The network &lt;em&gt;will&lt;/em&gt; take a long time to train. If you have access to a cluster, I recommend using it.&lt;/p&gt;

&lt;p&gt;Next, we define the possible &amp;lsquo;flags&amp;rsquo; or attributes that we need the network to take:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;#DEFINE THE FLAGS FOR RUNNING SCRIPT FROM THE TERMINAL
# ARG1 = NAME OF THE FLAG
# ARG2 = DEFAULT VALUE
# ARG3 = DESCRIPTION
flags = tf.app.flags
flags.DEFINE_integer(&amp;quot;epoch&amp;quot;, 20, &amp;quot;Number of epochs to train [20]&amp;quot;)
flags.DEFINE_float(&amp;quot;learning_rate&amp;quot;, 0.0002, &amp;quot;Learning rate for adam optimiser [0.0002]&amp;quot;)
flags.DEFINE_float(&amp;quot;beta1&amp;quot;, 0.5, &amp;quot;Momentum term for adam optimiser [0.5]&amp;quot;)
flags.DEFINE_integer(&amp;quot;train_size&amp;quot;, np.inf, &amp;quot;The size of training images [np.inf]&amp;quot;)
flags.DEFINE_integer(&amp;quot;batch_size&amp;quot;, 64, &amp;quot;The batch-size (number of images to train at once) [64]&amp;quot;)
flags.DEFINE_integer(&amp;quot;image_size&amp;quot;, 64, &amp;quot;The size of the images [n x n] [64]&amp;quot;)
flags.DEFINE_string(&amp;quot;dataset&amp;quot;, &amp;quot;lfw-aligned-64&amp;quot;, &amp;quot;Dataset directory.&amp;quot;)
flags.DEFINE_string(&amp;quot;checkpoint_dir&amp;quot;, &amp;quot;checkpoint&amp;quot;, &amp;quot;Directory name to save the checkpoints [checkpoint]&amp;quot;)
flags.DEFINE_string(&amp;quot;sample_dir&amp;quot;, &amp;quot;samples&amp;quot;, &amp;quot;Directory name to save the image samples [samples]&amp;quot;)
FLAGS = flags.FLAGS
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here, we&amp;rsquo;re using the &lt;code&gt;tf.flags&lt;/code&gt; module (which is a wrapper for &lt;code&gt;argparse&lt;/code&gt;) that takes arguments that trail the script name in the terminal and turn them into variables we can use in the network. The format for each argument is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;flags.DEFINE_datatype(name, default_value, description)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Where &lt;code&gt;datatype&lt;/code&gt; is what is expected (an integer, float, string etc.), &lt;code&gt;name&lt;/code&gt; is what the resulting variable will be called, &lt;code&gt;default_value&lt;/code&gt; is&amp;hellip; the default value in case it&amp;rsquo;s not explicitly defined at runtime, and &lt;code&gt;description&lt;/code&gt; is a useful descriptor of what this argument does. We package all these variables into one (called &lt;code&gt;FLAGS&lt;/code&gt;) that can be called later to assign values.&lt;/p&gt;

&lt;p&gt;Notice that the &lt;code&gt;name&lt;/code&gt; here is the same as those we wrote in the &lt;code&gt;__init__&lt;/code&gt; method of our GAN &lt;code&gt;class&lt;/code&gt; because these will be used to initialise the GAN.&lt;/p&gt;

&lt;p&gt;Our network will need folders to output to and also to check whether there&amp;rsquo;s an existing checkpoint that can be loaded (rather than doing it all over again).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;#CREATE SOME FOLDERS FOR THE DATA
if not os.path.exists(FLAGS.checkpoint_dir):
    os.makedirs(FLAGS.checkpoint_dir)
if not os.path.exists(FLAGS.sample_dir):
    os.makedirs(FLAGS.sample_dir)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Even though we&amp;rsquo;ve just defined some variables for our network, there are plenty of others in the Graph that need some default value. TensorFlow has a handy function for that:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;# GET ALL OF THE OPTIONS FOR TENSORFLOW RUNTIME 
config = tf.ConfigProto(intra_op_parallelism_threads=8)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;: I&amp;rsquo;ve included the &lt;code&gt;intra_op_parallelism_threads&lt;/code&gt; argument to &lt;code&gt;tf.ConfigProto&lt;/code&gt; because TensorFlow has the power to take over as many cores as it can see when it&amp;rsquo;s running. This may not be a problem if you&amp;rsquo;re not using your machine too much, but if you&amp;rsquo;re running on a cluster, TF will ignore the &amp;lsquo;requested&amp;rsquo; number of cpus/gpus and leech into other cores. Setting &lt;code&gt;intra_op_parallelism_threads&lt;/code&gt; to the correct number of threads stops this from happening.&lt;/p&gt;

&lt;p&gt;Finally, we initialise the TensorFlow session (with out &lt;code&gt;config&lt;/code&gt; above), initialise the GAN and pass the flags to the &lt;code&gt;.train&lt;/code&gt; method of the GAN &lt;code&gt;class&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;: It is good to initialise the session in this way with &lt;code&gt;with&lt;/code&gt; because it will be automatically closed when the GAN training is finished.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;with tf.Session(config=config) as sess:
    #INITIALISE THE GAN BY CREATING A NEW INSTANCE OF THE DCGAN CLASS
    dcgan = DCGAN(sess, image_size=FLAGS.image_size, batch_size=FLAGS.batch_size,
                  is_crop=False, checkpoint_dir=FLAGS.checkpoint_dir)

    #TRAIN THE GAN
    dcgan.train(FLAGS)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;training&#34;&gt; Training &lt;/h2&gt;

&lt;p&gt;This is it! 5 posts later and we can train our GAN. From our terminal, we are going to call the training script &lt;code&gt;gantut_trainer.py&lt;/code&gt; and pass it a couple of arguments:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;~/GAN/gantut_trainer.py --dataset ~/GAN/aligned --epoch 20
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Of course, if you&amp;rsquo;ve put your aligned training set somewhere else, make sure that path goes into the &lt;code&gt;--dataset&lt;/code&gt; flag. The other flags can be set to default because that&amp;rsquo;s how we&amp;rsquo;ve written our GAN &lt;code&gt;class&lt;/code&gt;. Now 20 epochs will take a seriously long time (it look me nearly 4 days using 12 cores on a cluster).&lt;/p&gt;

&lt;p&gt;There will be 3 folders of output from the GAN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;logs&lt;/code&gt; - where the logs from the training will be saved. These can be viewed with TensorBoard&lt;/li&gt;
&lt;li&gt;&lt;code&gt;checkpoints&lt;/code&gt; - where the model itself is saved&lt;/li&gt;
&lt;li&gt;&lt;code&gt;samples&lt;/code&gt; - this is where the image array we created in &lt;code&gt;gantut_imgfuncs.py&lt;/code&gt; will be output to every so often.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&#34;logs&#34;&gt; Logs &lt;/h3&gt;

&lt;p&gt;Whilst the network is training (if you&amp;rsquo;re doing it locally) you can pull up tensorboard and watch how the training is progressing. From the terminal:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;tensorboard --logdir=&amp;quot;~/GAN/logs&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Follow the link it spits out and you&amp;rsquo;ll be presented with a lot of information about the network. You will find graphs of the loss-functions under &amp;lsquo;scalars&amp;rsquo;, some examples from the generator under &amp;lsquo;images&amp;rsquo; and the Graph itself is nicely represented under &amp;lsquo;graph&amp;rsquo;. &amp;lsquo;Histograms&amp;rsquo; show how the distributions are changing over time. We can see in these that our noise distribution $p_{z}$ is uniform (which is what we defined) and that the real and fake images take values around &lt;code&gt;1&lt;/code&gt; and &lt;code&gt;0&lt;/code&gt; at the discriminator, as we also described in &lt;a href=&#34;/post/GAN1&#34; title=&#34;GAN - Part 1&#34;&gt;part 1&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Noise (z) Distribution&#34; width=30% src=&#34;/img/CNN/hist_z_1.png&#34;&gt;
        &lt;img title=&#34;Real Image Discriminator Distribution&#34; width=30% src=&#34;/img/CNN/hist_d.png&#34;&gt;
        &lt;img title=&#34;Fake Image Discriminator Distribution&#34; width=30% src=&#34;/img/CNN/hist_d_.png&#34;&gt;
                        
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 1&lt;/font&gt;: The distributions of (Left to right) the noise vectors $z$ and the real and fake images at the discriminator.
    &lt;/div&gt;
&lt;/div&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;TensorFlow Graph&#34; width=100% src=&#34;/img/CNN/graph.png&#34;&gt;
                        
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 2&lt;/font&gt;: The TensorFlow Graph that we build using our GAN `class`.
    &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;results&#34;&gt; Results &lt;/h3&gt;

&lt;p&gt;Here it is, the output from our GAN (after 14 epochs in this case) showing how well the network has learned how to create faces. It may take longer than expect to load as I&amp;rsquo;ve tried to preserve quality.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;GAN Faces&#34; width=30% src=&#34;/img/CNN/faces_gif.gif&#34;&gt;
                        
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 3&lt;/font&gt;: The output of our GAN at the end of each epoch ending at epoch 14. (created at gifmaker.me).
        
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;We can see that some of the faces are still not quite there yet, but there are a few that are unbelieveably realistic. In fact, we can perform a kind of &amp;lsquo;Turing Test&amp;rsquo; on this data. The &lt;a href=&#34;https://en.wikipedia.org/wiki/Turing_test&#34; title=&#34;wiki:Turing Test&#34;&gt;Turing Test&lt;/a&gt;, put simply, is that if a user is unable to &lt;em&gt;reliably&lt;/em&gt; tell the difference between a computer and human performing the same task, then the computer has passed the Turing Test.&lt;/p&gt;

&lt;p&gt;Have a go at the test below: study each face, decide if it is a real or fake image; then click on the image to reveal the true result. If you only guess 50% or less, then the computer has passed this simplistic Turing Test.&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;a href=&#34;/docs/GAN/turing_quiz.html&#34; target=&#34;_blank&#34;&gt;Click Here for the Turing Test&lt;/a&gt;&lt;br&gt;(opens in a new window)&lt;/center&gt;&lt;/p&gt;

&lt;h2 id=&#34;conclusion&#34;&gt; Conclusion &lt;/h2&gt;

&lt;p&gt;So it looks great, but what was the point? Well, remember back to &lt;a href=&#34;/post/GAN1&#34; title=&#34;GAN - Post 1&#34;&gt;part 1&lt;/a&gt; - GANs and other generative networks are used for &lt;em&gt;image completion&lt;/em&gt;. We can use the fact that our network has learned what a face should look like to &amp;lsquo;fill-in&amp;rsquo; any missing bits. Lets say someone has a large tattoo across their face, we can reconstruct what the skin would look like without it. Or maybe we have an amazing photo, with a beautiufl background, but we&amp;rsquo;re not smiling: the GAN can reconstruct a smile. More advanced work can include learning what glasses are and putting them onto other faces.&lt;/p&gt;

&lt;p&gt;Again, for credit, this series is based on the main code by &lt;a href=&#34;https://github.com/carpedm20/DCGAN-tensorflow&#34; title=&#34;carpedm20/DCGAN-tensorflow&#34;&gt;carpedm20&lt;/a&gt; and inspired from the blog of &lt;a href=&#34;http://bamos.github.io/2016/08/09/deep-completion/#ml-heavy-generative-adversarial-net-gan-building-blocks&#34; title=&#34;bamos.github.io&#34;&gt;B. Amos&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;GANs are powerful networks, but work in a relatively simple way by trying to trick a discriminator by generating more and more realistic-looking images.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Generative Adversarial Network (GAN) in TensorFlow - Part 4</title>
      <link>/post/GAN4/</link>
      <pubDate>Mon, 17 Jul 2017 09:37:58 +0100</pubDate>
      
      <guid>/post/GAN4/</guid>
      <description>&lt;p&gt;Now that we&amp;rsquo;re able to import images into our network, we really need to build the GAN iteself. This tuorial will build the GAN &lt;code&gt;class&lt;/code&gt; including the methods needed to create the generator and discriminator. We&amp;rsquo;ll also be looking at some of the data functions needed to make this work.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;p&gt;*Note: This table of contents does not follow the order in the post. The contents is grouped by the methods in the GAN &lt;code&gt;class&lt;/code&gt; and the functions in &lt;code&gt;gantut_imgfuncs.py&lt;/code&gt;.&lt;/p&gt;

&lt;div id=&#34;toctop&#34;&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#intro&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#gan&#34;&gt;The GAN&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#datasetfiles&#34;&gt;dataset_files()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#dcgan&#34;&gt;GAN Class&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#init&#34;&gt;__init__()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#discriminator&#34;&gt;discriminator()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#generator&#34;&gt;generator()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#buildmodel&#34;&gt;build_model()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#save&#34;&gt;save()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#load&#34;&gt;load()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#train&#34;&gt;train()&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#batchnorm&#34;&gt;Data Functions&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#batchnorm&#34;&gt;batch_norm()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conv2d&#34;&gt;conv2d()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#relu&#34;&gt;relu()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#linear&#34;&gt;linear()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conv2dtrans&#34;&gt;conv2d_transpose()&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;intro&#34;&gt; Introduction &lt;/h2&gt;

&lt;p&gt;In the last tutorial, we build the functions in &lt;code&gt;gantut_imgfuncs.py&lt;/code&gt;which allow us to import data into our networks. The completed file is &lt;a href=&#34;/docs/GAN/gantut_imgfuncs_complete.py&#34; title=&#34;gantut_imgfuncs_complete.py&#34;&gt;here&lt;/a&gt;. In this tutorial we will be working on the final two code skeletons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_gan.py&#34; title=&#34;gantut_gan.py&#34;&gt;&lt;code&gt;gantut_gan.py&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_datafuncs.py&#34; title=&#34;gantut_datafuncs.py&#34;&gt;&lt;code&gt;gantut_datafuncs.py&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First, let&amp;rsquo;s take a look at the various parts of our GAN in the &lt;code&gt;gantut_gan.py&lt;/code&gt; file and see what they&amp;rsquo;re going to do.&lt;/p&gt;

&lt;h2 id=&#34;gan&#34;&gt; The GAN &lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;re going to import a number of modules for this file including those from our own &lt;code&gt;gantut_datafuncs.py&lt;/code&gt; and &lt;code&gt;gantut_imgfuncs.py&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;from __future__ import division
import os
import time
import math
import itertools
from glob import glob
import tensorflow as tf
import numpy as np
from six.moves import xrange

#IMPORT OUR IMAGE AND DATA FUNCTIONS
from gantut_datafuncs import *
from gantut_imgfuncs import *
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;datasetfiles&#34;&gt; dataset_files() &lt;/h3&gt;

&lt;p&gt;The initial part of this file is a little housekeeping - ensuring that we are only dealing with supported filetypes. This way of doing things I liked in &lt;a href=&#34;http://bamos.github.io/2016/08/09/deep-completion/#ml-heavy-generative-adversarial-net-gan-building-blocks&#34; title=&#34;B. Amos&#34;&gt;B. Amos blog&lt;/a&gt;. We define accepted file-extensions and then return a list of all of the possible files we can use for training purposes. the &lt;code&gt;itertools.chain.from_iterable&lt;/code&gt; function is useful for create a single &lt;code&gt;list&lt;/code&gt; of all of the files found in the folders and subfolders of a particular &lt;code&gt;root&lt;/code&gt; with an appropriate &lt;code&gt;ext&lt;/code&gt;. Notice that it doesn&amp;rsquo;t really matter what we call the images, so this will work for all datasets.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;SUPPORTED_EXTENSIONS = [&amp;quot;png&amp;quot;, &amp;quot;jpg&amp;quot;, &amp;quot;jpeg&amp;quot;]

&amp;quot;&amp;quot;&amp;quot; Returns the list of all SUPPORTED image files in the directory
&amp;quot;&amp;quot;&amp;quot;
def dataset_files(root):
    return list(itertools.chain.from_iterable(
    glob(os.path.join(root, &amp;quot;*.{}&amp;quot;.format(ext))) for ext in SUPPORTED_EXTENSIONS))
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h3 id=&#34;dcgan&#34;&gt; DCGAN() &lt;/h3&gt;

&lt;p&gt;This is where the hard work begins. We&amp;rsquo;re going to build the DCGAN &lt;code&gt;class&lt;/code&gt; (i.e. Deep Convolutional Generative Adversarial Network). The skeleton code already has the necessary method names for our model, let&amp;rsquo;s have a look at what we&amp;rsquo;ve got to create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;__init__&lt;/code&gt;:  &amp;emsp;to initialise the model and set parameters&lt;/li&gt;
&lt;li&gt;&lt;code&gt;build_model&lt;/code&gt;: &amp;emsp;creates the model (or &amp;lsquo;graph&amp;rsquo; in TensorFlow-speak) by calling&amp;hellip;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;generator&lt;/code&gt;: &amp;emsp;defines the generator network&lt;/li&gt;
&lt;li&gt;&lt;code&gt;discriminator&lt;/code&gt;: &amp;emsp;defines the discriminator network&lt;/li&gt;
&lt;li&gt;&lt;code&gt;train&lt;/code&gt;: &amp;emsp;is called to begin the training of the network with data&lt;/li&gt;
&lt;li&gt;&lt;code&gt;save&lt;/code&gt;: &amp;emsp;saves the TensorFlow checkpoints of the GAN&lt;/li&gt;
&lt;li&gt;&lt;code&gt;load&lt;/code&gt;: &amp;emsp;loads the TensorFlow checkpoints of the GAN&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We create an instance of our GAN class with &lt;code&gt;DCGAN(args)&lt;/code&gt; and be returned a DCGAN object with the above methods. Let&amp;rsquo;s code.&lt;/p&gt;

&lt;h4 id=&#34;init&#34;&gt; __init__() &lt;/h4&gt;

&lt;p&gt;To initialise our GAN object, we need some initial parameters. It looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def __init__(self, sess, image_size=64, is_crop=False, batch_size=64, sample_size=64, z_dim=100,
             gf_dim=64, df_dim=64, gfc_dim=1024, dfc_dim=1024, c_dim=3, checkpoint_dir=None, lam=0.1):
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The parameters are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sess&lt;/code&gt;: &amp;emsp; the TensorFlow session to run in&lt;/li&gt;
&lt;li&gt;&lt;code&gt;image_size&lt;/code&gt;: &amp;emsp; the width of the images, which should be the same as the height as we like square inputs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;is_crop&lt;/code&gt;: &amp;emsp; whether to crop the images or leave them as they are&lt;/li&gt;
&lt;li&gt;&lt;code&gt;batch_size&lt;/code&gt;: &amp;emsp; number of images to use in each run&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sample_size&lt;/code&gt;: &amp;emsp; number of z samples to take on each run, should be equal to batch_size&lt;/li&gt;
&lt;li&gt;&lt;code&gt;z_dim&lt;/code&gt;: &amp;emsp; number of samples to take for each z&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gf_dim&lt;/code&gt;: &amp;emsp; dimension of generator filters in first conv layer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;df_dim&lt;/code&gt;: &amp;emsp; dimenstion of discriminator filters in first conv layer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gfc_dim&lt;/code&gt;: &amp;emsp; dimension of generator units for fully-connected layer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dfc_gim&lt;/code&gt;: &amp;emsp; dimension of discriminator units for fully-connected layer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;c_dim&lt;/code&gt;: &amp;emsp; number of image cannels (gray=1, RGB=3)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;checkpoint_dir&lt;/code&gt;: &amp;emsp; where to store the TensorFlow checkpoints&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lam&lt;/code&gt;: &amp;emsp;small constant weight for the sum of contextual and perceptual loss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the controllable parameters for the GAN. As this is the initialising function, we need to transfer these inputs to the &lt;code&gt;self&lt;/code&gt; of the class so they are accessible later on. We will also add two new lines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Let&amp;rsquo;s add a check that the &lt;code&gt;image_size&lt;/code&gt; is a power of 2 (to make the convolution work well). This clever &amp;lsquo;bit-wise-and&amp;rsquo; operator &lt;code&gt;&amp;amp;&lt;/code&gt; will do the job for us. It uses the unique property of all power of 2 numbers have only one bit set to &lt;code&gt;1&lt;/code&gt; and all others to &lt;code&gt;0&lt;/code&gt;. Let&amp;rsquo;s also check that the image is bigger than $[8  \times 8]$ to we don&amp;rsquo;t convolve too far:&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Get the &lt;code&gt;image_shape&lt;/code&gt; which is the width and height of the image along with the number of channels (gray or RBG).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;#image_size must be power of 2 and 8+
assert(image_size &amp;amp; (image_size - 1) == 0 and image_size &amp;gt;= 8)

self.sess = sess
self.is_crop = is_crop
self.batch_size = batch_size
self.image_size = image_size
self.sample_size = sample_size
self.image_shape = [image_size, image_size, c_dim]

self.z_dim = z_dim
self.gf_dim = gf_dim
self.df_dim = df_dim        
self.gfc_dim = gfc_dim
self.dfc_dim = dfc_dim

self.lam = lam
self.c_dim = c_dim
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Later on, we will want to do &amp;lsquo;batch normalisation&amp;rsquo; on our data to make sure non of our images are extremely different to the others. We will need a batch-norm layer for each of the conv layers in our generator and discriminator. We will initialise the layers here, but define them in our &lt;code&gt;gantut_datafuncs.py&lt;/code&gt; file shortly.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;#batchnorm (from funcs.py)
self.d_bns = [batch_norm(name=&#39;d_bn{}&#39;.format(i,)) for i in range(4)]

log_size = int(math.log(image_size) / math.log(2))
self.g_bns = [batch_norm(name=&#39;g_bn{}&#39;.format(i,)) for i in range(log_size)]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This shows that we will be using 4 layers in our discriminator. But we will need more in our generator: our generator starts with a simple vector &lt;em&gt;z&lt;/em&gt; and needs to upscale to the size of &lt;code&gt;image_size&lt;/code&gt;. It does this by a factor of 2 in each layer, thus $\log(\mathrm{image \ size})/\log(2)$ is equal to the number of upsamplings to be done i.e. $2^{\mathrm{num \ of \ layers}} = 64$ in our case. Also note that we&amp;rsquo;ve created these objects (layers) with an iterator so that each has the name &lt;code&gt;g_bn1&lt;/code&gt;, &lt;code&gt;g_bn1&lt;/code&gt; etc.&lt;/p&gt;

&lt;p&gt;To finish &lt;code&gt;__init__()&lt;/code&gt; we set the checkpoint directory for TensorFlow saves, instruct the class to build the model and name it &amp;lsquo;DCGAN.model&amp;rsquo;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;self.checkpoint_dir = checkpoint_dir
self.build_model()

self.model_name=&amp;quot;DCGAN.model&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;batchnorm&#34;&gt; batch_norm() &lt;/h4&gt;

&lt;p&gt;This is the first of our &lt;code&gt;gantut_datafuncs.py&lt;/code&gt; functions.&lt;/p&gt;

&lt;p&gt;If some of our images are very different to the others then the network will not learn the features correctly. To avoid this, we add batch normalisation (as described in &lt;a href=&#34;http://arxiv.org/abs/1502.03167&#34; title=&#34;Batch Normalization: Sergey Ioffe, Christian Szegedy&#34;&gt;Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift - Ioffe &amp;amp; Szegedy (2015)&lt;/a&gt;. We effectively redistribute the intensities of the images around a common mean with a set variance.&lt;/p&gt;

&lt;p&gt;This is a &lt;code&gt;class&lt;/code&gt; that will be instantiated with set parameters when called. Then, the method will perform batch normalisation whenever the object is called on the set of images &lt;code&gt;x&lt;/code&gt;. We are using Tensorflow&amp;rsquo;s built-in &lt;a href=&#34;https://www.tensorflow.org/api_docs/python/tf/contrib/layers/batch_norm&#34; title=&#34;tf.contrib.layers.batch_norm&#34;&gt;tf.contrib.layers.batch_norm()&lt;/a&gt; layer for this which implements the method from the paper above.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Parameters&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;epsilon&lt;/code&gt;:    &amp;lsquo;small float added to variance [of the input data] to avoid division by 0&amp;rsquo;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;momentum&lt;/code&gt;:   &amp;lsquo;decay value for the moving average, usually 0.999, 0.99, 0.9&amp;rsquo;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;x&lt;/code&gt;:      the set of input images to be normalised&lt;/li&gt;
&lt;li&gt;&lt;code&gt;train&lt;/code&gt;:  whether or not the network is in training mode [True or False]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A batch_norm &amp;lsquo;object&amp;rsquo; on instantiation&lt;/li&gt;
&lt;li&gt;A tensor representing the output of the batch_norm operation&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot;Batch normalisation function to standardise the input
Initialises an object with all of the batch norm properties
When called, performs batch norm on input &#39;x&#39;
&amp;quot;&amp;quot;&amp;quot;
class batch_norm(object):
    def __init__(self, epsilon=1e-5, momentum = 0.9, name=&amp;quot;batch_norm&amp;quot;):
        with tf.variable_scope(name):
            self.epsilon = epsilon
            self.momentum = momentum

            self.name = name

    def __call__(self, x, train):
        return tf.contrib.layers.batch_norm(x, decay=self.momentum, updates_collections=None, epsilon=self.epsilon,
                                            center=True, scale=True, is_training=train, scope=self.name)
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;discriminator&#34;&gt; discriminator() &lt;/h4&gt;

&lt;p&gt;As the discriminator is a simple &lt;a href=&#34;/post/CNN1&#34; title=&#34;MLNotebook: Convolutional Neural Network&#34;&gt;convolutional neural network (CNN)&lt;/a&gt; this will not take many lines. We will have to create a couple of wrapper functions that will perform the actual convolutions, but let&amp;rsquo;s get the method written in &lt;code&gt;gantut_gan.py&lt;/code&gt; first.&lt;/p&gt;

&lt;p&gt;We want our discriminator to check a real &lt;code&gt;image&lt;/code&gt;, save varaibles and then use the same variables to check a fake &lt;code&gt;image&lt;/code&gt;. This way, if the images are fake, but fool the discriminator, we know we&amp;rsquo;re on the right track. Thus we use the variable &lt;code&gt;reuse&lt;/code&gt; when calling the &lt;code&gt;discriminator()&lt;/code&gt; method - we will set it to &lt;code&gt;True&lt;/code&gt; when we&amp;rsquo;re using the fake images.&lt;/p&gt;

&lt;p&gt;We add &lt;code&gt;tf.variable_scope()&lt;/code&gt; to our functions so that when we visualise our graph in TensorBoard we can recognise the various pieces of our GAN.&lt;/p&gt;

&lt;p&gt;Next are the definitions of the 4 layers of our discriminator. each one takes in the images, the kernel (filter) dimensions and has a name to identify it later on. Notice that we also call our &lt;code&gt;d_bns&lt;/code&gt; objects which are the batch-norm objects that were set-up during instantiation of the GAN. These act on the result of the convolution before being passed through the non-linear &lt;code&gt;lrelu&lt;/code&gt; function. The last layer is just a &lt;code&gt;linear&lt;/code&gt; layer that outputs the unbounded results from the network.&lt;/p&gt;

&lt;p&gt;As this is a classificaiton task (real or fake) we finish by returning the probabilities in the range $[0 \ 1]$ by applying the sigmoid function. The full output is also returned.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def discriminator(self, image, reuse=False):
	with tf.variable_scope(&amp;quot;discriminator&amp;quot;) as scope:
	    if reuse:
		scope.reuse_variables()
	   	    
	    h0 = lrelu(conv2d(image, self.df_dim, name=&#39;d_h00_conv&#39;))
	    h1 = lrelu(self.d_bns[0](conv2d(h0, self.df_dim*2, name=&#39;d_h1_conv&#39;), self.is_training))
	    h2 = lrelu(self.d_bns[1](conv2d(h1, self.df_dim*4, name=&#39;d_h2_conv&#39;), self.is_training))
	    h3 = lrelu(self.d_bns[2](conv2d(h2, self.df_dim*8, name=&#39;d_h3_conv&#39;), self.is_training))
	    h4 = linear(tf.reshape(h3, [-1, 8192]), 1, &#39;d_h4_lin&#39;)
	    
	    return tf.nn.sigmoid(h4), h4
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This method calls a couple of functions that we haven&amp;rsquo;t defined yet: &lt;code&gt;cov2d&lt;/code&gt;, &lt;code&gt;lrelu&lt;/code&gt; and &lt;code&gt;linear&lt;/code&gt; so lets do those now.&lt;/p&gt;

&lt;hr&gt;

&lt;h4 id=&#34;conv2d&#34;&gt; conv2d() &lt;/h4&gt;

&lt;p&gt;This function we&amp;rsquo;ve seen before in our &lt;a href=&#34;/post/CNN1&#34; title=&#34;MLNotebook: Convolutional Neural Networks&#34;&gt;CNN&lt;/a&gt; tutorial. We&amp;rsquo;ve defined the weights &lt;code&gt;w&lt;/code&gt; for each kernel which is &lt;code&gt;[k_h x k_w x number of images x number of kernels]&lt;/code&gt;not forgetting that different weights are learned for different images. We&amp;rsquo;ve initialised these weights using a standard, random sampling from a normal distribution with standard deviation &lt;code&gt;stddev&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The convolution is done by TensorFlow&amp;rsquo;s [tf.nn.conv2d]( &amp;ldquo;tf.nn.conv2d&amp;rdquo;) function using the weights &lt;code&gt;w&lt;/code&gt; we&amp;rsquo;ve already defined. The padding option &lt;code&gt;SAME&lt;/code&gt; makes sure that we end up with output that is the same size as the input. Biases are added (the same size as the number of kernels and initialised at a constant value) before the result is returned.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;input_&lt;/code&gt;:     the input images (full batch)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;output_dim&lt;/code&gt;: the number of kernels/filters to be learned&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k_h&lt;/code&gt;, &lt;code&gt;k_w&lt;/code&gt;:   height and width of the kernels to be learned&lt;/li&gt;
&lt;li&gt;&lt;code&gt;d_h&lt;/code&gt;, &lt;code&gt;d_w&lt;/code&gt;:   stride of the kernel horizontally and vertically&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stddev&lt;/code&gt;:     standard deviation for the normal func in weight-initialiser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the convolved images for each kernel&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot;Defines how to perform the convolution for the discriminator,
i.e. traditional conv rather than reverse conv for the generator
&amp;quot;&amp;quot;&amp;quot;
def conv2d(input_, output_dim, k_h=5, k_w=5, d_h=2, d_w=2, stddev=0.02, name=&amp;quot;conv2d&amp;quot;):
    with tf.variable_scope(name):
        w = tf.get_variable(&#39;w&#39;, [k_h, k_w, input_.get_shape()[-1], output_dim],
                            initializer=tf.truncated_normal_initializer(stddev=stddev))
        conv = tf.nn.conv2d(input_, w, strides=[1, d_h, d_w, 1], padding=&#39;SAME&#39;)

        biases = tf.get_variable(&#39;biases&#39;, [output_dim], initializer=tf.constant_initializer(0.0))
        # conv = tf.reshape(tf.nn.bias_add(conv, biases), conv.get_shape())
        conv = tf.nn.bias_add(conv, biases)

        return conv 
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;relu&#34;&gt; relu() &lt;/h4&gt;

&lt;p&gt;The network need to be able to learn complex functions, so we add some non-linearity to the output of our convolution layers. We&amp;rsquo;ve seen this before in our tutorial on &lt;a href=&#34;/post/transfer_functions&#34; title=&#34;Transfer Functions&#34;&gt;transfer functions&lt;/a&gt;. Here we use the leaky rectified linear unit (lReLU).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Parameters&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;leak&lt;/code&gt;:   the &amp;lsquo;leakiness&amp;rsquo; of the lrelu&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;x&lt;/code&gt;: some data with a wide range&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the transformed input data&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot;Neural nets need this non-linearity to build complex functions
&amp;quot;&amp;quot;&amp;quot;
def lrelu(x, leak=0.2, name=&amp;quot;lrelu&amp;quot;):
    with tf.variable_scope(name):
        f1 = 0.5 * (1 + leak)
        f2 = 0.5 * (1 - leak)
        return f1 * x + f2 * abs(x)
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;linear&#34;&gt; linear() &lt;/h4&gt;

&lt;p&gt;This linear layer takes the outputs from the convolution and does a linear transform using some randomly initialised weights. This does not have the same non-linear property as the &lt;code&gt;lrelu&lt;/code&gt; function because we will use this output to calcluate probabilities for classification. We return the result of &lt;code&gt;input_ x matrix&lt;/code&gt; by default, but if we also need the weights, we also output &lt;code&gt;matrix&lt;/code&gt; and &lt;code&gt;bias&lt;/code&gt; through the &lt;code&gt;if&lt;/code&gt; statement.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Parameters&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;stddev&lt;/code&gt;:     standard deviation for weight initialiser&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bias_start&lt;/code&gt;: for the bias initialiser (constant value)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;with_w&lt;/code&gt;:     return the weight matrix (and biases) as well as the output if True&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;input_&lt;/code&gt;:         input data (shape is used to define weight/bias matrices)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;output_size&lt;/code&gt;:    desired output size of the linear layer&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot;For the final layer of the discriminator network to get the
full detail (probabilities etc.) from the output
&amp;quot;&amp;quot;&amp;quot;
def linear(input_, output_size, scope=None, stddev=0.02, bias_start=0.0, with_w=False):
    shape = input_.get_shape().as_list()

    with tf.variable_scope(scope or &amp;quot;Linear&amp;quot;):
        matrix = tf.get_variable(&amp;quot;Matrix&amp;quot;, [shape[1], output_size], tf.float32,
                                 tf.random_normal_initializer(stddev=stddev))
        bias = tf.get_variable(&amp;quot;bias&amp;quot;, [output_size],
            initializer=tf.constant_initializer(bias_start))
        if with_w:
            return tf.matmul(input_, matrix) + bias, matrix, bias
        else:
            return tf.matmul(input_, matrix) + bias
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;generator&#34;&gt; generator() &lt;/h4&gt;

&lt;p&gt;Finally! We&amp;rsquo;re going to write the code for the generative part of the GAN. This method will take a single input - the randomly-sampled vector $z$ from the well known distribution $p_z$.&lt;/p&gt;

&lt;p&gt;Remember that the generator is effectively a reverse discriminator in that it is a CNN that works backwards. Thus we start with the &amp;lsquo;values&amp;rsquo; and must perform the linear transformation on them before feeding them through the other layers of the network. As we do not know the weights or biases yet in this network, we need to make sure we output these from the linear layer with &lt;code&gt;with_w=True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This first hidden layer &lt;code&gt;hs[0]&lt;/code&gt; needs reshaping to be the small image-shaped array that we can send through the network to become the upscaled $[64 \times 64]$ image at the end. So we take the linearly-transformed z-values and reshape to $[4 x 4 x num_kernels]$. Don&amp;rsquo;t forget the &lt;code&gt;-1&lt;/code&gt; to do this for all images in the batch. As before, we must batch-norm the result and pass it through the non-linearity.&lt;/p&gt;

&lt;p&gt;The number of layers in this network has been calculated earlier (using the logarithm ratio of image size to downsampling factor. We can therefore do the next part of the generator in a loop.&lt;/p&gt;

&lt;p&gt;In each loop/layer we are going to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;give the layer a name&lt;/li&gt;
&lt;li&gt;perform the &lt;em&gt;inverse&lt;/em&gt; convolution&lt;/li&gt;
&lt;li&gt;apply non-linearity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;1 and 3 are self-explanatory, but the inverse convolution function still needs to be written. This is the function that will take in the small square image and upsample it to a larger image using some weights that are being learnt. We start at layer &lt;code&gt;i=1&lt;/code&gt; where we want the image to go to &lt;code&gt;size=8&lt;/code&gt; from &lt;code&gt;size=4&lt;/code&gt; at layer &lt;code&gt;i=0&lt;/code&gt;. This will increase by a factor of 2 at each layer. As with a regular CNN we want to learn fewer kernels on the larger images, so we need to decrease the &lt;code&gt;depth_mul&lt;/code&gt; by a factor of 2 at each layer. Note that the &lt;code&gt;while&lt;/code&gt; loop will terminate when the size gets to the size of the input images &lt;code&gt;image_size&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The final layer is added which takes the last output and does the inverse convolution to get the final fake image (that will be tested with the discriminator.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def generator(self, z):
	with tf.variable_scope(&amp;quot;generator&amp;quot;) as scope:
	    self.z_, self.h0_w, self.h0_b = linear(z, self.gf_dim*8*4*4, &#39;g_h0_lin&#39;, with_w=True)

	    hs = [None]
	    hs[0] = tf.reshape(self.z_, [-1, 4, 4, self.gf_dim * 8])
	    hs[0] = tf.nn.relu(self.g_bns[0](hs[0], self.is_training))
	    
	    i=1             #iteration number
	    depth_mul = 8   #depth decreases as spatial component increases
	    size=8          #size increases as depth decreases
	    
	    while size &amp;lt; self.image_size:
		hs.append(None)
		name=&#39;g_h{}&#39;.format(i)
		hs[i], _, _ = conv2d_transpose(hs[i-1], [self.batch_size, size, size, self.gf_dim*depth_mul],
		                                name=name, with_w=True)
		hs[i] = tf.nn.relu(self.g_bns[i](hs[i], self.is_training))
		
		i += 1
		depth_mul //= 2
		size *= 2
		
	    hs.append(None)
	    name = &#39;g_h{}&#39;.format(i)
	    hs[i], _, _ = conv2d_transpose(hs[i-1], [self.batch_size, size, size, 3], name=name, with_w=True)
	    
	    return tf.nn.tanh(hs[i])           
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;conv2dtrans&#34;&gt; conv2d_transpose() &lt;/h4&gt;

&lt;p&gt;The inverse convolution function looks very similar to the forward convolution function. We&amp;rsquo;ve had to make sure that different versions of TensorFlow work here - in newer versions, the correct function is located at &lt;a href=&#34;https://www.tensorflow.org/api_docs/python/tf/nn/conv2d_transpose&#34; title=&#34;tf.nn.conv2d_transpose&#34;&gt;tf.nn.conv2d_transpose&lt;/a&gt; where as in older ones we must use &lt;code&gt;tf.nn.deconv2d&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;input_&lt;/code&gt;:         a vector (of noise) with dim=batch_size x z_dim&lt;/li&gt;
&lt;li&gt;&lt;code&gt;output_shape&lt;/code&gt;:   the final shape of the generated image&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k_h&lt;/code&gt;, &lt;code&gt;k_w&lt;/code&gt;:       the height and width of the kernels&lt;/li&gt;
&lt;li&gt;&lt;code&gt;d_h&lt;/code&gt;, &lt;code&gt;d_w&lt;/code&gt;:       the stride of the kernel horiz and vert.&lt;br /&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an image (upscaled from the initial data)&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot;Deconv isn&#39;t an accurate word, but is a handy shortener,
so we&#39;ll use that. This is for the generator that has to make
the image from some randomly sampled data
&amp;quot;&amp;quot;&amp;quot;
def conv2d_transpose(input_, output_shape, k_h=5, k_w=5, d_h=2, d_w=2, stddev=0.02,
                     name=&amp;quot;conv2d_transpose&amp;quot;, with_w=False):
    with tf.variable_scope(name):
        w = tf.get_variable(&#39;w&#39;, [k_h, k_w, output_shape[-1], input_.get_shape()[-1]],
                            initializer=tf.random_normal_initializer(stddev=stddev))

        try:
            deconv = tf.nn.conv2d_transpose(input_, w, output_shape=output_shape,
                                strides=[1, d_h, d_w, 1])

        # Support for verisons of TensorFlow before 0.7.0
        except AttributeError:
            deconv = tf.nn.deconv2d(input_, w, output_shape=output_shape,
                                strides=[1, d_h, d_w, 1])

        biases = tf.get_variable(&#39;biases&#39;, [output_shape[-1]], initializer=tf.constant_initializer(0.0))
        # deconv = tf.reshape(tf.nn.bias_add(deconv, biases), deconv.get_shape())
        deconv = tf.nn.bias_add(deconv, biases)

        if with_w:
            return deconv, w, biases
        else:
            return deconv    
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;buildmodel&#34;&gt; build_model() &lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;build_model()&lt;/code&gt; method bring together the image data and the generator and discriminator methods. This is the &amp;lsquo;graph&amp;rsquo; for TensorFlow to follow. It contains some &lt;code&gt;tf.placeholder&lt;/code&gt; pieces which we must supply attributes to when we finally train the model.&lt;/p&gt;

&lt;p&gt;We will need to know whether the model is in training or inference mode throughout our code, so we have a placeholder for that variable. We also need a placeholder for the image data itself because there will be a different batch of data being injected at each epoch. These are our &lt;code&gt;real_images&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When we inject the &lt;code&gt;z&lt;/code&gt; vectors into the GAN (served by another palceholder) we will also produce some monitoring output for TensorBoard. By adding &lt;code&gt;tf.summary.histogram()&lt;/code&gt; we are able to keep track of how the different &lt;code&gt;z&lt;/code&gt; vectors look at each epoch.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    def build_model(self):
        self.is_training = tf.placeholder(tf.bool, name=&#39;is_training&#39;)
        self.images = tf.placeholder(
            tf.float32, [None] + self.image_shape, name=&#39;real_images&#39;)
        self.lowres_images = tf.reduce_mean(tf.reshape(self.images,
            [self.batch_size, self.lowres_size, self.lowres,
             self.lowres_size, self.lowres, self.c_dim]), [2, 4])
        self.z = tf.placeholder(tf.float32, [None, self.z_dim], name=&#39;z&#39;)
        self.z_sum = tf.summary.histogram(&amp;quot;z&amp;quot;, self.z)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next, lets tell the graph to take the injected &lt;code&gt;z&lt;/code&gt; vector an turn it into an image with our &lt;code&gt;generator&lt;/code&gt;. We&amp;rsquo;ll also produce a lowres version of this image. Now, put the &amp;lsquo;real_images&amp;rsquo; into the &lt;code&gt;discriminator&lt;/code&gt;, which gives back our probabilities and the final-layer data (the logits). We then &lt;code&gt;reuse&lt;/code&gt; the same discriminator parameters to test the fake image from the generator. Here we also output some histograms of the probabilities of the &amp;lsquo;real_image&amp;rsquo; and the fake image. We will also output the current fake image from the generator to TensorBoard.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;        self.G = self.generator(self.z)
        self.lowres_G = tf.reduce_mean(tf.reshape(self.G,
            [self.batch_size, self.lowres_size, self.lowres,
             self.lowres_size, self.lowres, self.c_dim]), [2, 4])
        self.D, self.D_logits = self.discriminator(self.images)

        self.D_, self.D_logits_ = self.discriminator(self.G, reuse=True)

        self.d_sum = tf.summary.histogram(&amp;quot;d&amp;quot;, self.D)
        self.d__sum = tf.summary.histogram(&amp;quot;d_&amp;quot;, self.D_)
        self.G_sum = tf.summary.image(&amp;quot;G&amp;quot;, self.G)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now for some of the necessary calculations needed to be able to update the network. Let&amp;rsquo;s find the &amp;lsquo;loss&amp;rsquo; on the current outputs. We will utilise a very efficient loss function here the &lt;a href=&#34;https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits&#34; title=&#34;tf.nn.sigmoid_cross_entropy_with_logits&#34;&gt;tf.nn.sigmoid_cross_entropy_with_logits&lt;/a&gt;. We want to calculate a few things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;how well did the discriminator do at letting &lt;em&gt;true&lt;/em&gt; images through (i.e. comparing &lt;code&gt;D&lt;/code&gt; to &lt;code&gt;1&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;how often was the discriminator fooled by the generator  (i.e. comparing &lt;code&gt;D_&lt;/code&gt; to &lt;code&gt;1&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;how often did the generator fail at making realistic images (i.e. comparing &lt;code&gt;D_&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We&amp;rsquo;ll add the discriminator losses up (1 + 2) and create a TensorBoard summary statistic (a &lt;code&gt;scalar&lt;/code&gt; value) for the discriminator and generator losses in this epoch. These are what we will optimise during training.&lt;/p&gt;

&lt;p&gt;To keep everything tidy, we&amp;rsquo;ll group the discriminator and generator variables into &lt;code&gt;d_vars&lt;/code&gt; and &lt;code&gt;g_vars&lt;/code&gt; respectively.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;        self.d_loss_real = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(logits=self.D_logits,
                                                    labels=tf.ones_like(self.D)))
        self.d_loss_fake = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(logits=self.D_logits_,
                                                    labels=tf.zeros_like(self.D_)))
        self.g_loss = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(logits=self.D_logits_,
                                                    labels=tf.ones_like(self.D_)))

        self.d_loss_real_sum = tf.summary.scalar(&amp;quot;d_loss_real&amp;quot;, self.d_loss_real)
        self.d_loss_fake_sum = tf.summary.scalar(&amp;quot;d_loss_fake&amp;quot;, self.d_loss_fake)

        self.d_loss = self.d_loss_real + self.d_loss_fake

        self.g_loss_sum = tf.summary.scalar(&amp;quot;g_loss&amp;quot;, self.g_loss)
        self.d_loss_sum = tf.summary.scalar(&amp;quot;d_loss&amp;quot;, self.d_loss)

        t_vars = tf.trainable_variables()

        self.d_vars = [var for var in t_vars if &#39;d_&#39; in var.name]
        self.g_vars = [var for var in t_vars if &#39;g_&#39; in var.name]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We don&amp;rsquo;t want t lose our progress, so lets make sure we setup the &lt;code&gt;tf.Saver()&lt;/code&gt; function just keeping the most recent variables each time.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;        self.saver = tf.train.Saver(max_to_keep=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;save&#34;&gt; save() &lt;/h4&gt;

&lt;p&gt;When we want to save a checkpoint (i.e. save all of the weights we&amp;rsquo;ve learned) we will call this function. It will check whether the output directory exists, if not it will create it. Then it wll call the &lt;a href=&#34;https://www.tensorflow.org/api_docs/python/tf/train/Saver#save&#34; title=&#34;tf.train.Saver.save&#34;&gt;&lt;code&gt;tf.train.Saver.save()&lt;/code&gt;&lt;/a&gt; function which takes in the current session &lt;code&gt;sess&lt;/code&gt;, the save directory, model name and keeps track of the number of steps that&amp;rsquo;ve been done.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    def save(self, checkpoint_dir, step):
        if not os.path.exists(checkpoint_dir):
            os.makedirs(checkpoint_dir)
            
        self.saver.save(self.sess, os.path.join(checkpoint_dir, self.model_name), global_step=step)
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;load&#34;&gt; load() &lt;/h4&gt;

&lt;p&gt;Equally, if we&amp;rsquo;ve already spent a long time learning weights, we don&amp;rsquo;t want to start from scratch every time we want to push the network further. This function will load the most recent checkpoint in the save directory. TensorFlow has build-in functions for checking out the most recent checkpoint. If there is no checkpoint available, the function returns false and the appropriate action is taken by the main method that called it.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    def load(self, checkpoint_dir):
        print(&amp;quot; [*] Reading checkpoints...&amp;quot;)
        
        ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
        if ckpt and ckpt.model_checkpoint_path:
            self.saver.restore(self.sess, ckpt.model_checkpoint_path)
            return True
        else:
            return False
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;train&#34;&gt; train() &lt;/h4&gt;

&lt;p&gt;The all-important &lt;code&gt;train()&lt;/code&gt; method. This is where the magic happens. When we call &lt;code&gt;DCGAN.train(config)&lt;/code&gt; the networks will begin their fight and train. We will discuss the &lt;code&gt;config&lt;/code&gt; argument later on, but succinctly: it&amp;rsquo;s a list of all hyperparameters TensorFlow will use in the network. Here&amp;rsquo;s how &lt;code&gt;train()&lt;/code&gt; works:&lt;/p&gt;

&lt;p&gt;First we give the trainer the data (using our &lt;code&gt;dataset_files&lt;/code&gt; function) and make sure that it&amp;rsquo;s randomly shuffled. We want to make sure that the images next to each other have nothing in common so that we can truly randomly sample them. There&amp;rsquo;s also a check here `&lt;code&gt;assert(len(data) &amp;gt; 0)&lt;/code&gt; to make sure that we don&amp;rsquo;t pass in an empty directory&amp;hellip; that wouln&amp;rsquo;t be useful to learn from.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def train(self, config):
	data = dataset_files(config.dataset)
	np.random.shuffle(data)
	assert(len(data) &amp;gt; 0)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We&amp;rsquo;re going to use the adaptive non-convex optimization method &lt;a href=&#34;https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer&#34; title=&#34;tf.train.AdamOptimizer&#34;&gt;&lt;code&gt;tf.train.AdamOptimizer()&lt;/code&gt;&lt;/a&gt; from &lt;a href=&#34;https://arxiv.org/pdf/1412.6980.pdf&#34; title=&#34;Adam: A Method for Stochastic Optimization&#34;&gt;Kingma &lt;em&gt;et al&lt;/em&gt; (2014)&lt;/a&gt; to train out networks. Let&amp;rsquo;s set this up for the discriminator (&lt;code&gt;d_optim&lt;/code&gt;) and the generator (&lt;code&gt;g_optim&lt;/code&gt;).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;	d_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1).minimize(self.d_loss, var_list=self.d_vars)
	g_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1).minimize(self.g_loss, var_list=self.g_vars)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next we will initialize all variables in the network (depending on TensorFlow version) and generate some &lt;code&gt;tf.summary&lt;/code&gt; variables for TensorBoard which group together all of the summaries that we want to keep track of.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;	try:
	    tf.global_variables_initializer().run()
	except:
	    tf.initialize_all_variables().run()
	    
	self.g_sum = tf.summary.merge([self.z_sum, self.d__sum, self.G_sum, self.d_loss_fake_sum, self.g_loss_sum])
	self.d_sum = tf.summary.merge([self.z_sum, self.d_sum, self.d_loss_real_sum, self.d_loss_sum])
	self.writer = tf.summary.FileWriter(&amp;quot;./logs&amp;quot;, self.sess.graph)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So here&amp;rsquo;s the part where we now sample this well-known distribution $p_z$ to get the noise vector $z$. We&amp;rsquo;re using a &lt;code&gt;np.random.uniform&lt;/code&gt; distribution. Keep a look out for this when we&amp;rsquo;re watching the network in TensorBoard, we told the GAN &lt;code&gt;class&lt;/code&gt; to output the histogram of $z$ vectors that are sampled from $p_z$. So they should all approximate to a uniform distribution.&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;re also going to sample the input &lt;em&gt;real&lt;/em&gt; image files we shuffled earlier taking &lt;code&gt;sample_size&lt;/code&gt; images through to the training process. We will use these later on to assess the loss functions every now and again when we output some examples.&lt;/p&gt;

&lt;p&gt;We need to load in the data using the function &lt;code&gt;get_image()&lt;/code&gt; that we wrote into &lt;code&gt;gantut_imgfuncs.py&lt;/code&gt; during the &lt;a href=&#34;/post/GAN3&#34; title=&#34;MLNotebook: GAN3&#34;&gt;last tutorial&lt;/a&gt;. After loading the images, lets make sure that they&amp;rsquo;re all in one &lt;code&gt;np.array&lt;/code&gt; ready to be used.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;	sample_z = np.random.uniform(-1, 1, size=(self.sample_size, self.z_dim))

	sample_files = data[0:self.sample_size]
	sample = [get_image(sample_file, self.image_size, is_crop=self.is_crop) for sample_file in sample_files]
	sample_images = np.array(sample).astype(np.float32)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Set the epoch counter and get the start time (it can be frustrating if we can&amp;rsquo;t see how long things are taking). We also want to be sure to load any previous checkpoints from TensorFlow before we start again from scratch.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;	counter = 1
	start_time = time.time()

	if self.load(self.checkpoint_dir):
	    print(&amp;quot;&amp;quot;&amp;quot; An existing model was found - delete the directory or specify a new one with --checkpoint_dir &amp;quot;&amp;quot;&amp;quot;)
	else:
	    print(&amp;quot;&amp;quot;&amp;quot; No model found - initializing a new one&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here&amp;rsquo;s the actual training bit taking place.  &lt;code&gt;For&lt;/code&gt; each &lt;code&gt;epoch&lt;/code&gt; that we&amp;rsquo;ve assigned in &lt;code&gt;config&lt;/code&gt;, we create two minibatches: a sampling of real images, and those generated from the $z$ vector. We then update the &lt;code&gt;discriminator&lt;/code&gt; network before updating the &lt;code&gt;generator&lt;/code&gt;. We also write these loss values to the TensorBoard summary. There are two things to notice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;By calling &lt;code&gt;sess.run()&lt;/code&gt; with specified variables in the first (or &lt;code&gt;fetch&lt;/code&gt; attribute) we are able to keep the generator steady whilst updating the discriminator, and vice versa.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;The generator is updated twice. This is to make sure that the discriminator loss function does not just converge to zero very quickly.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;	for epoch in xrange(config.epoch):
	    data = dataset_files(config.dataset)
	    batch_idxs = min(len(data), config.train_size) // self.batch_size
	    
	    for idx in xrange(0, batch_idxs):
		batch_files = data[idx*config.batch_size:(idx+1)*config.batch_size]
		batch = [get_image(batch_file, self.image_size, is_crop=self.is_crop) for batch_file in batch_files]
		batch_images = np.array(batch).astype(np.float32)
		
		batch_z = np.random.uniform(-1, 1, [config.batch_size, self.z_dim]).astype(np.float32)
		
		#update D network
		_, summary_str = self.sess.run([d_optim, self.d_sum],
		                               feed_dict={self.images: batch_images, self.z: batch_z, self.is_training: True})
		self.writer.add_summary(summary_str, counter)
		
		#update G network
		_, summary_str = self.sess.run([g_optim, self.g_sum],
		                               feed_dict={self.z: batch_z, self.is_training: True})
		self.writer.add_summary(summary_str, counter)
		
		#run g_optim twice to make sure that d_loss does not go to zero
		_, summary_str = self.sess.run([g_optim, self.g_sum],
		                               feed_dict={self.z: batch_z, self.is_training: True})
		self.writer.add_summary(summary_str, counter)

&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To get the errors needed for backpropagation, we evaluate &lt;code&gt;d_loss_fake&lt;/code&gt;, &lt;code&gt;d_loss_real&lt;/code&gt; and &lt;code&gt;g_loss&lt;/code&gt;. We run the $z$ vector through the graph to get the fake loss and the generator loss, and use the real &lt;code&gt;batch_images&lt;/code&gt; for the real loss.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;		errD_fake = self.d_loss_fake.eval({self.z: batch_z, self.is_training: False})
		errD_real = self.d_loss_real.eval({self.images: batch_images, self.is_training: False})
		errG = self.g_loss.eval({self.z: batch_z, self.is_training: False})
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s get some output to &lt;code&gt;stdout&lt;/code&gt; for the user. The current epoch and progress through the minibatches is output at each new minibatch. Every 100 minibatches we&amp;rsquo;re going to evaluate the current generator &lt;code&gt;self.G&lt;/code&gt; and calculate the loss against the small set of images we sampled earlier. We will output the result of the generator and use our &lt;code&gt;save_images()&lt;/code&gt; function to create that image array we worked on in the last tutorial.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;		counter += 1
		print(&amp;quot;Epoch [{:2d}] [{:4d}/{:4d}] time: {:4.4f}, d_loss: {:.8f}&amp;quot;.format(
		        epoch, idx, batch_idxs, time.time() - start_time, errD_fake + errD_real, errG))
		
		if np.mod(counter, 100) == 1:
		    samples, d_loss, g_loss = self.sess.run([self.G, self.d_loss, self.g_loss], 
		                                            feed_dict={self.z: sample_z, self.images: sample_images, self.is_training: False})
		    save_images(samples, [8,8], &#39;./samples/train_{:02d}-{:04d}.png&#39;.format(epoch, idx))
		    print(&amp;quot;[Sample] d_loss: {:.8f}, g_loss: {:.8f}&amp;quot;.format(d_loss, g_loss))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally, we need to save the current weights from our networks.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;		if np.mod(counter, 500) == 2:
		    self.save(config.checkpoint_dir, counter)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;conclusion&#34;&gt; Conclusion &lt;/h2&gt;

&lt;p&gt;That&amp;rsquo;s it! We&amp;rsquo;ve completed the &lt;code&gt;gantut_gan.py&lt;/code&gt; and &lt;code&gt;gantut_datafuncs.py&lt;/code&gt; files. Checkout the completed files below:&lt;/p&gt;

&lt;p&gt;Completed versions of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_trainer.py&#34; title=&#34;gantut_trainer.py&#34;&gt;gantut_trainer.py&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_imgfuncs_complete.py&#34; title=&#34;gantut_imgfuncs_complete.py&#34;&gt;gantut_imgfuncs_complete.py&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_datafuncs_complete.py&#34; title=&#34;gantut_datafuncs_complete.py&#34;&gt;gantut_datafuncs_complete.py&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_gan_complete.py&#34; title=&#34;gantut_gan_complete.py&#34;&gt;gantut_gan_complete.py&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By following this tutorial series we should now have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A background in how GANs work&lt;/li&gt;
&lt;li&gt;Necessary data, fullly pre-processed and ready to use&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;gantut_imgfuncs.py&lt;/code&gt; for loading data into the neworks&lt;/li&gt;
&lt;li&gt;A GAN &lt;code&gt;class&lt;/code&gt; with the necessary methods in &lt;code&gt;gantut_gan.py&lt;/code&gt; and the &lt;code&gt;gantut_datafuncs.py&lt;/code&gt; we need to do the computations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the final part of the series, we will run this network and take a look at the outputs in TensorBoard.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Generative Adversarial Network (GAN) in TensorFlow - Part 3</title>
      <link>/post/GAN3/</link>
      <pubDate>Thu, 13 Jul 2017 09:16:32 +0100</pubDate>
      
      <guid>/post/GAN3/</guid>
      <description>&lt;p&gt;We&amp;rsquo;re ready to code! In &lt;a href=&#34;/content/post/GAN1 &amp;quot;GAN Tutorial - Part 1&#34;&gt;Part 1&lt;/a&gt; we looked at how GANs work and &lt;a href=&#34;/content/post/GAN2 &amp;quot;GAN Tutorial - Part 2&#34;&gt;Part 2&lt;/a&gt; showed how to get the data ready. In this Part, we will begin creating the functions that handle the image data including some pre-procesing and data normalisation.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;div id=&#34;toctop&#34;&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#intro&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#imagefuncs&#34;&gt;Image Functions&lt;/a&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#importfuncs&#34;&gt;Importing Functions&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#imread&#34;&gt;imread()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#transform&#34;&gt;transform()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#centercrop&#34;&gt;center_crop()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#getimage&#34;&gt;get_image()&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#savingfuncs&#34;&gt;Saving Functions&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#invtransform&#34;&gt;inverse_transform&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#merge&#34;&gt;merge()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#imsave&#34;&gt;imsave()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#saveimages&#34;&gt;save_images()&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;intro&#34;&gt; Introduction &lt;/h2&gt; 

&lt;p&gt;In the &lt;a href=&#34;/content/post/GAN2 &amp;quot;GAN Tutorial - Part 2&#34;&gt;previous post&lt;/a&gt; we downloaded and pre-processed our training data. There were also links to the skeleton code we will be using in the remainder of the tutorial, here they are again:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_imgfuncs.py&#34; title=&#34;gantut_imgfuncs.py&#34;&gt;&lt;code&gt;gantut_imgfuncs.py&lt;/code&gt;&lt;/a&gt;: holds the image-related functions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_datafuncs.py&#34; title=&#34;gantut_datafuncs.py&#34;&gt;&lt;code&gt;gantut_datafuncs.py&lt;/code&gt;&lt;/a&gt;: contains the data-related functions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_gan.py&#34; title=&#34;gantut_gan.py&#34;&gt;&lt;code&gt;gantut_gan.py&lt;/code&gt;&lt;/a&gt;: is where we define the GAN &lt;code&gt;class&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_trainer.py&#34; title=&#34;gantut_trainer.py&#34;&gt;&lt;code&gt;gantut_trainer.py&lt;/code&gt;&lt;/a&gt;: is the script that we will call in order to train the GAN&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Again, the code is based from other sources, particularly the respository by &lt;a href=&#34;https://github.com/carpedm20/DCGAN-tensorflow&#34; title=&#34;carpedm20/DCGAN-tensorflow&#34;&gt;carpedm20&lt;/a&gt; and &lt;a href=&#34;http://bamos.github.io/2016/08/09/deep-completion/#ml-heavy-generative-adversarial-net-gan-building-blocks&#34; title=&#34;bamos.github.io&#34;&gt;B. Amos&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, if your folder structure that looks something like this then we&amp;rsquo;re ready to go:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;~/GAN
  |- raw
    |-- 00001.jpg
    |-- ...
  |- aligned
    |-- 00001.jpg
    |-- ...
  |- gantut_imgfuncs.py
  |- gantut_datafuncs.py
  |- gantut_gan.py
  |- gantut_trainer.py
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;imagefuncs&#34;&gt; Image Functions &lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;re going to want to be able to read-in a set of images. We will also want to be able to output some generated images. We will also add in a fail-safe cropping/transformation procedure in-case we want to make sure we have the right input format. The skeleton code &lt;code&gt;gantut_imgfuncs.py&lt;/code&gt; contains the definition headers for these functions, we will fill them in as we go along.&lt;/p&gt;

&lt;h3 id=&#34;importfuncs&#34;&gt; Importing Functions &lt;/h3&gt;

&lt;p&gt;These are the functions needed to get the data from the hard-disk into our network. They are called like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;get_image&lt;/code&gt; which calls&lt;/li&gt;
&lt;li&gt;&lt;code&gt;imread&lt;/code&gt; and&lt;/li&gt;
&lt;li&gt;&lt;code&gt;transform&lt;/code&gt; which calls&lt;/li&gt;
&lt;li&gt;&lt;code&gt;center_crop&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&#34;imread&#34;&gt; imread() &lt;/h4&gt;

&lt;p&gt;We are dealing with standard image files and our GAN will support &lt;code&gt;.jpg&lt;/code&gt;, &lt;code&gt;.jpeg&lt;/code&gt; and &lt;code&gt;.png&lt;/code&gt; as input. For these kind of files, Python already has well-developed tools: specifically we can use the &lt;a href=&#34;https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.imread.html&#34; title=&#34;imread documentation&#34;&gt;scipy.misc.imread&lt;/a&gt; function from the &lt;code&gt;scipy.misc&lt;/code&gt; library. This is a one-liner and is already written in the skeleton code.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;path&lt;/code&gt;: location of the image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the image&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot; Reads in the image (part of get_image function)
&amp;quot;&amp;quot;&amp;quot;
def imread(path):
    return scipy.misc.imread(path, mode=&#39;RGB&#39;).astype(np.float)
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;transform&#34;&gt; transform() &lt;/h4&gt;
[to top][100]

This function we will have to write into the skeleton. We are including this to make sure that the image data are all of the same dimensions. So this function will need to take in the image, the desired width (the output will be square) and whether to perform the cropping or not. We may have already cropped our images (as we have) because we&#39;ve done some registration/alignment etc.

We do a check on whether we want to crop the image, if we do then call the `center_crop` function, other wise, just take the `image` as it is.

Before returning our cropped (or uncropped) image, we are going to perform normalisation. Currently the pixels have intensity values in the range $[0 \ 255]$ for each channel (reg, green, blue). It is best not to have this kind of skew on our data, so we will normalise our images to have intensity values in the range $[-1 \ 1]$ by dividing by the mean of the maximum range (127.5) and subtracting 1. i.e. image/127.5 - 1. 

We will define the cropping function next, but note that the returned image is a simply a `numpy` array.

*Inputs*

* `image`:      the image data to be transformed
* `npx`:        the size of the transformed image [`npx` x `npx`]
* `is_crop`:    whether to preform cropping too [`True` or `False`]

*Returns*

* the cropped, normalised image

```python
&#34;&#34;&#34; Transforms the image by cropping and resizing and 
normalises intensity values between -1 and 1
&#34;&#34;&#34;
def transform(image, npx=64, is_crop=True):
    if is_crop:
        cropped_image = center_crop(image, npx)
    else:
        cropped_image = image
    return np.array(cropped_image)/127.5 - 1.
```

&lt;hr&gt;

&lt;h4 id=&#34;centercrop&#34;&gt; center_crop() &lt;/h4&gt;

&lt;p&gt;Lets perform the cropping of the images (if requested). Usually we deal with square images, say $[64 \times 64]$. We can add a quick option to change that with short &lt;code&gt;if&lt;/code&gt; statements looking at the &lt;code&gt;crop_w&lt;/code&gt; argument to this function. We take the current height and width (&lt;code&gt;h&lt;/code&gt; and &lt;code&gt;w&lt;/code&gt;) from the &lt;code&gt;shape&lt;/code&gt; of the image &lt;code&gt;x&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To find the location of the centre of the image around which to take the square crop, we take half the result of &lt;code&gt;h - crop_h&lt;/code&gt; and &lt;code&gt;w - crop_w&lt;/code&gt;, making sure to round both to get a definite pixel value. However, it&amp;rsquo;s not guaranteed (depending on the image dimensions) that we will end up with a nice $[64 \times 64]$ image. Let&amp;rsquo;s fix that at the end.&lt;/p&gt;

&lt;p&gt;As before, &lt;code&gt;scipy&lt;/code&gt; has some efficient functions that we may as well use. &lt;a href=&#34;https://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.imresize.html&#34; title=&#34;imresize documentation&#34;&gt;&lt;code&gt;scipy.misc.imresize&lt;/code&gt;&lt;/a&gt; takes in an image array and the desired size and outputs a resized image. We can give it our array, which may not be a nice square image due to the initial image dimensions, and &lt;code&gt;imresize&lt;/code&gt; will perform interpolation (bilinear by default) to make sure we get a nice square image at the end.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;x&lt;/code&gt;:      the input image&lt;/li&gt;
&lt;li&gt;&lt;code&gt;crop_h&lt;/code&gt;: the height of the crop region&lt;/li&gt;
&lt;li&gt;&lt;code&gt;crop_w&lt;/code&gt;: if None crop width = crop height&lt;/li&gt;
&lt;li&gt;&lt;code&gt;resize_w&lt;/code&gt;: the width of the resized image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the cropped image&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot; Crops the input image at the centre pixel
&amp;quot;&amp;quot;&amp;quot;
def center_crop(x, crop_h, crop_w=None, resize_w=64):
    if crop_w is None:
        crop_w = crop_h
    h, w = x.shape[:2]
    j = int(round((h - crop_h)/2.))
    i = int(round((w - crop_w)/2.))
    return scipy.misc.imresize(x[j:j+crop_h, i:i+crop_w],
                               [resize_w, resize_w])
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;getimage&#34;&gt; get_image() &lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;get_image&lt;/code&gt; function is a wrapper that will call the &lt;code&gt;imread&lt;/code&gt; and &lt;code&gt;transform&lt;/code&gt; functions. It is the function that we&amp;rsquo;ll call to get the data rather than doing two separate function calls in the main GAN &lt;code&gt;class&lt;/code&gt;. This is a one-liner and is already written in the skeleton code.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Parameters&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;is_crop&lt;/code&gt;:    whether to crop the image or not [True or False]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;image_path&lt;/code&gt;: location of the image&lt;/li&gt;
&lt;li&gt;&lt;code&gt;image_size&lt;/code&gt;: width (in pixels) of the output image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the cropped image&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot; Loads the image and crops it to &#39;image_size&#39;
&amp;quot;&amp;quot;&amp;quot;
def get_image(image_path, image_size, is_crop=True):
    return transform(imread(image_path), image_size, is_crop)
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h3 id=&#34;savingfuncs&#34;&gt; Saving Functions &lt;/h3&gt;

&lt;p&gt;When we&amp;rsquo;re training our network, we will want to see some of the results. The previous functions all deal with getting images from storage &lt;em&gt;into&lt;/em&gt; the networks. We now want to take some images &lt;em&gt;out&lt;/em&gt;. The functions are called like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;save_images&lt;/code&gt; which calls&lt;/li&gt;
&lt;li&gt;&lt;code&gt;inverse_transform&lt;/code&gt; and&lt;/li&gt;
&lt;li&gt;&lt;code&gt;imsave&lt;/code&gt; which calls&lt;/li&gt;
&lt;li&gt;&lt;code&gt;merge&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&#34;invtransform&#34;&gt; inverse_transform() &lt;/h4&gt;

&lt;p&gt;Firstly, let&amp;rsquo;s put the intensities back into the skewed range, we&amp;rsquo;ll just go from $[-1 \ 1]$ to $[0 \ 1]$ here.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;images&lt;/code&gt;:     the image to be transformed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the transformed image&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot; This turns the intensities back to a normal range
&amp;quot;&amp;quot;&amp;quot;
def inverse_transform(images):
    return (images+1.)/2.
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;merge&#34;&gt; merge() &lt;/h4&gt;

&lt;p&gt;We will create an array of several example images from the network which we can output every now and again to see how things are progressing. We need some &lt;code&gt;images&lt;/code&gt; to go in and a &lt;code&gt;size&lt;/code&gt; which will say how many images in width and height the array should be.&lt;/p&gt;

&lt;p&gt;First get the height &lt;code&gt;h&lt;/code&gt; and width &lt;code&gt;w&lt;/code&gt; of the &lt;code&gt;images&lt;/code&gt; from their &lt;code&gt;shape&lt;/code&gt; (we assume they&amp;rsquo;re all the same size becuase we will have already used our previous functions to make this happen). &lt;strong&gt;Note&lt;/strong&gt; that &lt;code&gt;images&lt;/code&gt; is a collection of images where each &lt;code&gt;image&lt;/code&gt; has the same &lt;code&gt;h&lt;/code&gt; and &lt;code&gt;w&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We define &lt;code&gt;img&lt;/code&gt; to be the final image array and initialise it to all zeros. Notice that there is a &amp;lsquo;3&amp;rsquo; on the end to denote the number of channels as these are RGB images. This will still work for grayscale images.&lt;/p&gt;

&lt;p&gt;Next we will iterate through each &lt;code&gt;image&lt;/code&gt; in &lt;code&gt;images&lt;/code&gt; and put it into place. The &lt;code&gt;%&lt;/code&gt; operator is the modulo which returns the remainder of the division between two numbers. &lt;code&gt;//&lt;/code&gt; is the floor division operator which returns the integer result of division rounded down. So this will move along the top row of the array (remembering Python indexing starts at 0) and move down placing the image at each iteration.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;images&lt;/code&gt;:     the set of input images&lt;/li&gt;
&lt;li&gt;&lt;code&gt;size&lt;/code&gt;:       [height, width] of the array&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an array of images as a single image&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot; Takes a set of &#39;images&#39; and creates an array from them.
&amp;quot;&amp;quot;&amp;quot; 
def merge(images, size):
    h, w = images.shape[1], images.shape[2]
    img = np.zeros((int(h * size[0]), int(w * size[1]), 3))
    for idx, image in enumerate(images):
        i = idx % size[1]
        j = idx // size[1]
        img[j*h:j*h+h, i*w:i*w+w, :] = image
        
    return img
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;imsave&#34;&gt; imsave() &lt;/h4&gt;

&lt;p&gt;Our image array &lt;code&gt;img&lt;/code&gt; now has intensity values in $[0 \ 1]$ lets make this the proper image range $[0 \ 255]$ before getting the integer values as an image array with &lt;a href=&#34;https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.misc.imsave.html&#34; title=&#34;imsave documentation&#34;&gt;&lt;code&gt;scipy.misc.imsave&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;images&lt;/code&gt;: the set of input images&lt;/li&gt;
&lt;li&gt;&lt;code&gt;size&lt;/code&gt;:   [height, width] of the array&lt;/li&gt;
&lt;li&gt;&lt;code&gt;path&lt;/code&gt;:   the save location&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Returns&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an image saved to disk&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot; Takes a set of `images` and calls the merge function. Converts
the array to image data and saves to disk.
&amp;quot;&amp;quot;&amp;quot;
def imsave(images, size, path):
    img = merge(images, size)
    return scipy.misc.imsave(path, (255*img).astype(np.uint8))
&lt;/code&gt;&lt;/pre&gt;

&lt;hr&gt;

&lt;h4 id=&#34;saveimages&#34;&gt; save_images() &lt;/h4&gt;

&lt;p&gt;Finally, let&amp;rsquo;s create the wrapper to pull this together:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inputs&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;images&lt;/code&gt;: the images to be saves&lt;/li&gt;
&lt;li&gt;&lt;code&gt;size&lt;/code&gt;: the size of the img array [width height]&lt;/li&gt;
&lt;li&gt;&lt;code&gt;image_path&lt;/code&gt;: where the array is to be stored on disk&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;&amp;quot;&amp;quot;&amp;quot; takes an image and saves it to disk. Redistributes
intensity values [-1 1] from [0 255]
&amp;quot;&amp;quot;&amp;quot;
def save_images(images, size, image_path):
    return imsave(inverse_transform(images), size, image_path)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;conclusion&#34;&gt; Conclusion &lt;/h3&gt;

&lt;p&gt;In this post, we&amp;rsquo;ve dealt with all of the functions that are needed to import image data into our network and also some that will create outputs so we can see what&amp;rsquo;s going on. We&amp;rsquo;ve made sure that we can import any image-size  and it will be dealt with correctly.&lt;/p&gt;

&lt;p&gt;Make sure that we&amp;rsquo;ve imported &lt;code&gt;scpipy.misc&lt;/code&gt; and &lt;code&gt;numpy&lt;/code&gt; to this script:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;import numpy as np
import scipy.misc
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The complete script can be found &lt;a href=&#34;/docs/GAN/gantut_imgfuncs_complete.py&#34; title=&#34;gantut_imgfuncs_complete.py&#34;&gt;here&lt;/a&gt;. In the next post, we will be working on the GAN itself and building the &lt;code&gt;gantut_datafuncs.py&lt;/code&gt; functions as we go.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Generative Adversarial Network (GAN) in TensorFlow - Part 2</title>
      <link>/post/GAN2/</link>
      <pubDate>Wed, 12 Jul 2017 11:59:45 +0100</pubDate>
      
      <guid>/post/GAN2/</guid>
      <description>&lt;p&gt;This tutorial will provide the data that we will use when training our Generative Adversarial Networks. It will also take an overview on the structure of the necessary code for creating a GAN and provide some skeleton code which we can work on in the next post. If you&amp;rsquo;re not up to speed on GANs, please do read the brief introduction in &lt;a href=&#34;/post/GAN1&#34; title=&#34;GAN Part 1 - Some Background and Mathematics&#34;&gt;Part 1&lt;/a&gt; of this series on Generative Adversarial Networks.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;h2 id=&#34;intro&#34;&gt; Introduction &lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;ve looked at &lt;a href=&#34;/post/GAN1&#34; title=&#34;GAN Part 1 - Some Background and Mathematics&#34;&gt;how a GAN works&lt;/a&gt;  and how it is trained, but how do we implement this in Python? There are several stages to this task:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create some initial functions that will read in our training data&lt;/li&gt;
&lt;li&gt;Create some functions that will perform the steps in the CNN&lt;/li&gt;
&lt;li&gt;Write a &lt;code&gt;class&lt;/code&gt; that will hold our GAN and all of its important methods&lt;/li&gt;
&lt;li&gt;Put these together in a script that we can run to train the GAN&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The way I&amp;rsquo;d like to go through this process (in the next post) is by taking the network piece by piece as it would be called by the program. I think this is important to help to understand the flow of the data through the network. The code that I&amp;rsquo;ve used for the basis of these tutorials is from &lt;a href=&#34;https://github.com/carpedm20/DCGAN-tensorflow&#34; title=&#34;carpedm20/DCGAN-tensorflow&#34;&gt;carpedm20&amp;rsquo;s DCGAN-tensorflow repository&lt;/a&gt;, with a lot of influence from other sources including &lt;a href=&#34;http://bamos.github.io/2016/08/09/deep-completion/#ml-heavy-generative-adversarial-net-gan-building-blocks&#34; title=&#34;bamos.github.io&#34;&gt;this blog from B. Amos&lt;/a&gt;. I&amp;rsquo;m hoping that by  putting this together in several posts, and fleshing out the code, it will become clearer.&lt;/p&gt;

&lt;h2 id=&#34;skeletons&#34;&gt; Skeleton Code &lt;/h2&gt;

&lt;p&gt;We will structure our code into 4 separate &lt;code&gt;.py&lt;/code&gt; files. Each file represents one of the 4 stages set out above:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_imgfuncs.py&#34; title=&#34;gantut_imgfuncs.py&#34;&gt;&lt;code&gt;gantut_imgfuncs.py&lt;/code&gt;&lt;/a&gt;: holds the image-related functions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_datafuncs.py&#34; title=&#34;gantut_datafuncs.py&#34;&gt;&lt;code&gt;gantut_datafuncs.py&lt;/code&gt;&lt;/a&gt;: contains the data-related functions&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_gan.py&#34; title=&#34;gantut_gan.py&#34;&gt;&lt;code&gt;gantut_gan.py&lt;/code&gt;&lt;/a&gt;: is where we define the GAN &lt;code&gt;class&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;/docs/GAN/gantut_trainer.py&#34; title=&#34;gantut_trainer.py&#34;&gt;&lt;code&gt;gantut_trainer.py&lt;/code&gt;&lt;/a&gt;: is the script that we will call in order to train the GAN&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For our project, let&amp;rsquo;s use the working directory &lt;code&gt;~/GAN&lt;/code&gt;. Download these skeletons using the links above into `~/GAN&amp;rsquo;.&lt;/p&gt;

&lt;p&gt;If you look through each of these files, you will see that they contain only a comment for each function/class and the line defining each function/method. Each of these will have to be completed when we go through the next couple of posts. In the remainder of this post, we will take a look at the dataset that we will be using and prepare the images.&lt;/p&gt;

&lt;h2 id=&#34;dataset&#34;&gt; Dataset&lt;/h2&gt;

&lt;p&gt;We clearly need to have some training data to hand to be able to make this work. Several posts have used databases of faces or even the MNIST digit-classification dataset. In our tutorial, we will be using faces - I find this very interesting as it allows the computer to create photo-realistic images of people that don&amp;rsquo;t actually exist!&lt;/p&gt;

&lt;p&gt;To get the dataset prepared we need to download it, and then pre-process the images so that they will be small enough to use in our GAN.&lt;/p&gt;

&lt;h3 id=&#34;dataset-download&#34;&gt; Download &lt;/h3&gt;

&lt;p&gt;We are going to use the &lt;a href=&#34;http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html&#34; title=&#34;CelebA&#34;&gt;CelebA&lt;/a&gt; databse. Here is a direct link to the GoogleDrive which stores the data: &lt;a href=&#34;https://drive.google.com/drive/folders/0B7EVK8r0v71pTUZsaXdaSnZBZzg&#34;&gt;https://drive.google.com/drive/folders/0B7EVK8r0v71pTUZsaXdaSnZBZzg&lt;/a&gt;. You will want to go to the &amp;ldquo;img&amp;rdquo; folder and download the &lt;a href=&#34;https://drive.google.com/open?id=0B7EVK8r0v71pZjFTYXZWM3FlRnM&#34; title=&#34;img_align_celeba.zip&#34;&gt;&amp;ldquo;img_align_celeba.zip&amp;rdquo;&lt;/a&gt; file. Direct download link should be:&lt;/p&gt;

&lt;div align=&#34;center&#34;&gt;
&lt;a href=&#34;https://drive.google.com/open?id=0B7EVK8r0v71pZjFTYXZWM3FlRnM&#34; title=&#34;img_align_celeba.zip&#34;&gt;img_align_celeba.zip (1.3GB)&lt;/a&gt;
&lt;/div&gt;

&lt;p&gt;Download and extract this folder into &lt;code&gt;~/GAN/raw_images&lt;/code&gt; to find it contains 200,000+ examples of celebrity faces. Even though the &lt;code&gt;.zip&lt;/code&gt; says &amp;lsquo;align&amp;rsquo; in the name, we still need to resize the images and thus may need to realign them too.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img src=&#34;http://mmlab.ie.cuhk.edu.hk/projects/celeba/overview.png&#34; width=&#34;75%&#34; title=&#34;CelebA Database&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 1&lt;/font&gt;: Examples from the CelebA Database. Source: &lt;a href=&#34;http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html&#34; alt=&#34;CelebA&#34;&gt;CelebA&lt;/a&gt;
    &lt;/div&gt;
&lt;/div&gt;

&lt;h3 is=&#34;dataset-process&#34;&gt; Processing &lt;/h3&gt;

&lt;p&gt;To process this volume of images, we need an automated method for resizing and cropping. We will use &lt;a href=&#34;http://cmusatyalab.github.io/openface/&#34; title=&#34;OpenFace&#34;&gt;OpenFace&lt;/a&gt;. Specifically, there&amp;rsquo;s a small tool we will want to use from this.&lt;/p&gt;

&lt;p&gt;Open a terminal, navigate to or create your working directory (we&amp;rsquo;ll use &lt;code&gt;~/GAN&lt;/code&gt; and follow the instructions below to clone OpenFace and get the Python wrapping sorted:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;cd ~/GAN
git clone https://github.com/cmusatyalab/openface.git openface
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Cloning complete, move into the &lt;code&gt;openface&lt;/code&gt; folder and install the requirements (handily they&amp;rsquo;re in requirements.txt, so do this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;cd ./openface
sudo pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Installation complete (make sure you use sudo to get the permissions to install). Next we want to install the models that we can use with Python:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;./models/get-models.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This make take a short while. When this is done, you may want to update Scipy. This is because the requirements.txt wants a previous version to the most recent. Easily fixed:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;sudo pip install --upgrade scipy
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we have access to the Python tool that will do the aligning and cropping of our faces. This is an important step to ensure that all images going into the network are the same dimensions, but also so that the network can learn the faces well (there&amp;rsquo;s no point in having eyes at the bottom of an image, or a face that&amp;rsquo;s half out of the field of view).&lt;/p&gt;

&lt;p&gt;In our working directory `~/GAN&amp;rsquo;, do the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;./openface/util/align-dlib.py ./raw_images align innerEyesAndBottomLip ./aligned --size 64
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will &lt;code&gt;align&lt;/code&gt; all of the &lt;code&gt;innerEyesAndBottomLip&lt;/code&gt; of the images in &lt;code&gt;./raw_images&lt;/code&gt;, crop them to &lt;code&gt;64&lt;/code&gt; x &lt;code&gt;64&lt;/code&gt; and put them in &lt;code&gt;./aligned&lt;/code&gt;. This will take a long time (for 200,000+ images!).&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img src=&#34;/img/CNN/resized_celeba.png&#34; width=&#34;50%&#34; title=&#34;Cropped and Resized CelebA&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 2&lt;/font&gt;: Examples of aligned, cropped and resized images from the &lt;a href=&#34;http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html&#34; alt=&#34;CelebA&#34;&gt;CelebA&lt;/a&gt; database.
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;That&amp;rsquo;s it! Now we will have a good training set to use with our network. We also have the skeletons that we can build up to form our GAN. Our next post will look at the functions that will read-in the images for use with the GAN and begin to work on the GAN &lt;code&gt;class&lt;/code&gt;.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Generative Adversarial Network (GAN) in TensorFlow - Part 1</title>
      <link>/post/GAN1/</link>
      <pubDate>Tue, 11 Jul 2017 09:15:54 +0100</pubDate>
      
      <guid>/post/GAN1/</guid>
      <description>&lt;p&gt;We&amp;rsquo;ve seen that CNNs can learn the content of an image for classification purposes, but what else can they do? This tutorial will look at the Generative Adversarial Network (GAN) which is able to learn from a set of images and create an entirely new &amp;lsquo;fake&amp;rsquo; image which isn&amp;rsquo;t in the training set. Why? By the end of this tutorial you&amp;rsquo;ll get know why this might be done and how to do it.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;h2 id=&#34;intro&#34;&gt;  Introduction &lt;/h2&gt;

&lt;p&gt;Generative Adversarial Networks (GANs) were proposed by Ian Goodfellow &lt;em&gt;et al&lt;/em&gt; in 2014 at annual the Neural Information and Processing Systems (NIPS) conference. The original paper &lt;a href=&#34;https://arxiv.org/pdf/1406.2661&#34; title=&#34;Generative Adversarial Nets 2014&#34;&gt;is available on Arxiv&lt;/a&gt; along with a later tutorial by Goodfellow delivered at NIPS in 2016 &lt;a href=&#34;https://arxiv.org/pdf/1701.00160&#34; title=&#34;NIPS 2016 Tutorial: Generative Adversarial Networks&#34;&gt;here&lt;/a&gt;. I&amp;rsquo;ve read both of these (and others) as well as taking a look at other tutorials but sometimes things just weren&amp;rsquo;t clear enough for me. &lt;a href=&#34;http://bamos.github.io/2016/08/09/deep-completion/#ml-heavy-generative-adversarial-net-gan-building-blocks&#34; title=&#34;bamos.github.io&#34;&gt;This blog from B. Amos&lt;/a&gt; has been helpful in getting my thoughts organised on this series, and hopefully I can build on this a little and make things more concrete.&lt;/p&gt;

&lt;h3&gt;What&#39;s a GAN?&lt;/h3&gt;

&lt;p&gt;GANs  are used in a number of ways, for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;to generate new images based upon some training data. For our tutorial, we will train with a database of faces and ask the network to produce a new face.&lt;/li&gt;
&lt;li&gt;to do &amp;lsquo;inpainting&amp;rsquo; or &amp;lsquo;image completion&amp;rsquo;. This is where part of a scene may be missing and we wish to recover the full image. It could be that we want to remove parts of the image e.g. people, and fill-in the background.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are two components in a GAN which try to work against each other (hence the &amp;lsquo;adversarial&amp;rsquo; part).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Generator (&lt;em&gt;G&lt;/em&gt;) starts off by creating a very noisy image based upon some random input data. Its job is to try to come up with images that are as real as possible.&lt;/li&gt;
&lt;li&gt;The Discriminator (&lt;em&gt;D&lt;/em&gt;) is trying to determine whether an image is real or fake.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Though these two are the primary components of the network, we also need to write some functions for importing data and dealing with the training of this two-stage network. Part 1 of this tutorial will go through some background and mathematics, in Part 2 we will do some general housekeeping and get us prepared to write the main model of our network in Part 3.&lt;/p&gt;

&lt;h2 id=&#34;maths&#34;&gt; Background &lt;/h2&gt;

&lt;p&gt;There are a number of situations where you may want to use a GAN. A common task is for image completion or &amp;lsquo;in-painting&amp;rsquo;. This would be where we have an image and would like to remove some obstruction or imperfection by replacing it with the background. Maybe there&amp;rsquo;s a lovely holiday photo of beautiful scenery, but there are some people you don&amp;rsquo;t know spoiling the view. Figure 1 shows an example of the result of image completion using PhotoShop on such an image.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img src=&#34;https://farm5.staticflickr.com/4115/4756059924_e26ae12e46_b.jpg&#34; width=&#34;100%&#34; alt=&#34;Image Completion Example&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 1&lt;/font&gt;: Removal of unwated parts of a scene with image completion. Source: &lt;a href=&#34;https://www.flickr.com/photos/littleredelf/4756059924/in/photostream/&#34; alt=&#34;littleredelf&#34;&gt;Flickr:littleredelf&lt;/a&gt;
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;We have a couple of options if we want to try and do this kind of image completion ourselves. Let&amp;rsquo;s say we draw around an area we want to change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If we&amp;rsquo;ve never seen a beach or the sky before, well we may just have to use the neighbouring pixels to inform our in-filling. If we&amp;rsquo;re feeling fancy, we would look a little further afield and use that information too ( i.e. is there just sky around the area, or is there something else).&lt;/li&gt;
&lt;li&gt;Or&amp;hellip; we could look at the image as a whole and try to see what would fit best. For this we would have to use our knowledge of similar scenes we&amp;rsquo;ve observed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the difference between using (1) contextual and (2) perceptual information. But before we look more heavily into this, let&amp;rsquo;s take a look at the idea behind a GAN.&lt;/p&gt;

&lt;h2 id=&#34;gan&#34;&gt; Generative Adversarial Networks &lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;ve said that there are two components in a GAN, the &lt;em&gt;generator&lt;/em&gt; and the &lt;em&gt;discriminator&lt;/em&gt;. Here, we&amp;rsquo;ll look more closely at what they do.&lt;/p&gt;

&lt;p&gt;Our purpose is to create images which are as realistic as possible. So much so, that they are able to fool not only humans, but the computer that has generated them. You will often see GANs being compared to money counterfeiting: our generator is trying to create fake money whilst our discriminator is trying to tell the difference between the real and fake bills. How does this work?&lt;/p&gt;

&lt;p&gt;Say we have an image $x$ which our discriminator $D$ is analysing. $D(x)$ gives a low value near to 0 if the image looks normal or &amp;lsquo;natural&amp;rsquo; and a higher value near to 1 if it thinks the image is fake - this could mean it is very noisy for example. The generator $G$ takes a vector $z$ that has been randomly sampled from a very simple, but well known, distribution e.g. a uniform or normal distribution. The image that is produced by $G(z)$ should help to train the function at $D$. We alternate showing the discriminator a real image (which will change its parameters to give a low output) and then an image from $G$ (which will change $D$ to give a higher output). At the same time, we want $G$ to also be learning to produce more realistic images which are more likely to fool $D$. We want $G$ to &lt;em&gt;minimise&lt;/em&gt; the output of $D$ whilst $D$ is trying to &lt;em&gt;maximise&lt;/em&gt; the same thing. They are playing a &lt;a href=&#34;https://en.wikipedia.org/wiki/Minimax&#34; title=&#34;Wiki: minimax&#34;&gt;&amp;lsquo;minimax&amp;rsquo;&lt;/a&gt; game against each other, which is where we get the term &amp;lsquo;adversarial&amp;rsquo; training.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img src=&#34;/img/CNN/gan1.png&#34; width=&#34;100%&#34; alt=&#34;GAN&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 2&lt;/font&gt;: Generative Adversarial Network concept. Simple, known distribution $p_z$ from which the vector $z$ is drawn. Generator $G(z)$ generates an image. Discriminator tries to determine if image came from $G$ or from the true, unknown distribution $p_{data}$.
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Let&amp;rsquo;s keep going with the maths&amp;hellip;&lt;/p&gt;

&lt;p&gt;This kind of network has a lot of latent (hidden) variables that need to be found. But we can start from a strong position by using a distribution that we know very well like a uniform distribution.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;known&lt;/strong&gt; distribution we denote $p_z$ We will randomly draw a vector $z$ from $p_z$.&lt;/li&gt;
&lt;li&gt;We know that our data must have some distribution but we do &lt;strong&gt;not&lt;/strong&gt; know this. We&amp;rsquo;ll call this $p_{data}$&lt;/li&gt;
&lt;li&gt;Our generator will try to learn its own distribution $p_g$. Our goal is for $p_g = p_{data}$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We have two networks to train, $D$ and $G$:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We want to &lt;em&gt;minimise&lt;/em&gt; $D(x)$ if $x$ is drawn from our true distribution $p_{data}$ i.e. &lt;em&gt;minimise&lt;/em&gt; $D(x)$ if it&amp;rsquo;s not.&lt;/li&gt;
&lt;li&gt;and &lt;em&gt;maximise&lt;/em&gt; $D(G(z))$ i.e. &lt;em&gt;minimise&lt;/em&gt; $1 - D(G(z))$.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More formally:&lt;/p&gt;

&lt;div&gt;$$
\min_{G}\max_{D} V(D, G) = \mathbb{E}_{x\sim p_{data}} \left[ \log D(x)  \right]+ \mathbb{E}_{z\sim p_{z}} \left[ \log \left( 1 - D(G(z)) \right) \right]

$$
&lt;/div&gt;

&lt;p&gt;Where $\mathbb{E}$ is the expectation. The advantage of working with neural networks is that we can easily compute gradients and use backpropagation to perform training. This is because the generator and the discriminator are defined by the multi-layer perceptron (MLP) parameters $\theta_g$ and $\theta_d$ respectively.&lt;/p&gt;

&lt;p&gt;We will train the networks (the $G$ and the $D$) one at a time, fixing the weights of one whilst training the other. From the GAN paper by Goodfellow &lt;em&gt;et al&lt;/em&gt; we get the &lt;em&gt;pseudo&lt;/em&gt; code for this procedure:&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img src=&#34;/img/CNN/ganalgorithm.png&#34; width=&#34;100%&#34; alt=&#34;GAN&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 3&lt;/font&gt;: &lt;i&gt;pseudo&lt;/i&gt; code for GAN training. With $k=1$ this equates to training $D$ then $G$ one after the other. Adapted from &lt;a href=&#34;https://arxiv.org/pdf/1406.2661&#34; title=&#34;Goodfellow et al. 2014&#34;&gt;Goodfellow &lt;i&gt;et al.&lt;/i&gt; 2014&lt;/a&gt;.
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Notice that with $k=1$ we are training $D$ then $G$ one after the other. What is the training actually doing? Fig. 4 shows the distribution $p_g$ of the generator in green. Notice that with each training step, the $p_g$ becomes more like the true distribution of the image data $p_{data}$ in black. After each alternation, the error is backpropagated to udate the weights on the network that is not being held fixed. The discriminator eventually reaches its &lt;em&gt;lowest maximum&lt;/em&gt; where it is no longer able to tell the difference between the true and fake images.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img src=&#34;/img/CNN/ganalgographs.png&#34; width=&#34;100%&#34; alt=&#34;GAN&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 4&lt;/font&gt;: Initially (a) the generator&#39;s and true data distributions (green and black) are not very similar. (b) the discriminator (blue) is updated with generator held constant. (c) Generator is updated with discriminator held constant, until (d) $p_g$ and $p_{data}$ are most alike. Adapted from &lt;a href=&#34;https://arxiv.org/pdf/1406.2661&#34; title=&#34;Goodfellow et al. 2014&#34;&gt;Goodfellow &lt;i&gt;et al.&lt;/i&gt; 2014&lt;/a&gt;.
    &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;nextsteps&#34;&gt; What&#39;s Next?&#34;&lt;/h2&gt;

&lt;p&gt;That really is it. The basics of a GAN are just a game between two networks, the generator $G$, which produces images from some latent variables $z$, and the discriminator $D$ which tries to detect the faked images.&lt;/p&gt;

&lt;p&gt;Implementing this in Python seems old-hat to many and there are many pre-built solutions available. The work in this tutorial series will mostly follow the base-code from &lt;a href=&#34;https://github.com/carpedm20/DCGAN-tensorflow&#34; title=&#34;carpedm20/DCGAN-tensorflow&#34;&gt;carpedm20&amp;rsquo;s DCGAN-tensorflow repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the next post, we&amp;rsquo;ll get ourselves organised, make sure we have some dependencies, create some files and get our training data sorted.&lt;/p&gt;

&lt;p&gt;As always, if there&amp;rsquo;s anything wrong or that doesn&amp;rsquo;t make send &lt;strong&gt;please&lt;/strong&gt; get in contact and let me know. A comment here is great.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Convolutional Neural Networks - TensorFlow (Basics)</title>
      <link>/post/tensorflow-basics/</link>
      <pubDate>Mon, 03 Jul 2017 09:44:24 +0100</pubDate>
      
      <guid>/post/tensorflow-basics/</guid>
      <description>&lt;p&gt;We&amp;rsquo;ve looked at the principles behind how a CNN works, but how do we actually implement this in Python? This tutorial will look at the basic idea behind Google&amp;rsquo;s TensorFlow: an efficient way to build a CNN using purpose-build Python libraries.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;div style=&#34;text-align:center;&#34;&gt;&lt;img width=30% title=&#34;TensorFlow&#34; src=&#34;/img/CNN/TF_logo.png&#34;&gt;&lt;/div&gt;

&lt;h2 id=&#34;intro&#34;&gt;  Introduction &lt;/h2&gt;

&lt;p&gt;Building a CNN from scratch in Python is perfectly possible, but very memory intensive. It can also lead to very long pieces of code. Several libraries have been developed by the community to solve this problem by wrapping the most common parts of CNNs into special methods called from their own libraries. Theano, Keras and PyTorch are notable libraries being used today that are all opensource. However, since TensorFlow was released and Google announced their machine-learning-specific hardware, the Tensor Processing Unit (TPU), TensorFlow has quickly become a much-used tool in the field. If any applications being built today are intended for use on mobile devices, TensorFlow is the way to go as the mobile TPU in the upcoming Google phones will be able to perform inference from machine learning models in the User&amp;rsquo;s hand. Of course, being a relative newcomer and updates still very much controlled by Google, TensorFlow may not have the huge body of support that has built up with Theano, say.&lt;/p&gt;

&lt;p&gt;Nevertheless, TensorFlow is powerful and quick to setup so long as you know how: read on to find out. Much of this tutorial is based around the documentation provided by Google, but gives a lot more information that many be useful to less experienced users.&lt;/p&gt;

&lt;h2 id=&#34;install&#34;&gt; Installation &lt;/h2&gt;

&lt;p&gt;TensorFlow is just another set of Python libraries distributed by Google via the website: &lt;a href=&#34;https://www.tensorflow.org/install&#34; title=&#34;TensorFlow Installation&#34;&gt;https://www.tensorflow.org/install&lt;/a&gt;. There&amp;rsquo;s the option to install the version for use on GPUs but that&amp;rsquo;s not necessary for this tutorial, we&amp;rsquo;ll be using the MNIST dataset which is not too memory instensive.&lt;/p&gt;

&lt;p&gt;Go ahead and install the TensorFlow libraries. I would say that even though they suggest using TF in a virtual environment, we will be coding up our CNN in a Python script so don&amp;rsquo;t worry about that if you&amp;rsquo;re not comfortable with it.&lt;/p&gt;

&lt;p&gt;One of the most frustrating things you will find with TF is that much of the documentation on various websites is already out-of-date. Some of the commands have been re-written or renamed since the support was put in place. Even some of Google&amp;rsquo;s own tutorials are now old and require tweaking. Currently, the code written here will work on all versions, but may throw some &amp;lsquo;depreication&amp;rsquo; warnings.&lt;/p&gt;

&lt;h2 id=&#34;structure&#34;&gt; TensorFlow Structure &lt;/h2&gt;

&lt;p&gt;The idea of &amp;lsquo;flow&amp;rsquo; is central to TF&amp;rsquo;s organisation. The actual CNN is written as a &amp;lsquo;graph&amp;rsquo;. A graph is simply a list of the differnet layers in your network each with their own input and output. Whatever data we input at the top will &amp;lsquo;flow&amp;rsquo; through the graph and output some values. The values we will also deal with using TensorFlow which will automatically take care of the updating of any internal weights via whatever optimization method and loss function we prefer.&lt;/p&gt;

&lt;p&gt;The graph is called by some initial functions in the script that create the classifier, run the training and output whatever evlauation metrics we like.&lt;/p&gt;

&lt;p&gt;Before writing any functions, lets import the necessary includes and tell TF to limit any program logging:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;import numpy as np
import os
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib


os.environ[&#39;TF_CPP_MIN_LOG_LEVEL&#39;] = &#39;3&#39;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We&amp;rsquo;ve included multiple TF lines to save on the typing later.&lt;/p&gt;

&lt;h3 id=&#34;graph&#34;&gt; The Graph &lt;/h3&gt;

&lt;p&gt;Let&amp;rsquo;s get straight to it and start to build our graph. We will keep it simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2 convolutional layers learning 16 filters (or kernels) of [3 x 3]&lt;/li&gt;
&lt;li&gt;2 max-pooling layers that half the size of the image using [2 x 2] kernel&lt;/li&gt;
&lt;li&gt;A fully connected layer at the end.&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;#Hyperparameters
numK = 16               #number of kernels in each conv layer
sizeConvK = 3           #size of the kernels in each conv layer [n x n]
sizePoolK = 2           #size of the kernels in each pool layer [m x m]
inputSize = 28          #size of the input image
numChannels = 1         #number of channels to the input image grayscale=1, RGB=3

def convNet(inputs, labels, mode):
    #reshape the input from a vector to a 2D image
    input_layer = tf.reshape(inputs, [-1, inputSize, inputSize, numChannels])   
    
    #perform convolution and pooling
    conv1 = doConv(input_layer) 
    pool1 = doPool(conv1)      
    
    conv2 = doConv(pool1)
    pool2 = doPool(conv2)

    #flatted the result back to a vector for the FC layer
    flatPool = tf.reshape(pool2, [-1, 7 * 7 * numK])    
    dense = tf.layers.dense(inputs=flatPool, units=1024, activation=tf.nn.relu)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So what&amp;rsquo;s going on here? First we&amp;rsquo;ve defined some parameters for the CNN such as kernel sizes, the height of the input image (assuming it&amp;rsquo;s square) and the number of channels for the image. The number of channels is &lt;code&gt;1&lt;/code&gt; for both Black and White with intensity values of either 0 or 1, and grayscale images with intensities in the range [0 255]. Colour images have &lt;code&gt;3&lt;/code&gt; channels, Red, Green and Blue.&lt;/p&gt;

&lt;p&gt;You&amp;rsquo;ll notice that we&amp;rsquo;ve barely used TF so far: we use it to reshape the data. This is important, when we run our script, TF will take our raw data and turn it into its own data type i.e. a &lt;code&gt;tensor&lt;/code&gt;. That means our normal &lt;code&gt;numpy&lt;/code&gt; operations won&amp;rsquo;t work on them so we should use the in-built &lt;code&gt;tf.reshape&lt;/code&gt; function which works in the same was as the one in numpy - it takes the input data and an output shape as arguments.&lt;/p&gt;

&lt;p&gt;But why are we reshaping at all? Well, the data that is input into the network will be in the form of vectors. The image will have been saved along with lots of other images as single lines of a larger file. This is the case with the MNIST dataset and is common in machine learning. So we need to put it back into image-form so that we can perform convolutions.&lt;/p&gt;

&lt;p&gt;&amp;ldquo;Where are those random 7s and the -1 from?&amp;rdquo;&amp;hellip; good question. In this example, we are going to be using the MNIST dataset whose images are 28 x 28. If we put this through 2 pooling layers we will half (14 x 14) and half again (7 x 7) the width. Thus the layer needs to know what it is expecting the output to look like based upon the input which will be a 7 x 7 x &lt;code&gt;numK&lt;/code&gt; tensor, one 7 x 7 for each kernel. Keep in mind that we will be running the network with more than one input image at a time, so in reality when we get to this stage, there will be &lt;code&gt;n&lt;/code&gt; images here which all have 7 x 7 x &lt;code&gt;numK&lt;/code&gt; values associated with them. The -1 simply tells TensorFlow to take &lt;em&gt;all&lt;/em&gt; of these images and do the same to each. It&amp;rsquo;s short hand for &amp;ldquo;do this for the whole batch&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s also a &lt;code&gt;tf.layers.dense&lt;/code&gt; method at the end here. This is one of TF&amp;rsquo;s in-built layer types that is very handy. We just tell it what to take as input, how many units we want it to have and what non-linearity we would prefer at the end. Instead of typing this all separately, it&amp;rsquo;s combined into a single line. Neat!&lt;/p&gt;

&lt;p&gt;But what about the &lt;code&gt;conv&lt;/code&gt; and &lt;code&gt;pool&lt;/code&gt; layers? Well, to keep the code nice and tidy, I like to write the convolution and pooling layers in separate functions. This means that if I want to add more &lt;code&gt;conv&lt;/code&gt; or &lt;code&gt;pool&lt;/code&gt; layers, I can just write them in underneath the current ones and the code will still look clean (not that the functions are very long). Here they are:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def doConv(inputs):
    convOut = tf.layers.conv2d(inputs=inputs, filters=numK, kernel_size=[sizeConvK, sizeConvK], \
    	padding=&amp;quot;SAME&amp;quot;, activation=tf.nn.relu)    
    return convOut
    
def doPool(inputs):
    poolOut = tf.layers.max_pooling2d(inputs=inputs, pool_size=[sizePoolK, sizePoolK], strides=2)
    return poolOut
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Again, both the &lt;code&gt;conv&lt;/code&gt; and &lt;code&gt;pool&lt;/code&gt; layers are simple one-liners. They both take in some input data and need to know the size of the kernel you want them to use (which we defined earlier on). The &lt;code&gt;conv&lt;/code&gt; layer needs to know how many &lt;code&gt;filters&lt;/code&gt; to learn too. Alongside this, we need to take care of any mis-match between the image size and the size of the kernels to ensure that we&amp;rsquo;re not changing the size of the image when we get the output. This is easily done in TF by setting the &lt;code&gt;padding&lt;/code&gt; attribute to &lt;code&gt;&amp;quot;SAME&amp;quot;&lt;/code&gt;. We&amp;rsquo;ve got our non-linearity at the end here too. We&amp;rsquo;ve hard-coded that the pooling layer will have &lt;code&gt;strides=2&lt;/code&gt; and will therefore half in size at each pooling layer.&lt;/p&gt;

&lt;p&gt;Now we have the main part of our network coded-up. But it wont do very much unless we ask TF to give us some outputs and compare them to some training data.&lt;/p&gt;

&lt;p&gt;As the MNIST data is used for image-classification problems, we&amp;rsquo;ll be trying to get the network to output probabilities that the image it is given belongs to a specific class i.e. a number 0-9. The MNIST dataset provides the numbers 0-9 which, if we provided this to the network, would start to output guesses of decimal values 0.143, 4.765, 8.112 or whatever. We need to change this data so that each class can have its own specific box which the network can assign a probability. We use the idea of &amp;lsquo;one-hot&amp;rsquo; labels for this. For example, class 3 becomes [0 0 0 1 0 0 0 0 0 0] and class 9 becomes [0 0 0 0 0 0 0 0 0 1]. This way we&amp;rsquo;re not asking the network to predict the number associated with each class but rather how likely is the test-image to be in this class.&lt;/p&gt;

&lt;p&gt;TF has a very handy function for changing class labels into &amp;lsquo;one-hot&amp;rsquo; labels. Let&amp;rsquo;s continue coding our graph in the &lt;code&gt;convNet&lt;/code&gt; function.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;     #Get the output in the form of one-hot labels with x units
    logits = tf.layers.dense(inputs=dense, units=10) 
    
    loss = None
    train_op = None
    #At the end of the network, check how well we did     
    if mode != learn.ModeKeys.INFER:
        #create one-hot tabels from the training-labels
        onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
        #check how close the output is to the training-labels
        loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
    
    #After checking the loss, use it to train the network weights   
    if mode == learn.ModeKeys.TRAIN:
        train_op = tf.contrib.layers.optimize_loss(loss=loss, global_step=tf.contrib.framework.get_global_step(), \
            learning_rate=learning_rate, optimizer=&amp;quot;SGD&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;logits&lt;/code&gt; here is the output of the network which corresponds to the 10 classes of the training labels. The next two sections check whether we should be training the weights right now, or checking how well we&amp;rsquo;ve done. First we check our progress: we use &lt;code&gt;tf.one_hot&lt;/code&gt; to create the one-hot labels from the numeric training labels given to the network in &lt;code&gt;labels&lt;/code&gt;. We&amp;rsquo;ve performed a &lt;code&gt;tf.cast&lt;/code&gt; operation to make sure that the data is of the correct type before doing the conversion.&lt;/p&gt;

&lt;p&gt;Our loss-function is an important part of a CNN (or any machine learning algorithm). There are many different loss functions already built-in with TensorFlow from simple &lt;code&gt;absolute_difference&lt;/code&gt; to more complex functions like our &lt;code&gt;softmax_cross_entropy&lt;/code&gt;. We won&amp;rsquo;t delve into how this is calculated, just know that we can pick any loss function. More advanced users can write their own loss-functions. The loss function takes in the output of the network &lt;code&gt;logits&lt;/code&gt; and compares it to our &lt;code&gt;onehot_labels&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When this is done, we ask TF to perform some updating or &amp;lsquo;optimization&amp;rsquo; of the network based on the loss that we just calculated. the &lt;code&gt;train_op&lt;/code&gt; in TF is the name given in support documents to the function that performs any background changes to the fundamentals of the network or updates values. Our &lt;code&gt;train_op&lt;/code&gt; here is a simple loss-optimiser that tries to find the minimum loss for our data. As with all machine learning algorithms, the parameters of this optimiser are subject to much research. Using a pre-built optimiser such as those included with TF will ensure that your network performs efficiently and trains as quickly as possible. The &lt;code&gt;learning_rate&lt;/code&gt; can be set as a variable at the beginning of our script along with the other parameters. We tend to stick with &lt;code&gt;0.001&lt;/code&gt; to begin with and move in orders of magnitude if we need to e.g. &lt;code&gt;0.01&lt;/code&gt; or &lt;code&gt;0.0001&lt;/code&gt;. Just like the loss functions, there are a number of optimisers to use, some will take longer than others if they are more complex. For our purposes on the MNIST dataset, simple stochastic gradient descent (&lt;code&gt;SGD&lt;/code&gt;) will suffice.&lt;/p&gt;

&lt;p&gt;Notice that we are just giving TF some instructions: take my network, calculate the loss and do some optimisation based on that loss.&lt;/p&gt;

&lt;p&gt;We are going to want to show what the network has learned, so we output the current predictions by definiing a dictionary of data. The raw logits information and the associated probabilities (found by taking the softmax of the logits tensor).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;predictions ={&amp;quot;classes&amp;quot;: tf.argmax(input=logits, axis=1), &amp;quot;probabilities&amp;quot;: tf.nn.softmax(logits, name=&amp;quot;softmax_tensor&amp;quot;)}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can finish off our graph by making sure it returns the data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;return model_fn_lib.ModelFnOps(mode=mode, predictions=predictions, loss=loss, train_op=train_op)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;ModelFnOps&lt;/code&gt; class is returned that contains the current mode of the network (training or inference), the current predictions, loss and the &lt;code&gt;train_op&lt;/code&gt; that we use to train the network.&lt;/p&gt;

&lt;h3 id=&#34;setup&#34;&gt;Setting up the Script&lt;/h3&gt;

&lt;p&gt;Now that the graph has been constructed, we need to call it and tell TF to do the training. First, lets take a moment to load the data the we will be using. The MNIST dataset has its own loading method within TF (handy!). Let&amp;rsquo;s define the main body of our script:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def main(unused_argv):
    # Load training and eval data
    mnist = learn.datasets.load_dataset(&amp;quot;mnist&amp;quot;)
    train_data = mnist.train.images # Returns np.array
    train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
    eval_data = mnist.test.images # Returns np.array
    eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next, we create the classifier that will hold the network and all of its data. We have to tell it what our graph is called under &lt;code&gt;model_fn&lt;/code&gt; and where we would like our output stored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you use the &lt;code&gt;/tmp&lt;/code&gt; directory in Linux you will probably find that the model will no longer be there if you restart your computer. If you intend to reload and use your model later on, be sure to save it in a more conventient place.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    mnistClassifier = learn.Estimator(model_fn=convNet,   model_dir=&amp;quot;/tmp/mln_MNIST&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We will want to get some information out of our network that tells us about the training performance. For example, we can create a dictionary that will hold the probabilities from the key that we named &amp;lsquo;softmax_tensor&amp;rsquo; in the graph. How often we save this information is controlled with the &lt;code&gt;every_n_iter&lt;/code&gt; attricute. We add this to the &lt;code&gt;tf.train.LoggingTensorHook&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    tensors2log = {&amp;quot;probabilities&amp;quot;: &amp;quot;softmax_tensor&amp;quot;}
    logging_hook = tf.train.LoggingTensorHook(tensors=tensors2log, every_n_iter=100)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally! Let&amp;rsquo;s get TF to actually train the network. We call the &lt;code&gt;.fit&lt;/code&gt; method of the classifier that we created earlier. We pass it the training data and the labels along with the batch size (i.e. how much of the training data we want to use in each iteration). Bare in mind that even though the MNIST images are very small, there are 60,000 of them and this may not do well for your RAM. We also need to say what the maximum number of iterations we&amp;rsquo;d like TF to perform is and also add on that we want to &lt;code&gt;monitor&lt;/code&gt; the training by outputting the data we&amp;rsquo;ve requested in &lt;code&gt;logging_hook&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    mnistClassifier.fit(x=train_data, y=train_labels, batch_size=100, steps=1000, monitors=[logging_hook])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When the training is complete, we&amp;rsquo;d like TF to take some test-data and tell us how well the network performs. So we create a special metrics dictionary that TF will populate by calling the &lt;code&gt;.evaluate&lt;/code&gt; method of the classifier.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    metrics = {&amp;quot;accuracy&amp;quot;: learn.MetricSpec(metric_fn=tf.metrics.accuracy, prediction_key=&amp;quot;classes&amp;quot;)}
    
    eval_results = mnistClassifier.evaluate(x=eval_data, y=eval_labels, metrics=metrics)
    print(eval_results)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this case, we&amp;rsquo;ve chosen to find the accuracy of the classifier by using the &lt;code&gt;tf.metrics.accuracy&lt;/code&gt; value for the &lt;code&gt;metric_fn&lt;/code&gt;. We also need to tell the evaluator that it&amp;rsquo;s the &amp;lsquo;classes&amp;rsquo; key we&amp;rsquo;re looking at in the graph. This is then passed to the evaluator along with the test data.&lt;/p&gt;

&lt;h3 id=&#34;running&#34;&gt;Running the Network&lt;/h3&gt;

&lt;p&gt;Adding the final main function to the script and making sure we&amp;rsquo;ve done all the necessary includes, we can run the program. The full script can be found &lt;a href=&#34;/docs/tfCNNMNIST.py&#34; title=&#34;TFCNNMNIST.py&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the current configuration, running the network for 1000 epochs gave me an output of:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;{&#39;loss&#39;: 1.9025836, &#39;global_step&#39;: 1000, &#39;accuracy&#39;: 0.64929998}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Definitely not a great accuracy for the MNIST dataset! We could just run this for longer and would likely see an increase in accuracy, Instead, lets make some of the easy tweaks to our network that we&amp;rsquo;ve described before: dropout and batch normalisation.&lt;/p&gt;

&lt;p&gt;In our graph, we want to add:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    dense = tf.contrib.layers.batch_norm(dense, decay=0.99, is_training= mode==learn.ModeKeys.TRAIN)
    dense = tf.layers.dropout(inputs=dense, rate=keepProb, training = mode==learn.ModeKeys.TRAIN)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This layer &lt;a href=&#34;https://www.tensorflow.org/api_docs/python/tf/contrib/layers/batch_norm&#34; title=&#34;tf.contrib.layers.batch_norm&#34;&gt;has many different attirbutes&lt;/a&gt;. It&amp;rsquo;s functionality is taken from &lt;a href=&#34;https://arxiv.org/abs/1502.03167&#34; title=&#34;Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift&#34;&gt;the paper by Loffe and Szegedy (2015)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Dropout layer&amp;rsquo;s &lt;code&gt;keepProb&lt;/code&gt; is defined in the Hyperparameter pramble to the script. Another value that can be changed to improve the performance of the network. Both of these lines are in the final script &lt;a href=&#34;/docs/tfCNNMNIST.py&#34; title=&#34;tffCNNMNIST.py&#34;&gt;available here&lt;/a&gt;, just uncomment them.&lt;/p&gt;

&lt;p&gt;If we re-run the script, it will automatically load the most recent state of the network (clever TensorFlow!) but&amp;hellip; it will fail because the checkpoint does not include the two new layers in its graph. So we must either delete our &lt;code&gt;/tmp/mln_MNIST&lt;/code&gt; folder, or give the classifier a new &lt;code&gt;model_dir&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Doing this and rerunning for the same 1000 epochs, I get an instant 140% increase in accuracy:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;{&#39;loss&#39;: 0.29391664, &#39;global_step&#39;: 1000, &#39;accuracy&#39;: 0.91680002}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Simply changing the optimiser to use the &amp;ldquo;Adam&amp;rdquo; rather than &amp;ldquo;SGD&amp;rdquo; optimiser yields:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;{&#39;loss&#39;: 0.040745325, &#39;global_step&#39;: 1000, &#39;accuracy&#39;: 0.98500001}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And running for slightly longer (20,000 iterations);&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;{&#39;loss&#39;: 0.046967514, &#39;global_step&#39;: 20000, &#39;accuracy&#39;: 0.99129999}
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;conclusion&#34;&gt; Conclusion &lt;/h2&gt;

&lt;p&gt;TensorFlow takes away the tedium of having to write out the full code for each individual layer and is able to perform optimisation and evaluation with minimal effort.&lt;/p&gt;

&lt;p&gt;If you look around online, you will see many methods for using TF that will get you similar results. I actually prefer some methods that are a little more explicit. The tutorial on Google for example has some room to allow us to including more logging features.&lt;/p&gt;

&lt;p&gt;In future posts, we will look more into logging and TensorBoard, but for now, happy coding!&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Convolutional Neural Networks - Basics</title>
      <link>/post/CNN1/</link>
      <pubDate>Fri, 07 Apr 2017 09:46:56 +0100</pubDate>
      
      <guid>/post/CNN1/</guid>
      <description>&lt;p&gt;This series will give some background to CNNs, their architecture, coding and tuning. In particular, this tutorial covers some of the background to CNNs and Deep Learning. We won&amp;rsquo;t go over any coding in this session, but that will come in the next one. What&amp;rsquo;s the big deal about CNNs? What do they look like? Why do they work? Find out in this tutorial.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;h2 id=&#34;intro&#34;&gt;  Introduction &lt;/h2&gt;

&lt;p&gt;A convolutional neural network (CNN) is very much related to the standard NN we&amp;rsquo;ve previously encountered. I found that when I searched for the link between the two, there seemed to be no natural progression from one to the other in terms of tutorials. It would seem that CNNs were developed in the late 1980s and then forgotten about due to the lack of processing power. In fact, it wasn&amp;rsquo;t until the advent of cheap, but powerful GPUs (graphics cards) that the research on CNNs and Deep Learning in general was given new life. Thus you&amp;rsquo;ll find an explosion of papers on CNNs in the last 3 or 4 years.&lt;/p&gt;

&lt;p&gt;Nonetheless, the research that has been churned out is &lt;em&gt;powerful&lt;/em&gt;. CNNs are used in so many applications now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Object recognition in images and videos (think image-search in Google, tagging friends faces in Facebook, adding filters in Snapchat and tracking movement in Kinect)&lt;/li&gt;
&lt;li&gt;Natural language processing (speech recognition in Google Assistant or Amazon&amp;rsquo;s Alexa)&lt;/li&gt;
&lt;li&gt;Playing games (the recent &lt;a href=&#34;https://en.wikipedia.org/wiki/AlphaGo&#34; title=&#34;AlphaGo on Wiki&#34;&gt;defeat of the world &amp;lsquo;Go&amp;rsquo; champion&lt;/a&gt; by DeepMind at Google)&lt;/li&gt;
&lt;li&gt;Medical innovation (from drug discovery to prediction of disease)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dispite the differences between these applications and the ever-increasing sophistication of CNNs, they all start out in the same way. Let&amp;rsquo;s take a look.&lt;/p&gt;

&lt;h2 id=&#34;deep&#34;&gt;  CNN or Deep Learning? &lt;/h2&gt;

&lt;p&gt;
The &#34;deep&#34; part of deep learning comes in a couple of places: the number of layers and the number of features. Firstly, as one may expect, there are usually more layers in a deep learning framework than in your average multi-layer perceptron or standard neural network. We have some architectures that are 150 layers deep. Secondly, each layer of a CNN will learn multiple &#39;features&#39; (multiple sets of weights) that connect it to the previous layer; so in this sense it&#39;s much deeper than a normal neural net too. In fact, some powerful neural networks, even CNNs, only consist of a few layers. So the &#39;deep&#39; in DL acknowledges that each layer of the network learns multiple features. More on this later.
&lt;/p&gt;&lt;p&gt;
Often you may see a conflation of CNNs with DL, but the concept of DL comes some time before CNNs were first introduced. Connecting multiple neural networks together, altering the directionality of their weights and stacking such machines all gave rise to the increasing power and popularity of DL.
&lt;/p&gt;&lt;p&gt;
We won&#39;t delve too deeply into history or mathematics in this tutorial, but if you want to know the timeline of DL in more detail, I&#39;d suggest the paper &#34;On the Origin of Deep Learning&#34; (Wang and Raj 2016) available &lt;a href=&#34;https://t.co/aAw4rEpZEt&#34; title=&#34;On the Origin of Deep Learning&#34;&gt;here&lt;/a&gt;. It&#39;s a lengthy read - 72 pages including references - but shows the logic between progressive steps in DL.
&lt;/p&gt;&lt;p&gt;
As with the study of neural networks, the inspiration for CNNs came from nature: specifically, the visual cortex. It drew upon the idea that the neurons in the visual cortex focus upon different sized patches of an image getting different levels of information in different layers. If a computer could be programmed to work in this way, it may be able to mimic the image-recognition power of the brain. So how can this be done?
&lt;/p&gt;

&lt;p&gt;A CNN takes as input an array, or image (2D or 3D, grayscale or colour) and tries to learn the relationship between this image and some target data e.g. a classification. By &amp;lsquo;learn&amp;rsquo; we are still talking about weights just like in a regular neural network. The difference in CNNs is that these weights connect small subsections of the input to each of the different neurons in the first layer. Fundamentally, there are multiple neurons in a single layer that each have their own weights to the same subsection of the input. These different sets of weights are called &amp;lsquo;kernels&amp;rsquo;.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s important at this stage to make sure we understand this weight or kernel business, because it&amp;rsquo;s the whole point of the &amp;lsquo;convolution&amp;rsquo; bit of the CNN.&lt;/p&gt;

&lt;h2 id=&#34;kernels&#34;&gt; Convolution and Kernels &lt;/h2&gt;

&lt;p&gt;Convolution is something that should be taught in schools along with addition, and multiplication - it&amp;rsquo;s &lt;a href=&#34;https://en.wikipedia.org/wiki/Convolution&#34; title=&#34;Convolution on Wiki&#34;&gt;just another mathematical operation&lt;/a&gt;. Perhaps the reason it&amp;rsquo;s not, is because it&amp;rsquo;s a little more difficult to visualise.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s say we have a pattern or a stamp that we want to repeat at regular intervals on a sheet of paper, a very convenient way to do this is to perform a convolution of the pattern with a regular grid on the paper. Think about hovering the stamp (or kernel) above the paper and moving it along a grid before pushing it into the page at each interval.&lt;/p&gt;

&lt;p&gt;This idea of wanting to repeat a pattern (kernel) across some domain comes up a lot in the realm of signal processing and computer vision. In fact, if you&amp;rsquo;ve ever used a graphics package such as Photoshop, Inkscape or GIMP, you&amp;rsquo;ll have seen many kernels before. The list of &amp;lsquo;filters&amp;rsquo; such as &amp;lsquo;blur&amp;rsquo;, &amp;lsquo;sharpen&amp;rsquo; and &amp;lsquo;edge-detection&amp;rsquo; are all done with a convolution of a kernel or filter with the image that you&amp;rsquo;re looking at.&lt;/p&gt;

&lt;p&gt;For example, let&amp;rsquo;s find the outline (edges) of the image &amp;lsquo;A&amp;rsquo;.&lt;/p&gt;

&lt;div style=&#34;text-align:center; display:inline-block; width:100%; margin:auto;&#34;&gt;
&lt;img title=&#34;Android&#34; src=&#34;/img/CNN/android.png&#34;&gt;&lt;br&gt;
&lt;b&gt;A&lt;/b&gt;
&lt;/div&gt;

&lt;p&gt;We can use a kernel, or set of weights, like the ones below.&lt;/p&gt;

&lt;div style=&#34;width:100%; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:49%; margin:auto;min-width:155px;&#34;&gt;
&lt;img title=&#34;Horizontal Filter&#34; height=150 src=&#34;/img/CNN/horizFilter.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Finds horizontals&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; min-width:150px;display:inline-block; width:49%;margin:auto;&#34;&gt;
&lt;img title=&#34;Vertical Filter&#34; height=150 src=&#34;/img/CNN/vertFilter.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Finds verticals&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;A kernel is placed in the top-left corner of the image. The pixel values covered by the kernel are multiplied with the corresponing kernel values and the products are summated. The result is placed in the new image at the point corresponding to the centre of the kernel. An example for this first step is shown in the diagram below. This takes the vertical Sobel filter (used for edge-detection) and applies it to the pixels of the image.&lt;/p&gt;

&lt;div style=&#34;text-align:center; display:inline-block; width:100%;margin:auto;&#34;&gt;
&lt;img title=&#34;Conv Example&#34; height=&#34;350&#34; src=&#34;/img/CNN/convExample.png&#34;&gt;&lt;br&gt;
&lt;b&gt;A step in the Convolution Process.&lt;/b&gt;
&lt;/div&gt;

&lt;p&gt;The kernel is moved over by one pixel and this process is repated until all of the possible locations in the image are filtered as below, this time for the horizontal Sobel filter. Notice that there is a border of empty values around the convolved image. This is because the result of convolution is placed at the centre of the kernel. To deal with this, a process called &amp;lsquo;padding&amp;rsquo; or more commonly &amp;lsquo;zero-padding&amp;rsquo; is used. This simply means that a border of zeros is placed around the original image to make it a pixel wider all around. The convolution is then done as normal, but the convolution result will now produce an image that is of equal size to the original.&lt;/p&gt;

&lt;div style=&#34;width:100%;margin:auto; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:45%;min-width:455px;margin:auto;&#34;&gt;
&lt;img title=&#34;Sobel Conv Gif&#34; height=&#34;450&#34; src=&#34;/img/CNN/convSobel.gif&#34;&gt;&lt;br&gt;
&lt;b&gt;The kernel is moved over the image performing the convolution as it goes.&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; display:inline-block; width:45%;min-width:450px;margin:auto;&#34;&gt;
&lt;img title=&#34;Zero Padding Conv&#34; height=&#34;450&#34; src=&#34;/img/CNN/convZeros.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Zero-padding is used so that the resulting image doesn&#39;t shrink.&lt;/b&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Now that we have our convolved image, we can use a colourmap to visualise the result. Here, I&amp;rsquo;ve just normalised the values between 0 and 255 so that I can apply a grayscale visualisation:&lt;/p&gt;

&lt;div style=&#34;text-align:center; display:inline-block; width:100%;margin:auto;&#34;&gt;
&lt;img title=&#34;Conv Result&#34; height=&#34;150&#34;src=&#34;/img/CNN/convResult.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Result of the convolution&lt;/b&gt;
&lt;/div&gt;

&lt;p&gt;This dummy example could represent the very bottom left edge of the Android&amp;rsquo;s head and doesn&amp;rsquo;t really look like it&amp;rsquo;s detected anything. To see the proper effect, we need to scale this up so that we&amp;rsquo;re not looking at individual pixels. Performing the horizontal and vertical sobel filtering on the full 264 x 264 image gives:&lt;/p&gt;

&lt;div style=&#34;width:100%;margin:auto; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block; min-width:100px;margin:auto;&#34;&gt;
&lt;img title=&#34;Horizontal Sobel&#34; src=&#34;/img/CNN/horiz.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Horizontal Sobel&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; display:inline-block; margin:auto;min-width:100px&#34;&gt;
&lt;img title=&#34;Vertical Sobel&#34; src=&#34;/img/CNN/vert.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Vertical Sobel&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; display:inline-block;margin:auto;min-width:100px&#34;&gt;
&lt;img title=&#34;Full Sobel&#34; src=&#34;/img/CNN/both.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Combined Sobel&lt;/b&gt;
&lt;/div&gt;  
&lt;/div&gt;

&lt;p&gt;Where we&amp;rsquo;ve also added together the result from both filters to get both the horizontal and vertical ones.&lt;/p&gt;

&lt;h3 id=&#34;relationship&#34;&gt; How does this feed into CNNs? &lt;/h3&gt;

&lt;p&gt;Clearly, convolution is powerful in finding the features of an image &lt;strong&gt;if&lt;/strong&gt; we already know the right kernel to use. Kernel design is an artform and has been refined over the last few decades to do some pretty amazing things with images (just look at the huge list in your graphics software!). But the important question is, what if we don&amp;rsquo;t know the features we&amp;rsquo;re looking for? Or what if we &lt;strong&gt;do&lt;/strong&gt; know, but we don&amp;rsquo;t know what the kernel should look like?&lt;/p&gt;

&lt;p&gt;Well, first we should recognise that every pixel in an image is a &lt;strong&gt;feature&lt;/strong&gt; and that means it represents an &lt;strong&gt;input node&lt;/strong&gt;. The result from each convolution is placed into the next layer in a &lt;strong&gt;hidden node&lt;/strong&gt;. Each feature or pixel of the convolved image is a node in the hidden layer.&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve already said that each of these numbers in the kernel is a weight, and that weight is the connection between the feature of the input image and the node of the hidden layer. The kernel is swept across the image and so there must be as many hidden nodes as there are input nodes (well actually slightly fewer as we should add zero-padding to the input image). This means that the hidden layer is also 2D like the input image. Sometimes, instead of moving the kernel over one pixel at a time, the &lt;strong&gt;stride&lt;/strong&gt;, as it&amp;rsquo;s called, can be increased. This will result in fewer nodes or fewer pixels in the convolved image. Consider it like this:&lt;/p&gt;

&lt;div style=&#34;width:100%;margin:auto; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block;margin:auto;min-width:300px;&#34;&gt;
&lt;img title=&#34;Hidden Layer Nodes&#34; height=300 src=&#34;/img/CNN/hiddenLayer.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Hidden Layer Nodes in a CNN&lt;/b&gt;
&lt;/div&gt;  
&lt;div style=&#34;text-align:center; display:inline-block;margin:auto;min-width:300px&#34;&gt;
&lt;img title=&#34;Hidden Layer after Increased Stride&#34; height=225 src=&#34;/img/CNN/strideHidden.png&#34;&gt;&lt;br&gt;
&lt;b&gt;Increased stride means fewer hidden-layer nodes&lt;/b&gt;
&lt;/div&gt;  
&lt;/div&gt;

&lt;p&gt;These weights that connect to the nodes need to be learned in exactly the same way as in a regular neural network. The image is passed through these nodes (by being convolved with the weights a.k.a the kernel) and the result is compared to some output (the error of which is then backpropagated and optimised).&lt;/p&gt;

&lt;p&gt;In reality, it isn&amp;rsquo;t just the weights or the kernel for one 2D set of nodes that has to be learned, there is a whole array of nodes which all look at the same area of the image (sometimes, but possibly incorrectly, called the &lt;strong&gt;receptive field&lt;/strong&gt;*). Each of the nodes in this row (or &lt;strong&gt;fibre&lt;/strong&gt;) tries to learn different kernels (different weights) that will show up some different features of the image, like edges. So the hidden-layer may look something more like this:&lt;/p&gt;

&lt;p&gt;* &lt;em&gt;Note: we&amp;rsquo;ll talk more about the receptive field after looking at the pooling layer below&lt;/em&gt;&lt;/p&gt;

&lt;div style=&#34;width:100%;margin:auto; text-align:center;&#34;&gt;
&lt;div style=&#34;text-align:center; display:inline-block;margin:auto;min-width:100px&#34;&gt;
&lt;img title=&#34;Multiple Kernel Hidden Layer&#34; height=350 src=&#34;/img/CNN/deepConv.png&#34;&gt;&lt;br&gt;
&lt;b&gt;For a 2D image learning a set of kernels.&lt;/b&gt;
&lt;/div&gt;
&lt;div style=&#34;text-align:center; display:inline-block;margin:auto;min-width:100px&#34;&gt;
&lt;img title=&#34;3 Channel Image&#34; height=350 src=&#34;/img/CNN/deepConv3.png&#34;&gt;&lt;br&gt;
&lt;b&gt;For a 3 channel RGB image the kernel becomes 3D.&lt;/b&gt; 
&lt;/div&gt;
&lt;/div&gt;  

&lt;p&gt;Now &lt;strong&gt;this&lt;/strong&gt; is why deep learning is called &lt;strong&gt;deep&lt;/strong&gt; learning. Each hidden layer of the convolutional neural network is capable of learning a large number of kernels. The output from this hidden-layer is passed to more layers which are able to learn their own kernels based on the &lt;em&gt;convolved&lt;/em&gt; image output from this layer (after some pooling operation to reduce the size of the convolved output). This is what gives the CNN the ability to see the edges of an image and build them up into larger features.&lt;/p&gt;

&lt;h2 id=&#34;CNN Architecture&#34;&gt;  CNN Archiecture &lt;/h2&gt;

&lt;p&gt;It is the &lt;em&gt;architecture&lt;/em&gt; of a CNN that gives it its power. A lot of papers that are puplished on CNNs tend to be about a new achitecture i.e. the number and ordering of different layers and how many kernels are learnt. Though often it&amp;rsquo;s the clever tricks applied to older architecures that really give the network power. Let&amp;rsquo;s take a look at the other layers in a CNN.&lt;/p&gt;

&lt;h2 id=&#39;layers&#39;&gt; Layers &lt;/h2&gt;

&lt;h3 id=&#34;input&#34;&gt;  Input Layer &lt;/h3&gt;

&lt;p&gt;The input image is placed into this layer. It can be a single-layer 2D image (grayscale), 2D 3-channel image (RGB colour) or 3D. The main difference between how the inputs are arranged comes in the formation of the expected kernel shapes. Kernels need to be learned that are the same depth as the input i.e. 5 x 5 x 3 for a 2D RGB image with dimensions of 5 x 5.&lt;/p&gt;

&lt;p&gt;Inputs to a CNN seem to work best when they&amp;rsquo;re of certain dimensions. This is because of the behviour of the convolution. Depending on the &lt;em&gt;stride&lt;/em&gt; of the kernel and the subsequent &lt;em&gt;pooling layers&lt;/em&gt; the outputs may become an &amp;ldquo;illegal&amp;rdquo; size including half-pixels. We&amp;rsquo;ll look at this in the &lt;em&gt;pooling layer&lt;/em&gt; section.&lt;/p&gt;

&lt;h3 id=&#34;convolution&#34;&gt;  Convolutional Layer &lt;/h3&gt;

&lt;p&gt;We&amp;rsquo;ve &lt;a href=&#34;#kernels&#34; title=&#34;Convolution and Kernels&#34;&gt;already looked at what the conv layer does&lt;/a&gt;. Just remember that it takes in an image e.g. [56 x 56 x 3] and assuming a stride of 1 and zero-padding, will produce an output of [56 x 56 x 32] if 32 kernels are being learnt. It&amp;rsquo;s important to note that the order of these dimensions can be important during the implementation of a CNN in Python. This is because there&amp;rsquo;s alot of matrix multiplication going on!&lt;/p&gt;

&lt;h3 id=&#34;nonlinear&#34;&gt; Non-linearity&lt;/h3&gt;

&lt;p&gt;The &amp;lsquo;non-linearity&amp;rsquo; here isn&amp;rsquo;t its own distinct layer of the CNN, but comes as part of the convolution layer as it is done on the output of the neurons (just like a normal NN). By this, we mean &amp;ldquo;don&amp;rsquo;t take the data forwards as it is (linearity) let&amp;rsquo;s do something to it (non-linearlity) that will help us later on&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;In our neural network tutorials we looked at different &lt;a href=&#34;/post/transfer-functions&#34; title=&#34;Transfer Functions&#34;&gt;activation functions&lt;/a&gt;. These each provide a different mapping of the input to an output, either to [-1 1], [0 1] or some other domain e.g the Rectified Linear Unit thresholds the data at 0: max(0,x). The &lt;em&gt;ReLU&lt;/em&gt; is very popular as it doesn&amp;rsquo;t require any expensive computation and it&amp;rsquo;s been &lt;a href=&#34;http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf&#34; title=&#34;Krizhevsky et al 2012&#34;&gt;shown to speed up the convergence of stochastic gradient descent algorithms&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&#34;pool&#34;&gt;  Pooling Layer &lt;/h3&gt;

&lt;p&gt;The pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will &lt;em&gt;half&lt;/em&gt; the size of the convolved image. The number of feature-maps produced by the learned kernels will remain the same as &lt;strong&gt;pooling&lt;/strong&gt; is done on each one in turn. Thus the pooling layer returns an array with the same depth as the convolution layer. The figure below shows the principal.&lt;/p&gt;

&lt;div style=&#34;text-align:center; display:inline-block; width:100%;margin:auto;&#34;&gt;
&lt;img title=&#34;Pooling&#34; height=350 src=&#34;/img/CNN/poolfig.gif&#34;&gt;&lt;br&gt;
&lt;b&gt;Max-pooling: Pooling using a &#34;max&#34; filter with stride equal to the kernel size&lt;/b&gt;
&lt;/div&gt;  

&lt;h3 id=&#34;receptiveField&#34;&gt; A Note on the Receptive Field &lt;/h3&gt;

&lt;p&gt;This is quite an important, but sometimes neglected, concept. We said that the receptive field of a single neuron can be taken to mean the area of the image which it can &amp;lsquo;see&amp;rsquo;. Each neuron therefore has a different receptive field. While this is true, the full impact of it can only be understood when we see what happens after pooling.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s take an image of size [12 x 12] and a kernel size in the first conv layer of [3 x 3]. The output of the conv layer (assuming zero-padding and stride of 1) is going to be [12 x 12 x 10] if we&amp;rsquo;re learning 10 kernels. After pooling with a [3 x 3] kernel, we get an output of [4 x 4 x 10]. This gets fed into the next conv layer. Suppose the kernel in the second conv layer is [2 x 2], would we say that the receptive field here is also [2 x 2]? Well, some people do but, actually, no it&amp;rsquo;s not. In fact, a neuron in this layer is not just seeing the [2 x 2] area of the &lt;em&gt;convolved&lt;/em&gt; image, it is actually seeing a [4 x 4] area of the &lt;em&gt;original&lt;/em&gt; image too. That&amp;rsquo;s the [3 x 3] of the first layer for each of the pixels in the &amp;lsquo;receptive field&amp;rsquo; of the second layer (remembering we had a stride of 1 in the first layer). Continuing this through the rest of the network, it is possible to end up with a final layer with a recpetive field equal to the size of the original image. Understanding this gives us the real insight to how the CNN works, building up the image as it goes.&lt;/p&gt;

&lt;h3 id=&#34;dense&#34;&gt;  Fully-connected (Dense) Layer&lt;/h3&gt;

&lt;p&gt;So this layer took me a while to figure out, despite its simplicity. If I take all of the say [3 x 3 x 64] featuremaps of my final pooling layer I have 3 x 3 x 64 = 576 different weights to consider and update. I need to make sure that my training labels match with the outputs from my output layer. We may only have 10 possibilities in our output layer (say the digits 0 - 9 in the classic MNIST number classification task). Thus we want the final numbers in our output layer to be [10,] and the layer before this to be [? x 10] where the ? represents the number of nodes in the layer before: the fully-connected (FC) layer. If there was only 1 node in this layer, it would have 576 weights attached to it - one for each of the weights coming from the previous pooling layer. This is not very useful as it won&amp;rsquo;t allow us to learn any combinations of these low-dimensional outputs. Increasing the number of neurons to say 1,000 will allow the FC layer to provide many different combinations of features and learn a more complex non-linear function that represents the feature space. The number of nodes in this layer can be whatever we want it to be and isn&amp;rsquo;t constrained by any previous dimensions - this is the thing that kept confusing me when I looked at other CNNs. Sometimes it&amp;rsquo;s also seen that there are two FC layers together, this just increases the possibility of learning a complex function. FC layers are 1D vectors.&lt;/p&gt;

&lt;p&gt;However, FC layers act as &amp;lsquo;black boxes&amp;rsquo; and are notoriously uninterpretable. They&amp;rsquo;re also prone to overfitting so &lt;strong&gt;dropout&amp;rsquo;&lt;/strong&gt; is often performed (discussed below).&lt;/p&gt;

&lt;h4 id = &#34;fcConv&#34;&gt; Fully-connected as a Convolutional Layer &lt;/h4&gt;

&lt;p&gt;If the idea above doesn&amp;rsquo;t help you lets remove the FC layer and replace it with another convolutional layer. This is very simple - take the output from the pooling layer as before and apply a convolution to it with a kernel that is the same size as a featuremap in the pooling layer. For this to be of use, the input to the conv should be down to around [5 x 5] or [3 x 3] by making sure there have been enough pooling layers in the network. What does this achieve? By convolving a [3 x 3] image with a [3 x 3] kernel we get a 1 pixel output. There is no striding, just one convolution per featuremap. So our output from this layer will be a [1 x k] vector where &lt;em&gt;k&lt;/em&gt; is the number of featuremaps. This is very similar to the FC layer, except that the output from the conv is only created from an individual featuremap rather than being connected to all of the featuremaps.&lt;/p&gt;

&lt;p&gt;But, isn&amp;rsquo;t this more weights to learn? Yes, so it isn&amp;rsquo;t done. Instead, we perform either &lt;em&gt;global average pooling&lt;/em&gt; or &lt;em&gt;global max pooling&lt;/em&gt; where the &lt;em&gt;global&lt;/em&gt; refers to a whole single feature map (not the whole set of feature maps). So we&amp;rsquo;re taking the average of all points in the feature and repeating this for each feature to get the [1 x k] vector as before. Note that the number of channels (kernels/features) in the last conv layer has to be equal to the number of outputs we want, or else we have to include an FC layer to change the [1 x k] vector to what we need.&lt;/p&gt;

&lt;p&gt;This can be powerfull as we have represented a very large receptive field by a single pixel and also removed some spatial information that allows us to try and take into account translations of the input. We&amp;rsquo;re able to say, if the value of the output is high, that all of the featuremaps visible to this output have activated enough to represent a &amp;lsquo;cat&amp;rsquo; or whatever it is we are training our network to learn.&lt;/p&gt;

&lt;h3 id=&#34;dropout&#34;&gt; Dropout Layer &lt;/h3&gt;

&lt;p&gt;The previously mentioned fully-connected layer is connected to all weights in the previous layer - this can be a very large number. As such, an FC layer is prone to &lt;em&gt;overfitting&lt;/em&gt; meaning that the network won&amp;rsquo;t generalise well to new data. There are a number of techniques that can be used to reduce overfitting though the most commonly seen in CNNs is the dropout layer, proposed by Hinton. As the name suggests, this causes the network to &amp;lsquo;drop&amp;rsquo; some nodes on each iteration with a particular probability. The &lt;em&gt;keep probability&lt;/em&gt; is between 0 and 1, most commonly around 0.2-0.5 it seems. This is the probability that a particular node is dropped during training. When back propagation occurs, the weights connected to these nodes are not updated. They are readded for the next iteration before another set is chosen for dropout.&lt;/p&gt;

&lt;h3 id=&#34;output&#34;&gt; Output Layer &lt;/h3&gt;

&lt;p&gt;Of course depending on the purpose of your CNN, the output layer will be slightly different. In general, the output layer consists of a number of nodes which have a high value if they are &amp;lsquo;true&amp;rsquo; or activated. Consider a classification problem where a CNN is given a set of images containing cats, dogs and elephants. If we&amp;rsquo;re asking the CNN to learn what a cat, dog and elephant looks like, output layer is going to be a set of three nodes, one for each &amp;lsquo;class&amp;rsquo; or animal. We&amp;rsquo;d expect that when the CNN finds an image of a cat, the value at the node representing &amp;lsquo;cat&amp;rsquo; is higher than the other two. This is the same idea as in a regular neural network. In fact, the FC layer and the output layer can be considered as a traditional NN where we also usually include a softmax activation function. Some output layers are probabilities and as such will sum to 1, whilst others will just achieve a value which could be a pixel intensity in the range 0-255. The output can also consist of a single node if we&amp;rsquo;re doing regression or deciding if an image belong to a specific class or not e.g. diseased or healthy. Commonly, however, even binary classificaion is proposed with 2 nodes in the output and trained with labels that are &amp;lsquo;one-hot&amp;rsquo; encoded i.e. [1,0] for class 0 and [0,1] for class 1.&lt;/p&gt;

&lt;h3 id=&#34;backProp&#34;&gt; A Note on Back Propagation &lt;/h3&gt;

&lt;p&gt;I&amp;rsquo;ve found it helpful to consider CNNs in reverse. It didn&amp;rsquo;t sit properly in my mind that the CNN first learns all different types of edges, curves etc. and then builds them up into large features e.g. a face. It came up in a discussion with a colleague that we could consider the CNN working in reverse, and in fact this is effectively what happens - back propagation updates the weights from the final layer &lt;em&gt;back&lt;/em&gt; towards the first. In fact, the error (or loss) minimisation occurs firstly at the final layer and as such, this is where the network is &amp;lsquo;seeing&amp;rsquo; the bigger picture. The gradient (updates to the weights) vanishes towards the input layer and is greatest at the output layer. We can effectively think that the CNN is learning &amp;ldquo;face - has eyes, nose mouth&amp;rdquo; at the output layer, then &amp;ldquo;I don&amp;rsquo;t know what a face is, but here are some eyes, noses, mouths&amp;rdquo; in the previous one, then &amp;ldquo;What are eyes? I&amp;rsquo;m only seeing circles, some white bits and a black hole&amp;rdquo; followed by &amp;ldquo;woohoo! round things!&amp;rdquo; and initially by &amp;ldquo;I think that&amp;rsquo;s what a line looks like&amp;rdquo;. Possibly we could think of the CNN as being less sure about itself at the first layers and being more advanced at the end.&lt;/p&gt;

&lt;p&gt;CNNs can be used for segmentation, classification, regression and a whole manner of other processes. On the whole, they only differ by four things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;architecture (number and order of conv, pool and fc layers plus the size and number of the kernels)&lt;/li&gt;
&lt;li&gt;output (probabilitstic etc.)&lt;/li&gt;
&lt;li&gt;training method (cost or loss function, regularisation and optimiser)&lt;/li&gt;
&lt;li&gt;hyperparameters (learning rate, regularisation weights, batch size, iterations&amp;hellip;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There may well be other posts which consider these kinds of things in more detail, but for now I hope you have some insight into how CNNs function. Now, lets code it up&amp;hellip;&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>A Simple Neural Network - Simple Performance Improvements</title>
      <link>/post/nn-python-tweaks/</link>
      <pubDate>Fri, 17 Mar 2017 08:53:55 +0000</pubDate>
      
      <guid>/post/nn-python-tweaks/</guid>
      <description>&lt;p&gt;The 5th installment of our tutorial on implementing a neural network (NN) in Python. By the end of this tutorial, our NN should perform much more efficiently giving good results with fewer iterations. We will do this by implementing &amp;ldquo;momentum&amp;rdquo; into our network. We will also put in the other transfer functions for each layer.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;div id=&#34;toctop&#34;&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#intro&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#momentum&#34;&gt;Momentum&lt;/a&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#momentumbackground&#34;&gt;Background&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#momentumpython&#34;&gt;Momentum in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#momentumtesting&#34;&gt;Testing&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#transferfunctions&#34;&gt;Transfer Functions&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;intro&#34;&gt; Introduction &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve come so far! The intial &lt;a href=&#34;/post/neuralnetwork&#34;&gt;maths&lt;/a&gt; was a bit of a slog, as was the &lt;a href=&#34;/post/nn-more-maths&#34;&gt;vectorisation&lt;/a&gt; of that maths, but it was important to be able to implement our NN in Python which we did in our &lt;a href=&#34;/post/nn-in-python&#34;&gt;previous post&lt;/a&gt;. So what now? Well, you may have noticed when running the NN as it stands that it isn&amp;rsquo;t overly quick, depening on the randomly initialised weights, it may take the network the full number of &lt;code&gt;maxIterations&lt;/code&gt; to converge, and then it may not converge at all! But there is something we can do about it. Let&amp;rsquo;s learn about, and implement, &amp;lsquo;momentum&amp;rsquo;.&lt;/p&gt;

&lt;h2 id=&#34;momentum&#34;&gt; Momentum &lt;/h2&gt;

&lt;h3 id=&#34;momentumbackground&#34;&gt; Background &lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s revisit our equation for error in the NN:&lt;/p&gt;

&lt;div id=&#34;eqerror&#34;&gt;$$
\text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}
$$&lt;/div&gt;

&lt;p&gt;This isn&amp;rsquo;t the only error function that could be used. In fact, there&amp;rsquo;s a whole field of study in NN about the best error or &amp;lsquo;optimisation&amp;rsquo; function that should be used. This one tries to look at the sum of the squared-residuals between the outputs and the expected values at the end of each forward pass (the so-called $l_{2}$-norm). Others e.g. $l_{1}$-norm, look at minimising the sum of the absolute differences between the values themselves. There are more complex error (a.k.a. optimisation or cost) functions, for example those that look at the cross-entropy in the data. There may well be a post in the future about different cost-functions, but for now we will still focus on the equation above.&lt;/p&gt;

&lt;p&gt;Now this function is described as a &amp;lsquo;convex&amp;rsquo; function. This is an important property if we are to make our NN converge to the correct answer. Take a look at the two functions below:&lt;/p&gt;

&lt;div  id=&#34;fig1&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;convex&#34; src=&#34;/img/simpleNN/convex.png&#34; width=&#34;35%&#34; hspace=&#34;10px&#34;&gt;&lt;img title=&#34;non-convex&#34; src=&#34;/img/simpleNN/non-convex.png&#34; width=&#34;35%&#34; hspace=&#34;10px&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 1&lt;/font&gt;: A convex (left) and non-convex (right) cost function
        &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Let&amp;rsquo;s say that our current error was represented by the green ball. Our NN will calculate the gradient of its cost function at this point then look for the direction which is going to &lt;em&gt;minimise&lt;/em&gt; the error i.e. go down a slope. The NN will feed the result into the back-propagation algorithm which will hopefully mean that on the next iteration, the error will have decreased. For a &lt;em&gt;convex&lt;/em&gt; function, this is very straight forward, the NN just needs to keep going in the direction it found on the first run. But, look at the &lt;em&gt;non-convex&lt;/em&gt; or &lt;em&gt;stochastic&lt;/em&gt; function: our current error (green ball) sits at a point where either direction will take it to a lower error i.e. the gradient decreases on both sides. If the error goes to the left, it will hit &lt;strong&gt;one&lt;/strong&gt; of the possible minima of the function, but this will be a higher minima (higher final error) than if the error had chosen the gradient to the right. Clearly the starting point for the error here has a big impact on the final result. Looking down at the 2D perspective (remembering that these are complex multi-dimensional functions), the non-convex case is clearly more ambiguous in terms of the location of the minimum and direction of descent. The convex function, however, nicely guides the error to the minimum with little care of the starting point.&lt;/p&gt;

&lt;div  id=&#34;fig2&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;convexcontour&#34; src=&#34;/img/simpleNN/convexcontourarrows.png&#34; width=&#34;35%&#34; hspace=&#34;10px&#34;&gt;&lt;img title=&#34;non-convexcontour&#34; src=&#34;/img/simpleNN/nonconvexcontourarrows.png&#34; width=&#34;35%&#34; hspace=&#34;10px&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 2&lt;/font&gt;: Contours for a portion of the convex (left) and non-convex (right) cost function
        &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;So let&amp;rsquo;s focus on the convex case and explain what &lt;em&gt;momentum&lt;/em&gt; is and why it works. I don&amp;rsquo;t think you&amp;rsquo;ll ever see a back propagation algorithm without momentum implemented in some way. In its simplest form, it modifies the weight-update equation:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{ \Delta W_{JK} = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}}
$$&lt;/div&gt;

&lt;p&gt;by adding an extra &lt;em&gt;momentum&lt;/em&gt; term:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{ \Delta W_{JK}\left(t\right) = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}} + m \mathbf{\Delta W_{JK}\left(t-1\right)}
$$&lt;/div&gt;

&lt;p&gt;The weight delta (the update amount to the weights after BP) now relies on its &lt;em&gt;previous&lt;/em&gt; value i.e. the weight delta now at iteration $t$ requires the value of itself from $t-1$. The $m$ or momentum term, like the learning rate $\eta$ is just a small number between 0 and 1. What effect does this have?&lt;/p&gt;

&lt;p&gt;Using prior information about the network is beneficial as it stops the network firing wildly into the unknown. If it can know the previous weights that have given the current error, it can keep the descent to the minimum roughly pointing in the same direction as it was before. The effect is that each iteration does not jump around so much as it would otherwise. In effect, the result is similar to that of the learning rate. We should be careful though, a large value for $m$ may cause the result to jump past the minimum and back again if combined with a large learning rate. We can think of momentum as changing the path taken to the optimum.&lt;/p&gt;

&lt;h3 id=&#34;momentumpython&#34;&gt; Momentum in Python &lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, implementing momentum into our NN should be pretty easy. We will need to provide a momentum term to the &lt;code&gt;backProp&lt;/code&gt; method of the NN and also create a new matrix in which to store the weight deltas from the current epoch for use in the subsequent one.&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;__init__&lt;/code&gt; method of the NN, we need to initialise the previous weight matrix and then give them some  values - they&amp;rsquo;ll start with zeros:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def __init__(self, numNodes):
	&amp;quot;&amp;quot;&amp;quot;Initialise the NN - setup the layers and initial weights&amp;quot;&amp;quot;&amp;quot;

	# Layer info
	self.numLayers = len(numNodes) - 1
	self.shape = numNodes 

	# Input/Output data from last run
	self._layerInput = []
	self._layerOutput = []
	self._previousWeightDelta = []

	# Create the weight arrays
	for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
	    self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1))) 
	    self._previousWeightDelta.append(np.zeros((l2,l1+1)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The only other part of the NN that needs to change is the definition of &lt;code&gt;backProp&lt;/code&gt; adding momentum to the inputs, and updating the weight equation. Finally, we make sure to save the current weights into the previous-weight matrix:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def backProp(self, input, target, trainingRate = 0.2, momentum=0.5):
	&amp;quot;&amp;quot;&amp;quot;Get the error, deltas and back propagate to update the weights&amp;quot;&amp;quot;&amp;quot;
	...
	weightDelta = trainingRate * thisWeightDelta + momentum * self._previousWeightDelta[index]

	self.weights[index] -= weightDelta

	self._previousWeightDelta[index] = weightDelta
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;momentumtesting&#34;&gt; Testing &lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our default values for learning rate and momentum are 0.2 and 0,5 respectively. We can change either of these by including them in the call to &lt;code&gt;backProp&lt;/code&gt;. Thi is the only change to the iteration process:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;for i in range(maxIterations + 1):
    Error = NN.backProp(Input, Target, learningRate=0.2, momentum=0.5)
    if i % 2500 == 0:
        print(&amp;quot;Iteration {0}\tError: {1:0.6f}&amp;quot;.format(i,Error))
    if Error &amp;lt;= minError:
        print(&amp;quot;Minimum error reached at iteration {0}&amp;quot;.format(i))
        break
        
Iteration 100000	Error: 0.000076
Input 	Output 		Target
[0 0]	 [ 0.00491572] 	[ 0.]
[1 1]	 [ 0.00421318] 	[ 0.]
[0 1]	 [ 0.99586268] 	[ 1.]
[1 0]	 [ 0.99586257] 	[ 1.]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Feel free to play around with these numbers, however, it would be unlikely that much would change right now. I say this beacuse there is only so good that we can get when using only the sigmoid function as our activation function. If you go back and read the post on &lt;a href=&#34;/post/transfer-functions&#34;&gt;transfer functions&lt;/a&gt; you&amp;rsquo;ll see that it&amp;rsquo;s more common to use &lt;em&gt;linear&lt;/em&gt; functions for the output layer. As it stands, the sigmoid function is unable to output a 1 or a 0 because it is asymptotic at these values. Therefore, no matter what learning rate or momentum we use, the network will never be able to get the best output.&lt;/p&gt;

&lt;p&gt;This seems like a good time to implement the other transfer functions.&lt;/p&gt;

&lt;h3 id=&#34;transferfunctions&#34;&gt; Transfer Functions &lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve already gone through writing the transfer functions in Python in the &lt;a href=&#34;/post/transfer-functions&#34;&gt;transfer functions&lt;/a&gt; post. We&amp;rsquo;ll just put these under the sigmoid function we defined earlier. I&amp;rsquo;m going to use &lt;code&gt;sigmoid&lt;/code&gt;, &lt;code&gt;linear&lt;/code&gt;, &lt;code&gt;gaussian&lt;/code&gt; and &lt;code&gt;tanh&lt;/code&gt; here.&lt;/p&gt;

&lt;p&gt;To modify the network, we need to assign each layer its own activation function, so let&amp;rsquo;s put that in the &amp;lsquo;layer information&amp;rsquo; part of the &lt;code&gt;__init__&lt;/code&gt; method:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def __init__(self, layerSize, transferFunctions=None):
	&amp;quot;&amp;quot;&amp;quot;Initialise the Network&amp;quot;&amp;quot;&amp;quot;

	# Layer information
	self.numLayers = len(numLayers) - 1
	self.shape = numNodes
	
	if transferFunctions is None:
	    layerTFs = []
	    for i in range(self.numLayers):
		if i == self.numLayers - 1:
		    layerTFs.append(linear)
		else:
		    layerTFs.append(sigmoid)
	else:
            if len(numNodes) != len(transferFunctions):
                raise ValueError(&amp;quot;Number of transfer functions must match the number of layers: minus input layer&amp;quot;)
            elif transferFunctions[0] is not None:
                raise ValueError(&amp;quot;The Input layer doesn&#39;t need a a transfer function: give it [None,...]&amp;quot;)
            else:
                layerTFs = transferFunctions[1:]
		
	self.tFunctions = layerTFs
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s go through this. We input into the initialisation a parameter called &lt;code&gt;transferFunctions&lt;/code&gt; with a default value of &lt;code&gt;None&lt;/code&gt;. If the default it taken, or if the parameter is ommitted, we set some defaults. for each layer, we use the &lt;code&gt;sigmoid&lt;/code&gt; function, unless its the output layer where we will use the &lt;code&gt;linear&lt;/code&gt; function. If a list of &lt;code&gt;transferFunctions&lt;/code&gt; is given, first, check that it&amp;rsquo;s a &amp;lsquo;legal&amp;rsquo; input. If the number of functions in the list is not the same as the number of layers (given by &lt;code&gt;numNodes&lt;/code&gt;) then throw an error. Also, if the first function in the list is not &lt;code&gt;&amp;quot;None&amp;quot;&lt;/code&gt; throw an error, because the first layer shouldn&amp;rsquo;t have an activation function (it is the input layer). If those two things are fine, go ahead and store the list of functions as &lt;code&gt;layerTFs&lt;/code&gt; without the first (element 0) one.&lt;/p&gt;

&lt;p&gt;We next need to replace all of our calls directly to &lt;code&gt;sigmoid&lt;/code&gt; and its derivative. These should now refer to the list of functions via an &lt;code&gt;index&lt;/code&gt; that depends on the number of the current layer. There are 3 instances of this in our NN: 1 in the forward pass where we call &lt;code&gt;sigmoid&lt;/code&gt; directly, and 2 in the &lt;code&gt;backProp&lt;/code&gt; method where we call the derivative at the output and hidden layers. so &lt;code&gt;sigmoid(layerInput)&lt;/code&gt; for example should become:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;self.tFunctions[index](layerInput)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Check the updated code &lt;a href=&#34;/docs/simpleNN-improvements.py&#34;&gt;here&lt;/a&gt; if that&amp;rsquo;s confusing.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s test this out! We&amp;rsquo;ll modify the call to initialising the NN by adding a list of functions like so:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])
transferFunctions = [None, sigmoid, linear]
    
NN = backPropNN((2,2,1), transferFunctions)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Running the NN like this with the default learning rate and momentum should provide you with an immediate performance boost simply becuase with the &lt;code&gt;linear&lt;/code&gt; function we&amp;rsquo;re now able to get closer to the target values, reducing the error.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;Iteration 0	Error: 1.550211
Iteration 2500	Error: 1.000000
Iteration 5000	Error: 0.999999
Iteration 7500	Error: 0.999999
Iteration 10000	Error: 0.999995
Iteration 12500	Error: 0.999969
Minimum error reached at iteration 14543
Input 	Output 		Target
[0 0]	 [ 0.0021009] 	[ 0.]
[1 1]	 [ 0.00081154] 	[ 0.]
[0 1]	 [ 0.9985881] 	[ 1.]
[1 0]	 [ 0.99877479] 	[ 1.]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Play around with the number of layers and different combinations of transfer functions as well as tweaking the learning rate and momentum. You&amp;rsquo;ll soon get a feel for how each changes the performance of the NN.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>A Simple Neural Network - With Numpy in Python</title>
      <link>/post/nn-in-python/</link>
      <pubDate>Wed, 15 Mar 2017 09:55:00 +0000</pubDate>
      
      <guid>/post/nn-in-python/</guid>
      <description>&lt;p&gt;Part 4 of our tutorial series on Simple Neural Networks. We&amp;rsquo;re ready to write our Python script! Having gone through the maths, vectorisation and activation functions, we&amp;rsquo;re now ready to put it all together and write it up. By the end of this tutorial, you will have a working NN in Python, using only numpy, which can be used to learn the output of logic gates (e.g. XOR)
&lt;/p&gt;

&lt;div id=&#34;toctop&#34;&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#intro&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#transferfunction&#34;&gt;Transfer Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#backpropclass&#34;&gt;Back Propagation Class&lt;/a&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#initialisation&#34;&gt;Initialisation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#forwardpass&#34;&gt;Forward Pass&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#backprop&#34;&gt;Back Propagation&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#testing&#34;&gt;Testing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#iterating&#34;&gt;Iterating&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&#34;intro&#34;&gt; Introduction &lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve &lt;a href=&#34;/post/neuralnetwork&#34;&gt;ploughed through the maths&lt;/a&gt;, then &lt;a href=&#34;/post/nn-more-maths&#34;&gt;some more&lt;/a&gt;, now we&amp;rsquo;re finally here! This tutorial will run through the coding up of a simple neural network (NN) in Python. We&amp;rsquo;re not going to use any fancy packages (though they obviously have their advantages in tools, speed, efficiency&amp;hellip;) we&amp;rsquo;re only going to use numpy!&lt;/p&gt;

&lt;p&gt;By the end of this tutorial, we will have built an algorithm which will create a neural network with as many layers (and nodes) as we want. It will be trained by taking in multiple training examples and running the back propagation algorithm many times.&lt;/p&gt;

&lt;p&gt;Here are the things we&amp;rsquo;re going to need to code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The transfer functions&lt;/li&gt;
&lt;li&gt;The forward pass&lt;/li&gt;
&lt;li&gt;The back propagation algorithm&lt;/li&gt;
&lt;li&gt;The update function&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To keep things nice and contained, the forward pass and back propagation algorithms should be coded into a class. We&amp;rsquo;re going to expect that we can build a NN by creating an instance of this class which has some internal functions (forward pass, delta calculation, back propagation, weight updates).&lt;/p&gt;

&lt;p&gt;First things first&amp;hellip; lets import numpy:&lt;/p&gt;

&lt;div class=&#34;highlight&#34; style=&#34;background: #272822&#34;&gt;&lt;pre style=&#34;line-height: 125%&#34;&gt;&lt;span&gt;&lt;/span&gt;&lt;span style=&#34;color: #f92672&#34;&gt;import&lt;/span&gt; &lt;span style=&#34;color: #f8f8f2&#34;&gt;numpy&lt;/span&gt; &lt;span style=&#34;color: #f92672&#34;&gt;as&lt;/span&gt; &lt;span style=&#34;color: #f8f8f2&#34;&gt;np&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Now let&amp;rsquo;s go ahead and get the first bit done:&lt;/p&gt;

&lt;h2 id=&#34;transferfunction&#34;&gt; Transfer Function &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To begin with, we&amp;rsquo;ll focus on getting the network working with just one transfer function: the sigmoid function. As we discussed in a &lt;a href=&#34;/post/transfer-functions&#34;&gt;previous post&lt;/a&gt; this is very easy to code up because of its simple derivative:&lt;/p&gt;

&lt;div &gt;$$
f\left(x_{i} \right) = \frac{1}{1 + e^{  - x_{i}  }} \ \ \ \
f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 -  \sigma(x_{i}) \right)
$$&lt;/div&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def sigmoid(x, Derivative=False):
	if not Derivative:
		return 1 / (1 + np.exp (-x))
	else:
		out = sigmoid(x)
		return out * (1 - out)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is a succinct expression which actually calls itself in order to get a value to use in its derivative. We&amp;rsquo;ve used numpy&amp;rsquo;s exponential function to create the sigmoid function and created an &lt;code&gt;out&lt;/code&gt; variable to hold this in the derivative. Whenever we want to use this function, we can supply the parameter &lt;code&gt;True&lt;/code&gt; to get the derivative, We can omit this, or enter &lt;code&gt;False&lt;/code&gt; to just get the output of the sigmoid. This is the same function I used to get the graphs in the &lt;a href=&#34;/post/transfer-functions&#34;&gt;post on transfer functions&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&#34;backpropclass&#34;&gt; Back Propagation Class &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;m fairly new to building my own classes in Python, but for this tutorial, I really relied on the videos of &lt;a href=&#34;https://www.youtube.com/playlist?list=PLRyu4ecIE9tibdzuhJr94uQeKnOFkkbq6&#34;&gt;Ryan on YouTube&lt;/a&gt;. Some of his hacks were very useful so I&amp;rsquo;ve taken some of those on board, but i&amp;rsquo;ve made a lot of the variables more self-explanatory.&lt;/p&gt;

&lt;p&gt;First we&amp;rsquo;re going to get the skeleton of the class setup. This means that whenever we create a new variable with the class of &lt;code&gt;backPropNN&lt;/code&gt;, it will be able to access all of the functions and variables within itself.&lt;/p&gt;

&lt;p&gt;It looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;class backPropNN:
    &amp;quot;&amp;quot;&amp;quot;Class defining a NN using Back Propagation&amp;quot;&amp;quot;&amp;quot;
    
    # Class Members (internal variables that are accessed with backPropNN.member) 
    numLayers = 0
    shape = None
    weights = []
    
    # Class Methods (internal functions that can be called)
    
    def __init__(self):
        &amp;quot;&amp;quot;&amp;quot;Initialise the NN - setup the layers and initial weights&amp;quot;&amp;quot;&amp;quot;
        
    # Forward Pass method
    def FP(self):
    	&amp;quot;&amp;quot;&amp;quot;Get the input data and run it through the NN&amp;quot;&amp;quot;&amp;quot;
    	 
    # TrainEpoch method
    def backProp(self):
        &amp;quot;&amp;quot;&amp;quot;Get the error, deltas and back propagate to update the weights&amp;quot;&amp;quot;&amp;quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We&amp;rsquo;ve not added any detail to the functions (or methods) yet, but we know there needs to be an &lt;code&gt;__init__&lt;/code&gt; method for any class, plus we&amp;rsquo;re going to want to be able to do a forward pass and then back propagate the error.&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve also added a few class members, variables which can be called from an instance of the &lt;code&gt;backPropNN&lt;/code&gt; class. &lt;code&gt;numLayers&lt;/code&gt; is just that, a count of the number of layers in the network, initialised to &lt;code&gt;0&lt;/code&gt;.  The &lt;code&gt;shape&lt;/code&gt; of the network will return the size of each layer of the network in an array and the &lt;code&gt;weights&lt;/code&gt; will return an array of the weights across the network.&lt;/p&gt;

&lt;h3 id=&#34;initialisation&#34;&gt; Initialisation &lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;re going to make the user supply an input variablewhich is the size of the layers in the network i.e. the number of nodes in each layer: &lt;code&gt;numNodes&lt;/code&gt;. This will be an array which is the length of the number of layers (including the input and output layers) where each element is the number of nodes in that layer.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def __init__(self, numNodes):
	&amp;quot;&amp;quot;&amp;quot;Initialise the NN - setup the layers and initial weights&amp;quot;&amp;quot;&amp;quot;

	# Layer information
	self.numLayers = len(numNodes) - 1
	self.shape = numNodes
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We&amp;rsquo;ve told our network to ignore the input layer when counting the number of layers (common practice) and that the shape of the network should be returned as the input array &lt;code&gt;numNodes&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Lets also initialise the weights. We will take the approach of initialising all of the weights to small, random numbers. To keep the code succinct, we&amp;rsquo;ll use a neat function&lt;code&gt;zip&lt;/code&gt;. &lt;code&gt;zip&lt;/code&gt; is a function which takes two vectors and pairs up the elements in corresponding locations (like a zip). For example:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;A = [1, 2, 3]
B = [4, 5, 6]

zip(A,B)
[(1,4), (2,5), (3,6)]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Why might this be useful? Well, when we talk about weights we&amp;rsquo;re talking about the connections between layers. Lets say we have &lt;code&gt;numNodes=(2, 2, 1)&lt;/code&gt; i.e. a 2 layer network with 2 inputs, 1 output and 2 nodes in the hidden layer. Then we need to let the algorithm know that we expect two input nodes to send weights to 2 hidden nodes. Then 2 hidden nodes to send weights to 1 output node, or &lt;code&gt;[(2,2), (2,1)]&lt;/code&gt;. Note that overall we will have 4 weights from the input to the hidden layer, and 2 weights from the hidden to the output layer.&lt;/p&gt;

&lt;p&gt;What is our &lt;code&gt;A&lt;/code&gt; and &lt;code&gt;B&lt;/code&gt; in the code above that will give us &lt;code&gt;[(2,2), (2,1)]&lt;/code&gt;? It&amp;rsquo;s this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;numNodes = (2,2,1)
A = numNodes[:-1]
B = numNodes[1:]

A
(2,2)
B
(2,1)
zip(A,B)
[(2,2), (2,1)]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Great! So each pair represents the nodes between which we need initialise some weights. In fact, the shape of each pair &lt;code&gt;(2,2)&lt;/code&gt; is the clue to how many weights we are going to need between each layer e.g. between the input and hidden layers we are going to need &lt;code&gt;(2 x 2) =4&lt;/code&gt; weights.&lt;/p&gt;

&lt;p&gt;so &lt;code&gt;for&lt;/code&gt; each pair &lt;code&gt;in zip(A,B)&lt;/code&gt; (hint hint) we need to &lt;code&gt;append&lt;/code&gt; some weights into that empty weight matrix we initialised earlier.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;# Initialise the weight arrays
for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
    self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;self.weights&lt;/code&gt; as we&amp;rsquo;re appending to the class member initialised earlier. We&amp;rsquo;re using the numpy random number generator from a &lt;code&gt;normal&lt;/code&gt; distribution. The &lt;code&gt;scale&lt;/code&gt; just tells numpy to choose numbers around the 0.1 kind of mark and that we want a matrix of results which is the size of the tuple &lt;code&gt;(l2,l1+1)&lt;/code&gt;. Huh, &lt;code&gt;+1&lt;/code&gt;? Don&amp;rsquo;t think we&amp;rsquo;re getting away without including the &lt;em&gt;bias&lt;/em&gt; term! We want a random starting point even for the weight connecting the bias node (&lt;code&gt;=1&lt;/code&gt;) to the next layer. Ok, but why this way and not &lt;code&gt;(l1+1,l2)&lt;/code&gt;? Well, we&amp;rsquo;re looking for &lt;code&gt;l2&lt;/code&gt; connections from each of the &lt;code&gt;l1+1&lt;/code&gt; nodes in the previous layer - think of it as (number of observations x number of features). We&amp;rsquo;re creating a matrix of weights which goes across the nodes and down the weights from each node, or as we&amp;rsquo;ve seen in our maths tutorial:&lt;/p&gt;

&lt;div&gt;$$
W_{ij} = \begin{pmatrix} w_{11} &amp; w_{21} &amp; w_{31} \\ w_{12} &amp;w_{22} &amp; w_{32} \end{pmatrix}, \ \ \ \

W_{jk} = \begin{pmatrix} w_{11} &amp; w_{21} &amp; w_{31} \end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;Between the first two layers, and second 2 layers respectively with node 3 being the bias node.&lt;/p&gt;

&lt;p&gt;Before we move on, lets also put in some placeholders in &lt;code&gt;__init__&lt;/code&gt; for the input and output values to each layer:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;self._layerInput = []
self._layerOutput = []
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;forwardpass&#34;&gt; Forward Pass &lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve now initialised out network enough to be able to focus on the forward pass (FP).&lt;/p&gt;

&lt;p&gt;Our &lt;code&gt;FP&lt;/code&gt; function needs to have the input data. It needs to know how many training examples it&amp;rsquo;s going to have to go through, and it will need to reassign the inputs and outputs at each layer, so lets clean those at the beginning:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def FP(self,input):

	numExamples = input.shape[0]

	# Clean away the values from the previous layer
	self._layerInput = []
	self._layerOutput = []
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So lets propagate. We already have a matrix of (randomly initialised) weights. We just need to know what the input is to each of the layers. We&amp;rsquo;ll separate this into the first hidden layer, and subsequent hidden layers.&lt;/p&gt;

&lt;p&gt;For the first hidden layer we will write:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s break this down:&lt;/p&gt;

&lt;p&gt;Our training example inputs need to match the weights that we&amp;rsquo;ve already created. We expect that our examples will come in rows of an array with columns acting as features, something like &lt;code&gt;[(0,0), (0,1),(1,1),(1,0)]&lt;/code&gt;. We can use numpy&amp;rsquo;s &lt;code&gt;vstack&lt;/code&gt; to put each of these examples one on top of the other.&lt;/p&gt;

&lt;p&gt;Each of the input examples is a matrix which will be multiplied by the weight matrix to get the input to the current layer:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{x_{J}} = \mathbf{W_{IJ} \vec{\mathcal{O}}_{I}}
$$&lt;/div&gt;

&lt;p&gt;where $\mathbf{x_{J}}$ are the inputs to the layer $J$ and  $\mathbf{\vec{\mathcal{O}}_{I}}$ is the output from the precious layer (the input examples in this case).&lt;/p&gt;

&lt;p&gt;So given a set of $n$ input examples we &lt;code&gt;vstack&lt;/code&gt; them so we just have &lt;code&gt;(n x numInputNodes)&lt;/code&gt;. We want to transpose this, &lt;code&gt;(numInputNodes x n)&lt;/code&gt; such that we can multiply by the weight matrix which is &lt;code&gt;(numOutputNodes x numInputNodes)&lt;/code&gt;. This gives an input to the layer which is &lt;code&gt;(numOutputNodes x n)&lt;/code&gt; as we expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; we&amp;rsquo;re actually going to do the transposition first before doing the &lt;code&gt;vstack&lt;/code&gt; - this does exactly the same thing, but it also allows us to more easily add the bias nodes in to each input.&lt;/p&gt;

&lt;p&gt;Bias! Lets not forget this: we add a bias node which always has the value &lt;code&gt;1&lt;/code&gt; to each input (including the input layer). So our actual method is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Transpose the inputs &lt;code&gt;input.T&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Add a row of ones to the bottom (one bias node for each input) &lt;code&gt;[input.T, np.ones([1,numExamples])]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;vstack&lt;/code&gt; this to compact the array &lt;code&gt;np.vstack(...)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Multipy with the weights connecting from the previous to the current layer &lt;code&gt;self.weights[0].dot(...)&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But what about the subsequent hidden layers? We&amp;rsquo;re not using the input examples in these layers, we are using the output from the previous layer &lt;code&gt;[self._layerOutput[-1]]&lt;/code&gt; (multiplied by the weights).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;for index in range(self.numLayers):
#Get input to the layer
if index ==0:
        layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
else:
        layerInput = self.weights[index].dot(np.vstack([self._layerOutput[-1],np.ones([1,numExamples])]))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Make sure to save this output, but also to now calculate the output of the current layer i.e.:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{ \vec{ \mathcal{O}}_{J}} = \sigma(\mathbf{x_{J}})
$$&lt;/div&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;self._layerInput.append(layerInput)
self._layerOutput.append(sigmoid(layerInput))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally, make sure that we&amp;rsquo;re returning the data from our output layer the same way that we got it:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;return self._layerOutput[-1].T
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;backprop&#34;&gt;Back Propagation&lt;/h3&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve successfully sent the data from the input layer to the output layer using some initially randomised weights &lt;strong&gt;and&lt;/strong&gt; we&amp;rsquo;ve included the bias term (a kind of threshold on the activation functions). Our vectorised equations from the previous post will now come into play:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}

\mathbf{\vec{\delta}_{K}} &amp;= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} -  \mathbf{T_{K}}\right) \\[0.5em]

\mathbf{ \vec{ \delta }_{J}} &amp;= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}}

\end{align}
$$&lt;/div&gt;

&lt;div&gt;$$
\begin{align}

\mathbf{W_{JK}} + \Delta \mathbf{W_{JK}} &amp;\rightarrow \mathbf{W_{JK}}, \ \ \ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} \\[0.5em]

\vec{\theta}  + \Delta \vec{\theta}  &amp;\rightarrow \vec{\theta}, \ \ \ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}} 

\end{align}
$$&lt;/div&gt;

&lt;p&gt;With $*$ representing an elementwise multiplication between the matrices.&lt;/p&gt;

&lt;p&gt;First, lets initialise some variables and get the error on the output of the output layer. We assume that the target values have been formatted in the same way as the input values i.e. they are a row-vector per input example. In our forward propagation method, the outputs are stored as column-vectors, thus the targets have to be transposed. We will need to supply the input data, the target data and  $\eta$, the learning rate, which we will set at some small number for default. So we start back propagation by first initialising a placeholder for the deltas and getting the number of training examples before running them through the &lt;code&gt;FP&lt;/code&gt; method:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def backProp(self, input, target, trainingRate = 0.2):
&amp;quot;&amp;quot;&amp;quot;Get the error, deltas and back propagate to update the weights&amp;quot;&amp;quot;&amp;quot;

delta = []
numExamples = input.shape[0]

# Do the forward pass
self.FP(input)

output_delta = self._layerOutput[index] - target.T
error = np.sum(output_delta**2)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We know from previous posts that the error is squared to get rid of the negatives. From this we compute the deltas for the output layer:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;delta.append(output_delta * sigmoid(self._layerInput[index], True))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We now have the error but need to know what direction to alter the weights in, thus the gradient of the inputs to the layer need to be known. So, we get the gradient of the activation function at the input to the layer and get the product with the error. Notice we&amp;rsquo;ve supplied &lt;code&gt;True&lt;/code&gt; to the sigmoid function to get its derivative.&lt;/p&gt;

&lt;p&gt;This is the delta for the output layer. So this calculation is only done when we&amp;rsquo;re considering the index at the end of the network. We should be careful that when telling the algorithm that this is the &amp;ldquo;last layer&amp;rdquo; we take account of the zero-indexing in Python i.e. the last layer is &lt;code&gt;self.numLayers - 1&lt;/code&gt; i.e. in a network with 2 layers, &lt;code&gt;layer[2]&lt;/code&gt; does not exist.&lt;/p&gt;

&lt;p&gt;We also need to get the deltas of the intermediate hidden layers. To do this, (according to our equations above) we have to &amp;lsquo;pull back&amp;rsquo; the delta from the output layer first. More accurately, for any hidden layer, we pull back the delta from the &lt;em&gt;next&lt;/em&gt; layer, which may well be another hidden layer. These deltas from the &lt;em&gt;next&lt;/em&gt; layer are multiplied by the weights from the &lt;em&gt;next&lt;/em&gt; layer &lt;code&gt;[index + 1]&lt;/code&gt;, before getting the product with the sigmoid derivative evaluated at the &lt;em&gt;current&lt;/em&gt; layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: this is &lt;em&gt;back&lt;/em&gt; propagation. We have to start at the end and work back to the beginning. We use the &lt;code&gt;reversed&lt;/code&gt; keyword in our loop to ensure that the algorithm considers the layers in reverse order.&lt;/p&gt;

&lt;p&gt;Combining this into one method:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;# Calculate the deltas
for index in reversed(range(self.numLayers)):
    if index == self.numLayers - 1:
        # If the output layer, then compare to the target values
        output_delta = self._layerOutput[index] - target.T
        error = np.sum(output_delta**2)
        delta.append(output_delta * sigmoid(self._layerInput[index], True))
    else:
        # If a hidden layer. compare to the following layer&#39;s delta
        delta_pullback = self.weights[index + 1].T.dot(delta[-1])
        delta.append(delta_pullback[:-1,:] * sigmoid(self._layerInput[index], True))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Pick this piece of code apart. This is an important snippet as it calculates all of the deltas for all of the nodes in the network. Be sure that we understand:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;This is a &lt;code&gt;reversed&lt;/code&gt; loop because we want to deal with the last layer first&lt;/li&gt;
&lt;li&gt;The delta of the output layer is the residual between the output and target multiplied with the gradient (derivative) of the activation function &lt;em&gt;at the current layer&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;The delta of a hidden layer first needs the product of the &lt;em&gt;subsequent&lt;/em&gt; layer&amp;rsquo;s delta with the &lt;em&gt;subsequent&lt;/em&gt; layer&amp;rsquo;s weights. This is then multiplied with the gradient of the activation function evaluated at the &lt;em&gt;current&lt;/em&gt; layer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Double check that this matches up with the equations above too! We can double check the matrix multiplication. For the output layer:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;output_delta&lt;/code&gt; = (numOutputNodes x 1) - (1 x numOutputNodes).T = (numOutputNodes x 1)
&lt;code&gt;error&lt;/code&gt; = (numOutputNodes x 1) **2 = (numOutputNodes x 1)
&lt;code&gt;delta&lt;/code&gt; = (numOutputNodes x 1) * sigmoid( (numOutputNodes x 1) ) = (numOutputNodes  x 1)&lt;/p&gt;

&lt;p&gt;For the hidden layers (take the one previous to the output as example):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;delta_pullback&lt;/code&gt; = (numOutputNodes x numHiddenNodes).T.dot(numOutputNodes x 1) = (numHiddenNodes x 1)
&lt;code&gt;delta&lt;/code&gt; = (numHiddenNodes x 1) * sigmoid ( (numHuddenNodes x 1) ) = (numHiddenNodes x 1)&lt;/p&gt;

&lt;p&gt;Hurray! We have the delta at each node in our network. We can use them to update the weights for each layer in the network. Remember, to update the weights between layer $J$ and $K$ we need to use the output of layer $J$ and the deltas of layer $K$. This means we need to keep a track of the index of the layer we&amp;rsquo;re currently working on ($J$) and the index of the delta layer ($K$) - not forgetting about the zero-indexing in Python:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;for index in range(self.numLayers):
    delta_index = self.numLayers - 1 - index
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s first get the outputs from each layer:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;    if index == 0:
        layerOutput = np.vstack([input.T, np.ones([1, numExamples])])
    else:
        layerOutput = np.vstack([self._layerOutput[index - 1], np.ones([1,self._layerOutput[index -1].shape[1]])])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The output of the input layer is just the input examples (which we&amp;rsquo;ve &lt;code&gt;vstack&lt;/code&gt;-ed again and the output from the other layers we take from calculation in the forward pass (making sure to add the bias term on the end).&lt;/p&gt;

&lt;p&gt;For the current &lt;code&gt;index&lt;/code&gt; (layer) lets use this &lt;code&gt;layerOutput&lt;/code&gt; to get the change in weight. We will use a few neat tricks to make this succinct:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;	thisWeightDelta = np.sum(\
	    layerOutput[None,:,:].transpose(2,0,1) * delta[delta_index][None,:,:].transpose(2,1,0) \
	    , axis = 0)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Break it down. We&amp;rsquo;re looking for $\mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} $ so it&amp;rsquo;s the delta at &lt;code&gt;delta_index&lt;/code&gt;, the next layer along.&lt;/p&gt;

&lt;p&gt;We want to be able to deal with all of the input training examples simultaneously. This requires a bit of fancy slicing and transposing of the matrices. Take a look: by calling &lt;code&gt;vstack&lt;/code&gt; we made all of the input data and bias terms live in the same matrix of a numpy array. When we slice this arraywith the &lt;code&gt;[None,:,:]&lt;/code&gt; argument, it tells Python to take all (&lt;code&gt;:&lt;/code&gt;) the data in the rows and columns and shift it to the 1st and 2nd dimensions and leave the first dimension empty (&lt;code&gt;None&lt;/code&gt;). We do this to create the three dimensions which we can now transpose into. Calling &lt;code&gt;transpose(2,0,1)&lt;/code&gt; instructs Python to move around the dimensions of the data (e.g. its rows&amp;hellip; or examples). This creates an array where each example now lives in its own plane. The same is done for the deltas of the subsequent layer, but being careful to transpost them in the opposite direction so that the matrix multiplication can occur. The &lt;code&gt;axis= 0&lt;/code&gt; is supplied to make sure that the inputs are multiplied by the correct dimension of the delta matrix.&lt;/p&gt;

&lt;p&gt;This looks incredibly complicated. It an be broken down into a for-loop over the input examples, but this reduces the efficiency of the network. Taking advantage of the numpy array like this keeps our calculations fast. In reality, if you&amp;rsquo;re struggling with this particular part, just copy and paste it, forget about it and be happy with yourself for understanding the maths behind back propagation, even if this random bit of Python is perplexing.&lt;/p&gt;

&lt;p&gt;Anyway. Lets take this set of weight deltas and put back the $\eta$. We&amp;rsquo;ll call this the &lt;code&gt;learningRate&lt;/code&gt;. It&amp;rsquo;s called a lot of things, but this seems to be the most common. We&amp;rsquo;ll update the weights by making sure to include the &lt;code&gt;-&lt;/code&gt; from the $-\eta$.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;	weightDelta = trainingRate * thisWeightDelta
	self.weights[index] -= weightDelta
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;the &lt;code&gt;-=&lt;/code&gt; is Python slang for: take the current value and subtract the value of &lt;code&gt;weightDelta&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To finish up, we want our back propagation to return the current error in the network, so:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;return error
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;testing&#34;&gt; A Toy Example&lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Believe it or not, that&amp;rsquo;s it! The fundamentals of forward and back propagation have now been implemented in Python. If you want to double check your code, have a look at my completed .py &lt;a href=&#34;/docs/simpleNN.py&#34;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s test it!&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])

NN = backPropNN((2,2,1))

Error = NN.backProp(Input, Target)
Output = NN.FP(Input)

print &#39;Input \tOutput \t\tTarget&#39;
for i in range(Input.shape[0]):
    print &#39;{0}\t {1} \t{2}&#39;.format(Input[i], Output[i], Target[i])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will provide 4 input examples and the expected targets. We create an instance of the network called &lt;code&gt;NN&lt;/code&gt; with 2 layers (2 nodes in the hidden and 1 node in the output layer). We make &lt;code&gt;NN&lt;/code&gt; do &lt;code&gt;backProp&lt;/code&gt; with the input and target data and then get the output from the final layer by running out input through the network with a &lt;code&gt;FP&lt;/code&gt;. The printout is self explantory. Give it a try!&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Input 	Output 		Target
[0 0]	 [ 0.51624448] 	[ 0.]
[1 1]	 [ 0.51688469] 	[ 0.]
[0 1]	 [ 0.51727559] 	[ 1.]
[1 0]	 [ 0.51585529] 	[ 1.]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can see that the network has taken our inputs, and we have some outputs too. They&amp;rsquo;re not great, and all seem to live around the same value. This is because we initialised the weights across the network to a similarly small random value. We need to repeat the &lt;code&gt;FP&lt;/code&gt; and &lt;code&gt;backProp&lt;/code&gt; process many times in order to keep updating the weights.&lt;/p&gt;

&lt;h2 id=&#34;iterating&#34;&gt; Iterating &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Iteration is very straight forward. We just tell our algorithm to repeat a maximum of &lt;code&gt;maxIterations&lt;/code&gt; times or until the &lt;code&gt;Error&lt;/code&gt; is below &lt;code&gt;minError&lt;/code&gt; (whichever comes first). As the weights are stored internally within &lt;code&gt;NN&lt;/code&gt; every time we call the &lt;code&gt;backProp&lt;/code&gt; method, it uses the latest, internally stored weights and doesn&amp;rsquo;t start again - the weights are only initialised once upon creation of &lt;code&gt;NN&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;maxIterations = 100000
minError = 1e-5

for i in range(maxIterations + 1):
    Error = NN.backProp(Input, Target)
    if i % 2500 == 0:
        print(&amp;quot;Iteration {0}\tError: {1:0.6f}&amp;quot;.format(i,Error))
    if Error &amp;lt;= minError:
        print(&amp;quot;Minimum error reached at iteration {0}&amp;quot;.format(i))
        break
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here&amp;rsquo;s the end of my output from the first run:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Iteration 100000	Error: 0.000291
Input 	Output 		Target
[0 0]	 [ 0.00780385] 	[ 0.]
[1 1]	 [ 0.00992829] 	[ 0.]
[0 1]	 [ 0.99189799] 	[ 1.]
[1 0]	 [ 0.99189943] 	[ 1.]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Much better! The error is very small and the outputs are very close to the correct value. However, they&amp;rsquo;re note completely right. We can do better, by implementing different activation functions which we will do in the next tutorial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Please&lt;/strong&gt; let me know if anything is unclear, or there are mistakes. Let me know how you get on!&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>A Simple Neural Network - Vectorisation</title>
      <link>/post/nn-more-maths/</link>
      <pubDate>Mon, 13 Mar 2017 10:33:08 +0000</pubDate>
      
      <guid>/post/nn-more-maths/</guid>
      <description>&lt;p&gt;The third in our series of tutorials on Simple Neural Networks. This time, we&amp;rsquo;re looking a bit deeper into the maths, specifically focusing on vectorisation. This is an important step before we can translate our maths in a functioning script in Python.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;p&gt;So we&amp;rsquo;ve &lt;a href=&#34;/post/neuralnetwork&#34;&gt;been through the maths&lt;/a&gt; of a neural network (NN) using back propagation and taken a look at the &lt;a href=&#34;/post/transfer-functions&#34;&gt;different activation functions&lt;/a&gt; that we could implement. This post will translate the mathematics into Python which we can piece together at the end into a functioning NN!&lt;/p&gt;

&lt;h2 id=&#34;forwardprop&#34;&gt; Forward Propagation &lt;/h2&gt;

&lt;p&gt;Let&amp;rsquo;s remimnd ourselves of our notation from our 2 layer network in the &lt;a href=&#34;/post/neuralnetwork&#34;&gt;maths tutorial&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I is our input layer&lt;/li&gt;
&lt;li&gt;J is our hidden layer&lt;/li&gt;
&lt;li&gt;$w_{ij}$ is the weight connecting the $i^{\text{th}}$ node in in $I$ to the $j^{\text{th}}$ node in $J$&lt;/li&gt;
&lt;li&gt;$x_{j}$ is the total input to the $j^{\text{th}}$ node in $J$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, assuming that we have three features (nodes) in the input layer, the input to the first node in the hidden layer is given by:&lt;/p&gt;

&lt;div&gt;$$
x_{1} = \mathcal{O}_{1}^{I} w_{11} + \mathcal{O}_{2}^{I} w_{21} + \mathcal{O}_{3}^{I} w_{31}
$$&lt;/div&gt;

&lt;p&gt;Lets generalise this for any connected nodes in any layer: the input to node $j$ in layer $l$ is:&lt;/p&gt;

&lt;div&gt;$$
x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j}
$$&lt;/div&gt;

&lt;p&gt;But we need to be careful and remember to put in our &lt;em&gt;bias&lt;/em&gt; term $\theta$. In our maths tutorial, we said that the bias term was always equal to 1; now we can try to understand why.&lt;/p&gt;

&lt;p&gt;We could just add the bias term onto the end of the previous equation to get:&lt;/p&gt;

&lt;div&gt;$$
x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j} + \theta_{i}
$$&lt;/div&gt;

&lt;p&gt;If we think more carefully about this, what we are really saying is that &amp;ldquo;an extra node in the previous layer, which always outputs the value 1, is connected to the node $j$ in the current layer by some weight $w_{4j}$&amp;ldquo;. i.e. $1 \cdot w_{4j}$:&lt;/p&gt;

&lt;div&gt;$$
x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j} + 1 \cdot w_{4j}
$$&lt;/div&gt;

&lt;p&gt;By the magic of matrix multiplication, we should be able to convince ourselves that:&lt;/p&gt;

&lt;div&gt;$$
x_{j} = \begin{pmatrix} w_{1j} &amp;w_{2j} &amp;w_{3j} &amp;w_{4j} \end{pmatrix}
     \begin{pmatrix}    \mathcal{O}_{1}^{l-1} \\
                    \mathcal{O}_{2}^{l-1} \\
                    \mathcal{O}_{3}^{l-1} \\
                    1
        \end{pmatrix}

$$&lt;/div&gt;

&lt;p&gt;Now, lets be a little more explicit, consider the input $x$ to the first two nodes of the layer $J$:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
x_{1} &amp;= \begin{pmatrix} w_{11} &amp;w_{21} &amp;w_{31} &amp;w_{41} \end{pmatrix}
     \begin{pmatrix}    \mathcal{O}_{1}^{l-1} \\
                    \mathcal{O}_{2}^{l-1} \\
                    \mathcal{O}_{3}^{l-1} \\
                    1
        \end{pmatrix}
\\[0.5em]
x_{2} &amp;= \begin{pmatrix} w_{12} &amp;w_{22} &amp;w_{32} &amp;w_{42} \end{pmatrix}
     \begin{pmatrix}    \mathcal{O}_{1}^{l-1} \\
                    \mathcal{O}_{2}^{l-1} \\
                    \mathcal{O}_{3}^{l-1} \\
                    1
        \end{pmatrix}
\end{align}
$$&lt;/div&gt;

&lt;p&gt;Note that the second matrix is constant between the input calculations as it is only the output values of the previous layer (including the bias term). This means (again by the magic of matrix multiplication) that we can construct a single vector containing the input values $x$ to the current layer:&lt;/p&gt;

&lt;div&gt; $$
\begin{pmatrix} x_{1} \\ x_{2} \end{pmatrix}
= \begin{pmatrix}   w_{11} &amp; w_{21} &amp; w_{31} &amp; w_{41} \\
                    w_{12} &amp; w_{22} &amp; w_{32} &amp; w_{42} 
                    \end{pmatrix}
     \begin{pmatrix}    \mathcal{O}_{1}^{l-1} \\
                    \mathcal{O}_{2}^{l-1} \\
                    \mathcal{O}_{3}^{l-1} \\
                    1
        \end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;This is an $\left(n \times m+1 \right)$ matrix multiplied with an $\left(m +1 \times  1 \right)$ where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$n$ is the number of nodes in the current layer $l$&lt;/li&gt;
&lt;li&gt;$m$ is the number of nodes in the previous layer $l-1$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lets generalise - the vector of inputs to the $n$ nodes in the current layer from the nodes $m$ in the previous layer is:&lt;/p&gt;

&lt;div&gt; $$
\begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{pmatrix}
= \begin{pmatrix}   w_{11} &amp; w_{21} &amp; \cdots &amp; w_{(m+1)1} \\
                    w_{12} &amp; w_{22} &amp; \cdots &amp; w_{(m+1)2} \\
                    \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
                    w_{1n} &amp; w_{2n} &amp; \cdots &amp; w_{(m+1)n} \\
                    \end{pmatrix}
     \begin{pmatrix}    \mathcal{O}_{1}^{l-1} \\
                    \mathcal{O}_{2}^{l-1} \\
                    \mathcal{O}_{3}^{l-1} \\
                    1
        \end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;or:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{x_{J}} = \mathbf{W_{IJ}} \mathbf{\vec{\mathcal{O}}_{I}}
$$&lt;/div&gt;

&lt;p&gt;In this notation, the output from the current layer $J$ is easily written as:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{\vec{\mathcal{O}}_{J}} = \sigma \left( \mathbf{W_{IJ}} \mathbf{\vec{\mathcal{O}}_{I}} \right)
$$&lt;/div&gt;

&lt;p&gt;Where $\sigma$ is the activation or transfer function chosen for this layer which is applied elementwise to the product of the matrices.&lt;/p&gt;

&lt;p&gt;This notation allows us to very efficiently calculate the output of a layer which reduces computation time. Additionally, we are now able to extend this efficiency by making out network consider &lt;strong&gt;all&lt;/strong&gt; of our input examples at once.&lt;/p&gt;

&lt;p&gt;Remember that our network requires training (many epochs of forward propagation followed by back propagation) and as such needs training data (preferably a lot of it!). Rather than consider each training example individually, we vectorise each example into a large matrix of inputs.&lt;/p&gt;

&lt;p&gt;Our weights $\mathbf{W_{IJ}}$ connecting the layer $l$ to layer $J$ are the same no matter which input example we put into the network: this is fundamental as we expect that the network would act the same way for similar inputs i.e. we expect the same neurons (nodes) to fire based on the similar features in the input.&lt;/p&gt;

&lt;p&gt;If 2 input examples gave the outputs $ \mathbf{\vec{\mathcal{O}}_{I_{1}}} $ and $ \mathbf{\vec{\mathcal{O}}_{I_{2}}} $  from the nodes in layer $I$ to a layer $J$ then the outputs from layer $J$ , $\mathbf{\vec{\mathcal{O}}_{J_{1}}}$ and $\mathbf{\vec{\mathcal{O}}_{J_{1}}}$ can be written:&lt;/p&gt;

&lt;div&gt;$$
\begin{pmatrix}
    \mathbf{\vec{\mathcal{O}}_{J_{1}}} \\
    \mathbf{\vec{\mathcal{O}}_{J_{2}}}
\end{pmatrix}
=
\sigma \left(\mathbf{W_{IJ}}\begin{pmatrix}
        \mathbf{\vec{\mathcal{O}}_{I_{1}}} &amp;
        \mathbf{\vec{\mathcal{O}}_{I_{2}}}  
    \end{pmatrix}
    \right)
=
\sigma \left(\mathbf{W_{IJ}}\begin{pmatrix}
        \begin{bmatrix}\mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m}
        \end{bmatrix}
        \begin{bmatrix}\mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m}
        \end{bmatrix}   
    \end{pmatrix}
        \right)
=   \sigma \left(\begin{pmatrix} \mathbf{W_{IJ}}\begin{bmatrix}\mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m}
        \end{bmatrix} &amp; 
    \mathbf{W_{IJ}}     \begin{bmatrix}\mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m}
        \end{bmatrix}
    \end{pmatrix}
        \right)

$$&lt;/div&gt;

&lt;p&gt;For the $m$ nodes in the input layer. Which may look hideous, but the point is that all of the training examples that are input to the network can be dealt with simultaneously because each example becomes another column in the input vector and a corresponding column in the output vector.&lt;/p&gt;

&lt;div class=&#34;highlight_section&#34;&gt;

In summary, for forward propagation:

&lt;uo&gt;
&lt;li&gt; All $n$ training examples with $m$ features (input nodes) are put into column vectors to build the input matrix $I$, taking care to add the bias term to the end of each.&lt;/li&gt;

&lt;li&gt; All weight vectors that connect $m +1$ nodes in the layer $I$ to the $n$ nodes in layer $J$ are put together in a weight-matrix&lt;/li&gt;

&lt;div&gt;$$
\mathbf{I} =    \left(
    \begin{bmatrix}
        \mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m} \\ 1 \end{bmatrix}
    \begin{bmatrix}
        \mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m} \\ 1
    \end{bmatrix}
        \begin{bmatrix}
    \cdots \\ \cdots \\ \ddots \\ \cdots
        \end{bmatrix}
    \begin{bmatrix}
        \mathcal{O}_{I_{n}}^{1} \\ \vdots \\ \mathcal{O}_{I_{n}}^{m} \\ 1

    \end{bmatrix}
    \right)

\ \ \ \ 


\mathbf{W_{IJ}} = 
\begin{pmatrix}     w_{11} &amp; w_{21} &amp; \cdots &amp; w_{(m+1)1} \\
                    w_{12} &amp; w_{22} &amp; \cdots &amp; w_{(m+1)2} \\
                    \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
                    w_{1n} &amp; w_{2n} &amp; \cdots &amp; w_{(m+1)n} \\
                    \end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;&lt;li&gt; We perform $ \mathbf{W_{IJ}} \mathbf{I}$ to get the vector $\mathbf{\vec{\mathcal{O}}_{J}}$ which is the output from each of the $m$ nodes in layer $J$ &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;&lt;/p&gt;

&lt;h2 id=&#34;backprop&#34;&gt; Back Propagation &lt;/h2&gt;

&lt;p&gt;To perform back propagation there are a couple of things that we need to vectorise. The first is the error on the weights when we compare the output of the network $\mathbf{\vec{\mathcal{O}}_{K}}$ with the known target values:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{T_{K}} = \begin{bmatrix} t_{1} \\ \vdots \\ t_{k} \end{bmatrix}
$$&lt;/div&gt;

&lt;p&gt;A reminder of the formulae:&lt;/p&gt;

&lt;div&gt;$$

    \delta_{k} = \mathcal{O}_{k}  \left( 1 - \mathcal{O}_{k}  \right)  \left( \mathcal{O}_{k} - t_{k} \right), 
    \ \ \ \
    \delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right)   \sum_{k \in K} \delta_{k} W_{jk}

$$&lt;/div&gt;
    

&lt;p&gt;Where $\delta_{k}$ is the error on the weights to the output layer and $\delta_{j}$ is the error on the weights to the hidden layers. We also need to vectorise the update formulae for the weights and bias:&lt;/p&gt;

&lt;div&gt;$$
    W + \Delta W \rightarrow W, \ \ \ \
    \theta + \Delta\theta \rightarrow \theta
$$&lt;/div&gt;

&lt;h3 id=&#34;outputdeltas&#34;&gt;  Vectorising the Output Layer Deltas &lt;/h3&gt;

&lt;p&gt;Lets look at the output layer delta: we need a subtraction between the outputs and the target which is multiplied by the derivative of the transfer function (sigmoid). Well, the subtraction between two matrices is straight forward:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{\vec{\mathcal{O}}_{K}} -  \mathbf{T_{K}}
$$&lt;/div&gt;

&lt;p&gt;but we need to consider the derivative. Remember that the output of the final layer is:&lt;/p&gt;

&lt;div&gt;$$
\mathbf{\vec{\mathcal{O}}_{K}}  = \sigma \left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}}  \right)
$$&lt;/div&gt;

&lt;p&gt;and the derivative can be written:&lt;/p&gt;

&lt;div&gt;$$
 \sigma ^{\prime} \left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}}  \right) =   \mathbf{\vec{\mathcal{O}}_{K}}\left( 1 - \mathbf{\vec{\mathcal{O}}_{K}}  \right) 
$$&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This is the derivative of the sigmoid as evaluated at each of the nodes in the layer $K$. It is acting &lt;em&gt;elementwise&lt;/em&gt; on the inputs to layer $K$. Thus it is a column vector with the same length as the number of nodes in layer $K$.&lt;/p&gt;

&lt;p&gt;Put the derivative and subtraction terms together and we get:&lt;/p&gt;

&lt;div class=&#34;highlight_section&#34;&gt;$$
\mathbf{\vec{\delta}_{K}} = \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} -  \mathbf{T_{K}}\right)
$$&lt;/div&gt;

&lt;p&gt;Again, the derivatives are being multiplied elementwise with the results of the subtration. Now we have a vector of deltas for the output layer $K$! Things aren&amp;rsquo;t so straight forward for the detlas in the hidden layers.&lt;/p&gt;

&lt;p&gt;Lets visualise what we&amp;rsquo;ve seen:&lt;/p&gt;

&lt;div  id=&#34;fig1&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img img title=&#34;NN Vectorisation&#34; src=&#34;/img/simpleNN/nn_vectors1.png&#34; width=&#34;30%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 1&lt;/font&gt;: NN showing the weights and outputs in vector form along with the target values for layer $K$
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;hiddendeltas&#34;&gt; Vectorising the Hidden Layer Deltas &lt;/h3&gt;

&lt;p&gt;We need to vectorise:&lt;/p&gt;

&lt;div&gt;$$
    \delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right)   \sum_{k \in K} \delta_{k} W_{jk}
$$&lt;/div&gt;

&lt;p&gt;Let&amp;rsquo;s deal with the summation. We&amp;rsquo;re multipying each of the deltas $\delta_{k}$ in the output layer (or more generally, the subsequent layer could be another hidden layer) by the weight $w_{jk}$ that pulls them back to the node $j$ in the current layer before adding the results. For the first node in the hidden layer:&lt;/p&gt;

&lt;div&gt;$$
\sum_{k \in K} \delta_{k} W_{jk} = \delta_{k}^{1}w_{11} + \delta_{k}^{2}w_{12} + \delta_{k}^{3}w_{13}

= \begin{pmatrix} w_{11} &amp; w_{12} &amp; w_{13} \end{pmatrix}  \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3}\end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;Notice the weights? They pull the delta from each output layer node back to the first node of the hidden layer. In forward propagation, these we consider multiple nodes going out to a single node, rather than this way of receiving multiple nodes at a single node.&lt;/p&gt;

&lt;p&gt;Combine this summation with the multiplication by the activation function derivative:&lt;/p&gt;

&lt;div&gt;$$
\delta_{j}^{1} = \sigma^{\prime} \left(  x_{j}^{1} \right)
\begin{pmatrix} w_{11} &amp; w_{12} &amp; w_{13} \end{pmatrix}  \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;remembering that the input to the $\text{1}^\text{st}$ node in the layer $J$&lt;/p&gt;

&lt;div&gt;$$
x_{j}^{1} = \mathbf{W_{I1}}\mathbf{\vec{\mathcal{O}}_{I}}
$$&lt;/div&gt;

&lt;p&gt;What about the $\text{2}^\text{nd}$ node in the hidden layer?&lt;/p&gt;

&lt;div&gt;$$
\delta_{j}^{2} = \sigma^{\prime} \left(  x_{j}^{2} \right)
\begin{pmatrix} w_{21} &amp; w_{22} &amp; w_{23} \end{pmatrix}  \begin{pmatrix}  \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;This is looking familiar, hopefully we can be confident based upon what we&amp;rsquo;ve done before to say that:&lt;/p&gt;

&lt;div&gt;$$
\begin{pmatrix}
    \delta_{j}^{1} \\ \delta_{j}^{2}
\end{pmatrix}
 = 
 \begin{pmatrix}
     \sigma^{\prime} \left(  x_{j}^{1} \right) \\ \sigma^{\prime} \left(  x_{j}^{2} \right)
 \end{pmatrix}
 *
  \begin{pmatrix}
    w_{11} &amp; w_{12} &amp; w_{13} \\
    w_{21} &amp; w_{22} &amp; w_{23} 
 \end{pmatrix}
 
 \begin{pmatrix}\delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3}  \end{pmatrix}

$$&lt;/div&gt;

&lt;p&gt;We&amp;rsquo;ve seen a version of this weights matrix before when we did the forward propagation vectorisation. In this case though, look carefully - as we mentioned, the weights are not in the same places, in fact, the weight matrix has been &lt;em&gt;transposed&lt;/em&gt; from the one we used in forward propagation. This makes sense because we&amp;rsquo;re going backwards through the network now! This is useful because it means there is very little extra calculation needed here - the matrix we need is already available from the forward pass, but just needs transposing. We can call the weights in back propagation here $ \mathbf{ W_{KJ}} $ as we&amp;rsquo;re pulling the deltas from $K$ to $J$.&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
    \mathbf{W_{KJ}} &amp;=
    \begin{pmatrix}
    w_{11} &amp; w_{12} &amp; \cdots &amp; w_{1n} \\
    w_{21} &amp; w_{22} &amp; \cdots &amp; w_{23}  \\
    \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
    w_{(m+1)1} &amp; w_{(m+1)2} &amp; \cdots &amp; w_{(m+1)n}
    \end{pmatrix} , \ \ \
    
    \mathbf{W_{JK}} = 
    \begin{pmatrix}     w_{11} &amp; w_{21} &amp; \cdots &amp; w_{(m+1)1} \\
                    w_{12} &amp; w_{22} &amp; \cdots &amp; w_{(m+1)2} \\
                    \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
                    w_{1n} &amp; w_{2n} &amp; \cdots &amp; w_{(m+1)n} \\
                    \end{pmatrix} \\[0.5em]
                        
\mathbf{W_{KJ}} &amp;= \mathbf{W^{\intercal}_{JK}}
\end{align}
$$&lt;/div&gt;

&lt;div class=&#34;highlight_section&#34;&gt;

And so, the vectorised equations for the output layer and hidden layer deltas are:

&lt;div&gt;$$
\begin{align}

\mathbf{\vec{\delta}_{K}} &amp;= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} -  \mathbf{T_{K}}\right) \\[0.5em]

\mathbf{ \vec{ \delta }_{J}} &amp;= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}} 
\end{align}

$$&lt;/div&gt;

&lt;p&gt;&lt;/div&gt;&lt;/p&gt;

&lt;p&gt;Lets visualise what we&amp;rsquo;ve seen:&lt;/p&gt;

&lt;div  id=&#34;fig2&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img img title=&#34;NN Vectorisation 2&#34; src=&#34;/img/simpleNN/nn_vectors2.png&#34; width=&#34;20%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 2&lt;/font&gt;: The NN showing the delta vectors
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;updates&#34;&gt; Vectorising the Update Equations &lt;/h3&gt;

&lt;p&gt;Finally, now that we have the vectorised equations for the deltas (which required us to get the vectorised equations for the forward pass) we&amp;rsquo;re ready to get the update equations in vector form. Let&amp;rsquo;s recall the update equations&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
    \Delta W &amp;= -\eta \ \delta_{l} \ \mathcal{O}_{l-1} \\
    \Delta\theta &amp;= -\eta \ \delta_{l}
\end{align}
$$&lt;/div&gt;

&lt;p&gt;Ignoring the $-\eta$ for now, we need to get a vector form for $\delta_{l} \ \mathcal{O}_{l-1}$ in order to get the update to the weights. We have the matrix of weights:&lt;/p&gt;

&lt;div&gt;$$
    
\mathbf{W_{JK}} = 
\begin{pmatrix}     w_{11} &amp; w_{21}  &amp; w_{31} \\
                w_{12} &amp; w_{22}  &amp; w_{32} \\

                \end{pmatrix}
$$&lt;/div&gt;

&lt;p&gt;Suppose we are updating the weight $w_{21}$ in the matrix. We&amp;rsquo;re looking to find the product of the output from the second node in $J$ with the delta from the first node in $K$.&lt;/p&gt;

&lt;div&gt;$$
    \Delta w_{21} = \delta_{K}^{1} \mathcal{O}_{J}^{2} 
$$&lt;/div&gt;

&lt;p&gt;Considering this example, we can write the matrix for the weight updates as:&lt;/p&gt;

&lt;div&gt;$$
    
\Delta \mathbf{W_{JK}} = 
\begin{pmatrix}     \delta_{K}^{1} \mathcal{O}_{J}^{1} &amp; \delta_{K}^{1}  \mathcal{O}_{J}^{2}  &amp; \delta_{K}^{1} \mathcal{O}_{J}^{3}  \\
                \delta_{K}^{2} \mathcal{O}_{J}^{1} &amp; \delta_{K}^{2} \mathcal{O}_{J}^{2}  &amp; \delta_{K}^{2} \mathcal{O}_{J}^{3} 

                \end{pmatrix}
 = 

\begin{pmatrix}  \delta_{K}^{1} \\ \delta_{K}^{2}\end{pmatrix}

\begin{pmatrix}     \mathcal{O}_{J}^{1} &amp; \mathcal{O}_{J}^{2}&amp; \mathcal{O}_{J}^{3}

\end{pmatrix}

$$&lt;/div&gt;

&lt;p&gt;Generalising this into vector notation and including the &lt;em&gt;learning rate&lt;/em&gt; $\eta$, the update for the weights in layer $J$ is:&lt;/p&gt;

&lt;div&gt;$$
    
\Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}}

$$&lt;/div&gt;

&lt;p&gt;Similarly, we have the update to the bias term. If:&lt;/p&gt;

&lt;div&gt;$$
\Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}} 
$$&lt;/div&gt;

&lt;p&gt;So the bias term is updated just by taking the deltas straight from the nodes in the subsequent layer (with the negative factor of learning rate).&lt;/p&gt;

&lt;div class=&#34;highlight_section&#34;&gt;

In summary, for back propagation, the equations we need in vector form are:

&lt;div&gt;$$
\begin{align}

\mathbf{\vec{\delta}_{K}} &amp;= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} -  \mathbf{T_{K}}\right) \\[0.5em]

\mathbf{ \vec{ \delta }_{J}} &amp;= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}}

\end{align}
$$&lt;/div&gt;

&lt;div&gt;$$
\begin{align}

\mathbf{W_{JK}} + \Delta \mathbf{W_{JK}} &amp;\rightarrow \mathbf{W_{JK}}, \ \ \ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} \\[0.5em]

\vec{\theta}  + \Delta \vec{\theta}  &amp;\rightarrow \vec{\theta}, \ \ \ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}} 

\end{align}
$$&lt;/div&gt;

&lt;p&gt;With $*$ representing an elementwise multiplication between the matrices.&lt;/p&gt;

&lt;p&gt;&lt;/div&gt;&lt;/p&gt;

&lt;h2 id=&#34;nextsteps&#34;&gt; What&#39;s next? &lt;/h2&gt;

&lt;p&gt;Although this kinds of mathematics can be tedious and sometimes hard to follow (and probably with numerous notation mistakes&amp;hellip; please let me know if you find them!), it is necessary in order to write a quick, efficient NN. Our next step is to implement this setup in Python.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>A Simple Neural Network - Transfer Functions</title>
      <link>/post/transfer-functions/</link>
      <pubDate>Wed, 08 Mar 2017 10:43:07 +0000</pubDate>
      
      <guid>/post/transfer-functions/</guid>
      <description>&lt;p&gt;We&amp;rsquo;re going to write a little bit of Python in this tutorial on Simple Neural Networks (Part 2). It will focus on the different types of activation (or transfer) functions, their properties and how to write each of them (and their derivatives) in Python.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;p&gt;As promised in the previous post, we&amp;rsquo;ll take a look at some of the different activation functions that could be used in our nodes. Again &lt;strong&gt;please&lt;/strong&gt; let me know if there&amp;rsquo;s anything I&amp;rsquo;ve gotten totally wrong - I&amp;rsquo;m very much learning too.&lt;/p&gt;

&lt;div id=&#34;toctop&#34;&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#linear&#34;&gt;Linear Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#sigmoid&#34;&gt;Sigmoid Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tanh&#34;&gt;Hyperbolic Tangent Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#gaussian&#34;&gt;Gaussian Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step&#34;&gt;Heaviside (step) Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#ramp&#34;&gt;Ramp Function&lt;/a&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#relu&#34;&gt;Rectified Linear Unit (ReLU)&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;linear&#34;&gt; Linear (Identity) Function &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&#34;what-does-it-look-like&#34;&gt;What does it look like?&lt;/h3&gt;

&lt;div  id=&#34;fig1&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/linear.png&#34; width=&#34;40%&#34;&gt;&lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/dlinear.png&#34; width=&#34;40%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 1&lt;/font&gt;: The linear function (left) and its derivative (right)
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;formulae&#34;&gt;Formulae&lt;/h3&gt;

&lt;div&gt;$$
f \left( x_{i} \right) = x_{i}
$$&lt;/div&gt;

&lt;h3 id=&#34;python-code&#34;&gt;Python Code&lt;/h3&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def linear(x, Derivative=False):
    if not Derivative:
        return x
    else:
        return 1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;why-is-it-used&#34;&gt;Why is it used?&lt;/h3&gt;

&lt;p&gt;If there&amp;rsquo;s a situation where we want a node to give its output without applying any thresholds, then the identity (or linear) function is the way to go.&lt;/p&gt;

&lt;p&gt;Hopefully you can see why it is used in the final output layer nodes as we only want these nodes to do the $ \text{input} \times \text{weight}$ operations before giving us its answer without any further modifications.&lt;/p&gt;

&lt;p&gt;&lt;font color=&#34;blue&#34;&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The linear function is not used in the hidden layers. We must use non-linear transfer functions in the hidden layer nodes or else the output will only ever end up being a linearly separable solution.&lt;/p&gt;

&lt;p&gt;&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;sigmoid&#34;&gt; The Sigmoid (or Fermi) Function &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&#34;what-does-it-look-like-1&#34;&gt;What does it look like?&lt;/h3&gt;

&lt;div  id=&#34;fig2&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/sigmoid.png&#34; width=&#34;40%&#34;&gt;&lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/dsigmoid.png&#34; width=&#34;40%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 2&lt;/font&gt;: The sigmoid function (left) and its derivative (right)
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;formulae-1&#34;&gt;Formulae&lt;/h3&gt;

&lt;div &gt;$$
f\left(x_{i} \right) = \frac{1}{1 + e^{  - x_{i}  }}, \ \
f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 -  \sigma(x_{i}) \right)
$$&lt;/div&gt;

&lt;h3 id=&#34;python-code-1&#34;&gt;Python Code&lt;/h3&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def sigmoid(x,Derivative=False):
    if not Derivative:
        return 1 / (1 + np.exp (-x))
    else:
        out = sigmoid(x)
        return out * (1 - out)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;why-is-it-used-1&#34;&gt;Why is it used?&lt;/h3&gt;

&lt;p&gt;This function maps the input to a value between 0 and 1 (but not equal to 0 or 1). This means the output from the node will be a high signal (if the input is positive) or a low one (if the input is negative). This function is often chosen as it is one of the easiest to hard-code in terms of its derivative. The simplicity of its derivative allows us to efficiently perform back propagation without using any fancy packages or approximations. The fact that this function is smooth, continuous (differentiable), monotonic and bounded means that back propagation will work well.&lt;/p&gt;

&lt;p&gt;The sigmoid&amp;rsquo;s natural threshold is 0.5, meaning that any input that maps to a value above 0.5 will be considered high (or 1) in binary terms.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;tanh&#34;&gt; Hyperbolic Tangent Function ( $\tanh(x)$ ) &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&#34;what-does-it-look-like-2&#34;&gt;What does it look like?&lt;/h3&gt;

&lt;div  id=&#34;fig3&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/tanh.png&#34; width=&#34;40%&#34;&gt;&lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/dtanh.png&#34; width=&#34;40%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 3&lt;/font&gt;: The hyperbolic tangent function (left) and its derivative (right)
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;formulae-2&#34;&gt;Formulae&lt;/h3&gt;

&lt;div &gt;$$
f\left(x_{i} \right) = \tanh\left(x_{i}\right),
f^{\prime}\left(x_{i} \right) = 1 - \tanh\left(x_{i}\right)^{2}
$$&lt;/div&gt;

&lt;h3 id=&#34;why-is-it-used-2&#34;&gt;Why is it used?&lt;/h3&gt;

&lt;p&gt;This is a very similar function to the previous sigmoid function and has much of the same properties: even its derivative is straight forward to compute. However, this function allows us to map the input to any value between -1 and 1 (but not inclusive of those). In effect, this allows us to apply a plenalty to the node (negative) rather than just have the node not fire at all. It also gives us a larger range of output to play with in the positive end of the scale meaning finer adjustments can be made.&lt;/p&gt;

&lt;p&gt;This function has a natural threshold of 0, meaning that any input which maps to a value greater than 0 is considered high (or 1) in binary terms.&lt;/p&gt;

&lt;p&gt;Again, the fact that this function is smooth, continuous (differentiable), monotonic and bounded means that back propagation will work well. The subsequent functions don&amp;rsquo;t all have these properties which makes them more difficult to use in back propagation (though it is done).
&lt;br&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;what-s-the-difference-between-the-sigmoid-and-hyperbolic-tangent&#34;&gt;What&amp;rsquo;s the difference between the sigmoid and hyperbolic tangent?&lt;/h2&gt;

&lt;p&gt;They both achieve a similar mapping, are both continuous, smooth, monotonic and differentiable, but give out different values. For a sigmoid function, a large negative input generates an almost zero output. This lack of output will affect all subsequent weights in the network which may not be desirable - effectively stopping the next nodes from learning. In contrast, the $\tanh$ function supplies -1 for negative values, maintaining the output of the node and allowing subsequent nodes to learn from it.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;gaussian&#34;&gt; Gaussian Function &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&#34;what-does-it-look-like-3&#34;&gt;What does it look like?&lt;/h3&gt;

&lt;div  id=&#34;fig4&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/gaussian.png&#34; width=&#34;40%&#34;&gt;&lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/dgaussian.png&#34; width=&#34;40%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 4&lt;/font&gt;: The gaussian function (left) and its derivative (right)
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;formulae-3&#34;&gt;Formulae&lt;/h3&gt;

&lt;div &gt;$$
f\left( x_{i}\right ) = e^{ -x_{i}^{2}}, \ \
f^{\prime}\left( x_{i}\right ) = - 2x e^{ - x_{i}^{2}}
$$&lt;/div&gt;

&lt;h3 id=&#34;python-code-2&#34;&gt;Python Code&lt;/h3&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def gaussian(x, Derivative=False):
    if not Derivative:
        return np.exp(-x**2)
    else:
        return -2 * x * np.exp(-x**2)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;why-is-it-used-3&#34;&gt;Why is it used?&lt;/h3&gt;

&lt;p&gt;The gaussian function is an even function, thus is gives the same output for equally positive and negative values of input. It gives its maximal output when there is no input and has decreasing output with increasing distance from zero. We can perhaps imagine this function is used in a node where the input feature is less likely to contribute to the final result.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;step&#34;&gt; Step (or Heaviside) Function &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&#34;what-does-it-look-like-4&#34;&gt;What does it look like?&lt;/h3&gt;

&lt;div  id=&#34;fig5&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/step.png&#34; width=&#34;40%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 5&lt;/font&gt;: The Heaviside function (left) and its derivative (right)
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;formulae-4&#34;&gt;Formulae&lt;/h3&gt;

&lt;div&gt;$$
    f(x)= 
\begin{cases}
\begin{align}
    0  \ &amp;: \ x_{i} \leq T\\
    1 \ &amp;: \ x_{i} &gt; T\\
    \end{align}
\end{cases}
$$&lt;/div&gt;

&lt;h3 id=&#34;why-is-it-used-4&#34;&gt;Why is it used?&lt;/h3&gt;

&lt;p&gt;Some cases call for a function which applies a hard thresold: either the output is precisely a single value, or not. The other functions we&amp;rsquo;ve looked at have an intrinsic probablistic output to them i.e. a higher output in decimal format implying a greater probability of being 1 (or a high output). The step function does away with this opting for a definite high or low output depending on some threshold on the input $T$.&lt;/p&gt;

&lt;p&gt;However, the step-function is discontinuous and therefore non-differentiable (its derivative is the Dirac-delta function). Therefore use of this function in practice is not done with back-propagation.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;ramp&#34;&gt; Ramp Function &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&#34;what-does-it-look-like-5&#34;&gt;What does it look like?&lt;/h3&gt;

&lt;div  id=&#34;fig6&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/ramp.png&#34; width=&#34;40%&#34;&gt;&lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/dramp.png&#34; width=&#34;40%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 6&lt;/font&gt;: The ramp function (left) and its derivative (right) with $T1=-2$ and $T2=3$.
        &lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&#34;formulae-5&#34;&gt;Formulae&lt;/h3&gt;

&lt;div&gt;$$
    f(x)= 
\begin{cases}
\begin{align}
    0 \ &amp;: \ x_{i} \leq T_{1}\\[0.5em]
    \frac{\left( x_{i} - T_{1} \right)}{\left( T_{2} - T_{1} \right)} \ &amp;: \ T_{1} \leq x_{i} \leq T_{2}\\[0.5em]
    1 \ &amp;: \ x_{i} &gt; T_{2}\\
    \end{align}
\end{cases}
$$&lt;/div&gt;

&lt;h3 id=&#34;python-code-3&#34;&gt;Python Code&lt;/h3&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def ramp(x, Derivative=False, T1=0, T2=np.max(x)):
    out = np.ones(x.shape)
    ids = ((x &amp;lt; T1) | (x &amp;gt; T2))
    if not Derivative:
        out = ((x - T1)/(T2-T1))
        out[(x &amp;lt; T1)] = 0
        out[(x &amp;gt; T2)] = 1
        return out
    else:
        out[ids]=0
        return out
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;why-is-it-used-5&#34;&gt;Why is it used?&lt;/h3&gt;

&lt;p&gt;The ramp function is a truncated version of the linear function. From its shape, the ramp function looks like a more definitive version of the sigmoid function in that its maps a range of inputs to outputs over the range (0 1) but this time with definitive cut off points $T1$ and $T2$. This gives the function the ability to fire the node very definitively above a threshold, but still have some uncertainty in the lower regions. It may not be common to see $T1$ in the negative region unless the ramp is equally distributed about $0$.&lt;/p&gt;

&lt;h3 id=&#34;relu&#34;&gt; 6.1 Rectified Linear Unit (ReLU) &lt;/h3&gt;

&lt;p&gt;There is a popular, special case of the ramp function in use in the powerful &lt;em&gt;convolutional neural network&lt;/em&gt; (CNN) architecture called a &lt;em&gt;&lt;strong&gt;Re&lt;/strong&gt;ctifying &lt;strong&gt;L&lt;/strong&gt;inear &lt;strong&gt;U&lt;/strong&gt;nit&lt;/em&gt; (ReLU). In a ReLU, $T1=0$ and $T2$ is the maximum of the input giving a linear function with no negative values as below:&lt;/p&gt;

&lt;div  id=&#34;fig7&#34; class=&#34;figure_container&#34;&gt;
        &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/relu.png&#34; width=&#34;40%&#34;&gt;&lt;img title=&#34;Simple NN&#34; src=&#34;/img/transferFunctions/drelu.png&#34; width=&#34;40%&#34;&gt;
        &lt;/div&gt;
        &lt;div class=&#34;figure_caption&#34;&gt;
            &lt;font color=&#34;blue&#34;&gt;Figure 7&lt;/font&gt;: The Rectified Linear Unit (ReLU) (left) with its derivative (right).
        &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;and in Python:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;def relu(x, Derivative=False):
    if not Derivative:
        return np.maximum(0,x)
    else:
        out = np.ones(x.shape)
        out[(x &amp;lt; 0)]=0
        return out
&lt;/code&gt;&lt;/pre&gt;</description>
    </item>
    
    <item>
      <title>A Simple Neural Network - Mathematics</title>
      <link>/post/neuralnetwork/</link>
      <pubDate>Mon, 06 Mar 2017 17:04:53 +0000</pubDate>
      
      <guid>/post/neuralnetwork/</guid>
      <description>&lt;p&gt;This is the first part of a series of tutorials on Simple Neural Networks (NN). Tutorials on neural networks (NN) can be found all over the internet. Though many of them are the same, each is written (or recorded) slightly differently. This means that I always feel like I learn something new or get a better understanding of things with every tutorial I see. I&amp;rsquo;d like to make this tutorial as clear as I can, so sometimes the maths may be simplistic, but hopefully it&amp;rsquo;ll give you a good unserstanding of what&amp;rsquo;s going on.  &lt;strong&gt;Please&lt;/strong&gt; let me know if any of the notation is incorrect or there are any mistakes - either comment or use the contact page on the left.&lt;/p&gt;

&lt;div id=&#34;toctop&#34;&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&#34;#nnarchitecture&#34;&gt;Neural Network Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#transferFunction&#34;&gt;Transfer Function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#feedforward&#34;&gt;Feed-forward&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#error&#34;&gt;Error&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#backPropagationGrads&#34;&gt;Back Propagation - the Gradients&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bias&#34;&gt;Bias&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#backPropagationAlgorithm&#34;&gt;Back Propagaton - the Algorithm&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&#34;nnarchitecture&#34;&gt;1. Neural Network Architecture &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By now, you may well have come across diagrams which look very similar to the one below. It shows some input node, connected to some output node via an intermediate node in what is called a &amp;lsquo;hidden layer&amp;rsquo; - &amp;lsquo;hidden&amp;rsquo; because in the use of NN only the input and output is of concern to the user, the &amp;lsquo;under-the-hood&amp;rsquo; stuff may not be interesting to them. In real, high-performing NN there are usually more hidden layers.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;Simple NN&#34; width=40% src=&#34;/img/simpleNN/simpleNN.png&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 1&lt;/font&gt;: A simple 2-layer NN with 2 features in the input layer, 3 nodes in the hidden layer and two nodes in the output layer.
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;When we train our network, the nodes in the hidden layer each perform a calculation using the values from the input nodes. The output of this is passed on to the nodes of the next layer. When the output hits the final layer, the &amp;lsquo;output layer&amp;rsquo;, the results are compared to the real, known outputs and some tweaking of the network is done to make the output more similar to the real results. This is done with an algorithm called &lt;em&gt;back propagation&lt;/em&gt;. Before we get there, lets take a closer look at these calculations being done by the nodes.&lt;/p&gt;

&lt;h2 id=&#34;transferFunction&#34;&gt;2. Transfer Function &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At each node in the hidden and output layers of the NN, an &lt;em&gt;activation&lt;/em&gt; or &lt;em&gt;transfer&lt;/em&gt; function is executed. This function takes in the output of the previous node, and multiplies it by some &lt;em&gt;weight&lt;/em&gt;. These weights are the lines which connect the nodes. The weights that come out of one node can all be different, that is they will &lt;em&gt;activate&lt;/em&gt; different neurons. There can be many forms of the transfer function, we will first look at the &lt;em&gt;sigmoid&lt;/em&gt; transfer function as it seems traditional.&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
        &lt;img title=&#34;The sigmoid function&#34; width=50% src=&#34;/img/simpleNN/sigmoid.png&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 2&lt;/font&gt;: The sigmoid function.
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;As you can see from the figure, the sigmoid function takes any real-valued input and maps it to a real number in the range $(0 \ 1)$ - i.e. between, but not equal to, 0 and 1. We can think of this almost like saying &amp;lsquo;if the value we have maps to an output near 1, this node fires, if it maps to an output near 0, the node does not fire&amp;rsquo;. The equation for this sigmoid function is:&lt;/p&gt;

&lt;div id=&#34;eqsigmoidFunction&#34;&gt;$$
\sigma ( x ) = \frac{1}{1 + e^{-x}}
$$&lt;/div&gt;

&lt;p&gt;We need to have the derivative of this transfer function so that we can perform back propagation later on. This is the process where by the connections in the network are updated to tune the performance of the NN. We&amp;rsquo;ll talk about this in more detail later, but let&amp;rsquo;s find the derivative now.&lt;/p&gt;

&lt;div&gt;
$$
\begin{align*}
\frac{d}{dx}\sigma ( x ) &amp;= \frac{d}{dx} \left( 1 + e^{ -x }\right)^{-1}\\
&amp;=  -1 \times -e^{-x} \times \left(1 + e^{-x}\right)^{-2}= \frac{ e^{-x} }{ \left(1 + e^{-x}\right)^{2} } \\
&amp;= \frac{\left(1 + e^{-x}\right) - 1}{\left(1 + e^{-x}\right)^{2}} 
= \frac{\left(1 + e^{-x}\right) }{\left(1 + e^{-x}\right)^{2}} - \frac{1}{\left(1 + e^{-x}\right)^{2}} 
= \frac{1}{\left(1 + e^{-x}\right)} - \left( \frac{1}{\left(1 + e^{-x}\right)} \right)^{2} \\[0.5em]
&amp;= \sigma ( x ) - \sigma ( x ) ^ {2}
\end{align*}
$$&lt;/div&gt;

&lt;p&gt;Therefore, we can write the derivative of the sigmoid function as:&lt;/p&gt;

&lt;div id=&#34;eqdsigmoid&#34;&gt;$$
\sigma^{\prime}( x ) = \sigma (x ) \left( 1 - \sigma ( x ) \right)
$$&lt;/div&gt;

&lt;p&gt;The sigmoid function has the nice property that its derivative is very simple: a bonus when we want to hard-code this into our NN later on. Now that we have our activation or transfer function selected, what do we do with it?&lt;/p&gt;

&lt;h2 id=&#34;feedforward&#34;&gt;3. Feed-forward &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During a feed-forward pass, the network takes in the input values and gives us some output values. To see how this is done, let&amp;rsquo;s first consider a 2-layer neural network like the one in Figure 1. Here we are going to refer to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$i$ - the $i^{\text{th}}$ node of the input layer $I$&lt;/li&gt;
&lt;li&gt;$j$ - the $j^{\text{th}}$ node of the hidden layer $J$&lt;/li&gt;
&lt;li&gt;$k$ - the $k^{\text{th}}$ node of the input layer $K$&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The activation function at a node $j$ in the hidden layer takes the value:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
x_{j} &amp;= \xi_{1} w_{1j} + \xi_{2} w_{2j} \\[0.5em]
&amp;= \sum_{i \in I} \xi_{i} w_{i j}

\end{align}
$$&lt;/div&gt;

&lt;p&gt;where $\xi_{i}$ is the value of the $i^{\text{th}}$ input node and $w_{i j}$ is the weight of the connection between $i^{\text{th}}$ input node and the $j^{\text{th}}$ hidden node. &lt;strong&gt;In short:&lt;/strong&gt; at each hidden layer node, multiply each input value by the connection received by that node and add them together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; the weights are initisliased when the network is setup. Sometimes they are all set to 1, or often they&amp;rsquo;re set to some small random value.&lt;/p&gt;

&lt;p&gt;We apply the activation function on $x_{j}$ at the $j^{\text{th}}$ hidden node and get:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\mathcal{O}_{j} &amp;= \sigma(x_{j}) \\
&amp;= \sigma(  \xi_{1} w_{1j} + \xi_{2} w_{2j})
\end{align}
$$&lt;/div&gt;

&lt;p&gt;$\mathcal{O}_{j}$ is the output of the $j^{\text{th}}$ hidden node. This is calculated for each of the $j$ nodes in the hidden layer. The resulting outputs now become the input for the next layer in the network. In our case, this is the final output later. So for each of the $k$ nodes in $K$:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\mathcal{O}_{k} &amp;= \sigma(x_{k}) \\
&amp;= \sigma \left( \sum_{j \in J}  \mathcal{O}_{j} w_{jk}  \right)
\end{align}
$$&lt;/div&gt;

&lt;p&gt;As we&amp;rsquo;ve reached the end of the network, this is also the end of the feed-foward pass. So how well did our network do at getting the correct result $\mathcal{O}_{k}$? As this is the training phase of our network, the true results will be known an we cal calculate the error.&lt;/p&gt;

&lt;h2 id=&#34;error&#34;&gt;4. Error &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We measure error at the end of each foward pass. This allows us to quantify how well our network has performed in getting the correct output. Let&amp;rsquo;s define $t_{k}$ as the expected or &lt;em&gt;target&lt;/em&gt; value of the $k^{\text{th}}$ node of the output layer $K$. Then the error $E$ on the entire output is:&lt;/p&gt;

&lt;div id=&#34;eqerror&#34;&gt;$$
\text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}
$$&lt;/div&gt;

&lt;p&gt;Dont&amp;rsquo; be put off by the random &lt;sup&gt;1&lt;/sup&gt;&amp;frasl;&lt;sub&gt;2&lt;/sub&gt; in front there, it&amp;rsquo;s been manufactured that way to make the upcoming maths easier. The rest of this should be easy enough: get the residual (difference between the target and output values), square this to get rid of any negatives and sum this over all of the nodes in the output layer.&lt;/p&gt;

&lt;p&gt;Good! Now how does this help us? Our aim here is to find a way to tune our network such that when we do a forward pass of the input data, the output is exactly what we know it should be. But we can&amp;rsquo;t change the input data, so there are only two other things we can change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the weights going into the activation function&lt;/li&gt;
&lt;li&gt;the activation function itself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We will indeed consider the second case in another post, but the magic of NN is all about the &lt;em&gt;weights&lt;/em&gt;. Getting each weight i.e. each connection between nodes, to be just the perfect value, is what back propagation is all about. The back propagation algorithm we will look at in the next section, but lets go ahead and set it up by considering the following: how much of this error $E$ has come from each of the weights in the network?&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;re asking, what is the proportion of the error coming from each of the $W_{jk}$ connections between the nodes in layer $J$ and the output layer $K$. Or in mathematical terms:&lt;/p&gt;

&lt;div&gt;$$
\frac{\partial{\text{E}}}{\partial{W_{jk}}} =  \frac{\partial{}}{\partial{W_{jk}}}  \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}
$$&lt;/div&gt;

&lt;p&gt;If you&amp;rsquo;re not concerned with working out the derivative, skip this highlighted section.&lt;/p&gt;

&lt;div class=&#34;highlight_section&#34;&gt;

To tackle this we can use the following bits of knowledge: the derivative of the sum is equal to the sum of the derivatives i.e. we can move the derivative term inside of the summation:

&lt;div&gt;$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} =  \frac{1}{2} \sum_{k \in K} \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;the weight $w_{1k}$ does not affect connection $w_{2k}$ therefore the change in $W_{jk}$ with respect to any node other than the current $k$ is zero. Thus the summation goes away:&lt;/li&gt;
&lt;/ul&gt;

&lt;div&gt;$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} =  \frac{1}{2} \frac{\partial{}}{\partial{W_{jk}}}  \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;apply the power rule knowing that $t_{k}$ is a constant:&lt;/li&gt;
&lt;/ul&gt;

&lt;div&gt;$$ 
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{jk}}} &amp;=  \frac{1}{2} \times 2 \times \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{jk}}}  \left( \mathcal{O}_{k}\right) \\
 &amp;=  \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{jk}}}  \left( \mathcal{O}_{k}\right)
\end{align}
$$&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;the leftover derivative is the chage in the output values with respect to the weights. Substituting $ \mathcal{O}_{k} = \sigma(x_{k}) $ and the sigmoid derivative $\sigma^{\prime}( x ) = \sigma (x ) \left( 1 - \sigma ( x ) \right)$:&lt;/li&gt;
&lt;/ul&gt;

&lt;div&gt;$$ 
\frac{\partial{\text{E}}}{\partial{W_{jk}}} =  \left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \sigma ( x ) \right) \frac{\partial{}}{\partial{W_{jk}}}  \left( x_{k}\right)
$$&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;the final derivative, the input value $x_{k}$ is just $\mathcal{O}_{j} W_{jk}$ i.e. output of the previous layer times the weight to this layer. So the change in  $\mathcal{O}_{j} w_{jk}$ with respect to $w_{jk}$ just gives us the output value of the previous layer $ \mathcal{O}_{j} $ and so the full derivative becomes:&lt;/li&gt;
&lt;/ul&gt;

&lt;div&gt;$$ 
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{jk}}}  &amp;=  \left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \sigma ( x ) \right) \frac{\partial{}}{\partial{W_{jk}}}  \left( \mathcal{O}_{j} W_{jk} \right) \\[0.5em]
&amp;=\left( \mathcal{O}_{k} - t_{k} \right) \sigma (x )  \left( 1 - \mathcal{O}_{k}  \right) \mathcal{O}_{j} 
\end{align}
$$&lt;/div&gt;

&lt;p&gt;We can replace the sigmoid function with the output of the layer
&lt;/div&gt;&lt;/p&gt;

&lt;p&gt;The derivative of the error function with respect to the weights is then:&lt;/p&gt;

&lt;div id=&#34;derror&#34;&gt;$$ 
\frac{\partial{\text{E}}}{\partial{W_{jk}}}  =\left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k}  \left( 1 - \mathcal{O}_{k}  \right) \mathcal{O}_{j}
$$&lt;/div&gt;

&lt;p&gt;We group the terms involving $k$ and define:&lt;/p&gt;

&lt;div&gt;$$
\delta_{k} = \mathcal{O}_{k}  \left( 1 - \mathcal{O}_{k}  \right)  \left( \mathcal{O}_{k} - t_{k} \right)
$$&lt;/div&gt;

&lt;p&gt;And therefore:&lt;/p&gt;

&lt;div id=&#34;derrorjk&#34;&gt;$$ 
\frac{\partial{\text{E}}}{\partial{W_{jk}}}  = \mathcal{O}_{j} \delta_{k} 
$$&lt;/div&gt;

&lt;p&gt;So we have an expression for the amount of error, called &amp;lsquo;deta&amp;rsquo; ($\delta_{k}$), on the weights from the nodes in $J$ to each node $k$ in $K$. But how does this help us to improve out network? We need to back propagate the error.&lt;/p&gt;

&lt;h2 id=&#34;backPropagationGrads&#34;&gt;5. Back Propagation - the gradients&lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Back propagation takes the error function we found in the previous section, uses it to calculate the error on the current layer and updates the weights to that layer by some amount.&lt;/p&gt;

&lt;p&gt;So far we&amp;rsquo;ve only looked at the error on the output layer, what about the hidden layer? This also has an error, but the error here depends on the output layer&amp;rsquo;s error too (because this is where the difference between the target $t_{k}$ and output $\mathcal{O}_{k}$ can be calculated). Lets have a look at the error on the weights of the hidden layer $W_{ij}$:&lt;/p&gt;

&lt;div&gt;$$ \frac{\partial{\text{E}}}{\partial{W_{ij}}} =  \frac{\partial{}}{\partial{W_{ij}}}  \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$&lt;/div&gt;

&lt;p&gt;Now, unlike before, we cannot just drop the summation as the derivative is not directly acting on a subscript $k$ in the summation. We should be careful to note that the output from every node in $J$ is actually connected to each of the nodes in $K$ so the summation should stay. But we can still use the same tricks as before: lets use the power rule again and move the derivative inside (because the summation is finite):&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &amp;=  \frac{1}{2} \times 2 \times  \frac{\partial{}}{\partial{W_{ij}}}   \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)  \mathcal{O}_{k} \\
&amp;= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} \mathcal{O}_{k}
 \end{align}
 $$&lt;/div&gt;
 

&lt;p&gt;Again, we substitute $\mathcal{O}_{k} = \sigma( x_{k})$ and its derivative and revert back to our output notation:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &amp;= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} (\sigma(x_{k}) )\\
&amp;= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \sigma(x_{k}) \left( 1 - \sigma(x_{k}) \right) \frac{\partial{}}{\partial{W_{ij}}} (x_{k}) \\
&amp;= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} (x_{k})
 \end{align}
 $$&lt;/div&gt;
 

&lt;p&gt;This still looks familar from the output layer derivative, but now we&amp;rsquo;re struggling with the derivative of the input to $k$ i.e. $x_{k}$ with respect to the weights from $I$ to $J$. Let&amp;rsquo;s use the chain rule to break apart this derivative in terms of the output from $J$:&lt;/p&gt;

&lt;div&gt; $$
\frac{\partial{ x_{k}}}{\partial{W_{ij}}} = \frac{\partial{ x_{k}}}{\partial{\mathcal{O}_{j}}}\frac{\partial{\mathcal{O}_{j}}}{\partial{W_{ij}}}
$$&lt;/div&gt;

&lt;p&gt;The change of the input to the $k^{\text{th}}$ node with respect to the output from the $j^{\text{th}}$ node is down to a product with the weights, therefore this derivative just becomes the weights $W_{jk}$. The final derivative has nothing to do with the subscript $k$ anymore, so we&amp;rsquo;re free to move this around - lets put it at the beginning:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &amp;= \frac{\partial{\mathcal{O}_{j}}}{\partial{W_{ij}}}  \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk}
 \end{align}
 $$&lt;/div&gt;
 

&lt;p&gt;Lets finish the derivatives, remembering that the output of the node $j$ is just $\mathcal{O}_{j} = \sigma(x_{j}) $ and we know the derivative of this function too:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &amp;= \frac{\partial{}}{\partial{W_{ij}}}\sigma(x_{j})  \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \\
&amp;= \sigma(x_{j}) \left( 1 - \sigma(x_{j}) \right)  \frac{\partial{x_{j} }}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \\
&amp;= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right)  \frac{\partial{x_{j} }}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk}
 \end{align}
 $$&lt;/div&gt;
 

&lt;p&gt;The final derivative is straightforward too, the derivative of the input to $j$ with repect to the weights is just the previous input, which in our case is $\mathcal{O}_{i}$,&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &amp;= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right)  \mathcal{O}_{i} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk}
 \end{align}
 $$&lt;/div&gt;
 

&lt;p&gt;Almost there! Recall that we defined $\delta_{k}$ earlier, lets sub that in:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &amp;= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right)  \mathcal{O}_{i} \sum_{k \in K} \delta_{k} W_{jk}
 \end{align}
 $$&lt;/div&gt;
 

&lt;p&gt;To clean this up, we now define the &amp;lsquo;delta&amp;rsquo; for our hidden layer:&lt;/p&gt;

&lt;div&gt;$$
\delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right)   \sum_{k \in K} \delta_{k} W_{jk}
$$&lt;/div&gt;

&lt;p&gt;Thus, the amount of error on each of the weights going into our hidden layer:&lt;/p&gt;

&lt;div id=&#34;derrorij&#34;&gt;$$ 
\frac{\partial{\text{E}}}{\partial{W_{ij}}}  = \mathcal{O}_{i} \delta_{j} 
$$&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; the reason for the name &lt;em&gt;back&lt;/em&gt; propagation is that we must calculate the errors at the far end of the network and work backwards to be able to calculate the weights at the front.&lt;/p&gt;

&lt;h2 id=&#34;bias&#34;&gt;6.  Bias &lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lets remind ourselves what happens inside our hidden layer nodes:&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
    &lt;img title=&#34;Simple NN&#34;  width=50% src=&#34;/img/simpleNN/nodeInsideNoBias.png&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 3&lt;/font&gt;: The insides of a hidden layer node, $j$.
    &lt;/div&gt;
&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Each feature $\xi_{i}$ from the input layer $I$ is multiplied by some weight $w_{ij}$&lt;/li&gt;
&lt;li&gt;These are added together to get $x_{i}$ the total, weighted input from the nodes in $I$&lt;/li&gt;
&lt;li&gt;$x_{i}$ is passed through the activation, or transfer, function $\sigma(x_{i})$&lt;/li&gt;
&lt;li&gt;This gives the output $\mathcal{O}_{j}$ for each of the $j$ nodes in hidden layer $J$&lt;/li&gt;
&lt;li&gt;$\mathcal{O}_{j}$ from each of the $J$ nodes becomes $\xi_{j}$ for the next layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When we talk about the &lt;em&gt;bias&lt;/em&gt; term in NN, we are talking about an additional parameter that is inluded in the summation of step 2 above. The bias term is usually denoted with the symbol $\theta$ (theta). It&amp;rsquo;s function is to act as a threshold for the activation (transfer) function. It is given the value of 1 and is not connected to anything else. As such, this means that any derivative of the node&amp;rsquo;s output with respect to the bias term would just give a constant, 1. This allows us to just think of the bias term as an output from the node with the value of 1. This will be updated later during backpropagation to change the threshold at which the node fires.&lt;/p&gt;

&lt;p&gt;Lets update the equation for $x_{i}$:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
x_{i} &amp;= \xi_{1j} w_{1j} + \xi_{2j} w_{2j} + \theta_{j} \\[0.5em]
\sigma( x_{i} ) &amp;= \sigma \left( \sum_{i \in I} \left( \xi_{ij} w_{ij} \right) + \theta_{j} \right)
\end{align}
$$&lt;/div&gt;

&lt;p&gt;and put it on the diagram:&lt;/p&gt;

&lt;div class=&#34;figure_container&#34;&gt;
    &lt;div class=&#34;figure_images&#34;&gt;
    &lt;img title=&#34;Simple NN&#34;  width=50% src=&#34;/img/simpleNN/nodeInside.png&#34;&gt;
    &lt;/div&gt;
    &lt;div class=&#34;figure_caption&#34;&gt;
        &lt;font color=&#34;blue&#34;&gt;Figure 3&lt;/font&gt;: The insides of a hidden layer node, $j$.
    &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&#34;backPropagationAlgorithm&#34;&gt;7. Back Propagation - the algorithm&lt;/h2&gt;

&lt;p&gt;&lt;a href=&#34;#toctop&#34;&gt;To contents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we have all of the pieces! We&amp;rsquo;ve got the initial outputs after our feed-forward, we have the equations for the delta terms (the amount by which the error is based on the different weights) and we know we need to update our bias term too. So what does it look like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Input the data into the network and feed-forward&lt;/li&gt;

&lt;li&gt;&lt;p&gt;For each of the &lt;em&gt;output&lt;/em&gt; nodes calculate:&lt;/p&gt;

&lt;div&gt;$$
\delta_{k} = \mathcal{O}_{k}  \left( 1 - \mathcal{O}_{k}  \right)  \left( \mathcal{O}_{k} - t_{k} \right)
$$&lt;/div&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;For each of the &lt;em&gt;hidden layer&lt;/em&gt; nodes calculate:&lt;/p&gt;

&lt;div&gt;$$
\delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right)   \sum_{k \in K} \delta_{k} W_{jk}
$$&lt;/div&gt;
    &lt;/li&gt;

&lt;li&gt;&lt;p&gt;Calculate the changes that need to be made to the weights and bias terms:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
\Delta W &amp;= -\eta \ \delta_{l} \ \mathcal{O}_{l-1} \\
\Delta\theta &amp;= -\eta \ \delta_{l}
\end{align}
$$&lt;/div&gt;
    &lt;/li&gt;

&lt;li&gt;&lt;p&gt;Update the weights and biases across the network:&lt;/p&gt;

&lt;div&gt;$$
\begin{align}
W + \Delta W &amp;\rightarrow W \\
\theta + \Delta\theta &amp;\rightarrow \theta
\end{align}
$$&lt;/div&gt;
    &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here, $\eta$ is just a small number that limit the size of the deltas that we compute: we don&amp;rsquo;t want the network jumping around everywhere. The $l$ subscript denotes the deltas and output for that layer $l$. That is, we compute the delta for each of the nodes in a layer and vectorise them. Thus we can compute the element-wise product with the output values of the previous layer and get our update $\Delta W$ for the weights of the current later. Similarly with the bias term.&lt;/p&gt;

&lt;p&gt;This algorithm is looped over and over until the error between the output and the target values is below some set threshold. Depending on the size of the network i.e. the number of layers and number of nodes per layer, it can take a long time to complete one &amp;lsquo;epoch&amp;rsquo; or run through of this algorithm.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Some of the ideas and notation in this tutorial comes from the good videos by &lt;a href=&#34;https://www.youtube.com/playlist?list=PL29C61214F2146796&#34; title=&#34; NN Videos&#34;&gt;Ryan Harris&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Web Design Wisdom</title>
      <link>/post/webdesign/</link>
      <pubDate>Sat, 04 Mar 2017 17:21:15 +0000</pubDate>
      
      <guid>/post/webdesign/</guid>
      <description>&lt;p&gt;So I&amp;rsquo;m quite a bit into getting MLNotebook setup and I&amp;rsquo;ve been learning a hell of a lot about web design using Hugo (a static site generator). There are a few things around the internet that could be explained more clearly or where more examples could be given, so hopefully that&amp;rsquo;s what I can do for you here!
&lt;/p&gt;

&lt;p&gt;I thought I&amp;rsquo;d give an overview of some of the wisdom I&amp;rsquo;ve gained from creating MLNotebook - my adventures in markdown&amp;hellip; and the rest!&lt;/p&gt;

&lt;h2 id=&#34;hugo&#34;&gt; Hugo &lt;/h2&gt;

&lt;h3 id=&#34;hugoSetup&#34;&gt; Setup &lt;/h3&gt;

&lt;p&gt;Hugo was relatively easy to setup, but I think some of the guides around could be a lot clearer particularly when it comes to hosting on Githib Pages. Firstly, make sure that you download Hugo &lt;a href=&#34;https://github.com/spf13/hugo/releases&#34; title=&#34;Hugo Github&#34;&gt;here&lt;/a&gt; and extract it to &lt;code&gt;/usr/local/bin&lt;/code&gt;. I renamed mine to &amp;ldquo;hugo&amp;rdquo;. Check whether its properly installed with the command:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;$ hugo -v
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will provide the version number. If not, add &lt;code&gt;/usr/local/bin&lt;/code&gt; to your system path:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;$ PATH=$PATH:/usr/local/bin
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Creating a new site called &amp;ldquo;newsite&amp;rdquo; from scratch is the easy bit:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;$ hugo new site ./newsite
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;themeAndOverrides&#34;&gt;Theme and overrides &lt;/h3&gt;

&lt;p&gt;To get my theme to work, I simply cloned the repository (as shown &lt;a href=&#34;https://themes.gohugo.io/blackburn/&#34; title=&#34;Blackburn theme&#34;&gt;here&lt;/a&gt;) directly into ./newsite/themes/blackburn. Be sure to copy the &lt;code&gt;config.toml&lt;/code&gt; file to &lt;code&gt;./newsite&lt;/code&gt;. That&amp;rsquo;s all there is to it!&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;$ mkdir themes
$ cd themes
$ git clone https://github.com/yoshiharuyamashita/blackburn.git
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Customising this theme was really easy as it is mostly done in config.toml. What I wish I knew about Hugo straight off the bat is that the tree structure is important. So anything in the &amp;ldquo;themes&amp;rdquo; folder is a fall-back for anything that &lt;strong&gt;isn&amp;rsquo;t&lt;/strong&gt; present in the root folder of the site. That means if you have your own template for a post in &lt;code&gt;./newsite/layouts/single.html&lt;/code&gt; it will be used instead of the themes one in &lt;code&gt;./newsite/themes/layouts/single.html&lt;/code&gt;. Thus if you want to edit the layout, copy the theme&amp;rsquo;s one into your sites layout folder and edit it from there.&lt;/p&gt;

&lt;p&gt;The index page is the same deal, just copy it to your sites root and it will take precident over the default theme&amp;rsquo;s one.&lt;/p&gt;

&lt;h3 id=&#34;partials&#34;&gt;Partials&lt;/h3&gt;

&lt;p&gt;The partials bit can be a little confusing if you&amp;rsquo;re not too familiar with how the site is put together. Effectively, the page you&amp;rsquo;re loooking at right now is made up of lots of different parts (partials) that have been edited separately, put through a parser, turned into HTML and pasted together into a single HTML page. The head and footer don&amp;rsquo;t have much in them but are important for adding calls to Javascripts as they are stitched into each and every page on the website. Don&amp;rsquo;t confuse the head.html and header.html files, the latter is the actual title/banner at the top of the homepage (it is another partial that is stitched into index.html.&lt;/p&gt;

&lt;h3 id=&#34;socialMediaButtons&#34;&gt;Social Media Buttons&lt;/h3&gt;

&lt;p&gt;I spend a while trying to figure out how to get my social media buttons to actually take the url of the page they were on and share that exact post. I tried a hosted service which gave me a script that pulled down the buttons from them and allowed me to edit them via their interface, but it wasn&amp;rsquo;t content-specific. To dynamically get the url and get some nice-looking icons, I actually used the site &lt;a href=&#34;https://simplesharingbuttons.com/&#34; title=&#34;Simple Sharing Buttons&#34;&gt;Simple Sharing Buttons&lt;/a&gt;, chose the sites I wanted and theyprovided the icons along with the HTML. In comparisson to other sites and methods, this seems to work the best (except for the reddit one really).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-html&#34;&gt;&amp;lt;ul class=&amp;quot;share-buttons&amp;quot;&amp;gt;
  &amp;lt;li&amp;gt;&amp;lt;a href=&amp;quot;https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fmlnotebook.github.io&amp;amp;t=&amp;quot; title=&amp;quot;Share on Facebook&amp;quot; target=&amp;quot;_blank&amp;quot; onclick=&amp;quot;window.open(&#39;https://www.facebook.com/sharer/sharer.php?u=&#39; + encodeURIComponent(document.URL) + &#39;&amp;amp;t=&#39; + encodeURIComponent(document.URL),&#39;&#39;,&#39;width=500,height=300&#39;); return false;&amp;quot;&amp;gt;&amp;lt;img alt=&amp;quot;Share on facebook&amp;quot; src=&amp;quot;/img/facebook.png&amp;quot;&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;&amp;lt;a href=&amp;quot;https://twitter.com/intent/tweet?source=https%3A%2F%2Fmlnotebook.github.io&amp;amp;text=:%20https%3A%2F%2Fmlnotebook.github.io&amp;amp;via=mlnotebook&amp;quot; target=&amp;quot;_blank&amp;quot; title=&amp;quot;Tweet&amp;quot; onclick=&amp;quot;window.open(&#39;https://twitter.com/intent/tweet?text=&#39; + encodeURIComponent(document.title) + &#39;:%20&#39;  + encodeURIComponent(document.URL),&#39;&#39;,&#39;width=500,height=300&#39;); return false;&amp;quot;&amp;gt;&amp;lt;img alt=&amp;quot;Tweet&amp;quot; src=&amp;quot;/img/twitter.png&amp;quot;&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;&amp;lt;a href=&amp;quot;http://www.reddit.com/submit?url=https%3A%2F%2Fmlnotebook.github.io&amp;amp;title=&amp;quot; target=&amp;quot;_blank&amp;quot; title=&amp;quot;Submit to Reddit&amp;quot; onclick=&amp;quot;window.open(&#39;http://www.reddit.com/submit?url=&#39; + encodeURIComponent(document.URL) + &#39;&amp;amp;title=&#39; +  encodeURIComponent(document.title),&#39;&#39;,&#39;width=500,height=300&#39;); return false;&amp;quot;&amp;gt;&amp;lt;img alt=&amp;quot;Submit to Reddit&amp;quot; src=&amp;quot;/img/reddit.png&amp;quot;&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;&amp;lt;a href=&amp;quot;http://www.linkedin.com/shareArticle?mini=true&amp;amp;url=https%3A%2F%2Fmlnotebook.github.io&amp;amp;title=&amp;amp;summary=&amp;amp;source=https%3A%2F%2Fmlnotebook.github.io&amp;quot; target=&amp;quot;_blank&amp;quot; title=&amp;quot;Share on LinkedIn&amp;quot; onclick=&amp;quot;window.open(&#39;http://www.linkedin.com/shareArticle?mini=true&amp;amp;url=&#39; + encodeURIComponent(document.URL) + &#39;&amp;amp;title=&#39; +  encodeURIComponent(document.title),&#39;&#39;,&#39;width=500,height=300&#39;); return false;&amp;quot;&amp;gt;&amp;lt;img alt=&amp;quot;Share on LinkedIn&amp;quot; src=&amp;quot;/img/linkedin.png&amp;quot;&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;githubPages&#34;&gt; Hosting on Peronal Github Pages &lt;/h3&gt;

&lt;p&gt;Again, some of the tutorials out there aren&amp;rsquo;t great at properly explaining how to get your pages hosted on your &lt;strong&gt;personal&lt;/strong&gt; Github pages, rather than project ones (i.e. &lt;code&gt;https://&amp;lt;your username&amp;gt;.github.io&lt;/code&gt;) I&amp;rsquo;ll try to give you another version here.&lt;/p&gt;

&lt;p&gt;Firstly, login to Github and create the repository &lt;code&gt;&amp;lt;your username&amp;gt;.github.io&lt;/code&gt;. This is important as the master branch will be used to locate your website at exactly &lt;code&gt;https://&amp;lt;your username&amp;gt;.github.io&lt;/code&gt;. Initialise it with the &lt;code&gt;README.md&lt;/code&gt;. Create a new branch called &lt;code&gt;hugo&lt;/code&gt; and initialise this with the &lt;code&gt;README.md&lt;/code&gt; too.&lt;/p&gt;

&lt;p&gt;In your &lt;code&gt;./newsite&lt;/code&gt; directory you&amp;rsquo;ll need to build the site, initialise the git respository and add the remote:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;$ hugo
$
$ git init
$ git remote add origin git@github.com:&amp;lt;username&amp;gt;/&amp;lt;username&amp;gt;.github.io.git
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you&amp;rsquo;re having trouble adding the remote because of &lt;em&gt;permissions&lt;/em&gt; it could be that you&amp;rsquo;re using a different Git account for your website than normal. Have a look at the &lt;code&gt;git config&lt;/code&gt; options to change the username/password. If that fails, it could be that you need to sort an &lt;code&gt;ssh&lt;/code&gt; key - instructions for that are on your account settings page.&lt;/p&gt;

&lt;p&gt;From here, I managed to find and adapt two scripts from &lt;a href=&#34;https://hjdskes.github.io/blog/deploying-hugo-on-personal-gh-pages/&#34; title=&#34;hjdskes&#34;&gt;here&lt;/a&gt;. The first is &lt;code&gt;setup.sh&lt;/code&gt; (&lt;a href=&#34;/docs/setup.sh&#34; title=&#34;setup.sh&#34;&gt;download&lt;/a&gt;) and only needs to be executed once. It does the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deletes the master branch (perfectly safe)&lt;/li&gt;
&lt;li&gt;Creates a new orphaned master branch&lt;/li&gt;
&lt;li&gt;Takes the &lt;code&gt;README.md&lt;/code&gt; from &lt;code&gt;hugo&lt;/code&gt; and makes an initial commit to &lt;code&gt;master&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Changes back to &lt;code&gt;hugo&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Removes the existing &lt;code&gt;./public&lt;/code&gt; folder&lt;/li&gt;
&lt;li&gt;Sets the &lt;code&gt;master&lt;/code&gt; branch as a subtree for the &lt;code&gt;./public&lt;/code&gt; folder&lt;/li&gt;
&lt;li&gt;Pulls the commited &lt;code&gt;master&lt;/code&gt; back into &lt;code&gt;./public&lt;/code&gt; to stop merge conflicts.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&#34;warn&#34;&gt;Make sure that you edit the `USERNAME` field in `setup.sh` before executing.&lt;/div&gt;

&lt;p&gt;After that, whenever you want to upload your site, just run the second script &lt;code&gt;deploy.sh&lt;/code&gt; which I&amp;rsquo;ve altered slightly (&lt;a href=&#34;/docs/deploy.sh&#34; title=&#34;deploy.sh&#34;&gt;download&lt;/a&gt;) with an optional argument which will be your commit message: missing out the argument submits a default message.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;deploy.sh&lt;/code&gt; commits and pushes all of your changes to the &lt;code&gt;hugo&lt;/code&gt; source branch before putting the &lt;code&gt;./public&lt;/code&gt; folder on &lt;code&gt;master&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&#34;warn&#34;&gt;Make sure that you edit the `USERNAME` field in `deploy.sh` before executing&lt;/div&gt;

&lt;p&gt;And that&amp;rsquo;s it! If the website doesn&amp;rsquo;t load when you go to &lt;code&gt;https://&amp;lt;your username&amp;gt;.github.io&lt;/code&gt; you may need to hit &lt;code&gt;settings&lt;/code&gt; in your repo (top right of the menu bar), scroll down to &amp;ldquo;Github Pages&amp;rdquo; and select &lt;code&gt;master&lt;/code&gt; as your source.&lt;/p&gt;

&lt;h2 id=&#34;htmlCss&#34;&gt;HTML / CSS&lt;/h2&gt;

&lt;h3 id=&#34;contactForm&#34;&gt;Contact Form&lt;/h3&gt;

&lt;p&gt;The first part of the site I altered was the contant page. I added a contact form which largely involves &lt;code&gt;html&lt;/code&gt; formatted with &lt;code&gt;css&lt;/code&gt;. The magic that makes it work comes from the free service called &lt;a href=&#34;https://formspree.io/&#34; title=&#34;Formspree&#34;&gt;Formspree&lt;/a&gt;. Essentially, the submit button sends the information to formspree and they forward it on to me directly. It uses a hidden field to give the forwarded emails the same subject, this makes for easy filtering. It also provides a free &amp;ldquo;I&amp;rsquo;m not a robot&amp;rdquo; page after clicking submit.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-html&#34;&gt;&amp;lt;div id=&amp;quot;contactform&amp;quot; class=&amp;quot;center&amp;quot;&amp;gt;
&amp;lt;form action=&amp;quot;https://formspree.io/your@email.com method=&amp;quot;POST&amp;quot; name=&amp;quot;sentMessage&amp;quot; id=&amp;quot;contactForm&amp;quot; novalidate&amp;gt;
	&amp;lt;input type=&amp;quot;text&amp;quot; name=&amp;quot;name&amp;quot; placeholder=&amp;quot;Name&amp;quot; id=&amp;quot;name&amp;quot; required data-validation-required-message=&amp;quot;Please enter your name.&amp;quot;&amp;gt;&amp;lt;br&amp;gt;
	&amp;lt;input type=&amp;quot;email&amp;quot; name=&amp;quot;_replyto&amp;quot; placeholder=&amp;quot;Email Address&amp;quot; id=&amp;quot;email&amp;quot; required data-validation-required-message=&amp;quot;Please enter your email address.&amp;quot; &amp;gt;&amp;lt;br&amp;gt;

	&amp;lt;input type=&amp;quot;hidden&amp;quot;  name=&amp;quot;_subject&amp;quot; value=&amp;quot;Message from MLNotebook&amp;quot;&amp;gt;
	&amp;lt;input type=&amp;quot;text&amp;quot; name=&amp;quot;_gotcha&amp;quot; style=&amp;quot;display:none&amp;quot; /&amp;gt;
	&amp;lt;textarea rows=&amp;quot;10&amp;quot; name=&amp;quot;message&amp;quot; class=&amp;quot;form-control&amp;quot; placeholder=&amp;quot;Message&amp;quot; id=&amp;quot;message&amp;quot; required data-validation-required-message=&amp;quot;Please enter a message.&amp;quot;&amp;gt;&amp;lt;/textarea&amp;gt;&amp;lt;br&amp;gt;
	&amp;lt;input type=&amp;quot;submit&amp;quot; value=&amp;quot;Send&amp;quot;&amp;gt;
&amp;lt;/form&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The formatting was a pain as I&amp;rsquo;d never used the box-size argument before - this is what I found made the boxes all the same size and have the same alignment. I added for all browsers too.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-css&#34;&gt;
input[type=text], input[type=email], textarea {
	display: inline-block;
  	border: 1px solid transparent;
  	border-top: none;
  	border-bottom: 1px solid #DDD;
  	box-shadow: inset 0 1px 2px rgba(0,0,0,.39), 0 -1px 1px #FFF, 0 1px 0 #FFF;
	border-radius: 4px;
	margin: 2px 2px 2px 2px;
	resize:none;
	float: left;
	width: 100%;
}

textarea, input {
    -webkit-box-sizing: border-box;
    -moz-box-sizing: border-box;
    box-sizing: border-box;
}

input[type=submit] {
	width: 100%;
}

.center {
	margin: auto;
}

input {
	height:50px;
}

textarea {
	height: 200px;
	padding-left: 0px;
}

input, textarea::-webkit-input-placeholder {
   padding-left: 10px;
}
input, textarea::-moz-placeholder {
   padding-left: 10px;
}
input, textarea:-ms-input-placeholder {
   padding-left: 10px;
}
input, textarea:-moz-placeholder {
   padding-left: 10px;
}
  
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;resizing&#34;&gt;Resizing for Small Screens&lt;/h3&gt;

&lt;p&gt;One of my final hurdles in getting the site setup was making the homepage a little more friendly that just showing the recent posts. So I decided to add my &lt;a href=&#34;https://twitter.com/mlnotebook&#34; title=&#34;@MLNotebook&#34;&gt;twitter&lt;/a&gt; feed to the side. Twitter has an easy code to embed this, and I just put it into its own partial in &lt;code&gt;layouts/partials/twitterfeed.html&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;My problem here though was that when I viewed my site on my phone, or resized the web-browser on the computer, the content would shrink and be almost unreadable - I wanted the feed to move below the text if the screen was below a certain size. So I created the usual &lt;code&gt;div&lt;/code&gt; containers within my &lt;code&gt;index.html&lt;/code&gt; file and added the shortcode to include my &lt;code&gt;twitterfeed.html&lt;/code&gt; in the right-hand side.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-html&#34;&gt;&amp;lt;div id=&amp;quot;container&amp;quot; class=&amp;quot;center&amp;quot;&amp;gt;
	&amp;lt;div id=&amp;quot;left_content&amp;quot; class=&amp;quot;center&amp;quot;&amp;gt;
		&amp;lt;div class=&amp;quot;content&amp;quot;&amp;gt;
		  {{ range ( .Paginate (where .Data.Pages &amp;quot;Type&amp;quot; &amp;quot;post&amp;quot;)).Pages }}
		    {{ .Render &amp;quot;summary&amp;quot;}}
		  {{ end }}

		  {{ partial &amp;quot;pagination.html&amp;quot; . }}
		&amp;lt;/div&amp;gt;
	&amp;lt;/div&amp;gt;
	&amp;lt;div id=&amp;quot;right_content&amp;quot; class=&amp;quot;center&amp;quot;&amp;gt;
		&amp;lt;center&amp;gt;{{ partial &amp;quot;twitterfeed.html&amp;quot; . }}&amp;lt;/center&amp;gt;
	&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I then used &lt;code&gt;css&lt;/code&gt; to give the &lt;code&gt;div&lt;/code&gt; containers their own properties for different screen sizes:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-css&#34;&gt;#container {
	position: relative;
	width:auto;

}

#right_content {
	float:left;
	overflow:hidden;
	display:block;
	padding-right:1%;

}

#left_content {
	float:left;
	width:80%;
	display:block;
	margin:auto;
	min-width=600px;

}

pre &amp;gt; code {
	font-size:11pt;
}

@media screen and (max-width: 1000px) {

#left_content {
	width: 100%;
	}
	
	.content {
	max-width:100%;
	}



#right_content {
	width:100%;
}

pre &amp;gt; code {
	font-size:8pt;
}

}

&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that this allows the size of the font in the code-snippets to shrink when the screensize is small - I find that it reads more easily.&lt;/p&gt;

&lt;h2 id=&#34;syntaxHighlighting&#34;&gt;Syntax highlighting&lt;/h2&gt;

&lt;p&gt;So actually getting code into the website was trickier than I thought. The in-built markdown codeblocks seem to work just fine by adding code between backticks: &lt;code&gt;`&amp;lt;code here&amp;gt;`&lt;/code&gt;. Markdown doesn&amp;rsquo;t do syntax highlightsing right out of the box though. So I&amp;rsquo;m using &lt;code&gt;highlight.js&lt;/code&gt;. My theme does come with a highlight shortcode option, but I found that I couldn&amp;rsquo;t customise it how I wanted - particuarly, the font size was just too big. I tried everything, even adding extra &lt;code&gt;&amp;lt;pre&amp;gt; &amp;lt;/pre&amp;gt;&lt;/code&gt; tags around it and using &lt;code&gt;css&lt;/code&gt; to format them. In the end, I found that using &lt;code&gt;highlight.js&lt;/code&gt; was much simpler - I just loaded the script straight off their server and voila! The link just needed editing to select the theme I wanted, but I opted for the standard &lt;code&gt;monokai&lt;/code&gt; anyway. I placed this in my site&amp;rsquo;s &lt;code&gt;head&lt;/code&gt; partial.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-html&#34;&gt;&amp;lt;link rel=&amp;quot;stylesheet&amp;quot; href=&amp;quot;//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.9.0/styles/monokai.min.css&amp;quot;&amp;gt;
&amp;lt;script src=&amp;quot;//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.9.0/highlight.min.js&amp;quot;&amp;gt;&amp;lt;/script&amp;gt;
&amp;lt;script&amp;gt;hljs.initHighlightingOnLoad();&amp;lt;/script&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;mathsRendering&#34;&gt;Maths Rendering&lt;/h2&gt;

&lt;p&gt;Being a site on machine learning, I&amp;rsquo;m going to need to be able to include some mathematics sometimes. I&amp;rsquo;m very familiar with $\rm\LaTeX$ and I&amp;rsquo;ve written-up a lot of formulae already, so I looked into getting $\rm\LaTeX$ formatting into markdown/Hugo. A few math rendering engines are around, but not all are simple to implement. The best option I found was &lt;a href=&#34;https://www.mathjax.org/&#34; title=&#34;MathJax&#34;&gt;MathJax&lt;/a&gt; which literally required me to add these few lines to my &lt;code&gt;head&lt;/code&gt; partial.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-javascript&#34;&gt;&amp;lt;script type=&amp;quot;text/javascript&amp;quot;
  src=&amp;quot;https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&amp;quot;&amp;gt;
&amp;lt;/script&amp;gt;

&amp;lt;script type=&amp;quot;text/x-mathjax-config&amp;quot;&amp;gt;
MathJax.Hub.Config({
  tex2jax: {
    inlineMath: [[&#39;$&#39;,&#39;$&#39;], [&#39;\\(&#39;,&#39;\\)&#39;]],
    displayMath: [[&#39;$$&#39;,&#39;$$&#39;], [&#39;\\[&#39;,&#39;\\]&#39;]],
    processEscapes: true,
    processEnvironments: true,
    skipTags: [&#39;script&#39;, &#39;noscript&#39;, &#39;style&#39;, &#39;textarea&#39;, &#39;pre&#39;],
    TeX: { equationNumbers: { autoNumber: &amp;quot;AMS&amp;quot; },
         extensions: [&amp;quot;AMSmath.js&amp;quot;, &amp;quot;AMSsymbols.js&amp;quot;] }
  }
});
&amp;lt;/script&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;From there, it allows me to put inline math into my websites such as $ c = \sqrt{a^{2} + b^{2}} $ by enclosing them in the normal \$ symbols like so: &lt;code&gt;\$ some math \$&lt;/code&gt;. MathJax also provides display-style input with enclosing &lt;code&gt;&amp;lt;div&amp;gt;\$\$ code \$\$&amp;lt;/div&amp;gt;&lt;/code&gt; e.g.:&lt;/p&gt;

&lt;div&gt;$$ c = \sqrt{a^{2} + b^{2}}  $$&lt;/div&gt;

&lt;p&gt;The formatting is done by some css&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-css&#34;&gt;code.has-jax {
	font: inherit;
	font-size: 100%;
	background: inherit;
	border: inherit;
	color: #515151;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;/td&gt;&lt;/p&gt;</description>
    </item>
    
  </channel>
</rss>