<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://kharshit.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kharshit.github.io/" rel="alternate" type="text/html" /><updated>2024-12-03T19:30:22+00:00</updated><id>https://kharshit.github.io/feed.xml</id><title type="html">Harshit Kumar</title><subtitle>Harshit Kumar - personal website and blog</subtitle><entry><title type="html">Matrix Multiplication in CUDA</title><link href="https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda" rel="alternate" type="text/html" title="Matrix Multiplication in CUDA" /><published>2024-06-07T00:00:00+00:00</published><updated>2024-06-07T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda</id><content type="html" xml:base="https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda"><![CDATA[<p>Matrix multiplication is at the heart of deep learning. In this evolving world of LLMs, the need for fast and efficient matrix multiplications is paramount. Nvidia CUDA allows you to perform matrix operations on GPU in a faster way.</p>

<p>CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model. CUDA programming model provides an abstraction of GPU architecture (API for GPUs).</p>

<p>In this blog post, we will explore how to implement matrix multiplication using CUDA. We will start with a naive implementation on the CPU and then demonstrate how to significantly speed up the process using CUDA.</p>

<h2 id="naive-c-implementation-on-cpu">Naive C++ Implementation on CPU</h2>

<p>Since in most hardwares, matrices are stored in row-major format, let’s define our 2d matrices as row-major 1d arrays.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="k">struct</span> <span class="nc">Matrix</span> 
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">height</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">width</span><span class="p">;</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">elements</span><span class="p">;</span> <span class="c1">// height x width</span>
    <span class="c1">// you can also use std::vector&lt;float&gt; elements for automatic memory management</span>
<span class="p">};</span></code></pre></figure>

<p>Matrix multiplication for computing each element of matrix <code class="language-plaintext highlighter-rouge">C</code> from matrices <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code> can be written as follows:</p>

\[C_{i,j} = \sum_{k=0}^{K-1} A_{i,k} \times B_{k,j}\]

<p>where <code class="language-plaintext highlighter-rouge">i</code> and <code class="language-plaintext highlighter-rouge">j</code> are the row and column indices of the resulting matrix <code class="language-plaintext highlighter-rouge">C</code> and <code class="language-plaintext highlighter-rouge">k</code> is the index used for the summation over the common dimension.</p>

<div style="text-align: center">
<figure>
<img src="/img/cuda_matmul_naive.png" style="display: block; margin: auto;  max-width: 55%;" />
<figcaption>Naive matmul (source: Nvidia CUDA docs)</figcaption>
</figure>
</div>

<p>Our naive matrix multplication in C++ on CPU is:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">void</span> <span class="nf">matMulCPU</span><span class="p">(</span><span class="k">const</span> <span class="n">Matrix</span> <span class="o">&amp;</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="n">Matrix</span> <span class="o">&amp;</span><span class="n">B</span><span class="p">,</span> <span class="n">Matrix</span> <span class="o">&amp;</span><span class="n">C</span><span class="p">)</span> 
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">row</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">row</span> <span class="o">&lt;</span> <span class="n">A</span><span class="p">.</span><span class="n">height</span><span class="p">;</span> <span class="o">++</span><span class="n">row</span><span class="p">)</span> 
    <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">col</span> <span class="o">&lt;</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">col</span><span class="p">)</span> 
        <span class="p">{</span>
            <span class="kt">float</span> <span class="n">cValue</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
            <span class="c1">// C[i][j] = sum_k A[i][k] * B[k][j]</span>
            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span> <span class="o">&lt;</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">k</span><span class="p">)</span> 
                <span class="n">cValue</span> <span class="o">+=</span> <span class="n">A</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">row</span> <span class="o">*</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">k</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">k</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">col</span><span class="p">];</span>
            <span class="n">C</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">row</span> <span class="o">*</span> <span class="n">C</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">cValue</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>We can use the below <code class="language-plaintext highlighter-rouge">main()</code> function to call our <code class="language-plaintext highlighter-rouge">matMulCPU()</code> and measure its performance.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="c1">// Function to initialize a matrix with random values</span>
<span class="kt">void</span> <span class="nf">initializeMatrix</span><span class="p">(</span><span class="n">Matrix</span> <span class="o">&amp;</span><span class="n">mat</span><span class="p">)</span> 
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">mat</span><span class="p">.</span><span class="n">height</span> <span class="o">*</span> <span class="n">mat</span><span class="p">.</span><span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> 
        <span class="n">mat</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="k">static_cast</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span><span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">%</span> <span class="mi">100</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="n">main</span><span class="p">()</span> 
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">M</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">;</span> <span class="c1">// Rows of A and C</span>
    <span class="kt">int</span> <span class="n">K</span> <span class="o">=</span> <span class="mi">768</span><span class="p">;</span> <span class="c1">// Columns of A and rows of B</span>
    <span class="kt">int</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">;</span> <span class="c1">// Columns of B and C </span>

    <span class="c1">// Allocate matrices A, B, and C</span>
    <span class="n">Matrix</span> <span class="n">A</span> <span class="o">=</span> <span class="p">{</span><span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="k">new</span> <span class="kt">float</span><span class="p">[</span><span class="n">M</span> <span class="o">*</span> <span class="n">K</span><span class="p">]};</span> <span class="c1">// 1024x768</span>
    <span class="n">Matrix</span> <span class="n">B</span> <span class="o">=</span> <span class="p">{</span><span class="n">K</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="k">new</span> <span class="kt">float</span><span class="p">[</span><span class="n">K</span> <span class="o">*</span> <span class="n">N</span><span class="p">]};</span> <span class="c1">// 768x1024 </span>
    <span class="n">Matrix</span> <span class="n">C</span> <span class="o">=</span> <span class="p">{</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="k">new</span> <span class="kt">float</span><span class="p">[</span><span class="n">M</span> <span class="o">*</span> <span class="n">N</span><span class="p">]};</span> <span class="c1">// 1024x1024</span>

    <span class="c1">// Initialize matrices A and B with random values</span>
    <span class="n">initializeMatrix</span><span class="p">(</span><span class="n">A</span><span class="p">);</span>
    <span class="n">initializeMatrix</span><span class="p">(</span><span class="n">B</span><span class="p">);</span>

    <span class="c1">// Measure the time taken for matrix multiplication on the CPU</span>
    <span class="k">auto</span> <span class="n">start</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">high_resolution_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>
    <span class="n">matMulCPU</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">);</span>
    <span class="k">auto</span> <span class="n">stop</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">high_resolution_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>

    <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span> <span class="n">duration</span> <span class="o">=</span> <span class="n">stop</span> <span class="o">-</span> <span class="n">start</span><span class="p">;</span>
    <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"CPU matrix multiplication time: "</span> <span class="o">&lt;&lt;</span> <span class="n">duration</span><span class="p">.</span><span class="n">count</span><span class="p">()</span> <span class="o">*</span> <span class="mf">1000.0</span><span class="n">f</span> <span class="o">&lt;&lt;</span> <span class="s">" ms"</span> <span class="o">&lt;&lt;</span> <span class="n">endl</span><span class="p">;</span>

    <span class="c1">// Clean up memory</span>
    <span class="k">delete</span><span class="p">[]</span> <span class="n">A</span><span class="p">.</span><span class="n">elements</span><span class="p">;</span>
    <span class="k">delete</span><span class="p">[]</span> <span class="n">B</span><span class="p">.</span><span class="n">elements</span><span class="p">;</span>
    <span class="k">delete</span><span class="p">[]</span> <span class="n">C</span><span class="p">.</span><span class="n">elements</span><span class="p">;</span>

    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<h2 id="naive-cuda-kernel">Naive CUDA Kernel</h2>

<p>In CUDA, we define a CUDA kernel, which is a function (e.g. C++ function) executed by CUDA.</p>

<p>In CUDA programming model, there is a three-level hierarchy. The threads are the smallest unit of execution. These threads are grouped into a CUDA thread block. CUDA blocks are grouped into arrays called grids. The kernel is written from the perspective of a single thread in CUDA. Thus, a kernel is executed as a grid of blocks of threads.</p>

<div style="text-align: center">
<figure>
<img src="/img/cuda_thread_grid.png" style="display: block; margin: auto;  max-width: 55%;" />
<figcaption>CUDA grid of thread blocks (source: Nvidia CUDA docs)</figcaption>
</figure>
</div>

<p>On a CPU, matrix multiplication is typically performed sequentially, where each element of the output matrix is computed one after another. This process can be slow for large matrices due to the limited number of CPU cores available for parallel execution. In contrast, the GPU excels at parallel processing. A CUDA kernel is executed by many threads running simultaneously, allowing for significant speedup in computations like matrix multiplication. The GPU’s architecture enables it to handle thousands of threads concurrently, making it well-suited for tasks with high levels of parallelism.</p>

<p>Let’s re-write the above matrix multiplication code in CUDA. We use <code class="language-plaintext highlighter-rouge">__global__</code> keyword to define a CUDA kernel.  Here, we assign a thread for calculation of each element of output matrix C. And, multiple such threads are run in parallel. Each thread reads one row of A and one column of B to compute one element of C.</p>

<p>Threads and blocks are indexed using the built-in 3D variable <code class="language-plaintext highlighter-rouge">threadIdx</code> and <code class="language-plaintext highlighter-rouge">blockIdx</code>. The <code class="language-plaintext highlighter-rouge">blockDim</code> gives the dimension of thread block. We can access index using dot attribute e.g. <code class="language-plaintext highlighter-rouge">threadIdx.x, threadIdx.y, and threadIdx.z</code>. Thus, for 2d thread block, we can access particular element of C using a combination of these as shown in below code.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">__global__</span> <span class="kt">void</span> <span class="nf">matMulNaiveKernel</span><span class="p">(</span><span class="n">Matrix</span> <span class="n">A</span><span class="p">,</span> <span class="n">Matrix</span> <span class="n">B</span><span class="p">,</span> <span class="n">Matrix</span> <span class="n">C</span><span class="p">)</span> 
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">row</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">y</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">y</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>

    <span class="c1">// Each thread accumulates one element of C by accumulating results into cValue</span>
    <span class="kt">float</span> <span class="n">cValue</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="c1">// C[i][j] = sum_k A[i][k] * B[k][j]</span>
    <span class="c1">// Iterates over common dimensions of A and B (k = A.width = B.height)</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">row</span> <span class="o">&lt;</span> <span class="n">A</span><span class="p">.</span><span class="n">height</span> <span class="o">&amp;&amp;</span> <span class="n">col</span> <span class="o">&lt;</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span> <span class="o">&lt;</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span><span class="p">;</span> <span class="o">++</span><span class="n">k</span><span class="p">)</span>
            <span class="n">cValue</span> <span class="o">+=</span> <span class="n">A</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">row</span> <span class="o">*</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">k</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">k</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">col</span><span class="p">];</span>
        <span class="n">C</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">row</span> <span class="o">*</span> <span class="n">C</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">cValue</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>We create a 16x16 thread block (256 threads with 16 each in x and y-direction). We define <code class="language-plaintext highlighter-rouge">(B.width/BLOCK_SIZE, A.height/BLOCK_SIZE)</code> blocks per grid. Extra operations below is to take care of the last tile if size isn’t perfectly divisible.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="cp">#define BLOCK_SIZE 16
</span><span class="n">dim3</span> <span class="nf">threadsPerBlock</span><span class="p">(</span><span class="n">BLOCK_SIZE</span><span class="p">,</span> <span class="n">BLOCK_SIZE</span><span class="p">);</span>
<span class="n">dim3</span> <span class="nf">blocksPerGrid</span><span class="p">((</span><span class="n">B</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">threadsPerBlock</span><span class="p">.</span><span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">threadsPerBlock</span><span class="p">.</span><span class="n">x</span><span class="p">,</span>
                <span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">height</span> <span class="o">+</span> <span class="n">threadsPerBlock</span><span class="p">.</span><span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">threadsPerBlock</span><span class="p">.</span><span class="n">y</span><span class="p">);</span>
<span class="n">runKernel</span><span class="p">(</span><span class="n">matMulNaiveKernel</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">,</span> <span class="n">blocksPerGrid</span><span class="p">,</span> <span class="n">threadsPerBlock</span><span class="p">);</span></code></pre></figure>

<p>This kernel is called with device (gpu) matrices <code class="language-plaintext highlighter-rouge">A</code>, <code class="language-plaintext highlighter-rouge">B</code>, and <code class="language-plaintext highlighter-rouge">C</code> as follows:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">kernel</span><span class="o">&lt;&lt;&lt;</span><span class="n">blocksPerGrid</span><span class="p">,</span> <span class="n">threadsPerBlock</span><span class="o">&gt;&gt;&gt;</span><span class="p">(</span><span class="n">d_A</span><span class="p">,</span> <span class="n">d_B</span><span class="p">,</span> <span class="n">d_C</span><span class="p">);</span></code></pre></figure>

<p>This setup ensures that the CUDA kernel efficiently processes the entire matrix by dividing the workload among the available threads and blocks.</p>

<p>To execute CUDA program:</p>

<ol>
  <li>Copy the input data from host (cpu) memory to device (gpu) memory. This is called host-to-device (H2D) transfer.</li>
  <li>Run CUDA kernel on data.</li>
  <li>Copy the results from device memory to host memory, also called device-to-host (D2H) transfer.</li>
</ol>

<p>We pass our kernel to the <code class="language-plaintext highlighter-rouge">runKernel()</code> function that also takes CPU matrices A and B. It copies the data from CPU to GPU, runs kernel, copy result from GPU to CPU, and return the result matrix C.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">void</span> <span class="nf">runKernel</span><span class="p">(</span><span class="kt">void</span><span class="p">(</span><span class="o">*</span><span class="n">kernel</span><span class="p">)(</span><span class="n">Matrix</span><span class="p">,</span> <span class="n">Matrix</span><span class="p">,</span> <span class="n">Matrix</span><span class="p">),</span>
               <span class="k">const</span> <span class="n">Matrix</span> <span class="o">&amp;</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="n">Matrix</span> <span class="o">&amp;</span><span class="n">B</span><span class="p">,</span> <span class="n">Matrix</span> <span class="o">&amp;</span><span class="n">C</span><span class="p">,</span>
               <span class="n">dim3</span> <span class="n">gridDim</span><span class="p">,</span> <span class="n">dim3</span> <span class="n">blockDim</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// Load matrices to device memory</span>
    <span class="n">Matrix</span> <span class="n">d_A</span><span class="p">,</span> <span class="n">d_B</span><span class="p">,</span> <span class="n">d_C</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">size_A</span> <span class="o">=</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span> <span class="o">*</span> <span class="n">A</span><span class="p">.</span><span class="n">height</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">size_B</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">height</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">);</span>
    <span class="kt">size_t</span> <span class="n">size_C</span> <span class="o">=</span> <span class="n">C</span><span class="p">.</span><span class="n">width</span> <span class="o">*</span> <span class="n">C</span><span class="p">.</span><span class="n">height</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">);</span>
    <span class="n">d_A</span><span class="p">.</span><span class="n">width</span> <span class="o">=</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span><span class="p">;</span> <span class="n">d_A</span><span class="p">.</span><span class="n">height</span> <span class="o">=</span> <span class="n">A</span><span class="p">.</span><span class="n">height</span><span class="p">;</span>
    <span class="n">d_B</span><span class="p">.</span><span class="n">width</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span><span class="p">;</span> <span class="n">d_B</span><span class="p">.</span><span class="n">height</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">height</span><span class="p">;</span>
    <span class="n">d_C</span><span class="p">.</span><span class="n">width</span> <span class="o">=</span> <span class="n">C</span><span class="p">.</span><span class="n">width</span><span class="p">;</span> <span class="n">d_C</span><span class="p">.</span><span class="n">height</span> <span class="o">=</span> <span class="n">C</span><span class="p">.</span><span class="n">height</span><span class="p">;</span>

    <span class="c1">// Allocate device memory</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">d_A</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">size_A</span><span class="p">));</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">d_B</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">size_B</span><span class="p">));</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">d_C</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">size_C</span><span class="p">));</span>

    <span class="c1">// Copy A, B to device memory</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">d_A</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">A</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">size_A</span><span class="p">,</span> <span class="n">cudaMemcpyHostToDevice</span><span class="p">));</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">d_B</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">B</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">size_B</span><span class="p">,</span> <span class="n">cudaMemcpyHostToDevice</span><span class="p">));</span>

    <span class="k">auto</span> <span class="n">start</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">high_resolution_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>

    <span class="c1">// Launch kernel</span>
    <span class="n">kernel</span><span class="o">&lt;&lt;&lt;</span><span class="n">gridDim</span><span class="p">,</span> <span class="n">blockDim</span><span class="o">&gt;&gt;&gt;</span><span class="p">(</span><span class="n">d_A</span><span class="p">,</span> <span class="n">d_B</span><span class="p">,</span> <span class="n">d_C</span><span class="p">);</span>

    <span class="c1">// Synchronize device memory</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaDeviceSynchronize</span><span class="p">());</span>

    <span class="k">auto</span> <span class="n">end</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">high_resolution_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>
    <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span> <span class="n">duration</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">;</span>
    <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"Kernel execution time: "</span> <span class="o">&lt;&lt;</span> <span class="n">duration</span><span class="p">.</span><span class="n">count</span><span class="p">()</span> <span class="o">*</span> <span class="mf">1000.0</span><span class="n">f</span> <span class="o">&lt;&lt;</span> <span class="s">" ms"</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>

    <span class="c1">// Copy C from device memory to host memory</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">C</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">d_C</span><span class="p">.</span><span class="n">elements</span><span class="p">,</span> <span class="n">size_C</span><span class="p">,</span> <span class="n">cudaMemcpyDeviceToHost</span><span class="p">));</span>

    <span class="c1">// Free device memory</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaFree</span><span class="p">(</span><span class="n">d_A</span><span class="p">.</span><span class="n">elements</span><span class="p">));</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaFree</span><span class="p">(</span><span class="n">d_B</span><span class="p">.</span><span class="n">elements</span><span class="p">));</span>
    <span class="n">CUDA_CHECK_ERROR</span><span class="p">(</span><span class="n">cudaFree</span><span class="p">(</span><span class="n">d_C</span><span class="p">.</span><span class="n">elements</span><span class="p">));</span>
<span class="p">}</span></code></pre></figure>

<p>And, we call <code class="language-plaintext highlighter-rouge">runKernel()</code> function in above defined <code class="language-plaintext highlighter-rouge">main()</code> function.</p>

<h2 id="cuda-shared-memory-kernel">CUDA Shared Memory Kernel</h2>

<p>The previous CUDA kernel uses DRAM, but we can optimize performance by leveraging the GPU’s shared memory. Shared memory is faster but has limited capacity, so we cannot load entire matrices at once. Instead, we divide the matrices into smaller sub-matrices, or tiles, that fit into shared memory.</p>

<div style="text-align: center">
<figure>
<img src="/img/cuda_matmul_sharedmem.png" style="display: block; margin: auto;  max-width: 55%;" />
<figcaption>Shared memory matmul (source: Nvidia CUDA docs)</figcaption>
</figure>
</div>

<p>Shared memory is allocated per thread block, allowing threads within the same block to communicate efficiently. Each thread block is responsible for computing one square sub-matrix \(C_{sub}\) of <code class="language-plaintext highlighter-rouge">C</code> by loading tiles of input matrices <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code> from global memory to shared memory. Each thread within the block computes a single element of \(C_{sub}\) by iterating over the corresponding elements in the shared memory tiles, accumulating the results of the products. Finally, each thread writes its computed value to the appropriate position in global memory.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="cp">#define TILE_SIZE 16
</span>
<span class="c1">// Kernel for matrix multiplication using tiling and shared memory</span>
<span class="n">__global__</span> <span class="kt">void</span> <span class="nf">matMulSharedMemoryKernel</span><span class="p">(</span><span class="n">Matrix</span> <span class="n">A</span><span class="p">,</span> <span class="n">Matrix</span> <span class="n">B</span><span class="p">,</span> <span class="n">Matrix</span> <span class="n">C</span><span class="p">)</span>
<span class="p">{</span>
    <span class="c1">// Shared memory for tiles of A and B</span>
    <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">shared_A</span><span class="p">[</span><span class="n">TILE_SIZE</span><span class="p">][</span><span class="n">TILE_SIZE</span><span class="p">];</span>
    <span class="n">__shared__</span> <span class="kt">float</span> <span class="n">shared_B</span><span class="p">[</span><span class="n">TILE_SIZE</span><span class="p">][</span><span class="n">TILE_SIZE</span><span class="p">];</span>

    <span class="c1">// Calculate the global row and column index of the element</span>
    <span class="kt">int</span> <span class="n">globalRow</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">y</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">y</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">globalCol</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>

    <span class="kt">float</span> <span class="n">Cvalue</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>

    <span class="c1">// Thread row and column within Csub</span>
    <span class="kt">int</span> <span class="n">row</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">col</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>

    <span class="c1">// Loop over the tiles of the input matrices</span>
    <span class="c1">// A.width/TILE_SIZE and B.height/TILE_SIZE; take care of the last tile</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">m</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">m</span> <span class="o">&lt;</span> <span class="p">(</span><span class="n">A</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">TILE_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">TILE_SIZE</span><span class="p">;</span> <span class="o">++</span><span class="n">m</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="c1">// Load elements of A into shared memory</span>
        <span class="c1">// if shared memory defined using 1d array, we'd have used shared_A[row * TILE_SIZE + col]</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">row</span> <span class="o">&lt;</span> <span class="n">A</span><span class="p">.</span><span class="n">height</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span> <span class="n">TILE_SIZE</span> <span class="o">+</span> <span class="n">col</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span><span class="p">)</span> 
        <span class="p">{</span>
            <span class="n">shared_A</span><span class="p">[</span><span class="n">row</span><span class="p">][</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">globalRow</span> <span class="o">*</span> <span class="n">A</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">m</span> <span class="o">*</span> <span class="n">TILE_SIZE</span> <span class="o">+</span> <span class="n">col</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> 
        <span class="p">{</span>
            <span class="c1">// When matrix dimensions are not exact multiples of the tile size,</span>
            <span class="c1">// some threads in the last blocks might access elements outside</span>
            <span class="c1">// the matrix boundaries. By setting out-of-bounds elements to zero,</span>
            <span class="c1">// we ensure that these threads do not contribute invalid values to final result.</span>
            <span class="c1">// e.g. Matrix A = [100x100] and TILE_SIZE = 16</span>
            <span class="n">shared_A</span><span class="p">[</span><span class="n">row</span><span class="p">][</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="c1">// Load elements of B into shared memory</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">col</span> <span class="o">&lt;</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">m</span> <span class="o">*</span> <span class="n">TILE_SIZE</span> <span class="o">+</span> <span class="n">row</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">B</span><span class="p">.</span><span class="n">height</span><span class="p">)</span> 
        <span class="p">{</span>
            <span class="n">shared_B</span><span class="p">[</span><span class="n">row</span><span class="p">][</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">elements</span><span class="p">[(</span><span class="n">m</span> <span class="o">*</span> <span class="n">TILE_SIZE</span> <span class="o">+</span> <span class="n">row</span><span class="p">)</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">globalCol</span><span class="p">];</span>
        <span class="p">}</span> <span class="k">else</span> 
        <span class="p">{</span>
            <span class="n">shared_B</span><span class="p">[</span><span class="n">row</span><span class="p">][</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="c1">// Synchronize to ensure all threads have loaded their elements</span>
        <span class="n">__syncthreads</span><span class="p">();</span>

        <span class="c1">// Compute the partial result</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span> <span class="o">&lt;</span> <span class="n">TILE_SIZE</span><span class="p">;</span> <span class="o">++</span><span class="n">k</span><span class="p">)</span>
            <span class="n">Cvalue</span> <span class="o">+=</span> <span class="n">shared_A</span><span class="p">[</span><span class="n">row</span><span class="p">][</span><span class="n">k</span><span class="p">]</span> <span class="o">*</span> <span class="n">shared_B</span><span class="p">[</span><span class="n">k</span><span class="p">][</span><span class="n">col</span><span class="p">];</span>

        <span class="c1">// Synchronize to ensure all threads have completed the computation</span>
        <span class="n">__syncthreads</span><span class="p">();</span>
    <span class="p">}</span>

    <span class="c1">// Write the result to global memory</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">globalRow</span> <span class="o">&lt;</span> <span class="n">C</span><span class="p">.</span><span class="n">height</span> <span class="o">&amp;&amp;</span> <span class="n">globalCol</span> <span class="o">&lt;</span> <span class="n">C</span><span class="p">.</span><span class="n">width</span><span class="p">)</span>
        <span class="n">C</span><span class="p">.</span><span class="n">elements</span><span class="p">[</span><span class="n">globalRow</span> <span class="o">*</span> <span class="n">C</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">globalCol</span><span class="p">]</span> <span class="o">=</span> <span class="n">Cvalue</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>

<p>We can call our kernel as follows:</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">dim3</span> <span class="nf">blockDim</span><span class="p">(</span><span class="n">TILE_SIZE</span><span class="p">,</span> <span class="n">TILE_SIZE</span><span class="p">);</span>
<span class="n">dim3</span> <span class="nf">gridDim</span><span class="p">((</span><span class="n">C</span><span class="p">.</span><span class="n">width</span> <span class="o">+</span> <span class="n">TILE_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">TILE_SIZE</span><span class="p">,</span> <span class="p">(</span><span class="n">C</span><span class="p">.</span><span class="n">height</span> <span class="o">+</span> <span class="n">TILE_SIZE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">TILE_SIZE</span><span class="p">);</span>
<span class="n">runKernel</span><span class="p">(</span><span class="n">matMulSharedMemoryKernel</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">,</span> <span class="n">gridDim</span><span class="p">,</span> <span class="n">blockDim</span><span class="p">);</span></code></pre></figure>

<h2 id="cuda-matrix-multiplication-comparison">CUDA Matrix Multiplication Comparison</h2>

<p>The kernel execution time of above kernels on Tesla T4 on google colab is as follows.</p>

<table class="mbtablestyle">
  <thead>
    <tr>
      <th>Method</th>
      <th style="text-align: center">Execution Time (ms)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C++ CPU matrix multiplication</td>
      <td style="text-align: center">8554.51</td>
    </tr>
    <tr>
      <td>Naive CUDA kernel</td>
      <td style="text-align: center">7.08397</td>
    </tr>
    <tr>
      <td>Shared memory CUDA kernel</td>
      <td style="text-align: center">4.42471</td>
    </tr>
  </tbody>
</table>

<p>The CUDA parallelism significantly improves the CPU computation time. The shared memory kernel achieves the fastest execution time.</p>

<p>The full code is availble at <a href="https://github.com/kHarshit/cuda-programming">https://github.com/kHarshit/cuda-programming</a></p>

<h2 id="further-optimization">Further Optimization</h2>

<p>There are other ways to optimize the CUDA matrix multplication kernel further, such as:</p>

<ol>
  <li><strong>Using Register Blocking:</strong> This technique involves utilizing the register file to hold smaller sub-blocks of the matrices, reducing the number of accesses to shared memory.</li>
  <li><strong>Loop Unrolling:</strong> By unrolling loops, you can decrease the overhead of loop control instructions and increase the efficiency of the computation.</li>
  <li><strong>Occupancy Optimization:</strong> Tuning the number of threads per block and the size of the blocks to achieve the highest possible occupancy on the GPU.</li>
  <li><strong>Prefetching:</strong> Loading data into shared memory or registers ahead of time to hide memory latency.</li>
  <li><strong>Asynchronous Memory Operations:</strong> Using CUDA streams and <code class="language-plaintext highlighter-rouge">cudaMemcpyAsync</code> to overlap computation and data transfer, further reducing idle times.</li>
  <li><strong>Low Precision:</strong> Using half-precision (FP16) or mixed-precision (FP16/FP32) arithmetic can improve performance on supported GPUs.</li>
</ol>

<p>By combining these advanced optimization techniques with shared memory, you can achieve even greater performance gains for matrix multiplication on CUDA-enabled GPUs.</p>

<section>
	<link rel="stylesheet" href="/css/quiz.css" />
<div id="quiz">
  <h1 id="quiz-name"></h1>
  <div style="display: flex; align-items: center; justify-content: center">
    <button id="prev-question-button">Previous Question</button>
    <button id="next-question-button">Next Question</button>
    <button id="submit-button">Submit Answers</button>
  
  <div id="quiz-results" style="display: flex; align-items: center; justify-content: center">
    <p id="quiz-results-message"></p>
    <p id="quiz-results-score"></p>
    <!-- <button id="quiz-retry-button">Retry</button> -->
  </div>
  </div>

  <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
  <script>
	// Array of all the questions and choices to populate the questions. This might be saved in some JSON file or a database and we would have to read the data in.
	var all_questions = [{
	  question_string: "In the context of CUDA programming, what is a kernel?",
	  choices: {
	    correct: "A function executed on the GPU",
	    wrong: ["A small piece of hardware in the GPU", "A type of memory in the GPU", "A special type of thread"]
	  }
	}, {
	  question_string: "In our CUDA matrix multiplication, what does each thread compute?",
	  choices: {
	    correct: "A single element of the resulting matrix C",
	    wrong: ["A row of the resulting matrix C", "A column of the resulting matrix C", "The entire resulting matrix C"]
	  }
	}, {
	  question_string: "How are threads organized in CUDA?",
	  choices: {
	    correct: "Threads are organized into blocks, which are further organized into grids.",
	    wrong: ["Threads are organized directly into grids.", "Threads are organized into matrices.", "Threads are organized into lists."]
	  }
	}, {
	  question_string: "What does the cudaDeviceSynchronize() function do?",
	  choices: {
	    correct: "It blocks the CPU until all previous CUDA calls are complete.",
	    wrong: ["It allocates shared memory.", "It synchronizes threads within a block.", "It frees device memory."]
	  }
	}];
  </script>
  <script src="/css/quiz.js"></script>

</div>
	 
</section>

<p><strong>References</strong></p>
<ul>
  <li><a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capability">Nvidia CUDA Docs (also image source)</a></li>
  <li><a href="https://siboehm.com/articles/22/CUDA-MMM">Really good blog post on CUDA matrix multiplication</a></li>
</ul>]]></content><author><name></name></author><category term="CUDA" /><category term="Deep Learning" /><category term="Optimization" /><category term="Generative AI" /><category term="LLM" /><summary type="html"><![CDATA[Matrix multiplication is at the heart of deep learning. In this evolving world of LLMs, the need for fast and efficient matrix multiplications is paramount. Nvidia CUDA allows you to perform matrix operations on GPU in a faster way.]]></summary></entry><entry><title type="html">Retrieval Augmented Generation (RAG) Chatbot for 10Q Financial Reports</title><link href="https://kharshit.github.io/blog/2024/04/26/rag-financial-reports-llm" rel="alternate" type="text/html" title="Retrieval Augmented Generation (RAG) Chatbot for 10Q Financial Reports" /><published>2024-04-26T00:00:00+00:00</published><updated>2024-04-26T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2024/04/26/rag-financial-reports-llm</id><content type="html" xml:base="https://kharshit.github.io/blog/2024/04/26/rag-financial-reports-llm"><![CDATA[<p>While Large Language Models (LLMs) are revolutionary, they sometimes get it wrong—like citing varying figures for something as critical as Tesla’s total assets on a given date. In the accompanying figure, you can see ChatGPT4 giving different results when asked the same question multiple times. This problem is called LLM hallucinations. And that’s where Retrival Augmented Generation (RAG) comes in. In this blog post, I’ll describe how to create a Chabot for 10Q Financial Reports that leverages RAG.</p>

<div style="text-align: center">
<figure>
<img src="/img/llm_hallucination.png" style="display: block; margin: auto;  max-width: 100%;" />
<figcaption>LLM hallucination</figcaption>
</figure>
</div>

<h2 id="what-is-retrival-augmented-generation-rag">What is Retrival Augmented Generation (RAG)?</h2>

<p>It’s a framework that combines the strengths of information retrieval and generative language modeling to enhance the capabilities of machine learning systems, particularly in tasks that involve natural language understanding and generation. It involves two main components.</p>

<ol>
  <li><strong>Retrieval Component:</strong> responsible for accessing an external knowledge source, such as a database or a document collection, to retrieve relevant information based on the input query.</li>
  <li><strong>Generation Component:</strong> leverages LLMs to generate response based on the context provided by the retrieval component.</li>
</ol>

<h2 id="building-rag-chatbot">Building RAG Chatbot</h2>

<h3 id="dataset">Dataset</h3>

<p>The dataset primarily consists of financial documents, specifically 10-Q and 10-K filings from major publicly traded companies, such as Tesla, NVIDIA, and Apple. These documents are obtained from the U.S. Securities and Exchange Commission’s (SEC) <a href="https://www.sec.gov/edgar/searchedgar/companysearch">EDGAR database</a>, which is a reliable source for such financial reports. Each 10-Q and 10-K filing within the dataset contains a comprehensive overview of a company’s financial performance.</p>

<div style="text-align: center">
<figure>
<img src="/img/tsla_10q.png" style="display: block; margin: auto;  max-width: 80%;" />
<figcaption>Tesla 10Q</figcaption>
</figure>
</div>

<h3 id="steps">Steps</h3>

<p>We need to following the following steps to build a RAG Chatbot.</p>

<ul>
  <li><strong>Problem statement:</strong> Given a PDF document and a query, retrieve the relevant details and information from the document as per the query, and synthesize this information to generate accurate answers.</li>
  <li><strong>Data Ingestion and Processing:</strong> Reading PDFs of financial reports and split the documents for efficient text chunking of long documents.</li>
  <li><strong>Retrieval-Augmented Generation (RAG):</strong> Combination of document retrieval with the generative capabilities of the chosen language models.</li>
  <li><strong>Large Language Models:</strong> Evaluation of various models, including GPT-3.5-turbo, LLama 2, Gemma 1.1, etc.</li>
  <li><strong>Conversation Chain and Prompt Design:</strong> Crafting of a prompt template designed for concise two-sentence financial summaries.</li>
  <li><strong>User interface:</strong> Designing Chatbot like user interface.</li>
</ul>

<p>First, we load the 10-Q PDF using PyPDFLoader.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">langchain.document_loaders</span> <span class="kn">import</span> <span class="n">PyPDFLoader</span>
<span class="c1"># create a loader
</span><span class="n">loader</span> <span class="o">=</span> <span class="n">PyPDFLoader</span><span class="p">(</span><span class="sa">r</span><span class="s">"data/tsla-20230930.pdf"</span><span class="p">)</span></code></pre></figure>

<p>We then split data in chunks using a recursive character text splitter to handle large documents.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">langchain.text_splitter</span> <span class="kn">import</span> <span class="n">RecursiveCharacterTextSplitter</span>

<span class="n">text_splitter</span> <span class="o">=</span> <span class="n">RecursiveCharacterTextSplitter</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span> <span class="n">chunk_overlap</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
<span class="n">all_splits</span> <span class="o">=</span> <span class="n">text_splitter</span><span class="p">.</span><span class="n">split_documents</span><span class="p">(</span><span class="n">data</span><span class="p">)</span></code></pre></figure>

<p>We now create the embeddings using Sentence Transformer and HuggingFace embeddings. In order to create vector embeddings, we use the open-source Chroma vector database.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">HuggingFaceEmbeddings</span>
<span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">Chroma</span>

<span class="n">model_name</span> <span class="o">=</span> <span class="s">"sentence-transformers/all-mpnet-base-v2"</span>
<span class="n">embeddings</span> <span class="o">=</span> <span class="n">HuggingFaceEmbeddings</span><span class="p">(</span><span class="n">model_name</span><span class="o">=</span><span class="n">model_name</span><span class="p">,</span> <span class="n">model_kwargs</span><span class="o">=</span><span class="p">{</span><span class="s">"device"</span><span class="p">:</span> <span class="s">"cuda"</span><span class="p">})</span>

<span class="n">vectordb</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">.</span><span class="n">from_documents</span><span class="p">(</span><span class="n">documents</span><span class="o">=</span><span class="n">all_splits</span><span class="p">,</span> <span class="n">embedding</span><span class="o">=</span><span class="n">embeddings</span><span class="p">,</span> <span class="n">persist_directory</span><span class="o">=</span><span class="s">"chroma_db"</span><span class="p">)</span></code></pre></figure>

<p>We use HuggingFace to load LLama 2 model and create a HuggingFace pipeline. Since, we’re going to use LangChain, we use <code class="language-plaintext highlighter-rouge">HugggingFacePipeline</code> wrapper from LangChain to create LangChain llm object, which we’re going to use to do further processing.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">transformers</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">LlamaForCausalLM</span><span class="p">,</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoConfig</span>
<span class="kn">from</span> <span class="nn">langchain.llms</span> <span class="kn">import</span> <span class="n">HuggingFacePipeline</span>

<span class="n">model_config</span> <span class="o">=</span> <span class="n">AutoConfig</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Llama-2-7b-chat-hf"</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">LlamaForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Llama-2-7b-chat-hf"</span><span class="p">,</span>
                                            <span class="n">trust_remote_code</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">config</span> <span class="o">=</span> <span class="n">model_config</span><span class="p">,</span> <span class="n">device_map</span> <span class="o">=</span> <span class="s">'auto'</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Llama-2-7b-chat-hf"</span><span class="p">)</span>

<span class="c1"># Creating Pipeline
</span><span class="n">query_pipeline</span> <span class="o">=</span> <span class="n">transformers</span><span class="p">.</span><span class="n">pipeline</span><span class="p">(</span>
        <span class="s">"text-generation"</span><span class="p">,</span>
        <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
        <span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">,</span>
        <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
        <span class="n">device_map</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,)</span>
<span class="n">llm</span> <span class="o">=</span> <span class="n">HuggingFacePipeline</span><span class="p">(</span><span class="n">pipeline</span><span class="o">=</span><span class="n">query_pipeline</span><span class="p">)</span></code></pre></figure>

<p>If we want to use GPT models from OpenAI, we can diretly use <code class="language-plaintext highlighter-rouge">openai</code> API.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">openai</span>
<span class="kn">from</span> <span class="nn">langchain.chat_models</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>

<span class="c1"># Set your OpenAI API key
</span><span class="n">openai</span><span class="p">.</span><span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"OPENAI_API_KEY"</span><span class="p">)</span>

<span class="c1"># Define LLM
</span><span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">model_name</span><span class="o">=</span><span class="s">"gpt-3.5-turbo"</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span></code></pre></figure>

<p>Finally, we create a LangChain chain for our RAG system. We also pass a task-specific prompt to guide LLM for question answering wrt RAG for financial reports.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">langchain.prompts</span> <span class="kn">import</span> <span class="n">ChatPromptTemplate</span>
<span class="kn">from</span> <span class="nn">langchain.schema.runnable</span> <span class="kn">import</span> <span class="n">RunnablePassthrough</span>
<span class="kn">from</span> <span class="nn">langchain.schema.output_parser</span> <span class="kn">import</span> <span class="n">StrOutputParser</span>

<span class="c1"># Define prompt template
</span><span class="n">template</span> <span class="o">=</span> <span class="s">"""You are an assistant for question-answering tasks for Retrieval Augmented Generation system for the financial reports such as 10Q and 10K.
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use two sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
"""</span>

<span class="n">prompt</span> <span class="o">=</span> <span class="n">ChatPromptTemplate</span><span class="p">.</span><span class="n">from_template</span><span class="p">(</span><span class="n">template</span><span class="p">)</span>
<span class="n">retriever</span> <span class="o">=</span> <span class="n">vectordb</span><span class="p">.</span><span class="n">as_retriever</span><span class="p">()</span>

<span class="c1"># Setup RAG pipeline
</span><span class="n">conversation_chain</span> <span class="o">=</span> <span class="p">(</span>
    <span class="p">{</span><span class="s">"context"</span><span class="p">:</span> <span class="n">retriever</span><span class="p">,</span>  <span class="s">"question"</span><span class="p">:</span> <span class="n">RunnablePassthrough</span><span class="p">()}</span> 
    <span class="o">|</span> <span class="n">prompt</span> 
    <span class="o">|</span> <span class="n">llm</span>
    <span class="o">|</span> <span class="n">StrOutputParser</span><span class="p">()</span> 
<span class="p">)</span></code></pre></figure>

<p>Finally, we invoke our conversation chain on user input.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">user_input</span> <span class="o">=</span> <span class="s">"What's the total assets of Tesla?"</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">conversation_chain</span><span class="p">.</span><span class="n">invoke</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span></code></pre></figure>

<p>We can integrate our code with some frontend e.g. with Dash to have chatbot like interface.</p>

<div style="text-align: center">
<figure>
<img src="/img/rag_chatbot_10q.png" style="display: block; margin: auto;  max-width: 80%;" />
<figcaption>RAG Chatbot</figcaption>
</figure>
</div>

<p>The full code is availble at <a href="https://github.com/kHarshit/Financial_Document_Summarization_through_RAG">https://github.com/kHarshit/Financial_Document_Summarization_through_RAG</a></p>]]></content><author><name></name></author><category term="LLM" /><category term="Generative AI" /><category term="Natural Language Processing" /><summary type="html"><![CDATA[While Large Language Models (LLMs) are revolutionary, they sometimes get it wrong—like citing varying figures for something as critical as Tesla’s total assets on a given date. In the accompanying figure, you can see ChatGPT4 giving different results when asked the same question multiple times. This problem is called LLM hallucinations. And that’s where Retrival Augmented Generation (RAG) comes in. In this blog post, I’ll describe how to create a Chabot for 10Q Financial Reports that leverages RAG.]]></summary></entry><entry><title type="html">PyTorch Basic Tutorial</title><link href="https://kharshit.github.io/blog/2021/12/03/pytorch-basics-tutorial" rel="alternate" type="text/html" title="PyTorch Basic Tutorial" /><published>2021-12-03T00:00:00+00:00</published><updated>2021-12-03T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2021/12/03/pytorch-basics-tutorial</id><content type="html" xml:base="https://kharshit.github.io/blog/2021/12/03/pytorch-basics-tutorial"><![CDATA[<p><strong>PyTorch libraries</strong></p>
<ul>
  <li>torchvision: for computer vision</li>
  <li>torchtext: for NLP</li>
  <li>torchaudio: for speech</li>
</ul>

<p><strong>PyTorch API (Python, C++, and CUDA)</strong></p>
<ul>
  <li>torch: core library</li>
  <li>torch.nn: for neural networks</li>
  <li>torch.nn.functional: defines functions</li>
  <li>torch.optim: for optimizers such as SGD</li>
  <li>C++
    <ul>
      <li>ATen: foundational tensor operation
library</li>
      <li>torch.autograd: for automatic differentiation</li>
      <li>torchscript: python to c++</li>
    </ul>
  </li>
  <li>toch.onnx: for interoperatibility</li>
</ul>

<p><strong>Topics</strong></p>

<ul>
  <li><a href="#Immediate-Vs-Deferred-execution-modes">Immediate Vs Deferred execution modes</a></li>
  <li><a href="#Installation">Installation</a></li>
  <li><a href="#Tensors">Tensors</a></li>
  <li><a href="#Autograd">Autograd</a></li>
  <li><a href="#Data-loading-and-augmentation">Data loading and augmentation</a></li>
  <li><a href="#Designing-a-neural-network">Designing a neural network</a></li>
  <li><a href="#Transfer-Learning">Transfer Learning</a></li>
  <li><a href="#Training,-Validation,-and-Inference">Training, Validation, and Inference</a></li>
  <li><a href="#ONNX">ONNX</a></li>
  <li><a href="#Assignment">Assignment</a></li>
</ul>

<h1 id="immediate-vs-deferred-execution-modes">Immediate Vs Deferred execution modes</h1>

<p>PyTorch and Tensorflow 2 (by default) uses immediate (eager) mode. It follows the “define by run” principle i.e. you can execute the code as you define it. Consider the below simple example in Python.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">a</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">b</span> <span class="o">=</span> <span class="mi">4</span>
<span class="n">c</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span><span class="o">**</span><span class="mi">2</span> <span class="o">+</span> <span class="n">b</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
<span class="n">c</span>
<span class="c1"># 5.0</span></code></pre></figure>

<p>Tensorflow 1.0, on the other hand, uses deferred execution i.e. you define a series of operation first, then execute – most exceptions are be raised when the function is called, not when it’s defined. In the example below, <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> are placeholders, and the equation isn’t executed instantly to get the value of <code class="language-plaintext highlighter-rouge">p</code> unlike in immediate execution example above.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">p</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="p">(</span><span class="n">a</span><span class="o">**</span><span class="mi">2</span> <span class="o">+</span> <span class="n">b</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">**</span> <span class="mf">0.5</span>
<span class="n">p</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># 2.23606797749979
</span><span class="n">p</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="c1"># 5.0</span></code></pre></figure>

<p>In static graph (left side), the neuron gets compiled into a symbolic graph in which each node represents individual operations, using placeholders for inputs and outputs. Then the graph is evaluated numerically when numbers are plugged into the placeholders.</p>

<p>Dynamic graphs (righ side) can change during successive forward passes. Different nodes can be invoked according to conditions on the outputs of the preceding nodes, for example, without a need for such conditions to be represented in the graph.</p>

<div style="text-align: center">
<figure>
<img src="/img/graph_static_dynamic.png" style="display: block; margin: auto;  max-width: 100%;" />
<figcaption>Source: Deep Learning with PyTorch book</figcaption>
</figure>
</div>

<h1 id="installation">Installation</h1>

<p>I recommend creating a conda environment first. Then, follow the steps on <a href="https://pytorch.org/get-started/locally/">PyTorch Getting Started</a>. By default, the PyTorch library contains CUDA code, however, if you’re using CPU, you can download a smaller version of it.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># create conda env</span>
conda create <span class="nt">-n</span> torchenv <span class="nv">python</span><span class="o">=</span>3.8
<span class="c"># activate env</span>
conda activate torchenv
<span class="c"># install pytorch and torchvision</span>
conda <span class="nb">install </span>pytorch torchvision <span class="nv">cudatoolkit</span><span class="o">=</span>10.1 <span class="nt">-c</span> pytorch</code></pre></figure>

<p>You can use <a href="https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py"><code class="language-plaintext highlighter-rouge">collect_env.py</code></a> script to test the installation.</p>

<p><em>Note:</em> This tutorial works fine on PyTorch 1.4, torchvision 0.5.</p>

<h1 id="tensors">Tensors</h1>

<p>You can create and train neural networks in numpy as well. However, you won’t be able to use GPU, and will have to write the backward pass of gradient descent yourself, write your layers etc. The deep learning libraries, like PyTorch, solves all these types of problems. In short,</p>

<blockquote>
  <p>PyTorch = numpy with GPU + DL stuff</p>
</blockquote>

<p>Note that in order to maintain reproducibility, you need to set both numpy and pytorch seeds.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">torch</span>

<span class="k">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">__version__</span><span class="p">)</span>

<span class="c1"># reproducibility:  https://pytorch.org/docs/stable/notes/randomness.html
</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">7</span><span class="p">)</span>
<span class="c1"># when using CUDA and running on the CuDNN backend
</span><span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">manual_seed_all</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">backends</span><span class="p">.</span><span class="n">cudnn</span><span class="p">.</span><span class="n">deterministic</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">torch</span><span class="p">.</span><span class="n">backends</span><span class="p">.</span><span class="n">cudnn</span><span class="p">.</span><span class="n">benchmark</span> <span class="o">=</span> <span class="bp">False</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="mf">1.4</span><span class="p">.</span><span class="mi">0</span></code></pre></figure>

<p>A tensor is a generalization of matrices having a single datatype: a vector (1D tensor), a matrix (2D tensor), an array with three indices (3D tensor e.g. RGB color images). In PyTorch, similar to numpy, every tensor has a data type and can reside either on CPU or on GPU. For example, a tensor having 32-bit floating point numbers has data type of <code class="language-plaintext highlighter-rouge">torch.float32</code> (<code class="language-plaintext highlighter-rouge">torch.float</code>). If the tensor is on CPU, it’ll be a <code class="language-plaintext highlighter-rouge">torch.FloatTensor</code>, and if on gpu, it’ll be a <code class="language-plaintext highlighter-rouge">torch.cuda.FloatTensor</code>. You can perform operations on these tensors similar to numpy arrays. In fact, PyTorch even has same naming conventions for basic functions as in numpy.</p>

<p>Read the complete list of types of tensors at <a href="https://pytorch.org/docs/stable/tensors.html">PyTorch Tensor docs</a>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># uninitialized tensor
</span><span class="k">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">empty</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="nb">bool</span><span class="p">))</span>

<span class="c1"># initialized tensor
# torch.zeros(2, 2)
# torch.ones(2, 2)
</span>
<span class="k">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>  <span class="c1"># from a uniform distribution
</span><span class="k">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>  <span class="c1"># from standard normal distribution</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">tensor</span><span class="p">([[</span><span class="bp">True</span><span class="p">,</span> <span class="bp">True</span><span class="p">],</span>
        <span class="p">[</span><span class="bp">True</span><span class="p">,</span> <span class="bp">True</span><span class="p">]])</span>
<span class="n">tensor</span><span class="p">([[</span><span class="mf">0.5349</span><span class="p">,</span> <span class="mf">0.1988</span><span class="p">],</span>
        <span class="p">[</span><span class="mf">0.6592</span><span class="p">,</span> <span class="mf">0.6569</span><span class="p">]])</span>
<span class="n">tensor</span><span class="p">([[</span> <span class="mf">0.9468</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.1143</span><span class="p">],</span>
        <span class="p">[</span> <span class="mf">1.6908</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.8948</span><span class="p">]])</span></code></pre></figure>

<p><code class="language-plaintext highlighter-rouge">torch.Tensor</code> is an alias for the default tensor type <code class="language-plaintext highlighter-rouge">torch.FloatTensor</code>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># C, H, W
</span><span class="n">a</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">dtype</span><span class="p">,</span> <span class="n">a</span><span class="p">.</span><span class="nb">type</span><span class="p">(),</span> <span class="n">a</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1"># a.reshpae()
</span><span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">56</span><span class="p">).</span><span class="n">shape</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">torch</span><span class="p">.</span><span class="n">float32</span> <span class="n">torch</span><span class="p">.</span><span class="n">FloatTensor</span> <span class="n">torch</span><span class="p">.</span><span class="n">Size</span><span class="p">([</span><span class="mi">3</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">])</span>
<span class="n">torch</span><span class="p">.</span><span class="n">Size</span><span class="p">([</span><span class="mi">42</span><span class="p">,</span> <span class="mi">56</span><span class="p">])</span></code></pre></figure>

<p><strong>in-place operations</strong></p>

<p>The in-place operations in PyTorch are those that directly modify the tensor content in-place i.e. without creating a new copy. The functions that have <code class="language-plaintext highlighter-rouge">_</code> after their names are in-place e.g. <code class="language-plaintext highlighter-rouge">add_()</code> is in-place, while <code class="language-plaintext highlighter-rouge">add()</code> isn’t. Note that certain python operations such as <code class="language-plaintext highlighter-rouge">a += b</code> are also in-place.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">a</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]])</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]])</span>
<span class="c1"># c = a + b  # normal operation
</span><span class="n">b</span><span class="p">.</span><span class="n">add_</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>  <span class="c1"># in-place operation
</span><span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">tensor</span><span class="p">([[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
        <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">]])</span></code></pre></figure>

<p><strong>np array &lt;–&gt; tensor</strong></p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># tensor -&gt; np array
</span><span class="n">b</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="n">numpy</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">b</span><span class="p">))</span>
<span class="c1"># np array -&gt; tensor
</span><span class="n">b</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>  <span class="c1"># torch.from_numpy(b)
</span><span class="k">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">b</span><span class="p">))</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">&lt;</span><span class="k">class</span> <span class="err">'</span><span class="nc">numpy</span><span class="p">.</span><span class="n">ndarray</span><span class="s">'&gt;
&lt;class '</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="s">'&gt;</span></code></pre></figure>

<p><strong>CUDA and GPU</strong></p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># check if CUDA available
</span><span class="k">print</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">())</span>
<span class="c1"># check if tensor on GPU
</span><span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">is_cuda</span><span class="p">)</span>
<span class="c1"># move tensor to GPU
</span><span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">cuda</span><span class="p">())</span> <span class="c1"># defaults to gpu:0 # or to.device('cuda')
# move tensor to CPU
</span><span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">cpu</span><span class="p">())</span> <span class="c1"># or to.device('cpu')
# check tensor device
</span><span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">device</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="bp">True</span>
<span class="bp">False</span>
<span class="n">tensor</span><span class="p">([[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
        <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">]],</span> <span class="n">device</span><span class="o">=</span><span class="s">'cuda:0'</span><span class="p">)</span>
<span class="n">tensor</span><span class="p">([[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
        <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">]])</span>
<span class="n">cpu</span></code></pre></figure>

<p>If you’ve multiple GPUs, you can specify it using <code class="language-plaintext highlighter-rouge">to.device('cuda:&lt;n&gt;</code>). Here, <code class="language-plaintext highlighter-rouge">n</code> (0, 1, 2, …) denotes GPU number.</p>

<h1 id="autograd">Autograd</h1>

<p>automatic differentiation: calculate the gradients of the parameters (W, b) with respect to the loss, L</p>

<p>It does so by keeping track of operations performed on tensors, then going backwards through those operations, calculating gradients along the way. For this, you need to set <code class="language-plaintext highlighter-rouge">requires_grad = True</code> on a tensor.</p>

<div style="text-align: center">
<figure>
<img src="/img/autograd.png" style="display: block; margin: auto;  max-width: 100%;" />
<figcaption>Source: Deep Learning with PyTorch book</figcaption>
</figure>
</div>

<p>Consider the function <code class="language-plaintext highlighter-rouge">z</code> whose derivative w.r.t. x is <code class="language-plaintext highlighter-rouge">x/2</code>.</p>

\[\frac{\partial z}{\partial x} = \frac{\partial}{\partial x}\left[\frac{1}{n}\sum_i^n x_i^2\right] = \frac{x}{2}\]

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span>
<span class="c1"># y.retain_grad()  # retain gradient
# each tensor has a .grad_fn attribute that references a Function that created it
</span><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'y.grad_fn: </span><span class="si">{</span><span class="n">y</span><span class="p">.</span><span class="n">grad_fn</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">y</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'x.grad: </span><span class="si">{</span><span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="n">z</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'x.grad: </span><span class="si">{</span><span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="si">}</span><span class="se">\n\
</span><span class="s">x/2: </span><span class="si">{</span><span class="n">x</span><span class="o">/</span><span class="mi">2</span><span class="si">}</span><span class="se">\n\
</span><span class="s">y.grad: </span><span class="si">{</span><span class="n">y</span><span class="p">.</span><span class="n">grad</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>  <span class="c1"># dz/dy</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">y</span><span class="p">.</span><span class="n">grad_fn</span><span class="p">:</span> <span class="o">&lt;</span><span class="n">PowBackward0</span> <span class="nb">object</span> <span class="n">at</span> <span class="mh">0x7f47f618c048</span><span class="o">&gt;</span>
<span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">:</span> <span class="bp">None</span>
<span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">:</span> <span class="n">tensor</span><span class="p">([[</span><span class="o">-</span><span class="mf">0.0734</span><span class="p">,</span>  <span class="mf">0.3931</span><span class="p">],</span>
		<span class="p">[</span> <span class="mf">0.4734</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.5572</span><span class="p">]])</span>
<span class="n">x</span><span class="o">/</span><span class="mi">2</span><span class="p">:</span> <span class="n">tensor</span><span class="p">([[</span><span class="o">-</span><span class="mf">0.0734</span><span class="p">,</span>  <span class="mf">0.3931</span><span class="p">],</span>
		<span class="p">[</span> <span class="mf">0.4734</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.5572</span><span class="p">]],</span> <span class="n">grad_fn</span><span class="o">=&lt;</span><span class="n">DivBackward0</span><span class="o">&gt;</span><span class="p">)</span>
<span class="n">y</span><span class="p">.</span><span class="n">grad</span><span class="p">:</span> <span class="bp">None</span></code></pre></figure>

<p>Note that the derivative of <code class="language-plaintext highlighter-rouge">z</code> w.r.t. <code class="language-plaintext highlighter-rouge">y</code> is <code class="language-plaintext highlighter-rouge">None</code> since gradients are calculated <a href="https://stackoverflow.com/questions/48051434/computing-gradients-of-intermediate-nodes-in-pytorch/48054482#48054482">only for leaf variables</a> by default.</p>

<p>You could use <code class="language-plaintext highlighter-rouge">retain_grad()</code> to calculate the gradient of non-left variables. You can use <code class="language-plaintext highlighter-rouge">retain_graph=True</code> so that the buffers are not freed. To reduce memory usage, during the <code class="language-plaintext highlighter-rouge">.backward()</code> call, all the intermediary results are deleted when they are not needed anymore. Hence if you try to call <code class="language-plaintext highlighter-rouge">.backward()</code> again, the intermediary results don’t exist and the backward pass cannot be performed.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">z</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">---------------------------------------------------------------------------</span>

<span class="nb">RuntimeError</span>                              <span class="n">Traceback</span> <span class="p">(</span><span class="n">most</span> <span class="n">recent</span> <span class="n">call</span> <span class="n">last</span><span class="p">)</span>

<span class="o">&lt;</span><span class="n">ipython</span><span class="o">-</span><span class="nb">input</span><span class="o">-</span><span class="mi">9</span><span class="o">-</span><span class="mi">40</span><span class="n">c0c9b0bbab</span><span class="o">&gt;</span> <span class="ow">in</span> <span class="o">&lt;</span><span class="n">module</span><span class="o">&gt;</span><span class="p">()</span>
<span class="o">----&gt;</span> <span class="mi">1</span> <span class="n">z</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>


<span class="o">/</span><span class="n">usr</span><span class="o">/</span><span class="n">local</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">python3</span><span class="p">.</span><span class="mi">6</span><span class="o">/</span><span class="n">dist</span><span class="o">-</span><span class="n">packages</span><span class="o">/</span><span class="n">torch</span><span class="o">/</span><span class="n">tensor</span><span class="p">.</span><span class="n">py</span> <span class="ow">in</span> <span class="n">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">gradient</span><span class="p">,</span> <span class="n">retain_graph</span><span class="p">,</span> <span class="n">create_graph</span><span class="p">)</span>
	<span class="mi">193</span>                 <span class="n">products</span><span class="p">.</span> <span class="n">Defaults</span> <span class="n">to</span> <span class="sb">``</span><span class="bp">False</span><span class="sb">``</span><span class="p">.</span>
	<span class="mi">194</span>         <span class="s">"""
--&gt; 195         torch.autograd.backward(self, gradient, retain_graph, create_graph)
	196 
	197     def register_hook(self, hook):


/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
		97     Variable._execution_engine.run_backward(
		98         tensors, grad_tensors, retain_graph, create_graph,
---&gt; 99         allow_unreachable=True)  # allow_unreachable flag
	100 
	101 


RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.</span></code></pre></figure>

<p><em>Note:</em> Calling <code class="language-plaintext highlighter-rouge">.backward()</code> only works on scalar variables. When called on vector variables, an additional ‘gradient’ argument is required. In fact, <code class="language-plaintext highlighter-rouge">y.backward()</code> is equivalent to <code class="language-plaintext highlighter-rouge">y.backward(torch.tensor(1.))</code>. <code class="language-plaintext highlighter-rouge">torch.autograd</code> is an engine for computing vector-Jacobian product. Read <a href="https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py">more</a>.</p>

<p>To stop a tensor from tracking history, you can call <code class="language-plaintext highlighter-rouge">.detach()</code> to detach it from the computation history, and to prevent future computation from being tracked OR use <code class="language-plaintext highlighter-rouge">with torch.no_grad():</code> context manager.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">)</span>
<span class="k">print</span><span class="p">((</span><span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">).</span><span class="n">requires_grad</span><span class="p">)</span>

<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
    <span class="k">print</span><span class="p">((</span><span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">).</span><span class="n">requires_grad</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">detach</span><span class="p">()</span>
<span class="c1"># best way to copy a tensor
# y = x.detach().clone()
</span><span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="bp">True</span>
<span class="bp">True</span>
<span class="bp">False</span>
<span class="bp">True</span>
<span class="bp">False</span></code></pre></figure>

<hr />

<p>Now, we’re going to train a simple dog classifier.</p>

<h1 id="data-loading-and-augmentation">Data loading and augmentation</h1>

<p><a href="https://pytorch.org/docs/stable/data.html"><code class="language-plaintext highlighter-rouge">Dataset</code></a> class is an abstract class representing a dataset.</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">ImageFolder</code> requires dataset to be in the format:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png
root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png
root/classname/image.png
</code></pre></div>    </div>
  </li>
  <li>Custom Dataset: It must inherit from Dataset class and override the <code class="language-plaintext highlighter-rouge">__len__</code> so that len(dataset) returns the size of the dataset and <code class="language-plaintext highlighter-rouge">__getitem__</code> to support the indexing such that <code class="language-plaintext highlighter-rouge">dataset[i]</code> can be used to get <code class="language-plaintext highlighter-rouge">i</code>th sample.</li>
</ol>

<p>In this tutorial, we’re going to use <code class="language-plaintext highlighter-rouge">ImageFolder</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">DataLoader</code> takes a dataset (such as you would get from <code class="language-plaintext highlighter-rouge">ImageFolder</code>) and returns batches of images and the corresponding labels.</p>

<p>We’re also going to normalize our input data and apply data augmentation techniques. Note that we don’t apply data augmentation to validation and testing split.</p>

<p>For nomalization, the mean and standard deviation should be taken from the training dataset, however, in this case, we’re going to use <code class="language-plaintext highlighter-rouge">ImageNet</code>’s statistics (<a href="https://stackoverflow.com/a/57533806/6210807">why?</a>).</p>

\[\text{Normalized input[channel]} = \frac{\text{input[channel]} - \text{mean[channel]}}{\text{std[channel]}}\]

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">PIL.Image</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>

<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span>
<span class="kn">import</span> <span class="nn">torchvision</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">DataLoader</span>
<span class="kn">from</span> <span class="nn">torchvision</span> <span class="kn">import</span> <span class="n">datasets</span><span class="p">,</span> <span class="n">transforms</span>
<span class="kn">from</span> <span class="nn">torchvision.models</span> <span class="kn">import</span> <span class="n">resnet101</span>

<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span></code></pre></figure>

<p>Get the dog breed classification dataset from <a href="https://www.kaggle.com/c/dog-breed-identification">Kaggle</a>, <a href="http://vision.stanford.edu/aditya86/ImageNetDogs/">Stanford Dog Dataset</a>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="err">!</span><span class="n">wget</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">s3</span><span class="o">-</span><span class="n">us</span><span class="o">-</span><span class="n">west</span><span class="o">-</span><span class="mf">1.</span><span class="n">amazonaws</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">udacity</span><span class="o">-</span><span class="n">aind</span><span class="o">/</span><span class="n">dog</span><span class="o">-</span><span class="n">project</span><span class="o">/</span><span class="n">dogImages</span><span class="p">.</span><span class="nb">zip</span>
<span class="err">!</span><span class="n">unzip</span> <span class="n">dogImages</span><span class="p">.</span><span class="nb">zip</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">data_dir</span> <span class="o">=</span> <span class="s">'dogImages'</span>
<span class="n">data_transforms</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">'train'</span><span class="p">:</span> <span class="n">transforms</span><span class="p">.</span><span class="n">Compose</span><span class="p">([</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">RandomRotation</span><span class="p">(</span><span class="mi">30</span><span class="p">),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">RandomResizedCrop</span><span class="p">(</span><span class="mi">224</span><span class="p">),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">RandomHorizontalFlip</span><span class="p">(),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">ToTensor</span><span class="p">(),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">Normalize</span><span class="p">([</span><span class="mf">0.485</span><span class="p">,</span> <span class="mf">0.456</span><span class="p">,</span> <span class="mf">0.406</span><span class="p">],</span>
                             <span class="p">[</span><span class="mf">0.229</span><span class="p">,</span> <span class="mf">0.224</span><span class="p">,</span> <span class="mf">0.225</span><span class="p">])</span>
    <span class="p">]),</span>
    <span class="s">'valid'</span><span class="p">:</span> <span class="n">transforms</span><span class="p">.</span><span class="n">Compose</span><span class="p">([</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">Resize</span><span class="p">(</span><span class="mi">256</span><span class="p">),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">CenterCrop</span><span class="p">(</span><span class="mi">224</span><span class="p">),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">ToTensor</span><span class="p">(),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">Normalize</span><span class="p">([</span><span class="mf">0.485</span><span class="p">,</span> <span class="mf">0.456</span><span class="p">,</span> <span class="mf">0.406</span><span class="p">],</span>
                             <span class="p">[</span><span class="mf">0.229</span><span class="p">,</span> <span class="mf">0.224</span><span class="p">,</span> <span class="mf">0.225</span><span class="p">])</span>
    <span class="p">]),</span>
    <span class="s">'test'</span><span class="p">:</span> <span class="n">transforms</span><span class="p">.</span><span class="n">Compose</span><span class="p">([</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">Resize</span><span class="p">(</span><span class="mi">256</span><span class="p">),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">CenterCrop</span><span class="p">(</span><span class="mi">224</span><span class="p">),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">ToTensor</span><span class="p">(),</span>
        <span class="n">transforms</span><span class="p">.</span><span class="n">Normalize</span><span class="p">([</span><span class="mf">0.485</span><span class="p">,</span> <span class="mf">0.456</span><span class="p">,</span> <span class="mf">0.406</span><span class="p">],</span>
                             <span class="p">[</span><span class="mf">0.229</span><span class="p">,</span> <span class="mf">0.224</span><span class="p">,</span> <span class="mf">0.225</span><span class="p">])</span>
    <span class="p">]),</span>
<span class="p">}</span>

<span class="k">print</span><span class="p">(</span><span class="s">"Initializing Datasets and Dataloaders..."</span><span class="p">)</span>

<span class="n">image_datasets</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">datasets</span><span class="p">.</span><span class="n">ImageFolder</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="n">x</span><span class="p">),</span>
                                          <span class="n">data_transforms</span><span class="p">[</span><span class="n">x</span><span class="p">])</span>
                  <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'train'</span><span class="p">,</span> <span class="s">'valid'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">]}</span>
<span class="c1"># image, label
</span>
<span class="n">loaders</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">image_datasets</span><span class="p">[</span><span class="n">x</span><span class="p">],</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
                                             <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
              <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'train'</span><span class="p">,</span> <span class="s">'valid'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">]}</span>
<span class="n">dataset_sizes</span> <span class="o">=</span> <span class="p">{</span><span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">image_datasets</span><span class="p">[</span><span class="n">x</span><span class="p">])</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'train'</span><span class="p">,</span> <span class="s">'valid'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">]}</span>
<span class="k">print</span><span class="p">(</span><span class="n">dataset_sizes</span><span class="p">)</span>

<span class="n">class_names</span> <span class="o">=</span> <span class="n">image_datasets</span><span class="p">[</span><span class="s">'train'</span><span class="p">].</span><span class="n">classes</span>
<span class="n">n_classes</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">class_names</span><span class="p">)</span>
<span class="n">n_classes</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">Initializing</span> <span class="n">Datasets</span> <span class="ow">and</span> <span class="n">Dataloaders</span><span class="p">...</span>
<span class="p">{</span><span class="s">'train'</span><span class="p">:</span> <span class="mi">6680</span><span class="p">,</span> <span class="s">'valid'</span><span class="p">:</span> <span class="mi">835</span><span class="p">,</span> <span class="s">'test'</span><span class="p">:</span> <span class="mi">836</span><span class="p">}</span>

<span class="mi">133</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">use_cuda</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda:0"</span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s">"cpu"</span><span class="p">)</span>
<span class="n">device</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">device</span><span class="p">(</span><span class="nb">type</span><span class="o">=</span><span class="s">'cpu'</span><span class="p">)</span></code></pre></figure>

<h1 id="designing-a-neural-network">Designing a neural network</h1>

<p>There are two ways we can implement different layers and functions in PyTorch. <code class="language-plaintext highlighter-rouge">torch.nn module</code> (python class) is a real layer which can be added or connected to other layers or network models. However, <code class="language-plaintext highlighter-rouge">torch.nn.functional</code> (python function) contains functions  that do some operations, not the layers which have learnable parameters such as weights and bias terms. Still, the choice of using <code class="language-plaintext highlighter-rouge">torch.nn</code> or <code class="language-plaintext highlighter-rouge">torch.nn.functional</code> is yours. <code class="language-plaintext highlighter-rouge">torch.nn</code> is more convenient for methods which have learnable parameters. It keep the network clean.</p>

<p><em>Note:</em> Always use <code class="language-plaintext highlighter-rouge">nn.Dropout()</code>, <a href="https://stackoverflow.com/questions/53419474/using-dropout-in-pytorch-nn-dropout-vs-f-dropout">not <code class="language-plaintext highlighter-rouge">F.dropout()</code></a>. Dropout is supposed to be used only in training mode, not in evaluation mode, <code class="language-plaintext highlighter-rouge">nn.Dropout()</code> takes care of that.</p>

<p>The spatial dimensions of a convolutional layer can be calculated as: <code class="language-plaintext highlighter-rouge">(W_in−F+2P)/S+1</code>, where <code class="language-plaintext highlighter-rouge">W_in</code> is input, <code class="language-plaintext highlighter-rouge">F</code> is filter size, <code class="language-plaintext highlighter-rouge">P</code> is padding, <code class="language-plaintext highlighter-rouge">S</code> is stride.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Net</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">(</span><span class="n">Net</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
        <span class="c1"># input image: (3, 224, 224)  
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">in_channels</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">out_channels</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
        <span class="c1"># (16, 224, 224) --&gt; (16, 112, 112) (halved by max-pool)
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">conv2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
        <span class="c1"># (32, 56, 56)
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">conv3</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">pool</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">MaxPool2d</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
        <span class="c1"># (64, 28, 28)
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="o">*</span><span class="mi">28</span><span class="o">*</span><span class="mi">28</span><span class="p">,</span> <span class="mi">512</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="mi">256</span><span class="p">)</span>
        <span class="c1"># no of classes `n_classes`: 133
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">n_classes</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.25</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="c1">## forward pass
</span>        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">pool</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">pool</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">conv2</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">pool</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">conv3</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span>
        <span class="c1"># flatten image input
</span>        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">64</span> <span class="o">*</span> <span class="mi">28</span> <span class="o">*</span> <span class="mi">28</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">x</span>

<span class="c1"># instantiate the CNN
</span><span class="n">model_scratch</span> <span class="o">=</span> <span class="n">Net</span><span class="p">()</span>

<span class="c1"># move tensors to GPU if CUDA is available
</span><span class="n">model_scratch</span> <span class="o">=</span> <span class="n">model_scratch</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">model_scratch</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">Net</span><span class="p">(</span>
    <span class="p">(</span><span class="n">conv1</span><span class="p">):</span> <span class="n">Conv2d</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">stride</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">padding</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="p">(</span><span class="n">conv2</span><span class="p">):</span> <span class="n">Conv2d</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">stride</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">padding</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="p">(</span><span class="n">conv3</span><span class="p">):</span> <span class="n">Conv2d</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">stride</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">padding</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="p">(</span><span class="n">pool</span><span class="p">):</span> <span class="n">MaxPool2d</span><span class="p">(</span><span class="n">kernel_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">dilation</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">ceil_mode</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="p">(</span><span class="n">fc1</span><span class="p">):</span> <span class="n">Linear</span><span class="p">(</span><span class="n">in_features</span><span class="o">=</span><span class="mi">50176</span><span class="p">,</span> <span class="n">out_features</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="p">(</span><span class="n">fc2</span><span class="p">):</span> <span class="n">Linear</span><span class="p">(</span><span class="n">in_features</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">out_features</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="p">(</span><span class="n">fc3</span><span class="p">):</span> <span class="n">Linear</span><span class="p">(</span><span class="n">in_features</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span> <span class="n">out_features</span><span class="o">=</span><span class="mi">133</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="p">(</span><span class="n">dropout</span><span class="p">):</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">p</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># !pip install torchsummary
</span><span class="kn">from</span> <span class="nn">torchsummary</span> <span class="kn">import</span> <span class="n">summary</span>
<span class="n">summary</span><span class="p">(</span><span class="n">model_scratch</span><span class="p">,</span> <span class="n">input_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">))</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">----------------------------------------------------------------</span>
        <span class="n">Layer</span> <span class="p">(</span><span class="nb">type</span><span class="p">)</span>               <span class="n">Output</span> <span class="n">Shape</span>         <span class="n">Param</span> <span class="c1">#
</span><span class="o">================================================================</span>
            <span class="n">Conv2d</span><span class="o">-</span><span class="mi">1</span>         <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">]</span>             <span class="mi">448</span>
         <span class="n">MaxPool2d</span><span class="o">-</span><span class="mi">2</span>         <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">16</span><span class="p">,</span> <span class="mi">112</span><span class="p">,</span> <span class="mi">112</span><span class="p">]</span>               <span class="mi">0</span>
            <span class="n">Conv2d</span><span class="o">-</span><span class="mi">3</span>         <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">112</span><span class="p">,</span> <span class="mi">112</span><span class="p">]</span>           <span class="mi">4</span><span class="p">,</span><span class="mi">640</span>
         <span class="n">MaxPool2d</span><span class="o">-</span><span class="mi">4</span>           <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">56</span><span class="p">,</span> <span class="mi">56</span><span class="p">]</span>               <span class="mi">0</span>
            <span class="n">Conv2d</span><span class="o">-</span><span class="mi">5</span>           <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">56</span><span class="p">,</span> <span class="mi">56</span><span class="p">]</span>          <span class="mi">18</span><span class="p">,</span><span class="mi">496</span>
         <span class="n">MaxPool2d</span><span class="o">-</span><span class="mi">6</span>           <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">64</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">]</span>               <span class="mi">0</span>
           <span class="n">Dropout</span><span class="o">-</span><span class="mi">7</span>                <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">50176</span><span class="p">]</span>               <span class="mi">0</span>
            <span class="n">Linear</span><span class="o">-</span><span class="mi">8</span>                  <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">512</span><span class="p">]</span>      <span class="mi">25</span><span class="p">,</span><span class="mi">690</span><span class="p">,</span><span class="mi">624</span>
           <span class="n">Dropout</span><span class="o">-</span><span class="mi">9</span>                  <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">512</span><span class="p">]</span>               <span class="mi">0</span>
           <span class="n">Linear</span><span class="o">-</span><span class="mi">10</span>                  <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">256</span><span class="p">]</span>         <span class="mi">131</span><span class="p">,</span><span class="mi">328</span>
           <span class="n">Linear</span><span class="o">-</span><span class="mi">11</span>                  <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">133</span><span class="p">]</span>          <span class="mi">34</span><span class="p">,</span><span class="mi">181</span>
<span class="o">================================================================</span>
<span class="n">Total</span> <span class="n">params</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span><span class="mi">879</span><span class="p">,</span><span class="mi">717</span>
<span class="n">Trainable</span> <span class="n">params</span><span class="p">:</span> <span class="mi">25</span><span class="p">,</span><span class="mi">879</span><span class="p">,</span><span class="mi">717</span>
<span class="n">Non</span><span class="o">-</span><span class="n">trainable</span> <span class="n">params</span><span class="p">:</span> <span class="mi">0</span>
<span class="o">----------------------------------------------------------------</span>
<span class="n">Input</span> <span class="n">size</span> <span class="p">(</span><span class="n">MB</span><span class="p">):</span> <span class="mf">0.57</span>
<span class="n">Forward</span><span class="o">/</span><span class="n">backward</span> <span class="k">pass</span> <span class="n">size</span> <span class="p">(</span><span class="n">MB</span><span class="p">):</span> <span class="mf">13.79</span>
<span class="n">Params</span> <span class="n">size</span> <span class="p">(</span><span class="n">MB</span><span class="p">):</span> <span class="mf">98.72</span>
<span class="n">Estimated</span> <span class="n">Total</span> <span class="n">Size</span> <span class="p">(</span><span class="n">MB</span><span class="p">):</span> <span class="mf">113.09</span>
<span class="o">----------------------------------------------------------------</span></code></pre></figure>

<div style="text-align: center">
<figure>
<img src="/img/tensorboard_dogmodel.png" style="display: block; margin: auto;  max-width: 100%;" />
<figcaption>Model graph in Tensorboard</figcaption>
</figure>
</div>

<h1 id="transfer-learning">Transfer Learning</h1>

<p><a href="https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html">PyTorch transfer learning offical tutorial</a></p>

<p>Instead of training the model we created from scratch, we’re going to fine-tune pretrained model.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model_transfer</span> <span class="o">=</span> <span class="n">resnet101</span><span class="p">(</span><span class="n">pretrained</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">model_transfer</span><span class="p">)</span></code></pre></figure>

<p>The classifier part of the model is a single fully-connected layer <code class="language-plaintext highlighter-rouge">(fc): Linear(in_features=2048, out_features=1000, bias=True)</code>. This layer was trained on the ImageNet dataset, so it won’t work for our specific problem, so we need to replace the classifier.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Freeze parameters so we don't backprop through them
</span><span class="k">for</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">model_transfer</span><span class="p">.</span><span class="n">parameters</span><span class="p">():</span>
    <span class="n">param</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">False</span>
    
<span class="n">num_ftrs</span> <span class="o">=</span> <span class="mi">2048</span> <span class="c1">#model_transfer.fc.in_features  # it's 2048, check fc layer of resnet
</span>
<span class="c1"># creating model using Sequential API
</span><span class="n">classifier</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">num_ftrs</span><span class="p">,</span> <span class="mi">512</span><span class="p">),</span>
                           <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
                           <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.2</span><span class="p">),</span>
                           <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="mi">133</span><span class="p">))</span>
<span class="n">model_transfer</span><span class="p">.</span><span class="n">fc</span> <span class="o">=</span> <span class="n">classifier</span>

<span class="n">model_transfer</span> <span class="o">=</span> <span class="n">model_transfer</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">model_transfer</span><span class="p">)</span>
<span class="n">summary</span><span class="p">(</span><span class="n">model_transfer</span><span class="p">,</span> <span class="n">input_size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">))</span></code></pre></figure>

<h1 id="training-validation-and-inference">Training, Validation, and Inference</h1>

<p>Since, it’s a classification problem, we’ll use cross-entropy loss function.</p>

\[\text{Cross-entropy} = -\sum_{i=1}^n \sum_{j=1}^m y_{i,j}\log(p_{i,j})\]

<p>where, \(y_{i,j}\) denotes the true value i.e. 1 if sample <code class="language-plaintext highlighter-rouge">i</code> belongs to class <code class="language-plaintext highlighter-rouge">j</code> and 0 otherwise, and \(p_{i,j}\) denotes the probability predicted by your model of sample <code class="language-plaintext highlighter-rouge">i</code> belonging to class <code class="language-plaintext highlighter-rouge">j</code>.</p>

<p><code class="language-plaintext highlighter-rouge">nn.CrossEntropyLoss()</code> combines <code class="language-plaintext highlighter-rouge">nn.LogSoftmax()</code> (log(softmax(x))) and <code class="language-plaintext highlighter-rouge">nn.NLLLoss()</code> (negative log likelihood loss) in one single class. Therefore, the output from the network that is passed into <code class="language-plaintext highlighter-rouge">nn.CrossEntropyLoss</code> needs to be the raw output of the network (called logits), not the output of the softmax function.</p>

<p>It is convenient to build the model with a log-softmax output using <code class="language-plaintext highlighter-rouge">nn.LogSoftmax</code> (or <code class="language-plaintext highlighter-rouge">F.log_softmax</code>) since the actual probabilities can be accessed by taking the exponential <code class="language-plaintext highlighter-rouge">torch.exp(output)</code>, then negative log likelihood loss, <code class="language-plaintext highlighter-rouge">nn.NLLLoss</code> can be used. <a href="https://stackoverflow.com/a/65193236/6210807">Read more</a>.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">criterion</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">CrossEntropyLoss</span><span class="p">()</span> <span class="c1"># LogSoftmax + NLLLoss
# only train the classifier (fully-connected layers') parameters
</span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model_transfer</span><span class="p">.</span><span class="n">fc</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.001</span><span class="p">)</span></code></pre></figure>

<ul>
  <li>one epoch = one forward pass and one backward pass of all the training examples.</li>
  <li>batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.</li>
  <li>number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).</li>
</ul>

<p>Example: if you have 1000 training examples, and your batch size is 4, then it will take 250 iterations to complete 1 epoch.</p>

<p><em>Note:</em> the weights are updated after each batch, not epoch or iteration.</p>

<p>Calling backward leads derivatives to accumulate at leaf nodes. You need to zero the gradient explicitly after using it for parameter updates i.e. <code class="language-plaintext highlighter-rouge">optimizer.zero_grad()</code>. We can utilize this functionality to <a href="https://stackoverflow.com/a/68479643/6210807">Increase effective batch size using gradient accmulation</a></p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">n_epochs</span><span class="p">,</span> <span class="n">loaders</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">use_cuda</span><span class="p">,</span> <span class="n">save_path</span><span class="p">):</span>
    <span class="s">"""returns trained model"""</span>
    <span class="c1"># initialize tracker for minimum validation loss
</span>    <span class="n">valid_loss_min</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">Inf</span> 
    
    <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_epochs</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span>
        <span class="c1"># initialize variables to monitor training and validation loss
</span>        <span class="n">train_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
        <span class="n">valid_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
        
        <span class="c1">###################
</span>        <span class="c1"># train the model #
</span>        <span class="c1">###################
</span>        <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
        <span class="k">for</span> <span class="n">batch_idx</span><span class="p">,</span> <span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">loaders</span><span class="p">[</span><span class="s">'train'</span><span class="p">]):</span>
            <span class="c1"># move to GPU, if available
</span>            <span class="c1"># image, label
</span>            <span class="k">if</span> <span class="n">use_cuda</span><span class="p">:</span>
                <span class="n">data</span><span class="p">,</span> <span class="n">target</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">cuda</span><span class="p">(),</span> <span class="n">target</span><span class="p">.</span><span class="n">cuda</span><span class="p">()</span> <span class="c1"># .to(device)
</span>            <span class="c1"># zero the parameter gradients
</span>            <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
            <span class="c1"># forward pass: compute predicted outputs by passing inputs to the model
</span>            <span class="c1"># [N, C, H, W] -&gt; [32, 3, 224, 224]
</span>            <span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
            <span class="c1"># calculate the loss
</span>            <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span>
            <span class="c1"># backward pass
</span>            <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
            <span class="c1"># optimization step (update the weights)
</span>            <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
            <span class="c1"># record the average training loss
</span>            <span class="c1"># train_loss += loss.item()*data.size(0)
</span>            <span class="c1"># if using above method then divide loss "outside this for-loop": 
</span>            <span class="c1"># using this (to get epoch loss): train_loss = train_loss/len(loaders['train'])
</span>            <span class="n">train_loss</span> <span class="o">+=</span> <span class="p">((</span><span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="n">batch_idx</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span> <span class="o">*</span> <span class="p">(</span><span class="n">loss</span><span class="p">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">train_loss</span><span class="p">))</span>
            
        <span class="c1">######################    
</span>        <span class="c1"># validate the model #
</span>        <span class="c1">######################
</span>        <span class="c1"># set model to evaluation model (disables dropout etc)
</span>        <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
        <span class="k">for</span> <span class="n">batch_idx</span><span class="p">,</span> <span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">loaders</span><span class="p">[</span><span class="s">'valid'</span><span class="p">]):</span>
            <span class="k">if</span> <span class="n">use_cuda</span><span class="p">:</span>
                <span class="n">data</span><span class="p">,</span> <span class="n">target</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">cuda</span><span class="p">(),</span> <span class="n">target</span><span class="p">.</span><span class="n">cuda</span><span class="p">()</span>
            <span class="c1"># Turn off gradients for validation, saves memory and computations
</span>            <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
                <span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
                <span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span>
                <span class="n">valid_loss</span> <span class="o">+=</span> <span class="p">((</span><span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="n">batch_idx</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span> <span class="o">*</span> <span class="p">(</span><span class="n">loss</span><span class="p">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">valid_loss</span><span class="p">))</span>
                
            
        <span class="c1"># print training/validation statistics 
</span>        <span class="k">print</span><span class="p">(</span><span class="s">'Epoch: {} </span><span class="se">\t</span><span class="s">Training Loss: {:.6f} </span><span class="se">\t</span><span class="s">Validation Loss: {:.6f}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span>
            <span class="n">epoch</span><span class="p">,</span> 
            <span class="n">train_loss</span><span class="p">,</span>
            <span class="n">valid_loss</span>
            <span class="p">))</span>
        
        <span class="c1">## serialization: save the model if validation loss has decreased
</span>        <span class="k">if</span> <span class="n">valid_loss</span> <span class="o">&lt;=</span> <span class="n">valid_loss_min</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Validation loss decreased (</span><span class="si">{</span><span class="n">valid_loss_min</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s"> --&gt; </span><span class="si">{</span><span class="n">valid_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">).  Saving model ...'</span><span class="p">)</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">state_dict</span><span class="p">(),</span> <span class="n">save_path</span><span class="p">)</span>
            <span class="n">valid_loss_min</span> <span class="o">=</span> <span class="n">valid_loss</span>
            
    <span class="k">return</span> <span class="n">model</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">ImageFile</span>
<span class="n">ImageFile</span><span class="p">.</span><span class="n">LOAD_TRUNCATED_IMAGES</span> <span class="o">=</span> <span class="bp">True</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># train the model
</span><span class="n">model_transfer</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="n">loaders</span><span class="p">,</span> <span class="n">model_transfer</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">criterion</span><span class="p">,</span> <span class="n">use_cuda</span><span class="p">,</span> <span class="s">'model_transfer.pt'</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">Epoch</span><span class="p">:</span> <span class="mi">1</span> 	<span class="n">Training</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">2.871226</span> 	<span class="n">Validation</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">1.018821</span>
<span class="n">Validation</span> <span class="n">loss</span> <span class="n">decreased</span> <span class="p">(</span><span class="n">inf</span> <span class="o">--&gt;</span> <span class="mf">1.019</span><span class="p">).</span>  <span class="n">Saving</span> <span class="n">model</span> <span class="p">...</span>
<span class="n">Epoch</span><span class="p">:</span> <span class="mi">2</span> 	<span class="n">Training</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">1.468614</span> 	<span class="n">Validation</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">0.654094</span>
<span class="n">Validation</span> <span class="n">loss</span> <span class="n">decreased</span> <span class="p">(</span><span class="mf">1.019</span> <span class="o">--&gt;</span> <span class="mf">0.654</span><span class="p">).</span>  <span class="n">Saving</span> <span class="n">model</span> <span class="p">...</span>
<span class="n">Epoch</span><span class="p">:</span> <span class="mi">3</span> 	<span class="n">Training</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">1.249909</span> 	<span class="n">Validation</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">0.551980</span>
<span class="n">Validation</span> <span class="n">loss</span> <span class="n">decreased</span> <span class="p">(</span><span class="mf">0.654</span> <span class="o">--&gt;</span> <span class="mf">0.552</span><span class="p">).</span>  <span class="n">Saving</span> <span class="n">model</span> <span class="p">...</span>
<span class="n">Epoch</span><span class="p">:</span> <span class="mi">4</span> 	<span class="n">Training</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">1.162452</span> 	<span class="n">Validation</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">0.498752</span>
<span class="n">Validation</span> <span class="n">loss</span> <span class="n">decreased</span> <span class="p">(</span><span class="mf">0.552</span> <span class="o">--&gt;</span> <span class="mf">0.499</span><span class="p">).</span>  <span class="n">Saving</span> <span class="n">model</span> <span class="p">...</span>
<span class="n">Epoch</span><span class="p">:</span> <span class="mi">5</span> 	<span class="n">Training</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">1.122475</span> 	<span class="n">Validation</span> <span class="n">Loss</span><span class="p">:</span> <span class="mf">0.470465</span>
<span class="n">Validation</span> <span class="n">loss</span> <span class="n">decreased</span> <span class="p">(</span><span class="mf">0.499</span> <span class="o">--&gt;</span> <span class="mf">0.470</span><span class="p">).</span>  <span class="n">Saving</span> <span class="n">model</span> <span class="p">...</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># load the model that got the best validation accuracy (uncomment the line below)
# model_transfer.load_state_dict(torch.load('model_transfer.pt'))</span></code></pre></figure>

<p><a href="https://stackoverflow.com/a/54747245/6210807"><code class="language-plaintext highlighter-rouge">parameters()</code> Vs <code class="language-plaintext highlighter-rouge">state_dict</code></a></p>

<p>The <code class="language-plaintext highlighter-rouge">.parameters()</code> only gives the module parameters i.e. weights and biases, while <code class="language-plaintext highlighter-rouge">state_dict</code> returns a dictionary containing a whole state of the module.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">model_scratch</span><span class="p">.</span><span class="n">named_parameters</span><span class="p">():</span>
    <span class="k">if</span> <span class="n">param</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">name</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">conv1</span><span class="p">.</span><span class="n">weight</span>
<span class="n">conv1</span><span class="p">.</span><span class="n">bias</span>
<span class="n">conv2</span><span class="p">.</span><span class="n">weight</span>
<span class="n">conv2</span><span class="p">.</span><span class="n">bias</span>
<span class="n">conv3</span><span class="p">.</span><span class="n">weight</span>
<span class="n">conv3</span><span class="p">.</span><span class="n">bias</span>
<span class="n">fc1</span><span class="p">.</span><span class="n">weight</span>
<span class="n">fc1</span><span class="p">.</span><span class="n">bias</span>
<span class="n">fc2</span><span class="p">.</span><span class="n">weight</span>
<span class="n">fc2</span><span class="p">.</span><span class="n">bias</span>
<span class="n">fc3</span><span class="p">.</span><span class="n">weight</span>
<span class="n">fc3</span><span class="p">.</span><span class="n">bias</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model_transfer</span><span class="p">.</span><span class="n">state_dict</span><span class="p">().</span><span class="n">keys</span><span class="p">()</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">odict_keys</span><span class="p">([</span><span class="s">'conv1.weight'</span><span class="p">,</span> <span class="s">'bn1.weight'</span><span class="p">,</span> <span class="s">'bn1.bias'</span><span class="p">,</span> <span class="s">'bn1.running_mean'</span><span class="p">,</span> <span class="s">'bn1.running_var'</span><span class="p">,</span> <span class="s">'bn1.num_batches_tracked'</span><span class="p">,</span> <span class="s">'layer1.0.conv1.weight'</span><span class="p">,</span> <span class="s">'layer1.0.bn1.weight'</span><span class="p">,</span> <span class="s">'layer1.0.bn1.bias'</span><span class="p">,</span> <span class="s">'layer1.0.bn1.running_mean'</span><span class="p">,</span> <span class="s">'layer1.0.bn1.running_var'</span><span class="p">,</span> <span class="s">'layer1.0.bn1.num_batches_tracked'</span><span class="p">,</span> <span class="s">'layer1.0.conv2.weight'</span><span class="p">,</span> <span class="s">'layer1.0.bn2.weight'</span><span class="p">,</span> <span class="s">'layer1.0.bn2.bias'</span><span class="p">,</span> <span class="s">'layer1.0.bn2.running_mean'</span><span class="p">,</span> <span class="s">'layer1.0.bn2.running_var'</span><span class="p">,</span> <span class="s">'layer1.0.bn2.num_batches_tracked'</span><span class="p">,</span> <span class="s">'layer1.0.conv3.weight'</span><span class="p">,</span> <span class="s">'layer1.0.bn3.weight'</span><span class="p">,</span> <span class="s">'layer1.0.bn3.bias'</span><span class="p">,</span> <span class="s">'layer1.0.bn3.running_mean'</span><span class="p">,</span> <span class="s">'layer1.0.bn3.running_var'</span><span class="p">,</span> <span class="s">'layer1.0.bn3.num_batches_tracked'</span><span class="p">,</span> <span class="s">'layer1.0.downsample.0.weight'</span><span class="p">,</span> <span class="s">'layer1.0.downsample.1.weight'</span><span class="p">,</span> <span class="s">'layer1.0.downsample.1.bias'</span><span class="p">,</span> <span class="s">'layer1.0.downsample.1.running_mean'</span><span class="p">,</span> <span class="s">'layer1.0.downsample.1.running_var'</span><span class="p">,</span> <span class="s">'layer1.0.downsample.1.num_batches_tracked'</span><span class="p">,</span> <span class="s">'layer1.1.conv1.weight'</span><span class="p">,</span> <span class="s">'layer1.1.bn1.weight'</span><span class="p">,</span> <span class="s">'layer1.1.bn1.bias'</span><span class="p">,</span> <span class="s">'layer1.1.bn1.running_mean'</span><span class="p">,</span> <span class="s">'layer1.1.bn1.running_var'</span><span class="p">,</span> <span class="s">'layer1.1.bn1.num_batches_tracked'</span><span class="p">,</span> <span class="s">'layer1.1.conv2.weight'</span><span class="p">,</span> <span class="s">'layer1.1.bn2.weight'</span><span class="p">,</span> <span class="s">'layer1.1.bn2.bias'</span><span class="p">,</span> <span class="s">'layer1.1.bn2.running_mean'</span><span class="p">,</span> <span class="s">'layer1.1.bn2.running_var'</span><span class="p">,</span> <span class="s">'layer1.1.bn2.num_batches_tracked'</span><span class="p">,</span> <span class="s">'layer1.1.conv3.weight'</span><span class="p">,</span> <span class="s">'layer1.1.bn3.weight'</span><span class="p">,</span> <span class="s">'layer1.1.bn3.bias'</span><span class="p">,</span> <span class="s">'layer1.1.bn3.running_mean'</span><span class="p">,</span> <span class="s">'layer1.1.bn3.running_var'</span><span class="p">,</span> <span class="s">'layer1.1.bn3.num_batches_tracked'</span><span class="p">,</span> <span class="p">...])</span></code></pre></figure>

<p><code class="language-plaintext highlighter-rouge">torch.nn</code> only supports mini-batches. For example, nn.Conv2d will take in a 4D Tensor of <strong>NCHW</strong> (nSamples x nChannels x Height x Width) .If you have a single sample, just use <code class="language-plaintext highlighter-rouge">input.unsqueeze(0)</code> to add a fake batch dimension.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">class_names</span> <span class="o">=</span> <span class="p">[</span><span class="n">item</span><span class="p">[</span><span class="mi">4</span><span class="p">:].</span><span class="n">replace</span><span class="p">(</span><span class="s">"_"</span><span class="p">,</span> <span class="s">" "</span><span class="p">)</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">image_datasets</span><span class="p">[</span><span class="s">'train'</span><span class="p">].</span><span class="n">classes</span><span class="p">]</span>
<span class="n">loader_transform</span> <span class="o">=</span> <span class="n">data_transforms</span><span class="p">[</span><span class="s">'test'</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">predict_breed_transfer</span><span class="p">(</span><span class="n">img_path</span><span class="p">):</span>
    <span class="n">img</span> <span class="o">=</span> <span class="n">PIL</span><span class="p">.</span><span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">img_path</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
    <span class="n">img</span> <span class="o">=</span> <span class="n">loader_transform</span><span class="p">(</span><span class="n">img</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span>
    <span class="c1"># 3, 224, 224
</span>    <span class="n">img</span> <span class="o">=</span> <span class="n">img</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>  <span class="c1"># Add batch size for PyTorch: [N, C, H, W]: [1, 3, 224, 224]
</span>    <span class="n">model_transfer</span><span class="p">.</span><span class="n">cpu</span><span class="p">()</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">model_transfer</span><span class="p">(</span><span class="n">img</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">class_names</span><span class="p">[</span><span class="n">preds</span><span class="p">]</span>

<span class="n">predict_breed_transfer</span><span class="p">(</span><span class="s">'dogImages/train/001.Affenpinscher/Affenpinscher_00001.jpg'</span><span class="p">)</span></code></pre></figure>

<p><img src="/img/dog_output_Affenpinscher.png" style="display: block; margin: auto;  max-width: 100%;" /></p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="s">'Affenpinscher'</span></code></pre></figure>

<h1 id="onnx">ONNX</h1>

<ul>
  <li><a href="https://onnx.ai/">ONNX</a> (Open Neural Network Exchange) is an open format to represent models thus allowing interoperability.</li>
  <li>It defines a common set of operators (opsets) that a model uses and creates <code class="language-plaintext highlighter-rouge">.onnx</code> model file that can be converted to various frameworks.</li>
</ul>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda:0"</span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s">"cpu"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Using'</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>

<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">1</span>  <span class="c1"># just take random number
</span><span class="n">dummy_input</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">)</span>

<span class="c1"># move model to gpu if available
</span><span class="n">model_transfer</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="c1"># set eval mode
</span><span class="n">model_transfer</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="c1"># move input to gpu if available
</span><span class="n">dummy_input</span> <span class="o">=</span> <span class="n">dummy_input</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="c1"># output using pytorch
</span><span class="n">torch_out</span> <span class="o">=</span> <span class="n">model_transfer</span><span class="p">(</span><span class="n">dummy_input</span><span class="p">)</span>
<span class="c1"># print('torch_out', torch_out)
</span><span class="k">print</span><span class="p">(</span><span class="s">'shape:'</span><span class="p">,</span> <span class="n">torch_out</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="c1"># export the model
</span><span class="n">torch</span><span class="p">.</span><span class="n">onnx</span><span class="p">.</span><span class="n">export</span><span class="p">(</span><span class="n">model_transfer</span><span class="p">,</span>             <span class="c1"># model being run
</span>                 <span class="n">dummy_input</span><span class="p">,</span>                 <span class="c1"># model input (or a tuple for multiple inputs)
</span>                 <span class="s">'resnet101.onnx'</span><span class="p">,</span>            <span class="c1"># where to save the model (can be a file or file-like object)
</span>                 <span class="n">input_names</span> <span class="o">=</span> <span class="p">[</span><span class="s">'input_1'</span><span class="p">],</span>   <span class="c1"># the model's input names
</span>                 <span class="n">output_names</span> <span class="o">=</span> <span class="p">[</span><span class="s">'output_1'</span><span class="p">],</span> <span class="c1"># the model's output names
</span>                 <span class="n">dynamic_axes</span><span class="o">=</span><span class="p">{</span><span class="s">'input_1'</span> <span class="p">:</span> <span class="p">{</span><span class="mi">0</span> <span class="p">:</span> <span class="s">'batch_size'</span><span class="p">},</span>   <span class="c1"># variable length axes
</span>                               <span class="s">'output_1'</span> <span class="p">:</span> <span class="p">{</span><span class="mi">0</span> <span class="p">:</span> <span class="s">'batch_size'</span><span class="p">}})</span>

<span class="k">print</span><span class="p">(</span><span class="s">'Model exported successfully!'</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">Using</span> <span class="n">cuda</span><span class="p">:</span><span class="mi">0</span>
<span class="n">shape</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Size</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">])</span>
<span class="n">Model</span> <span class="n">exported</span> <span class="n">successfully</span><span class="err">!</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># !pip install onnx onnxruntime-gpu 
</span><span class="kn">import</span> <span class="nn">onnx</span><span class="p">,</span> <span class="n">onnxruntime</span>

<span class="n">model_name</span> <span class="o">=</span> <span class="s">'resnet101.onnx'</span>
<span class="n">onnx_model</span> <span class="o">=</span> <span class="n">onnx</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">model_name</span><span class="p">)</span>
<span class="n">onnx</span><span class="p">.</span><span class="n">checker</span><span class="p">.</span><span class="n">check_model</span><span class="p">(</span><span class="n">onnx_model</span><span class="p">)</span>

<span class="n">ort_session</span> <span class="o">=</span> <span class="n">onnxruntime</span><span class="p">.</span><span class="n">InferenceSession</span><span class="p">(</span><span class="n">model_name</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">to_numpy</span><span class="p">(</span><span class="n">tensor</span><span class="p">):</span>
      <span class="k">return</span> <span class="n">tensor</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">()</span> <span class="k">if</span> <span class="n">tensor</span><span class="p">.</span><span class="n">requires_grad</span> <span class="k">else</span> <span class="n">tensor</span><span class="p">.</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">()</span>

<span class="c1"># compute ONNX Runtime output prediction
</span><span class="n">ort_inputs</span> <span class="o">=</span> <span class="p">{</span><span class="n">ort_session</span><span class="p">.</span><span class="n">get_inputs</span><span class="p">()[</span><span class="mi">0</span><span class="p">].</span><span class="n">name</span><span class="p">:</span> <span class="n">to_numpy</span><span class="p">(</span><span class="n">dummy_input</span><span class="p">)}</span>
<span class="n">ort_outs</span> <span class="o">=</span> <span class="n">ort_session</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">ort_inputs</span><span class="p">)</span>

<span class="c1"># compare ONNX Runtime and PyTorch results
</span><span class="k">print</span><span class="p">(</span><span class="s">'ort_outs[0]: '</span><span class="p">,</span> <span class="n">ort_outs</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">shape</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">testing</span><span class="p">.</span><span class="n">assert_allclose</span><span class="p">(</span><span class="n">to_numpy</span><span class="p">(</span><span class="n">torch_out</span><span class="p">),</span> <span class="n">ort_outs</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">rtol</span><span class="o">=</span><span class="mf">1e-03</span><span class="p">,</span> <span class="n">atol</span><span class="o">=</span><span class="mf">1e-05</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"Exported model has been tested with ONNXRuntime, and the result looks good!"</span><span class="p">)</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">ort_outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>  <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
<span class="n">Exported</span> <span class="n">model</span> <span class="n">has</span> <span class="n">been</span> <span class="n">tested</span> <span class="k">with</span> <span class="n">ONNXRuntime</span><span class="p">,</span> <span class="ow">and</span> <span class="n">the</span> <span class="n">result</span> <span class="n">looks</span> <span class="n">good</span><span class="err">!</span></code></pre></figure>

<h1 id="assignment">Assignment</h1>

<h3 id="assignment-1">Assignment 1</h3>

<ol>
  <li>Calculate the second derivative of <code class="language-plaintext highlighter-rouge">x^2+x</code>.</li>
  <li>Create a custom layer that perform convolution then optional batch normalization.
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ConvWithBatchNorm(in_channels=3, out_channels=16, kernel_size=4, stride=2, padding=1, batch_norm=False)
</code></pre></div>    </div>
  </li>
  <li>Initialize the weights of a single linear layer from a uniform distribution.</li>
  <li>Calculate cross-entropy loss for the following:<br />
Note that <code class="language-plaintext highlighter-rouge">cross_entropy</code> or <code class="language-plaintext highlighter-rouge">nll_loss</code> in pytorch takes the <a href="https://stackoverflow.com/q/49390842/6210807">raw inputs, not probabilites</a> while calculating loss.<br />
(4a).
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>labels: [1, 0, 2]
logits = [2.5, -0.5, 0.1], [-1.1, 2.5, 0.0], [1.2, 2.2, 3.1]
</code></pre></div>    </div>
    <p>(4b).</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>labels: [1, 0, 1]
probabilites: [0.1, 0.9], [0.9, 0.1], [0.2, 0.8]
</code></pre></div>    </div>
  </li>
  <li>Fix the below code to create a model having multiple linear layers:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class MyModule(nn.Module):
 def __init__(self):
     super(MyModule, self).__init__()
     self.linears = []
     for i in range(5):
         self.linears.append(nn.Linear(10, 10))

 def forward(self, x):
     for i, l in enumerate(self.linears):
         x = self.linears[i // 2](x) + l(x)
     return x
model = MyModule()
print(model)
</code></pre></div>    </div>
  </li>
</ol>

<h3 id="assignment-2">Assignment 2</h3>

<ol>
  <li>Use Transfer Learning to fine-tune the model on the following dataset and achieve validation classification accuracy of at least 0.85 (or validation loss 0.25) during training. (Choose pretrained model of your choice.)<br />
Dataset: <a href="https://s3.amazonaws.com/content.udacity-data.com/courses/nd188/flower_data.zip">Flower images</a> <a href="http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html">[Read more here]</a>  <br />
Note: Don’t forget to normalize the data before training. You can also apply data augmentation, regularization, learning rate decay etc.</li>
</ol>

<section>
	<link rel="stylesheet" href="/css/quiz.css" />
<div id="quiz">
  <h1 id="quiz-name"></h1>
  <div style="display: flex; align-items: center; justify-content: center">
    <button id="prev-question-button">Previous Question</button>
    <button id="next-question-button">Next Question</button>
    <button id="submit-button">Submit Answers</button>
  
  <div id="quiz-results" style="display: flex; align-items: center; justify-content: center">
    <p id="quiz-results-message"></p>
    <p id="quiz-results-score"></p>
    <!-- <button id="quiz-retry-button">Retry</button> -->
  </div>
  </div>

  <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
  <script>
	// Array of all the questions and choices to populate the questions. This might be saved in some JSON file or a database and we would have to read the data in.
	var all_questions = [{
	  question_string: "What is the best way to copy the data of a tensor x to y?",
	  choices: {
	    correct: "y = x.detach().clone()",
	    wrong: ["y = torch.tensor(x)", "y = x.clone()", "y = x.detach()"]
	  }
	}, {
	  question_string: "Which data format does PyTorch use?",
	  choices: {
	    correct: "NCHW",
	    wrong: ["NHWC", "CHWN", "Both NCHW and NHWC"]
	  }
	}, {
	  question_string: "What are the default tensor data types of: torch.tensor([1, 2]), torch.tensor([1., 2.]), torch.randn(2, 2)?",
	  choices: {
	    correct: "torch.int64, torch.float32, torch.float32",
	    wrong: ["torch.int64, torch.float64, torch.float32", "torch.int64, torch.float64, torch.float64", "torch.float32, torch.float32, torch.float32"]
	  }
	}, {
	  question_string: "Which of the following is the incorrect way to get prediction probabilities from a classification model?",
	  choices: {
	    correct: "Add nn.Softmax() extra layer then use torch.exp(output) with nn.NLLLoss()",
	    wrong: ["Add nn.LogSoftmax() extra layer then use torch.exp(output) with nn.NLLLoss()", "Use F.softmax(output) with nn.CrossEntropy()", "All are correct"]
	  }
	}];
  </script>
  <script src="/css/quiz.js"></script>

</div>
	 
</section>

<hr />

<p><em>Special thanks to Udacity, where I started my PyTorch journey through PyTorch Scholarship and Deep Learning Nanodegree.</em></p>

<p>If you’re looking for more PyTorch basic projects. Check <a href="https://github.com/kHarshit/udacity-nanodegree-projects/tree/master/DLND_deep_learning_nanodegree">kHarshit/udacity-nanodegree-projects</a>.</p>

<p><strong>Resources</strong></p>
<ul>
  <li><a href="https://pytorch.org/docs/">PyTorch Docs</a></li>
  <li><a href="https://pytorch.org/tutorials">PyTorch Tutorials</a></li>
  <li><a href="https://discuss.pytorch.org/">PyTorch Discuss</a></li>
  <li><a href="https://stackoverflow.com/questions/tagged/pytorch?tab=Votes">Stack Overflow</a></li>
  <li><a href="https://pytorch.org/assets/deep-learning/Deep-Learning-with-PyTorch.pdf">Deep Learning with PyTorch book</a></li>
</ul>]]></content><author><name></name></author><category term="Computer Vision" /><category term="Deep Learning" /><category term="PyTorch" /><summary type="html"><![CDATA[PyTorch libraries torchvision: for computer vision torchtext: for NLP torchaudio: for speech]]></summary></entry><entry><title type="html">Color and color spaces in Computer Vision</title><link href="https://kharshit.github.io/blog/2020/01/17/color-and-color-spaces-in-computer-vision" rel="alternate" type="text/html" title="Color and color spaces in Computer Vision" /><published>2020-01-17T00:00:00+00:00</published><updated>2020-01-17T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2020/01/17/color-and-color-spaces-in-computer-vision</id><content type="html" xml:base="https://kharshit.github.io/blog/2020/01/17/color-and-color-spaces-in-computer-vision"><![CDATA[<blockquote>
  <p>A picture is worth a millions words.</p>
</blockquote>

<p><img src="/img/debashis-biswas-dyPFnxxUhYk-unsplash.jpg" style="display: block; margin: auto;  max-width: 100%;" />
Photo by <a href="https://unsplash.com/@debashismelts?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Debashis Biswas</a> on <a href="https://unsplash.com/s/photos/holi-color?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>

<p>The color we see is how our brain visually perceive the world. The color of an object is determined by the different wavelengths of light it reflects (and absorbs), which is affected by the object’s physical properties.</p>

<blockquote>
  <p>Color is a perception, not the physical property of an object … though it’s affected by the object’s properties.</p>
</blockquote>

<h2 id="color-space-vs-color-model">Color space Vs Color model</h2>

<p>In order to categorize and represent colors in computers, we use color models such as RGB that mathematically describe colors. On the other hand, a color space is the organization of colors that is used to display or reproduce colors in a medium such as computer screen. It’s how you map the real colors to the color model’s discrete values e.g. sRGB and Adobe RGB are two different color spaces, both based on the RGB color model i.e. RGB(16,69,201) may be differently displayed in sRGB and AdobeRGB. You can read more about it <a href="https://photo.stackexchange.com/questions/48984/what-is-the-difference-or-relation-between-a-color-model-and-a-color-space/48985">here</a>.</p>

<p>Note that these terms are often used interchangeably.</p>

<h2 id="characteristics-of-color">Characteristics of color</h2>

<p>The color can be characterized by the following properties:</p>

<ul>
  <li><strong>hue</strong>: the dominant color, name of the color itself e.g. red, yellow, green.</li>
  <li><strong>saturation or chroma</strong>: how pure is the color, the dominance of hue in color, purity, strength, intensity, intense vs dull.</li>
  <li><strong>brightness or value</strong>: how bright or illuminated the color is, black vs white, dark vs light.</li>
</ul>

<p><img src="/img/hue_s_v.jpg" style="display: block; margin: auto;  max-width: 100%;" /></p>

<h2 id="human-eye">Human eye</h2>

<p>The human eye responds differently to different wavelengths of light. In fact, it is trichromatic – it contains three different types of photo-receptors called cones that are sensitive to different wavelengths of light. These are S-cones (short-wavelength), M-cones (middle-wavelength), and L-cones (long-wavelength) historically considered more sensitive to blue, green, and red light respectively.</p>

<p>The below graph shows the cone cells’ response to varying wavelengths of light.</p>

<p style="text-align: center"><a href="https://commons.wikimedia.org/wiki/File:Cone-fundamentals-with-srgb-spectrum.svg#/media/File:Cone-fundamentals-with-srgb-spectrum.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/0/04/Cone-fundamentals-with-srgb-spectrum.svg" alt="Cone-fundamentals-with-srgb-spectrum.svg" width="540" height="380" /></a><br />By <a href="//commons.wikimedia.org/wiki/User:BenRG" title="User:BenRG">BenRG</a> - <span class="int-own-work" lang="en">Own work</span>, Public Domain, <a href="https://commons.wikimedia.org/w/index.php?curid=7873848">Link</a></p>

<p>As elucidated by the above figure, the peak value of L cone cells lies in greenish-yellow region, not red. Similarly, the S and M cones don’t directly correspond to blue and green color. In fact, the responsiveness of the cones to different colors varies from person-to-person.</p>

<h2 id="rgb">RGB</h2>

<p>In RGB color model, all the colors are represented by adding the combinations of three primary colors; Red, Green, and Blue. All the primary colors at full intensity form white represented by RGB(255, 255, 255), and at zero intensity gives black (0, 0, 0).</p>

<p>Though RGB model is a convenient model for representing colors, it differs from how human eye perceive colors.</p>

<p><img src="/img/rgb_cymk.png" style="display: block; margin: auto; width:70%; max-width: 100%;" /></p>

<h2 id="cymk">CYMK</h2>

<p>Unlike RGB, CYMK is a subtractive color model i.e. the different colors are represented by subtracting some color from white e.g. cyan  is  white  minus  red. Cyan, magenta, and white are the complements of red, green and, blue respectively. The fourth black color is added to yield CYMK for better reproduction of colors.</p>

<p>Conversion from RGB to CMYK: C=1−R, M=1−G, Y=1−B.</p>

<h2 id="hsv-and-hsl">HSV and HSL</h2>

<p>HSV (Hue, Saturation, Value) and HSL (Hue, Saturation, Lightness) color models, developed by transforming the RGB color model, were designed to be more intuitive and interpretable. These are cylindrical representation of colors.</p>

<p>Hue, the color itself, ranges from 0 to 360 starting and ending with red. Saturation defines how pure the color is i.e. the dominance of hue in the color. It ranges from 0 (no color saturation) to 1 (full saturation). The Value (in HSV) and Lightness (in HSL), both ranging from 0 (no light, black) at the bottom to 1 (white) at the top, indicates the illumination level. They differ in the fact that full saturation is achieved at V=1 in HSV, while in HSL, it’s achieved at L=0.5.</p>

<p><img src="/img/hsv_hsl.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<h2 id="delta-e">Delta E</h2>

<p><em>To be updated soon…</em></p>

<section>
	<link rel="stylesheet" href="/css/quiz.css" />
<div id="quiz">
  <h1 id="quiz-name"></h1>
  <div style="display: flex; align-items: center; justify-content: center">
    <button id="prev-question-button">Previous Question</button>
    <button id="next-question-button">Next Question</button>
    <button id="submit-button">Submit Answers</button>
  
  <div id="quiz-results" style="display: flex; align-items: center; justify-content: center">
    <p id="quiz-results-message"></p>
    <p id="quiz-results-score"></p>
    <!-- <button id="quiz-retry-button">Retry</button> -->
  </div>
  </div>

  <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
  <script>
	// Array of all the questions and choices to populate the questions. This might be saved in some JSON file or a database and we would have to read the data in.
	var all_questions = [{
	  question_string: "What measures the purity or intensity of the color?",
	  choices: {
	    correct: "Saturation",
	    wrong: ["Hue", "Brightness", "None of these"]
	  }
	}, {
	  question_string: "Which characteristic corresponds to the difference between black and white color?",
	  choices: {
	    correct: "Value",
	    wrong: ["Hue", "Saturation", "All of these"]
	  }
	}, {
	  question_string: "Which of the following are additive models of color?",
	  choices: {
	    correct: "RGB",
	    wrong: ["CYMK", "Both of these", "None of these"]
	  }
	}, {
	  question_string: 'Which color model is closer to how human eye perceives color?',
	  choices: {
	    correct: "Delta E",
	    wrong: ["RGB", "CYMK", "HSV"]
	  }
	}];
  </script>
  <script src="/css/quiz.js"></script>

</div>
	 
</section>

<hr />

<p><strong>References &amp; Further Readings:</strong></p>

<ol>
  <li><a href="https://en.wikipedia.org/wiki/Color_space">Color space - Wikipedia</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Color_model">Color model - Wikipedia</a></li>
  <li><a href="https://www.dcc.fc.up.pt/~mcoimbra/lectures/MAPI_1415/CV_1415_T1.pdf">Fundamental concepts of processing and image analysis</a></li>
  <li><a href="http://sun.aei.polsl.pl/~mkawulok/stud/graph/instr.pdf">Introduction to computer vision</a></li>
</ol>]]></content><author><name></name></author><category term="Computer Vision" /><summary type="html"><![CDATA[A picture is worth a millions words.]]></summary></entry><entry><title type="html">Introduction to Panoptic Segmentation: A Tutorial</title><link href="https://kharshit.github.io/blog/2019/10/18/introduction-to-panoptic-segmentation-tutorial" rel="alternate" type="text/html" title="Introduction to Panoptic Segmentation: A Tutorial" /><published>2019-10-18T00:00:00+00:00</published><updated>2019-10-18T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2019/10/18/introduction-to-panoptic-segmentation-tutorial</id><content type="html" xml:base="https://kharshit.github.io/blog/2019/10/18/introduction-to-panoptic-segmentation-tutorial"><![CDATA[<p>In semantic segmentation, the goal is to classify each pixel into the given classes. In instance segmentation, we care about segmentation of the instances of objects separately. The panoptic segmentation combines semantic and instance segmentation such that all pixels are assigned a class label and all object instances are uniquely segmented.</p>

<p><em>Read about <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation</a>, and <a href="/blog/2019/08/23/quick-intro-to-instance-segmentation">instance segmentation</a></em>.</p>

<p><img src="/img/college_semantic.png" style="width: 304px; max-width: 100%" />
<img src="/img/college_instance.png" style="width: 304px; max-width: 100%" />
<img src="/img/college_panoptic.png" style="width: 304px; max-width: 100%" /></p>
<figcaption style="text-align: center;">Left: semantic segmentation, middle: instance segmentation, right: panoptic segmentation</figcaption>

<h2 id="introduction">Introduction</h2>

<p>The goal in panoptic segmentation is to perform a unified segmentation task. In order to do so, let’s first understand few basic concepts.</p>

<p>A <em>thing</em> is a countable object such as people, car, etc, thus it’s a category having instance-level annotation. The <em>stuff</em> is amorphous region of similar texture such as road, sky, etc, thus it’s a category without instance-level annotation. Studying thing comes under object detection and instance segmentation, while studying stuff comes under semantic segmentation.</p>

<p>The label encoding of pixels in panoptic segmentation involves assigning each pixel of an image two labels – one for semantic label, and other for instance id. The pixels having the same label are considered belonging to the same class, and instance id for stuff is ignored. Unlike instance segmentation, each pixel in panoptic segmentation has only one label corresponding to instance i.e. there are no overlapping instances.</p>

<p>For example, consider the following set of pixel values in a naive encoding manner:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>26000, 260001, 260002, 260003, 19, 18
</code></pre></div></div>

<p>Here, <code class="language-plaintext highlighter-rouge">pixel // 1000</code> gives the semantic label, and <code class="language-plaintext highlighter-rouge">pixel % 1000</code> gives the instance id. Thus, the pixels <code class="language-plaintext highlighter-rouge">26000, 26001, 260002, 26003</code> corresponds to the same object and represents different instances. And, the pixels <code class="language-plaintext highlighter-rouge">19</code>, and <code class="language-plaintext highlighter-rouge">18</code> represents the semantic labels belonging to the non-instance stuff classes.</p>

<p>In COCO, the panoptic annotations are stored in the following way:</p>

<blockquote>
  <p>Each annotation struct is a per-image annotation rather than a per-object annotation. Each per-image annotation has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON struct that stores the semantic information for each image segment.</p>
</blockquote>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">annotation</span><span class="p">{</span>
    <span class="s">"image_id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="s">"file_name"</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>      <span class="c1"># per-pixel segment ids are stored as a single PNG at annotation.file_name
</span>    <span class="s">"segments_info"</span><span class="p">:</span> <span class="p">[</span><span class="n">segment_info</span><span class="p">],</span>
<span class="p">}</span>

<span class="n">segment_info</span><span class="p">{</span>
    <span class="s">"id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>             <span class="c1"># unique segment id for each segment whether stuff or thing
</span>    <span class="s">"category_id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>    <span class="c1"># gives the semantic category
</span>    <span class="s">"area"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="s">"bbox"</span><span class="p">:</span> <span class="p">[</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">width</span><span class="p">,</span><span class="n">height</span><span class="p">],</span>
    <span class="s">"iscrowd"</span><span class="p">:</span> <span class="mi">0</span> <span class="ow">or</span> <span class="mi">1</span><span class="p">,</span>     <span class="c1"># indicates whether segment encompasses a group of objects (relevant for thing categories only).
</span><span class="p">}</span>

<span class="n">categories</span><span class="p">[{</span>
    <span class="s">"id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
    <span class="s">"name"</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="s">"supercategory"</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="s">"isthing"</span><span class="p">:</span> <span class="mi">0</span> <span class="ow">or</span> <span class="mi">1</span><span class="p">,</span>     <span class="c1"># stuff or thing
</span>    <span class="s">"color"</span><span class="p">:</span> <span class="p">[</span><span class="n">R</span><span class="p">,</span><span class="n">G</span><span class="p">,</span><span class="n">B</span><span class="p">],</span>
<span class="p">}]</span></code></pre></figure>

<h2 id="datasets">Datasets</h2>

<p>The available panoptic segmentation datasets include <a href="http://cocodataset.org/#panoptic-2019">MS-COCO</a>, <a href="https://www.cityscapes-dataset.com/">Cityscapes</a>, <a href="https://research.mapillary.com/eccv18/#panoptic">Mapillary Vistas</a>, <a href="https://groups.csail.mit.edu/vision/datasets/ADE20K/">ADE20k</a>, and <a href="https://idd.insaan.iiit.ac.in/">Indian Driving Dataset</a>.</p>

<h2 id="evaluation">Evaluation</h2>

<p>In semantic segmentation, <code class="language-plaintext highlighter-rouge">IoU</code> and per-pixel accuracy is used as a evaluation criterion. In instance segmentation, average precision over different <code class="language-plaintext highlighter-rouge">IoU</code> thresholds is used for evaluation. For panoptic segmentation, a combination of <code class="language-plaintext highlighter-rouge">IoU</code> and <code class="language-plaintext highlighter-rouge">AP</code> can be used, but it causes asymmetry for classes with or without instance-level annotations. That is why, a new metric that treats all the categories equally, called <strong>Panoptic Quality (<code class="language-plaintext highlighter-rouge">PQ</code>)</strong>, is used.</p>

<p><em>Read more about <a href="/blog/2019/09/20/evaluation-metrics-for-object-detection-and-segmentation">evaluation metrics</a>.</em></p>

<p>As in the calculation of <code class="language-plaintext highlighter-rouge">AP</code>, <code class="language-plaintext highlighter-rouge">PQ</code> is also first calculated independently for each class, then averaged over all classes. It involves two steps: matching, and calculation.</p>

<p>Step 1 (matching): The predicted and ground truth segments are considered to be matched if their <code class="language-plaintext highlighter-rouge">IoU &gt; 0.5</code>. It, with non-overlapping instances property, results in a unique matching i.e. there can be at most one predicted segment corresponding to a ground truth segment.</p>

<p><img src="/img/pq.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<p>Step 2 (calculation): Mathematically, for a ground truth segment <code class="language-plaintext highlighter-rouge">g</code>, and for predicted segment <code class="language-plaintext highlighter-rouge">p</code>, PQ is calculated as follows.</p>

\[\begin{align}
\mathrm{PQ} &amp;= \frac{\sum_{(p, g) \in T P} \operatorname{IoU}(p, g)}{|T P|+\frac{1}{2}|F P|+\frac{1}{2}|F N|}\\

&amp;= \underbrace{\frac{\sum_{(p, g) \in T P} \operatorname{loU}(p, g)}{|T P|}}_{\text {segmentation quality (SQ) }} \times \underbrace{\frac{|T P|}{|T P|+\frac{1}{2}|F P|+\frac{1}{2}|F N|}}_{\text {recognition quality (RQ) }}
\end{align}\]

<p>Here, in the first equation, the numerator divided by <code class="language-plaintext highlighter-rouge">TP</code> is simply the average <code class="language-plaintext highlighter-rouge">IoU</code> of matched segments, and <code class="language-plaintext highlighter-rouge">FP</code> and <code class="language-plaintext highlighter-rouge">FN</code> are added to penalize the non-matched segments. As shown in the second equation, <code class="language-plaintext highlighter-rouge">PQ</code> can divided into segmentation quality (<code class="language-plaintext highlighter-rouge">SQ</code>), and recognition quality (<code class="language-plaintext highlighter-rouge">RQ</code>). <code class="language-plaintext highlighter-rouge">SQ</code>, here, is the average <code class="language-plaintext highlighter-rouge">IoU</code> of matched segments, and <code class="language-plaintext highlighter-rouge">RQ</code> is the <code class="language-plaintext highlighter-rouge">F1</code> score.</p>

<h2 id="model">Model</h2>

<p>One of the ways to solve the problem of panoptic segmentation is to combine the predictions from semantic and instance segmentation models, e.g. <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">Fully Convolutional Network (FCN)</a> and <a href="/blog/2019/08/23/quick-intro-to-instance-segmentation">Mask R-CNN</a>, to get panoptic predictions. In order to do so, the overlapping instance predictions are first need to be converted to non-overlapping ones using a NMS-like (Non-max suppression) procedure.</p>

<p><img src="/img/fpn_approach.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<p>A better way is to use a unified <strong>Panoptic FPN</strong> (Feature Pyramid Network) framework. The idea is to use FPN for multi-level feature extraction as backbone, which is to be used for region-based instance segmentation as in case of Mask R-CNN, and add a parallel dense-prediction branch on top of same FPN features to perform semantic segmentation.</p>

<p><img src="/img/panoptic_fpn.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<p>During training, the instance segmentation branch has three losses \(L_{cls}\) (classification loss), \(L_{bbox}\) (bounding-box loss), and \(L_{mask}\) (mask loss). The semantic segmentation branch has semantic loss, \(L_s\), computed as the per-pixel cross-entropy between the predicted and the ground truth labels.</p>

\[L = \lambda_i(L_{cls} + L_{bbox} + L_{mask}) + \lambda_s L_s\]

<p>In addition, a weighted combination of the semantic and instance loss is used by adding two tuning parameters \(\lambda_i\) and \(\lambda_s\) to get the panoptic loss.</p>

<h2 id="implementation">Implementation</h2>

<p>Facebook AI Research recently released <a href="https://github.com/facebookresearch/detectron2">Detectron2</a> written in PyTorch. In order to test panoptic segmentation using Mask R-CNN FPN, follow the below steps.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># install pytorch (https://pytorch.org) and opencv</span>
pip <span class="nb">install </span>opencv-python
<span class="c"># install dependencies</span>
pip <span class="nb">install </span>cython<span class="p">;</span> pip <span class="nb">install</span> <span class="s1">'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'</span>
<span class="c"># install detectron2</span>
git clone https://github.com/facebookresearch/detectron2.git
<span class="nb">cd </span>detectron2
python setup.py build develop

<span class="c"># test on an image (using `MODEL.DEVICE cpu` for inference on CPU)</span>
python demo/demo.py <span class="nt">--config-file</span> configs/COCO-PanopticSegmentation/panoptic_fpn_R_50_3x.yaml <span class="nt">--input</span> ~/Pictures/image.jpg <span class="nt">--opts</span> MODEL.WEIGHTS detectron2://COCO-PanopticSegmentation/panoptic_fpn_R_50_3x/139514569/model_final_c10459.pkl MODEL.DEVICE cpu</code></pre></figure>

<p><img src="/img/panoptic_example.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<p><strong>References &amp; Further Readings:</strong></p>
<ol>
  <li><a href="https://arxiv.org/pdf/1801.00868.pdf">Panoptic Segmentation paper</a></li>
  <li><a href="http://cocodataset.org/#format-data">Panoptic data format</a></li>
  <li><a href="https://arxiv.org/pdf/1901.02446.pdf">Panoptic FPN</a></li>
  <li><a href="https://www.dropbox.com/s/t6tg87t78pdq6v3/cvpr19_tutorial_alexander_kirillov.pdf?dl=0">Panoptic segmentation slides (also image source)</a></li>
</ol>]]></content><author><name></name></author><category term="Deep Learning" /><category term="Computer Vision" /><summary type="html"><![CDATA[In semantic segmentation, the goal is to classify each pixel into the given classes. In instance segmentation, we care about segmentation of the instances of objects separately. The panoptic segmentation combines semantic and instance segmentation such that all pixels are assigned a class label and all object instances are uniquely segmented.]]></summary></entry><entry><title type="html">Evaluation metrics for object detection and segmentation: mAP</title><link href="https://kharshit.github.io/blog/2019/09/20/evaluation-metrics-for-object-detection-and-segmentation" rel="alternate" type="text/html" title="Evaluation metrics for object detection and segmentation: mAP" /><published>2019-09-20T00:00:00+00:00</published><updated>2019-09-20T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2019/09/20/evaluation-metrics-for-object-detection-and-segmentation</id><content type="html" xml:base="https://kharshit.github.io/blog/2019/09/20/evaluation-metrics-for-object-detection-and-segmentation"><![CDATA[<p><em>Read about <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation</a>, and <a href="/blog/2019/08/23/quick-intro-to-instance-segmentation">instance segmentation</a></em>.</p>

<p>The different evaluation metrics are used for different datasets/competitions. Most common are Pascal VOC metric and MS COCO evaluation metric.</p>

<h2 id="iou-intersection-over-union">IoU (Intersection over Union)</h2>

<p>To decide whether a prediction is correct w.r.t to an object or not, <strong>IoU</strong> or <strong>Jaccard Index</strong> is used. It is defines as the intersection b/w the predicted bbox and actual bbox divided by their union. A prediction is considered to be True Positive if <code class="language-plaintext highlighter-rouge">IoU &gt; threshold</code>, and False Positive if <code class="language-plaintext highlighter-rouge">IoU &lt; threshold</code>.</p>

<p><img src="/img/iou.png" style="display: block; margin: auto; width: 35%; max-width: 100%;" /></p>

<h2 id="precision-and-recall">Precision and Recall</h2>

<p>To understand mAP, let’s go through precision and recall first. <strong>Recall</strong> is the True Positive Rate i.e. Of all the actual positives, how many are True positives predictions. <strong>Precision</strong> is the Positive prediction value i.e. Of all the positive predictions, how many are True positives predictions. Read more in <a href="/blog/2017/12/29/false-positives">evaluation metrics for classification</a>.</p>

\[\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} =  \frac{\text{TP}}{\text{# ground truths}}\]

\[\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{\text{TP}}{\text{# predictions}}\]

<h2 id="map-mean-average-precision">mAP (mean Average Precision)</h2>

<h3 id="pascal-voc">Pascal VOC</h3>

<p>In order to calculate mAP, first, you need to calculate AP per class.</p>

<p>Consider the below images containing ground truths (in green) and bbox predictions (in red) for a particular class.</p>

<p><img src="/img/map_bboxes.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<p>The details of the bboxes are as follows:</p>

<p><img src="/img/map_gt.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<p>In this example, TP is considered if IoU &gt; 0.5 else FP. Now, sort the images based on the confidence score. Note that if there are more than one detection for a single object, the detection having highest IoU is considered as TP, rest as FP e.g. in image 2.</p>

<p><img src="/img/map_table.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<blockquote>
  <p>In VOC metric, Recall is defined as  the  proportion of  all positive examples ranked  above a given rank. Precision is the proportion of all examples above that rank which are from the positive class.</p>
</blockquote>

<p>Thus, in the column Acc (accumulated) TP, write the total number of TP encountered from the top, and do the same for Acc FP. Now, calculate the precision and recall e.g. for P4, <code class="language-plaintext highlighter-rouge">Precision = 1/(1+0) = 1</code>, and <code class="language-plaintext highlighter-rouge">Recall = 1/3 = 0.33</code>.</p>

<p>These precision and recall values are then plotted to get a PR (precision-recall) curve. The area under the PR curve is called <strong>Average Precision (AP)</strong>. The PR curve follows a kind of zig-zag pattern as recall increases absolutely, while precision decreases overall with sporadic rises.</p>

<p>The AP summarizes the shape of the precision-recall curve, and, in <strong>VOC 2007</strong>, it is defined as the mean of precision values at a set of 11 equally spaced recall levels [0,0.1,…,1] (0 to 1 at step size of 0.1), <em>not the AUC</em>.</p>

\[AP = \frac{1}{11} \sum_{r \in (0,0.1,...,1)}{p_{interp(r)}}\]

<p>The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r.</p>

\[p_{interp(r)} = \max_{\tilde{r}:\tilde{r}\geq r}{p(r)}\]

<p><img src="/img/interpolateAP.jpeg" style="display: block; margin: auto; width: 75%; max-width: 100%;" /></p>

<p>i.e. take the max precision value to the right at 11 equally spaced recall points [0: 0.1: 1], and take their mean to get AP.</p>

<p>However, from <strong>VOC 2010</strong>, the computation of AP changed.</p>

<blockquote>
  <p>Compute a version of the measured precision-recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for <em>any</em> recall \(\tilde{r}\geq r\). Then compute the AP as the area under this curve by numerical integration.</p>
</blockquote>

<p>i.e. given the PR curve in orange, calculate the max precision to the right for all the recall points thus getting a new curve in green. Now, take the AUC using integration under the green curve. It would be the AP. The only difference from VOC 2007 here is that we’re taking not just 11 but all the points into account.</p>

<p>Now, we have AP per class (object category), <strong>mean Average Precision (mAP)</strong> is the averaged AP over all the object categories.</p>

<p><img src="/img/map.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<p>For the segmentation challenge in VOC, the <strong>segmentation accuracy</strong> (per-pixel accuracy calculated using IoU) is used as the evaluation criterion, which is defined as follows:</p>

\[\text{segmentation accuracy} = \frac{\text{TP}}{\text{TP + FP + FN}}\]

<h3 id="coco">COCO</h3>

<p>Usually, as in VOC, a prediction with IoU &gt; 0.5 is considered as True Positive prediction. It means that two predictions of IoU 0.6 and 0.9 would have equal weightage. Thus, a certain threshold introduces a bias in the evaluation metric. One way to solve this problem is to use a range of IoU threshold values, and calculate mAP for each IoU, and take their average to get the final mAP.</p>

<p><em>Note that COCO uses [0:.01:1] R=101 recall thresholds for evaluation.</em></p>

<p>In COCO evaluation, the IoU threshold ranges from 0.5 to 0.95 with a step size of 0.05 represented as AP@[.5:.05:.95].</p>

<p>The AP at fixed IoUs such as IoU=0.5 and IoU=0.75 is written as AP50 and AP75 respectively.</p>

<blockquote>
  <p>Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric \(AP^{IoU=.50}\)). Averaging over IoUs rewards detectors with better localization.</p>
</blockquote>

\[mAP_{\text{COCO}} = \frac{mAP_{0.50} + mAP_{0.55} + ... + mAP_{0.95}}{10}\]

<blockquote>
  <p>AP is averaged over all categories. Traditionally, this is called “mean average precision” (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.</p>
</blockquote>

<p><strong><em>Two minute additions:</em></strong> Usually, the averages are taken in a different order (the final result is same), and in COCO, mAP is also referred to as AP i.e.</p>

<ul>
  <li><em>Step 1:</em> For each class, calculate AP at different IoU thresholds and take their average to get the AP of that class.</li>
</ul>

\[\text{AP[class]} = \frac{1}{\text{#thresolds}} \sum_{\text{iou $\in$ thresholds}}{AP[class, iou]}\]

<p><img src="/img/ap.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<ul>
  <li><em>Step 2:</em> Calculate the final AP by averaging the AP over different classes.</li>
</ul>

\[\text{AP} = \frac{1}{\text{#classes}} \sum_{\text{class $\in$ classes}}{AP[class]}\]

<blockquote>
  <p>AP is in fact an <abbr title="classes">average</abbr>, <abbr title="IoU thresholds">average, </abbr><abbr title="precision at different recall levels">average</abbr> precision.</p>
</blockquote>

<p><img src="/img/coco_eval.png" style="display: block; margin: auto; max-width: 100%;" /></p>

<h2 id="conclusion">Conclusion</h2>

<ul>
  <li>PascalVOC2007 uses 11 Recall points on PR curve.</li>
  <li>PascalVOC2010–2012 uses (all points) Area Under Curve (AUC) on PR curve.</li>
  <li>MS COCO uses 101 Recall points on PR curve as well as different IoU thresholds.</li>
</ul>

<p><strong>References &amp; Further Readings:</strong></p>
<ol>
  <li><a href="http://cocodataset.org/#detection-eval">COCO evaluation metrics</a></li>
  <li><a href="http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf">VOC2007 metrics</a></li>
  <li><a href="http://host.robots.ox.ac.uk/pascal/VOC/voc2010/devkit_doc_08-May-2010.pdf">VOC2012 metrics</a></li>
  <li><a href="https://github.com/rafaelpadilla/Object-Detection-Metrics">Object detection metrics</a></li>
  <li><a href="https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173">mAP (mean Average Precision) for Object Detection</a></li>
</ol>]]></content><author><name></name></author><category term="Deep Learning" /><category term="Computer Vision" /><summary type="html"><![CDATA[Read about semantic segmentation, and instance segmentation.]]></summary></entry><entry><title type="html">Quick intro to Instance segmentation: Mask R-CNN</title><link href="https://kharshit.github.io/blog/2019/08/23/quick-intro-to-instance-segmentation" rel="alternate" type="text/html" title="Quick intro to Instance segmentation: Mask R-CNN" /><published>2019-08-23T00:00:00+00:00</published><updated>2019-08-23T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2019/08/23/quick-intro-to-instance-segmentation</id><content type="html" xml:base="https://kharshit.github.io/blog/2019/08/23/quick-intro-to-instance-segmentation"><![CDATA[<p><em>This is the third post in the Quick intro series: <a href="/blog/2019/03/15/quick-intro-to-object-detection">object detection (I)</a>, <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation (II)</a></em>.</p>

<p><img src="/img/ibrahim-rifath-D0x1GOoiPzw-unsplash_inst_seg.jpg" style="display: block; margin: auto; width: 80%; max-width: 100%;" /></p>

<blockquote>
  <p>“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.”
— <cite>Joseph Redmon, YOLOv3</cite></p>
</blockquote>

<p>The instance segmentation combines <em>object detection</em>, where the goal is to classify individual objects and localize them using a bounding box, and <em>semantic segmentation</em>, where the goal is to classify each pixel into the given classes. In instance segmentation, we care about detection and segmentation of the instances of objects separately.</p>

<p><img src="/img/segmentation.png" style="display: block; margin: auto; width: 90%; max-width: 100%;" /></p>

<h2 id="mask-r-cnn">Mask R-CNN</h2>

<p>Mask R-CNN is a state-of-the-art model for instance segmentation. It extends Faster R-CNN, the model used for object detection, by adding a parallel branch for predicting segmentation masks.</p>

<p><img src="/img/seg_mask_rcnn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>Before getting into Mask R-CNN, let’s take a look at Faster R-CNN.</p>

<h2 id="faster-r-cnn">Faster R-CNN</h2>

<p>Faster R-CNN consists of two stages.</p>

<h3 id="stage-i">Stage I</h3>

<p>The <em>first stage</em> is a deep convolutional network with <strong>Region Proposal Network (RPN)</strong>, which proposes regions of interest (ROI) from the feature maps output by the convolutional neural network i.e.</p>

<p>The input image is fed into a CNN, often called <strong>backbone</strong>, which is usually a pretrained network such as ResNet101. The classification (fully connected) layers from the backbone network are removed so as to use it as a feature extractor. This also makes the network fully convolutional, thus it can take any input size image.</p>

<p><img src="/img/remove_fc_layers.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>The RPN uses a sliding window method to get <abbr title="boxes having high probability of containing object">relevant anchor boxes</abbr> <em>(the precalculated fixed sized bounding boxes having different sizes that are placed throughout the image that represent the approximate bbox predictions so as to save the time to search)</em> from the feature maps.</p>

<p>It then does a binary classification that the anchor has object or not (into classes <abbr title="foreground">fg</abbr> or <abbr title="background">bg</abbr>), and bounding box regression to refine bounding boxes. The anchor is classified as positive label (fg class) if the anchor(s) has highest Intersection-over-Union (IoU) with the ground truth box, or, it has IoU overlap greater than 0.7 with the ground truth.</p>

<blockquote>
  <p>At each sliding window location, a number of proposals (max <code class="language-plaintext highlighter-rouge">k</code>) are predicted corresponding to anchor boxes. So the <code class="language-plaintext highlighter-rouge">reg</code> layer has <code class="language-plaintext highlighter-rouge">4k</code> outputs encoding the coordinates of <code class="language-plaintext highlighter-rouge">k</code> boxes, and the <code class="language-plaintext highlighter-rouge">cls</code> layer outputs <code class="language-plaintext highlighter-rouge">2k</code> scores that estimate probability of <em>object</em> or <em>not object</em> for each proposal.</p>
</blockquote>

<p><img src="/img/rpn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<blockquote>
  <p>In Faster R-CNN, k=9 anchors representing 3 scales and 3 aspect ratios of anchor boxes are present at <em>each</em> sliding window position. Thus, for a convolutional feature map of a size <code class="language-plaintext highlighter-rouge">W×H</code> <em>(typically∼2,400)</em>, there are <code class="language-plaintext highlighter-rouge">WHk</code> anchors in total.</p>
</blockquote>

<p>Hence, at this stage, there are two losses i.e. bbox binary classification loss, \(L_{cls_1}\)  and bbox regression loss, \(L_{bbox_1}\).</p>

<p>The top <em>(positive)</em> anchors output by the RPN, called proposals or Region of Interest (RoI) are fed to the next stage.</p>

<p><img src="/img/faster_rcnn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<h3 id="stage-ii">Stage II</h3>

<p>The <em>second stage</em> is essentially <strong>Fast R-CNN</strong>, which using RoI pooling layer, extracts feature maps from each RoI, and performs classification and bounding box regression. The RoI pooling layer converts the section of feature map corresponding to each <em>(variable sized)</em> RoI into fixed size to be fed into a fully connected layer.</p>

<p>For example, say, for a 8x8 feature map, the RoI is 7x5 in the bottom left corner, and the RoI pooling layer outputs a fixed size 2x2 feature map. Then, the following operations would be performed:</p>

<ul>
  <li>Divide the RoI into 2x2.</li>
  <li>Perform max-pooling i.e. take maximum value from each section.</li>
</ul>

<p><img src="/img/roi_pooling.gif" style="display: block; margin: auto; width: 80%; max-width: 100%;" /></p>

<p>The fc layer further performs softmax classification of objects into classes (e.g. car, person, bg),  and the same bounding box regression to refine bounding boxes.</p>

<p>Thus, at the second stage as well, there are two losses i.e. object classification loss (into multiple classes), \(L_{cls_2}\), and bbox regression loss, \(L_{bbox_2}\).</p>

<h2 id="mask-prediction">Mask prediction</h2>

<p>Mask R-CNN has the identical first stage, and in second stage, it also predicts binary mask in addition to class score and bbox. The mask branch takes positive RoI and predicts mask using a fully convolutional network (FCN).</p>

<p><img src="/img/mask_head.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>In simple terms, Mask R-CNN = Faster R-CNN + FCN</p>

<p>Finally, the loss function is</p>

\[L = L_{cls} + L_{bbox} + L_{mask}\]

<p>The \(L_{cls} (L_{cls_1} + L_{cls_2})\) is the classification loss, which tells how close the predictions are to the true class, and \(L_{bbox} (L_{bbox_1} + L_{bbox_2})\) is the bounding box loss, which tells how good the model is at localization, as discussed above. In addition, there is also \(L_{mask}\), loss for mask prediction, which is calculated by taking the binary cross-entropy between the predicted mask and the ground truth. This loss penalizes wrong per-pixel binary classifications (fg/bg w.r.t ground truth label).</p>

<blockquote>
  <p>Mask R-CNN encodes a binary mask per class for each of the RoIs, and the mask loss for a specific RoI is calculated based only on the mask corresponding to its true class, which prevents the mask loss from being affected by class predictions.</p>
</blockquote>

<blockquote>
  <p>The mask branch has a \(Km^2\)-dimensional output for each RoI, which encodes <code class="language-plaintext highlighter-rouge">K</code> binary masks of resolution <code class="language-plaintext highlighter-rouge">m×m</code>, one for each of the <code class="language-plaintext highlighter-rouge">K</code> classes. To this we apply a per-pixel sigmoid, and define \(L_{mask}\) as the average binary cross-entropy loss.</p>
</blockquote>

<p>In total, there are five losses as follows:</p>

<ul>
  <li>rpn_class_loss, \(L_{cls_1}\): RPN (bbox) anchor binary classifier loss</li>
  <li>rpn_bbox_loss, \(L_{bbox_1}\): RPN bbox regression loss</li>
  <li>fastrcnn_class_loss, \(L_{cls_2}\): loss for the classifier head of Mask R-CNN</li>
  <li>fastrcnn_bbox_loss, \(L_{bbox_2}\): loss for Mask R-CNN bounding box refinement</li>
  <li>maskrcnn_mask_loss, \(L_{mask}\): mask binary cross-entropy loss for the mask head</li>
</ul>

<h2 id="other-improvements">Other improvements</h2>

<h3 id="feature-pyramid-network">Feature Pyramid Network</h3>

<p>Mask R-CNN also utilizes a more effective backbone network architecture called <strong>Feature Pyramid Network (FPN)</strong> along with ResNet, which results in better performance in terms of both accuracy and speed.</p>

<blockquote>
  <p>Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature  pyramid  according  to  their  scale,  but  otherwise  the rest of the approach is similar to vanilla ResNet.</p>
</blockquote>

<p><img src="/img/fpn_0.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>In order to detect object at different scales, various techniques have been proposed. One of them (c) utilizes the fact that deep CNN build a multi-scale representation of the feature maps. The features computed by various layers of the CNN acts as a feature pyramid. Here, you can use your model to detect objects at different levels of the pyramid thus allowing your model to detect object across a large range of scales e.g. the model can detect small objects at <code class="language-plaintext highlighter-rouge">conv3</code> as it has higher spatial resolution thus allowing the model to extract better features for detection compared to detecting small objects at <code class="language-plaintext highlighter-rouge">conv5</code>, which has lower spatial resolution. But, an important thing to note here is that the quality of features at <code class="language-plaintext highlighter-rouge">conv3</code> won’t be as good for classification as features at <code class="language-plaintext highlighter-rouge">conv5</code>.</p>

<p><img src="/img/fpn_1.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>The above idea is fast as it utilizes the inherent working of CNN by using the features extracted at different conv layers for multi-scale detection, but compromises with the feature quality.</p>

<p>FPN uses the inherent multi-scale representation in the network as above, and solves the problem of weak features at later layers for multi-scale detection.</p>

<p><img src="/img/fpn_2.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>The forward pass of the CNN gives the feature maps at different conv layers i.e. builds the multi-level representation at different scales. In FPN, lateral connections are added at each level of the pyramid. The idea is to take top-down strong features (from <code class="language-plaintext highlighter-rouge">conv5</code>) and propagate them to the high resolution feature maps (to <code class="language-plaintext highlighter-rouge">conv3</code>) thus having strong features across all levels.</p>

<h3 id="roialign">RoiAlign</h3>

<p>As discussed above, RoIPool layer extracts small feature maps from each RoI.  The problem with RoIPool is quantization. If the RoI doesn’t perfectly align with the grid in feature map as shown, the quantization breaks pixel-to-pixel alignment. It isn’t much of a problem in object detection, but in case of predicting masks, which require finer spatial localization, it matters.</p>

<p><img src="/img/roi_quantization.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p><strong>RoIAlign</strong> is an improvement over the RoIPool operation. What RoIAlign does is to smoothly transform features from the RoIs (which has different aspect sizes) into fixed size feature vectors without using <em>quantization</em>. It uses bilinear interpolation to do. A grid of sampling points are used within each bin of RoI, which are used to interpolate the features at its nearest neighbors as shown.</p>

<p><img src="/img/roialign.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>For example, in the above figure, you can’t apply the max-pooling directly due to the misalignment of RoI with the feature map grids, thus in case of RoIAlign, four points are sampled in each bin using bilinear interpolation from its nearest neighbors. Finally, the max value from these points is chosen to get the required 2x2 feature map.</p>

<h2 id="implementation">Implementation</h2>

<p>The following Mask R-CNN implementation is from <a href="https://github.com/facebookresearch/maskrcnn-benchmark"><code class="language-plaintext highlighter-rouge">facebookresearch/maskrcnn-benchmark</code></a> in PyTorch.</p>

<p>Other famous implementations are:</p>

<ul>
  <li>matterport’s <a href="https://github.com/matterport/Mask_RCNN">Mask_RCNN</a> in Keras and Tensorflow</li>
  <li>open-mmlab’s <a href="https://github.com/open-mmlab/mmdetection">mmdetection</a> in PyTorch</li>
  <li>facebookresearch’s <a href="https://github.com/facebookresearch/Detectron">Detectron</a> in Caffe2, and <a href="https://github.com/facebookresearch/detectron2">Detectron2</a> in PyTorch</li>
</ul>

<p>First, install it as follows.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># install dependencies</span>
pip <span class="nb">install </span>ninja yacs cython matplotlib tqdm opencv-python

<span class="c"># install COCO API</span>
git clone https://github.com/cocodataset/cocoapi.git
<span class="nb">cd </span>cocoapi/PythonAPI
python setup.py build_ext <span class="nb">install
cd</span> ../../

<span class="c"># install apex</span>
<span class="nb">rm</span> <span class="nt">-rf</span> apex
git clone https://github.com/NVIDIA/apex.git
<span class="nb">cd </span>apex
git pull
<span class="c"># if no GPU available, try installing removing --cuda_ext</span>
python setup.py <span class="nb">install</span> <span class="nt">--cuda_ext</span> <span class="nt">--cpp_ext</span>
<span class="nb">cd</span> ../

<span class="c"># install maskrcnn-benchmark </span>
git clone https://github.com/facebookresearch/maskrcnn-benchmark.git
<span class="nb">cd </span>maskrcnn-benchmark
<span class="c"># the following will install the lib with symbolic links, so that you can modify</span>
<span class="c"># the files if you want and won't need to re-build it</span>
python setup.py build develop

<span class="c"># download predictor.py, which contains necessary utility functions</span>
wget https://raw.githubusercontent.com/facebookresearch/maskrcnn-benchmark/master/demo/predictor.py

<span class="c"># download configuration file</span>
wget https://raw.githubusercontent.com/facebookresearch/maskrcnn-benchmark/master/configs/caffe2/e2e_mask_rcnn_R_50_FPN_1x_caffe2.yaml</code></pre></figure>

<p>Here, for inference, we’ll use Mask R-CNN model pretrained on MS COCO dataset.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib.pylab</span> <span class="k">as</span> <span class="n">pylab</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">io</span> <span class="kn">import</span> <span class="n">BytesIO</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">from</span> <span class="nn">maskrcnn_benchmark.config</span> <span class="kn">import</span> <span class="n">cfg</span>
<span class="kn">from</span> <span class="nn">predictor</span> <span class="kn">import</span> <span class="n">COCODemo</span>

<span class="n">config_file</span> <span class="o">=</span> <span class="s">"e2e_mask_rcnn_R_50_FPN_1x_caffe2.yaml"</span>

<span class="c1"># update the config options with the config file
</span><span class="n">cfg</span><span class="p">.</span><span class="n">merge_from_file</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span>

<span class="c1"># a helper class `COCODemo`, which loads a model from the config file, and performs pre-processing, model prediction and post-processing for us
</span><span class="n">coco_demo</span> <span class="o">=</span> <span class="n">COCODemo</span><span class="p">(</span>
    <span class="n">cfg</span><span class="p">,</span>
    <span class="n">min_image_size</span><span class="o">=</span><span class="mi">800</span><span class="p">,</span>
    <span class="n">confidence_threshold</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">pil_image</span> <span class="o">=</span> <span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'cats.jpg'</span><span class="p">).</span><span class="n">convert</span><span class="p">(</span><span class="s">"RGB"</span><span class="p">)</span>
<span class="c1"># convert to BGR format
</span><span class="n">image</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">pil_image</span><span class="p">)[:,</span> <span class="p">:,</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]]</span>

<span class="c1"># compute predictions
</span><span class="n">predictions</span> <span class="o">=</span> <span class="n">coco_demo</span><span class="p">.</span><span class="n">run_on_opencv_image</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>

<span class="c1"># plot
</span><span class="n">f</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'input image'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">imshow</span><span class="p">(</span><span class="n">pil_image</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'segmented output'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">imshow</span><span class="p">(</span><span class="n">predictions</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"segmented_output.png"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span></code></pre></figure>

<p><img src="/img/segmentation_cat_output_instance.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>Notice that, here, both the instances of cats are segmented separately, unlike <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation</a>.</p>

<h2 id="other-instance-segmentation-models">Other Instance segmentation models</h2>

<h3 id="ms-r-cnn-mask-scoring-r-cnn">MS R-CNN (Mask Scoring R-CNN)</h3>

<p>In Mask R-CNN, the instance classification score is used as the mask quality score. However, it’s possible that due to certain factors such as background clutter, occlusion, etc. the classification score is high, but the mask quality (IoU b/w instance mask and ground truth) is low. MS R-CNN uses a network that learns the quality of mask. The mask score is reevaluated by multiplying the predicted MaskIoU and classification score.</p>

<blockquote>
  <p>Within  the  Mask  R-CNN framework, we implement a MaskIoU prediction network named MaskIoU head.  It takes both the output of themask  head  and  RoI  feature  as  input,  and  is  trained  using a  simple regression  loss.</p>
</blockquote>

<p>i.e. MS R-CNN = Mask R-CNN + MaskIoU head module</p>

<h3 id="yolact-you-only-look-at-coefficients">YOLACT (You Only Look At CoefficienTs)</h3>

<p>YOLACT is the current fastest instance segmentation method. It can achieve real-time instance segmentation results i.e. 30fps.</p>

<p><img src="/img/yolact.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>It breaks the instance segmentation process into two parts i.e. it generates a set of prototype masks in parallel with predicting per-instance mask coefficients. Then the prototypes are linearly combined with the mask coefficients to produce the instance masks.</p>

<p><strong>References &amp; Further Readings:</strong></p>
<ol>
  <li><a href="https://arxiv.org/abs/1703.06870">Mask R-CNN paper</a></li>
  <li><a href="https://arxiv.org/pdf/1506.01497.pdf">Faster R-CNN paper</a></li>
  <li><a href="https://arxiv.org/pdf/1612.03144.pdf">FPN paper</a></li>
  <li><a href="https://arxiv.org/pdf/1903.00241.pdf">MS R-CNN paper</a></li>
  <li><a href="https://arxiv.org/pdf/1904.02689.pdf">YOLACT paper</a></li>
  <li><a href="https://cseweb.ucsd.edu/classes/sp18/cse252C-a/CSE252C_20180509.pdf">Mask R-CNN presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma</a></li>
  <li><a href="https://youtu.be/jHv37mKAhV4">Tutorial: Deep Learning for Objects and Scenes - Part 1 - CVPR’17</a></li>
  <li><a href="http://cs231n.stanford.edu/">CS231n: Convolutional Neural Networks for Visual Recognition (image source)</a></li>
  <li><a href="http://lernapparat.de/static/artikel/pytorch-jit-android/thomas_viehmann.pytorch_jit_android_2018-12-11.pdf">Mask R-CNN image source</a></li>
  <li><a href="https://deepsense.ai/region-of-interest-pooling-explained/">RoIPool image source</a></li>
</ol>]]></content><author><name></name></author><category term="Deep Learning" /><category term="Computer Vision" /><summary type="html"><![CDATA[This is the third post in the Quick intro series: object detection (I), semantic segmentation (II).]]></summary></entry><entry><title type="html">Quick intro to semantic segmentation: FCN, U-Net and DeepLab</title><link href="https://kharshit.github.io/blog/2019/08/09/quick-intro-to-semantic-segmentation" rel="alternate" type="text/html" title="Quick intro to semantic segmentation: FCN, U-Net and DeepLab" /><published>2019-08-09T00:00:00+00:00</published><updated>2019-08-09T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2019/08/09/quick-intro-to-semantic-segmentation</id><content type="html" xml:base="https://kharshit.github.io/blog/2019/08/09/quick-intro-to-semantic-segmentation"><![CDATA[<p>Suppose you’ve an image, consisting of cats. You want to classify every pixel of the image as cat or background. This process is called semantic segmentation.</p>

<p>One of the ways to do so is to use a <strong>Fully Convolutional Network (FCN)</strong> i.e. you stack a bunch of convolutional layers in a encoder-decoder fashion. The encoder downsamples the image using strided convolution giving a compressed feature representation of the image, and the decoder upsamples the image using methods like transpose convolution to give the segmented output <em>(<a href="/blog/2019/02/15/autoencoder-downsampling-and-upsampling">Read more about downsampling and upsampling</a>)</em>.</p>

<p><img src="/img/segmentation_fcn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>The fully connected (fc) layers of a convolutional neural network requires a fixed size input. Thus, if your model is trained on an image size of <code class="language-plaintext highlighter-rouge">224x224</code>, the input image of size <code class="language-plaintext highlighter-rouge">227x227</code> will throw an error. The solution, as adapted in FCN, is to <a href="/blog/2019/08/02/converting-fc-layers-to-conv-layers">replace fc layers with <code class="language-plaintext highlighter-rouge">1x1</code> conv layers</a>. Thus, FCN can perform semantic segmentation for any input size image.</p>

<p>In FCN, the <em>skip connections</em> from the earlier layers are also utilized to reconstruct accurate segmentation boundaries by learning back relevant features, which are lost during downsampling.</p>

<blockquote>
  <p>Semantic segmentation faces an inherent tension between semantics and location: global information resolves <em>what</em> while local information resolves <em>where</em>… Combining fine layers and coarse layers <em>(by using skip connections)</em> lets the model make local predictions that respect global structure.</p>
</blockquote>

<h2 id="u-net">U-Net</h2>

<p>The U-Net build upon the concept of FCN. Its architecture, similar to the above encoder-decoder architecture, can be divided into three parts:</p>

<ul>
  <li>The <strong>contracting or downsampling path</strong> consists of 4 blocks where each block applies two <code class="language-plaintext highlighter-rouge">3x3</code> convolution (<code class="language-plaintext highlighter-rouge">+</code>batch norm) followed by <code class="language-plaintext highlighter-rouge">2x2</code> max-pooling. The number of features maps are doubled at each pooling layer (after each block) as <code class="language-plaintext highlighter-rouge">64 -&gt; 128 -&gt; 256</code> and so on.</li>
  <li>The horizontal <strong>bottleneck</strong> consists of two <code class="language-plaintext highlighter-rouge">3x3</code> convolution followed by <code class="language-plaintext highlighter-rouge">2x2</code> up-convolution.</li>
  <li>The <strong>expanding or upsampling path</strong>, complimentary to the contracting path, also consists of 4 blocks, where each block consists of two <code class="language-plaintext highlighter-rouge">3x3</code> conv followed by <code class="language-plaintext highlighter-rouge">2x2</code> upsampling (transpose convolution). The number of features maps here are halved after every block.</li>
</ul>

<p>The pretrained models such as resnet18 can be used as the left part of the model.</p>

<p><img src="/img/segmentation_unet.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>U-Net also has skip connections in order to localize, as shown in white. The upsampled output is concatenated with the corresponding cropped <em>(cropped due to the loss of border pixels in every convolution)</em> feature maps from the contracting path <em>(the features learned during downsampling are used during upsampling)</em>.</p>

<p>Finally, the resultant output passes through 3x3 conv layer to provide the segmented output, where number of feature maps is equal to number segments desired.</p>

<h2 id="deeplab">DeepLab</h2>

<p>DeepLab is a state-of-the-art semantic segmentation model having encoder-decoder architecture. The encoder consisting of pretrained CNN model is used to get encoded feature maps of the input image, and the decoder reconstructs output, from the essential information extracted by encoder, using upsampling.</p>

<p>To understand the DeepLab architecture, let’s go through its fundamental building blocks one by one.</p>

<h3 id="spatial-pyramid-pooling">Spatial Pyramid Pooling</h3>

<p>In order to deal with the different input image sizes, fc layers can be replaced by <code class="language-plaintext highlighter-rouge">1x1</code> conv layers as in case of FCN. But we want our model to be robust to different size of input images. The solution to deal with variable sized images is to train the model on various scales of the input image to capture multi-scale contextual information.</p>

<p><img src="/img/segmentation_spp.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>Usually, a single pooling layer is used between the last conv layer and fc layer. DeepLab, instead, utilizes a technique of using multiple pooling layer called Spatial Pyramid Pooling (SPP) to deal with multi-scale images. SPP divides the feature maps from the last conv layer into a fixed number of spatial bins having size proportional to the image size. Each bin gives a different scaled image as shown in the figure. The output of the SPP is a fixed size vector <code class="language-plaintext highlighter-rouge">FxB</code>, where <code class="language-plaintext highlighter-rouge">F</code> is the number of filters (feature maps) in the last conv layer, and <code class="language-plaintext highlighter-rouge">B</code> is the fixed number of bins. The different output vectors (<code class="language-plaintext highlighter-rouge">16x256-d, 4x256-d, 1x256-d</code>) are concatenated to form a fixed <code class="language-plaintext highlighter-rouge">(4x4+2x2+1)x256=5376</code> dimension vector, which is fed into the fc layer.</p>

<p>There is a drawback to SPP that it leads to an increase in the computational complexity of the model, the solution to which is atrous convolution.</p>

<h3 id="dilated-or-atrous-convolutions">Dilated or atrous convolutions</h3>

<p>Unlike the normal convolution, dilation or atrous convolution has one more parameter called dilation or atrous rate, r, which defines the spacing between the values in a kernel. The dilation rate of 1 corresponds to the normal convolution. DeepLab uses atrous rates of 6, 12 and 18.</p>

<div style="text-align:center">
<img src="/img/segmentation_conv.gif" style="margin: auto; width: auto; max-width: 100%;" />
<img src="/img/segmentation_dilation_conv.gif" style="margin: auto; width: auto; max-width: 100%;" />
</div>

<p>The benefit of this type of convolution is that it enlarges field of view of filters to incorporate larger context without increasing the number of parameters.</p>

<p>Deeplab uses atrous convolution with SPP called <strong>Atrous Spatial Pyramid Pooling (ASPP)</strong>. In DeepLabv3+, depthwise separable convolutions are applied to both ASPP and decoder modules.</p>

<h3 id="depthwise-separable-convolutions">Depthwise separable convolutions</h3>

<p>Suppose you’ve an input RGB image of size <code class="language-plaintext highlighter-rouge">12x12x3</code>, the normal convolution operation using <code class="language-plaintext highlighter-rouge">5x5x3</code> filter without padding and stride of <code class="language-plaintext highlighter-rouge">1</code> gives the output of size <code class="language-plaintext highlighter-rouge">8x8x1</code>. In order to increase the number of channels (e.g. to get output of <code class="language-plaintext highlighter-rouge">8x8x256</code>), you’ll have to use <code class="language-plaintext highlighter-rouge">256</code> filters to create <code class="language-plaintext highlighter-rouge">256 8x8x1</code> outputs and stack them together to get <code class="language-plaintext highlighter-rouge">8x8x256</code> output i.e. <code class="language-plaintext highlighter-rouge">12x12x3 — (5x5x3x256) —&gt; 12x12x256</code>. This whole operations costs <code class="language-plaintext highlighter-rouge">256x5x5x3x8x8=1,228,800</code> multiplications.</p>

<p>The depthwise separable convolution dissolves the above into two steps:</p>

<ul>
  <li>In <strong>depthwise convolution</strong>, the convolution operation is perfomed separately for each channel using three <code class="language-plaintext highlighter-rouge">5x5x1</code> filter, stacking whose outputs gives <code class="language-plaintext highlighter-rouge">8x8x3</code> image.</li>
  <li>The <strong>pointwise convolution</strong> is used to increase the depth, number of channels, by taking convolution of <code class="language-plaintext highlighter-rouge">256 1x1x3</code> filters with the <code class="language-plaintext highlighter-rouge">8x8x3</code> image, where each filter gives <code class="language-plaintext highlighter-rouge">8x8x1</code> image which are stacked together to get <code class="language-plaintext highlighter-rouge">8x8x256</code> desired output image.</li>
</ul>

<p>The process can be described as <code class="language-plaintext highlighter-rouge">12x12x3 — (5x5x1x1) —&gt; (1x1x3x256) —&gt; 12x12x256</code>. This whole operation took <code class="language-plaintext highlighter-rouge">3x5x5x8x8 + 256x1x1x3x8x8 = 53,952</code> multiplication, which is far less compared to that of normal convolution.</p>

<p><img src="/img/segmentation_deeplab.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>DeepLabv3+ uses xception (pointwise conv is followed by depthwise conv) as the feature extractor in the encoder portion. The depthwise separable convolutions are applied in place of max-pooling. The encoder uses output stride of 16, while in decoder, the encoded features by the encoder are first upsampled by 4, then concatenated with corresponding features from the encoder, then upsampled again to give output segmentation map.</p>

<p>Let’s test the DeepLabv3 model, which uses resnet101 as its backbone, pretrained on MS COCO dataset, in PyTorch.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">torchvision</span> <span class="kn">import</span> <span class="n">transforms</span>
<span class="kn">import</span> <span class="nn">PIL.Image</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>

<span class="c1"># load deeplab
</span><span class="n">model</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">hub</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'pytorch/vision'</span><span class="p">,</span> <span class="s">'deeplabv3_resnet101'</span><span class="p">,</span> <span class="n">pretrained</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>

<span class="c1"># load the input image and preprocess
</span><span class="n">input_image</span> <span class="o">=</span> <span class="n">PIL</span><span class="p">.</span><span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'image.png'</span><span class="p">)</span>
<span class="n">preprocess</span> <span class="o">=</span> <span class="n">transforms</span><span class="p">.</span><span class="n">Compose</span><span class="p">([</span>
    <span class="n">transforms</span><span class="p">.</span><span class="n">ToTensor</span><span class="p">(),</span>
    <span class="n">transforms</span><span class="p">.</span><span class="n">Normalize</span><span class="p">(</span><span class="n">mean</span><span class="o">=</span><span class="p">[</span><span class="mf">0.485</span><span class="p">,</span> <span class="mf">0.456</span><span class="p">,</span> <span class="mf">0.406</span><span class="p">],</span> <span class="n">std</span><span class="o">=</span><span class="p">[</span><span class="mf">0.229</span><span class="p">,</span> <span class="mf">0.224</span><span class="p">,</span> <span class="mf">0.225</span><span class="p">]),</span>
<span class="p">])</span>

<span class="n">input_tensor</span> <span class="o">=</span> <span class="n">preprocess</span><span class="p">(</span><span class="n">input_image</span><span class="p">)</span>
<span class="n">input_batch</span> <span class="o">=</span> <span class="n">input_tensor</span><span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> 

<span class="c1"># move the input and model to GPU if available
</span><span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">():</span>
    <span class="n">input_batch</span> <span class="o">=</span> <span class="n">input_batch</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="s">'cuda'</span><span class="p">)</span>
    <span class="n">model</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="s">'cuda'</span><span class="p">)</span>

<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
    <span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_batch</span><span class="p">)[</span><span class="s">'out'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">output_predictions</span> <span class="o">=</span> <span class="n">output</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

<span class="c1"># create a color pallette, selecting a color for each class
</span><span class="n">palette</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">25</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">15</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">21</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">colors</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">as_tensor</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">21</span><span class="p">)])[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">palette</span>
<span class="n">colors</span> <span class="o">=</span> <span class="p">(</span><span class="n">colors</span> <span class="o">%</span> <span class="mi">255</span><span class="p">).</span><span class="n">numpy</span><span class="p">().</span><span class="n">astype</span><span class="p">(</span><span class="s">"uint8"</span><span class="p">)</span>

<span class="c1"># plot the semantic segmentation predictions
</span><span class="n">r</span> <span class="o">=</span> <span class="n">PIL</span><span class="p">.</span><span class="n">Image</span><span class="p">.</span><span class="n">fromarray</span><span class="p">(</span><span class="n">output_predictions</span><span class="p">.</span><span class="n">byte</span><span class="p">().</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">()).</span><span class="n">resize</span><span class="p">(</span><span class="n">input_image</span><span class="p">.</span><span class="n">size</span><span class="p">)</span>
<span class="n">r</span><span class="p">.</span><span class="n">putpalette</span><span class="p">(</span><span class="n">colors</span><span class="p">)</span>

<span class="n">f</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'input image'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">imshow</span><span class="p">(</span><span class="n">input_image</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'segmented output'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">imshow</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"segmented_output.png"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
<span class="c1"># plt.show()</span></code></pre></figure>

<p><img src="/img/segmentation_cat_output.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p><strong>References:</strong></p>

<ol>
  <li><a href="https://arxiv.org/abs/1411.4038">Fully Convolutional Networks for Semantic Segmentation</a></li>
  <li><a href="https://arxiv.org/abs/1505.04597.pdf">U-Net: Convolutional Networks for BiomedicalImage Segmentation</a></li>
  <li><a href="https://github.com/vdumoulin/conv_arithmetic">Convolution arithmetic</a></li>
  <li><a href="https://arxiv.org/abs/1406.4729">Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition</a></li>
  <li><a href="https://github.com/tensorflow/models/tree/master/research/deeplab">DeepLab: Deep Labelling for Semantic Image Segmentation</a></li>
  <li><a href="https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728">A Basic Introduction to Separable Convolutions</a></li>
</ol>]]></content><author><name></name></author><category term="Deep Learning" /><category term="Computer Vision" /><summary type="html"><![CDATA[Suppose you’ve an image, consisting of cats. You want to classify every pixel of the image as cat or background. This process is called semantic segmentation.]]></summary></entry><entry><title type="html">Converting FC layers to CONV layers</title><link href="https://kharshit.github.io/blog/2019/08/02/converting-fc-layers-to-conv-layers" rel="alternate" type="text/html" title="Converting FC layers to CONV layers" /><published>2019-08-02T00:00:00+00:00</published><updated>2019-08-02T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2019/08/02/converting-fc-layers-to-conv-layers</id><content type="html" xml:base="https://kharshit.github.io/blog/2019/08/02/converting-fc-layers-to-conv-layers"><![CDATA[<blockquote>
  <p>It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.</p>
</blockquote>

<p>Suppose, the 7x7x512 activation volume output of the conv layer is fed into a 4096 sized fc layer. This fc layer can be replaced with a conv layer having 4096 filters (kernel) of size 7x7x512, where each filter gives 1x1x1 output  which are concatenated to give output of 1x1x4096, which is equal to what we get in fc layer.</p>

<p>As a general rule, replace <code class="language-plaintext highlighter-rouge">K</code> sized fc layer <em>with</em> a conv layer having <code class="language-plaintext highlighter-rouge">K</code> number of filters of the same size that is input to the fc layer.<br />
For example, if a <code class="language-plaintext highlighter-rouge">conv1</code> layer outputs <code class="language-plaintext highlighter-rouge">HxWxC</code> volume, and it’s fed to a <code class="language-plaintext highlighter-rouge">K</code> sized <code class="language-plaintext highlighter-rouge">fc</code> layer. Then, the <code class="language-plaintext highlighter-rouge">fc</code> layer can be replaced with a <code class="language-plaintext highlighter-rouge">conv2</code> layer having <code class="language-plaintext highlighter-rouge">K HxW</code> filters. In PyTorch, it’d be</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">in_channels</span><span class="p">,</span> <span class="n">out_channels</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="p">...)</span></code></pre></figure>

<p>Before:<br />
<em><code class="language-plaintext highlighter-rouge">nn.Conv2d(...)</code></em><br />
image dim: 7x7x512<br />
<em><code class="language-plaintext highlighter-rouge">nn.Linear(512 * 7 * 7, 4096)</code></em><br />
<em><code class="language-plaintext highlighter-rouge">nn.Linear(4096, 1000)</code></em></p>

<p>After:<br />
<em><code class="language-plaintext highlighter-rouge">nn.Conv2d(...)</code></em><br />
image dim: 7x7x512<br />
<em><code class="language-plaintext highlighter-rouge">nn.Conv2d(512, 4096, 7)</code></em><br />
image dim: 1x1x4096<br />
<em><code class="language-plaintext highlighter-rouge">nn.Conv2d(4096, 1000, 1)</code></em><br />
image dim: 1x1x1000</p>

<p>Using the above reasoning, you’d notice that all the further fc layers, <em>except the first one</em>, will require <code class="language-plaintext highlighter-rouge">1x1</code> convolutions as shown in the above example, it’s because after the first conv layer, the feature maps are of size <code class="language-plaintext highlighter-rouge">1x1xC</code> where <code class="language-plaintext highlighter-rouge">C</code> is the number of channels.</p>

<p><strong>References:</strong></p>
<ol>
  <li><a href="http://cs231n.github.io/convolutional-networks/#convert">CS231n</a></li>
</ol>]]></content><author><name></name></author><category term="Deep Learning" /><category term="Computer Vision" /><summary type="html"><![CDATA[It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.]]></summary></entry><entry><title type="html">Two Years of Technical Fridays</title><link href="https://kharshit.github.io/blog/2019/07/19/two-years-of-technical-fridays" rel="alternate" type="text/html" title="Two Years of Technical Fridays" /><published>2019-07-19T00:00:00+00:00</published><updated>2019-07-19T00:00:00+00:00</updated><id>https://kharshit.github.io/blog/2019/07/19/two-years-of-technical-fridays</id><content type="html" xml:base="https://kharshit.github.io/blog/2019/07/19/two-years-of-technical-fridays"><![CDATA[<p><img src="/img/favicon_files/favicon-96x96.png" style="float:left; display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>It’s been two years since I started writing this blog, <a href="/blog/2017/07/21/technical-fridays">Technical Fridays</a>, <a href="/blog/2018/07/20/a-year-of-fridays">A Year of Fridays</a>.</p>

<p>In the last year (July 20, 2018 - July 19, 2019), the site had 10,099 users from all over the world. That’s an incredible achievement. Thank you all :)</p>

<p><img src="/img/kHarshit.github.io_Analytics_world_18_19.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p>

<p>For the past few months, I’ve been working mainly in the field of <a href="/categories/#computer-vision">Computer Vision</a>, so I expect to write more blog posts related to it. Once again, thank you to all the readers, it has been an incredible journey so far, and I hope to continue writing on some of the amazing topics in the future.</p>

<p>Regards,<br />
Harshit</p>]]></content><author><name></name></author><category term="Personal" /><summary type="html"><![CDATA[]]></summary></entry></feed>