<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://atomicapple0.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://atomicapple0.github.io/" rel="alternate" type="text/html" /><updated>2025-12-31T00:54:51-08:00</updated><id>https://atomicapple0.github.io/feed.xml</id><title type="html">Brian E. Zhang</title><subtitle>personal description</subtitle><author><name>Brian E. Zhang</name><email>bez@modular.com</email></author><entry><title type="html">Notes on Optimizing Compilers</title><link href="https://atomicapple0.github.io/posts/2024/optimizing-compilers/" rel="alternate" type="text/html" title="Notes on Optimizing Compilers" /><published>2024-09-15T00:00:00-07:00</published><updated>2024-09-15T00:00:00-07:00</updated><id>https://atomicapple0.github.io/posts/2024/optimizing-compilers</id><content type="html" xml:base="https://atomicapple0.github.io/posts/2024/optimizing-compilers/"><![CDATA[<p>Notes from <a href="https://www.cs.cmu.edu/~15745/">15-745: Optimizing Compilers</a>.</p>

<h2 id="lecture-1-overview-of-optimizations">Lecture 1: Overview of Optimizations</h2>

<p>Three-address code</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A := B op C
A := unaryop B
A := B
GOTO s
IF A relop B GOTO s
CALL f
RETURN
</code></pre></div></div>

<p>Basic block</p>
<ul>
  <li>sequence of 3-addr statements</li>
  <li>only first statement can be reached from outside the block (no branches into middle)</li>
  <li>all statements are executed consecutively if first one is executed (no branches out or halts except maybe at end of block)</li>
  <li>are maximal</li>
  <li>optimizations within a basic block are local optimizations</li>
</ul>

<p>Flow graph</p>
<ul>
  <li>Nodes: basic blocks</li>
  <li>Edges: \(b_i \rightarrow b_j\) if there is a branch from \(b_i\) to \(b_j\). Also works if \(b_j\) physically follows \(b_i\) which does not end in an unconditional goto.</li>
</ul>

<p>Optimization types:</p>
<ul>
  <li>local: within bb – across instrs</li>
  <li>global: within a flow graph – across bbs</li>
  <li>interprocedural analysis: within a program – across procedures (flow graphs)</li>
</ul>

<p>Local</p>
<ul>
  <li>common subexpression elimination</li>
  <li>constant folding or elimination</li>
  <li>dead code elimination</li>
</ul>

<p>Global</p>
<ul>
  <li>Global versions of local
    <ul>
      <li>global common subexpression elimination</li>
      <li>global constant propagation</li>
      <li>dead code elimination</li>
    </ul>
  </li>
  <li>Loop opts
    <ul>
      <li>reduce code to be executed in each iter</li>
      <li>code motion</li>
      <li>induction var elimination</li>
    </ul>
  </li>
  <li>Other control structures
    <ul>
      <li>Code hoisting: eliminate copies of idential code on parallel paths in flow graph to reduce code size</li>
    </ul>
  </li>
</ul>

<p>Induction Variable Elimination</p>
<ul>
  <li>Loop indices are induction var</li>
  <li>Linear function of loop indices are also induction var</li>
  <li>Analysis: detection of induction var</li>
  <li>opts
    <ul>
      <li>strength reduction: replace mult with add</li>
      <li>elim loop index: replace termination by tests on other induction vars</li>
    </ul>
  </li>
</ul>

<p>Loop invariant code motion</p>
<ul>
  <li>computation is done within loop and result is same as long as we keep going around the loop</li>
  <li>move computation outside of loop</li>
</ul>

<p>Machine dependent optimizations</p>
<ul>
  <li>regalloc</li>
  <li>instr schd</li>
  <li>memory hierarchy optz</li>
</ul>

<h2 id="lecture-2-local-optimizations">Lecture 2: Local Optimizations</h2>
<p>Outline</p>
<ul>
  <li>bb/flow graphs</li>
  <li>abstr 1: dag</li>
  <li>abstr 2: value numbering</li>
  <li>phi in ssa</li>
</ul>

<p>Partitionining into BBs</p>
<ul>
  <li>identify leader of eahc bb.
    <ul>
      <li>first instr</li>
      <li>target of a jump</li>
      <li>any instr after a jump</li>
    </ul>
  </li>
  <li>bb starts at leadr &amp; ends at instr immediately before a leader or last instr</li>
</ul>

<p>Common subexpression elimination</p>
<ul>
  <li>array expressions</li>
  <li>field access in records</li>
  <li>access to parameters</li>
</ul>

<p>CSE cont:</p>
<ul>
  <li>consider Parse Tree for <code class="language-plaintext highlighter-rouge">a+a * (b-c) + (b-c) * d</code></li>
  <li>notice that subtrees <code class="language-plaintext highlighter-rouge">b-c</code> and <code class="language-plaintext highlighter-rouge">a</code> are duplicated</li>
  <li>turn parse tree to expression DAG</li>
</ul>

<p>DAG bad:</p>
<ul>
  <li>worked for one statment but not so good for multiple statments</li>
</ul>

<p>Alternative: Value Numbering Scheme</p>
<ul>
  <li>var2value where each value has its own number</li>
  <li>common subexpression means same value number</li>
  <li><code class="language-plaintext highlighter-rouge">r1 + r2 =&gt; var2value(r1) + var2value(r2)</code></li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VALUES : (var, op, var) -&gt; var = {}
BIN_VALUES : (var, op, var) -&gt; var = {}

for dst, src1, op, src2 in instrs:
  val1 = var2value(src1)
  val2 = var2value(src2)
  if val1, op, val2 in BIN_VALUES:
    print(f`{dst} = {BIN_VALUES[src1, op, src2]}`)
  else:
    tmp = newtemp()
    BIN_VALUES[val1, op, val2] = tmp
    print(f`{tmp} = {VALUES[src1]} {op} {VALUES[src2]}`)
    print(f`{dst} = {tmp}`)
  VALUES[dst] = tmp
</code></pre></div></div>

<h1 id="uhhh-im-sort-of-lost-above-is-wrong">uhhh im sort of lost, above is wrong</h1>

<p>phi functions in SSA</p>
<ul>
  <li>appears in llvm</li>
  <li>copy prop
    <ul>
      <li>for a given use of X
        <ul>
          <li>are all reaching definitions of X:
            <ul>
              <li>copies from same variable: eg, X = Y</li>
              <li>where Y is not redefined since that copy?</li>
            </ul>
          </li>
          <li>if so, substitute X for Y</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Static single assignments: IR where every variable is assigned a value at most once in the program rext</li>
  <li>Easy for basic block
    <ul>
      <li>For instr:
        <ul>
          <li>LHS: assign to fresh version of variable</li>
          <li>RHS: use most recent version of variable</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>What about joins in CFG?
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c = 12
if (i) {
a = x + y
b = a + x
} else {
a = b + 2
c = y + 1
}
a = c + a
</code></pre></div>    </div>
    <p>to</p>
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c1 = 12
if (i) {
a1 = x + y
b1 = a1 + x
} else {
a2 = b + 2
c2 = y + 1
}
a4 = c? + a?
</code></pre></div>    </div>
  </li>
  <li>use notational convention: phi function
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a3 = phi(a1,a2)
c3 = phi(c1,c2)
b2 = phi(b1,b)
b3 = c3 + a3
</code></pre></div>    </div>
  </li>
</ul>

<h2 id="lecture-3-overview-of-llvm-compiler">Lecture 3: Overview of LLVM Compiler</h2>
<p>LLVM Compiler Infrastructure</p>
<ul>
  <li>provides resable components for building compilers</li>
  <li>reduces time/cost</li>
  <li>build static compilers, jits, trace-based optz, etc
LLVM Compiler Framework</li>
  <li>e2e compilers using LLVM infra</li>
  <li>C and C++ are robust and aggressive</li>
  <li>Emit C code or native code</li>
</ul>

<p>Parts</p>
<ul>
  <li>LLVM Virtual Instructio Set</li>
  <li>…</li>
</ul>

<p>System:</p>
<ul>
  <li>front end turns C/C++/Java/… into LLVM IR. One front end is Clang</li>
  <li>LLVM optimizer opterates on LLVM IR in series of passes</li>
  <li>Back end turns LLVM IR into machine code</li>
</ul>

<p>Analysis passes do not change code</p>

<p>LLVM instruction set</p>
<ul>
  <li>RISC-like three address code</li>
  <li>infinite virtual register set in SSA form</li>
  <li>Simple, low-level control flow constructs</li>
  <li>Load/store instructions with typed-pointers</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for (int i = 0; i &lt; N; i++) {
  Sum(&amp;A[i], &amp;P);
}
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>loop:   ; prefs = %bb0, %loop
  %i.1 = phi i32 [ 0, %bb0 ], [ %i.2, %loop ]
  #AiAddr = getelementptr float* %A, i32 %i.1
  call void @Sum(float %AiAddr, %pair* %P)
  %i.2 = add i32 %i.1, 1
  %exitcond = icmp eq i32 %i.1, %N
  br i1 %exitcond, label %outloop, label %loop
</code></pre></div></div>

<ul>
  <li>explicit dataflow through SSA</li>
  <li>explicit cfg, even for exceptions</li>
  <li>explicit language independent type-information</li>
  <li>explicit typed pointer arithmetic
    <ul>
      <li>preserve array subscript and struture indexing</li>
    </ul>
  </li>
</ul>

<p>Type system:</p>
<ul>
  <li>Primitives: label, void, float, integer (incld arbitrary bitwidth integers i1, i32, …)</li>
  <li>Derived: pointer, array, structure, function</li>
  <li>No high-level types</li>
</ul>

<p>Stack allocation is explicit in LLVM</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int caller() {
  int T;
  T = 4;
  ...
}

int %caller() {
  %T = alloca i32
  store i32 4, i32* %T
  ...
}
</code></pre></div></div>

<p>LLVM IR is almost all nested doubly-linked lists</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Function &amp;M = ...
for (Function::iterator I = M-&gt;begin(); I != M-&gt;end; ++I) {
  BasicBlock &amp;BB = I;
  ...
}
</code></pre></div></div>

<ul>
  <li>Modules contains functions and globals (unit of compilation/analysis/optimization)</li>
  <li>Functions contain basic blocks and arguments</li>
  <li>Basic blocks contain instructions</li>
  <li>Instructions contain opcodes and vector of operands</li>
</ul>

<p>Pass Manager</p>
<ul>
  <li>ImmutablePass: doesn’t do much</li>
  <li>LoopPass</li>
  <li>RegionPass: process single-enty, single-exit regions</li>
  <li>ModulePas</li>
  <li>CallGraphSCCPass: bottom up on the call graph</li>
  <li>FunctionPass</li>
  <li>BasicBlockPass: process basic blocks</li>
</ul>

<p>Tools:</p>
<ul>
  <li>llvm-as: .ll (text) to .bc (binary)</li>
  <li>llvm-dis: .bc to .ll</li>
  <li>llvm-link: link multiple .bc files</li>
  <li>llvm-prof: profile output to human readers</li>
  <li>opt: run one or more optz on bc</li>
</ul>

<p>Aggregate tools:</p>
<ul>
  <li>bugpoint</li>
  <li>clang</li>
</ul>

<p>mostly use clang, opt, llvm-dis</p>

<h2 id="lecture-4-dataflow-analysis">Lecture 4: Dataflow Analysis</h2>

<p>Reaching definitions</p>
<ul>
  <li>every assignment is a definition</li>
  <li>a definition reaches a point p if there exists a path from the point immediately following d to p such that d is not killed along that path</li>
</ul>

<p>Live variable analysis:</p>
<ul>
  <li>a variable v is live at point p if the value of v is used along some path in the flow graph starting at p</li>
  <li>otws the variable is dead</li>
</ul>

<p>For each basic block, track uses and defs</p>

<p>reaching definitions vs live variables:</p>
<ul>
  <li>sets of definitions vs variables</li>
  <li>forward vs backward</li>
  <li>…</li>
</ul>

<p>Reaching:</p>
<ul>
  <li>consider <code class="language-plaintext highlighter-rouge">a = b + c</code>. at what previous statments did variable c acquire a value that can reach line 10?
Live range:</li>
  <li>starting from definition of c, go to the next definition of variable or end of scope that variable c exists.</li>
</ul>

<h2 id="lecture-6-more-on-llvm-compiler">Lecture 6: More on LLVM Compiler</h2>
<ul>
  <li>almost everything is sublcass of llvm::Value</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for (Function F1 = func-&gt;begin(), FE = func-&gt;end(); FI != FE; ++FI) {
    for (BasicBlock::iterator BBI = FL-&gt;begin(), BBE = FI-&gt;end(); BBI != BBE; ++BBI) {
        Instruction *I = BBI;
        if (Vallnst * CI = dyn_cast&lt;CallInst&gt;(I)) {
            outs() &lt;&lt; "I'm a Call Instruction!\n";
        }
        if (UnaryInstruction * UI = dyn_cast&lt;UnaryInstruction&gt;(I)) {
            outs() &lt;&lt; "I'm a Unary Instruction!\n";
        }
        ...
    }
}
</code></pre></div></div>

<h2 id="lecture-8-ssa">Lecture 8: SSA</h2>
<ul>
  <li>only 1 asignment per variable</li>
  <li>definitions dominate uses
    <ul>
      <li>x strictly dominates w iff impossible to reach w without passing through x first</li>
    </ul>
  </li>
</ul>

<p>If x_i is used in x &lt;- phi(…, x_i, …) then VV(x_i) dominates ith pred of BB(PHI)</p>
<ul>
  <li>if x is used in y &lt;- … x  then BB(x) dominates BB(y)</li>
</ul>

<h2 id="lecture-9-ssa-style-optz">Lecture 9: SSA style optz</h2>
<p>constant prop</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>W = all defs
while W:
    stmt S &lt;- W.pop()
    if (S has form "v &lt;- c" or S has form "v &lt;- phi(c,c, ..., c)"):
        delete S
        foreach stmt U that uses v {
            replace v with c in U
            W.add(U)
        }
</code></pre></div></div>

<h2 id="lecture-10-loop-motion">Lecture 10: loop motion</h2>
<ul>
  <li>use reaching defs and dominators</li>
</ul>]]></content><author><name>Brian E. Zhang</name><email>bez@modular.com</email></author><summary type="html"><![CDATA[Notes from 15-745: Optimizing Compilers.]]></summary></entry><entry><title type="html">Coffee Table</title><link href="https://atomicapple0.github.io/posts/2024/coffee-table/" rel="alternate" type="text/html" title="Coffee Table" /><published>2024-09-12T00:00:00-07:00</published><updated>2024-09-12T00:00:00-07:00</updated><id>https://atomicapple0.github.io/posts/2024/coffee-table</id><content type="html" xml:base="https://atomicapple0.github.io/posts/2024/coffee-table/"><![CDATA[<p>I recently registered for a woodworking class at a local adult community school located at a high school in south Florida. I had some minimal experience from taking a mini course at CMU but forgot pretty much everything. Nevertheless, Woodworking I was not offered at this time so I registered for Woodworking II. Most of the other students were fairly independent woodworkers and mainly registered for the course to have a woodworking space. As I didn’t know anything anymore, the very kind instructor recommended me to make a coffee table. This coffee table is based on Steve Ramsey’s “Sonoma Vineyard Coffee Table” guide from his Weekend Woodworker course.</p>

<h2 id="class-1">Class 1:</h2>
<p>Missed it bc not in town :(</p>

<h2 id="class-2">Class 2:</h2>
<p>Bought a bunch of wood from home depot. Got wood shaving all over inside of car. Note to self that a rav4 just barely manages to fir 8ft boards. Since the pine common boards looked like they were rotting, I paid extra for the pine select ones.</p>

<p>Note to self: 1in x 4in (or 1x4 colloquially), are not 4in wide. They are 3.5in wide. This is because some width was lost when the wood was planed.</p>

<p>Materials:</p>
<ul>
  <li>8 of 1in x 4in x 8ft pine select boards</li>
  <li>1 of 2ft x 4ft x 1/2in plywood</li>
</ul>

<p>After lugging the wood back, I relearned how to use the miter saw:</p>
<ul>
  <li>set miter saw completly down. lock in place with pin</li>
  <li>press tape against saw to measure length.then lock the stop block in place</li>
  <li>crop 1/2 blade width off the end of the board</li>
  <li>cut a bunch of identical length pieces by measuring against stop block</li>
</ul>

<p>Now I have 16 of 1in x 4in x 15.25in boards. I glue 4 of them together along faces to form 4 table legs. For wood glue, use a brush to spread glue evenly. Then clamp overnight. If use metal clamp, put a piece of wood between the clamp and the wood to prevent denting.</p>

<h2 id="class-3">Class 3:</h2>
<p>Got to use table saw once again. That thing is terrifying.</p>
<ul>
  <li>spin wheel thing to set blade height to be around .25in above wood</li>
  <li>lock wheel in place</li>
  <li>set fence to be good distance from blade</li>
  <li>for this cut, we just want to shave off a tiny bit of the misaligned glued edges</li>
  <li>use featherboard to press board against fence to stop kickback!!! this thing is incredible and makes it so that my fingers are much further from blade</li>
  <li>turn on saw. once stop flashing, pull red stopper to begin spin</li>
  <li>position urself not right behind board in case it gets ejected backwards</li>
  <li>push board through with push stick. aim force downward, forward, and towards fence.</li>
  <li>make sure the table is long enought it doesnt fall off</li>
</ul>

<p>Then I used miter saw to plane the other two small faces of board. Just cut tiny bit off.</p>

<p>Then to square it, I came back to table saw. I measured fence dist by using shorter length. Then rotate to start cutting.</p>

<p>Aftwards I rotates the saw to 45deg to cut the wood into a trapezoidal prism.</p>

<p>Then I used the other table saw that had a dado blade that cuts away larger volumes of wood. This was used to form the tenons of each leg. Since this was doing crosscuts (instead of rip cuts like before), I installed a miter guage which guided the wood. I like this as it keeps my fingers far from blade again.</p>

<p>Note that to tighten the miter guage, you have to use an allen key. I didn’t tigthen it enough and the miter guage sort of twisted on a cut and caused some of the metal of the miter guage to get cut off and propelled the shaving onto me. Funnily enough, it was good that this table saw was not saw stop compatible due to being old as it would have braked and ruined the blade.</p>

<p>Also, oops I cut some later pieces I shouldn’t of cut. It seems that preemptively cutting pieces before they are needed may not be a great idea as the required dimensions may be slightly different than the measurements in the guide due to inprecise cuts earlier on in the project.</p>]]></content><author><name>Brian E. Zhang</name><email>bez@modular.com</email></author><summary type="html"><![CDATA[I recently registered for a woodworking class at a local adult community school located at a high school in south Florida. I had some minimal experience from taking a mini course at CMU but forgot pretty much everything. Nevertheless, Woodworking I was not offered at this time so I registered for Woodworking II. Most of the other students were fairly independent woodworkers and mainly registered for the course to have a woodworking space. As I didn’t know anything anymore, the very kind instructor recommended me to make a coffee table. This coffee table is based on Steve Ramsey’s “Sonoma Vineyard Coffee Table” guide from his Weekend Woodworker course.]]></summary></entry><entry><title type="html">Optimizing CUDA Kernels</title><link href="https://atomicapple0.github.io/posts/2024/cuda-kernels/" rel="alternate" type="text/html" title="Optimizing CUDA Kernels" /><published>2024-09-04T00:00:00-07:00</published><updated>2024-09-04T00:00:00-07:00</updated><id>https://atomicapple0.github.io/posts/2024/cuda-kernels</id><content type="html" xml:base="https://atomicapple0.github.io/posts/2024/cuda-kernels/"><![CDATA[<p>Notes from “Professional CUDA C Programming” by John Cheng, Max Grossman, and Ty McKercher. Since this book is old, information is only accurate for Fermi / Kepler generation devices.</p>

<hr />

<h3 id="shared-memory-bank-conflicts">Shared Memory Bank Conflicts</h3>
<p>Each SM contains small latency memory pool accessible by all threads in threadblock known as shared memory. Shared memory is divided into 32 equally-sized memory banks. Each bank can service one 32-bit word per clock cycle. If multiple threads in the same warp access the same bank, the bank must serialize the requests. This is known as a bank conflict.</p>

<p>On Kepler, the size of each bank entry can be 32 or 64 bit.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"># 64-bit mode
</span><span class="n">bank</span> <span class="n">index</span> <span class="o">=</span> <span class="p">(</span><span class="n">bbyte</span> <span class="n">address</span> <span class="o">/</span> <span class="mi">8</span> <span class="n">bytes</span><span class="o">/</span><span class="n">bank</span><span class="p">)</span> <span class="o">%</span> <span class="mi">32</span> <span class="n">banks</span>
</code></pre></div></div>

<p>To mitigate bank conflicts, we can use padding to ensure that each thread accesses a different bank.</p>

<p>You typically use <code class="language-plaintext highlighter-rouge">__syncthreads()</code> to ensure that all threads in the block have finished writing to shared memory before reading from it.</p>

<h3 id="global-memory">Global Memory</h3>
<p>We want our reads/writes to be:</p>
<ul>
  <li>aligned</li>
  <li>coalesced</li>
</ul>

<p>L1 cache line is 128-bytes.</p>

<p>Uncached loads that do not pass through L1 cache are performaed at the granularity of memory segments (32-bytes) and not cache lines (128-bytes).</p>

<p>Writes do not get cached in L1. They can be cached in L2. Memory transactions are 32-byte granularity and can be one, two, or four segments at a time.</p>

<h3 id="fast-matrix-transpose">Fast Matrix Transpose</h3>

<p>Matrix transpose is hard because reading/writing to global memory column-wise results in a lot of uncoalesced memory accesses. This will hurt your global load/store efficiency.</p>

<p>Assume 2D matrix. Each threadblock is responsible for a 2D tile. Each thread is responsible for 1 element in tile.</p>

<ol>
  <li>Each thread reads from global memory (row-wise) and writes to shared memory (row-wise). Adjacent threads (based on <code class="language-plaintext highlighter-rouge">threadIdx.x</code>) read adjacent elements in the same row. These reads are likely coalesced and aligned (good). These writes to shared memory are also row-wise and do not result in bank conflicts.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">__syncthreads()</code> to ensure all threads have finished writing to shared memory.</li>
  <li>Each thread reads from shared memory (column-wise) and writes to global memory (row-wise). Adjacent threads (based on <code class="language-plaintext highlighter-rouge">threadIdx.x</code>) read adjacent elements in the same column. These reads from smem may result in bank conflicts but we can fix this. These writes to global memory are row-wise so they are coalesced and aligned.</li>
</ol>

<p>Optimizations</p>
<ul>
  <li>unroll the grid. now each threadblock is responsible for <code class="language-plaintext highlighter-rouge">k</code> tiles. each thread is responsible for <code class="language-plaintext highlighter-rouge">k</code> elements in tile.</li>
  <li>pad the shared memory to avoid bank conflicts:
    <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// not padded</span>
<span class="n">__shared__</span> <span class="kt">float</span> <span class="n">tile</span><span class="p">[</span><span class="n">BDIM_Y</span><span class="p">][</span><span class="n">BDIM_X</span><span class="p">];</span>
<span class="c1">// padded</span>
<span class="n">__shared__</span> <span class="kt">float</span> <span class="n">tile</span><span class="p">[</span><span class="n">BDIM_Y</span><span class="p">][</span><span class="n">BDIM_X</span><span class="o">+</span><span class="mi">2</span><span class="p">];</span>
</code></pre></div>    </div>
  </li>
  <li>try various grid dimensions. more threadblocks can mean more device parallelism.</li>
</ul>

<h3 id="fast-reduce">Fast Reduce</h3>
<ul>
  <li>warp shuffle trick is insane. this forgoes need for block level sync like <code class="language-plaintext highlighter-rouge">__syncthreads()</code>. This is because threads in a warp can communicate with each other without needing to sync with other warps.
    <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__inline__</span> <span class="n">__device__</span> <span class="kt">int</span> <span class="nf">warpReduce</span><span class="p">(</span><span class="kt">int</span> <span class="n">mySum</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">mySum</span> <span class="o">+=</span> <span class="n">__shfl_xor</span><span class="p">(</span><span class="n">mySum</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
  <span class="n">mySum</span> <span class="o">+=</span> <span class="n">__shfl_xor</span><span class="p">(</span><span class="n">mySum</span><span class="p">,</span> <span class="mi">8</span><span class="p">);</span>
  <span class="n">mySum</span> <span class="o">+=</span> <span class="n">__shfl_xor</span><span class="p">(</span><span class="n">mySum</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
  <span class="n">mySum</span> <span class="o">+=</span> <span class="n">__shfl_xor</span><span class="p">(</span><span class="n">mySum</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
  <span class="n">mySum</span> <span class="o">+=</span> <span class="n">__shfl_xor</span><span class="p">(</span><span class="n">mySum</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div>    </div>
  </li>
</ul>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__global__</span> <span class="kt">void</span> <span class="nf">reduceShfl</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">g_idata</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">g_odata</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// shared memory for each warp sum</span>
    <span class="n">__shared__</span> <span class="kt">int</span> <span class="n">smem</span><span class="p">[</span><span class="n">SMEMDIM</span><span class="p">];</span>
    
    <span class="c1">// boundary check</span>
    <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">idx</span> <span class="o">&gt;=</span> <span class="n">n</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>

    <span class="c1">// read from global memory</span>
    <span class="kt">int</span> <span class="n">mySum</span> <span class="o">=</span> <span class="n">g_idata</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span>

    <span class="c1">// calculate lane index and warp index</span>
    <span class="kt">int</span> <span class="n">laneIdx</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">%</span> <span class="n">warpSize</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">warpIdx</span> <span class="o">=</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">/</span> <span class="n">warpSize</span><span class="p">;</span>

    <span class="c1">// block-wide warp reduce</span>
    <span class="n">mySum</span> <span class="o">=</span> <span class="n">warpReduce</span><span class="p">(</span><span class="n">mySum</span><span class="p">);</span>

    <span class="c1">// save warp sum to shared memory</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">laneIdx</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="n">smem</span><span class="p">[</span><span class="n">warpIdx</span><span class="p">]</span> <span class="o">=</span> <span class="n">mySum</span><span class="p">;</span>

    <span class="c1">// block-wide sync</span>
    <span class="n">__syncthreads</span><span class="p">();</span>

    <span class="c1">// last warp reduce</span>
    <span class="n">mySum</span> <span class="o">=</span> <span class="p">(</span><span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">SMEMDIM</span><span class="p">)</span> <span class="o">?</span> <span class="n">smem</span><span class="p">[</span><span class="n">laneIdx</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">warpIdx</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="n">mySum</span> <span class="o">=</span> <span class="n">warpReduce</span><span class="p">(</span><span class="n">mySum</span><span class="p">);</span>

    <span class="c1">// write to global memory</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="n">g_odata</span><span class="p">[</span><span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">mySum</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Brian E. Zhang</name><email>bez@modular.com</email></author><summary type="html"><![CDATA[Notes from “Professional CUDA C Programming” by John Cheng, Max Grossman, and Ty McKercher. Since this book is old, information is only accurate for Fermi / Kepler generation devices.]]></summary></entry><entry><title type="html">C++ Notes</title><link href="https://atomicapple0.github.io/posts/2024/cpp-notes/" rel="alternate" type="text/html" title="C++ Notes" /><published>2024-09-03T00:00:00-07:00</published><updated>2024-09-03T00:00:00-07:00</updated><id>https://atomicapple0.github.io/posts/2024/cpp-notes</id><content type="html" xml:base="https://atomicapple0.github.io/posts/2024/cpp-notes/"><![CDATA[<p>I have largely evaded formally learning C++ by leveraging my knowledge of C to pattern match on existing C++ codebases, or using Rust in my own green-field projects. Nevertheless, it seems that knowing C++ is pretty important given that everyone still uses it and test on its concepts in interviews. I guess I cannot possibly continue on without knowing what a virtual method call is.</p>

<p>A previous company shipped me a copy of “A Tour of C++, Third Edition” by Bjarne Stroustrup. I’ll be dumping my notes here so I actually remember stuff.</p>

<hr />

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Vector</span> <span class="p">{</span>
<span class="nl">public:</span>
    <span class="n">Vector</span><span class="p">(</span><span class="kt">int</span> <span class="n">s</span><span class="p">)</span> <span class="o">:</span><span class="n">elem</span><span class="p">{</span><span class="k">new</span> <span class="kt">double</span><span class="p">[</span><span class="n">s</span><span class="p">]},</span> <span class="n">sz</span><span class="p">{</span><span class="n">s</span><span class="p">}</span> <span class="p">{}</span>
    <span class="kt">double</span><span class="o">&amp;</span> <span class="k">operator</span><span class="p">[](</span><span class="kt">int</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">elem</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span>
    <span class="kt">int</span> <span class="n">size</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">sz</span><span class="p">;</span> <span class="p">}</span>
<span class="nl">private:</span>
    <span class="kt">double</span><span class="o">*</span> <span class="n">elem</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">sz</span><span class="p">;</span>
<span class="p">};</span>

<span class="kt">int</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">Vector</span> <span class="n">v</span><span class="p">(</span><span class="mi">6</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="k">class</span> <span class="nc">Color</span> <span class="p">{</span> <span class="n">red</span><span class="p">,</span> <span class="n">green</span><span class="p">,</span> <span class="n">blue</span> <span class="p">};</span>
</code></pre></div></div>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">export</span> <span class="n">module</span> <span class="n">vector_printer</span><span class="p">;</span>

<span class="n">import</span> <span class="n">std</span><span class="p">;</span>

<span class="k">export</span> <span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span> <span class="kt">void</span> <span class="nf">print</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;&amp;</span> <span class="n">v</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"{</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="k">const</span> <span class="n">T</span><span class="o">&amp;</span> <span class="n">val</span> <span class="o">:</span> <span class="n">v</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">" "</span> <span class="o">&lt;&lt;</span> <span class="n">val</span> <span class="o">&lt;&lt;</span> <span class="sc">'\n'</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"}</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>polymorphism</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>

<span class="k">class</span> <span class="nc">base</span> <span class="p">{</span>
<span class="nl">public:</span>
    <span class="k">virtual</span> <span class="kt">void</span> <span class="n">print</span><span class="p">()</span> <span class="p">{</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"print base class</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span> <span class="p">}</span>
    <span class="kt">void</span> <span class="n">show</span><span class="p">()</span> <span class="p">{</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"show base class</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span> <span class="p">}</span>
<span class="p">};</span>

<span class="k">class</span> <span class="nc">derived</span> <span class="o">:</span> <span class="k">public</span> <span class="n">base</span> <span class="p">{</span>
<span class="nl">public:</span>
    <span class="kt">void</span> <span class="n">print</span><span class="p">()</span> <span class="p">{</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"print derived class</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span> <span class="p">}</span>
    <span class="kt">void</span> <span class="n">show</span><span class="p">()</span> <span class="p">{</span> <span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"show derived class</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span> <span class="p">}</span>
<span class="p">};</span>

<span class="kt">int</span> <span class="n">main</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">base</span><span class="o">*</span> <span class="n">bptr</span><span class="p">;</span>
    <span class="n">derived</span> <span class="n">d</span><span class="p">;</span>
    <span class="n">bptr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">d</span><span class="p">;</span>
 
    <span class="c1">// Virtual function, binded at runtime</span>
    <span class="n">bptr</span><span class="o">-&gt;</span><span class="n">print</span><span class="p">();</span>
 
    <span class="c1">// Non-virtual function, binded at compile time</span>
    <span class="n">bptr</span><span class="o">-&gt;</span><span class="n">show</span><span class="p">();</span>
 
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% Output
print derived class
show base class
</code></pre></div></div>

<p>Basically when you declare a class’s method as virtual, you are ensuring that sub-classes use their own implementation. Aka, override-able functions.</p>]]></content><author><name>Brian E. Zhang</name><email>bez@modular.com</email></author><summary type="html"><![CDATA[I have largely evaded formally learning C++ by leveraging my knowledge of C to pattern match on existing C++ codebases, or using Rust in my own green-field projects. Nevertheless, it seems that knowing C++ is pretty important given that everyone still uses it and test on its concepts in interviews. I guess I cannot possibly continue on without knowing what a virtual method call is.]]></summary></entry></feed>