Jekyll2025-12-31T00:54:51-08:00https://atomicapple0.github.io/feed.xmlBrian E. Zhangpersonal descriptionBrian E. Zhang[email protected]Notes on Optimizing Compilers2024-09-15T00:00:00-07:002024-09-15T00:00:00-07:00https://atomicapple0.github.io/posts/2024/optimizing-compilersNotes from 15-745: Optimizing Compilers.

Lecture 1: Overview of Optimizations

Three-address code

A := B op C
A := unaryop B
A := B
GOTO s
IF A relop B GOTO s
CALL f
RETURN

Basic block

  • sequence of 3-addr statements
  • only first statement can be reached from outside the block (no branches into middle)
  • all statements are executed consecutively if first one is executed (no branches out or halts except maybe at end of block)
  • are maximal
  • optimizations within a basic block are local optimizations

Flow graph

  • Nodes: basic blocks
  • Edges: \(b_i \rightarrow b_j\) if there is a branch from \(b_i\) to \(b_j\). Also works if \(b_j\) physically follows \(b_i\) which does not end in an unconditional goto.

Optimization types:

  • local: within bb – across instrs
  • global: within a flow graph – across bbs
  • interprocedural analysis: within a program – across procedures (flow graphs)

Local

  • common subexpression elimination
  • constant folding or elimination
  • dead code elimination

Global

  • Global versions of local
    • global common subexpression elimination
    • global constant propagation
    • dead code elimination
  • Loop opts
    • reduce code to be executed in each iter
    • code motion
    • induction var elimination
  • Other control structures
    • Code hoisting: eliminate copies of idential code on parallel paths in flow graph to reduce code size

Induction Variable Elimination

  • Loop indices are induction var
  • Linear function of loop indices are also induction var
  • Analysis: detection of induction var
  • opts
    • strength reduction: replace mult with add
    • elim loop index: replace termination by tests on other induction vars

Loop invariant code motion

  • computation is done within loop and result is same as long as we keep going around the loop
  • move computation outside of loop

Machine dependent optimizations

  • regalloc
  • instr schd
  • memory hierarchy optz

Lecture 2: Local Optimizations

Outline

  • bb/flow graphs
  • abstr 1: dag
  • abstr 2: value numbering
  • phi in ssa

Partitionining into BBs

  • identify leader of eahc bb.
    • first instr
    • target of a jump
    • any instr after a jump
  • bb starts at leadr & ends at instr immediately before a leader or last instr

Common subexpression elimination

  • array expressions
  • field access in records
  • access to parameters

CSE cont:

  • consider Parse Tree for a+a * (b-c) + (b-c) * d
  • notice that subtrees b-c and a are duplicated
  • turn parse tree to expression DAG

DAG bad:

  • worked for one statment but not so good for multiple statments

Alternative: Value Numbering Scheme

  • var2value where each value has its own number
  • common subexpression means same value number
  • r1 + r2 => var2value(r1) + var2value(r2)
VALUES : (var, op, var) -> var = {}
BIN_VALUES : (var, op, var) -> var = {}

for dst, src1, op, src2 in instrs:
  val1 = var2value(src1)
  val2 = var2value(src2)
  if val1, op, val2 in BIN_VALUES:
    print(f`{dst} = {BIN_VALUES[src1, op, src2]}`)
  else:
    tmp = newtemp()
    BIN_VALUES[val1, op, val2] = tmp
    print(f`{tmp} = {VALUES[src1]} {op} {VALUES[src2]}`)
    print(f`{dst} = {tmp}`)
  VALUES[dst] = tmp

uhhh im sort of lost, above is wrong

phi functions in SSA

  • appears in llvm
  • copy prop
    • for a given use of X
      • are all reaching definitions of X:
        • copies from same variable: eg, X = Y
        • where Y is not redefined since that copy?
      • if so, substitute X for Y
  • Static single assignments: IR where every variable is assigned a value at most once in the program rext
  • Easy for basic block
    • For instr:
      • LHS: assign to fresh version of variable
      • RHS: use most recent version of variable
  • What about joins in CFG?
    c = 12
    if (i) {
    a = x + y
    b = a + x
    } else {
    a = b + 2
    c = y + 1
    }
    a = c + a
    

    to

    c1 = 12
    if (i) {
    a1 = x + y
    b1 = a1 + x
    } else {
    a2 = b + 2
    c2 = y + 1
    }
    a4 = c? + a?
    
  • use notational convention: phi function
    a3 = phi(a1,a2)
    c3 = phi(c1,c2)
    b2 = phi(b1,b)
    b3 = c3 + a3
    

Lecture 3: Overview of LLVM Compiler

LLVM Compiler Infrastructure

  • provides resable components for building compilers
  • reduces time/cost
  • build static compilers, jits, trace-based optz, etc LLVM Compiler Framework
  • e2e compilers using LLVM infra
  • C and C++ are robust and aggressive
  • Emit C code or native code

Parts

  • LLVM Virtual Instructio Set

System:

  • front end turns C/C++/Java/… into LLVM IR. One front end is Clang
  • LLVM optimizer opterates on LLVM IR in series of passes
  • Back end turns LLVM IR into machine code

Analysis passes do not change code

LLVM instruction set

  • RISC-like three address code
  • infinite virtual register set in SSA form
  • Simple, low-level control flow constructs
  • Load/store instructions with typed-pointers
for (int i = 0; i < N; i++) {
  Sum(&A[i], &P);
}
loop:   ; prefs = %bb0, %loop
  %i.1 = phi i32 [ 0, %bb0 ], [ %i.2, %loop ]
  #AiAddr = getelementptr float* %A, i32 %i.1
  call void @Sum(float %AiAddr, %pair* %P)
  %i.2 = add i32 %i.1, 1
  %exitcond = icmp eq i32 %i.1, %N
  br i1 %exitcond, label %outloop, label %loop
  • explicit dataflow through SSA
  • explicit cfg, even for exceptions
  • explicit language independent type-information
  • explicit typed pointer arithmetic
    • preserve array subscript and struture indexing

Type system:

  • Primitives: label, void, float, integer (incld arbitrary bitwidth integers i1, i32, …)
  • Derived: pointer, array, structure, function
  • No high-level types

Stack allocation is explicit in LLVM

int caller() {
  int T;
  T = 4;
  ...
}

int %caller() {
  %T = alloca i32
  store i32 4, i32* %T
  ...
}

LLVM IR is almost all nested doubly-linked lists

Function &M = ...
for (Function::iterator I = M->begin(); I != M->end; ++I) {
  BasicBlock &BB = I;
  ...
}
  • Modules contains functions and globals (unit of compilation/analysis/optimization)
  • Functions contain basic blocks and arguments
  • Basic blocks contain instructions
  • Instructions contain opcodes and vector of operands

Pass Manager

  • ImmutablePass: doesn’t do much
  • LoopPass
  • RegionPass: process single-enty, single-exit regions
  • ModulePas
  • CallGraphSCCPass: bottom up on the call graph
  • FunctionPass
  • BasicBlockPass: process basic blocks

Tools:

  • llvm-as: .ll (text) to .bc (binary)
  • llvm-dis: .bc to .ll
  • llvm-link: link multiple .bc files
  • llvm-prof: profile output to human readers
  • opt: run one or more optz on bc

Aggregate tools:

  • bugpoint
  • clang

mostly use clang, opt, llvm-dis

Lecture 4: Dataflow Analysis

Reaching definitions

  • every assignment is a definition
  • a definition reaches a point p if there exists a path from the point immediately following d to p such that d is not killed along that path

Live variable analysis:

  • a variable v is live at point p if the value of v is used along some path in the flow graph starting at p
  • otws the variable is dead

For each basic block, track uses and defs

reaching definitions vs live variables:

  • sets of definitions vs variables
  • forward vs backward

Reaching:

  • consider a = b + c. at what previous statments did variable c acquire a value that can reach line 10? Live range:
  • starting from definition of c, go to the next definition of variable or end of scope that variable c exists.

Lecture 6: More on LLVM Compiler

  • almost everything is sublcass of llvm::Value
for (Function F1 = func->begin(), FE = func->end(); FI != FE; ++FI) {
    for (BasicBlock::iterator BBI = FL->begin(), BBE = FI->end(); BBI != BBE; ++BBI) {
        Instruction *I = BBI;
        if (Vallnst * CI = dyn_cast<CallInst>(I)) {
            outs() << "I'm a Call Instruction!\n";
        }
        if (UnaryInstruction * UI = dyn_cast<UnaryInstruction>(I)) {
            outs() << "I'm a Unary Instruction!\n";
        }
        ...
    }
}

Lecture 8: SSA

  • only 1 asignment per variable
  • definitions dominate uses
    • x strictly dominates w iff impossible to reach w without passing through x first

If x_i is used in x <- phi(…, x_i, …) then VV(x_i) dominates ith pred of BB(PHI)

  • if x is used in y <- … x then BB(x) dominates BB(y)

Lecture 9: SSA style optz

constant prop

W = all defs
while W:
    stmt S <- W.pop()
    if (S has form "v <- c" or S has form "v <- phi(c,c, ..., c)"):
        delete S
        foreach stmt U that uses v {
            replace v with c in U
            W.add(U)
        }

Lecture 10: loop motion

  • use reaching defs and dominators
]]>
Brian E. Zhang[email protected]
Coffee Table2024-09-12T00:00:00-07:002024-09-12T00:00:00-07:00https://atomicapple0.github.io/posts/2024/coffee-tableI recently registered for a woodworking class at a local adult community school located at a high school in south Florida. I had some minimal experience from taking a mini course at CMU but forgot pretty much everything. Nevertheless, Woodworking I was not offered at this time so I registered for Woodworking II. Most of the other students were fairly independent woodworkers and mainly registered for the course to have a woodworking space. As I didn’t know anything anymore, the very kind instructor recommended me to make a coffee table. This coffee table is based on Steve Ramsey’s “Sonoma Vineyard Coffee Table” guide from his Weekend Woodworker course.

Class 1:

Missed it bc not in town :(

Class 2:

Bought a bunch of wood from home depot. Got wood shaving all over inside of car. Note to self that a rav4 just barely manages to fir 8ft boards. Since the pine common boards looked like they were rotting, I paid extra for the pine select ones.

Note to self: 1in x 4in (or 1x4 colloquially), are not 4in wide. They are 3.5in wide. This is because some width was lost when the wood was planed.

Materials:

  • 8 of 1in x 4in x 8ft pine select boards
  • 1 of 2ft x 4ft x 1/2in plywood

After lugging the wood back, I relearned how to use the miter saw:

  • set miter saw completly down. lock in place with pin
  • press tape against saw to measure length.then lock the stop block in place
  • crop 1/2 blade width off the end of the board
  • cut a bunch of identical length pieces by measuring against stop block

Now I have 16 of 1in x 4in x 15.25in boards. I glue 4 of them together along faces to form 4 table legs. For wood glue, use a brush to spread glue evenly. Then clamp overnight. If use metal clamp, put a piece of wood between the clamp and the wood to prevent denting.

Class 3:

Got to use table saw once again. That thing is terrifying.

  • spin wheel thing to set blade height to be around .25in above wood
  • lock wheel in place
  • set fence to be good distance from blade
  • for this cut, we just want to shave off a tiny bit of the misaligned glued edges
  • use featherboard to press board against fence to stop kickback!!! this thing is incredible and makes it so that my fingers are much further from blade
  • turn on saw. once stop flashing, pull red stopper to begin spin
  • position urself not right behind board in case it gets ejected backwards
  • push board through with push stick. aim force downward, forward, and towards fence.
  • make sure the table is long enought it doesnt fall off

Then I used miter saw to plane the other two small faces of board. Just cut tiny bit off.

Then to square it, I came back to table saw. I measured fence dist by using shorter length. Then rotate to start cutting.

Aftwards I rotates the saw to 45deg to cut the wood into a trapezoidal prism.

Then I used the other table saw that had a dado blade that cuts away larger volumes of wood. This was used to form the tenons of each leg. Since this was doing crosscuts (instead of rip cuts like before), I installed a miter guage which guided the wood. I like this as it keeps my fingers far from blade again.

Note that to tighten the miter guage, you have to use an allen key. I didn’t tigthen it enough and the miter guage sort of twisted on a cut and caused some of the metal of the miter guage to get cut off and propelled the shaving onto me. Funnily enough, it was good that this table saw was not saw stop compatible due to being old as it would have braked and ruined the blade.

Also, oops I cut some later pieces I shouldn’t of cut. It seems that preemptively cutting pieces before they are needed may not be a great idea as the required dimensions may be slightly different than the measurements in the guide due to inprecise cuts earlier on in the project.

]]>
Brian E. Zhang[email protected]
Optimizing CUDA Kernels2024-09-04T00:00:00-07:002024-09-04T00:00:00-07:00https://atomicapple0.github.io/posts/2024/cuda-kernelsNotes from “Professional CUDA C Programming” by John Cheng, Max Grossman, and Ty McKercher. Since this book is old, information is only accurate for Fermi / Kepler generation devices.


Shared Memory Bank Conflicts

Each SM contains small latency memory pool accessible by all threads in threadblock known as shared memory. Shared memory is divided into 32 equally-sized memory banks. Each bank can service one 32-bit word per clock cycle. If multiple threads in the same warp access the same bank, the bank must serialize the requests. This is known as a bank conflict.

On Kepler, the size of each bank entry can be 32 or 64 bit.

# 64-bit mode
bank index = (bbyte address / 8 bytes/bank) % 32 banks

To mitigate bank conflicts, we can use padding to ensure that each thread accesses a different bank.

You typically use __syncthreads() to ensure that all threads in the block have finished writing to shared memory before reading from it.

Global Memory

We want our reads/writes to be:

  • aligned
  • coalesced

L1 cache line is 128-bytes.

Uncached loads that do not pass through L1 cache are performaed at the granularity of memory segments (32-bytes) and not cache lines (128-bytes).

Writes do not get cached in L1. They can be cached in L2. Memory transactions are 32-byte granularity and can be one, two, or four segments at a time.

Fast Matrix Transpose

Matrix transpose is hard because reading/writing to global memory column-wise results in a lot of uncoalesced memory accesses. This will hurt your global load/store efficiency.

Assume 2D matrix. Each threadblock is responsible for a 2D tile. Each thread is responsible for 1 element in tile.

  1. Each thread reads from global memory (row-wise) and writes to shared memory (row-wise). Adjacent threads (based on threadIdx.x) read adjacent elements in the same row. These reads are likely coalesced and aligned (good). These writes to shared memory are also row-wise and do not result in bank conflicts.
  2. Use __syncthreads() to ensure all threads have finished writing to shared memory.
  3. Each thread reads from shared memory (column-wise) and writes to global memory (row-wise). Adjacent threads (based on threadIdx.x) read adjacent elements in the same column. These reads from smem may result in bank conflicts but we can fix this. These writes to global memory are row-wise so they are coalesced and aligned.

Optimizations

  • unroll the grid. now each threadblock is responsible for k tiles. each thread is responsible for k elements in tile.
  • pad the shared memory to avoid bank conflicts:
    // not padded
    __shared__ float tile[BDIM_Y][BDIM_X];
    // padded
    __shared__ float tile[BDIM_Y][BDIM_X+2];
    
  • try various grid dimensions. more threadblocks can mean more device parallelism.

Fast Reduce

  • warp shuffle trick is insane. this forgoes need for block level sync like __syncthreads(). This is because threads in a warp can communicate with each other without needing to sync with other warps.
    __inline__ __device__ int warpReduce(int mySum) {
      mySum += __shfl_xor(mySum, 16);
      mySum += __shfl_xor(mySum, 8);
      mySum += __shfl_xor(mySum, 4);
      mySum += __shfl_xor(mySum, 2);
      mySum += __shfl_xor(mySum, 1);
    }
    
__global__ void reduceShfl(int *g_idata, int *g_odata, unsigned int n) {
    // shared memory for each warp sum
    __shared__ int smem[SMEMDIM];
    
    // boundary check
    unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;
    if (idx >= n) return;

    // read from global memory
    int mySum = g_idata[idx];

    // calculate lane index and warp index
    int laneIdx = threadIdx.x % warpSize;
    int warpIdx = threadIdx.x / warpSize;

    // block-wide warp reduce
    mySum = warpReduce(mySum);

    // save warp sum to shared memory
    if (laneIdx == 0) smem[warpIdx] = mySum;

    // block-wide sync
    __syncthreads();

    // last warp reduce
    mySum = (threadIdx.x < SMEMDIM) ? smem[laneIdx] : 0;
    if (warpIdx == 0) mySum = warpReduce(mySum);

    // write to global memory
    if (threadIdx.x == 0) g_odata[blockIdx.x] = mySum;
}
]]>
Brian E. Zhang[email protected]
C++ Notes2024-09-03T00:00:00-07:002024-09-03T00:00:00-07:00https://atomicapple0.github.io/posts/2024/cpp-notesI have largely evaded formally learning C++ by leveraging my knowledge of C to pattern match on existing C++ codebases, or using Rust in my own green-field projects. Nevertheless, it seems that knowing C++ is pretty important given that everyone still uses it and test on its concepts in interviews. I guess I cannot possibly continue on without knowing what a virtual method call is.

A previous company shipped me a copy of “A Tour of C++, Third Edition” by Bjarne Stroustrup. I’ll be dumping my notes here so I actually remember stuff.


class Vector {
public:
    Vector(int s) :elem{new double[s]}, sz{s} {}
    double& operator[](int i) { return elem[i]; }
    int size() { return sz; }
private:
    double* elem;
    int sz;
};

int main() {
    Vector v(6);
}
enum class Color { red, green, blue };
export module vector_printer;

import std;

export template<typename T> void print(std::vector<T>& v)
{
    cout << "{\n";
    for (const T& val : v) {
        std::cout << " " << val << '\n';
    }
    cout << "}\n";
}

polymorphism

using namespace std;

class base {
public:
    virtual void print() { cout << "print base class\n"; }
    void show() { cout << "show base class\n"; }
};

class derived : public base {
public:
    void print() { cout << "print derived class\n"; }
    void show() { cout << "show derived class\n"; }
};

int main()
{
    base* bptr;
    derived d;
    bptr = &d;
 
    // Virtual function, binded at runtime
    bptr->print();
 
    // Non-virtual function, binded at compile time
    bptr->show();
 
    return 0;
}
% Output
print derived class
show base class

Basically when you declare a class’s method as virtual, you are ensuring that sub-classes use their own implementation. Aka, override-able functions.

]]>
Brian E. Zhang[email protected]