Brian E. Zhang

Notes on Optimizing Compilers

2024-09-15T00:00:00-07:00

Notes from 15-745: Optimizing Compilers.

Lecture 1: Overview of Optimizations

Three-address code

A := B op C
A := unaryop B
A := B
GOTO s
IF A relop B GOTO s
CALL f
RETURN

Basic block

sequence of 3-addr statements
only first statement can be reached from outside the block (no branches into middle)
all statements are executed consecutively if first one is executed (no branches out or halts except maybe at end of block)
are maximal
optimizations within a basic block are local optimizations

Flow graph

Nodes: basic blocks
Edges: \(b_i \rightarrow b_j\) if there is a branch from \(b_i\) to \(b_j\). Also works if \(b_j\) physically follows \(b_i\) which does not end in an unconditional goto.

Optimization types:

local: within bb – across instrs
global: within a flow graph – across bbs
interprocedural analysis: within a program – across procedures (flow graphs)

Local

common subexpression elimination
constant folding or elimination
dead code elimination

Global

Global versions of local
- global common subexpression elimination
- global constant propagation
- dead code elimination
Loop opts
- reduce code to be executed in each iter
- code motion
- induction var elimination
Other control structures
- Code hoisting: eliminate copies of idential code on parallel paths in flow graph to reduce code size

Induction Variable Elimination

Loop indices are induction var
Linear function of loop indices are also induction var
Analysis: detection of induction var
opts
- strength reduction: replace mult with add
- elim loop index: replace termination by tests on other induction vars

Loop invariant code motion

computation is done within loop and result is same as long as we keep going around the loop
move computation outside of loop

Machine dependent optimizations

regalloc
instr schd
memory hierarchy optz

Lecture 2: Local Optimizations

Outline

bb/flow graphs
abstr 1: dag
abstr 2: value numbering
phi in ssa

Partitionining into BBs

identify leader of eahc bb.
- first instr
- target of a jump
- any instr after a jump
bb starts at leadr & ends at instr immediately before a leader or last instr

Common subexpression elimination

array expressions
field access in records
access to parameters

CSE cont:

consider Parse Tree for a+a * (b-c) + (b-c) * d
notice that subtrees b-c and a are duplicated
turn parse tree to expression DAG

DAG bad:

worked for one statment but not so good for multiple statments

Alternative: Value Numbering Scheme

var2value where each value has its own number
common subexpression means same value number
r1 + r2 => var2value(r1) + var2value(r2)

VALUES : (var, op, var) -> var = {}
BIN_VALUES : (var, op, var) -> var = {}

for dst, src1, op, src2 in instrs:
  val1 = var2value(src1)
  val2 = var2value(src2)
  if val1, op, val2 in BIN_VALUES:
    print(f`{dst} = {BIN_VALUES[src1, op, src2]}`)
  else:
    tmp = newtemp()
    BIN_VALUES[val1, op, val2] = tmp
    print(f`{tmp} = {VALUES[src1]} {op} {VALUES[src2]}`)
    print(f`{dst} = {tmp}`)
  VALUES[dst] = tmp

uhhh im sort of lost, above is wrong

phi functions in SSA

appears in llvm
copy prop
- for a given use of X
  - are all reaching definitions of X:
    - copies from same variable: eg, X = Y
    - where Y is not redefined since that copy?
  - if so, substitute X for Y
Static single assignments: IR where every variable is assigned a value at most once in the program rext
Easy for basic block
- For instr:
  - LHS: assign to fresh version of variable
  - RHS: use most recent version of variable

What about joins in CFG?

c = 12
if (i) {
a = x + y
b = a + x
} else {
a = b + 2
c = y + 1
}
a = c + a

c1 = 12
if (i) {
a1 = x + y
b1 = a1 + x
} else {
a2 = b + 2
c2 = y + 1
}
a4 = c? + a?

use notational convention: phi function

a3 = phi(a1,a2)
c3 = phi(c1,c2)
b2 = phi(b1,b)
b3 = c3 + a3

Lecture 3: Overview of LLVM Compiler

LLVM Compiler Infrastructure

provides resable components for building compilers
reduces time/cost
build static compilers, jits, trace-based optz, etc LLVM Compiler Framework
e2e compilers using LLVM infra
C and C++ are robust and aggressive
Emit C code or native code

Parts

LLVM Virtual Instructio Set
…

System:

front end turns C/C++/Java/… into LLVM IR. One front end is Clang
LLVM optimizer opterates on LLVM IR in series of passes
Back end turns LLVM IR into machine code

Analysis passes do not change code

LLVM instruction set

RISC-like three address code
infinite virtual register set in SSA form
Simple, low-level control flow constructs
Load/store instructions with typed-pointers

for (int i = 0; i < N; i++) {
  Sum(&A[i], &P);
}

loop:   ; prefs = %bb0, %loop
  %i.1 = phi i32 [ 0, %bb0 ], [ %i.2, %loop ]
  #AiAddr = getelementptr float* %A, i32 %i.1
  call void @Sum(float %AiAddr, %pair* %P)
  %i.2 = add i32 %i.1, 1
  %exitcond = icmp eq i32 %i.1, %N
  br i1 %exitcond, label %outloop, label %loop

explicit dataflow through SSA
explicit cfg, even for exceptions
explicit language independent type-information
explicit typed pointer arithmetic
- preserve array subscript and struture indexing

Type system:

Primitives: label, void, float, integer (incld arbitrary bitwidth integers i1, i32, …)
Derived: pointer, array, structure, function
No high-level types

Stack allocation is explicit in LLVM

int caller() {
  int T;
  T = 4;
  ...
}

int %caller() {
  %T = alloca i32
  store i32 4, i32* %T
  ...
}

LLVM IR is almost all nested doubly-linked lists

Function &M = ...
for (Function::iterator I = M->begin(); I != M->end; ++I) {
  BasicBlock &BB = I;
  ...
}

Modules contains functions and globals (unit of compilation/analysis/optimization)
Functions contain basic blocks and arguments
Basic blocks contain instructions
Instructions contain opcodes and vector of operands

Pass Manager

ImmutablePass: doesn’t do much
LoopPass
RegionPass: process single-enty, single-exit regions
ModulePas
CallGraphSCCPass: bottom up on the call graph
FunctionPass
BasicBlockPass: process basic blocks

Tools:

llvm-as: .ll (text) to .bc (binary)
llvm-dis: .bc to .ll
llvm-link: link multiple .bc files
llvm-prof: profile output to human readers
opt: run one or more optz on bc

Aggregate tools:

bugpoint
clang

mostly use clang, opt, llvm-dis

Lecture 4: Dataflow Analysis

Reaching definitions

every assignment is a definition
a definition reaches a point p if there exists a path from the point immediately following d to p such that d is not killed along that path

Live variable analysis:

a variable v is live at point p if the value of v is used along some path in the flow graph starting at p
otws the variable is dead

For each basic block, track uses and defs

reaching definitions vs live variables:

sets of definitions vs variables
forward vs backward
…

Reaching:

consider a = b + c. at what previous statments did variable c acquire a value that can reach line 10? Live range:
starting from definition of c, go to the next definition of variable or end of scope that variable c exists.

Lecture 6: More on LLVM Compiler

almost everything is sublcass of llvm::Value

for (Function F1 = func->begin(), FE = func->end(); FI != FE; ++FI) {
    for (BasicBlock::iterator BBI = FL->begin(), BBE = FI->end(); BBI != BBE; ++BBI) {
        Instruction *I = BBI;
        if (Vallnst * CI = dyn_cast(I)) {
            outs() << "I'm a Call Instruction!\n";
        }
        if (UnaryInstruction * UI = dyn_cast(I)) {
            outs() << "I'm a Unary Instruction!\n";
        }
        ...
    }
}

Lecture 8: SSA

only 1 asignment per variable
definitions dominate uses
- x strictly dominates w iff impossible to reach w without passing through x first

If x_i is used in x <- phi(…, x_i, …) then VV(x_i) dominates ith pred of BB(PHI)

if x is used in y <- … x then BB(x) dominates BB(y)

Lecture 9: SSA style optz

constant prop

W = all defs
while W:
    stmt S <- W.pop()
    if (S has form "v <- c" or S has form "v <- phi(c,c, ..., c)"):
        delete S
        foreach stmt U that uses v {
            replace v with c in U
            W.add(U)
        }

Lecture 10: loop motion

use reaching defs and dominators

Coffee Table

2024-09-12T00:00:00-07:00

I recently registered for a woodworking class at a local adult community school located at a high school in south Florida. I had some minimal experience from taking a mini course at CMU but forgot pretty much everything. Nevertheless, Woodworking I was not offered at this time so I registered for Woodworking II. Most of the other students were fairly independent woodworkers and mainly registered for the course to have a woodworking space. As I didn’t know anything anymore, the very kind instructor recommended me to make a coffee table. This coffee table is based on Steve Ramsey’s “Sonoma Vineyard Coffee Table” guide from his Weekend Woodworker course.

Class 1:

Missed it bc not in town :(

Class 2:

Bought a bunch of wood from home depot. Got wood shaving all over inside of car. Note to self that a rav4 just barely manages to fir 8ft boards. Since the pine common boards looked like they were rotting, I paid extra for the pine select ones.

Note to self: 1in x 4in (or 1x4 colloquially), are not 4in wide. They are 3.5in wide. This is because some width was lost when the wood was planed.

Materials:

8 of 1in x 4in x 8ft pine select boards
1 of 2ft x 4ft x 1/2in plywood

After lugging the wood back, I relearned how to use the miter saw:

set miter saw completly down. lock in place with pin
press tape against saw to measure length.then lock the stop block in place
crop 1/2 blade width off the end of the board
cut a bunch of identical length pieces by measuring against stop block

Now I have 16 of 1in x 4in x 15.25in boards. I glue 4 of them together along faces to form 4 table legs. For wood glue, use a brush to spread glue evenly. Then clamp overnight. If use metal clamp, put a piece of wood between the clamp and the wood to prevent denting.

Class 3:

Got to use table saw once again. That thing is terrifying.

spin wheel thing to set blade height to be around .25in above wood
lock wheel in place
set fence to be good distance from blade
for this cut, we just want to shave off a tiny bit of the misaligned glued edges
use featherboard to press board against fence to stop kickback!!! this thing is incredible and makes it so that my fingers are much further from blade
turn on saw. once stop flashing, pull red stopper to begin spin
position urself not right behind board in case it gets ejected backwards
push board through with push stick. aim force downward, forward, and towards fence.
make sure the table is long enought it doesnt fall off

Then I used miter saw to plane the other two small faces of board. Just cut tiny bit off.

Then to square it, I came back to table saw. I measured fence dist by using shorter length. Then rotate to start cutting.

Aftwards I rotates the saw to 45deg to cut the wood into a trapezoidal prism.

Then I used the other table saw that had a dado blade that cuts away larger volumes of wood. This was used to form the tenons of each leg. Since this was doing crosscuts (instead of rip cuts like before), I installed a miter guage which guided the wood. I like this as it keeps my fingers far from blade again.

Note that to tighten the miter guage, you have to use an allen key. I didn’t tigthen it enough and the miter guage sort of twisted on a cut and caused some of the metal of the miter guage to get cut off and propelled the shaving onto me. Funnily enough, it was good that this table saw was not saw stop compatible due to being old as it would have braked and ruined the blade.

Also, oops I cut some later pieces I shouldn’t of cut. It seems that preemptively cutting pieces before they are needed may not be a great idea as the required dimensions may be slightly different than the measurements in the guide due to inprecise cuts earlier on in the project.

Optimizing CUDA Kernels

2024-09-04T00:00:00-07:00

Notes from “Professional CUDA C Programming” by John Cheng, Max Grossman, and Ty McKercher. Since this book is old, information is only accurate for Fermi / Kepler generation devices.

Shared Memory Bank Conflicts

Each SM contains small latency memory pool accessible by all threads in threadblock known as shared memory. Shared memory is divided into 32 equally-sized memory banks. Each bank can service one 32-bit word per clock cycle. If multiple threads in the same warp access the same bank, the bank must serialize the requests. This is known as a bank conflict.

On Kepler, the size of each bank entry can be 32 or 64 bit.

# 64-bit mode
bank index = (bbyte address / 8 bytes/bank) % 32 banks

To mitigate bank conflicts, we can use padding to ensure that each thread accesses a different bank.

You typically use __syncthreads() to ensure that all threads in the block have finished writing to shared memory before reading from it.

Global Memory

We want our reads/writes to be:

aligned
coalesced

L1 cache line is 128-bytes.

Uncached loads that do not pass through L1 cache are performaed at the granularity of memory segments (32-bytes) and not cache lines (128-bytes).

Writes do not get cached in L1. They can be cached in L2. Memory transactions are 32-byte granularity and can be one, two, or four segments at a time.

Fast Matrix Transpose

Matrix transpose is hard because reading/writing to global memory column-wise results in a lot of uncoalesced memory accesses. This will hurt your global load/store efficiency.

Assume 2D matrix. Each threadblock is responsible for a 2D tile. Each thread is responsible for 1 element in tile.

Each thread reads from global memory (row-wise) and writes to shared memory (row-wise). Adjacent threads (based on threadIdx.x) read adjacent elements in the same row. These reads are likely coalesced and aligned (good). These writes to shared memory are also row-wise and do not result in bank conflicts.
Use __syncthreads() to ensure all threads have finished writing to shared memory.
Each thread reads from shared memory (column-wise) and writes to global memory (row-wise). Adjacent threads (based on threadIdx.x) read adjacent elements in the same column. These reads from smem may result in bank conflicts but we can fix this. These writes to global memory are row-wise so they are coalesced and aligned.

Optimizations

unroll the grid. now each threadblock is responsible for k tiles. each thread is responsible for k elements in tile.

pad the shared memory to avoid bank conflicts:

// not padded
__shared__ float tile[BDIM_Y][BDIM_X];
// padded
__shared__ float tile[BDIM_Y][BDIM_X+2];

try various grid dimensions. more threadblocks can mean more device parallelism.

Fast Reduce

warp shuffle trick is insane. this forgoes need for block level sync like __syncthreads(). This is because threads in a warp can communicate with each other without needing to sync with other warps.

__inline__ __device__ int warpReduce(int mySum) {
  mySum += __shfl_xor(mySum, 16);
  mySum += __shfl_xor(mySum, 8);
  mySum += __shfl_xor(mySum, 4);
  mySum += __shfl_xor(mySum, 2);
  mySum += __shfl_xor(mySum, 1);
}

__global__ void reduceShfl(int *g_idata, int *g_odata, unsigned int n) {
    // shared memory for each warp sum
    __shared__ int smem[SMEMDIM];
    
    // boundary check
    unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;
    if (idx >= n) return;

    // read from global memory
    int mySum = g_idata[idx];

    // calculate lane index and warp index
    int laneIdx = threadIdx.x % warpSize;
    int warpIdx = threadIdx.x / warpSize;

    // block-wide warp reduce
    mySum = warpReduce(mySum);

    // save warp sum to shared memory
    if (laneIdx == 0) smem[warpIdx] = mySum;

    // block-wide sync
    __syncthreads();

    // last warp reduce
    mySum = (threadIdx.x < SMEMDIM) ? smem[laneIdx] : 0;
    if (warpIdx == 0) mySum = warpReduce(mySum);

    // write to global memory
    if (threadIdx.x == 0) g_odata[blockIdx.x] = mySum;
}

C++ Notes

2024-09-03T00:00:00-07:00

I have largely evaded formally learning C++ by leveraging my knowledge of C to pattern match on existing C++ codebases, or using Rust in my own green-field projects. Nevertheless, it seems that knowing C++ is pretty important given that everyone still uses it and test on its concepts in interviews. I guess I cannot possibly continue on without knowing what a virtual method call is.

A previous company shipped me a copy of “A Tour of C++, Third Edition” by Bjarne Stroustrup. I’ll be dumping my notes here so I actually remember stuff.

class Vector {
public:
    Vector(int s) :elem{new double[s]}, sz{s} {}
    double& operator[](int i) { return elem[i]; }
    int size() { return sz; }
private:
    double* elem;
    int sz;
};

int main() {
    Vector v(6);
}

enum class Color { red, green, blue };

export module vector_printer;

import std;

export template<typename T> void print(std::vector<T>& v)
{
    cout << "{\n";
    for (const T& val : v) {
        std::cout << " " << val << '\n';
    }
    cout << "}\n";
}

polymorphism

using namespace std;

class base {
public:
    virtual void print() { cout << "print base class\n"; }
    void show() { cout << "show base class\n"; }
};

class derived : public base {
public:
    void print() { cout << "print derived class\n"; }
    void show() { cout << "show derived class\n"; }
};

int main()
{
    base* bptr;
    derived d;
    bptr = &d;
 
    // Virtual function, binded at runtime
    bptr->print();
 
    // Non-virtual function, binded at compile time
    bptr->show();
 
    return 0;
}

% Output
print derived class
show base class

Basically when you declare a class’s method as virtual, you are ensuring that sub-classes use their own implementation. Aka, override-able functions.