Three-address code
A := B op C
A := unaryop B
A := B
GOTO s
IF A relop B GOTO s
CALL f
RETURN
Basic block
Flow graph
Optimization types:
Local
Global
Induction Variable Elimination
Loop invariant code motion
Machine dependent optimizations
Outline
Partitionining into BBs
Common subexpression elimination
CSE cont:
a+a * (b-c) + (b-c) * db-c and a are duplicatedDAG bad:
Alternative: Value Numbering Scheme
r1 + r2 => var2value(r1) + var2value(r2)VALUES : (var, op, var) -> var = {}
BIN_VALUES : (var, op, var) -> var = {}
for dst, src1, op, src2 in instrs:
val1 = var2value(src1)
val2 = var2value(src2)
if val1, op, val2 in BIN_VALUES:
print(f`{dst} = {BIN_VALUES[src1, op, src2]}`)
else:
tmp = newtemp()
BIN_VALUES[val1, op, val2] = tmp
print(f`{tmp} = {VALUES[src1]} {op} {VALUES[src2]}`)
print(f`{dst} = {tmp}`)
VALUES[dst] = tmp
phi functions in SSA
c = 12
if (i) {
a = x + y
b = a + x
} else {
a = b + 2
c = y + 1
}
a = c + a
to
c1 = 12
if (i) {
a1 = x + y
b1 = a1 + x
} else {
a2 = b + 2
c2 = y + 1
}
a4 = c? + a?
a3 = phi(a1,a2)
c3 = phi(c1,c2)
b2 = phi(b1,b)
b3 = c3 + a3
LLVM Compiler Infrastructure
Parts
System:
Analysis passes do not change code
LLVM instruction set
for (int i = 0; i < N; i++) {
Sum(&A[i], &P);
}
loop: ; prefs = %bb0, %loop
%i.1 = phi i32 [ 0, %bb0 ], [ %i.2, %loop ]
#AiAddr = getelementptr float* %A, i32 %i.1
call void @Sum(float %AiAddr, %pair* %P)
%i.2 = add i32 %i.1, 1
%exitcond = icmp eq i32 %i.1, %N
br i1 %exitcond, label %outloop, label %loop
Type system:
Stack allocation is explicit in LLVM
int caller() {
int T;
T = 4;
...
}
int %caller() {
%T = alloca i32
store i32 4, i32* %T
...
}
LLVM IR is almost all nested doubly-linked lists
Function &M = ...
for (Function::iterator I = M->begin(); I != M->end; ++I) {
BasicBlock &BB = I;
...
}
Pass Manager
Tools:
Aggregate tools:
mostly use clang, opt, llvm-dis
Reaching definitions
Live variable analysis:
For each basic block, track uses and defs
reaching definitions vs live variables:
Reaching:
a = b + c. at what previous statments did variable c acquire a value that can reach line 10?
Live range:for (Function F1 = func->begin(), FE = func->end(); FI != FE; ++FI) {
for (BasicBlock::iterator BBI = FL->begin(), BBE = FI->end(); BBI != BBE; ++BBI) {
Instruction *I = BBI;
if (Vallnst * CI = dyn_cast<CallInst>(I)) {
outs() << "I'm a Call Instruction!\n";
}
if (UnaryInstruction * UI = dyn_cast<UnaryInstruction>(I)) {
outs() << "I'm a Unary Instruction!\n";
}
...
}
}
If x_i is used in x <- phi(…, x_i, …) then VV(x_i) dominates ith pred of BB(PHI)
constant prop
W = all defs
while W:
stmt S <- W.pop()
if (S has form "v <- c" or S has form "v <- phi(c,c, ..., c)"):
delete S
foreach stmt U that uses v {
replace v with c in U
W.add(U)
}
Missed it bc not in town :(
Bought a bunch of wood from home depot. Got wood shaving all over inside of car. Note to self that a rav4 just barely manages to fir 8ft boards. Since the pine common boards looked like they were rotting, I paid extra for the pine select ones.
Note to self: 1in x 4in (or 1x4 colloquially), are not 4in wide. They are 3.5in wide. This is because some width was lost when the wood was planed.
Materials:
After lugging the wood back, I relearned how to use the miter saw:
Now I have 16 of 1in x 4in x 15.25in boards. I glue 4 of them together along faces to form 4 table legs. For wood glue, use a brush to spread glue evenly. Then clamp overnight. If use metal clamp, put a piece of wood between the clamp and the wood to prevent denting.
Got to use table saw once again. That thing is terrifying.
Then I used miter saw to plane the other two small faces of board. Just cut tiny bit off.
Then to square it, I came back to table saw. I measured fence dist by using shorter length. Then rotate to start cutting.
Aftwards I rotates the saw to 45deg to cut the wood into a trapezoidal prism.
Then I used the other table saw that had a dado blade that cuts away larger volumes of wood. This was used to form the tenons of each leg. Since this was doing crosscuts (instead of rip cuts like before), I installed a miter guage which guided the wood. I like this as it keeps my fingers far from blade again.
Note that to tighten the miter guage, you have to use an allen key. I didn’t tigthen it enough and the miter guage sort of twisted on a cut and caused some of the metal of the miter guage to get cut off and propelled the shaving onto me. Funnily enough, it was good that this table saw was not saw stop compatible due to being old as it would have braked and ruined the blade.
Also, oops I cut some later pieces I shouldn’t of cut. It seems that preemptively cutting pieces before they are needed may not be a great idea as the required dimensions may be slightly different than the measurements in the guide due to inprecise cuts earlier on in the project.
]]>Each SM contains small latency memory pool accessible by all threads in threadblock known as shared memory. Shared memory is divided into 32 equally-sized memory banks. Each bank can service one 32-bit word per clock cycle. If multiple threads in the same warp access the same bank, the bank must serialize the requests. This is known as a bank conflict.
On Kepler, the size of each bank entry can be 32 or 64 bit.
# 64-bit mode
bank index = (bbyte address / 8 bytes/bank) % 32 banks
To mitigate bank conflicts, we can use padding to ensure that each thread accesses a different bank.
You typically use __syncthreads() to ensure that all threads in the block have finished writing to shared memory before reading from it.
We want our reads/writes to be:
L1 cache line is 128-bytes.
Uncached loads that do not pass through L1 cache are performaed at the granularity of memory segments (32-bytes) and not cache lines (128-bytes).
Writes do not get cached in L1. They can be cached in L2. Memory transactions are 32-byte granularity and can be one, two, or four segments at a time.
Matrix transpose is hard because reading/writing to global memory column-wise results in a lot of uncoalesced memory accesses. This will hurt your global load/store efficiency.
Assume 2D matrix. Each threadblock is responsible for a 2D tile. Each thread is responsible for 1 element in tile.
threadIdx.x) read adjacent elements in the same row. These reads are likely coalesced and aligned (good). These writes to shared memory are also row-wise and do not result in bank conflicts.__syncthreads() to ensure all threads have finished writing to shared memory.threadIdx.x) read adjacent elements in the same column. These reads from smem may result in bank conflicts but we can fix this. These writes to global memory are row-wise so they are coalesced and aligned.Optimizations
k tiles. each thread is responsible for k elements in tile.// not padded
__shared__ float tile[BDIM_Y][BDIM_X];
// padded
__shared__ float tile[BDIM_Y][BDIM_X+2];
__syncthreads(). This is because threads in a warp can communicate with each other without needing to sync with other warps.
__inline__ __device__ int warpReduce(int mySum) {
mySum += __shfl_xor(mySum, 16);
mySum += __shfl_xor(mySum, 8);
mySum += __shfl_xor(mySum, 4);
mySum += __shfl_xor(mySum, 2);
mySum += __shfl_xor(mySum, 1);
}
__global__ void reduceShfl(int *g_idata, int *g_odata, unsigned int n) {
// shared memory for each warp sum
__shared__ int smem[SMEMDIM];
// boundary check
unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx >= n) return;
// read from global memory
int mySum = g_idata[idx];
// calculate lane index and warp index
int laneIdx = threadIdx.x % warpSize;
int warpIdx = threadIdx.x / warpSize;
// block-wide warp reduce
mySum = warpReduce(mySum);
// save warp sum to shared memory
if (laneIdx == 0) smem[warpIdx] = mySum;
// block-wide sync
__syncthreads();
// last warp reduce
mySum = (threadIdx.x < SMEMDIM) ? smem[laneIdx] : 0;
if (warpIdx == 0) mySum = warpReduce(mySum);
// write to global memory
if (threadIdx.x == 0) g_odata[blockIdx.x] = mySum;
}
A previous company shipped me a copy of “A Tour of C++, Third Edition” by Bjarne Stroustrup. I’ll be dumping my notes here so I actually remember stuff.
class Vector {
public:
Vector(int s) :elem{new double[s]}, sz{s} {}
double& operator[](int i) { return elem[i]; }
int size() { return sz; }
private:
double* elem;
int sz;
};
int main() {
Vector v(6);
}
enum class Color { red, green, blue };
export module vector_printer;
import std;
export template<typename T> void print(std::vector<T>& v)
{
cout << "{\n";
for (const T& val : v) {
std::cout << " " << val << '\n';
}
cout << "}\n";
}
polymorphism
using namespace std;
class base {
public:
virtual void print() { cout << "print base class\n"; }
void show() { cout << "show base class\n"; }
};
class derived : public base {
public:
void print() { cout << "print derived class\n"; }
void show() { cout << "show derived class\n"; }
};
int main()
{
base* bptr;
derived d;
bptr = &d;
// Virtual function, binded at runtime
bptr->print();
// Non-virtual function, binded at compile time
bptr->show();
return 0;
}
% Output
print derived class
show base class
Basically when you declare a class’s method as virtual, you are ensuring that sub-classes use their own implementation. Aka, override-able functions.
]]>