1. Introduction: The Memory Wall
Modern hardware performance is a balancing act between computation speed (how fast we can do math) and memory bandwidth (how fast we can move data). Understanding this balance is the foundation of all performance optimization.
In the early days of computing, processors and memory operated at similar speeds. A CPU could fetch data, perform calculations, and store results without waiting. But starting in the 1980s, processor speeds began growing exponentially while memory speeds improved only linearly:
This growing disparity—known as the Memory Wall—means that for most real-world applications, the processor spends most of its time waiting for data rather than computing. By 2020, a typical CPU can perform thousands of floating-point operations in the time it takes to fetch a single cache line from main memory.
On modern hardware, data movement costs more than computation. Moving a 64-bit value from DRAM costs ~200 cycles, while a fused multiply-add (FMA) takes 4-5 cycles. You can do 40-50 math operations in the time it takes to fetch one number!
Real-World Impact
Consider this simple vector addition:
void vector_add(float* a, float* b, float* c, size_t n) {
for (size_t i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // 1 FLOP, 12 bytes accessed
}
}
// Arithmetic Intensity = 1 / 12 = 0.083 FLOP/Byte
// On a 100 GB/s memory system: max 8.3 GFLOPS
// But CPU can do 100+ GFLOPS!
// → 92% of compute capacity WASTED
This kernel has an arithmetic intensity of 0.083 FLOP/Byte—far below the ~10 FLOP/Byte threshold needed to utilize modern CPU compute capabilities. The processor is memory-bound, throttled by how fast it can load and store data.
The Two Types of Bottlenecks
Every computation falls into one of two categories:
Memory-Bound
The processor waits for data. Performance is limited by memory bandwidth.
- Low arithmetic intensity
- Optimization: reduce data movement
- Examples: streaming, sparse ops
Compute-Bound
The processor is doing math. Performance is limited by compute throughput.
- High arithmetic intensity
- Optimization: use more compute
- Examples: dense GEMM, convolutions
The critical insight: the right optimization depends on which bound you're in. Vectorizing a memory-bound kernel is pointless. Improving memory access patterns for a compute-bound kernel is wasteful. You must diagnose before you optimize.
2. Diagnosing Performance Bottlenecks
Before writing a single line of optimization code, you must answer: is my kernel memory-bound or compute-bound? This section covers the tools and techniques for diagnosis.
The Roofline Model Recap
The Roofline Model provides a visual framework for understanding performance limits. The key metric is Arithmetic Intensity (AI):
The roofline Plot shows two ceilings:
- Memory bandwidth ceiling: Performance = AI × Bandwidth (diagonal line)
- Compute ceiling: Performance = Peak FLOPS (horizontal line)
Where they meet is the ridge point—the AI where your code transitions from memory-bound to compute-bound:
Profiling Tools
Theoretical analysis tells you what should happen. Profilers tell you what actually happens:
| Tool | Platform | Key Metrics | When to Use |
|---|---|---|---|
perf |
Linux CPU | IPC, cache misses, branch mispredicts | Quick CPU profiling |
Intel VTune |
Intel CPU | Roofline, memory hierarchy, threading | Deep CPU analysis |
AMD uProf |
AMD CPU | Similar to VTune for AMD | AMD systems |
Nsight Compute |
NVIDIA GPU | SM utilization, memory throughput, roofline | CUDA kernel optimization |
Nsight Systems |
NVIDIA GPU | Timeline, kernel launches, memory transfers | Application-level GPU analysis |
rocprof |
AMD GPU | HIP kernel metrics | ROCm optimization |
CPU Profiling with perf
# Basic statistics
perf stat ./my_program
# Key counters for memory-bound diagnosis
perf stat -e cycles,instructions,cache-references,cache-misses,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./my_program
# Sample output:
# 12,345,678,901 cycles
# 3,456,789,012 instructions # IPC: 0.28 (LOW!)
# 456,789,012 cache-references
# 234,567,890 cache-misses # 51% miss rate (HIGH!)
#
# Diagnosis: Memory-bound (IPC < 1, high cache misses)
GPU Profiling with Nsight Compute
# Profile a specific kernel
ncu --set full -o profile ./my_cuda_app
# Quick metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed \
./my_cuda_app
# Key indicators:
# - SM Throughput high, DRAM low → Compute-bound
# - SM Throughput low, DRAM high → Memory-bound
# - Both low → Latency-bound or underutilized
Identifying Your Bound
Here's a systematic approach to diagnose your kernel:
def diagnose_kernel(
flops: int,
bytes_accessed: int,
observed_gflops: float,
peak_gflops: float,
peak_bw_gb_s: float
):
"""Diagnose whether a kernel is memory or compute bound."""
# Calculate arithmetic intensity
ai = flops / bytes_accessed
# Calculate ridge point
ridge = peak_gflops / peak_bw_gb_s
# Calculate theoretical max performance
if ai < ridge:
theoretical_max = ai * peak_bw_gb_s
bound = "Memory-bound"
else:
theoretical_max = peak_gflops
bound = "Compute-bound"
# Calculate efficiency
efficiency = observed_gflops / theoretical_max * 100
print(f"Arithmetic Intensity: {ai:.2f} FLOP/Byte")
print(f"Ridge Point: {ridge:.2f} FLOP/Byte")
print(f"Classification: {bound}")
print(f"Theoretical Max: {theoretical_max:.1f} GFLOPS")
print(f"Achieved: {observed_gflops:.1f} GFLOPS")
print(f"Efficiency: {efficiency:.1f}%")
if efficiency < 50:
print(f"\n⚠️ Low efficiency! Check for:")
if bound == "Memory-bound":
print(" - Cache misses")
print(" - Uncoalesced access (GPU)")
print(" - Poor data locality")
else:
print(" - No vectorization")
print(" - Low occupancy (GPU)")
print(" - Dependency stalls")
# Example: Matrix-vector multiply on CPU
# y = A @ x, where A is 4096×4096, x is 4096×1
diagnose_kernel(
flops=2 * 4096 * 4096, # 33.5M FLOPs
bytes_accessed=(4096*4096 + 4096 + 4096) * 4, # 67 MB
observed_gflops=8.5,
peak_gflops=500, # Modern CPU
peak_bw_gb_s=50 # DDR4
)
# Output:
# Arithmetic Intensity: 0.50 FLOP/Byte
# Ridge Point: 10.00 FLOP/Byte
# Classification: Memory-bound
# Theoretical Max: 25.0 GFLOPS
# Achieved: 8.5 GFLOPS
# Efficiency: 34.0%
#
# ⚠️ Low efficiency! Check for:
# - Cache misses
# - Uncoalesced access (GPU)
# - Poor data locality
If your kernel achieves less than 50% of its theoretical maximum (based on the roofline), there's likely a fixable inefficiency. Common culprits: cache misses, uncoalesced access, branch divergence, or lack of vectorization.
3. CPU Architecture Fundamentals
To optimize for CPUs, you need to understand how they're built. Modern CPUs are designed to hide memory latency and maximize throughput through several mechanisms:
Cache Hierarchy
CPUs use a hierarchy of progressively larger, slower caches to bridge the gap between processor speed and main memory:
L1 Cache
L1 Latency
L3 Cache
DRAM Latency
The key insight: cache misses are expensive. An L1 hit costs 4 cycles; a DRAM fetch costs 200+. Code that fits in cache runs 50× faster than code that doesn't. This is why locality—keeping frequently accessed data close together and re-accessing it while still cached—is the primary CPU optimization strategy.
// Example: Impact of stride on cache performance
const int N = 1024 * 1024; // 1M elements = 4 MB
float data[N];
// Sequential access: Perfect cache utilization
// Each cache line (64B) serves 16 floats
for (int i = 0; i < N; i++) {
sum += data[i]; // ~0.25 cycles/element (cache hit)
}
// Strided access: Cache line wasted
// Only 1 of 16 floats used per cache line fetch
for (int i = 0; i < N; i += 16) {
sum += data[i]; // ~4 cycles/element (cache miss every time)
}
// Random access: Worst case
for (int i = 0; i < N; i++) {
sum += data[random_indices[i]]; // ~50+ cycles/element (DRAM)
}
Pipeline & Out-of-Order Execution
Modern CPUs don't execute one instruction at a time. They use pipelining to overlap instruction execution and out-of-order execution to keep functional units busy:
Instructions Per Cycle (IPC) tells you how well your code utilizes the CPU pipeline. Modern CPUs can theoretically execute 4-8 instructions per cycle. If your IPC is below 1, you're likely stalled waiting for data. Low IPC usually indicates memory-boundness or long dependency chains.
Branch Prediction
Branches (if/else, loops) are problematic for pipelines. The CPU must guess which path to take before it knows the condition result. A misprediction costs 15-20 cycles as the pipeline is flushed:
// Predictable branch: CPU learns the pattern
for (int i = 0; i < N; i++) {
if (i < N/2) { // Predictable: always true then always false
sum += data[i];
}
}
// Prediction accuracy: ~99%
// Mispredictions: ~2 (at the transition)
// Unpredictable branch: 50/50 random
for (int i = 0; i < N; i++) {
if (data[i] > 0.5) { // Random data: unpredictable!
sum += data[i];
}
}
// Prediction accuracy: ~50%
// Mispredictions: ~N/2
// Cost: N/2 × 15 cycles = 7.5N extra cycles!
// Branchless version using conditional move
for (int i = 0; i < N; i++) {
sum += (data[i] > 0.5) ? data[i] : 0; // May compile to CMOV
}
// Or manually branchless:
for (int i = 0; i < N; i++) {
float mask = (data[i] > 0.5f) ? 1.0f : 0.0f;
sum += data[i] * mask;
}
// No branch mispredictions!
| Pattern | Prediction Accuracy | Performance Impact |
|---|---|---|
| Always taken / never taken | ~99% | Negligible |
| Regular pattern (TTTNTTTN...) | ~95% | Minor |
| Data-dependent (random) | ~50% | Severe (15-20 cycles/mispredict) |
| Branchless code | N/A | No penalty |
SIMD Units
Single Instruction Multiple Data (SIMD) allows a CPU to perform the same operation on multiple data elements simultaneously. Modern x86 CPUs support:
#include <immintrin.h>
// Scalar: Process 1 element at a time
void add_scalar(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // 1 element per iteration
}
}
// AVX2: Process 8 elements at a time
void add_avx2(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 8) {
// Load 8 floats from each array
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
// Add all 8 pairs in a single instruction
__m256 vc = _mm256_add_ps(va, vb);
// Store 8 results
_mm256_storeu_ps(&c[i], vc);
}
}
// AVX-512: Process 16 elements at a time
void add_avx512(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 16) {
__m512 va = _mm512_loadu_ps(&a[i]);
__m512 vb = _mm512_loadu_ps(&b[i]);
__m512 vc = _mm512_add_ps(va, vb);
_mm512_storeu_ps(&c[i], vc);
}
}
// Note: Modern compilers can auto-vectorize with -O3 -march=native
// Check with: gcc -fopt-info-vec-optimized or clang -Rpass=loop-vectorize
SIMD helps compute-bound code by increasing arithmetic throughput. For memory-bound code (like the simple vector add above), SIMD provides limited benefit because you're already bottlenecked on memory bandwidth, not compute. SIMD shines when you do many operations on data already in cache.
| SIMD Extension | Width | float32 | float64 | Availability |
|---|---|---|---|---|
SSE |
128-bit | 4 | 2 | All modern x86 |
AVX |
256-bit | 8 | 4 | Sandy Bridge+ (2011) |
AVX2 |
256-bit | 8 | 4 | Haswell+ (2013) |
AVX-512 |
512-bit | 16 | 8 | Skylake-X+ (2017) |
ARM NEON |
128-bit | 4 | 2 | All ARMv8 |
ARM SVE/SVE2 |
128-2048 bit | 4-64 | 2-32 | ARMv8.2+ (scalable) |
This covers the fundamentals: Memory Wall, diagnosing bottlenecks, and CPU architecture basics. Batch 2 will add: CPU memory optimizations (AoS/SoA, cache blocking, prefetching) and CPU compute optimizations (vectorization deep-dive, loop unrolling, branch elimination, ILP).
4. CPU Memory Optimizations
For memory-bound CPU code, the goal is to minimize data movement and maximize cache hits. This section covers the essential techniques.
Data Layout: Array of Structures vs Structure of Arrays
How you organize data in memory has massive performance implications. The two primary layouts are:
Array of Structures (AoS)
Each object's fields are stored contiguously
struct Particle {
float x, y, z;
float vx, vy, vz;
};
Particle particles[N];
Memory: [x₀,y₀,z₀,vx₀,vy₀,vz₀,x₁,y₁,...]
Structure of Arrays (SoA)
Each field stored in its own array
struct Particles {
float x[N], y[N], z[N];
float vx[N], vy[N], vz[N];
};
Particles particles;
Memory: [x₀,x₁,x₂...][y₀,y₁,y₂...]...
// Array of Structures (AoS) - Object-oriented style
struct ParticleAoS {
float x, y, z;
float vx, vy, vz;
float mass;
};
ParticleAoS particles_aos[N];
// Processing - poor cache utilization when only accessing some fields
void update_positions_aos(ParticleAoS* p, float dt, int n) {
for (int i = 0; i < n; i++) {
p[i].x += p[i].vx * dt; // Loads 28 bytes per particle
p[i].y += p[i].vy * dt; // but only uses 12 bytes (x,vx,y,vy,z,vz)
p[i].z += p[i].vz * dt; // mass field wastes cache space
}
}
// Structure of Arrays (SoA) - Data-oriented style
struct ParticlesSoA {
float* x; float* y; float* z;
float* vx; float* vy; float* vz;
float* mass;
};
// Processing - excellent cache utilization + auto-vectorization
void update_positions_soa(ParticlesSoA& p, float dt, int n) {
// Each loop is perfectly vectorizable!
for (int i = 0; i < n; i++) p.x[i] += p.vx[i] * dt;
for (int i = 0; i < n; i++) p.y[i] += p.vy[i] * dt;
for (int i = 0; i < n; i++) p.z[i] += p.vz[i] * dt;
}
// Even better: fused loops with explicit SIMD
void update_positions_soa_avx(ParticlesSoA& p, float dt, int n) {
__m256 vdt = _mm256_set1_ps(dt);
for (int i = 0; i < n; i += 8) {
// Process 8 particles' x coordinates at once
__m256 vx = _mm256_loadu_ps(&p.x[i]);
__m256 vvx = _mm256_loadu_ps(&p.vx[i]);
vx = _mm256_fmadd_ps(vvx, vdt, vx); // x += vx * dt
_mm256_storeu_ps(&p.x[i], vx);
// Similar for y and z...
}
}
// Performance difference: 3-6× speedup from SoA + SIMD
| Aspect | AoS | SoA |
|---|---|---|
| Cache efficiency (field subset) | Poor - loads unused fields | Excellent - loads only needed data |
| SIMD vectorization | Requires gather/scatter | Sequential loads (native) |
| Random access by object | Good - single pointer dereference | Poor - multiple array accesses |
| Memory allocation | Simple (one array) | Complex (multiple arrays) |
| Best use case | Object-by-object processing | Field-by-field processing (HPC) |
SoA: Batch processing many objects, iterating through one field (common in physics simulations, ML inference). AoS: Object-oriented access, spatial data structures (trees, graphs), when you need all fields of an object together.
Cache Blocking (Tiling)
Cache blocking partitions data to fit in cache, dramatically reducing memory traffic. The classic example is matrix multiplication:
// Naive matrix multiplication - O(N³) cache misses!
void matmul_naive(float* A, float* B, float* C, int N) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float sum = 0;
for (int k = 0; k < N; k++) {
sum += A[i*N + k] * B[k*N + j]; // B access is column-wise!
} // Cache miss every iteration
C[i*N + j] = sum;
}
}
}
// For N=1024: ~1 billion cache misses
// Cache-blocked matrix multiplication - O(N³/BLOCK) cache misses
const int BLOCK = 64; // Chosen to fit in L1 cache
void matmul_blocked(float* A, float* B, float* C, int N) {
// Initialize C to zero
memset(C, 0, N * N * sizeof(float));
// Iterate over tiles
for (int i0 = 0; i0 < N; i0 += BLOCK) {
for (int j0 = 0; j0 < N; j0 += BLOCK) {
for (int k0 = 0; k0 < N; k0 += BLOCK) {
// Inner loops work on BLOCK×BLOCK tiles
// These tiles fit in L1 cache!
for (int i = i0; i < min(i0+BLOCK, N); i++) {
for (int j = j0; j < min(j0+BLOCK, N); j++) {
float sum = C[i*N + j];
for (int k = k0; k < min(k0+BLOCK, N); k++) {
sum += A[i*N + k] * B[k*N + j];
}
C[i*N + j] = sum;
}
}
}
}
}
}
// For N=1024, BLOCK=64: ~16 million cache misses (64× fewer!)
// Even better: blocked + SIMD + loop unrolling
void matmul_blocked_avx(float* A, float* B, float* C, int N) {
memset(C, 0, N * N * sizeof(float));
for (int i0 = 0; i0 < N; i0 += BLOCK) {
for (int j0 = 0; j0 < N; j0 += BLOCK) {
for (int k0 = 0; k0 < N; k0 += BLOCK) {
for (int i = i0; i < i0+BLOCK; i++) {
for (int k = k0; k < k0+BLOCK; k++) {
__m256 va = _mm256_set1_ps(A[i*N + k]);
for (int j = j0; j < j0+BLOCK; j += 8) {
__m256 vb = _mm256_loadu_ps(&B[k*N + j]);
__m256 vc = _mm256_loadu_ps(&C[i*N + j]);
vc = _mm256_fmadd_ps(va, vb, vc);
_mm256_storeu_ps(&C[i*N + j], vc);
}
}
}
}
}
}
}
// Additional 8× from SIMD → ~50× total speedup vs naive
Naive GEMM
Blocked
Blocked + AVX
OpenBLAS
Software Prefetching
Prefetching loads data into cache before it's needed, hiding memory latency. Modern CPUs do hardware prefetching for sequential access patterns, but software prefetching helps with irregular patterns:
#include <xmmintrin.h> // For _mm_prefetch
// Prefetch hints:
// _MM_HINT_T0 - Prefetch to all cache levels (L1, L2, L3)
// _MM_HINT_T1 - Prefetch to L2 and L3
// _MM_HINT_T2 - Prefetch to L3 only
// _MM_HINT_NTA - Non-temporal, bypass cache
// Without prefetch - stalls waiting for memory
void indirect_access_naive(float* data, int* indices, int n) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += data[indices[i]]; // Cache miss → 200 cycle stall
}
}
// With prefetch - overlap memory access with computation
void indirect_access_prefetch(float* data, int* indices, int n) {
const int PREFETCH_DISTANCE = 16; // Tune based on latency
float sum = 0;
for (int i = 0; i < n; i++) {
// Prefetch data for future iteration
if (i + PREFETCH_DISTANCE < n) {
_mm_prefetch((char*)&data[indices[i + PREFETCH_DISTANCE]], _MM_HINT_T0);
}
// Process current element (hopefully in cache now)
sum += data[indices[i]];
}
}
// Prefetch distance calculation:
// Distance = Memory_Latency / Time_Per_Iteration
// Example: 200 cycles latency / 10 cycles per iteration = 20 iterations ahead
// Linked list traversal with prefetch
struct Node { Node* next; int data; };
int sum_list_prefetch(Node* head) {
int sum = 0;
Node* curr = head;
Node* prefetch_ptr = head;
// Advance prefetch pointer ahead
for (int i = 0; i < 8 && prefetch_ptr; i++) {
prefetch_ptr = prefetch_ptr->next;
}
while (curr) {
// Prefetch 8 nodes ahead
if (prefetch_ptr) {
_mm_prefetch((char*)prefetch_ptr, _MM_HINT_T0);
prefetch_ptr = prefetch_ptr->next;
}
sum += curr->data;
curr = curr->next;
}
return sum;
}
Too early: Data evicted before use. Too late: Still waiting for memory. Too many: Pollutes cache, evicting useful data. Start with prefetch distance = memory_latency / loop_time, then tune with profiling. Modern CPUs have excellent hardware prefetchers—software prefetching mainly helps irregular access patterns.
Memory Alignment
Misaligned memory access can cause performance penalties or require multiple cache line fetches:
// Allocate aligned memory for SIMD operations
// C++11 aligned allocation
alignas(32) float data[1024]; // 32-byte aligned for AVX
// Dynamic allocation (C++17)
float* ptr = new (std::align_val_t(64)) float[1024]; // 64-byte for cache line
// POSIX aligned allocation
void* ptr;
posix_memalign(&ptr, 64, 1024 * sizeof(float));
// Aligned SIMD loads are faster than unaligned
void process_aligned(float* data, int n) {
// Assumes data is 32-byte aligned
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_load_ps(&data[i]); // Faster, requires alignment
// vs _mm256_loadu_ps for unaligned
}
}
// Padding to avoid false sharing in multithreaded code
struct alignas(64) PaddedCounter {
std::atomic<int> count;
char padding[60]; // Ensure each counter is on its own cache line
};
PaddedCounter counters[NUM_THREADS]; // No false sharing between threads
5. CPU Compute Optimizations
For compute-bound CPU code, the goal is to maximize instruction throughput. This means using SIMD, increasing ILP, and keeping the pipeline full.
Vectorization Deep Dive
Modern compilers can auto-vectorize simple loops, but often need help:
// Compile with: g++ -O3 -march=native -fopt-info-vec-optimized
// ✅ Auto-vectorizes: Simple loop, no dependencies
void saxpy(float* y, float a, float* x, int n) {
for (int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
// ❌ Won't vectorize: Potential aliasing
void copy_overlap(float* dst, float* src, int n) {
for (int i = 0; i < n; i++) {
dst[i] = src[i]; // dst and src might overlap!
}
}
// ✅ Fix: Use restrict to promise no aliasing
void copy_restrict(float* __restrict__ dst,
float* __restrict__ src, int n) {
for (int i = 0; i < n; i++) {
dst[i] = src[i]; // Compiler knows they don't overlap
}
}
// ❌ Won't vectorize: Function call in loop
void process_external(float* data, int n) {
for (int i = 0; i < n; i++) {
data[i] = external_func(data[i]); // Can't inline
}
}
// ❌ Won't vectorize: Data-dependent branch
void process_conditional(float* data, int n) {
for (int i = 0; i < n; i++) {
if (data[i] > 0) {
data[i] = sqrt(data[i]);
}
}
}
// ✅ Fix: Use masked operations
void process_masked(float* data, int n) {
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_loadu_ps(&data[i]);
__m256 zero = _mm256_setzero_ps();
__m256 mask = _mm256_cmp_ps(v, zero, _CMP_GT_OQ);
__m256 sqrtv = _mm256_sqrt_ps(v);
__m256 result = _mm256_blendv_ps(v, sqrtv, mask);
_mm256_storeu_ps(&data[i], result);
}
}
// Pragma hints (compiler-specific)
void process_pragma(float* data, int n) {
#pragma omp simd // OpenMP SIMD directive
for (int i = 0; i < n; i++) {
data[i] = data[i] * data[i];
}
#pragma GCC ivdep // GCC: ignore vector dependencies
for (int i = 0; i < n; i++) {
data[i] = data[i - offset];
}
}
| Vectorization Blocker | Fix |
|---|---|
| Pointer aliasing | __restrict__ keyword |
| Non-unit stride | Restructure data layout (SoA) |
| Data-dependent branches | Masked operations, branchless code |
| Loop-carried dependencies | Restructure algorithm, reduction clauses |
| Function calls | Inline, SIMD math libraries (SVML) |
| Unknown trip count | Use #pragma omp simd |
Loop Unrolling & ILP
Loop unrolling reduces loop overhead and exposes Instruction-Level Parallelism (ILP)—independent operations that can execute in parallel on superscalar CPUs:
// Single accumulator - limited by add latency (4 cycles)
float sum_naive(float* data, int n) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += data[i]; // Each add depends on previous
} // Throughput: 0.25 adds/cycle
return sum;
}
// Multiple accumulators - breaks dependency chain
float sum_unrolled(float* data, int n) {
float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
float sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0;
int i = 0;
for (; i < n - 7; i += 8) {
sum0 += data[i]; // All 8 additions are independent
sum1 += data[i + 1]; // CPU can issue 2/cycle
sum2 += data[i + 2]; // Throughput: 2 adds/cycle
sum3 += data[i + 3];
sum4 += data[i + 4];
sum5 += data[i + 5];
sum6 += data[i + 6];
sum7 += data[i + 7];
}
// Handle remainder
for (; i < n; i++) {
sum0 += data[i];
}
// Combine accumulators
return (sum0 + sum1) + (sum2 + sum3) + (sum4 + sum5) + (sum6 + sum7);
}
// SIMD unrolled - best of both worlds
float sum_simd_unrolled(float* data, int n) {
__m256 vsum0 = _mm256_setzero_ps();
__m256 vsum1 = _mm256_setzero_ps();
__m256 vsum2 = _mm256_setzero_ps();
__m256 vsum3 = _mm256_setzero_ps();
int i = 0;
for (; i < n - 31; i += 32) {
vsum0 = _mm256_add_ps(vsum0, _mm256_loadu_ps(&data[i]));
vsum1 = _mm256_add_ps(vsum1, _mm256_loadu_ps(&data[i + 8]));
vsum2 = _mm256_add_ps(vsum2, _mm256_loadu_ps(&data[i + 16]));
vsum3 = _mm256_add_ps(vsum3, _mm256_loadu_ps(&data[i + 24]));
}
// Combine vectors
vsum0 = _mm256_add_ps(_mm256_add_ps(vsum0, vsum1),
_mm256_add_ps(vsum2, vsum3));
// Horizontal sum (reduce 8 floats to 1)
__m128 lo = _mm256_castps256_ps128(vsum0);
__m128 hi = _mm256_extractf128_ps(vsum0, 1);
lo = _mm_add_ps(lo, hi);
lo = _mm_hadd_ps(lo, lo);
lo = _mm_hadd_ps(lo, lo);
float sum = _mm_cvtss_f32(lo);
// Handle remainder
for (; i < n; i++) {
sum += data[i];
}
return sum;
}
// Performance comparison (N = 1M floats):
// sum_naive: ~1 GB/s (memory bound, but add latency limited)
// sum_unrolled: ~8 GB/s (near bandwidth limit)
// sum_simd_unrolled: ~40 GB/s (exceeds DRAM, benefits from cache)
Optimal unroll factor = latency × throughput. For FP add: 4 cycles latency × 2/cycle throughput = 8 accumulators. For FMA: 4 cycles × 2/cycle = 8. This ensures enough independent operations to keep all execution units busy.
Fused Multiply-Add (FMA)
FMA computes a * b + c in a single instruction with higher precision (only one rounding) and twice the throughput:
#include <immintrin.h>
// Without FMA: 2 operations
float dot_naive(float* a, float* b, int n) {
float sum = 0;
for (int i = 0; i < n; i++) {
sum += a[i] * b[i]; // mul + add = 2 ops
}
return sum;
}
// With FMA: 1 operation, 2 FLOPs
float dot_fma(float* a, float* b, int n) {
__m256 vsum0 = _mm256_setzero_ps();
__m256 vsum1 = _mm256_setzero_ps();
for (int i = 0; i < n; i += 16) {
__m256 va0 = _mm256_loadu_ps(&a[i]);
__m256 vb0 = _mm256_loadu_ps(&b[i]);
__m256 va1 = _mm256_loadu_ps(&a[i + 8]);
__m256 vb1 = _mm256_loadu_ps(&b[i + 8]);
// FMA: vsum = va * vb + vsum
vsum0 = _mm256_fmadd_ps(va0, vb0, vsum0);
vsum1 = _mm256_fmadd_ps(va1, vb1, vsum1);
}
vsum0 = _mm256_add_ps(vsum0, vsum1);
// Horizontal sum...
__m128 lo = _mm256_castps256_ps128(vsum0);
__m128 hi = _mm256_extractf128_ps(vsum0, 1);
lo = _mm_add_ps(lo, hi);
lo = _mm_hadd_ps(lo, lo);
lo = _mm_hadd_ps(lo, lo);
return _mm_cvtss_f32(lo);
}
// FMA variants:
// _mm256_fmadd_ps(a, b, c) → a*b + c
// _mm256_fmsub_ps(a, b, c) → a*b - c
// _mm256_fnmadd_ps(a, b, c) → -(a*b) + c = c - a*b
// _mm256_fnmsub_ps(a, b, c) → -(a*b) - c
// Enable FMA in compiler for auto-vectorization:
// gcc/clang: -mfma or -march=haswell (or newer)
// MSVC: /arch:AVX2
Branchless Programming
Replacing branches with arithmetic operations eliminates branch misprediction penalties:
// Branchy: Branch misprediction on random data
int abs_branchy(int x) {
if (x < 0) return -x;
return x;
}
// Branchless: No prediction needed
int abs_branchless(int x) {
int mask = x >> 31; // All 1s if negative, all 0s if positive
return (x ^ mask) - mask;
}
// Min/Max
int min_branchy(int a, int b) {
return (a < b) ? a : b; // May branch or use CMOV
}
int min_branchless(int a, int b) {
int diff = a - b;
int mask = diff >> 31; // mask = -1 if a < b, else 0
return b + (diff & mask);
}
// Clamp
float clamp_branchy(float x, float lo, float hi) {
if (x < lo) return lo;
if (x > hi) return hi;
return x;
}
float clamp_branchless(float x, float lo, float hi) {
return fmax(fmin(x, hi), lo); // Uses SIMD min/max
}
// Conditional increment
int count = 0;
// Branchy
for (int i = 0; i < n; i++) {
if (data[i] > threshold) count++; // Mispredicts ~50% on random
}
// Branchless
for (int i = 0; i < n; i++) {
count += (data[i] > threshold); // Bool converts to 0 or 1
}
// SIMD branchless comparison
int count_simd(float* data, float threshold, int n) {
__m256 vthresh = _mm256_set1_ps(threshold);
__m256i vcount = _mm256_setzero_si256();
__m256i vone = _mm256_set1_epi32(1);
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_loadu_ps(&data[i]);
__m256 mask = _mm256_cmp_ps(v, vthresh, _CMP_GT_OQ);
__m256i imask = _mm256_castps_si256(mask);
vcount = _mm256_sub_epi32(vcount, imask); // -(-1) = +1
}
// Horizontal sum of 8 ints
int result[8];
_mm256_storeu_si256((__m256i*)result, vcount);
return result[0] + result[1] + result[2] + result[3] +
result[4] + result[5] + result[6] + result[7];
}
7. CPU Parallelization
Modern CPUs have multiple cores, and utilizing all of them is essential for achieving peak throughput. However, parallel programming introduces new challenges: synchronization overhead, load imbalance, and cache coherency issues.
OpenMP Basics
OpenMP is the most common approach to add parallelism to existing C/C++ code. It uses compiler directives (pragmas) to parallelize loops and regions.
| Directive | Purpose | Use Case |
|---|---|---|
#pragma omp parallel for |
Parallelize a for loop | Independent iterations |
#pragma omp parallel for reduction(+:sum) |
Parallel reduction | Accumulating values |
#pragma omp parallel sections |
Different code blocks in parallel | Task parallelism |
#pragma omp simd |
Hint for SIMD vectorization | Force vectorization |
#pragma omp task |
Create asynchronous task | Dynamic workloads |
// Basic parallel for - each iteration runs on different thread
#pragma omp parallel for
for (int i = 0; i < N; i++) {
result[i] = expensive_computation(data[i]);
}
// Reduction - thread-safe accumulation
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += data[i]; // Each thread has private sum, combined at end
}
// Nested parallelism with collapse
#pragma omp parallel for collapse(2)
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
matrix[i][j] = compute(i, j);
}
}
// Schedule types for load balancing
#pragma omp parallel for schedule(static) // Equal chunks, best for uniform work
#pragma omp parallel for schedule(dynamic) // Grab work as available, best for variable work
#pragma omp parallel for schedule(guided) // Decreasing chunk sizes, good balance
static: Best for uniform workloads (each iteration takes same time). Lowest overhead.
dynamic: Best for irregular workloads (some iterations much longer). Higher overhead but better load balance.
guided: Compromise - starts with large chunks, shrinks over time.
False Sharing
False sharing is the silent killer of parallel performance. It occurs when threads on different cores write to different variables that happen to share the same cache line (typically 64 bytes).
// BAD: False sharing - counters share cache line
struct BadCounters {
int64_t counter[8]; // 64 bytes total, fits in ONE cache line!
};
BadCounters bad;
#pragma omp parallel for
for (int t = 0; t < 8; t++) {
for (int i = 0; i < 1000000; i++) {
bad.counter[t]++; // All threads thrash the same cache line!
}
}
// Time: ~500ms (horrible!)
// GOOD: Pad to separate cache lines
struct PaddedCounter {
int64_t counter;
char padding[56]; // Pad to 64 bytes (cache line size)
};
struct GoodCounters {
PaddedCounter counter[8]; // Each counter on its own cache line
};
GoodCounters good;
#pragma omp parallel for
for (int t = 0; t < 8; t++) {
for (int i = 0; i < 1000000; i++) {
good.counter[t].counter++; // No contention!
}
}
// Time: ~5ms (100× faster!)
// C++17: Use alignas for cleaner padding
struct alignas(64) CacheLineCounter {
int64_t value;
};
Use perf c2c on Linux to detect cache-line contention. Look for high "HITM" (Hit Modified) counts. Intel VTune also reports false sharing directly under "Memory Access" analysis.
NUMA Awareness
NUMA (Non-Uniform Memory Access) architectures have multiple memory controllers. Memory access is fast when accessing local memory but slow when crossing NUMA boundaries.
#include
// Check NUMA topology
int num_nodes = numa_num_configured_nodes();
printf("NUMA nodes: %d\n", num_nodes);
// Allocate memory on specific NUMA node
void* local_mem = numa_alloc_onnode(size, 0); // Node 0
void* interleaved = numa_alloc_interleaved(size); // Spread across nodes
// Bind thread to NUMA node
numa_run_on_node(0); // Run this thread on node 0
// First-touch policy: memory allocated where first accessed
float* data = (float*)malloc(N * sizeof(float));
// Initialize in parallel - each thread touches its portion
#pragma omp parallel for
for (int i = 0; i < N; i++) {
data[i] = 0.0f; // Memory pages allocated on thread's local node
}
// Later computation accesses local memory
#pragma omp parallel for
for (int i = 0; i < N; i++) {
data[i] = compute(data[i]); // Fast local access!
}
Thread Pools & Task Scheduling
Creating threads is expensive (~10-100μs). For fine-grained parallelism, use thread pools that reuse worker threads.
#include
#include
#include
#include
#include
class ThreadPool {
private:
std::vector<std::thread> workers;
std::queue<std::function<void()>> tasks;
std::mutex queue_mutex;
std::condition_variable condition;
bool stop = false;
public:
ThreadPool(size_t num_threads) {
for (size_t i = 0; i < num_threads; i++) {
workers.emplace_back([this] {
while (true) {
std::function<void()> task;
{
std::unique_lock<std::mutex> lock(queue_mutex);
condition.wait(lock, [this] {
return stop || !tasks.empty();
});
if (stop && tasks.empty()) return;
task = std::move(tasks.front());
tasks.pop();
}
task(); // Execute task
}
});
}
}
template<class F>
void enqueue(F&& f) {
{
std::unique_lock<std::mutex> lock(queue_mutex);
tasks.emplace(std::forward(f));
}
condition.notify_one();
}
~ThreadPool() {
{ std::unique_lock<std::mutex> lock(queue_mutex); stop = true; }
condition.notify_all();
for (auto& w : workers) w.join();
}
};
// Usage
ThreadPool pool(std::thread::hardware_concurrency());
for (int i = 0; i < 10000; i++) {
pool.enqueue([i] {
process_item(i); // No thread creation overhead!
});
}
| Parallelization Technique | Helps Memory-Bound? | Helps Compute-Bound? | Best For |
|---|---|---|---|
| OpenMP parallel for | Sometimes (more bandwidth) | Yes (more cores) | Data parallelism |
| False sharing fix | Yes (cache efficiency) | Yes (reduces stalls) | Shared counters/accumulators |
| NUMA awareness | Yes (local bandwidth) | Somewhat | Large memory workloads |
| Thread pools | No | Yes (reduces overhead) | Many small tasks |
8. GPU Architecture Fundamentals
GPUs are fundamentally different from CPUs. While CPUs optimize for latency (fast single-thread execution), GPUs optimize for throughput (maximum total work per second) by running thousands of threads simultaneously.
Streaming Multiprocessors (SMs)
A GPU is composed of many Streaming Multiprocessors (SMs). Each SM is like a mini processor with its own:
- CUDA Cores: Execute integer and floating-point operations
- Tensor Cores: Specialized matrix multiply-accumulate units
- Shared Memory: Fast on-chip scratchpad (64-164 KB)
- L1 Cache: Transparent cache (combined with shared memory)
- Warp Schedulers: Issue instructions to execution units
- Register File: Very large (256 KB) to support many threads
| GPU | SMs | FP32 Cores | Tensor Cores | Shared Mem/SM | HBM BW |
|---|---|---|---|---|---|
| RTX 3090 | 82 | 10,496 | 328 | 128 KB | 936 GB/s |
| A100 | 108 | 6,912 | 432 | 192 KB | 2,039 GB/s |
| H100 | 132 | 16,896 | 528 | 228 KB | 3,350 GB/s |
| RTX 4090 | 128 | 16,384 | 512 | 128 KB | 1,008 GB/s |
Warps & SIMT Execution
GPU threads are grouped into warps of 32 threads. All threads in a warp execute the same instruction at the same time (SIMT = Single Instruction, Multiple Threads). This is both the source of GPU power and the cause of performance pitfalls.
// BAD: Heavy divergence within warp
__global__ void divergent_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// Each thread in warp may take different path!
if (data[idx] > 0.5f) {
data[idx] = expensive_pathA(data[idx]); // Some threads
} else {
data[idx] = expensive_pathB(data[idx]); // Other threads
}
}
// BETTER: Sort data so adjacent threads take same path
// Or restructure to minimize branch impact
// GOOD: Branch on warp-uniform condition
__global__ void uniform_kernel(float* data, int n, bool usePathA) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// All threads in warp take same path (uniform)
if (usePathA) {
data[idx] = expensive_pathA(data[idx]);
} else {
data[idx] = expensive_pathB(data[idx]);
}
}
GPU Memory Hierarchy
GPUs have a complex memory hierarchy. Using the right level is critical for performance:
| Memory Type | Scope | Latency | Bandwidth | Size (A100) | Use Case |
|---|---|---|---|---|---|
| Registers | Thread | 0 cycles | ~20 TB/s | 256 KB/SM | Local variables |
| Shared Memory | Block | ~25 cycles | ~19 TB/s | 164 KB/SM | Inter-thread communication |
| L1 Cache | SM | ~30 cycles | ~12 TB/s | 192 KB/SM | Automatic caching |
| L2 Cache | Device | ~200 cycles | ~4 TB/s | 40 MB | Cross-SM sharing |
| Global (HBM) | Device | ~400 cycles | 2 TB/s | 80 GB | Large datasets |
| Host (PCIe) | System | ~10,000 cycles | 32 GB/s | System RAM | CPU↔GPU transfer |
Occupancy & Latency Hiding
Occupancy is the ratio of active warps to the maximum possible warps per SM. High occupancy helps hide memory latency by allowing the SM to switch to other warps while waiting for memory.
// Query theoretical occupancy
#include
__global__ void myKernel(float* data) {
// Kernel code here
}
void analyzeOccupancy() {
int blockSize = 256;
int minGridSize;
int maxBlockSize;
// Let CUDA suggest optimal block size
cudaOccupancyMaxPotentialBlockSize(
&minGridSize,
&maxBlockSize,
myKernel,
0, // Dynamic shared mem
0 // No block size limit
);
printf("Suggested block size: %d\n", maxBlockSize);
// Calculate occupancy for chosen block size
int numBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
myKernel,
blockSize,
0 // Shared mem
);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int maxWarps = prop.maxThreadsPerMultiProcessor / 32;
int activeWarps = numBlocks * (blockSize / 32);
float occupancy = (float)activeWarps / maxWarps;
printf("Occupancy: %.1f%% (%d/%d warps)\n",
occupancy * 100, activeWarps, maxWarps);
}
// Rule of thumb:
// - 50%+ occupancy is usually good
// - Below 25% often indicates a problem
// - 100% not always needed (depends on memory access)
Higher occupancy doesn't always mean better performance! A kernel with 50% occupancy but perfect memory coalescing can outperform 100% occupancy with scattered access. Use nsys and ncu to measure actual performance, not just occupancy.
9. GPU Memory Optimizations
GPU memory bandwidth is massive (2+ TB/s on modern GPUs), but achieving that peak requires careful attention to memory access patterns. Poor patterns can reduce effective bandwidth by 10-32×.
Memory Coalescing
Memory coalescing is the most important GPU memory optimization. When threads in a warp access consecutive memory addresses, the hardware combines them into a single, efficient transaction. When accesses are scattered, each thread requires a separate transaction.
// BAD: Column-major access in row-major storage (strided)
__global__ void bad_transpose(float* out, float* in, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Adjacent threads read from different rows (stride = N)
out[col * N + row] = in[row * N + col]; // Strided write!
}
// GOOD: Use shared memory to fix coalescing
__global__ void good_transpose(float* out, float* in, int N) {
__shared__ float tile[32][33]; // +1 to avoid bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
// Coalesced read from global memory
tile[threadIdx.y][threadIdx.x] = in[y * N + x];
__syncthreads();
// Transpose indices for coalesced write
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
// Coalesced write to global memory
out[y * N + x] = tile[threadIdx.x][threadIdx.y];
}
// AoS vs SoA on GPU - same principle as CPU!
struct ParticleAoS { float x, y, z, mass; }; // BAD for GPU
struct ParticlesSoA { // GOOD for GPU
float* x;
float* y;
float* z;
float* mass;
};
Shared Memory Usage
Shared memory is on-chip SRAM shared by all threads in a block. It's ~100× faster than global memory and perfect for:
- Inter-thread communication within a block
- Caching frequently accessed data
- Reorganizing memory access patterns (like transpose above)
- Reduction operations
// Static shared memory allocation
__global__ void reduction_kernel(float* input, float* output, int N) {
__shared__ float sdata[256]; // 1KB per block
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load from global to shared (coalesced)
sdata[tid] = (idx < N) ? input[idx] : 0.0f;
__syncthreads(); // CRITICAL: wait for all loads
// Parallel reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads(); // Sync after each step
}
// Write result (only thread 0)
if (tid == 0) output[blockIdx.x] = sdata[0];
}
// Dynamic shared memory (size set at launch)
__global__ void dynamic_smem_kernel(float* data) {
extern __shared__ float smem[]; // Size unknown at compile time
smem[threadIdx.x] = data[threadIdx.x];
__syncthreads();
// Use smem...
}
// Launch with dynamic shared memory
int smem_size = 256 * sizeof(float);
dynamic_smem_kernel<<>>(data);
Forgetting __syncthreads() after shared memory writes causes race conditions. Threads may read stale data. Always sync after writing to shared memory before reading!
Bank Conflicts
Shared memory is divided into 32 banks. If multiple threads in a warp access different addresses in the same bank, the accesses are serialized (bank conflict). This can reduce shared memory performance by up to 32×.
Texture & Constant Memory
GPUs have specialized memory types optimized for specific access patterns:
| Memory Type | Size | Access Pattern | Benefits | Use Case |
|---|---|---|---|---|
| Constant Memory | 64 KB | All threads read same address | Broadcast to warp (1 cycle) | Kernel parameters, lookup tables |
| Texture Memory | Cache-backed | 2D spatial locality | Free interpolation, clamping | Image processing, irregular access |
10. GPU Compute Optimizations
Once memory access is optimized, we can focus on maximizing computational throughput through kernel design optimizations.
Kernel Fusion
Kernel fusion combines multiple operations into a single kernel to reduce memory traffic. Each kernel launch requires reading inputs from global memory and writing outputs back—fusing eliminates intermediate reads/writes.
// BAD: Separate kernels for each operation
__global__ void square(float* y, float* x, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = x[i] * x[i];
}
__global__ void add_one(float* z, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) z[i] = y[i] + 1.0f;
}
__global__ void relu(float* out, float* z, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) out[i] = fmaxf(z[i], 0.0f);
}
// Launch 3 kernels, 3x memory traffic
square<<>>(y, x, n);
add_one<<>>(z, y, n);
relu<<>>(out, z, n);
// GOOD: Single fused kernel
__global__ void fused_square_add_relu(float* out, float* x, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float val = x[i]; // Read once
val = val * val; // Square (register)
val = val + 1.0f; // Add (register)
out[i] = fmaxf(val, 0.0f); // ReLU + Write once
}
}
// Single launch, 1x memory traffic, 2-3x faster!
fused_square_add_relu<<>>(out, x, n);
Modern frameworks like PyTorch 2.0 (torch.compile) and JAX automatically fuse operations. NVIDIA's cuDNN fuses common patterns like Conv+BN+ReLU. Custom Triton kernels give you explicit control over fusion.
Mixed Precision
Mixed precision uses lower-precision formats (FP16, BF16, INT8) for most computations while keeping critical values in FP32. This provides:
- 2× memory bandwidth: FP16 is half the size of FP32
- 2-8× compute throughput: Tensor Cores accelerate FP16/BF16/INT8
- 2× memory capacity: Fit larger models in GPU memory
import torch
from torch.cuda.amp import autocast, GradScaler
# Automatic Mixed Precision (AMP) training
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler() # Handles gradient scaling for FP16
for data, target in dataloader:
optimizer.zero_grad()
# Forward pass in FP16 (where safe)
with autocast(dtype=torch.float16): # or torch.bfloat16
output = model(data)
loss = criterion(output, target)
# Backward pass (scaled gradients)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# What autocast does automatically:
# - MatMuls, Convs → FP16 (Tensor Core accelerated)
# - Reductions, Softmax → FP32 (numerical stability)
# - Loss computation → FP32
# BF16 on Ampere+ (no scaler needed!)
with autocast(dtype=torch.bfloat16):
output = model(data) # Stable without gradient scaling
Thread Coarsening
Thread coarsening makes each thread do more work. Instead of one thread per element, have each thread process multiple elements. This reduces launch overhead and can improve instruction-level parallelism.
// Fine-grained: 1 thread per element
__global__ void fine_kernel(float* out, float* in, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
out[i] = compute(in[i]);
}
}
// Launch: <<>>
// Coarsened: 1 thread per 4 elements
#define COARSEN 4
__global__ void coarse_kernel(float* out, float* in, int n) {
int base = (blockIdx.x * blockDim.x + threadIdx.x) * COARSEN;
// Process 4 elements per thread
#pragma unroll
for (int j = 0; j < COARSEN; j++) {
int i = base + j;
if (i < n) {
out[i] = compute(in[i]);
}
}
}
// Launch: <<>> (4× fewer blocks!)
// Benefits of coarsening:
// - Fewer blocks → less launch overhead
// - Unrolled loop → better ILP
// - Amortize index calculation
// - Can enable register reuse
// When to coarsen:
// - Simple kernels with low occupancy limiters
// - Compute-bound kernels
// - When block count is very high
Reducing Warp Divergence
Warp divergence (covered in Section 8) can be mitigated through several techniques:
// BAD: Thread-dependent branching causes divergence
__global__ void divergent(float* out, float* in, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
if (in[i] > 0) {
out[i] = expensive_func_A(in[i]); // Some threads
} else {
out[i] = expensive_func_B(in[i]); // Other threads
}
}
}
// BETTER: Predication for simple operations
__global__ void predicated(float* out, float* in, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float val = in[i];
// Both paths computed, result selected (no branch)
out[i] = (val > 0) ? val * val : -val;
}
}
// BETTER: Sort data to cluster similar values
// If in[0..999] > 0 and in[1000..1999] <= 0,
// different warps take different paths (no intra-warp divergence)
// BETTER: Warp-level voting for uniform branches
__global__ void warp_vote(float* out, float* in, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
bool cond = in[i] > 0;
// Check if ALL threads in warp agree
if (__all_sync(0xffffffff, cond)) {
// Uniform path - no divergence
out[i] = expensive_func_A(in[i]);
} else if (__all_sync(0xffffffff, !cond)) {
out[i] = expensive_func_B(in[i]);
} else {
// Mixed - must diverge, use simpler fallback
out[i] = simple_func(in[i]);
}
}
}
| GPU Optimization | Helps Memory-Bound? | Helps Compute-Bound? | When to Apply |
|---|---|---|---|
| Memory Coalescing | Yes (critical!) | Somewhat | Always check first |
| Shared Memory | Yes | Yes (reuse) | Data reuse within block |
| Bank Conflict Fix | Yes | No | When using shared memory |
| Kernel Fusion | Yes | Somewhat | Multiple consecutive ops |
| Mixed Precision | Yes (2×) | Yes (2-16×) | Almost always (DL) |
| Thread Coarsening | No | Yes | Simple, high-occupancy kernels |
| Reduce Divergence | No | Yes | Data-dependent branches |
11. GPU Execution Optimizations
Beyond kernel-level optimizations, how you launch and orchestrate GPU work significantly impacts performance. CUDA provides mechanisms to overlap computation with data transfer and reduce launch overhead.
CUDA Streams
A CUDA stream is a sequence of operations that execute in order on the GPU. Operations in different streams can execute concurrently, enabling overlap of:
- Host-to-Device (H2D) memory transfers
- Kernel execution
- Device-to-Host (D2H) memory transfers
// Create streams
cudaStream_t stream[2];
for (int i = 0; i < 2; i++) {
cudaStreamCreate(&stream[i]);
}
// Divide work into chunks for overlap
int chunk_size = N / 2;
for (int i = 0; i < 2; i++) {
int offset = i * chunk_size;
// Async H2D copy (non-blocking)
cudaMemcpyAsync(d_in + offset, h_in + offset,
chunk_size * sizeof(float),
cudaMemcpyHostToDevice, stream[i]);
// Launch kernel in stream
myKernel<<0, stream[i]>>>(d_out + offset, d_in + offset, chunk_size);
// Async D2H copy
cudaMemcpyAsync(h_out + offset, d_out + offset,
chunk_size * sizeof(float),
cudaMemcpyDeviceToHost, stream[i]);
}
// Wait for all streams
cudaDeviceSynchronize();
// Cleanup
for (int i = 0; i < 2; i++) {
cudaStreamDestroy(stream[i]);
}
// IMPORTANT: Host memory must be pinned for async transfers!
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float)); // Pinned memory
// Regular malloc() will cause cudaMemcpyAsync to be synchronous!
Async memory transfers (cudaMemcpyAsync) only work with pinned (page-locked) host memory allocated with cudaMallocHost. Regular malloc memory forces synchronous transfers!
CUDA Graphs
CUDA Graphs capture a sequence of GPU operations (kernels, memory copies) into a graph that can be launched with a single API call. This dramatically reduces CPU overhead for repetitive workloads.
// Method 1: Stream capture (easiest)
cudaGraph_t graph;
cudaGraphExec_t graphExec;
// Capture operations into a graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
{
// These operations are recorded, not executed
cudaMemcpyAsync(d_in, h_in, size, cudaMemcpyHostToDevice, stream);
kernel1<<0, stream>>>(d_tmp, d_in);
kernel2<<0, stream>>>(d_out, d_tmp);
cudaMemcpyAsync(h_out, d_out, size, cudaMemcpyDeviceToHost, stream);
}
cudaStreamEndCapture(stream, &graph);
// Instantiate executable graph (one-time cost)
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);
// Execute graph many times (very fast!)
for (int iter = 0; iter < 10000; iter++) {
cudaGraphLaunch(graphExec, stream); // ~5μs per launch!
}
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
// PyTorch CUDA Graphs
// ===================
// g = torch.cuda.CUDAGraph()
// with torch.cuda.graph(g):
// output = model(input) # Captured
//
// # Replay graph (fast!)
// for _ in range(1000):
// g.replay()
Async Memory Operations
Modern GPUs (Ampere+) support async memory copy that uses dedicated copy engines, allowing compute units to continue working during transfers.
// Async copy from global to shared memory (Ampere+)
#include
__global__ void async_copy_kernel(float* global_data) {
__shared__ float smem[256];
// Cooperative group for async operations
auto block = cooperative_groups::this_thread_block();
// Create pipeline for async copies
__shared__ cuda::pipeline_shared_state<1> pipe_state;
auto pipe = cuda::make_pipeline(block, &pipe_state);
// Initiate async copy (returns immediately)
pipe.producer_acquire();
cuda::memcpy_async(
smem + threadIdx.x, // Destination (shared)
global_data + threadIdx.x, // Source (global)
sizeof(float),
pipe
);
pipe.producer_commit();
// Do other work while copy is in flight!
float local_compute = expensive_calculation();
// Wait for copy to complete
pipe.consumer_wait();
// Now use smem data
float result = smem[threadIdx.x] + local_compute;
}
12. Case Studies
Let's apply the optimization techniques we've learned to real-world problems.
Case Study 1: Matrix Multiplication
Matrix multiplication (GEMM) is the most important operation in deep learning. Let's trace the optimization journey from naive to highly optimized:
// Tiled matrix multiplication with shared memory
#define TILE_SIZE 32
__global__ void matmul_tiled(float* C, float* A, float* B, int N) {
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE + 1]; // +1 avoids bank conflicts
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
// Loop over tiles
for (int t = 0; t < N / TILE_SIZE; t++) {
// Collaborative load: each thread loads one element
As[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
__syncthreads();
// Compute partial dot product from shared memory
#pragma unroll
for (int k = 0; k < TILE_SIZE; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = sum;
}
// Why this is faster:
// - Each element loaded once from global, reused TILE_SIZE times
// - Arithmetic intensity: O(TILE_SIZE) ops per byte loaded
// - Coalesced global loads, fast shared memory access
Case Study 2: Flash Attention
Flash Attention is a landmark optimization that makes Transformer attention memory-efficient by fusing operations and using tiled computation to avoid materializing the full N×N attention matrix.
Case Study 3: LLM Inference Optimization
Large Language Model inference faces unique challenges: the autoregressive decode phase generates one token at a time, making it heavily memory-bound.
| Optimization | Technique | Speedup | Trade-off |
|---|---|---|---|
| KV Cache | Cache attention keys/values | ~N× (sequence length) | Memory usage grows with context |
| Continuous Batching | Dynamically batch requests | 2-4× throughput | Latency variance |
| INT8 Quantization | 8-bit weights | 2× memory, ~1.8× speed | < 1% accuracy loss |
| INT4/AWQ | 4-bit weights | 4× memory, ~3× speed | ~1-2% accuracy loss |
| PagedAttention (vLLM) | Virtual memory for KV cache | ~24× throughput | Implementation complexity |
| Flash Attention | Fused, tiled attention | 2-4× attention speed | None (pure win) |
13. Conclusion
Performance optimization is fundamentally about understanding where your code spends time and what limits it. The key insights from this guide:
Key Takeaways
- Profile first — Never optimize blind. Use perf, VTune, nsys, ncu.
- Identify the bottleneck — Memory-bound or compute-bound? Use roofline model.
- Check your algorithm — O(n²) can't be optimized to beat O(n log n).
- Use optimized libraries — cuBLAS, MKL, Flash Attention before custom code.
- Optimize data movement — Most code is memory-bound; reduce traffic.
- Match data layout to access — SoA for vectorization, coalescing for GPU.
- Exploit parallelism — SIMD, multi-threading, GPU thousands of threads.
- Fuse operations — Keep data in fast memory (registers, cache, shared).
- Use lower precision — FP16/BF16/INT8 when accuracy permits.
- Profile again — Verify improvement, watch for new bottlenecks.
Further Reading
- Roofline Model: Previous post in this series
- CUDA Programming Guide: NVIDIA Documentation
- Flash Attention Paper: Dao et al., 2022
- What Every Programmer Should Know About Memory: Drepper, 2007
- Triton Tutorials: Next post in this series
This comprehensive guide covered CPU and GPU optimization from fundamentals to advanced techniques. The journey from understanding bottlenecks to writing optimized code requires practice—start profiling your own code and apply these principles. Happy optimizing!