1. Introduction: The Memory Wall

Modern hardware performance is a balancing act between computation speed (how fast we can do math) and memory bandwidth (how fast we can move data). Understanding this balance is the foundation of all performance optimization.

In the early days of computing, processors and memory operated at similar speeds. A CPU could fetch data, perform calculations, and store results without waiting. But starting in the 1980s, processor speeds began growing exponentially while memory speeds improved only linearly:

The Memory Wall: CPU vs Memory Performance Gap
Year Performance (log scale) 1980 1990 2000 2010 2020 10× 100× 1000× 10000× CPU DRAM Memory Wall Caches common Multi-core era GPU computing
CPU compute performance has grown ~10,000× since 1980, while DRAM bandwidth grew only ~100×

This growing disparity—known as the Memory Wall—means that for most real-world applications, the processor spends most of its time waiting for data rather than computing. By 2020, a typical CPU can perform thousands of floating-point operations in the time it takes to fetch a single cache line from main memory.

The Hard Truth

On modern hardware, data movement costs more than computation. Moving a 64-bit value from DRAM costs ~200 cycles, while a fused multiply-add (FMA) takes 4-5 cycles. You can do 40-50 math operations in the time it takes to fetch one number!

Real-World Impact

Consider this simple vector addition:

C++ naive_add.cpp
void vector_add(float* a, float* b, float* c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // 1 FLOP, 12 bytes accessed
    }
}
// Arithmetic Intensity = 1 / 12 = 0.083 FLOP/Byte
// On a 100 GB/s memory system: max 8.3 GFLOPS
// But CPU can do 100+ GFLOPS!
// → 92% of compute capacity WASTED

This kernel has an arithmetic intensity of 0.083 FLOP/Byte—far below the ~10 FLOP/Byte threshold needed to utilize modern CPU compute capabilities. The processor is memory-bound, throttled by how fast it can load and store data.

The Two Types of Bottlenecks

Every computation falls into one of two categories:

Memory-Bound

The processor waits for data. Performance is limited by memory bandwidth.

  • Low arithmetic intensity
  • Optimization: reduce data movement
  • Examples: streaming, sparse ops

Compute-Bound

The processor is doing math. Performance is limited by compute throughput.

  • High arithmetic intensity
  • Optimization: use more compute
  • Examples: dense GEMM, convolutions

The critical insight: the right optimization depends on which bound you're in. Vectorizing a memory-bound kernel is pointless. Improving memory access patterns for a compute-bound kernel is wasteful. You must diagnose before you optimize.

2. Diagnosing Performance Bottlenecks

Before writing a single line of optimization code, you must answer: is my kernel memory-bound or compute-bound? This section covers the tools and techniques for diagnosis.

The Roofline Model Recap

The Roofline Model provides a visual framework for understanding performance limits. The key metric is Arithmetic Intensity (AI):

$$\text{AI} = \frac{\text{FLOPs performed}}{\text{Bytes accessed from memory}}$$

The roofline Plot shows two ceilings:

Where they meet is the ridge point—the AI where your code transitions from memory-bound to compute-bound:

$$\text{Ridge Point} = \frac{\text{Peak FLOPS}}{\text{Peak Bandwidth}}$$
The Roofline Model
Arithmetic Intensity (FLOP/Byte) Performance (GFLOP/s) MEMORY BOUND COMPUTE-BOUND Ridge Point Vector Add Dense GEMM 0.1 1 10 100 1000
Operations below and left of the ridge are memory-bound; above and right are compute-bound

Profiling Tools

Theoretical analysis tells you what should happen. Profilers tell you what actually happens:

Tool Platform Key Metrics When to Use
perf Linux CPU IPC, cache misses, branch mispredicts Quick CPU profiling
Intel VTune Intel CPU Roofline, memory hierarchy, threading Deep CPU analysis
AMD uProf AMD CPU Similar to VTune for AMD AMD systems
Nsight Compute NVIDIA GPU SM utilization, memory throughput, roofline CUDA kernel optimization
Nsight Systems NVIDIA GPU Timeline, kernel launches, memory transfers Application-level GPU analysis
rocprof AMD GPU HIP kernel metrics ROCm optimization

CPU Profiling with perf

Bash perf_profiling.sh
# Basic statistics
perf stat ./my_program

# Key counters for memory-bound diagnosis
perf stat -e cycles,instructions,cache-references,cache-misses,\
          L1-dcache-loads,L1-dcache-load-misses,\
          LLC-loads,LLC-load-misses ./my_program

# Sample output:
#  12,345,678,901  cycles
#   3,456,789,012  instructions              # IPC: 0.28 (LOW!)
#     456,789,012  cache-references
#     234,567,890  cache-misses              # 51% miss rate (HIGH!)
#
# Diagnosis: Memory-bound (IPC < 1, high cache misses)

GPU Profiling with Nsight Compute

Bash ncu_profiling.sh
# Profile a specific kernel
ncu --set full -o profile ./my_cuda_app

# Quick metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
              dram__throughput.avg.pct_of_peak_sustained_elapsed,\
              gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed \
    ./my_cuda_app

# Key indicators:
# - SM Throughput high, DRAM low → Compute-bound
# - SM Throughput low, DRAM high → Memory-bound
# - Both low → Latency-bound or underutilized

Identifying Your Bound

Here's a systematic approach to diagnose your kernel:

Performance Diagnosis Flowchart
Calculate AI AI > Ridge? No Yes Memory-Bound Profile shows: • Low IPC (< 1) • High cache misses • DRAM BW near peak Optimize: • Cache blocking, prefetching • Data layout (SoA) • Kernel fusion, compression Compute-Bound Profile shows: • High IPC (> 1) • Low cache misses • SM/ALU near saturation Optimize: • Vectorization (SIMD/Tensor) • Loop unrolling, ILP • Mixed precision Neither at ceiling? → Latency-bound or underutilized Check for: • Thread synchronization • Branch divergence • Low occupancy • Dependency chains
Systematic approach: calculate AI, verify with profiler, then apply targeted optimizations
Python diagnose_kernel.py
def diagnose_kernel(
    flops: int,
    bytes_accessed: int,
    observed_gflops: float,
    peak_gflops: float,
    peak_bw_gb_s: float
):
    """Diagnose whether a kernel is memory or compute bound."""
    
    # Calculate arithmetic intensity
    ai = flops / bytes_accessed
    
    # Calculate ridge point
    ridge = peak_gflops / peak_bw_gb_s
    
    # Calculate theoretical max performance
    if ai < ridge:
        theoretical_max = ai * peak_bw_gb_s
        bound = "Memory-bound"
    else:
        theoretical_max = peak_gflops
        bound = "Compute-bound"
    
    # Calculate efficiency
    efficiency = observed_gflops / theoretical_max * 100
    
    print(f"Arithmetic Intensity: {ai:.2f} FLOP/Byte")
    print(f"Ridge Point: {ridge:.2f} FLOP/Byte")
    print(f"Classification: {bound}")
    print(f"Theoretical Max: {theoretical_max:.1f} GFLOPS")
    print(f"Achieved: {observed_gflops:.1f} GFLOPS")
    print(f"Efficiency: {efficiency:.1f}%")
    
    if efficiency < 50:
        print(f"\n⚠️  Low efficiency! Check for:")
        if bound == "Memory-bound":
            print("  - Cache misses")
            print("  - Uncoalesced access (GPU)")
            print("  - Poor data locality")
        else:
            print("  - No vectorization")
            print("  - Low occupancy (GPU)")
            print("  - Dependency stalls")

# Example: Matrix-vector multiply on CPU
# y = A @ x, where A is 4096×4096, x is 4096×1
diagnose_kernel(
    flops=2 * 4096 * 4096,              # 33.5M FLOPs
    bytes_accessed=(4096*4096 + 4096 + 4096) * 4,  # 67 MB
    observed_gflops=8.5,
    peak_gflops=500,                   # Modern CPU
    peak_bw_gb_s=50                    # DDR4
)

# Output:
# Arithmetic Intensity: 0.50 FLOP/Byte
# Ridge Point: 10.00 FLOP/Byte
# Classification: Memory-bound
# Theoretical Max: 25.0 GFLOPS
# Achieved: 8.5 GFLOPS
# Efficiency: 34.0%
#
# ⚠️  Low efficiency! Check for:
#   - Cache misses
#   - Uncoalesced access (GPU)
#   - Poor data locality
The 50% Rule

If your kernel achieves less than 50% of its theoretical maximum (based on the roofline), there's likely a fixable inefficiency. Common culprits: cache misses, uncoalesced access, branch divergence, or lack of vectorization.

3. CPU Architecture Fundamentals

To optimize for CPUs, you need to understand how they're built. Modern CPUs are designed to hide memory latency and maximize throughput through several mechanisms:

Cache Hierarchy

CPUs use a hierarchy of progressively larger, slower caches to bridge the gap between processor speed and main memory:

CPU Cache Hierarchy (Typical Modern Desktop)
CPU Core ~4 GHz, 4-wide L1 Cache (per core) 32-64 KB | 4-5 cycles | ~1 TB/s L2 Cache (per core) 256 KB - 1 MB | 12-15 cycles | ~500 GB/s L3 Cache (shared) 8-64 MB | 40-50 cycles | ~200 GB/s Main Memory (DRAM) 16-128 GB | 150-200 cycles | 50-100 GB/s ~1 ns ~4 ns ~15 ns ~60 ns L1: 64KB L2: 1MB L3: 32MB DRAM
Each cache level is ~10× larger but ~3-4× slower than the previous

L1 Cache

64
KB per core

L1 Latency

4-5
cycles (~1 ns)

L3 Cache

32
MB shared

DRAM Latency

~200
cycles (~60 ns)

The key insight: cache misses are expensive. An L1 hit costs 4 cycles; a DRAM fetch costs 200+. Code that fits in cache runs 50× faster than code that doesn't. This is why locality—keeping frequently accessed data close together and re-accessing it while still cached—is the primary CPU optimization strategy.

C++ cache_impact.cpp
// Example: Impact of stride on cache performance
const int N = 1024 * 1024;  // 1M elements = 4 MB
float data[N];

// Sequential access: Perfect cache utilization
// Each cache line (64B) serves 16 floats
for (int i = 0; i < N; i++) {
    sum += data[i];  // ~0.25 cycles/element (cache hit)
}

// Strided access: Cache line wasted
// Only 1 of 16 floats used per cache line fetch
for (int i = 0; i < N; i += 16) {
    sum += data[i];  // ~4 cycles/element (cache miss every time)
}

// Random access: Worst case
for (int i = 0; i < N; i++) {
    sum += data[random_indices[i]];  // ~50+ cycles/element (DRAM)
}

Pipeline & Out-of-Order Execution

Modern CPUs don't execute one instruction at a time. They use pipelining to overlap instruction execution and out-of-order execution to keep functional units busy:

CPU Pipeline & Out-of-Order Execution
Fetch Decode Rename Schedule Execute Retire Out-of-Order Scheduling Example Program Order 1: r1 = load [x] 2: r2 = r1 + 1 ← waits for r1 3: r3 = load [y] Execution Order (OoO) 1: r1 = load [x] 3: r3 = load [y] ← issued early! 2: r2 = r1 + 1 ← when r1 ready OoO Key Pipeline Metrics IPC (Instructions Per Cycle): IPC < 1 → Likely memory/latency bound IPC > 2 → Good utilization Issue Width: 4-8 instrs/cycle Reorder Buffer: 200-300 entries Load Buffer: ~100 outstanding loads
Out-of-order execution overlaps independent operations to hide latency
IPC: The Key Metric

Instructions Per Cycle (IPC) tells you how well your code utilizes the CPU pipeline. Modern CPUs can theoretically execute 4-8 instructions per cycle. If your IPC is below 1, you're likely stalled waiting for data. Low IPC usually indicates memory-boundness or long dependency chains.

Branch Prediction

Branches (if/else, loops) are problematic for pipelines. The CPU must guess which path to take before it knows the condition result. A misprediction costs 15-20 cycles as the pipeline is flushed:

C++ branch_impact.cpp
// Predictable branch: CPU learns the pattern
for (int i = 0; i < N; i++) {
    if (i < N/2) {  // Predictable: always true then always false
        sum += data[i];
    }
}
// Prediction accuracy: ~99%
// Mispredictions: ~2 (at the transition)

// Unpredictable branch: 50/50 random
for (int i = 0; i < N; i++) {
    if (data[i] > 0.5) {  // Random data: unpredictable!
        sum += data[i];
    }
}
// Prediction accuracy: ~50%
// Mispredictions: ~N/2
// Cost: N/2 × 15 cycles = 7.5N extra cycles!

// Branchless version using conditional move
for (int i = 0; i < N; i++) {
    sum += (data[i] > 0.5) ? data[i] : 0;  // May compile to CMOV
}
// Or manually branchless:
for (int i = 0; i < N; i++) {
    float mask = (data[i] > 0.5f) ? 1.0f : 0.0f;
    sum += data[i] * mask;
}
// No branch mispredictions!
Pattern Prediction Accuracy Performance Impact
Always taken / never taken ~99% Negligible
Regular pattern (TTTNTTTN...) ~95% Minor
Data-dependent (random) ~50% Severe (15-20 cycles/mispredict)
Branchless code N/A No penalty

SIMD Units

Single Instruction Multiple Data (SIMD) allows a CPU to perform the same operation on multiple data elements simultaneously. Modern x86 CPUs support:

SIMD Vector Widths Evolution
SSE (128-bit) 4 × float32 AVX/AVX2 (256-bit) 8 × float32 AVX-512 (512-bit) 16 × float32 Theoretical Speedup (per instruction) 16×
Each generation doubles the number of elements processed per instruction
C++ simd_example.cpp
#include <immintrin.h>

// Scalar: Process 1 element at a time
void add_scalar(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // 1 element per iteration
    }
}

// AVX2: Process 8 elements at a time
void add_avx2(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        // Load 8 floats from each array
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        
        // Add all 8 pairs in a single instruction
        __m256 vc = _mm256_add_ps(va, vb);
        
        // Store 8 results
        _mm256_storeu_ps(&c[i], vc);
    }
}

// AVX-512: Process 16 elements at a time
void add_avx512(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 16) {
        __m512 va = _mm512_loadu_ps(&a[i]);
        __m512 vb = _mm512_loadu_ps(&b[i]);
        __m512 vc = _mm512_add_ps(va, vb);
        _mm512_storeu_ps(&c[i], vc);
    }
}

// Note: Modern compilers can auto-vectorize with -O3 -march=native
// Check with: gcc -fopt-info-vec-optimized or clang -Rpass=loop-vectorize
SIMD Reality Check

SIMD helps compute-bound code by increasing arithmetic throughput. For memory-bound code (like the simple vector add above), SIMD provides limited benefit because you're already bottlenecked on memory bandwidth, not compute. SIMD shines when you do many operations on data already in cache.

SIMD Extension Width float32 float64 Availability
SSE 128-bit 4 2 All modern x86
AVX 256-bit 8 4 Sandy Bridge+ (2011)
AVX2 256-bit 8 4 Haswell+ (2013)
AVX-512 512-bit 16 8 Skylake-X+ (2017)
ARM NEON 128-bit 4 2 All ARMv8
ARM SVE/SVE2 128-2048 bit 4-64 2-32 ARMv8.2+ (scalable)
Batch 1 Complete

This covers the fundamentals: Memory Wall, diagnosing bottlenecks, and CPU architecture basics. Batch 2 will add: CPU memory optimizations (AoS/SoA, cache blocking, prefetching) and CPU compute optimizations (vectorization deep-dive, loop unrolling, branch elimination, ILP).

4. CPU Memory Optimizations

For memory-bound CPU code, the goal is to minimize data movement and maximize cache hits. This section covers the essential techniques.

Data Layout: Array of Structures vs Structure of Arrays

How you organize data in memory has massive performance implications. The two primary layouts are:

Array of Structures (AoS)

Each object's fields are stored contiguously

struct Particle {
  float x, y, z;
  float vx, vy, vz;
};
Particle particles[N];

Memory: [x₀,y₀,z₀,vx₀,vy₀,vz₀,x₁,y₁,...]

Structure of Arrays (SoA)

Each field stored in its own array

struct Particles {
  float x[N], y[N], z[N];
  float vx[N], vy[N], vz[N];
};
Particles particles;

Memory: [x₀,x₁,x₂...][y₀,y₁,y₂...]...

AoS vs SoA Memory Access Patterns
Array of Structures (AoS) x₀ y₀ z₀ vx₀ vy₀ vz₀ x₁ x₂ When accessing only x: Only 1/6 of cache used! 16.7% utilization Structure of Arrays (SoA) x₀ x₁ x₂ x₃ x₄ x₅ y₀ y₁ When accessing only x: All cache lines used! 100% utilization SIMD Vectorization Benefit AoS: Requires gather operations ~10× slower than sequential SoA: Sequential SIMD loads Maximum throughput Rule: Use SoA when accessing one field across many objects (common in HPC) Use AoS when accessing all fields of one object (better for random access patterns)
SoA enables full cache line utilization and efficient SIMD vectorization
C++ aos_vs_soa.cpp
// Array of Structures (AoS) - Object-oriented style
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
    float mass;
};
ParticleAoS particles_aos[N];

// Processing - poor cache utilization when only accessing some fields
void update_positions_aos(ParticleAoS* p, float dt, int n) {
    for (int i = 0; i < n; i++) {
        p[i].x += p[i].vx * dt;  // Loads 28 bytes per particle
        p[i].y += p[i].vy * dt;  // but only uses 12 bytes (x,vx,y,vy,z,vz)
        p[i].z += p[i].vz * dt;  // mass field wastes cache space
    }
}

// Structure of Arrays (SoA) - Data-oriented style
struct ParticlesSoA {
    float* x;  float* y;  float* z;
    float* vx; float* vy; float* vz;
    float* mass;
};

// Processing - excellent cache utilization + auto-vectorization
void update_positions_soa(ParticlesSoA& p, float dt, int n) {
    // Each loop is perfectly vectorizable!
    for (int i = 0; i < n; i++) p.x[i] += p.vx[i] * dt;
    for (int i = 0; i < n; i++) p.y[i] += p.vy[i] * dt;
    for (int i = 0; i < n; i++) p.z[i] += p.vz[i] * dt;
}

// Even better: fused loops with explicit SIMD
void update_positions_soa_avx(ParticlesSoA& p, float dt, int n) {
    __m256 vdt = _mm256_set1_ps(dt);
    
    for (int i = 0; i < n; i += 8) {
        // Process 8 particles' x coordinates at once
        __m256 vx = _mm256_loadu_ps(&p.x[i]);
        __m256 vvx = _mm256_loadu_ps(&p.vx[i]);
        vx = _mm256_fmadd_ps(vvx, vdt, vx);  // x += vx * dt
        _mm256_storeu_ps(&p.x[i], vx);
        
        // Similar for y and z...
    }
}

// Performance difference: 3-6× speedup from SoA + SIMD
Aspect AoS SoA
Cache efficiency (field subset) Poor - loads unused fields Excellent - loads only needed data
SIMD vectorization Requires gather/scatter Sequential loads (native)
Random access by object Good - single pointer dereference Poor - multiple array accesses
Memory allocation Simple (one array) Complex (multiple arrays)
Best use case Object-by-object processing Field-by-field processing (HPC)
When to Use Each

SoA: Batch processing many objects, iterating through one field (common in physics simulations, ML inference). AoS: Object-oriented access, spatial data structures (trees, graphs), when you need all fields of an object together.

Cache Blocking (Tiling)

Cache blocking partitions data to fit in cache, dramatically reducing memory traffic. The classic example is matrix multiplication:

Naive vs Cache-Blocked Matrix Multiplication
Naive: O(N³) cache misses row i A col j B Problem: Column access breaks cache prefetching! Each B[k][j] access = cache miss Blocked: O(N³/B) cache misses tile A tile B Solution: Process tiles that fit in L1/L2 cache! Tile reused B times = B× fewer loads Optimal Tile Size Calculation For 32 KB L1 cache with float32: 3 tiles × B² × 4 bytes ≤ 32 KB → B ≤ √(32768/12) ≈ 52 Use B = 32 or 64 (power of 2 for alignment)
Cache blocking ensures data reuse before eviction, reducing memory traffic by a factor of block size
C++ cache_blocking.cpp
// Naive matrix multiplication - O(N³) cache misses!
void matmul_naive(float* A, float* B, float* C, int N) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0;
            for (int k = 0; k < N; k++) {
                sum += A[i*N + k] * B[k*N + j];  // B access is column-wise!
            }                                     // Cache miss every iteration
            C[i*N + j] = sum;
        }
    }
}
// For N=1024: ~1 billion cache misses

// Cache-blocked matrix multiplication - O(N³/BLOCK) cache misses
const int BLOCK = 64;  // Chosen to fit in L1 cache

void matmul_blocked(float* A, float* B, float* C, int N) {
    // Initialize C to zero
    memset(C, 0, N * N * sizeof(float));
    
    // Iterate over tiles
    for (int i0 = 0; i0 < N; i0 += BLOCK) {
        for (int j0 = 0; j0 < N; j0 += BLOCK) {
            for (int k0 = 0; k0 < N; k0 += BLOCK) {
                
                // Inner loops work on BLOCK×BLOCK tiles
                // These tiles fit in L1 cache!
                for (int i = i0; i < min(i0+BLOCK, N); i++) {
                    for (int j = j0; j < min(j0+BLOCK, N); j++) {
                        float sum = C[i*N + j];
                        for (int k = k0; k < min(k0+BLOCK, N); k++) {
                            sum += A[i*N + k] * B[k*N + j];
                        }
                        C[i*N + j] = sum;
                    }
                }
            }
        }
    }
}
// For N=1024, BLOCK=64: ~16 million cache misses (64× fewer!)

// Even better: blocked + SIMD + loop unrolling
void matmul_blocked_avx(float* A, float* B, float* C, int N) {
    memset(C, 0, N * N * sizeof(float));
    
    for (int i0 = 0; i0 < N; i0 += BLOCK) {
        for (int j0 = 0; j0 < N; j0 += BLOCK) {
            for (int k0 = 0; k0 < N; k0 += BLOCK) {
                for (int i = i0; i < i0+BLOCK; i++) {
                    for (int k = k0; k < k0+BLOCK; k++) {
                        __m256 va = _mm256_set1_ps(A[i*N + k]);
                        for (int j = j0; j < j0+BLOCK; j += 8) {
                            __m256 vb = _mm256_loadu_ps(&B[k*N + j]);
                            __m256 vc = _mm256_loadu_ps(&C[i*N + j]);
                            vc = _mm256_fmadd_ps(va, vb, vc);
                            _mm256_storeu_ps(&C[i*N + j], vc);
                        }
                    }
                }
            }
        }
    }
}
// Additional 8× from SIMD → ~50× total speedup vs naive

Naive GEMM

~5
GFLOPS (N=1024)

Blocked

~50
GFLOPS (10× faster)

Blocked + AVX

~200
GFLOPS (40× faster)

OpenBLAS

~400
GFLOPS (optimal)

Software Prefetching

Prefetching loads data into cache before it's needed, hiding memory latency. Modern CPUs do hardware prefetching for sequential access patterns, but software prefetching helps with irregular patterns:

C++ prefetching.cpp
#include <xmmintrin.h>  // For _mm_prefetch

// Prefetch hints:
// _MM_HINT_T0 - Prefetch to all cache levels (L1, L2, L3)
// _MM_HINT_T1 - Prefetch to L2 and L3
// _MM_HINT_T2 - Prefetch to L3 only
// _MM_HINT_NTA - Non-temporal, bypass cache

// Without prefetch - stalls waiting for memory
void indirect_access_naive(float* data, int* indices, int n) {
    float sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[indices[i]];  // Cache miss → 200 cycle stall
    }
}

// With prefetch - overlap memory access with computation
void indirect_access_prefetch(float* data, int* indices, int n) {
    const int PREFETCH_DISTANCE = 16;  // Tune based on latency
    float sum = 0;
    
    for (int i = 0; i < n; i++) {
        // Prefetch data for future iteration
        if (i + PREFETCH_DISTANCE < n) {
            _mm_prefetch((char*)&data[indices[i + PREFETCH_DISTANCE]], _MM_HINT_T0);
        }
        
        // Process current element (hopefully in cache now)
        sum += data[indices[i]];
    }
}

// Prefetch distance calculation:
// Distance = Memory_Latency / Time_Per_Iteration
// Example: 200 cycles latency / 10 cycles per iteration = 20 iterations ahead

// Linked list traversal with prefetch
struct Node { Node* next; int data; };

int sum_list_prefetch(Node* head) {
    int sum = 0;
    Node* curr = head;
    Node* prefetch_ptr = head;
    
    // Advance prefetch pointer ahead
    for (int i = 0; i < 8 && prefetch_ptr; i++) {
        prefetch_ptr = prefetch_ptr->next;
    }
    
    while (curr) {
        // Prefetch 8 nodes ahead
        if (prefetch_ptr) {
            _mm_prefetch((char*)prefetch_ptr, _MM_HINT_T0);
            prefetch_ptr = prefetch_ptr->next;
        }
        
        sum += curr->data;
        curr = curr->next;
    }
    return sum;
}
Prefetch Pitfalls

Too early: Data evicted before use. Too late: Still waiting for memory. Too many: Pollutes cache, evicting useful data. Start with prefetch distance = memory_latency / loop_time, then tune with profiling. Modern CPUs have excellent hardware prefetchers—software prefetching mainly helps irregular access patterns.

Memory Alignment

Misaligned memory access can cause performance penalties or require multiple cache line fetches:

C++ alignment.cpp
// Allocate aligned memory for SIMD operations

// C++11 aligned allocation
alignas(32) float data[1024];  // 32-byte aligned for AVX

// Dynamic allocation (C++17)
float* ptr = new (std::align_val_t(64)) float[1024];  // 64-byte for cache line

// POSIX aligned allocation
void* ptr;
posix_memalign(&ptr, 64, 1024 * sizeof(float));

// Aligned SIMD loads are faster than unaligned
void process_aligned(float* data, int n) {
    // Assumes data is 32-byte aligned
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_load_ps(&data[i]);   // Faster, requires alignment
        // vs _mm256_loadu_ps for unaligned
    }
}

// Padding to avoid false sharing in multithreaded code
struct alignas(64) PaddedCounter {
    std::atomic<int> count;
    char padding[60];  // Ensure each counter is on its own cache line
};

PaddedCounter counters[NUM_THREADS];  // No false sharing between threads

5. CPU Compute Optimizations

For compute-bound CPU code, the goal is to maximize instruction throughput. This means using SIMD, increasing ILP, and keeping the pipeline full.

Vectorization Deep Dive

Modern compilers can auto-vectorize simple loops, but often need help:

C++ auto_vectorization.cpp
// Compile with: g++ -O3 -march=native -fopt-info-vec-optimized

// ✅ Auto-vectorizes: Simple loop, no dependencies
void saxpy(float* y, float a, float* x, int n) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

// ❌ Won't vectorize: Potential aliasing
void copy_overlap(float* dst, float* src, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i];  // dst and src might overlap!
    }
}

// ✅ Fix: Use restrict to promise no aliasing
void copy_restrict(float* __restrict__ dst, 
                   float* __restrict__ src, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i];  // Compiler knows they don't overlap
    }
}

// ❌ Won't vectorize: Function call in loop
void process_external(float* data, int n) {
    for (int i = 0; i < n; i++) {
        data[i] = external_func(data[i]);  // Can't inline
    }
}

// ❌ Won't vectorize: Data-dependent branch
void process_conditional(float* data, int n) {
    for (int i = 0; i < n; i++) {
        if (data[i] > 0) {
            data[i] = sqrt(data[i]);
        }
    }
}

// ✅ Fix: Use masked operations
void process_masked(float* data, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_loadu_ps(&data[i]);
        __m256 zero = _mm256_setzero_ps();
        __m256 mask = _mm256_cmp_ps(v, zero, _CMP_GT_OQ);
        __m256 sqrtv = _mm256_sqrt_ps(v);
        __m256 result = _mm256_blendv_ps(v, sqrtv, mask);
        _mm256_storeu_ps(&data[i], result);
    }
}

// Pragma hints (compiler-specific)
void process_pragma(float* data, int n) {
    #pragma omp simd  // OpenMP SIMD directive
    for (int i = 0; i < n; i++) {
        data[i] = data[i] * data[i];
    }
    
    #pragma GCC ivdep  // GCC: ignore vector dependencies
    for (int i = 0; i < n; i++) {
        data[i] = data[i - offset];
    }
}
Vectorization Blocker Fix
Pointer aliasing __restrict__ keyword
Non-unit stride Restructure data layout (SoA)
Data-dependent branches Masked operations, branchless code
Loop-carried dependencies Restructure algorithm, reduction clauses
Function calls Inline, SIMD math libraries (SVML)
Unknown trip count Use #pragma omp simd

Loop Unrolling & ILP

Loop unrolling reduces loop overhead and exposes Instruction-Level Parallelism (ILP)—independent operations that can execute in parallel on superscalar CPUs:

Loop Unrolling and ILP
Original Loop (Serial Dependencies) sum += data[i]; add₀ add₁ add₂ ... Each add waits for previous! Latency: N × 4 cycles Unrolled with Multiple Accumulators sum0 += data[i]; sum1 += data[i+1]; sum2 += data[i+2]; sum3 += data[i+3]; add₀ add₁ add₂ add₃ All 4 adds execute in parallel! Latency: N/4 × 4 cycles = 4× faster Why Multiple Accumulators Work • FP add latency: 4 cycles • FP add throughput: 2/cycle • Unroll factor = latency × throughput = 8 • 8 independent streams keep both FP add units busy • Maximum throughput achieved
Multiple accumulators break the dependency chain, enabling full pipeline utilization
C++ loop_unrolling.cpp
// Single accumulator - limited by add latency (4 cycles)
float sum_naive(float* data, int n) {
    float sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[i];  // Each add depends on previous
    }                    // Throughput: 0.25 adds/cycle
    return sum;
}

// Multiple accumulators - breaks dependency chain
float sum_unrolled(float* data, int n) {
    float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    float sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0;
    
    int i = 0;
    for (; i < n - 7; i += 8) {
        sum0 += data[i];      // All 8 additions are independent
        sum1 += data[i + 1];  // CPU can issue 2/cycle
        sum2 += data[i + 2];  // Throughput: 2 adds/cycle
        sum3 += data[i + 3];
        sum4 += data[i + 4];
        sum5 += data[i + 5];
        sum6 += data[i + 6];
        sum7 += data[i + 7];
    }
    
    // Handle remainder
    for (; i < n; i++) {
        sum0 += data[i];
    }
    
    // Combine accumulators
    return (sum0 + sum1) + (sum2 + sum3) + (sum4 + sum5) + (sum6 + sum7);
}

// SIMD unrolled - best of both worlds
float sum_simd_unrolled(float* data, int n) {
    __m256 vsum0 = _mm256_setzero_ps();
    __m256 vsum1 = _mm256_setzero_ps();
    __m256 vsum2 = _mm256_setzero_ps();
    __m256 vsum3 = _mm256_setzero_ps();
    
    int i = 0;
    for (; i < n - 31; i += 32) {
        vsum0 = _mm256_add_ps(vsum0, _mm256_loadu_ps(&data[i]));
        vsum1 = _mm256_add_ps(vsum1, _mm256_loadu_ps(&data[i + 8]));
        vsum2 = _mm256_add_ps(vsum2, _mm256_loadu_ps(&data[i + 16]));
        vsum3 = _mm256_add_ps(vsum3, _mm256_loadu_ps(&data[i + 24]));
    }
    
    // Combine vectors
    vsum0 = _mm256_add_ps(_mm256_add_ps(vsum0, vsum1),
                          _mm256_add_ps(vsum2, vsum3));
    
    // Horizontal sum (reduce 8 floats to 1)
    __m128 lo = _mm256_castps256_ps128(vsum0);
    __m128 hi = _mm256_extractf128_ps(vsum0, 1);
    lo = _mm_add_ps(lo, hi);
    lo = _mm_hadd_ps(lo, lo);
    lo = _mm_hadd_ps(lo, lo);
    
    float sum = _mm_cvtss_f32(lo);
    
    // Handle remainder
    for (; i < n; i++) {
        sum += data[i];
    }
    
    return sum;
}

// Performance comparison (N = 1M floats):
// sum_naive:         ~1 GB/s  (memory bound, but add latency limited)
// sum_unrolled:      ~8 GB/s  (near bandwidth limit)
// sum_simd_unrolled: ~40 GB/s (exceeds DRAM, benefits from cache)
Unroll Factor Formula

Optimal unroll factor = latency × throughput. For FP add: 4 cycles latency × 2/cycle throughput = 8 accumulators. For FMA: 4 cycles × 2/cycle = 8. This ensures enough independent operations to keep all execution units busy.

Fused Multiply-Add (FMA)

FMA computes a * b + c in a single instruction with higher precision (only one rounding) and twice the throughput:

C++ fma_examples.cpp
#include <immintrin.h>

// Without FMA: 2 operations
float dot_naive(float* a, float* b, int n) {
    float sum = 0;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];  // mul + add = 2 ops
    }
    return sum;
}

// With FMA: 1 operation, 2 FLOPs
float dot_fma(float* a, float* b, int n) {
    __m256 vsum0 = _mm256_setzero_ps();
    __m256 vsum1 = _mm256_setzero_ps();
    
    for (int i = 0; i < n; i += 16) {
        __m256 va0 = _mm256_loadu_ps(&a[i]);
        __m256 vb0 = _mm256_loadu_ps(&b[i]);
        __m256 va1 = _mm256_loadu_ps(&a[i + 8]);
        __m256 vb1 = _mm256_loadu_ps(&b[i + 8]);
        
        // FMA: vsum = va * vb + vsum
        vsum0 = _mm256_fmadd_ps(va0, vb0, vsum0);
        vsum1 = _mm256_fmadd_ps(va1, vb1, vsum1);
    }
    
    vsum0 = _mm256_add_ps(vsum0, vsum1);
    
    // Horizontal sum...
    __m128 lo = _mm256_castps256_ps128(vsum0);
    __m128 hi = _mm256_extractf128_ps(vsum0, 1);
    lo = _mm_add_ps(lo, hi);
    lo = _mm_hadd_ps(lo, lo);
    lo = _mm_hadd_ps(lo, lo);
    return _mm_cvtss_f32(lo);
}

// FMA variants:
// _mm256_fmadd_ps(a, b, c)  → a*b + c
// _mm256_fmsub_ps(a, b, c)  → a*b - c
// _mm256_fnmadd_ps(a, b, c) → -(a*b) + c = c - a*b
// _mm256_fnmsub_ps(a, b, c) → -(a*b) - c

// Enable FMA in compiler for auto-vectorization:
// gcc/clang: -mfma or -march=haswell (or newer)
// MSVC: /arch:AVX2

Branchless Programming

Replacing branches with arithmetic operations eliminates branch misprediction penalties:

C++ branchless.cpp
// Branchy: Branch misprediction on random data
int abs_branchy(int x) {
    if (x < 0) return -x;
    return x;
}

// Branchless: No prediction needed
int abs_branchless(int x) {
    int mask = x >> 31;  // All 1s if negative, all 0s if positive
    return (x ^ mask) - mask;
}

// Min/Max
int min_branchy(int a, int b) {
    return (a < b) ? a : b;  // May branch or use CMOV
}

int min_branchless(int a, int b) {
    int diff = a - b;
    int mask = diff >> 31;  // mask = -1 if a < b, else 0
    return b + (diff & mask);
}

// Clamp
float clamp_branchy(float x, float lo, float hi) {
    if (x < lo) return lo;
    if (x > hi) return hi;
    return x;
}

float clamp_branchless(float x, float lo, float hi) {
    return fmax(fmin(x, hi), lo);  // Uses SIMD min/max
}

// Conditional increment
int count = 0;

// Branchy
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) count++;  // Mispredicts ~50% on random
}

// Branchless
for (int i = 0; i < n; i++) {
    count += (data[i] > threshold);  // Bool converts to 0 or 1
}

// SIMD branchless comparison
int count_simd(float* data, float threshold, int n) {
    __m256 vthresh = _mm256_set1_ps(threshold);
    __m256i vcount = _mm256_setzero_si256();
    __m256i vone = _mm256_set1_epi32(1);
    
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_loadu_ps(&data[i]);
        __m256 mask = _mm256_cmp_ps(v, vthresh, _CMP_GT_OQ);
        __m256i imask = _mm256_castps_si256(mask);
        vcount = _mm256_sub_epi32(vcount, imask);  // -(-1) = +1
    }
    
    // Horizontal sum of 8 ints
    int result[8];
    _mm256_storeu_si256((__m256i*)result, vcount);
    return result[0] + result[1] + result[2] + result[3] +
           result[4] + result[5] + result[6] + result[7];
}

7. CPU Parallelization

Modern CPUs have multiple cores, and utilizing all of them is essential for achieving peak throughput. However, parallel programming introduces new challenges: synchronization overhead, load imbalance, and cache coherency issues.

CPU Parallelization: Scaling Challenges
Number of Threads Speedup 1 2 4 8 16 32 16× 32× Ideal 90% parallel 75% parallel With overhead Amdahl's Law S = 1 / ((1-P) + P/N) P = parallel fraction, N = threads
Parallel speedup is limited by serial portions and can degrade with excessive overhead

OpenMP Basics

OpenMP is the most common approach to add parallelism to existing C/C++ code. It uses compiler directives (pragmas) to parallelize loops and regions.

Directive Purpose Use Case
#pragma omp parallel for Parallelize a for loop Independent iterations
#pragma omp parallel for reduction(+:sum) Parallel reduction Accumulating values
#pragma omp parallel sections Different code blocks in parallel Task parallelism
#pragma omp simd Hint for SIMD vectorization Force vectorization
#pragma omp task Create asynchronous task Dynamic workloads
C++ openmp_basics.cpp
// Basic parallel for - each iteration runs on different thread
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    result[i] = expensive_computation(data[i]);
}

// Reduction - thread-safe accumulation
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
    sum += data[i];  // Each thread has private sum, combined at end
}

// Nested parallelism with collapse
#pragma omp parallel for collapse(2)
for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        matrix[i][j] = compute(i, j);
    }
}

// Schedule types for load balancing
#pragma omp parallel for schedule(static)   // Equal chunks, best for uniform work
#pragma omp parallel for schedule(dynamic)  // Grab work as available, best for variable work
#pragma omp parallel for schedule(guided)   // Decreasing chunk sizes, good balance
Schedule Selection

static: Best for uniform workloads (each iteration takes same time). Lowest overhead.
dynamic: Best for irregular workloads (some iterations much longer). Higher overhead but better load balance.
guided: Compromise - starts with large chunks, shrinks over time.

False Sharing

False sharing is the silent killer of parallel performance. It occurs when threads on different cores write to different variables that happen to share the same cache line (typically 64 bytes).

False Sharing: The Hidden Performance Killer
Core 0 Writes to counter[0] Core 1 Writes to counter[1] L1 Cache Line L1 Cache Line Shared 64-byte Cache Line counter[0] counter[1] counter[2] counter[3] Cache line ping-pongs! ~100 cycles per invalidation The Problem Different variables, same cache line Each write invalidates other core Can be 10-100× slower!
Even though threads access different variables, they fight over the same cache line
C++ false_sharing.cpp
// BAD: False sharing - counters share cache line
struct BadCounters {
    int64_t counter[8];  // 64 bytes total, fits in ONE cache line!
};

BadCounters bad;
#pragma omp parallel for
for (int t = 0; t < 8; t++) {
    for (int i = 0; i < 1000000; i++) {
        bad.counter[t]++;  // All threads thrash the same cache line!
    }
}
// Time: ~500ms (horrible!)

// GOOD: Pad to separate cache lines
struct PaddedCounter {
    int64_t counter;
    char padding[56];  // Pad to 64 bytes (cache line size)
};

struct GoodCounters {
    PaddedCounter counter[8];  // Each counter on its own cache line
};

GoodCounters good;
#pragma omp parallel for
for (int t = 0; t < 8; t++) {
    for (int i = 0; i < 1000000; i++) {
        good.counter[t].counter++;  // No contention!
    }
}
// Time: ~5ms (100× faster!)

// C++17: Use alignas for cleaner padding
struct alignas(64) CacheLineCounter {
    int64_t value;
};
Detecting False Sharing

Use perf c2c on Linux to detect cache-line contention. Look for high "HITM" (Hit Modified) counts. Intel VTune also reports false sharing directly under "Memory Access" analysis.

NUMA Awareness

NUMA (Non-Uniform Memory Access) architectures have multiple memory controllers. Memory access is fast when accessing local memory but slow when crossing NUMA boundaries.

NUMA Architecture: Local vs Remote Memory
NUMA Node 0 Core 0 Core 1 Core 2 Core 3 Local Memory ~80ns latency 100+ GB/s bandwidth NUMA Node 1 Core 4 Core 5 Core 6 Core 7 Local Memory ~80ns latency 100+ GB/s bandwidth QPI/UPI ~150ns latency ~50 GB/s Remote access: 2× slower!
Accessing remote NUMA memory can halve bandwidth and double latency
C++ numa_aware.cpp
#include 

// Check NUMA topology
int num_nodes = numa_num_configured_nodes();
printf("NUMA nodes: %d\n", num_nodes);

// Allocate memory on specific NUMA node
void* local_mem = numa_alloc_onnode(size, 0);  // Node 0
void* interleaved = numa_alloc_interleaved(size); // Spread across nodes

// Bind thread to NUMA node
numa_run_on_node(0);  // Run this thread on node 0

// First-touch policy: memory allocated where first accessed
float* data = (float*)malloc(N * sizeof(float));

// Initialize in parallel - each thread touches its portion
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    data[i] = 0.0f;  // Memory pages allocated on thread's local node
}

// Later computation accesses local memory
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    data[i] = compute(data[i]);  // Fast local access!
}

Thread Pools & Task Scheduling

Creating threads is expensive (~10-100μs). For fine-grained parallelism, use thread pools that reuse worker threads.

C++ thread_pool.cpp
#include 
#include 
#include 
#include 
#include 

class ThreadPool {
private:
    std::vector<std::thread> workers;
    std::queue<std::function<void()>> tasks;
    std::mutex queue_mutex;
    std::condition_variable condition;
    bool stop = false;

public:
    ThreadPool(size_t num_threads) {
        for (size_t i = 0; i < num_threads; i++) {
            workers.emplace_back([this] {
                while (true) {
                    std::function<void()> task;
                    {
                        std::unique_lock<std::mutex> lock(queue_mutex);
                        condition.wait(lock, [this] {
                            return stop || !tasks.empty();
                        });
                        if (stop && tasks.empty()) return;
                        task = std::move(tasks.front());
                        tasks.pop();
                    }
                    task();  // Execute task
                }
            });
        }
    }

    template<class F>
    void enqueue(F&& f) {
        {
            std::unique_lock<std::mutex> lock(queue_mutex);
            tasks.emplace(std::forward(f));
        }
        condition.notify_one();
    }

    ~ThreadPool() {
        { std::unique_lock<std::mutex> lock(queue_mutex); stop = true; }
        condition.notify_all();
        for (auto& w : workers) w.join();
    }
};

// Usage
ThreadPool pool(std::thread::hardware_concurrency());

for (int i = 0; i < 10000; i++) {
    pool.enqueue([i] {
        process_item(i);  // No thread creation overhead!
    });
}
Parallelization Technique Helps Memory-Bound? Helps Compute-Bound? Best For
OpenMP parallel for Sometimes (more bandwidth) Yes (more cores) Data parallelism
False sharing fix Yes (cache efficiency) Yes (reduces stalls) Shared counters/accumulators
NUMA awareness Yes (local bandwidth) Somewhat Large memory workloads
Thread pools No Yes (reduces overhead) Many small tasks

8. GPU Architecture Fundamentals

GPUs are fundamentally different from CPUs. While CPUs optimize for latency (fast single-thread execution), GPUs optimize for throughput (maximum total work per second) by running thousands of threads simultaneously.

CPU vs GPU Architecture Philosophy
CPU: Latency-Optimized Complex Core OoO, Branch Pred Big caches 4-5 GHz Complex Core OoO, Branch Pred Big caches 4-5 GHz Large L3 Cache (32+ MB) Memory Controller 4-16 threads, ~100 GB/s GPU: Throughput-Optimized 108 SMs × 64 cores = 6912 cores L2 Cache (40 MB) HBM3: 2+ TB/s bandwidth! 10,000+ threads, ~2000 GB/s
CPUs: few fast cores. GPUs: many slow cores. Different problems, different solutions.

Streaming Multiprocessors (SMs)

A GPU is composed of many Streaming Multiprocessors (SMs). Each SM is like a mini processor with its own:

Streaming Multiprocessor (SM) Architecture (A100)
Streaming Multiprocessor (SM) Warp Scheduler 0 Dispatch Unit Warp Scheduler 1 Dispatch Unit Warp Scheduler 2 Dispatch Unit Warp Sched 3 Dispatch FP32/INT32 Units 64 FP32 cores Tensor Cores 4 Tensor Cores FP64 Units 32 FP64 cores LD/ST Units 32 LD/ST Register File: 256 KB (65,536 × 32-bit registers) Shared Memory / L1 Cache: 192 KB Texture Units + L1 Tex Cache
Each SM can execute 2048 threads (64 warps), switching between them to hide latency
GPU SMs FP32 Cores Tensor Cores Shared Mem/SM HBM BW
RTX 3090 82 10,496 328 128 KB 936 GB/s
A100 108 6,912 432 192 KB 2,039 GB/s
H100 132 16,896 528 228 KB 3,350 GB/s
RTX 4090 128 16,384 512 128 KB 1,008 GB/s

Warps & SIMT Execution

GPU threads are grouped into warps of 32 threads. All threads in a warp execute the same instruction at the same time (SIMT = Single Instruction, Multiple Threads). This is both the source of GPU power and the cause of performance pitfalls.

Warp Execution: SIMT Model
1 Warp = 32 Threads Executing in Lockstep T0 T1 T2 T3 ... T31 ✓ Uniform Control Flow if (x > 0) { // All threads take same path result = compute(x); } All 32 threads active → Full throughput ✗ Warp Divergence if (threadIdx.x % 2 == 0) { pathA(); // Only even threads } else pathB(); // Only odd threads Paths execute serially → 50% throughput! Divergent Execution Timeline Even threads: pathA() [Odd masked] Odd threads: pathB() [Even masked] Sequential! 2× time Reconverge: all active
Divergent branches within a warp execute sequentially, halving or worse throughput
CUDA warp_divergence.cu
// BAD: Heavy divergence within warp
__global__ void divergent_kernel(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Each thread in warp may take different path!
    if (data[idx] > 0.5f) {
        data[idx] = expensive_pathA(data[idx]);  // Some threads
    } else {
        data[idx] = expensive_pathB(data[idx]);  // Other threads
    }
}

// BETTER: Sort data so adjacent threads take same path
// Or restructure to minimize branch impact

// GOOD: Branch on warp-uniform condition
__global__ void uniform_kernel(float* data, int n, bool usePathA) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    // All threads in warp take same path (uniform)
    if (usePathA) {
        data[idx] = expensive_pathA(data[idx]);
    } else {
        data[idx] = expensive_pathB(data[idx]);
    }
}

GPU Memory Hierarchy

GPUs have a complex memory hierarchy. Using the right level is critical for performance:

GPU Memory Hierarchy
Registers ~0 cycles 256 KB/SM ~20 TB/s Thread-private Shared Memory ~20-30 cycles 164-228 KB/SM ~10 TB/s Block-shared L1 / Texture Cache ~30-80 cycles 128-192 KB/SM ~5 TB/s Automatic caching L2 Cache (40-60 MB) ~200 cycles | ~4 TB/s Global Memory / HBM (40-80 GB) ~300-500 cycles | 2-3 TB/s Fast Slow Small Large
Optimal GPU code keeps data in faster memory tiers as long as possible
Memory Type Scope Latency Bandwidth Size (A100) Use Case
Registers Thread 0 cycles ~20 TB/s 256 KB/SM Local variables
Shared Memory Block ~25 cycles ~19 TB/s 164 KB/SM Inter-thread communication
L1 Cache SM ~30 cycles ~12 TB/s 192 KB/SM Automatic caching
L2 Cache Device ~200 cycles ~4 TB/s 40 MB Cross-SM sharing
Global (HBM) Device ~400 cycles 2 TB/s 80 GB Large datasets
Host (PCIe) System ~10,000 cycles 32 GB/s System RAM CPU↔GPU transfer

Occupancy & Latency Hiding

Occupancy is the ratio of active warps to the maximum possible warps per SM. High occupancy helps hide memory latency by allowing the SM to switch to other warps while waiting for memory.

Latency Hiding Through Occupancy
Low Occupancy (25%): Only 16 warps Compute Waiting (stall) Compute SM idle 60%+ of time! High Occupancy (100%): 64 warps Warps overlap → SM always busy! Factors Limiting Occupancy Register Usage More regs/thread → Fewer concurrent threads Shared Memory More smem/block → Fewer concurrent blocks Block Size Block size must be multiple of 32 (warp)
High occupancy keeps the SM busy while warps wait for memory
CUDA occupancy_analysis.cu
// Query theoretical occupancy
#include 

__global__ void myKernel(float* data) {
    // Kernel code here
}

void analyzeOccupancy() {
    int blockSize = 256;
    int minGridSize;
    int maxBlockSize;
    
    // Let CUDA suggest optimal block size
    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,
        &maxBlockSize,
        myKernel,
        0,     // Dynamic shared mem
        0      // No block size limit
    );
    printf("Suggested block size: %d\n", maxBlockSize);
    
    // Calculate occupancy for chosen block size
    int numBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &numBlocks,
        myKernel,
        blockSize,
        0  // Shared mem
    );
    
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    
    int maxWarps = prop.maxThreadsPerMultiProcessor / 32;
    int activeWarps = numBlocks * (blockSize / 32);
    float occupancy = (float)activeWarps / maxWarps;
    
    printf("Occupancy: %.1f%% (%d/%d warps)\n",
           occupancy * 100, activeWarps, maxWarps);
}

// Rule of thumb:
//   - 50%+ occupancy is usually good
//   - Below 25% often indicates a problem
//   - 100% not always needed (depends on memory access)
Occupancy vs Performance

Higher occupancy doesn't always mean better performance! A kernel with 50% occupancy but perfect memory coalescing can outperform 100% occupancy with scattered access. Use nsys and ncu to measure actual performance, not just occupancy.

9. GPU Memory Optimizations

GPU memory bandwidth is massive (2+ TB/s on modern GPUs), but achieving that peak requires careful attention to memory access patterns. Poor patterns can reduce effective bandwidth by 10-32×.

Memory Coalescing

Memory coalescing is the most important GPU memory optimization. When threads in a warp access consecutive memory addresses, the hardware combines them into a single, efficient transaction. When accesses are scattered, each thread requires a separate transaction.

Memory Coalescing: The Key to GPU Bandwidth
✓ Coalesced Access (Ideal) T0 T1 T2 T3 ... T31 128-byte aligned memory segment 1 transaction → 128 bytes → Full bandwidth ✗ Strided Access (Wasteful) T0 T1 T2 T3 ... T31 32 transactions → Only 128 bytes useful → 3% efficiency! // Coalesced: consecutive indices int idx = threadIdx.x; data[idx] = value; // T0→[0], T1→[1]... Bandwidth: ~2 TB/s // Strided: threads skip elements int idx = threadIdx.x * STRIDE; data[idx] = value; // T0→[0], T1→[64]... Bandwidth: ~60 GB/s (32× slower!) Coalescing Rules Adjacent threads → Adjacent addresses | Unit stride preferred | Alignment to 32/128 bytes helps
Coalesced access achieves 30-50× higher effective bandwidth than scattered access
CUDA coalescing_patterns.cu
// BAD: Column-major access in row-major storage (strided)
__global__ void bad_transpose(float* out, float* in, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Adjacent threads read from different rows (stride = N)
    out[col * N + row] = in[row * N + col];  // Strided write!
}

// GOOD: Use shared memory to fix coalescing
__global__ void good_transpose(float* out, float* in, int N) {
    __shared__ float tile[32][33];  // +1 to avoid bank conflicts
    
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    
    // Coalesced read from global memory
    tile[threadIdx.y][threadIdx.x] = in[y * N + x];
    
    __syncthreads();
    
    // Transpose indices for coalesced write
    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;
    
    // Coalesced write to global memory
    out[y * N + x] = tile[threadIdx.x][threadIdx.y];
}

// AoS vs SoA on GPU - same principle as CPU!
struct ParticleAoS { float x, y, z, mass; };  // BAD for GPU

struct ParticlesSoA {  // GOOD for GPU
    float* x;
    float* y;
    float* z;
    float* mass;
};

Shared Memory Usage

Shared memory is on-chip SRAM shared by all threads in a block. It's ~100× faster than global memory and perfect for:

Shared Memory: On-Chip Scratchpad
Streaming Multiprocessor (SM) Block 0 256 threads Block 1 256 threads Shared Mem (48KB) Shared Mem (48KB) No cross-block sharing! Shared Memory Properties • 48-164 KB per SM • ~25 cycles latency (vs 400+ global) • 32 banks, 4 bytes each • Configurable split with L1 • Block-scoped (not visible to other blocks) • Uses SM resources → affects occupancy
Each block gets its own private shared memory partition
CUDA shared_memory.cu
// Static shared memory allocation
__global__ void reduction_kernel(float* input, float* output, int N) {
    __shared__ float sdata[256];  // 1KB per block
    
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Load from global to shared (coalesced)
    sdata[tid] = (idx < N) ? input[idx] : 0.0f;
    __syncthreads();  // CRITICAL: wait for all loads
    
    // Parallel reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();  // Sync after each step
    }
    
    // Write result (only thread 0)
    if (tid == 0) output[blockIdx.x] = sdata[0];
}

// Dynamic shared memory (size set at launch)
__global__ void dynamic_smem_kernel(float* data) {
    extern __shared__ float smem[];  // Size unknown at compile time
    
    smem[threadIdx.x] = data[threadIdx.x];
    __syncthreads();
    // Use smem...
}

// Launch with dynamic shared memory
int smem_size = 256 * sizeof(float);
dynamic_smem_kernel<<>>(data);
__syncthreads() is Critical!

Forgetting __syncthreads() after shared memory writes causes race conditions. Threads may read stale data. Always sync after writing to shared memory before reading!

Bank Conflicts

Shared memory is divided into 32 banks. If multiple threads in a warp access different addresses in the same bank, the accesses are serialized (bank conflict). This can reduce shared memory performance by up to 32×.

Shared Memory Bank Conflicts
32 Memory Banks (4 bytes each, interleaved) B0 B1 B2 B3 ... B30 B31 Address[i] → Bank = i % 32 ✓ No Conflict: Each thread → Different bank T0→smem[0], T1→smem[1], T2→smem[2]... // Each address maps to different bank 1 cycle (parallel access) 2-Way Conflict T0→smem[0] T1→smem[32] // Same bank! 2 cycles (serial) 32-Way Conflict All T→smem[0] // Every thread same addr 32 cycles! (broadcast) Common Pitfall: Stride-32 Access Pattern float smem[32][32]; // 2D array smem[threadIdx.x][col]; // Column access = 32× slower! float smem[32][33]; // Add padding! smem[threadIdx.x][col]; // No conflict
Add +1 padding to array dimensions to avoid stride-32 bank conflicts

Texture & Constant Memory

GPUs have specialized memory types optimized for specific access patterns:

Memory Type Size Access Pattern Benefits Use Case
Constant Memory 64 KB All threads read same address Broadcast to warp (1 cycle) Kernel parameters, lookup tables
Texture Memory Cache-backed 2D spatial locality Free interpolation, clamping Image processing, irregular access

10. GPU Compute Optimizations

Once memory access is optimized, we can focus on maximizing computational throughput through kernel design optimizations.

Kernel Fusion

Kernel fusion combines multiple operations into a single kernel to reduce memory traffic. Each kernel launch requires reading inputs from global memory and writing outputs back—fusing eliminates intermediate reads/writes.

Kernel Fusion: Eliminating Memory Round-Trips
✗ Separate Kernels (3× memory traffic) Global Memory Y = X² Intermediate (written back!) Z = Y+1 Read X → Write Y → Read Y → Write Z ✓ Fused Kernel (1× memory traffic) Global Memory Y = X² Z = Y + 1 (Y stays in registers!) Output Z (only final result) Read X → Compute in registers → Write Z
Fusing kernels can provide 2-3× speedup by keeping intermediates in registers
CUDA kernel_fusion.cu
// BAD: Separate kernels for each operation
__global__ void square(float* y, float* x, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = x[i] * x[i];
}
__global__ void add_one(float* z, float* y, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) z[i] = y[i] + 1.0f;
}
__global__ void relu(float* out, float* z, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) out[i] = fmaxf(z[i], 0.0f);
}

// Launch 3 kernels, 3x memory traffic
square<<>>(y, x, n);
add_one<<>>(z, y, n);
relu<<>>(out, z, n);

// GOOD: Single fused kernel
__global__ void fused_square_add_relu(float* out, float* x, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float val = x[i];           // Read once
        val = val * val;            // Square (register)
        val = val + 1.0f;           // Add (register)
        out[i] = fmaxf(val, 0.0f); // ReLU + Write once
    }
}

// Single launch, 1x memory traffic, 2-3x faster!
fused_square_add_relu<<>>(out, x, n);
Fusion in Deep Learning

Modern frameworks like PyTorch 2.0 (torch.compile) and JAX automatically fuse operations. NVIDIA's cuDNN fuses common patterns like Conv+BN+ReLU. Custom Triton kernels give you explicit control over fusion.

Mixed Precision

Mixed precision uses lower-precision formats (FP16, BF16, INT8) for most computations while keeping critical values in FP32. This provides:

Numerical Precision Formats for GPU Computing
FP32 (32 bits) S Exp (8) Mantissa (23) FP16 (16 bits) S Exp (5) Mantissa (10) BF16 (16 bits) - "Brain Float" S Exp (8) Mantissa (7) Format Comparison Format Range Precision Speed FP32 ±3.4×10³⁸ ~7 digits FP16 ±65,504 ~3 digits 2-8× BF16 ±3.4×10³⁸ ~2 digits 2-8× INT8 -128 to 127 Integer 4-16× BF16 = FP32 range + FP16 speed (best for training!) Tensor Cores: 312 TFLOPS FP16 vs 19 TFLOPS FP32 (A100) 16× faster for matrix operations!
BF16 is preferred for training (same range as FP32), FP16/INT8 for inference
Python (PyTorch) mixed_precision.py
import torch
from torch.cuda.amp import autocast, GradScaler

# Automatic Mixed Precision (AMP) training
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()  # Handles gradient scaling for FP16

for data, target in dataloader:
    optimizer.zero_grad()
    
    # Forward pass in FP16 (where safe)
    with autocast(dtype=torch.float16):  # or torch.bfloat16
        output = model(data)
        loss = criterion(output, target)
    
    # Backward pass (scaled gradients)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# What autocast does automatically:
# - MatMuls, Convs → FP16 (Tensor Core accelerated)
# - Reductions, Softmax → FP32 (numerical stability)
# - Loss computation → FP32

# BF16 on Ampere+ (no scaler needed!)
with autocast(dtype=torch.bfloat16):
    output = model(data)  # Stable without gradient scaling

Thread Coarsening

Thread coarsening makes each thread do more work. Instead of one thread per element, have each thread process multiple elements. This reduces launch overhead and can improve instruction-level parallelism.

CUDA thread_coarsening.cu
// Fine-grained: 1 thread per element
__global__ void fine_kernel(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        out[i] = compute(in[i]);
    }
}
// Launch: <<>>

// Coarsened: 1 thread per 4 elements
#define COARSEN 4
__global__ void coarse_kernel(float* out, float* in, int n) {
    int base = (blockIdx.x * blockDim.x + threadIdx.x) * COARSEN;
    
    // Process 4 elements per thread
    #pragma unroll
    for (int j = 0; j < COARSEN; j++) {
        int i = base + j;
        if (i < n) {
            out[i] = compute(in[i]);
        }
    }
}
// Launch: <<>> (4× fewer blocks!)

// Benefits of coarsening:
//   - Fewer blocks → less launch overhead
//   - Unrolled loop → better ILP
//   - Amortize index calculation
//   - Can enable register reuse

// When to coarsen:
//   - Simple kernels with low occupancy limiters
//   - Compute-bound kernels
//   - When block count is very high

Reducing Warp Divergence

Warp divergence (covered in Section 8) can be mitigated through several techniques:

CUDA divergence_mitigation.cu
// BAD: Thread-dependent branching causes divergence
__global__ void divergent(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        if (in[i] > 0) {
            out[i] = expensive_func_A(in[i]);  // Some threads
        } else {
            out[i] = expensive_func_B(in[i]);  // Other threads
        }
    }
}

// BETTER: Predication for simple operations
__global__ void predicated(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float val = in[i];
        // Both paths computed, result selected (no branch)
        out[i] = (val > 0) ? val * val : -val;
    }
}

// BETTER: Sort data to cluster similar values
// If in[0..999] > 0 and in[1000..1999] <= 0,
// different warps take different paths (no intra-warp divergence)

// BETTER: Warp-level voting for uniform branches
__global__ void warp_vote(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        bool cond = in[i] > 0;
        
        // Check if ALL threads in warp agree
        if (__all_sync(0xffffffff, cond)) {
            // Uniform path - no divergence
            out[i] = expensive_func_A(in[i]);
        } else if (__all_sync(0xffffffff, !cond)) {
            out[i] = expensive_func_B(in[i]);
        } else {
            // Mixed - must diverge, use simpler fallback
            out[i] = simple_func(in[i]);
        }
    }
}
GPU Optimization Helps Memory-Bound? Helps Compute-Bound? When to Apply
Memory Coalescing Yes (critical!) Somewhat Always check first
Shared Memory Yes Yes (reuse) Data reuse within block
Bank Conflict Fix Yes No When using shared memory
Kernel Fusion Yes Somewhat Multiple consecutive ops
Mixed Precision Yes (2×) Yes (2-16×) Almost always (DL)
Thread Coarsening No Yes Simple, high-occupancy kernels
Reduce Divergence No Yes Data-dependent branches

11. GPU Execution Optimizations

Beyond kernel-level optimizations, how you launch and orchestrate GPU work significantly impacts performance. CUDA provides mechanisms to overlap computation with data transfer and reduce launch overhead.

CUDA Streams

A CUDA stream is a sequence of operations that execute in order on the GPU. Operations in different streams can execute concurrently, enabling overlap of:

CUDA Streams: Overlapping Compute and Transfer
✗ Sequential (Default Stream) - Total: 9 units 0 9 H2D Kernel A D2H H2D Kernel B D2H ✓ Concurrent (Multiple Streams) - Total: 5 units Stream 0: H2D Kernel A D2H Stream 1: H2D Kernel B D2H 0 5 Overlap! 1.8× Speedup H2D overlaps with compute Kernel A overlaps with Kernel B
Multiple streams enable concurrent execution of independent operations
CUDA cuda_streams.cu
// Create streams
cudaStream_t stream[2];
for (int i = 0; i < 2; i++) {
    cudaStreamCreate(&stream[i]);
}

// Divide work into chunks for overlap
int chunk_size = N / 2;

for (int i = 0; i < 2; i++) {
    int offset = i * chunk_size;
    
    // Async H2D copy (non-blocking)
    cudaMemcpyAsync(d_in + offset, h_in + offset, 
                    chunk_size * sizeof(float),
                    cudaMemcpyHostToDevice, stream[i]);
    
    // Launch kernel in stream
    myKernel<<0, stream[i]>>>(d_out + offset, d_in + offset, chunk_size);
    
    // Async D2H copy
    cudaMemcpyAsync(h_out + offset, d_out + offset,
                    chunk_size * sizeof(float),
                    cudaMemcpyDeviceToHost, stream[i]);
}

// Wait for all streams
cudaDeviceSynchronize();

// Cleanup
for (int i = 0; i < 2; i++) {
    cudaStreamDestroy(stream[i]);
}

// IMPORTANT: Host memory must be pinned for async transfers!
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));  // Pinned memory
// Regular malloc() will cause cudaMemcpyAsync to be synchronous!
Pinned Memory Required!

Async memory transfers (cudaMemcpyAsync) only work with pinned (page-locked) host memory allocated with cudaMallocHost. Regular malloc memory forces synchronous transfers!

CUDA Graphs

CUDA Graphs capture a sequence of GPU operations (kernels, memory copies) into a graph that can be launched with a single API call. This dramatically reduces CPU overhead for repetitive workloads.

CUDA Graphs: Reducing Launch Overhead
✗ Traditional: CPU launches each kernel Launch K1 Launch K2 Launch K3 ~5-10μs per launch × 1000 kernels = 5-10ms overhead! ✓ CUDA Graph: Single launch for entire graph Graph Launch K1 K2 K3 ~5μs total for entire graph! Graph with Dependencies (DAG) H2D K1 K2 K3 D2H Parallel Benefits: • Automatic dependency tracking • Parallel kernel launch • Minimal CPU overhead
CUDA Graphs are ideal for inference loops where the same operations repeat thousands of times
CUDA cuda_graphs.cu
// Method 1: Stream capture (easiest)
cudaGraph_t graph;
cudaGraphExec_t graphExec;

// Capture operations into a graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
{
    // These operations are recorded, not executed
    cudaMemcpyAsync(d_in, h_in, size, cudaMemcpyHostToDevice, stream);
    kernel1<<0, stream>>>(d_tmp, d_in);
    kernel2<<0, stream>>>(d_out, d_tmp);
    cudaMemcpyAsync(h_out, d_out, size, cudaMemcpyDeviceToHost, stream);
}
cudaStreamEndCapture(stream, &graph);

// Instantiate executable graph (one-time cost)
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);

// Execute graph many times (very fast!)
for (int iter = 0; iter < 10000; iter++) {
    cudaGraphLaunch(graphExec, stream);  // ~5μs per launch!
}

cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);

// PyTorch CUDA Graphs
// ===================
// g = torch.cuda.CUDAGraph()
// with torch.cuda.graph(g):
//     output = model(input)  # Captured
// 
// # Replay graph (fast!)
// for _ in range(1000):
//     g.replay()

Async Memory Operations

Modern GPUs (Ampere+) support async memory copy that uses dedicated copy engines, allowing compute units to continue working during transfers.

CUDA async_copy.cu
// Async copy from global to shared memory (Ampere+)
#include 

__global__ void async_copy_kernel(float* global_data) {
    __shared__ float smem[256];
    
    // Cooperative group for async operations
    auto block = cooperative_groups::this_thread_block();
    
    // Create pipeline for async copies
    __shared__ cuda::pipeline_shared_state<1> pipe_state;
    auto pipe = cuda::make_pipeline(block, &pipe_state);
    
    // Initiate async copy (returns immediately)
    pipe.producer_acquire();
    cuda::memcpy_async(
        smem + threadIdx.x,           // Destination (shared)
        global_data + threadIdx.x,    // Source (global)
        sizeof(float),
        pipe
    );
    pipe.producer_commit();
    
    // Do other work while copy is in flight!
    float local_compute = expensive_calculation();
    
    // Wait for copy to complete
    pipe.consumer_wait();
    
    // Now use smem data
    float result = smem[threadIdx.x] + local_compute;
}

12. Case Studies

Let's apply the optimization techniques we've learned to real-world problems.

Case Study 1: Matrix Multiplication

Matrix multiplication (GEMM) is the most important operation in deep learning. Let's trace the optimization journey from naive to highly optimized:

Matrix Multiplication Optimization Journey
Naive 3 nested loops ~100 GFLOPS + Tiling Shared memory ~1 TFLOPS + Coalescing Bank conflict free ~4 TFLOPS + Vectorization float4, double buffering ~10 TFLOPS Tensor Cores WMMA / MMA ~150 TFLOPS Performance Improvement (A100) 100 GF 150 TF Key Insight: cuBLAS implements all these optimizations + more! Use cuBLAS/cuDNN for standard ops. Write custom kernels only for novel operations.
1500× improvement from naive to Tensor Core implementation
CUDA tiled_matmul.cu
// Tiled matrix multiplication with shared memory
#define TILE_SIZE 32

__global__ void matmul_tiled(float* C, float* A, float* B, int N) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE + 1];  // +1 avoids bank conflicts
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    
    float sum = 0.0f;
    
    // Loop over tiles
    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Collaborative load: each thread loads one element
        As[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        
        __syncthreads();
        
        // Compute partial dot product from shared memory
        #pragma unroll
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }
        
        __syncthreads();
    }
    
    C[row * N + col] = sum;
}

// Why this is faster:
// - Each element loaded once from global, reused TILE_SIZE times
// - Arithmetic intensity: O(TILE_SIZE) ops per byte loaded
// - Coalesced global loads, fast shared memory access

Case Study 2: Flash Attention

Flash Attention is a landmark optimization that makes Transformer attention memory-efficient by fusing operations and using tiled computation to avoid materializing the full N×N attention matrix.

Flash Attention: Tiled Attention Without Materializing Full Matrix
✗ Standard Attention: O(N²) memory Q·Kᵀ N×N matrix (stored in HBM!) Softmax ·V Memory: O(N²) | 4096 seq → 64GB just for attention! 3× read/write to slow HBM memory ✓ Flash Attention: O(N) memory Tiled QKᵀ → Online Softmax → ·V (All in SRAM, never materialize full N×N) Memory: O(N) | 2-4× faster, fits any sequence length! 1× read/write to HBM Flash Attention Key Techniques Kernel Fusion QKᵀ + softmax + ·V in one kernel No intermediate HBM writes Online Softmax Compute softmax incrementally Track running max & sum Tiling Process Q,K,V in blocks Fits in shared memory
Flash Attention achieves 2-4× speedup by being memory-bound, not compute-bound

Case Study 3: LLM Inference Optimization

Large Language Model inference faces unique challenges: the autoregressive decode phase generates one token at a time, making it heavily memory-bound.

LLM Inference: Prefill vs Decode Phases
Prefill Phase Process entire prompt at once Batch size = prompt_length (1000+) Compute-bound ✓ High arithmetic intensity Decode Phase (Autoregressive) Generate one token at a time Batch size = 1 (per sequence) Memory-bound ✗ Load entire model for 1 token! Decode Phase Optimizations KV Cache Cache K,V from past tokens Batching Process multiple sequences Quantization INT8/INT4 weights Speculative Decode Draft + verify parallelism
LLM decode is memory-bound: optimizations focus on reducing memory traffic
Optimization Technique Speedup Trade-off
KV Cache Cache attention keys/values ~N× (sequence length) Memory usage grows with context
Continuous Batching Dynamically batch requests 2-4× throughput Latency variance
INT8 Quantization 8-bit weights 2× memory, ~1.8× speed < 1% accuracy loss
INT4/AWQ 4-bit weights 4× memory, ~3× speed ~1-2% accuracy loss
PagedAttention (vLLM) Virtual memory for KV cache ~24× throughput Implementation complexity
Flash Attention Fused, tiled attention 2-4× attention speed None (pure win)

13. Conclusion

Performance optimization is fundamentally about understanding where your code spends time and what limits it. The key insights from this guide:

Performance Optimization Decision Framework
Profile First! Memory or Compute Bound? Memory Memory-Bound AI < Peak AI Optimizations: • Coalescing (GPU) • Cache blocking/tiling • Data layout (SoA) • Prefetching • Kernel fusion • Mixed precision (FP16) • Reduce data movement Compute Compute-Bound AI > Peak AI Optimizations: • SIMD/vectorization • Tensor Cores • Loop unrolling/ILP • Reduce divergence • Algorithm optimization • Better parallelization • Occupancy tuning Always Consider: • Is the algorithm optimal? (O(n) vs O(n²)) • Are you using the right libraries? (cuBLAS, etc.) • Async execution (streams, graphs) • Profile again after each change!
Always profile first, identify the bottleneck, apply targeted optimizations, and profile again

Key Takeaways

The Performance Optimization Checklist
  1. Profile first — Never optimize blind. Use perf, VTune, nsys, ncu.
  2. Identify the bottleneck — Memory-bound or compute-bound? Use roofline model.
  3. Check your algorithm — O(n²) can't be optimized to beat O(n log n).
  4. Use optimized libraries — cuBLAS, MKL, Flash Attention before custom code.
  5. Optimize data movement — Most code is memory-bound; reduce traffic.
  6. Match data layout to access — SoA for vectorization, coalescing for GPU.
  7. Exploit parallelism — SIMD, multi-threading, GPU thousands of threads.
  8. Fuse operations — Keep data in fast memory (registers, cache, shared).
  9. Use lower precision — FP16/BF16/INT8 when accuracy permits.
  10. Profile again — Verify improvement, watch for new bottlenecks.

Further Reading

Series Complete!

This comprehensive guide covered CPU and GPU optimization from fundamentals to advanced techniques. The journey from understanding bottlenecks to writing optimized code requires practice—start profiling your own code and apply these principles. Happy optimizing!