CPU & GPU Performance Optimization: A Comprehensive Guide

Table of Contents

1. Introduction: The Memory Wall

Modern hardware performance is a balancing act between computation speed (how fast we can do math) and memory bandwidth (how fast we can move data). Understanding this balance is the foundation of all performance optimization.

In the early days of computing, processors and memory operated at similar speeds. A CPU could fetch data, perform calculations, and store results without waiting. But starting in the 1980s, processor speeds began growing exponentially while memory speeds improved only linearly:

The Memory Wall: CPU vs Memory Performance Gap

CPU compute performance has grown ~10,000× since 1980, while DRAM bandwidth grew only ~100×

This growing disparity—known as the Memory Wall—means that for most real-world applications, the processor spends most of its time waiting for data rather than computing. By 2020, a typical CPU can perform thousands of floating-point operations in the time it takes to fetch a single cache line from main memory.

The Hard Truth

On modern hardware, data movement costs more than computation. Moving a 64-bit value from DRAM costs ~200 cycles, while a fused multiply-add (FMA) takes 4-5 cycles. You can do 40-50 math operations in the time it takes to fetch one number!

Real-World Impact

Consider this simple vector addition:

C++ naive_add.cpp

void vector_add(float* a, float* b, float* c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // 1 FLOP, 12 bytes accessed
    }
}
// Arithmetic Intensity = 1 / 12 = 0.083 FLOP/Byte
// On a 100 GB/s memory system: max 8.3 GFLOPS
// But CPU can do 100+ GFLOPS!
// → 92% of compute capacity WASTED

This kernel has an arithmetic intensity of 0.083 FLOP/Byte—far below the ~10 FLOP/Byte threshold needed to utilize modern CPU compute capabilities. The processor is memory-bound, throttled by how fast it can load and store data.

The Two Types of Bottlenecks

Every computation falls into one of two categories:

Memory-Bound

The processor waits for data. Performance is limited by memory bandwidth.

Low arithmetic intensity
Optimization: reduce data movement
Examples: streaming, sparse ops

Compute-Bound

The processor is doing math. Performance is limited by compute throughput.

High arithmetic intensity
Optimization: use more compute
Examples: dense GEMM, convolutions

The critical insight: the right optimization depends on which bound you're in. Vectorizing a memory-bound kernel is pointless. Improving memory access patterns for a compute-bound kernel is wasteful. You must diagnose before you optimize.

2. Diagnosing Performance Bottlenecks

Before writing a single line of optimization code, you must answer: is my kernel memory-bound or compute-bound? This section covers the tools and techniques for diagnosis.

The Roofline Model Recap

The Roofline Model provides a visual framework for understanding performance limits. The key metric is Arithmetic Intensity (AI):

\text{AI} = \frac{\text{FLOPs performed}}{\text{Bytes accessed from memory}}

The roofline Plot shows two ceilings:

Memory bandwidth ceiling: Performance = AI × Bandwidth (diagonal line)
Compute ceiling: Performance = Peak FLOPS (horizontal line)

Where they meet is the ridge point—the AI where your code transitions from memory-bound to compute-bound:

\text{Ridge Point} = \frac{\text{Peak FLOPS}}{\text{Peak Bandwidth}}

The Roofline Model

Operations below and left of the ridge are memory-bound; above and right are compute-bound

Profiling Tools

Theoretical analysis tells you what should happen. Profilers tell you what actually happens:

Tool	Platform	Key Metrics	When to Use
`perf`	Linux CPU	IPC, cache misses, branch mispredicts	Quick CPU profiling
`Intel VTune`	Intel CPU	Roofline, memory hierarchy, threading	Deep CPU analysis
`AMD uProf`	AMD CPU	Similar to VTune for AMD	AMD systems
`Nsight Compute`	NVIDIA GPU	SM utilization, memory throughput, roofline	CUDA kernel optimization
`Nsight Systems`	NVIDIA GPU	Timeline, kernel launches, memory transfers	Application-level GPU analysis
`rocprof`	AMD GPU	HIP kernel metrics	ROCm optimization

CPU Profiling with perf

Bash perf_profiling.sh

# Basic statistics
perf stat ./my_program

# Key counters for memory-bound diagnosis
perf stat -e cycles,instructions,cache-references,cache-misses,\
          L1-dcache-loads,L1-dcache-load-misses,\
          LLC-loads,LLC-load-misses ./my_program

# Sample output:
#  12,345,678,901  cycles
#   3,456,789,012  instructions              # IPC: 0.28 (LOW!)
#     456,789,012  cache-references
#     234,567,890  cache-misses              # 51% miss rate (HIGH!)
#
# Diagnosis: Memory-bound (IPC < 1, high cache misses)

GPU Profiling with Nsight Compute

Bash ncu_profiling.sh

# Profile a specific kernel
ncu --set full -o profile ./my_cuda_app

# Quick metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
              dram__throughput.avg.pct_of_peak_sustained_elapsed,\
              gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed \
    ./my_cuda_app

# Key indicators:
# - SM Throughput high, DRAM low → Compute-bound
# - SM Throughput low, DRAM high → Memory-bound
# - Both low → Latency-bound or underutilized

Identifying Your Bound

Here's a systematic approach to diagnose your kernel:

Performance Diagnosis Flowchart

Systematic approach: calculate AI, verify with profiler, then apply targeted optimizations

Python diagnose_kernel.py

def diagnose_kernel(
    flops: int,
    bytes_accessed: int,
    observed_gflops: float,
    peak_gflops: float,
    peak_bw_gb_s: float
):
    """Diagnose whether a kernel is memory or compute bound."""
    
    # Calculate arithmetic intensity
    ai = flops / bytes_accessed
    
    # Calculate ridge point
    ridge = peak_gflops / peak_bw_gb_s
    
    # Calculate theoretical max performance
    if ai < ridge:
        theoretical_max = ai * peak_bw_gb_s
        bound = "Memory-bound"
    else:
        theoretical_max = peak_gflops
        bound = "Compute-bound"
    
    # Calculate efficiency
    efficiency = observed_gflops / theoretical_max * 100
    
    print(f"Arithmetic Intensity: {ai:.2f} FLOP/Byte")
    print(f"Ridge Point: {ridge:.2f} FLOP/Byte")
    print(f"Classification: {bound}")
    print(f"Theoretical Max: {theoretical_max:.1f} GFLOPS")
    print(f"Achieved: {observed_gflops:.1f} GFLOPS")
    print(f"Efficiency: {efficiency:.1f}%")
    
    if efficiency < 50:
        print(f"\n⚠️  Low efficiency! Check for:")
        if bound == "Memory-bound":
            print("  - Cache misses")
            print("  - Uncoalesced access (GPU)")
            print("  - Poor data locality")
        else:
            print("  - No vectorization")
            print("  - Low occupancy (GPU)")
            print("  - Dependency stalls")

# Example: Matrix-vector multiply on CPU
# y = A @ x, where A is 4096×4096, x is 4096×1
diagnose_kernel(
    flops=2 * 4096 * 4096,              # 33.5M FLOPs
    bytes_accessed=(4096*4096 + 4096 + 4096) * 4,  # 67 MB
    observed_gflops=8.5,
    peak_gflops=500,                   # Modern CPU
    peak_bw_gb_s=50                    # DDR4
)

# Output:
# Arithmetic Intensity: 0.50 FLOP/Byte
# Ridge Point: 10.00 FLOP/Byte
# Classification: Memory-bound
# Theoretical Max: 25.0 GFLOPS
# Achieved: 8.5 GFLOPS
# Efficiency: 34.0%
#
# ⚠️  Low efficiency! Check for:
#   - Cache misses
#   - Uncoalesced access (GPU)
#   - Poor data locality

The 50% Rule

If your kernel achieves less than 50% of its theoretical maximum (based on the roofline), there's likely a fixable inefficiency. Common culprits: cache misses, uncoalesced access, branch divergence, or lack of vectorization.

3. CPU Architecture Fundamentals

To optimize for CPUs, you need to understand how they're built. Modern CPUs are designed to hide memory latency and maximize throughput through several mechanisms:

Cache Hierarchy

CPUs use a hierarchy of progressively larger, slower caches to bridge the gap between processor speed and main memory:

CPU Cache Hierarchy (Typical Modern Desktop)

Each cache level is ~10× larger but ~3-4× slower than the previous

L1 Cache

KB per core

L1 Latency

4-5

cycles (~1 ns)

L3 Cache

MB shared

DRAM Latency

~200

cycles (~60 ns)

The key insight: cache misses are expensive. An L1 hit costs 4 cycles; a DRAM fetch costs 200+. Code that fits in cache runs 50× faster than code that doesn't. This is why locality—keeping frequently accessed data close together and re-accessing it while still cached—is the primary CPU optimization strategy.

C++ cache_impact.cpp

// Example: Impact of stride on cache performance
const int N = 1024 * 1024;  // 1M elements = 4 MB
float data[N];

// Sequential access: Perfect cache utilization
// Each cache line (64B) serves 16 floats
for (int i = 0; i < N; i++) {
    sum += data[i];  // ~0.25 cycles/element (cache hit)
}

// Strided access: Cache line wasted
// Only 1 of 16 floats used per cache line fetch
for (int i = 0; i < N; i += 16) {
    sum += data[i];  // ~4 cycles/element (cache miss every time)
}

// Random access: Worst case
for (int i = 0; i < N; i++) {
    sum += data[random_indices[i]];  // ~50+ cycles/element (DRAM)
}

Pipeline & Out-of-Order Execution

Modern CPUs don't execute one instruction at a time. They use pipelining to overlap instruction execution and out-of-order execution to keep functional units busy:

CPU Pipeline & Out-of-Order Execution

Out-of-order execution overlaps independent operations to hide latency

IPC: The Key Metric

Instructions Per Cycle (IPC) tells you how well your code utilizes the CPU pipeline. Modern CPUs can theoretically execute 4-8 instructions per cycle. If your IPC is below 1, you're likely stalled waiting for data. Low IPC usually indicates memory-boundness or long dependency chains.

Branch Prediction

Branches (if/else, loops) are problematic for pipelines. The CPU must guess which path to take before it knows the condition result. A misprediction costs 15-20 cycles as the pipeline is flushed:

C++ branch_impact.cpp

// Predictable branch: CPU learns the pattern
for (int i = 0; i < N; i++) {
    if (i < N/2) {  // Predictable: always true then always false
        sum += data[i];
    }
}
// Prediction accuracy: ~99%
// Mispredictions: ~2 (at the transition)

// Unpredictable branch: 50/50 random
for (int i = 0; i < N; i++) {
    if (data[i] > 0.5) {  // Random data: unpredictable!
        sum += data[i];
    }
}
// Prediction accuracy: ~50%
// Mispredictions: ~N/2
// Cost: N/2 × 15 cycles = 7.5N extra cycles!

// Branchless version using conditional move
for (int i = 0; i < N; i++) {
    sum += (data[i] > 0.5) ? data[i] : 0;  // May compile to CMOV
}
// Or manually branchless:
for (int i = 0; i < N; i++) {
    float mask = (data[i] > 0.5f) ? 1.0f : 0.0f;
    sum += data[i] * mask;
}
// No branch mispredictions!

Pattern	Prediction Accuracy	Performance Impact
Always taken / never taken	~99%	Negligible
Regular pattern (TTTNTTTN...)	~95%	Minor
Data-dependent (random)	~50%	Severe (15-20 cycles/mispredict)
Branchless code	N/A	No penalty

SIMD Units

Single Instruction Multiple Data (SIMD) allows a CPU to perform the same operation on multiple data elements simultaneously. Modern x86 CPUs support:

SIMD Vector Widths Evolution

Each generation doubles the number of elements processed per instruction

C++ simd_example.cpp

#include <immintrin.h>

// Scalar: Process 1 element at a time
void add_scalar(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // 1 element per iteration
    }
}

// AVX2: Process 8 elements at a time
void add_avx2(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        // Load 8 floats from each array
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        
        // Add all 8 pairs in a single instruction
        __m256 vc = _mm256_add_ps(va, vb);
        
        // Store 8 results
        _mm256_storeu_ps(&c[i], vc);
    }
}

// AVX-512: Process 16 elements at a time
void add_avx512(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 16) {
        __m512 va = _mm512_loadu_ps(&a[i]);
        __m512 vb = _mm512_loadu_ps(&b[i]);
        __m512 vc = _mm512_add_ps(va, vb);
        _mm512_storeu_ps(&c[i], vc);
    }
}

// Note: Modern compilers can auto-vectorize with -O3 -march=native
// Check with: gcc -fopt-info-vec-optimized or clang -Rpass=loop-vectorize

SIMD Reality Check

SIMD helps compute-bound code by increasing arithmetic throughput. For memory-bound code (like the simple vector add above), SIMD provides limited benefit because you're already bottlenecked on memory bandwidth, not compute. SIMD shines when you do many operations on data already in cache.

SIMD Extension	Width	float32	float64	Availability
`SSE`	128-bit	4	2	All modern x86
`AVX`	256-bit	8	4	Sandy Bridge+ (2011)
`AVX2`	256-bit	8	4	Haswell+ (2013)
`AVX-512`	512-bit	16	8	Skylake-X+ (2017)
`ARM NEON`	128-bit	4	2	All ARMv8
`ARM SVE/SVE2`	128-2048 bit	4-64	2-32	ARMv8.2+ (scalable)

Batch 1 Complete

This covers the fundamentals: Memory Wall, diagnosing bottlenecks, and CPU architecture basics. Batch 2 will add: CPU memory optimizations (AoS/SoA, cache blocking, prefetching) and CPU compute optimizations (vectorization deep-dive, loop unrolling, branch elimination, ILP).

4. CPU Memory Optimizations

For memory-bound CPU code, the goal is to minimize data movement and maximize cache hits. This section covers the essential techniques.

Data Layout: Array of Structures vs Structure of Arrays

How you organize data in memory has massive performance implications. The two primary layouts are:

Array of Structures (AoS)

Each object's fields are stored contiguously

struct Particle {
  float x, y, z;
  float vx, vy, vz;
};
Particle particles[N];

Memory: [x₀,y₀,z₀,vx₀,vy₀,vz₀,x₁,y₁,...]

Structure of Arrays (SoA)

Each field stored in its own array

struct Particles {
  float x[N], y[N], z[N];
  float vx[N], vy[N], vz[N];
};
Particles particles;

Memory: [x₀,x₁,x₂...][y₀,y₁,y₂...]...

AoS vs SoA Memory Access Patterns

SoA enables full cache line utilization and efficient SIMD vectorization

C++ aos_vs_soa.cpp

// Array of Structures (AoS) - Object-oriented style
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
    float mass;
};
ParticleAoS particles_aos[N];

// Processing - poor cache utilization when only accessing some fields
void update_positions_aos(ParticleAoS* p, float dt, int n) {
    for (int i = 0; i < n; i++) {
        p[i].x += p[i].vx * dt;  // Loads 28 bytes per particle
        p[i].y += p[i].vy * dt;  // but only uses 12 bytes (x,vx,y,vy,z,vz)
        p[i].z += p[i].vz * dt;  // mass field wastes cache space
    }
}

// Structure of Arrays (SoA) - Data-oriented style
struct ParticlesSoA {
    float* x;  float* y;  float* z;
    float* vx; float* vy; float* vz;
    float* mass;
};

// Processing - excellent cache utilization + auto-vectorization
void update_positions_soa(ParticlesSoA& p, float dt, int n) {
    // Each loop is perfectly vectorizable!
    for (int i = 0; i < n; i++) p.x[i] += p.vx[i] * dt;
    for (int i = 0; i < n; i++) p.y[i] += p.vy[i] * dt;
    for (int i = 0; i < n; i++) p.z[i] += p.vz[i] * dt;
}

// Even better: fused loops with explicit SIMD
void update_positions_soa_avx(ParticlesSoA& p, float dt, int n) {
    __m256 vdt = _mm256_set1_ps(dt);
    
    for (int i = 0; i < n; i += 8) {
        // Process 8 particles' x coordinates at once
        __m256 vx = _mm256_loadu_ps(&p.x[i]);
        __m256 vvx = _mm256_loadu_ps(&p.vx[i]);
        vx = _mm256_fmadd_ps(vvx, vdt, vx);  // x += vx * dt
        _mm256_storeu_ps(&p.x[i], vx);
        
        // Similar for y and z...
    }
}

// Performance difference: 3-6× speedup from SoA + SIMD

Aspect	AoS	SoA
Cache efficiency (field subset)	Poor - loads unused fields	Excellent - loads only needed data
SIMD vectorization	Requires gather/scatter	Sequential loads (native)
Random access by object	Good - single pointer dereference	Poor - multiple array accesses
Memory allocation	Simple (one array)	Complex (multiple arrays)
Best use case	Object-by-object processing	Field-by-field processing (HPC)

When to Use Each

SoA: Batch processing many objects, iterating through one field (common in physics simulations, ML inference). AoS: Object-oriented access, spatial data structures (trees, graphs), when you need all fields of an object together.

Cache Blocking (Tiling)

Cache blocking partitions data to fit in cache, dramatically reducing memory traffic. The classic example is matrix multiplication:

Naive vs Cache-Blocked Matrix Multiplication

Cache blocking ensures data reuse before eviction, reducing memory traffic by a factor of block size

C++ cache_blocking.cpp

// Naive matrix multiplication - O(N³) cache misses!
void matmul_naive(float* A, float* B, float* C, int N) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0;
            for (int k = 0; k < N; k++) {
                sum += A[i*N + k] * B[k*N + j];  // B access is column-wise!
            }                                     // Cache miss every iteration
            C[i*N + j] = sum;
        }
    }
}
// For N=1024: ~1 billion cache misses

// Cache-blocked matrix multiplication - O(N³/BLOCK) cache misses
const int BLOCK = 64;  // Chosen to fit in L1 cache

void matmul_blocked(float* A, float* B, float* C, int N) {
    // Initialize C to zero
    memset(C, 0, N * N * sizeof(float));
    
    // Iterate over tiles
    for (int i0 = 0; i0 < N; i0 += BLOCK) {
        for (int j0 = 0; j0 < N; j0 += BLOCK) {
            for (int k0 = 0; k0 < N; k0 += BLOCK) {
                
                // Inner loops work on BLOCK×BLOCK tiles
                // These tiles fit in L1 cache!
                for (int i = i0; i < min(i0+BLOCK, N); i++) {
                    for (int j = j0; j < min(j0+BLOCK, N); j++) {
                        float sum = C[i*N + j];
                        for (int k = k0; k < min(k0+BLOCK, N); k++) {
                            sum += A[i*N + k] * B[k*N + j];
                        }
                        C[i*N + j] = sum;
                    }
                }
            }
        }
    }
}
// For N=1024, BLOCK=64: ~16 million cache misses (64× fewer!)

// Even better: blocked + SIMD + loop unrolling
void matmul_blocked_avx(float* A, float* B, float* C, int N) {
    memset(C, 0, N * N * sizeof(float));
    
    for (int i0 = 0; i0 < N; i0 += BLOCK) {
        for (int j0 = 0; j0 < N; j0 += BLOCK) {
            for (int k0 = 0; k0 < N; k0 += BLOCK) {
                for (int i = i0; i < i0+BLOCK; i++) {
                    for (int k = k0; k < k0+BLOCK; k++) {
                        __m256 va = _mm256_set1_ps(A[i*N + k]);
                        for (int j = j0; j < j0+BLOCK; j += 8) {
                            __m256 vb = _mm256_loadu_ps(&B[k*N + j]);
                            __m256 vc = _mm256_loadu_ps(&C[i*N + j]);
                            vc = _mm256_fmadd_ps(va, vb, vc);
                            _mm256_storeu_ps(&C[i*N + j], vc);
                        }
                    }
                }
            }
        }
    }
}
// Additional 8× from SIMD → ~50× total speedup vs naive

Naive GEMM

GFLOPS (N=1024)

Blocked

~50

GFLOPS (10× faster)

Blocked + AVX

~200

GFLOPS (40× faster)

OpenBLAS

~400

GFLOPS (optimal)

Software Prefetching

Prefetching loads data into cache before it's needed, hiding memory latency. Modern CPUs do hardware prefetching for sequential access patterns, but software prefetching helps with irregular patterns:

C++ prefetching.cpp

#include <xmmintrin.h>  // For _mm_prefetch

// Prefetch hints:
// _MM_HINT_T0 - Prefetch to all cache levels (L1, L2, L3)
// _MM_HINT_T1 - Prefetch to L2 and L3
// _MM_HINT_T2 - Prefetch to L3 only
// _MM_HINT_NTA - Non-temporal, bypass cache

// Without prefetch - stalls waiting for memory
void indirect_access_naive(float* data, int* indices, int n) {
    float sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[indices[i]];  // Cache miss → 200 cycle stall
    }
}

// With prefetch - overlap memory access with computation
void indirect_access_prefetch(float* data, int* indices, int n) {
    const int PREFETCH_DISTANCE = 16;  // Tune based on latency
    float sum = 0;
    
    for (int i = 0; i < n; i++) {
        // Prefetch data for future iteration
        if (i + PREFETCH_DISTANCE < n) {
            _mm_prefetch((char*)&data[indices[i + PREFETCH_DISTANCE]], _MM_HINT_T0);
        }
        
        // Process current element (hopefully in cache now)
        sum += data[indices[i]];
    }
}

// Prefetch distance calculation:
// Distance = Memory_Latency / Time_Per_Iteration
// Example: 200 cycles latency / 10 cycles per iteration = 20 iterations ahead

// Linked list traversal with prefetch
struct Node { Node* next; int data; };

int sum_list_prefetch(Node* head) {
    int sum = 0;
    Node* curr = head;
    Node* prefetch_ptr = head;
    
    // Advance prefetch pointer ahead
    for (int i = 0; i < 8 && prefetch_ptr; i++) {
        prefetch_ptr = prefetch_ptr->next;
    }
    
    while (curr) {
        // Prefetch 8 nodes ahead
        if (prefetch_ptr) {
            _mm_prefetch((char*)prefetch_ptr, _MM_HINT_T0);
            prefetch_ptr = prefetch_ptr->next;
        }
        
        sum += curr->data;
        curr = curr->next;
    }
    return sum;
}

Prefetch Pitfalls

Too early: Data evicted before use. Too late: Still waiting for memory. Too many: Pollutes cache, evicting useful data. Start with prefetch distance = memory_latency / loop_time, then tune with profiling. Modern CPUs have excellent hardware prefetchers—software prefetching mainly helps irregular access patterns.

Memory Alignment

Misaligned memory access can cause performance penalties or require multiple cache line fetches:

C++ alignment.cpp

// Allocate aligned memory for SIMD operations

// C++11 aligned allocation
alignas(32) float data[1024];  // 32-byte aligned for AVX

// Dynamic allocation (C++17)
float* ptr = new (std::align_val_t(64)) float[1024];  // 64-byte for cache line

// POSIX aligned allocation
void* ptr;
posix_memalign(&ptr, 64, 1024 * sizeof(float));

// Aligned SIMD loads are faster than unaligned
void process_aligned(float* data, int n) {
    // Assumes data is 32-byte aligned
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_load_ps(&data[i]);   // Faster, requires alignment
        // vs _mm256_loadu_ps for unaligned
    }
}

// Padding to avoid false sharing in multithreaded code
struct alignas(64) PaddedCounter {
    std::atomic<int> count;
    char padding[60];  // Ensure each counter is on its own cache line
};

PaddedCounter counters[NUM_THREADS];  // No false sharing between threads

5. CPU Compute Optimizations

For compute-bound CPU code, the goal is to maximize instruction throughput. This means using SIMD, increasing ILP, and keeping the pipeline full.

Vectorization Deep Dive

Modern compilers can auto-vectorize simple loops, but often need help:

C++ auto_vectorization.cpp

// Compile with: g++ -O3 -march=native -fopt-info-vec-optimized

// ✅ Auto-vectorizes: Simple loop, no dependencies
void saxpy(float* y, float a, float* x, int n) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

// ❌ Won't vectorize: Potential aliasing
void copy_overlap(float* dst, float* src, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i];  // dst and src might overlap!
    }
}

// ✅ Fix: Use restrict to promise no aliasing
void copy_restrict(float* __restrict__ dst, 
                   float* __restrict__ src, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i];  // Compiler knows they don't overlap
    }
}

// ❌ Won't vectorize: Function call in loop
void process_external(float* data, int n) {
    for (int i = 0; i < n; i++) {
        data[i] = external_func(data[i]);  // Can't inline
    }
}

// ❌ Won't vectorize: Data-dependent branch
void process_conditional(float* data, int n) {
    for (int i = 0; i < n; i++) {
        if (data[i] > 0) {
            data[i] = sqrt(data[i]);
        }
    }
}

// ✅ Fix: Use masked operations
void process_masked(float* data, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_loadu_ps(&data[i]);
        __m256 zero = _mm256_setzero_ps();
        __m256 mask = _mm256_cmp_ps(v, zero, _CMP_GT_OQ);
        __m256 sqrtv = _mm256_sqrt_ps(v);
        __m256 result = _mm256_blendv_ps(v, sqrtv, mask);
        _mm256_storeu_ps(&data[i], result);
    }
}

// Pragma hints (compiler-specific)
void process_pragma(float* data, int n) {
    #pragma omp simd  // OpenMP SIMD directive
    for (int i = 0; i < n; i++) {
        data[i] = data[i] * data[i];
    }
    
    #pragma GCC ivdep  // GCC: ignore vector dependencies
    for (int i = 0; i < n; i++) {
        data[i] = data[i - offset];
    }
}

Vectorization Blocker	Fix
Pointer aliasing	`__restrict__` keyword
Non-unit stride	Restructure data layout (SoA)
Data-dependent branches	Masked operations, branchless code
Loop-carried dependencies	Restructure algorithm, reduction clauses
Function calls	Inline, SIMD math libraries (SVML)
Unknown trip count	Use `#pragma omp simd`

Loop Unrolling & ILP

Loop unrolling reduces loop overhead and exposes Instruction-Level Parallelism (ILP)—independent operations that can execute in parallel on superscalar CPUs:

Loop Unrolling and ILP

Multiple accumulators break the dependency chain, enabling full pipeline utilization

C++ loop_unrolling.cpp

// Single accumulator - limited by add latency (4 cycles)
float sum_naive(float* data, int n) {
    float sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[i];  // Each add depends on previous
    }                    // Throughput: 0.25 adds/cycle
    return sum;
}

// Multiple accumulators - breaks dependency chain
float sum_unrolled(float* data, int n) {
    float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    float sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0;
    
    int i = 0;
    for (; i < n - 7; i += 8) {
        sum0 += data[i];      // All 8 additions are independent
        sum1 += data[i + 1];  // CPU can issue 2/cycle
        sum2 += data[i + 2];  // Throughput: 2 adds/cycle
        sum3 += data[i + 3];
        sum4 += data[i + 4];
        sum5 += data[i + 5];
        sum6 += data[i + 6];
        sum7 += data[i + 7];
    }
    
    // Handle remainder
    for (; i < n; i++) {
        sum0 += data[i];
    }
    
    // Combine accumulators
    return (sum0 + sum1) + (sum2 + sum3) + (sum4 + sum5) + (sum6 + sum7);
}

// SIMD unrolled - best of both worlds
float sum_simd_unrolled(float* data, int n) {
    __m256 vsum0 = _mm256_setzero_ps();
    __m256 vsum1 = _mm256_setzero_ps();
    __m256 vsum2 = _mm256_setzero_ps();
    __m256 vsum3 = _mm256_setzero_ps();
    
    int i = 0;
    for (; i < n - 31; i += 32) {
        vsum0 = _mm256_add_ps(vsum0, _mm256_loadu_ps(&data[i]));
        vsum1 = _mm256_add_ps(vsum1, _mm256_loadu_ps(&data[i + 8]));
        vsum2 = _mm256_add_ps(vsum2, _mm256_loadu_ps(&data[i + 16]));
        vsum3 = _mm256_add_ps(vsum3, _mm256_loadu_ps(&data[i + 24]));
    }
    
    // Combine vectors
    vsum0 = _mm256_add_ps(_mm256_add_ps(vsum0, vsum1),
                          _mm256_add_ps(vsum2, vsum3));
    
    // Horizontal sum (reduce 8 floats to 1)
    __m128 lo = _mm256_castps256_ps128(vsum0);
    __m128 hi = _mm256_extractf128_ps(vsum0, 1);
    lo = _mm_add_ps(lo, hi);
    lo = _mm_hadd_ps(lo, lo);
    lo = _mm_hadd_ps(lo, lo);
    
    float sum = _mm_cvtss_f32(lo);
    
    // Handle remainder
    for (; i < n; i++) {
        sum += data[i];
    }
    
    return sum;
}

// Performance comparison (N = 1M floats):
// sum_naive:         ~1 GB/s  (memory bound, but add latency limited)
// sum_unrolled:      ~8 GB/s  (near bandwidth limit)
// sum_simd_unrolled: ~40 GB/s (exceeds DRAM, benefits from cache)

Unroll Factor Formula

Optimal unroll factor = latency × throughput. For FP add: 4 cycles latency × 2/cycle throughput = 8 accumulators. For FMA: 4 cycles × 2/cycle = 8. This ensures enough independent operations to keep all execution units busy.

Fused Multiply-Add (FMA)

FMA computes a * b + c in a single instruction with higher precision (only one rounding) and twice the throughput:

C++ fma_examples.cpp

#include <immintrin.h>

// Without FMA: 2 operations
float dot_naive(float* a, float* b, int n) {
    float sum = 0;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];  // mul + add = 2 ops
    }
    return sum;
}

// With FMA: 1 operation, 2 FLOPs
float dot_fma(float* a, float* b, int n) {
    __m256 vsum0 = _mm256_setzero_ps();
    __m256 vsum1 = _mm256_setzero_ps();
    
    for (int i = 0; i < n; i += 16) {
        __m256 va0 = _mm256_loadu_ps(&a[i]);
        __m256 vb0 = _mm256_loadu_ps(&b[i]);
        __m256 va1 = _mm256_loadu_ps(&a[i + 8]);
        __m256 vb1 = _mm256_loadu_ps(&b[i + 8]);
        
        // FMA: vsum = va * vb + vsum
        vsum0 = _mm256_fmadd_ps(va0, vb0, vsum0);
        vsum1 = _mm256_fmadd_ps(va1, vb1, vsum1);
    }
    
    vsum0 = _mm256_add_ps(vsum0, vsum1);
    
    // Horizontal sum...
    __m128 lo = _mm256_castps256_ps128(vsum0);
    __m128 hi = _mm256_extractf128_ps(vsum0, 1);
    lo = _mm_add_ps(lo, hi);
    lo = _mm_hadd_ps(lo, lo);
    lo = _mm_hadd_ps(lo, lo);
    return _mm_cvtss_f32(lo);
}

// FMA variants:
// _mm256_fmadd_ps(a, b, c)  → a*b + c
// _mm256_fmsub_ps(a, b, c)  → a*b - c
// _mm256_fnmadd_ps(a, b, c) → -(a*b) + c = c - a*b
// _mm256_fnmsub_ps(a, b, c) → -(a*b) - c

// Enable FMA in compiler for auto-vectorization:
// gcc/clang: -mfma or -march=haswell (or newer)
// MSVC: /arch:AVX2

Branchless Programming

Replacing branches with arithmetic operations eliminates branch misprediction penalties:

C++ branchless.cpp

// Branchy: Branch misprediction on random data
int abs_branchy(int x) {
    if (x < 0) return -x;
    return x;
}

// Branchless: No prediction needed
int abs_branchless(int x) {
    int mask = x >> 31;  // All 1s if negative, all 0s if positive
    return (x ^ mask) - mask;
}

// Min/Max
int min_branchy(int a, int b) {
    return (a < b) ? a : b;  // May branch or use CMOV
}

int min_branchless(int a, int b) {
    int diff = a - b;
    int mask = diff >> 31;  // mask = -1 if a < b, else 0
    return b + (diff & mask);
}

// Clamp
float clamp_branchy(float x, float lo, float hi) {
    if (x < lo) return lo;
    if (x > hi) return hi;
    return x;
}

float clamp_branchless(float x, float lo, float hi) {
    return fmax(fmin(x, hi), lo);  // Uses SIMD min/max
}

// Conditional increment
int count = 0;

// Branchy
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) count++;  // Mispredicts ~50% on random
}

// Branchless
for (int i = 0; i < n; i++) {
    count += (data[i] > threshold);  // Bool converts to 0 or 1
}

// SIMD branchless comparison
int count_simd(float* data, float threshold, int n) {
    __m256 vthresh = _mm256_set1_ps(threshold);
    __m256i vcount = _mm256_setzero_si256();
    __m256i vone = _mm256_set1_epi32(1);
    
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_loadu_ps(&data[i]);
        __m256 mask = _mm256_cmp_ps(v, vthresh, _CMP_GT_OQ);
        __m256i imask = _mm256_castps_si256(mask);
        vcount = _mm256_sub_epi32(vcount, imask);  // -(-1) = +1
    }
    
    // Horizontal sum of 8 ints
    int result[8];
    _mm256_storeu_si256((__m256i*)result, vcount);
    return result[0] + result[1] + result[2] + result[3] +
           result[4] + result[5] + result[6] + result[7];
}

7. CPU Parallelization

Modern CPUs have multiple cores, and utilizing all of them is essential for achieving peak throughput. However, parallel programming introduces new challenges: synchronization overhead, load imbalance, and cache coherency issues.

CPU Parallelization: Scaling Challenges

Parallel speedup is limited by serial portions and can degrade with excessive overhead

OpenMP Basics

OpenMP is the most common approach to add parallelism to existing C/C++ code. It uses compiler directives (pragmas) to parallelize loops and regions.

Directive	Purpose	Use Case
`#pragma omp parallel for`	Parallelize a for loop	Independent iterations
`#pragma omp parallel for reduction(+:sum)`	Parallel reduction	Accumulating values
`#pragma omp parallel sections`	Different code blocks in parallel	Task parallelism
`#pragma omp simd`	Hint for SIMD vectorization	Force vectorization
`#pragma omp task`	Create asynchronous task	Dynamic workloads

C++ openmp_basics.cpp

// Basic parallel for - each iteration runs on different thread
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    result[i] = expensive_computation(data[i]);
}

// Reduction - thread-safe accumulation
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
    sum += data[i];  // Each thread has private sum, combined at end
}

// Nested parallelism with collapse
#pragma omp parallel for collapse(2)
for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        matrix[i][j] = compute(i, j);
    }
}

// Schedule types for load balancing
#pragma omp parallel for schedule(static)   // Equal chunks, best for uniform work
#pragma omp parallel for schedule(dynamic)  // Grab work as available, best for variable work
#pragma omp parallel for schedule(guided)   // Decreasing chunk sizes, good balance

Schedule Selection

static: Best for uniform workloads (each iteration takes same time). Lowest overhead.
dynamic: Best for irregular workloads (some iterations much longer). Higher overhead but better load balance.
guided: Compromise - starts with large chunks, shrinks over time.

False sharing is the silent killer of parallel performance. It occurs when threads on different cores write to different variables that happen to share the same cache line (typically 64 bytes).

False Sharing: The Hidden Performance Killer

Even though threads access different variables, they fight over the same cache line

C++ false_sharing.cpp

// BAD: False sharing - counters share cache line
struct BadCounters {
    int64_t counter[8];  // 64 bytes total, fits in ONE cache line!
};

BadCounters bad;
#pragma omp parallel for
for (int t = 0; t < 8; t++) {
    for (int i = 0; i < 1000000; i++) {
        bad.counter[t]++;  // All threads thrash the same cache line!
    }
}
// Time: ~500ms (horrible!)

// GOOD: Pad to separate cache lines
struct PaddedCounter {
    int64_t counter;
    char padding[56];  // Pad to 64 bytes (cache line size)
};

struct GoodCounters {
    PaddedCounter counter[8];  // Each counter on its own cache line
};

GoodCounters good;
#pragma omp parallel for
for (int t = 0; t < 8; t++) {
    for (int i = 0; i < 1000000; i++) {
        good.counter[t].counter++;  // No contention!
    }
}
// Time: ~5ms (100× faster!)

// C++17: Use alignas for cleaner padding
struct alignas(64) CacheLineCounter {
    int64_t value;
};

Detecting False Sharing

Use perf c2c on Linux to detect cache-line contention. Look for high "HITM" (Hit Modified) counts. Intel VTune also reports false sharing directly under "Memory Access" analysis.

NUMA Awareness

NUMA (Non-Uniform Memory Access) architectures have multiple memory controllers. Memory access is fast when accessing local memory but slow when crossing NUMA boundaries.

NUMA Architecture: Local vs Remote Memory

Accessing remote NUMA memory can halve bandwidth and double latency

C++ numa_aware.cpp

#include 

// Check NUMA topology
int num_nodes = numa_num_configured_nodes();
printf("NUMA nodes: %d\n", num_nodes);

// Allocate memory on specific NUMA node
void* local_mem = numa_alloc_onnode(size, 0);  // Node 0
void* interleaved = numa_alloc_interleaved(size); // Spread across nodes

// Bind thread to NUMA node
numa_run_on_node(0);  // Run this thread on node 0

// First-touch policy: memory allocated where first accessed
float* data = (float*)malloc(N * sizeof(float));

// Initialize in parallel - each thread touches its portion
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    data[i] = 0.0f;  // Memory pages allocated on thread's local node
}

// Later computation accesses local memory
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    data[i] = compute(data[i]);  // Fast local access!
}

Thread Pools & Task Scheduling

Creating threads is expensive (~10-100μs). For fine-grained parallelism, use thread pools that reuse worker threads.

C++ thread_pool.cpp

#include 
#include 
#include 
#include 
#include 

class ThreadPool {
private:
    std::vector<std::thread> workers;
    std::queue<std::function<void()>> tasks;
    std::mutex queue_mutex;
    std::condition_variable condition;
    bool stop = false;

public:
    ThreadPool(size_t num_threads) {
        for (size_t i = 0; i < num_threads; i++) {
            workers.emplace_back([this] {
                while (true) {
                    std::function<void()> task;
                    {
                        std::unique_lock<std::mutex> lock(queue_mutex);
                        condition.wait(lock, [this] {
                            return stop || !tasks.empty();
                        });
                        if (stop && tasks.empty()) return;
                        task = std::move(tasks.front());
                        tasks.pop();
                    }
                    task();  // Execute task
                }
            });
        }
    }

    template<class F>
    void enqueue(F&& f) {
        {
            std::unique_lock<std::mutex> lock(queue_mutex);
            tasks.emplace(std::forward(f));
        }
        condition.notify_one();
    }

    ~ThreadPool() {
        { std::unique_lock<std::mutex> lock(queue_mutex); stop = true; }
        condition.notify_all();
        for (auto& w : workers) w.join();
    }
};

// Usage
ThreadPool pool(std::thread::hardware_concurrency());

for (int i = 0; i < 10000; i++) {
    pool.enqueue([i] {
        process_item(i);  // No thread creation overhead!
    });
}

Parallelization Technique	Helps Memory-Bound?	Helps Compute-Bound?	Best For
OpenMP parallel for	Sometimes (more bandwidth)	Yes (more cores)	Data parallelism
False sharing fix	Yes (cache efficiency)	Yes (reduces stalls)	Shared counters/accumulators
NUMA awareness	Yes (local bandwidth)	Somewhat	Large memory workloads
Thread pools	No	Yes (reduces overhead)	Many small tasks

8. GPU Architecture Fundamentals

GPUs are fundamentally different from CPUs. While CPUs optimize for latency (fast single-thread execution), GPUs optimize for throughput (maximum total work per second) by running thousands of threads simultaneously.

CPU vs GPU Architecture Philosophy

CPUs: few fast cores. GPUs: many slow cores. Different problems, different solutions.

Streaming Multiprocessors (SMs)

A GPU is composed of many Streaming Multiprocessors (SMs). Each SM is like a mini processor with its own:

CUDA Cores: Execute integer and floating-point operations
Tensor Cores: Specialized matrix multiply-accumulate units
Shared Memory: Fast on-chip scratchpad (64-164 KB)
L1 Cache: Transparent cache (combined with shared memory)
Warp Schedulers: Issue instructions to execution units
Register File: Very large (256 KB) to support many threads

Streaming Multiprocessor (SM) Architecture (A100)

Each SM can execute 2048 threads (64 warps), switching between them to hide latency

GPU	SMs	FP32 Cores	Tensor Cores	Shared Mem/SM	HBM BW
RTX 3090	82	10,496	328	128 KB	936 GB/s
A100	108	6,912	432	192 KB	2,039 GB/s
H100	132	16,896	528	228 KB	3,350 GB/s
RTX 4090	128	16,384	512	128 KB	1,008 GB/s

Warps & SIMT Execution

GPU threads are grouped into warps of 32 threads. All threads in a warp execute the same instruction at the same time (SIMT = Single Instruction, Multiple Threads). This is both the source of GPU power and the cause of performance pitfalls.

Warp Execution: SIMT Model

Divergent branches within a warp execute sequentially, halving or worse throughput

CUDA warp_divergence.cu

// BAD: Heavy divergence within warp
__global__ void divergent_kernel(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Each thread in warp may take different path!
    if (data[idx] > 0.5f) {
        data[idx] = expensive_pathA(data[idx]);  // Some threads
    } else {
        data[idx] = expensive_pathB(data[idx]);  // Other threads
    }
}

// BETTER: Sort data so adjacent threads take same path
// Or restructure to minimize branch impact

// GOOD: Branch on warp-uniform condition
__global__ void uniform_kernel(float* data, int n, bool usePathA) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    // All threads in warp take same path (uniform)
    if (usePathA) {
        data[idx] = expensive_pathA(data[idx]);
    } else {
        data[idx] = expensive_pathB(data[idx]);
    }
}

GPU Memory Hierarchy

GPUs have a complex memory hierarchy. Using the right level is critical for performance:

GPU Memory Hierarchy

Optimal GPU code keeps data in faster memory tiers as long as possible

Memory Type	Scope	Latency	Bandwidth	Size (A100)	Use Case
Registers	Thread	0 cycles	~20 TB/s	256 KB/SM	Local variables
Shared Memory	Block	~25 cycles	~19 TB/s	164 KB/SM	Inter-thread communication
L1 Cache	SM	~30 cycles	~12 TB/s	192 KB/SM	Automatic caching
L2 Cache	Device	~200 cycles	~4 TB/s	40 MB	Cross-SM sharing
Global (HBM)	Device	~400 cycles	2 TB/s	80 GB	Large datasets
Host (PCIe)	System	~10,000 cycles	32 GB/s	System RAM	CPU↔GPU transfer

Occupancy & Latency Hiding

Occupancy is the ratio of active warps to the maximum possible warps per SM. High occupancy helps hide memory latency by allowing the SM to switch to other warps while waiting for memory.

Latency Hiding Through Occupancy

High occupancy keeps the SM busy while warps wait for memory

CUDA occupancy_analysis.cu

// Query theoretical occupancy
#include 

__global__ void myKernel(float* data) {
    // Kernel code here
}

void analyzeOccupancy() {
    int blockSize = 256;
    int minGridSize;
    int maxBlockSize;
    
    // Let CUDA suggest optimal block size
    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,
        &maxBlockSize,
        myKernel,
        0,     // Dynamic shared mem
        0      // No block size limit
    );
    printf("Suggested block size: %d\n", maxBlockSize);
    
    // Calculate occupancy for chosen block size
    int numBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &numBlocks,
        myKernel,
        blockSize,
        0  // Shared mem
    );
    
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    
    int maxWarps = prop.maxThreadsPerMultiProcessor / 32;
    int activeWarps = numBlocks * (blockSize / 32);
    float occupancy = (float)activeWarps / maxWarps;
    
    printf("Occupancy: %.1f%% (%d/%d warps)\n",
           occupancy * 100, activeWarps, maxWarps);
}

// Rule of thumb:
//   - 50%+ occupancy is usually good
//   - Below 25% often indicates a problem
//   - 100% not always needed (depends on memory access)

Occupancy vs Performance

Higher occupancy doesn't always mean better performance! A kernel with 50% occupancy but perfect memory coalescing can outperform 100% occupancy with scattered access. Use nsys and ncu to measure actual performance, not just occupancy.

9. GPU Memory Optimizations

GPU memory bandwidth is massive (2+ TB/s on modern GPUs), but achieving that peak requires careful attention to memory access patterns. Poor patterns can reduce effective bandwidth by 10-32×.

Memory Coalescing

Memory coalescing is the most important GPU memory optimization. When threads in a warp access consecutive memory addresses, the hardware combines them into a single, efficient transaction. When accesses are scattered, each thread requires a separate transaction.

Memory Coalescing: The Key to GPU Bandwidth

Coalesced access achieves 30-50× higher effective bandwidth than scattered access

CUDA coalescing_patterns.cu

// BAD: Column-major access in row-major storage (strided)
__global__ void bad_transpose(float* out, float* in, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Adjacent threads read from different rows (stride = N)
    out[col * N + row] = in[row * N + col];  // Strided write!
}

// GOOD: Use shared memory to fix coalescing
__global__ void good_transpose(float* out, float* in, int N) {
    __shared__ float tile[32][33];  // +1 to avoid bank conflicts
    
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    
    // Coalesced read from global memory
    tile[threadIdx.y][threadIdx.x] = in[y * N + x];
    
    __syncthreads();
    
    // Transpose indices for coalesced write
    x = blockIdx.y * 32 + threadIdx.x;
    y = blockIdx.x * 32 + threadIdx.y;
    
    // Coalesced write to global memory
    out[y * N + x] = tile[threadIdx.x][threadIdx.y];
}

// AoS vs SoA on GPU - same principle as CPU!
struct ParticleAoS { float x, y, z, mass; };  // BAD for GPU

struct ParticlesSoA {  // GOOD for GPU
    float* x;
    float* y;
    float* z;
    float* mass;
};

Shared Memory Usage

Shared memory is on-chip SRAM shared by all threads in a block. It's ~100× faster than global memory and perfect for:

Inter-thread communication within a block
Caching frequently accessed data
Reorganizing memory access patterns (like transpose above)
Reduction operations

Shared Memory: On-Chip Scratchpad

Each block gets its own private shared memory partition

CUDA shared_memory.cu

// Static shared memory allocation
__global__ void reduction_kernel(float* input, float* output, int N) {
    __shared__ float sdata[256];  // 1KB per block
    
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Load from global to shared (coalesced)
    sdata[tid] = (idx < N) ? input[idx] : 0.0f;
    __syncthreads();  // CRITICAL: wait for all loads
    
    // Parallel reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();  // Sync after each step
    }
    
    // Write result (only thread 0)
    if (tid == 0) output[blockIdx.x] = sdata[0];
}

// Dynamic shared memory (size set at launch)
__global__ void dynamic_smem_kernel(float* data) {
    extern __shared__ float smem[];  // Size unknown at compile time
    
    smem[threadIdx.x] = data[threadIdx.x];
    __syncthreads();
    // Use smem...
}

// Launch with dynamic shared memory
int smem_size = 256 * sizeof(float);
dynamic_smem_kernel<<>>(data);

__syncthreads() is Critical!

Forgetting __syncthreads() after shared memory writes causes race conditions. Threads may read stale data. Always sync after writing to shared memory before reading!

Bank Conflicts

Shared memory is divided into 32 banks. If multiple threads in a warp access different addresses in the same bank, the accesses are serialized (bank conflict). This can reduce shared memory performance by up to 32×.

Shared Memory Bank Conflicts

Add +1 padding to array dimensions to avoid stride-32 bank conflicts

Texture & Constant Memory

GPUs have specialized memory types optimized for specific access patterns:

Memory Type	Size	Access Pattern	Benefits	Use Case
Constant Memory	64 KB	All threads read same address	Broadcast to warp (1 cycle)	Kernel parameters, lookup tables
Texture Memory	Cache-backed	2D spatial locality	Free interpolation, clamping	Image processing, irregular access

10. GPU Compute Optimizations

Once memory access is optimized, we can focus on maximizing computational throughput through kernel design optimizations.

Kernel Fusion

Kernel fusion combines multiple operations into a single kernel to reduce memory traffic. Each kernel launch requires reading inputs from global memory and writing outputs back—fusing eliminates intermediate reads/writes.

Kernel Fusion: Eliminating Memory Round-Trips

Fusing kernels can provide 2-3× speedup by keeping intermediates in registers

CUDA kernel_fusion.cu

// BAD: Separate kernels for each operation
__global__ void square(float* y, float* x, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = x[i] * x[i];
}
__global__ void add_one(float* z, float* y, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) z[i] = y[i] + 1.0f;
}
__global__ void relu(float* out, float* z, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) out[i] = fmaxf(z[i], 0.0f);
}

// Launch 3 kernels, 3x memory traffic
square<<>>(y, x, n);
add_one<<>>(z, y, n);
relu<<>>(out, z, n);

// GOOD: Single fused kernel
__global__ void fused_square_add_relu(float* out, float* x, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float val = x[i];           // Read once
        val = val * val;            // Square (register)
        val = val + 1.0f;           // Add (register)
        out[i] = fmaxf(val, 0.0f); // ReLU + Write once
    }
}

// Single launch, 1x memory traffic, 2-3x faster!
fused_square_add_relu<<>>(out, x, n);

Fusion in Deep Learning

Modern frameworks like PyTorch 2.0 (torch.compile) and JAX automatically fuse operations. NVIDIA's cuDNN fuses common patterns like Conv+BN+ReLU. Custom Triton kernels give you explicit control over fusion.

Mixed Precision

Mixed precision uses lower-precision formats (FP16, BF16, INT8) for most computations while keeping critical values in FP32. This provides:

2× memory bandwidth: FP16 is half the size of FP32
2-8× compute throughput: Tensor Cores accelerate FP16/BF16/INT8
2× memory capacity: Fit larger models in GPU memory

Numerical Precision Formats for GPU Computing

BF16 is preferred for training (same range as FP32), FP16/INT8 for inference

Python (PyTorch) mixed_precision.py

import torch
from torch.cuda.amp import autocast, GradScaler

# Automatic Mixed Precision (AMP) training
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()  # Handles gradient scaling for FP16

for data, target in dataloader:
    optimizer.zero_grad()
    
    # Forward pass in FP16 (where safe)
    with autocast(dtype=torch.float16):  # or torch.bfloat16
        output = model(data)
        loss = criterion(output, target)
    
    # Backward pass (scaled gradients)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# What autocast does automatically:
# - MatMuls, Convs → FP16 (Tensor Core accelerated)
# - Reductions, Softmax → FP32 (numerical stability)
# - Loss computation → FP32

# BF16 on Ampere+ (no scaler needed!)
with autocast(dtype=torch.bfloat16):
    output = model(data)  # Stable without gradient scaling

Thread Coarsening

Thread coarsening makes each thread do more work. Instead of one thread per element, have each thread process multiple elements. This reduces launch overhead and can improve instruction-level parallelism.

CUDA thread_coarsening.cu

// Fine-grained: 1 thread per element
__global__ void fine_kernel(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        out[i] = compute(in[i]);
    }
}
// Launch: <<>>

// Coarsened: 1 thread per 4 elements
#define COARSEN 4
__global__ void coarse_kernel(float* out, float* in, int n) {
    int base = (blockIdx.x * blockDim.x + threadIdx.x) * COARSEN;
    
    // Process 4 elements per thread
    #pragma unroll
    for (int j = 0; j < COARSEN; j++) {
        int i = base + j;
        if (i < n) {
            out[i] = compute(in[i]);
        }
    }
}
// Launch: <<>> (4× fewer blocks!)

// Benefits of coarsening:
//   - Fewer blocks → less launch overhead
//   - Unrolled loop → better ILP
//   - Amortize index calculation
//   - Can enable register reuse

// When to coarsen:
//   - Simple kernels with low occupancy limiters
//   - Compute-bound kernels
//   - When block count is very high

Reducing Warp Divergence

Warp divergence (covered in Section 8) can be mitigated through several techniques:

CUDA divergence_mitigation.cu

// BAD: Thread-dependent branching causes divergence
__global__ void divergent(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        if (in[i] > 0) {
            out[i] = expensive_func_A(in[i]);  // Some threads
        } else {
            out[i] = expensive_func_B(in[i]);  // Other threads
        }
    }
}

// BETTER: Predication for simple operations
__global__ void predicated(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float val = in[i];
        // Both paths computed, result selected (no branch)
        out[i] = (val > 0) ? val * val : -val;
    }
}

// BETTER: Sort data to cluster similar values
// If in[0..999] > 0 and in[1000..1999] <= 0,
// different warps take different paths (no intra-warp divergence)

// BETTER: Warp-level voting for uniform branches
__global__ void warp_vote(float* out, float* in, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        bool cond = in[i] > 0;
        
        // Check if ALL threads in warp agree
        if (__all_sync(0xffffffff, cond)) {
            // Uniform path - no divergence
            out[i] = expensive_func_A(in[i]);
        } else if (__all_sync(0xffffffff, !cond)) {
            out[i] = expensive_func_B(in[i]);
        } else {
            // Mixed - must diverge, use simpler fallback
            out[i] = simple_func(in[i]);
        }
    }
}

GPU Optimization	Helps Memory-Bound?	Helps Compute-Bound?	When to Apply
Memory Coalescing	Yes (critical!)	Somewhat	Always check first
Shared Memory	Yes	Yes (reuse)	Data reuse within block
Bank Conflict Fix	Yes	No	When using shared memory
Kernel Fusion	Yes	Somewhat	Multiple consecutive ops
Mixed Precision	Yes (2×)	Yes (2-16×)	Almost always (DL)
Thread Coarsening	No	Yes	Simple, high-occupancy kernels
Reduce Divergence	No	Yes	Data-dependent branches

11. GPU Execution Optimizations

Beyond kernel-level optimizations, how you launch and orchestrate GPU work significantly impacts performance. CUDA provides mechanisms to overlap computation with data transfer and reduce launch overhead.

CUDA Streams

A CUDA stream is a sequence of operations that execute in order on the GPU. Operations in different streams can execute concurrently, enabling overlap of:

Host-to-Device (H2D) memory transfers
Kernel execution
Device-to-Host (D2H) memory transfers

CUDA Streams: Overlapping Compute and Transfer

Multiple streams enable concurrent execution of independent operations

CUDA cuda_streams.cu

// Create streams
cudaStream_t stream[2];
for (int i = 0; i < 2; i++) {
    cudaStreamCreate(&stream[i]);
}

// Divide work into chunks for overlap
int chunk_size = N / 2;

for (int i = 0; i < 2; i++) {
    int offset = i * chunk_size;
    
    // Async H2D copy (non-blocking)
    cudaMemcpyAsync(d_in + offset, h_in + offset, 
                    chunk_size * sizeof(float),
                    cudaMemcpyHostToDevice, stream[i]);
    
    // Launch kernel in stream
    myKernel<<0, stream[i]>>>(d_out + offset, d_in + offset, chunk_size);
    
    // Async D2H copy
    cudaMemcpyAsync(h_out + offset, d_out + offset,
                    chunk_size * sizeof(float),
                    cudaMemcpyDeviceToHost, stream[i]);
}

// Wait for all streams
cudaDeviceSynchronize();

// Cleanup
for (int i = 0; i < 2; i++) {
    cudaStreamDestroy(stream[i]);
}

// IMPORTANT: Host memory must be pinned for async transfers!
float* h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));  // Pinned memory
// Regular malloc() will cause cudaMemcpyAsync to be synchronous!

Pinned Memory Required!

Async memory transfers (cudaMemcpyAsync) only work with pinned (page-locked) host memory allocated with cudaMallocHost. Regular malloc memory forces synchronous transfers!

CUDA Graphs

CUDA Graphs capture a sequence of GPU operations (kernels, memory copies) into a graph that can be launched with a single API call. This dramatically reduces CPU overhead for repetitive workloads.

CUDA Graphs: Reducing Launch Overhead

CUDA Graphs are ideal for inference loops where the same operations repeat thousands of times

CUDA cuda_graphs.cu

// Method 1: Stream capture (easiest)
cudaGraph_t graph;
cudaGraphExec_t graphExec;

// Capture operations into a graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
{
    // These operations are recorded, not executed
    cudaMemcpyAsync(d_in, h_in, size, cudaMemcpyHostToDevice, stream);
    kernel1<<0, stream>>>(d_tmp, d_in);
    kernel2<<0, stream>>>(d_out, d_tmp);
    cudaMemcpyAsync(h_out, d_out, size, cudaMemcpyDeviceToHost, stream);
}
cudaStreamEndCapture(stream, &graph);

// Instantiate executable graph (one-time cost)
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);

// Execute graph many times (very fast!)
for (int iter = 0; iter < 10000; iter++) {
    cudaGraphLaunch(graphExec, stream);  // ~5μs per launch!
}

cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);

// PyTorch CUDA Graphs
// ===================
// g = torch.cuda.CUDAGraph()
// with torch.cuda.graph(g):
//     output = model(input)  # Captured
// 
// # Replay graph (fast!)
// for _ in range(1000):
//     g.replay()

Async Memory Operations

Modern GPUs (Ampere+) support async memory copy that uses dedicated copy engines, allowing compute units to continue working during transfers.

CUDA async_copy.cu

// Async copy from global to shared memory (Ampere+)
#include 

__global__ void async_copy_kernel(float* global_data) {
    __shared__ float smem[256];
    
    // Cooperative group for async operations
    auto block = cooperative_groups::this_thread_block();
    
    // Create pipeline for async copies
    __shared__ cuda::pipeline_shared_state<1> pipe_state;
    auto pipe = cuda::make_pipeline(block, &pipe_state);
    
    // Initiate async copy (returns immediately)
    pipe.producer_acquire();
    cuda::memcpy_async(
        smem + threadIdx.x,           // Destination (shared)
        global_data + threadIdx.x,    // Source (global)
        sizeof(float),
        pipe
    );
    pipe.producer_commit();
    
    // Do other work while copy is in flight!
    float local_compute = expensive_calculation();
    
    // Wait for copy to complete
    pipe.consumer_wait();
    
    // Now use smem data
    float result = smem[threadIdx.x] + local_compute;
}

12. Case Studies

Let's apply the optimization techniques we've learned to real-world problems.

Case Study 1: Matrix Multiplication

Matrix multiplication (GEMM) is the most important operation in deep learning. Let's trace the optimization journey from naive to highly optimized:

Matrix Multiplication Optimization Journey

1500× improvement from naive to Tensor Core implementation

CUDA tiled_matmul.cu

// Tiled matrix multiplication with shared memory
#define TILE_SIZE 32

__global__ void matmul_tiled(float* C, float* A, float* B, int N) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE + 1];  // +1 avoids bank conflicts
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    
    float sum = 0.0f;
    
    // Loop over tiles
    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Collaborative load: each thread loads one element
        As[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        
        __syncthreads();
        
        // Compute partial dot product from shared memory
        #pragma unroll
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }
        
        __syncthreads();
    }
    
    C[row * N + col] = sum;
}

// Why this is faster:
// - Each element loaded once from global, reused TILE_SIZE times
// - Arithmetic intensity: O(TILE_SIZE) ops per byte loaded
// - Coalesced global loads, fast shared memory access

Case Study 2: Flash Attention

Flash Attention is a landmark optimization that makes Transformer attention memory-efficient by fusing operations and using tiled computation to avoid materializing the full N×N attention matrix.

Flash Attention: Tiled Attention Without Materializing Full Matrix

Flash Attention achieves 2-4× speedup by being memory-bound, not compute-bound

Case Study 3: LLM Inference Optimization

Large Language Model inference faces unique challenges: the autoregressive decode phase generates one token at a time, making it heavily memory-bound.

LLM Inference: Prefill vs Decode Phases

LLM decode is memory-bound: optimizations focus on reducing memory traffic

Optimization	Technique	Speedup	Trade-off
KV Cache	Cache attention keys/values	~N× (sequence length)	Memory usage grows with context
Continuous Batching	Dynamically batch requests	2-4× throughput	Latency variance
INT8 Quantization	8-bit weights	2× memory, ~1.8× speed	< 1% accuracy loss
INT4/AWQ	4-bit weights	4× memory, ~3× speed	~1-2% accuracy loss
PagedAttention (vLLM)	Virtual memory for KV cache	~24× throughput	Implementation complexity
Flash Attention	Fused, tiled attention	2-4× attention speed	None (pure win)

13. Conclusion

Performance optimization is fundamentally about understanding where your code spends time and what limits it. The key insights from this guide:

Performance Optimization Decision Framework

Always profile first, identify the bottleneck, apply targeted optimizations, and profile again

Key Takeaways

The Performance Optimization Checklist

Profile first — Never optimize blind. Use perf, VTune, nsys, ncu.
Identify the bottleneck — Memory-bound or compute-bound? Use roofline model.
Check your algorithm — O(n²) can't be optimized to beat O(n log n).
Use optimized libraries — cuBLAS, MKL, Flash Attention before custom code.
Optimize data movement — Most code is memory-bound; reduce traffic.
Match data layout to access — SoA for vectorization, coalescing for GPU.
Exploit parallelism — SIMD, multi-threading, GPU thousands of threads.
Fuse operations — Keep data in fast memory (registers, cache, shared).
Use lower precision — FP16/BF16/INT8 when accuracy permits.
Profile again — Verify improvement, watch for new bottlenecks.

CPU & GPU Performance Optimization: A Comprehensive Guide

1. Introduction: The Memory Wall

Real-World Impact

The Two Types of Bottlenecks

Memory-Bound

Compute-Bound

2. Diagnosing Performance Bottlenecks

The Roofline Model Recap

Profiling Tools

CPU Profiling with perf

GPU Profiling with Nsight Compute

Identifying Your Bound

3. CPU Architecture Fundamentals

Cache Hierarchy

L1 Cache

L1 Latency

L3 Cache

DRAM Latency

Pipeline & Out-of-Order Execution

Branch Prediction

SIMD Units

4. CPU Memory Optimizations

Data Layout: Array of Structures vs Structure of Arrays

Array of Structures (AoS)

Structure of Arrays (SoA)

Cache Blocking (Tiling)

Naive GEMM

Blocked

Blocked + AVX

OpenBLAS

Software Prefetching

Memory Alignment

5. CPU Compute Optimizations

Vectorization Deep Dive

Loop Unrolling & ILP

Fused Multiply-Add (FMA)

Branchless Programming

7. CPU Parallelization

OpenMP Basics

False Sharing

NUMA Awareness

Thread Pools & Task Scheduling

8. GPU Architecture Fundamentals

Streaming Multiprocessors (SMs)

Warps & SIMT Execution

GPU Memory Hierarchy

Occupancy & Latency Hiding

9. GPU Memory Optimizations

Memory Coalescing

Shared Memory Usage

Bank Conflicts

Texture & Constant Memory

10. GPU Compute Optimizations

Kernel Fusion

Mixed Precision

Thread Coarsening

Reducing Warp Divergence

11. GPU Execution Optimizations

CUDA Streams

CUDA Graphs

Async Memory Operations

12. Case Studies

Case Study 1: Matrix Multiplication

Case Study 2: Flash Attention

Case Study 3: LLM Inference Optimization

13. Conclusion

Key Takeaways

Further Reading