Blog & Tutorials

Technical deep-dives into HPC, GPU programming, distributed machine learning, and optimization algorithms.

GPU Programming Mar 6, 2026 15 min

Understanding CUDA Memory Hierarchy: A Practical Guide

Deep dive into global, shared, and register memory with practical examples, diagrams, and performance benchmarks.

CUDA Memory Performance
Distributed ML Mar 6, 2026 12 min

PyTorch DDP vs FSDP: When to Use Which

Comprehensive comparison of distributed training strategies with practical code examples and decision guide.

PyTorch DDP FSDP
Optimization Mar 6, 2026 18 min

ADMM for Distributed Machine Learning: From Theory to Practice

Step-by-step implementation of ADMM-based distributed optimization with convergence guarantees.

ADMM Convex Distributed
HPC Infrastructure Mar 6, 2026 25 min

AI Cluster Architectures: A Deep Dive

Understanding GPU cluster hierarchies, NVLink, InfiniBand, and network topologies for large-scale AI training.

NVLink InfiniBand Topology
HPC Mar 6, 2026 20 min

Getting Started with Slurm: Submit Your First Multi-GPU Job

Practical guide to Slurm job scheduling with examples from CSC's Puhti and Mahti supercomputers.

Slurm CSC Job Scheduling
GPU Programming Mar 6, 2026 25 min

Nsight Systems & Compute: Profiling Your CUDA Kernels

Learn to identify bottlenecks and optimize GPU kernels using NVIDIA's profiling tools.

Nsight Profiling Optimization
Inference Mar 6, 2026 30 min

Efficient Inference: A Complete Guide to Pruning and Quantization

Deep dive into model compression: unstructured, structured, N:M sparsity, INT8/FP16 quantization, and hardware acceleration.

Pruning Quantization Tensor Cores
GPU Programming Mar 6, 2026 30 min

Writing Custom CUDA Kernels with Triton

Master GPU programming with OpenAI Triton: from vector addition to fused attention kernels with Python-like syntax.

Triton CUDA Kernel Fusion
Performance Engineering Mar 9, 2026 60 min

CPU & GPU Performance Optimization: A Comprehensive Guide

Master high-performance computing: from understanding the memory wall to implementing vectorization, kernel fusion, and cache-aware algorithms.

Performance Optimization CPU/GPU
Distributed Systems Mar 9, 2026 75 min

Distributed Training: A Complete Guide from Collective Operations to 3D Parallelism

Master distributed deep learning: collective operations (ring, tree, hierarchical), α-β performance model, communication roofline, and parallelism strategies (DP, TP, PP, FSDP, MoE).

AllReduce FSDP 3D Parallelism
Distributed Systems Mar 9, 2026 55 min

Communication-Efficient Distributed Training

Comprehensive guide to reducing communication overhead: gradient compression, quantization, sparsification, local SGD, and system-level optimizations.

Compression Quantization Local SGD
Performance Analysis Mar 7, 2026 45 min

The Roofline Model: A Complete Guide to Performance Analysis

Master performance modeling: arithmetic intensity, peak compute/bandwidth calculation, BLAS and deep learning operation analysis.

Roofline Performance Memory Bandwidth

More Content Coming Soon!

Subscribe to get notified when new tutorials and deep-dives are published. Also check out the HPC4AI YouTube channel for video content.

Watch on YouTube