Understanding CUDA Memory Hierarchy: A Practical Guide
Deep dive into global, shared, and register memory with practical examples, diagrams, and performance benchmarks.
PyTorch DDP vs FSDP: When to Use Which
Comprehensive comparison of distributed training strategies with practical code examples and decision guide.
ADMM for Distributed Machine Learning: From Theory to Practice
Step-by-step implementation of ADMM-based distributed optimization with convergence guarantees.
AI Cluster Architectures: A Deep Dive
Understanding GPU cluster hierarchies, NVLink, InfiniBand, and network topologies for large-scale AI training.
Getting Started with Slurm: Submit Your First Multi-GPU Job
Practical guide to Slurm job scheduling with examples from CSC's Puhti and Mahti supercomputers.
Nsight Systems & Compute: Profiling Your CUDA Kernels
Learn to identify bottlenecks and optimize GPU kernels using NVIDIA's profiling tools.
Efficient Inference: A Complete Guide to Pruning and Quantization
Deep dive into model compression: unstructured, structured, N:M sparsity, INT8/FP16 quantization, and hardware acceleration.
Writing Custom CUDA Kernels with Triton
Master GPU programming with OpenAI Triton: from vector addition to fused attention kernels with Python-like syntax.
CPU & GPU Performance Optimization: A Comprehensive Guide
Master high-performance computing: from understanding the memory wall to implementing vectorization, kernel fusion, and cache-aware algorithms.
Distributed Training: A Complete Guide from Collective Operations to 3D Parallelism
Master distributed deep learning: collective operations (ring, tree, hierarchical), α-β performance model, communication roofline, and parallelism strategies (DP, TP, PP, FSDP, MoE).
Communication-Efficient Distributed Training
Comprehensive guide to reducing communication overhead: gradient compression, quantization, sparsification, local SGD, and system-level optimizations.
The Roofline Model: A Complete Guide to Performance Analysis
Master performance modeling: arithmetic intensity, peak compute/bandwidth calculation, BLAS and deep learning operation analysis.