Blog & Tutorials

GPU Programming Mar 6, 2026 15 min

Understanding CUDA Memory Hierarchy: A Practical Guide

Deep dive into global, shared, and register memory with practical examples, diagrams, and performance benchmarks.

CUDA Memory Performance

Distributed ML Mar 6, 2026 12 min

PyTorch DDP vs FSDP: When to Use Which

Comprehensive comparison of distributed training strategies with practical code examples and decision guide.

PyTorch DDP FSDP

Optimization Mar 6, 2026 18 min

ADMM for Distributed Machine Learning: From Theory to Practice

Step-by-step implementation of ADMM-based distributed optimization with convergence guarantees.

ADMM Convex Distributed

HPC Infrastructure Mar 6, 2026 25 min

AI Cluster Architectures: A Deep Dive

Understanding GPU cluster hierarchies, NVLink, InfiniBand, and network topologies for large-scale AI training.

NVLink InfiniBand Topology

HPC Mar 6, 2026 20 min

Getting Started with Slurm: Submit Your First Multi-GPU Job

Practical guide to Slurm job scheduling with examples from CSC's Puhti and Mahti supercomputers.

Slurm CSC Job Scheduling

GPU Programming Mar 6, 2026 25 min

Nsight Systems & Compute: Profiling Your CUDA Kernels

Learn to identify bottlenecks and optimize GPU kernels using NVIDIA's profiling tools.

Nsight Profiling Optimization

Inference Mar 6, 2026 30 min

Efficient Inference: A Complete Guide to Pruning and Quantization

Deep dive into model compression: unstructured, structured, N:M sparsity, INT8/FP16 quantization, and hardware acceleration.

Pruning Quantization Tensor Cores

GPU Programming Mar 6, 2026 30 min

Writing Custom CUDA Kernels with Triton

Master GPU programming with OpenAI Triton: from vector addition to fused attention kernels with Python-like syntax.

Triton CUDA Kernel Fusion

Performance Engineering Mar 9, 2026 60 min

CPU & GPU Performance Optimization: A Comprehensive Guide

Master high-performance computing: from understanding the memory wall to implementing vectorization, kernel fusion, and cache-aware algorithms.

Performance Optimization CPU/GPU

Distributed Systems Mar 9, 2026 75 min

Distributed Training: A Complete Guide from Collective Operations to 3D Parallelism

Master distributed deep learning: collective operations (ring, tree, hierarchical), α-β performance model, communication roofline, and parallelism strategies (DP, TP, PP, FSDP, MoE).

AllReduce FSDP 3D Parallelism

Distributed Systems Mar 9, 2026 55 min

Communication-Efficient Distributed Training

Comprehensive guide to reducing communication overhead: gradient compression, quantization, sparsification, local SGD, and system-level optimizations.

Compression Quantization Local SGD

Performance Analysis Mar 7, 2026 45 min

The Roofline Model: A Complete Guide to Performance Analysis

Master performance modeling: arithmetic intensity, peak compute/bandwidth calculation, BLAS and deep learning operation analysis.

Roofline Performance Memory Bandwidth