Course objective
This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. You will learn how deep learning models are trained across multiple GPUs, nodes, and clusters—and why distributed training is essential for today’s largest AI systems.
We will cover:
- Core techniques for distributed training
- Modern frameworks and scaling strategies
- Practical implementations with real-world toolchains
- Theoretical underpinnings of large-scale learning
- Inference and applications
As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for researchers and engineers. This series bridges engineering and theory.
Organization
- Enrollment limited to 60 students (no external auditors)
- 7 sessions: 2h lecture + 2h hands-on lab
- Final session: grading
- Projects and HWs are done in groups of 2
Bring your own laptop for the lab sessions, and make sure you can install the nightly version of PyTorch with GPU support before the first lab.
Lectures & Labs
| # | Topic | Date | Labs |
|---|---|---|---|
| 1 | Getting Started on Distributed LLM Training | 22/01/26 | Tiny Scaling Laws with nanoGPT |
| 2 | Systems for ML | 29/01/26 | Porting nanoGPT to torchtitan |
| 3 | Multi-GPU Parallelization Techniques | 05/02/26 | Pipeline Parallelism Simulator |
| 4 | Communication-Efficient Distributed Optimization | 12/02/26 | Collaborative Training with TorchFT |
| 5 | Post-Training | 26/02/26 | Evaluation and SFT with TorchTune |
| 6 | Serving LLMs at Scale | 05/03/26 | Serving with vLLM |
| 7 | Agentic AI | 12/03/26 | LLM agents |
| 8 | Grading | 26/03/26 |
Credits
Lab setup: install & verify pytorch / torchtitan / torchft / netbird / torchtune / vllm (click to expand)
Use a clean virtual environment, then follow the steps below.
-
pytorch (labs 1–6)
Install the pytorch nightly build using the official instructions. -
torchtitan (labs 2, 4)
Install from source using the official instructions.
Quick test:
NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/debug_model.toml" ./run_train.sh -
torchft (lab 4)
Install via pip using the official instructions or from source using the official instructions.
To test, run the lighthouse and two replicas following the usage guide. -
netbird (lab 4)
Install using the official instructions. NetBird is a VPN and may requiresudo. If you cannot usesudo, try the Docker setup.
Quick test: confirm thatnetbird upworks. In practice, we will usenetbird up --setup-key $KEY. -
torchtune (lab 5)
Install from source in dev mode using the official instructions.
Quick test: confirm that the instructions here work. -
vllm (lab 6)
Install vllm via the official instructions. (prefer GPU over CPU)
Quick test: verify thatvllm --helpworks.
Grading & Deadlines
Project: choose one paper (NeurIPS, MLSys, or similar) related to the lecture topics and get approval before you start. Grade and Groups.
| Item | Release | Due | Links |
|---|---|---|---|
| Group constitution | - | 05/02/26 | Instruction |
| HW1 | 05/02/26 | 19/02/26 | Statement |
| HW2 | 12/02/26 | 26/02/26 | Statement |
| Project proposal | 05/02/26 | 26/02/26 | Guidelines |
| Final report / poster | - | 26/03/26 |