Training and Deploying Large-Scale Models

Course objective

This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. You will learn how deep learning models are trained across multiple GPUs, nodes, and clusters—and why distributed training is essential for today’s largest AI systems.

We will cover:

Core techniques for distributed training
Modern frameworks and scaling strategies
Practical implementations with real-world toolchains
Theoretical underpinnings of large-scale learning
Inference and applications

As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for researchers and engineers. This series bridges engineering and theory.

Organization

Enrollment limited to 60 students (no external auditors)
7 sessions: 2h lecture + 2h hands-on lab
Final session: grading
Projects and HWs are done in groups of 2

Bring your own laptop for the lab sessions, and make sure you can install the nightly version of PyTorch with GPU support before the first lab.

Lectures & Labs

#	Topic	Date	Labs
1	Getting Started on Distributed LLM Training	22/01/26	Tiny Scaling Laws with nanoGPT
2	Systems for ML	29/01/26	Porting nanoGPT to torchtitan
3	Multi-GPU Parallelization Techniques	05/02/26	Pipeline Parallelism Simulator
4	Communication-Efficient Distributed Optimization	12/02/26	Collaborative Training with TorchFT
5	Post-Training	26/02/26	Evaluation and SFT with TorchTune
6	Serving LLMs at Scale	05/03/26	Serving with vLLM
7	Agentic AI	12/03/26	LLM agents
8	Grading	26/03/26

Credits

Lab setup: install & verify pytorch / torchtitan / torchft / netbird / torchtune / vllm (click to expand)

Use a clean virtual environment, then follow the steps below.

pytorch (labs 1–6)
Install the pytorch nightly build using the official instructions.
torchtitan (labs 2, 4)
Install from source using the official instructions.
Quick test:
NGPU=1 CONFIG_FILE="./torchtitan/models/nanogpt/train_configs/debug_model.toml" ./run_train.sh
torchft (lab 4)
Install via pip using the official instructions or from source using the official instructions.
To test, run the lighthouse and two replicas following the usage guide.
netbird (lab 4)
Install using the official instructions. NetBird is a VPN and may require sudo. If you cannot use sudo, try the Docker setup.
Quick test: confirm that netbird up works. In practice, we will use netbird up --setup-key $KEY.
torchtune (lab 5)
Install from source in dev mode using the official instructions.
Quick test: confirm that the instructions here work.
vllm (lab 6)
Install vllm via the official instructions. (prefer GPU over CPU)
Quick test: verify that vllm --help works.

Grading & Deadlines

HW1: 25% HW2: 25% Project: 50%

Project: choose one paper (NeurIPS, MLSys, or similar) related to the lecture topics and get approval before you start. Grade and Groups.

Item	Release	Due	Links
Group constitution	-	05/02/26	Instruction
HW1	05/02/26	19/02/26	Statement
HW2	12/02/26	26/02/26	Statement
Project proposal	05/02/26	26/02/26	Guidelines
Final report / poster	-	26/03/26