Skip to content

exalsius/curtail-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

287 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

curtail-llm

Technical report: Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study

This prototype trains a 561M-parameter nanochat d20 transformer across geographically distributed GPU clusters. Training is scheduled only during periods of local renewable energy curtailment, when electricity is both clean and (depending on procurement contracts) cheap.

  • Federated training is coordinated via Flower.
  • Nodes are elastically added/removed using Exalsius and a custom Flower Kubernetes operator.
  • Energy system dynamics are simulated by Vessim, with curtailment periods derived from real-world marginal carbon intensity traces provided by WattTime.

Alt text

Figure 1 | Sites train only during curtailment windows (green), when renewable generation exceeds demand. If multiple sites are curtailed simultaneously, they train locally in parallel and periodically average model states. In our experiment, curtailment-aware scheduling preserves training quality while reducing operational emissions to 5-12% of single-site baselines.

Setup

This setup explains how to run the system on a single machine (in our case 8xA100 GPUs) using the SubprocessProvisioner, simulating multiple clients via separate processes. For replicating the distributed deployment with the ExalsiusProvisioner, please refer to the Exalsius documentation and the exalsius/flower-operator repository or reach out to us!

Installation (all nodes)

Clone the repository and install dependencies:

uv venv
uv sync
source .venv/bin/activate

Data & Tokenizer Setup (all nodes)

The nanochat tokenizer uses a Rust BPE implementation. To install Rust / Cargo and build the tokenizer extension, run:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml

Next, prepare the tokenizer:

export NANOCHAT_BASE_DIR=/workspace/cache/nanochat/
python -m nanochat.dataset -n 240  # Download dataset for tokenizer training (~22GB of FineWeb-EDU data)
python -m scripts.tok_train --max_chars=4000000000 --vocab_size=65536  # Train BPE tokenizer on ~4B characters, takes about 3min
python -c "from nanochat.tokenizer import get_tokenizer; print(f'Vocab size: {get_tokenizer().get_vocab_size()}')"

Redis Setup (head node only)

The system uses Redis for coordination between server and clients. Redis must be accessible by all nodes, run via:

sudo apt-get update
sudo apt-get install redis-server
redis-server --daemonize yes
redis-cli ping

Deployment

Deploy across multiple physical nodes:

  1. Make sure Redis is running.
redis-server --daemonize yes
redis-cli ping
  1. Start the energy simulation (Vessim):
tmux new -A -s vessim

cd /workspace/pilot
source .venv/bin/activate

python energy_simulation.py
  1. Start Flower SuperLink (coordinator):
tmux new -A -s superlink

cd /workspace/pilot
source .venv/bin/activate
export WANDB_API_KEY=<your-key>

flower-superlink --insecure
  1. Start SuperNodes (workers) on each GPU:
# Client 0
tmux new -A -s client_0

cd /workspace/pilot
source .venv/bin/activate
export NANOCHAT_BASE_DIR=/workspace/cache/nanochat/

CUDA_VISIBLE_DEVICES=0,1 flower-supernode --insecure \
  --superlink 127.0.0.1:9092 \
  --clientappio-api-address 127.0.0.1:9094 \
  --node-config 'name="client_0" partition-id=0'
# Client 1
tmux new -A -s client_1

cd /workspace/pilot
source .venv/bin/activate
export NANOCHAT_BASE_DIR=/workspace/cache/nanochat/

CUDA_VISIBLE_DEVICES=4,5,6,7 flower-supernode --insecure \
  --superlink 127.0.0.1:9092 \
  --clientappio-api-address 127.0.0.1:9095 \
  --node-config 'name="client_1" partition-id=1'
  1. Run the Flower app:
cd /workspace/pilot
source .venv/bin/activate
flwr run . local-deployment --stream

You can override config values from the command line:

flwr run . local-deployment --run-config "lr=0.0005" --stream

Vanilla nanochat Baseline

tmux new -A -s baseline

cd /workspace/pilot
source .venv/bin/activate
export NANOCHAT_BASE_DIR=/workspace/cache/nanochat/

CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --standalone --nproc_per_node=4 \
  -m scripts.base_train -- --depth 20 --target_param_data_ratio 20 \
  --device_batch_size 8 --run baseline

Cite as

Wiesner, Philipp, Soeren Becker, Brett Cornick, Dominik Scheinert, Alexander Acker, and Odej Kao. 2026. Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study. Technical Report arXiv:2602.22760. Exalsius.

@techreport{wiesner2026curtailllm,
  title = {Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study},
  author = {Wiesner, Philipp and Becker, Soeren and Cornick, Brett and Scheinert, Dominik and Acker, Alexander and Kao, Odej},
  institution = {Exalsius},
  number = {arXiv:2602.22760},
  year = {2026}
}

About

Distributed LLM pretraining during renewable curtailment windows 🌱

Topics

Resources

Stars

Watchers

Forks

Contributors