STILL WORK IN PROGRESS
This repository contains files that enable the usage of DDP on a cluster managed with SLURM.
Your workflow:
- Integrate PyTorch DDP usage into your
train.py(or similar) by followingexample.py, which is a slightly adapted example from pytorch/examples, and the online docs. - Edit
distributed_data_parallel_slurm_run.bashto call your script and notexample.py. - Edit
distributed_data_parallel_slurm_setup.sbatchto adapt the SLURM launch parameters:--nodes=1: Number of nodes--ntasks-per-node=X: Number of tasks per node. Each task will have one GPU, which givesnodes*Xtotal GPUs--gres=gpu:X,VRAM:12G: Should be the same asntasks-per-node. Request the amount of VRAM per GPU that you need.--cpus-per-task=1: Number of CPU cores per task. Usually a number larger than 1 is better, maybe 3 to 6.--mem=4G: RAM per node. Set to10*Xif one task needs10Gof RAM.--time=00:01:00: Maximal runtime of the job (will be killed afterwards).--output: File to save the output to. See theslurm-logfunction defined below.
- Submit you job to the SLURM queue with
sbatch distributed_data_parallel_slurm_setup.sbatch. Make sure that the correct python interpreter is in the path, e.g. by callingconda activate my_envbefore.
SLURM pitfalls:
sbatchexecutes your script once ressources are available and will use the filesystem as it is at that point. If you callsbatchand then edit your code files you will run the edited code. This cannot be easily circumvented. The python interpreter loads all files on startup into RAM, which means that you can edit your code files after the job is running.- All machines use a shared network filesystem, which means that if you change your data while a job is running it will use the changed data.
SLURM-related functions:
These functions are what I use, but they might not be optimal for you. If you use slurm-log you must call mkdir -p $HOME/slurm/logs once to generate the output directory. You can call slurm-log and it will tail -f the output of the output file with the newest timestamp. You can also call slurm-log JOB_ID and it will tail -f the output of the job with the id JOB_ID. The function msq is just an alias for squeue with more compact output. You have to put your username there.
function slurm-log() {
if [ "$1" = "" ]
then
unset -v latest_slurm_log
for file in "$HOME/slurm/logs/slurm-"*.out;
do
latest_slurm_log=$file
done
echo "tail $latest_slurm_log"
tail -f -n+1 --retry --follow=name "$latest_slurm_log"
else
tail -f -n+1 --retry --follow=name "$HOME/slurm/logs/slurm-$1.out"
fi
}
# TODO: Insert you SLURM user here
function msq () {
squeue -u koestlel --format="%.7i %50j %.2t %.10M %.6D %R"
}