Contact: If you have any questions regarding the setup or usage of the Neuronic cluster, please contact David Yin via email or through Messenger. Please make sure to send me a message on Messenger first before sending a friend request on Facebook.
This manual is a guide for using the Neuronic cluster in Zhuang's group at Princeton University.
Neuronic is a SLURM cluster funded by the Dean for Research and the School of Engineering and Applied Science (SEAS), for members of SEAS who need a high performance computing (HPC) cluster environment and GPUs, to perform their research. The Neuronic cluster is composed of 33 identical nodes (each with 8 L40 GPUs).
- Neuronic Cluster Manual
To connect to Neuronic, you should have either a Princeton account or a Research Computer User (RCU) account (for external collaborators).
If you have not been approved of access to Neuronic and yet your research requires access, please contact Zhuang for help.
First, you need to connect to the cluster. See the Connect to the Cluster section for detailed instructions.
Below we show a simple example of submitting a SLURM job to train a neural network on the MNIST dataset using PyTorch Distributed Data Parallel (DDP).
We first clone this repo and create a conda environment called torch-env to install relevant packages.
git clone https://github.com/davidyyd/Neuronic-Manual.git
module purge
module load anaconda3/2024.02
source ~/.bashrc
conda create -n torch-env python=3.12
conda activate torch-env
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118Download the MNIST dataset.
python download_data.pyFinally, use the sbatch command to submit the job:
sbatch job.slurmYou can check the log file my_first_job.out in the directory where you ran the command. The model should achieve 98% accuracy on the test set in two epochs.
For more details on the above SLURM script, check out here
Once you have been granted access to Neuronic, you can connect Neuronic using an SSH client.
Note: Your account should be configured with two-factor authentication before you can connect to the cluster. Follow this link to set up two-factor authentication. It is also recommended to connect to the campus network when accessing the cluster. If you are off campus, use GlobalProtect VPN to connect to the campus network. See the detailed instructions here to set up this. This is not needed when the computer has been directly connected to the campus wifi (eduroam).
We recommend using the remote explorer in Cursor / VSCode to establish an SSH connection with the cluster. For Cursor, you can claim a free membership here. Open the SSH configuration under the remote explorer.
Copy and paste the following into the SSH configuration. Replace <YourNetID> with the username for the account (everything before @). For example, if your account is [email protected], then <YourNetID> should be yy8435.
Host neuronic
HostName neuronic.cs.princeton.edu
User <YourNetID>
Open the Command Palette with Ctrl+Shift+P (or Cmd+Shift+P on Mac). Type >Remote-SSH: Connect to Host and press Enter.
Then type neuronic in the Command Palette and press Enter.
It will ask you to type in the password for the account. You can check out here and see how to avoid typing in the password every time.
Then complete the two-factor authentication step.
Check the SSH connection output to select the two-factor login method. Here 1 is for Duo Push, 2 is for Phone Call, and 3 is for SMS Passcode.
Finally click the Open Folder button.
It will automatically set the path to the home directory, which should be /u/<YourNetID>.
Since Neuronic is a Linux system, knowing basic Linux commands is very important. For an introduction to Linux navigation, see the Intro to Linux Command Line workshop materials.
After logging in, you will land on a login node. The login node is only for lightweight tasks like file management, code editing, and job submission. For computational work, you must use compute nodes (see Compute Node section). Running computationally heavy tasks on the login node will block the normal traffic to it so that other users cannot connect to the cluster.
Each Neuronic cluster node is a Lenovo ThinkSystem SR670 V2 containing:
- 2 x Intel Xeon Gold 5320 26-core CPUs (104 cores total)
- 16 x 32 GB DDR4 3200MHz RDIMMs (512 GB total)
- 8 x NVIDIA L40 GPUs (46 GB memory each)
- 3.5TB of SSD for local scratch
- One 10Gbps Ethernet uplink (Note: NOT Infiniband)
This section explains how to submit jobs to compute nodes on Neuronic, especially for GPU workloads. All computational work must be performed on compute nodes, not login nodes, and jobs are managed using the SLURM scheduler.
SLURM (Simple Linux Utility for Resource Management) is the job scheduler used on Neuronic. It manages resource allocation and job scheduling for all users.
Useful SLURM commands (type these in the terminal to check cluster and job status)
sinfo # Show all partition and node information
sinfo -N -p <Partition> # Show node status of a given partition
squeue # Show running and pending jobs
squeue --me # Show only your jobs
scontrol show job <JobID> # Show detailed job information
scancel <JobID> # Cancel a job
sacct -j <JobID> # Show job accounting informationA typical SLURM batch script should specify:
#SBATCH --job-name=...# Name for your job#SBATCH --nodes=...# Number of nodes to allocate#SBATCH --ntasks=...# Number of tasks (processes)#SBATCH --gres=gpu:...# Number of GPUs per node#SBATCH --cpus-per-task=...# Number of CPU cores per task#SBATCH --mem=...# Memory per node (e.g., 16G)#SBATCH --time=...# Time limit (hh:mm:ss)#SBATCH --output=...# Output file for logs- The commands to run your job (e.g.,
python my_script.py)
Example (minimal):
#!/bin/bash
#SBATCH --job-name=test_job # Job name
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Number of tasks
#SBATCH --gres=gpu:4 # Number of GPUs per node
#SBATCH --cpus-per-task=4 # CPU cores per task
#SBATCH --mem=16G # Memory per node
#SBATCH --time=01:00:00 # Time limit (1 hour)
#SBATCH --output=output_%j.log # Output file (%j = job ID)
python my_script.py # Your job commandCheck out the Quick Launch section for a simple example of submitting a SLURM job to train a neural network on the MNIST dataset using PyTorch Distributed Data Parallel (DDP).
Interactive sessions are useful for example to test code, run short scripts, or debug interactively. To start an interactive session on a GPU node, use:
salloc --nodes=1 --ntasks=1 --time=60:00 --cpus-per-task=8 --mem=32G --gres=gpu:4Once the session is granted, you will be logged into a compute node with GPU access with an interactive shell. You can run a Python script directly without sbatch command (e.g., our MNIST classification example):
module purge
module load anaconda3/2024.02
conda activate torch-env
python -m torch.distributed.run --nproc_per_node=4 mnist_classify_ddp.py --epochs 2To exit the interactive session, simply type:
exitOn Neuronic, the priority of your job only matters when there’s competition. If a job is eligible and there are free nodes that match requested constraints, it should start immediately regardless of the priority.
Specifically, a jobs’s priority score is calculated as:
The higher the priority of a job is, the earlier it is in the queue.
sprio -w — show priority weights
$ sprio -w
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION Weights
1 500 10000 500 500
Definition: How long your job has been eligible to run (i.e., ready to start, just waiting for resources). It grows linearly while the job is pending and eligible.
Normalization: SLURM caps/normalizes Age by PriorityMaxAge (12 hours on Neuronic); once a job’s age ≥ that cap, its Age factor = 1.0(maxed).
Show SLURM config values for PriorityMaxAge & PriorityWeightAge
$ scontrol show config | egrep '^PriorityMaxAge|^PriorityWeightAge'
PriorityMaxAge = 12:00:00
PriorityWeightAge = 500
Definition: A 0–1 score SLURM computes for the user@account association your job is charged to. Higher = you’ve used less than your entitled share recently.
How it’s computed (conceptually):
- SLURM maintains effective usage that decays exponentially with the cluster’s half-life (PriorityDecayHalfLife; on Neuronic it’s 14 days).
- Usage is tracked per TRES minutes (e.g., GPU-minutes, CPU-minutes).
- The scheduler turns that into a FairShare factor ∈ [0,1] for your user@account. That single scalar is what goes into the priority math.
Show SLURM Fairshare settings: PriorityDecayHalfLife, PriorityUsageResetPeriod, PriorityWeightFairShare
$ scontrol show config | egrep '^PriorityDecayHalfLife|^PriorityWeightFairShare|^PriorityUsageResetPeriod'
PriorityDecayHalfLife = 14-00:00:00
PriorityUsageResetPeriod= NONE
PriorityWeightFairShare = 10000
Each user has its own fairshare, which is affected by the following factors:
- NormShares: Your entitled share value (uniformly distributed across all users in Neuronic).
- EffectvUsage: Your recent, decay-weighted share of actual usage.
From there, the LevelFS = NormShares / EffectvUsage is computed — SLURM ranks by this.
>1means under-using (good),~1on-share,<1over-using (bad).
SLURM orders accounts by LevelFS, and maps that ordering (fairtree algorithm) to a FairShare ∈ [0,1] per user@account. That final scalar is what your job uses as FF in the priority sum.
Show Fair-Tree metrics for account seas
$ sshare -l yy8435 | awk 'NR<20{print}'
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS
-------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ----------
root 0.000000 6181868410096 1.000000
seas 1 0.500000 6181868410096 1.000000 1.000000 0.500000
seas yy8435 1 0.004831 1316226032 0.000213 0.000213 0.269231 22.689187
Important: On Neuronic, not fully using your requested resources can also affect your fairshare, this includes CPU cores, RAM, GPU Utilization, GPU VRAM. A useful command is jobstats <JobID>. This command should be ran often to monitor your resource usage for a specific job. You can also ssh into a specific compute node from the login node when your job is running to check the resource usage.
Definition: An integer factor that increases with how big your job request is. On Neuronic, bigger job ⇒ bigger J because PriorityFavorSmall=no.
- The size of request in CPUs determines it. On Neuronic it’s effectively CPUs requested (GPUs don’t count here).
- Each CPU counts ~0.078 jobsize (with a constant offset of 8). That is:
- 1 CPU = 8 jobsize
- 52 CPUs = 12 jobsize
- 104 CPUs = 16 jobsize
Show the JobSize component for an example job
$ sprio -l -j 66690591
JOBID PARTITION USER ACCOUNT PRIORITY SITE AGE ASSOC FAIRSHARE JOBSIZE PARTITION
2264734 all mc5063 seas 290 0 180 0 48 13 50
Definition: The type of partition your job is submitted to.
There are two kinds of partitions (both supported on each node): all and interactive. all has a priority job factor of 100, while interactive has a priority factor of 1000.
The partition value is computed as follows:
This essentially prioritizes an interactive job (with priority 1000) over other normally submitted jobs (in the all partition).
Show SLURM config values for PriorityJobFactor & PriorityTier
$ scontrol show partition all | egrep -i 'PartitionName|PriorityTier'
PartitionName=all
PriorityJobFactor=100 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=NO
In general, waiting time, job usage, and past usage together determine whether your job can start.
There are three types of storage space on the cluster:
- Home space (
/u/<YourNetID>or~): This is the home directory for each user. Since there is only a limit of 16 GB, it should be used to store code only. - Project space (
/n/fs/vision-mix/<YourNetID>): This is the shared project directory for each user in our lab. It has a total limit of 26 TB across all users. You can use it to store your conda environment, model checkpoints, and other large files. Please be considerate with your usage, as this space is shared by everyone. Each user’s directory needs to be created manually. To request a new one, please contact David Yin by email or messenger for help. - Scratch space (
/scratch/<YourNetID>): This is the shared scratch directory on each node (i.e., not accessible from other nodes). It has a limit of 3.5TB across all users. You can use it to store any temporary files, such as pip install cache and huggingface cache. Note that this space is not backed up and routinely purged, so you should not store any important files here.
Here we list some useful commands for managing your job submissions and account usage.
squeue shows the status of all your submitted jobs. You can check the status of your own jobs by:
squeue -u <YourNetID>It will print out the status of all your submitted jobs, including the job ID, partition, name, user, status, time, number of nodes, and node list:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2251339 all classification yy8435 R 1-04:11:33 4 (Priority)
2251237 all gpt2 yy8435 R 1-04:11:33 2 neu[329-330]It is also possible to check the status of all jobs in the cluster by removing the -u <YourNetID> option:
squeuescancel is used to cancel a job in the cluster. You can cancel a job by its job ID:
scancel <JobID>It is also possible to cancel all your jobs at once by specifying your NetID:
scancel -u <YourNetID>To see the current status of all the nodes in the cluster, we build a helper script check_all_nodes.sh (in the root directory of this repo):
bash check_all_nodes.shIt will print out the number of free CPUs, CPU memory usage, and the number of free GPUs for each of 32 nodes in Neuronic. You can use this command to determine the best resources for your jobs.
neu301 FreeCPUs= 40/104 FreeMem=375.0GiB/503.0GiB FreeGPUs=1/8
...
neu332 FreeCPUs= 4/104 FreeCPUMem= 23.0GiB/503.0GiB FreeGPUs=4/8gpudash is a tool to check the GPU status across all nodes.
gpudashIt will print out the GPU utilization for each GPU in the cluster during the last hour:
NEURONIC-GPU UTILIZATION (Mon Jul 28)
3:10 AM 3:20 AM 3:30 AM 3:40 AM 3:50 AM 4:00 AM 4:10 AM
neu301 0 yy8435:100 yy8435:99 yy8435:100 yy8435:100 yy8435:98 yy8435:100 yy8435:100
1 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:99 yy8435:100
2 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:98 yy8435:100 yy8435:100
3 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100
4 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100
5 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:99 yy8435:100 yy8435:100
6 yy8435:99 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100
7 yy8435:99 yy8435:100 yy8435:100 yy8435:100 yy8435:99 yy8435:100 yy8435:100
...
neu332 0 yy8435:100 yy8435:99 yy8435:100 yy8435:100 yy8435:98 yy8435:100 yy8435:100
1 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:99 yy8435:100
2 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:98 yy8435:100 yy8435:100
3 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100
4 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100
5 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:99 yy8435:100 yy8435:100
6 yy8435:99 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100 yy8435:100
7 yy8435:99 yy8435:100 yy8435:100 yy8435:100 yy8435:99 yy8435:100 yy8435:100
3:10 AM 3:20 AM 3:30 AM 3:40 AM 3:50 AM 4:00 AM 4:10 AMsreport reports the CPU and GPU hours of your account. You need to specify the start and end date (in the format YYYY-MM-DD) for the report.
sreport -t Hours -T CPU,gres/gpu cluster AccountUtilizationByUser Users=<YourNetID> Start=<StartDate> End=<EndDate>It will print out the report as follows:
Usage reported in TRES Hours
--------------------------------------------------------------------------------
Cluster Account Login Proper Name TRES Name Used
--------- --------------- --------- --------------- -------------- --------
neuronic seas yy8435 Yida Yin cpu 8722
neuronic seas yy8435 Yida Yin gres/gpu 1034sshare gives the priority of your account.
sshare -u <YourNetID>It will print out something like this:
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 0.000000 6744067132527 1.000000
seas 1 0.500000 6744067132527 1.000000
seas yy8435 1 0.004762 37444021933 0.005552 0.109005Only the last row is related to your account. These values are less informative than those provided by sreport. For more information on each component of the priority, please check out Neuronic SLURM Queue.
It takes a long time to queue a single-node job with 8 GPUs? What should I do?
It is generally faster to queue a job with two nodes, each with 4 GPUs, than one node with 8 GPUs. The performance difference between these two options is minimal. You might also find our helper script check_all_nodes.sh useful to check the current status of all the nodes in the cluster and determine the best resources for your job.
Why is my job's logging very slow and I have to wait for a long time to get the output?
On Neuronic, the logging is not real time by default and is flushed by a fixed time interval. You can add export PYTHONUNBUFFERED=1 to your job script to enable real time logging. It is also possible to use flush=True in print statements. But you should be careful with this if your job will print thousands of times per second.
Can I download datasets to the cluster?
Yes, it is generally fine to download datasets to the cluster (such as ImageNet, COCO, from Huggingface, etc.). But please monitor the storage space while you are downloading. Also, do not perform any web scraping (send thousands of requests to different websites at the same time). This might overwhelm the network and crash the cluster's DNS server.
How to install packages that require compiling or intensive computation (e.g., FlashAttention) on the cluster?
You should not install those packages on the login node. Instead, it is better to request a compute node and install the packages there. Otherwise, you might block the normal traffic to the login node so that other users cannot connect to the cluster.
Official websites:
- Neuronic Website: https://clusters.cs.princeton.edu
Taiming's guide on Della cluster:
More Resources:
- SLURM: https://researchcomputing.princeton.edu/support/knowledge-base/slurm
- PyTorch: https://researchcomputing.princeton.edu/support/knowledge-base/pytorch
- Huggingface: https://researchcomputing.princeton.edu/support/knowledge-base/hugging-face
- VSCode: https://researchcomputing.princeton.edu/support/knowledge-base/vs-code
- Sharing Data: https://researchcomputing.princeton.edu/support/knowledge-base/sharing-data
Resource Allocation Guidelines: Please request the maximum CPU and GPU resources that your workload can effectively utilize. If your application cannot fully utilize the requested resources, consider requesting a lower configuration (e.g., smaller VRAM GPU, fewer CPU cores, or less memory). Jobs that underutilize their allocated resources may receive lower priority in the queue.
Contacts: For any questions regarding the setup or usage of the Neuronic cluster, please contact David Yin via email or through Messenger. Please make sure to send me a message on Messenger first (before sending a friend request on Facebook) in case I didn't see your friend request.
If you encounter any system issues on Neuronic, you can also contact [email protected] for help. However, please first check with me if we have encountered the same issue before.
The section on the SLURM queue is adapted from Taiming Lu's guide on Della cluster. The classification example is borrowed from here. Their repository includes more examples on using PyTorch Lightning, FSDP, and TensorFlow on the cluster.








