Skip to content

Commit ff8227e

Browse files
Update PrincetonUTutorial.md
1 parent 4572ef2 commit ff8227e

File tree

1 file changed

+58
-18
lines changed

1 file changed

+58
-18
lines changed

docs/PrincetonUTutorial.md

Lines changed: 58 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,57 @@
11
## Tutorials
22

3+
### Login to Tigergpu
4+
5+
First, login to TigerGPU cluster headnode via ssh:
6+
```
7+
ssh -XC <yourusername>@tigergpu.princeton.edu
8+
```
9+
310
### Sample usage on Tigergpu
411

5-
First, create an isolated Anaconda environment and load CUDA drivers:
12+
Next, check out the source code from github:
613
```
7-
module load anaconda3
8-
module load cudatoolkit/8.0 cudnn/cuda-8.0/6.0 openmpi/cuda-8.0/intel-17.0/2.1.0/64 intel/17.0/64/17.0.2.174
9-
module load intel/17.0/64/17.0.4.196 intel-mkl/2017.3/4/64
14+
git clone https://github.com/PPPLDeepLearning/plasma-python
15+
cd plasma-python
16+
```
17+
18+
After that, create an isolated Anaconda environment and load CUDA drivers:
19+
```
20+
#cd plasma-python
21+
module load anaconda3/4.4.0
1022
conda create --name my_env --file requirements-travis.txt
1123
source activate my_env
24+
25+
export OMPI_MCA_btl="tcp,self,sm"
26+
module load cudatoolkit/8.0
27+
module load cudnn/cuda-8.0/6.0
28+
module load openmpi/cuda-8.0/intel-17.0/2.1.0/64
29+
module load intel/17.0/64/17.0.4.196
1230
```
1331

14-
Then install the plasma-python package:
32+
and install the `plasma-python` package:
1533

1634
```bash
1735
#source activate my_env
18-
git clone https://github.com/PPPLDeepLearning/plasma-python
19-
cd plasma-python
2036
python setup.py install
2137
```
2238

2339
Where `my_env` should contain the Python packages as per `requirements-travis.txt` file.
2440

41+
#### Common issue
42+
43+
Common issue is Intel compiler mismatch in the `PATH` and what you use in the module. With the modules loaded as above,
44+
you should see something like this:
45+
```
46+
$ which mpicc
47+
/usr/local/openmpi/cuda-8.0/2.1.0/intel170/x86_64/bin/mpicc
48+
```
49+
50+
If you source activate the Anaconda environment after loading the openmpi, you would pick the MPI from Anaconda, which is not good and could lead to errors.
51+
2552
#### Location of the data on Tigress
2653

27-
The JET and D3D datasets containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions are located on /tigress filesystem on Princeton U clusters.
54+
The JET and D3D datasets containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions are located on `/tigress/FRNN` filesystem on Princeton U clusters.
2855
Fo convenience, create following symbolic links:
2956

3057
```bash
@@ -39,16 +66,22 @@ ln -s /tigress/FRNN/signal_data signal_data
3966
cd examples/
4067
python guarantee_preprocessed.py
4168
```
42-
This will preprocess the data and save it in `/tigress/<netid>/processed_shots` and `/tigress/<netid>/normalization`
69+
This will preprocess the data and save it in `/tigress/<netid>/processed_shots`, `/tigress/<netid>/processed_shotlists` and `/tigress/<netid>/normalization`
4370

71+
You would only have to run preprocessing once for each dataset. The dataset is specified in the config file `examples/conf.yaml`:
72+
```yaml
73+
paths:
74+
data: jet_data_0D
75+
```
76+
It take takes about 20 minutes to preprocess in parallel and can normally be done on the cluster headnode.
4477
4578
#### Training and inference
4679
47-
Use Slurm scheduler to perform batch or interactive analysis on Tiger cluster.
80+
Use Slurm scheduler to perform batch or interactive analysis on TigerGPU cluster.
4881
4982
##### Batch analysis
5083
51-
For batch analysis, make sure to allocate 1 process per GPU:
84+
For batch analysis, make sure to allocate 1 MPI process per GPU. Save the following to slurm.cmd file (or make changes to the existing `examples/slurm.cmd`):
5285

5386
```bash
5487
#!/bin/bash
@@ -58,15 +91,20 @@ For batch analysis, make sure to allocate 1 process per GPU:
5891
#SBATCH --ntasks-per-socket=2
5992
#SBATCH --gres=gpu:4
6093
#SBATCH -c 4
94+
#SBATCH --mem-per-cpu=0
6195
62-
module load anaconda3
96+
module load anaconda3/4.4.0
6397
source activate my_env
64-
module load cudatoolkit/8.0 cudnn/cuda-8.0/6.0 openmpi/cuda-8.0/intel-17.0/2.1.0/64 intel/17.0/64/17.0.2.174
65-
module load intel/17.0/64/17.0.4.196 intel-mkl/2017.3/4/64
98+
export OMPI_MCA_btl="tcp,self,sm"
99+
module load cudatoolkit/8.0
100+
module load cudnn/cuda-8.0/6.0
101+
module load openmpi/cuda-8.0/intel-17.0/2.1.0/64
102+
module load intel/17.0/64/17.0.4.196
103+
66104
srun python mpi_learn.py
67105
68106
```
69-
where X is the number of nodes for distibuted training.
107+
where `X` is the number of nodes for distibuted training.
70108

71109
Submit the job with:
72110
```bash
@@ -82,11 +120,11 @@ Optionally, add an email notification option in the Slurm about the job completi
82120

83121
##### Interactive analysis
84122

85-
Interactive option is preferred for debugging or running in the notebook, for all other case batch is preferred.
123+
Interactive option is preferred for **debugging** or running in the **notebook**, for all other case batch is preferred.
86124
The workflow is to request an interactive session:
87125

88126
```bash
89-
salloc -N [X] --ntasks-per-node=4 --ntasks-per-socket=2 --gres=gpu:4 -t 0-6:00
127+
salloc -N [X] --ntasks-per-node=4 --ntasks-per-socket=2 --gres=gpu:4 -c 4 --mem-per-cpu=0 -t 0-6:00
90128
```
91129
where the number of GPUs is X * 4.
92130

@@ -104,7 +142,7 @@ Currently, FRNN is capable of working with JET and D3D data as well as cross-mac
104142
```yaml
105143
paths:
106144
...
107-
data: 'jet_data'
145+
data: 'jet_data_0D'
108146
```
109147
use `d3d_data` for D3D signals, use `jet_to_d3d_data` ir `d3d_to_jet_data` for cross-machine regime.
110148

@@ -116,6 +154,8 @@ paths:
116154
```
117155
if left empty `[]` will use all valid signals defined on a machine. Only use if need a custom set.
118156

157+
Other parameters configured in the conf.yaml include batch size, learning rate, neural network topology and special conditions foir hyperparameter sweeps.
158+
119159
### Current signals and notations
120160

121161
Signal name | Description

0 commit comments

Comments
 (0)