Skip to content

Commit 5158d11

Browse files
Create PrincetonUTutorial.md
1 parent 9eb6515 commit 5158d11

1 file changed

Lines changed: 209 additions & 0 deletions

File tree

docs/PrincetonUTutorial.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
## Tutorials
2+
3+
### Sample usage on Tigergpu
4+
5+
First, create an isolated Anaconda environment and load CUDA drivers:
6+
```
7+
module load anaconda3
8+
module load cudatoolkit/8.0 cudnn/cuda-8.0/6.0 openmpi/cuda-8.0/intel-17.0/2.1.0/64 intel/17.0/64/17.0.2.174
9+
module load intel/17.0/64/17.0.4.196 intel-mkl/2017.3/4/64
10+
conda create --name my_env --file requirements-travis.txt
11+
source activate my_env
12+
```
13+
14+
Then install the plasma-python package:
15+
16+
```bash
17+
#source activate my_env
18+
git clone https://github.com/PPPLDeepLearning/plasma-python
19+
cd plasma-python
20+
python setup.py install
21+
```
22+
23+
Where `my_env` should contain the Python packages as per `requirements-travis.txt` file.
24+
25+
#### Location of the data on Tigress
26+
27+
The JET and D3D datasets containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions are located on /tigress filesystem on Princeton U clusters.
28+
Fo convenience, create following symbolic links:
29+
30+
```bash
31+
cd /tigress/<netid>
32+
ln -s /tigress/FRNN/shot_lists shot_lists
33+
ln -s /tigress/FRNN/signal_data signal_data
34+
```
35+
36+
#### Preprocessing
37+
38+
```bash
39+
cd examples/
40+
python guarantee_preprocessed.py
41+
```
42+
This will preprocess the data and save it in `/tigress/<netid>/processed_shots` and `/tigress/<netid>/normalization`
43+
44+
45+
#### Training and inference
46+
47+
Use Slurm scheduler to perform batch or interactive analysis on Tiger cluster.
48+
49+
##### Batch analysis
50+
51+
For batch analysis, make sure to allocate 1 process per GPU:
52+
53+
```bash
54+
#!/bin/bash
55+
#SBATCH -t 01:30:00
56+
#SBATCH -N X
57+
#SBATCH --ntasks-per-node=4
58+
#SBATCH --ntasks-per-socket=2
59+
#SBATCH --gres=gpu:4
60+
#SBATCH -c 4
61+
62+
module load anaconda3
63+
source activate my_env
64+
module load cudatoolkit/8.0 cudnn/cuda-8.0/6.0 openmpi/cuda-8.0/intel-17.0/2.1.0/64 intel/17.0/64/17.0.2.174
65+
module load intel/17.0/64/17.0.4.196 intel-mkl/2017.3/4/64
66+
srun python mpi_learn.py
67+
68+
```
69+
where X is the number of nodes for distibuted training.
70+
71+
Submit the job with:
72+
```bash
73+
#cd examples
74+
sbatch slurm.cmd
75+
```
76+
77+
And monitor it's completion via:
78+
```bash
79+
squeue -u <netid>
80+
```
81+
Optionally, add an email notification option in the Slurm about the job completion.
82+
83+
##### Interactive analysis
84+
85+
Interactive option is preferred for debugging or running in the notebook, for all other case batch is preferred.
86+
The workflow is to request an interactive session:
87+
88+
```bash
89+
salloc -N [X] --ntasks-per-node=4 --ntasks-per-socket=2 --gres=gpu:4 -t 0-6:00
90+
```
91+
where the number of GPUs is X * 4.
92+
93+
Then launch the application from the command line:
94+
95+
```bash
96+
mpirun -npernode 4 python examples/mpi_learn.py
97+
```
98+
99+
### Understanding the data
100+
101+
All the configuration parameters are summarised in `examples/conf.yaml`. Highlighting the important ones to control the data.
102+
Currently, FRNN is capable of working with JET and D3D data as well as cross-machine regime. The switch is done in the configuration file:
103+
104+
```yaml
105+
paths:
106+
...
107+
data: 'jet_data'
108+
```
109+
use `d3d_data` for D3D signals, use `jet_to_d3d_data` ir `d3d_to_jet_data` for cross-machine regime.
110+
111+
By default, FRNN will select, preprocess and normalize all valid signals available. To chose only specific signals use:
112+
```yaml
113+
paths:
114+
...
115+
specific_signals: [q95,ip]
116+
```
117+
if left empty `[]` will use all valid signals defined on a machine. Only use if need a custom set.
118+
119+
### Current signals and notations
120+
121+
Signal name | Description
122+
--- | ---
123+
q95 | q95 safety factor
124+
ip | plasma current
125+
li | internal inductance
126+
lm | Locked mode amplitude
127+
dens | Plasma density
128+
energy | stored energy
129+
pin | Input Power (beam for d3d)
130+
pradtot | Radiated Power
131+
pradcore | Radiated Power Core
132+
pradedge | Radiated Power Edge
133+
pechin | ECH input power, not always on
134+
pechin | ECH input power, not always on
135+
betan | Normalized Beta
136+
energydt | stored energy time derivative
137+
torquein | Input Beam Torque
138+
tmamp1 | Tearing Mode amplitude (rotating 2/1)
139+
tmamp2 | Tearing Mode amplitude (rotating 3/2)
140+
tmfreq1 | Tearing Mode frequency (rotating 2/1)
141+
tmfreq2 | Tearing Mode frequency (rotating 3/2)
142+
ipdirect | plasma current direction
143+
144+
### Visualizing learning
145+
146+
A regular FRNN run will produce several outputs and callbacks.
147+
148+
#### TensorBoard visualization
149+
150+
Currently supports graph visualization, histograms of weights, activations and biases, and scalar variable summaries of losses and accuracies.
151+
152+
The summaries are written real time to `/tigress/<netid>/Graph`. For MacOS, you can set up the `sshfs` mount of /tigress filesystem and view those summaries in your browser.
153+
154+
For Mac, you could follow the instructions here:
155+
https://github.com/osxfuse/osxfuse/wiki/SSHFS
156+
157+
then do something like:
158+
```
159+
sshfs -o allow_other,defer_permissions [email protected]:/tigress/netid/ /mnt/<destination folder name on your laptop>/
160+
```
161+
162+
Launch TensorBoard locally:
163+
```
164+
python -m tensorflow.tensorboard --logdir /mnt/<destination folder name on your laptop>/Graph
165+
```
166+
You should see something like:
167+
168+
![alt text](https://github.com/PPPLDeepLearning/plasma-python/blob/master/docs/tb.png)
169+
170+
#### Learning curves and ROC per epoch
171+
172+
Besides TensorBoard summaries you can produce the ROC curves for validation and test data as well as visualizations of shots:
173+
```
174+
cd examples/
175+
python performance_analysis.py
176+
```
177+
this uses the resulting file produced as a result of training the neural network as an input, and produces several `.png` files with plots as an output.
178+
179+
In addition, you can check the scalar variable summaries for training loss, validation loss and validation ROC logged at `/tigress/netid/csv_logs` (each run will produce a new log file with a timestamp in name).
180+
181+
A sample code to analyze can be found in `examples/notebooks`. For instance:
182+
183+
```python
184+
import pandas as pd
185+
import numpy as np
186+
from bokeh.plotting import figure, show, output_file, save
187+
188+
data = pd.read_csv("/mnt/<destination folder name on your laptop>/csv_logs/<name of the log file>.csv")
189+
190+
from bokeh.io import output_notebook
191+
output_notebook()
192+
193+
from bokeh.models import Range1d
194+
#optionally set the plotting range
195+
#left, right, bottom, top = -0.1, 31, 0.005, 1.51
196+
197+
p = figure(title="Learning curve", y_axis_label="Training loss", x_axis_label='Epoch number') #,y_axis_type="log")
198+
#p.set(x_range=Range1d(left, right), y_range=Range1d(bottom, top))
199+
200+
p.line(data['epoch'].values, data['train_loss'].values, legend="Test description",
201+
line_color="tomato", line_dash="dotdash", line_width=2)
202+
p.legend.location = "top_right"
203+
show(p, notebook_handle=True)
204+
```
205+
206+
### Learning curve summaries per mini-batch
207+
208+
To extract per mini-batch summaries, use the output produced by FRNN logged to the standard out (in case of the batch jobs, it will all be contained in the Slurm output file). Refer to the following notebook to perform the analysis of learning curve on a mini-batch level:
209+
https://github.com/PPPLDeepLearning/plasma-python/blob/master/examples/notebooks/FRNN_scaling.ipynb

0 commit comments

Comments
 (0)