|
| 1 | +## Tutorials |
| 2 | + |
| 3 | +### Sample usage on Tigergpu |
| 4 | + |
| 5 | +First, create an isolated Anaconda environment and load CUDA drivers: |
| 6 | +``` |
| 7 | +module load anaconda3 |
| 8 | +module load cudatoolkit/8.0 cudnn/cuda-8.0/6.0 openmpi/cuda-8.0/intel-17.0/2.1.0/64 intel/17.0/64/17.0.2.174 |
| 9 | +module load intel/17.0/64/17.0.4.196 intel-mkl/2017.3/4/64 |
| 10 | +conda create --name my_env --file requirements-travis.txt |
| 11 | +source activate my_env |
| 12 | +``` |
| 13 | + |
| 14 | +Then install the plasma-python package: |
| 15 | + |
| 16 | +```bash |
| 17 | +#source activate my_env |
| 18 | +git clone https://github.com/PPPLDeepLearning/plasma-python |
| 19 | +cd plasma-python |
| 20 | +python setup.py install |
| 21 | +``` |
| 22 | + |
| 23 | +Where `my_env` should contain the Python packages as per `requirements-travis.txt` file. |
| 24 | + |
| 25 | +#### Location of the data on Tigress |
| 26 | + |
| 27 | +The JET and D3D datasets containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions are located on /tigress filesystem on Princeton U clusters. |
| 28 | +Fo convenience, create following symbolic links: |
| 29 | + |
| 30 | +```bash |
| 31 | +cd /tigress/<netid> |
| 32 | +ln -s /tigress/FRNN/shot_lists shot_lists |
| 33 | +ln -s /tigress/FRNN/signal_data signal_data |
| 34 | +``` |
| 35 | + |
| 36 | +#### Preprocessing |
| 37 | + |
| 38 | +```bash |
| 39 | +cd examples/ |
| 40 | +python guarantee_preprocessed.py |
| 41 | +``` |
| 42 | +This will preprocess the data and save it in `/tigress/<netid>/processed_shots` and `/tigress/<netid>/normalization` |
| 43 | + |
| 44 | + |
| 45 | +#### Training and inference |
| 46 | + |
| 47 | +Use Slurm scheduler to perform batch or interactive analysis on Tiger cluster. |
| 48 | + |
| 49 | +##### Batch analysis |
| 50 | + |
| 51 | +For batch analysis, make sure to allocate 1 process per GPU: |
| 52 | + |
| 53 | +```bash |
| 54 | +#!/bin/bash |
| 55 | +#SBATCH -t 01:30:00 |
| 56 | +#SBATCH -N X |
| 57 | +#SBATCH --ntasks-per-node=4 |
| 58 | +#SBATCH --ntasks-per-socket=2 |
| 59 | +#SBATCH --gres=gpu:4 |
| 60 | +#SBATCH -c 4 |
| 61 | + |
| 62 | +module load anaconda3 |
| 63 | +source activate my_env |
| 64 | +module load cudatoolkit/8.0 cudnn/cuda-8.0/6.0 openmpi/cuda-8.0/intel-17.0/2.1.0/64 intel/17.0/64/17.0.2.174 |
| 65 | +module load intel/17.0/64/17.0.4.196 intel-mkl/2017.3/4/64 |
| 66 | +srun python mpi_learn.py |
| 67 | + |
| 68 | +``` |
| 69 | +where X is the number of nodes for distibuted training. |
| 70 | + |
| 71 | +Submit the job with: |
| 72 | +```bash |
| 73 | +#cd examples |
| 74 | +sbatch slurm.cmd |
| 75 | +``` |
| 76 | + |
| 77 | +And monitor it's completion via: |
| 78 | +```bash |
| 79 | +squeue -u <netid> |
| 80 | +``` |
| 81 | +Optionally, add an email notification option in the Slurm about the job completion. |
| 82 | + |
| 83 | +##### Interactive analysis |
| 84 | + |
| 85 | +Interactive option is preferred for debugging or running in the notebook, for all other case batch is preferred. |
| 86 | +The workflow is to request an interactive session: |
| 87 | + |
| 88 | +```bash |
| 89 | +salloc -N [X] --ntasks-per-node=4 --ntasks-per-socket=2 --gres=gpu:4 -t 0-6:00 |
| 90 | +``` |
| 91 | +where the number of GPUs is X * 4. |
| 92 | + |
| 93 | +Then launch the application from the command line: |
| 94 | + |
| 95 | +```bash |
| 96 | +mpirun -npernode 4 python examples/mpi_learn.py |
| 97 | +``` |
| 98 | + |
| 99 | +### Understanding the data |
| 100 | + |
| 101 | +All the configuration parameters are summarised in `examples/conf.yaml`. Highlighting the important ones to control the data. |
| 102 | +Currently, FRNN is capable of working with JET and D3D data as well as cross-machine regime. The switch is done in the configuration file: |
| 103 | + |
| 104 | +```yaml |
| 105 | +paths: |
| 106 | + ... |
| 107 | + data: 'jet_data' |
| 108 | +``` |
| 109 | +use `d3d_data` for D3D signals, use `jet_to_d3d_data` ir `d3d_to_jet_data` for cross-machine regime. |
| 110 | + |
| 111 | +By default, FRNN will select, preprocess and normalize all valid signals available. To chose only specific signals use: |
| 112 | +```yaml |
| 113 | +paths: |
| 114 | + ... |
| 115 | + specific_signals: [q95,ip] |
| 116 | +``` |
| 117 | +if left empty `[]` will use all valid signals defined on a machine. Only use if need a custom set. |
| 118 | + |
| 119 | +### Current signals and notations |
| 120 | + |
| 121 | +Signal name | Description |
| 122 | +--- | --- |
| 123 | +q95 | q95 safety factor |
| 124 | +ip | plasma current |
| 125 | +li | internal inductance |
| 126 | +lm | Locked mode amplitude |
| 127 | +dens | Plasma density |
| 128 | +energy | stored energy |
| 129 | +pin | Input Power (beam for d3d) |
| 130 | +pradtot | Radiated Power |
| 131 | +pradcore | Radiated Power Core |
| 132 | +pradedge | Radiated Power Edge |
| 133 | +pechin | ECH input power, not always on |
| 134 | +pechin | ECH input power, not always on |
| 135 | +betan | Normalized Beta |
| 136 | +energydt | stored energy time derivative |
| 137 | +torquein | Input Beam Torque |
| 138 | +tmamp1 | Tearing Mode amplitude (rotating 2/1) |
| 139 | +tmamp2 | Tearing Mode amplitude (rotating 3/2) |
| 140 | +tmfreq1 | Tearing Mode frequency (rotating 2/1) |
| 141 | +tmfreq2 | Tearing Mode frequency (rotating 3/2) |
| 142 | +ipdirect | plasma current direction |
| 143 | + |
| 144 | +### Visualizing learning |
| 145 | + |
| 146 | +A regular FRNN run will produce several outputs and callbacks. |
| 147 | + |
| 148 | +#### TensorBoard visualization |
| 149 | + |
| 150 | +Currently supports graph visualization, histograms of weights, activations and biases, and scalar variable summaries of losses and accuracies. |
| 151 | + |
| 152 | +The summaries are written real time to `/tigress/<netid>/Graph`. For MacOS, you can set up the `sshfs` mount of /tigress filesystem and view those summaries in your browser. |
| 153 | + |
| 154 | +For Mac, you could follow the instructions here: |
| 155 | +https://github.com/osxfuse/osxfuse/wiki/SSHFS |
| 156 | + |
| 157 | +then do something like: |
| 158 | +``` |
| 159 | +sshfs -o allow_other,defer_permissions [email protected]:/tigress/netid/ /mnt/<destination folder name on your laptop>/ |
| 160 | +``` |
| 161 | + |
| 162 | +Launch TensorBoard locally: |
| 163 | +``` |
| 164 | +python -m tensorflow.tensorboard --logdir /mnt/<destination folder name on your laptop>/Graph |
| 165 | +``` |
| 166 | +You should see something like: |
| 167 | + |
| 168 | + |
| 169 | + |
| 170 | +#### Learning curves and ROC per epoch |
| 171 | + |
| 172 | +Besides TensorBoard summaries you can produce the ROC curves for validation and test data as well as visualizations of shots: |
| 173 | +``` |
| 174 | +cd examples/ |
| 175 | +python performance_analysis.py |
| 176 | +``` |
| 177 | +this uses the resulting file produced as a result of training the neural network as an input, and produces several `.png` files with plots as an output. |
| 178 | + |
| 179 | +In addition, you can check the scalar variable summaries for training loss, validation loss and validation ROC logged at `/tigress/netid/csv_logs` (each run will produce a new log file with a timestamp in name). |
| 180 | + |
| 181 | +A sample code to analyze can be found in `examples/notebooks`. For instance: |
| 182 | + |
| 183 | +```python |
| 184 | +import pandas as pd |
| 185 | +import numpy as np |
| 186 | +from bokeh.plotting import figure, show, output_file, save |
| 187 | +
|
| 188 | +data = pd.read_csv("/mnt/<destination folder name on your laptop>/csv_logs/<name of the log file>.csv") |
| 189 | +
|
| 190 | +from bokeh.io import output_notebook |
| 191 | +output_notebook() |
| 192 | +
|
| 193 | +from bokeh.models import Range1d |
| 194 | +#optionally set the plotting range |
| 195 | +#left, right, bottom, top = -0.1, 31, 0.005, 1.51 |
| 196 | +
|
| 197 | +p = figure(title="Learning curve", y_axis_label="Training loss", x_axis_label='Epoch number') #,y_axis_type="log") |
| 198 | +#p.set(x_range=Range1d(left, right), y_range=Range1d(bottom, top)) |
| 199 | +
|
| 200 | +p.line(data['epoch'].values, data['train_loss'].values, legend="Test description", |
| 201 | + line_color="tomato", line_dash="dotdash", line_width=2) |
| 202 | +p.legend.location = "top_right" |
| 203 | +show(p, notebook_handle=True) |
| 204 | +``` |
| 205 | + |
| 206 | +### Learning curve summaries per mini-batch |
| 207 | + |
| 208 | +To extract per mini-batch summaries, use the output produced by FRNN logged to the standard out (in case of the batch jobs, it will all be contained in the Slurm output file). Refer to the following notebook to perform the analysis of learning curve on a mini-batch level: |
| 209 | +https://github.com/PPPLDeepLearning/plasma-python/blob/master/examples/notebooks/FRNN_scaling.ipynb |
0 commit comments