Skip to content

Commit a0f0516

Browse files
committed
Manually add OLCF-AMD documentation to master branch
1 parent 0b59a7a commit a0f0516

File tree

1 file changed

+296
-0
lines changed

1 file changed

+296
-0
lines changed

docs/OLCF-AMD.md

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
# OLCF Spock Tutorial
2+
*Last updated 2021-8-19*
3+
4+
*This document is built off of the excellent how-to guide created for [Princeton's TigerGPU](https://github.com/Techercise/plasma-python/blob/master/docs/PrincetonUTutorial.md)*
5+
6+
## Building the package
7+
### Login to Spock
8+
9+
First, login to the Spock headnode via ssh:
10+
```
11+
ssh -X <yourusername>@spock.olcf.ornl.gov
12+
```
13+
Note, `-X` is optional; it is only necessary if you are planning on performing remote visualization, e.g. the output `.png` files from the below [section](#Learning-curves-and-ROC-per-epoch). Trusted X11 forwarding can be used with `-Y` instead of `-X` and may prevent timeouts, but it disables X11 SECURITY extension controls.
14+
15+
### Sample installation on Spock
16+
17+
#### Check out the Code Repository
18+
Next, check out the source code from github:
19+
```
20+
git clone https://github.com/PPPLDeepLearning/plasma-python
21+
cd plasma-python
22+
```
23+
24+
#### Install Miniconda
25+
At the time of writing, Anaconda and Miniconda are not installed on Spock, therefore one of them must be manually downloaded. In their system documentation, AMD recommends downloading Miniconda.
26+
27+
To install Miniconda, download the Linux installer [here](https://docs.conda.io/en/latest/miniconda.html#linux-installers) and follow the installation instructions for Miniconda on [this page](https://conda.io/projects/conda/en/latest/user-guide/install/linux.html)
28+
29+
Once Miniconda is installed, create a conda environment:
30+
```
31+
conda create -n your_env_name python=3.8 -y
32+
```
33+
34+
Then, activate the environment:
35+
```
36+
conda activate your_env_name
37+
```
38+
39+
Ensure the following packages are installed in your conda environment:
40+
```
41+
pyyaml # pip install pyyaml
42+
pathos # pip install pathos
43+
hyperopt # pip install hyperopt
44+
matplotlib # pip install matplotlib
45+
keras # pip install keras
46+
tensorflow-rocm # pip install tensorflow-rocm
47+
```
48+
49+
#### Modules
50+
In order to load the correct modules with ease, creating a profile is recommended
51+
```
52+
vim frnn_spock.profile
53+
```
54+
55+
Write the following to the profile:
56+
```
57+
module load rocm
58+
module load cray-python
59+
module load gcc
60+
module load craype-accel-amd-gfx908
61+
module load cray-mpich/8.1.7
62+
module use /sw/aaims/spock/modulefiles
63+
module load tensorflow
64+
65+
# These must be set before running if wanting to use the Cray GPU-Aware MPI
66+
# If running on only 1 GPU, there is no need to uncomment these lines
67+
68+
# export MPIR_CVAR_GPU_EAGER_DEVICE_MEM=0
69+
# export MPICH_GPU_SUPPORT_ENABLED=1
70+
# export HIPCC_COMPILE_FLAGS_APPEND="$HIPCC_COMPILE_FLAGS_APPEND -I${MPICH_DIR}/include -L${MPICH_DIR}/lib -lmpi -L/opt/cray/pe/mpich/8.1.7/gtl/lib -lmpi_gtl_hsa"
71+
72+
export MPICC="$(which mpicc)"
73+
```
74+
75+
76+
As of the latest update of this document (Summer 2021), the above modules correspond to the following versions on the Spock system, given by `module list` (Note that this list also includes the default system modules):
77+
```
78+
Currently Loaded Modules:
79+
1) craype/2.7.8 3) libfabric/1.11.0.4.75 5) cray-dsmml/0.1.5 7) xpmem/2.2.40-2.1_2.28__g3cf3325.shasta 9) cray-pmi/6.0.12 11) DefApps/default 13) cray-python/3.8.5.1 15) craype-accel-amd-gfx908 17) rocm/4.1.0
80+
2) craype-x86-rome 4) craype-network-ofi 6) perftools-base/21.05.0 8) cray-libsci/21.06.1.1 10) cray-pmi-lib/6.0.12 12) PrgEnv-cray/8.1.0 14) gcc/10.3.0 16) cray-mpich/8.1.7 18) tensorflow/2.3.6
81+
```
82+
83+
#### Build mpi4py
84+
If wanting to run on multiple GPUs, mpi4py is needed. At the time of writing, a manual installation of mpi4py is needed on the Spock system. To install mpi4py, do the following:
85+
```
86+
# Ensure your conda environment is activated:
87+
conda activate your_env_name
88+
89+
# Download mpi4py to your home directory
90+
#cd ~
91+
curl -O -L https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.3.tar.gz
92+
93+
# Untar the file
94+
tar -xzvf mpi4py-3.0.3.tar.gz
95+
96+
cd mpi4py-3.0.3
97+
98+
# Edit the mpi.cfg file
99+
vim mpi.cfg
100+
```
101+
102+
Include the following segment in the mpi.cfg file:
103+
```
104+
[craympi]
105+
mpi_dir = /opt/cray/pe/mpich/8.1.4/ofi/crayclang/9.1
106+
mpicc = cc
107+
mpicxx = CC
108+
include_dirs = /opt/cray/pe/mpich/8.1.4/ofi/crayclang/9.1/include
109+
libraries = mpi
110+
library_dirs = /opt/cray/pe/mpich/8.1.4/ofi/crayclang/9.1/
111+
```
112+
113+
Build and install mpi4py:
114+
```
115+
python setup.py build --mpi=craympi
116+
python setup.py install
117+
```
118+
119+
Next, install the `plasma-python` package:
120+
121+
```bash
122+
#conda activate your_env_name
123+
#cd ~/plasma-python
124+
python setup.py install
125+
```
126+
127+
## Understanding and preparing the input data
128+
### Location of the data on Spock
129+
130+
**Currently, no public data exists on Spock, but we leave this section in here for the user to understand the input data**
131+
132+
The JET and D3D datasets contain multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions. The datasets are located in the `/tigress/FRNN` project directory of the [GPFS](https://www.ibm.com/support/knowledgecenter/en/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_gpfs_overview.html) filesystem on Princeton University clusters.
133+
134+
For convenience, create following symbolic links:
135+
```bash
136+
cd /tigress/<netid>
137+
ln -s /tigress/FRNN/shot_lists shot_lists
138+
ln -s /tigress/FRNN/signal_data signal_data
139+
```
140+
141+
### Configuring the dataset
142+
All the configuration parameters are summarised in `examples/conf.yaml`. In this section, we highlight the important ones used to control the input data.
143+
144+
Currently, FRNN is capable of working with JET and D3D data as well as thecross-machine regime. The switch is done in the configuration file:
145+
```yaml
146+
paths:
147+
...
148+
data: 'jet_0D'
149+
```
150+
151+
Older yaml files kept for archival purposes will denote this data set as follow:
152+
```yaml
153+
paths:
154+
...
155+
data: 'jet_data_0D'
156+
```
157+
use `d3d_data` for D3D signals, use `jet_to_d3d_data` ir `d3d_to_jet_data` for cross-machine regime.
158+
159+
By default, FRNN will select, preprocess, and normalize all valid signals available in the above dataset. To chose only specific signals use:
160+
```yaml
161+
paths:
162+
...
163+
specific_signals: [q95,ip]
164+
```
165+
if left empty `[]` will use all valid signals defined on a machine. Only set this variable if you need a custom set of signals.
166+
167+
Other parameters configured in the `conf.yaml` include batch size, learning rate, neural network topology and special conditions foir hyperparameter sweeps.
168+
169+
### Preprocessing the input data
170+
***Preprocessing the input data is currently not required on Spock as the data that is available is already preprocessed.***
171+
172+
```bash
173+
cd examples/
174+
python guarantee_preprocessed.py
175+
```
176+
This will preprocess the data and save rescaled copies of the signals in `/tigress/<netid>/processed_shots`, `/tigress/<netid>/processed_shotlists` and `/tigress/<netid>/normalization`
177+
178+
Preprocessing must be performed only once per each dataset. For example, consider the following dataset specified in the config file `examples/conf.yaml`:
179+
```yaml
180+
paths:
181+
data: jet_0D
182+
```
183+
Preprocessing this dataset takes about 20 minutes to preprocess in parallel and can normally be done on the cluster headnode.
184+
185+
### Current signals and notations
186+
187+
Signal name | Description
188+
--- | ---
189+
q95 | q95 safety factor
190+
ip | plasma current
191+
li | internal inductance
192+
lm | Locked mode amplitude
193+
dens | Plasma density
194+
energy | stored energy
195+
pin | Input Power (beam for d3d)
196+
pradtot | Radiated Power
197+
pradcore | Radiated Power Core
198+
pradedge | Radiated Power Edge
199+
pechin | ECH input power, not always on
200+
pechin | ECH input power, not always on
201+
betan | Normalized Beta
202+
energydt | stored energy time derivative
203+
torquein | Input Beam Torque
204+
tmamp1 | Tearing Mode amplitude (rotating 2/1)
205+
tmamp2 | Tearing Mode amplitude (rotating 3/2)
206+
tmfreq1 | Tearing Mode frequency (rotating 2/1)
207+
tmfreq2 | Tearing Mode frequency (rotating 3/2)
208+
ipdirect | plasma current direction
209+
210+
## Training and inference
211+
212+
Use the Slurm job scheduler to perform batch or interactive analysis on the Spock system.
213+
214+
### Batch job
215+
216+
A sample batch job script for 1 GPU is provided in the examples directory and is called spock_1GPU_slurm.cmd. It can be run using: `sbatch spock_1GPU_slurm.cmd`
217+
Note that, the project/account (`-A`) and partition (`-p) arugments will need to reflect your project and assigned partition.
218+
219+
Some batch job tips:
220+
* For non-interactive batch analysis, make sure to allocate exactly 1 MPI process per GPU where `X` is the number of nodes for distibuted training and the total number of GPUs is `X * 4`. This configuration guarantees 1 MPI process per GPU, regardless of the value of `X`.
221+
* Update the `num_gpus` value in `conf.yaml` to correspond to the total number of GPUs specified for your Slurm allocation.
222+
223+
And monitor it's completion via:
224+
```bash
225+
squeue --me
226+
```
227+
Optionally, add an email notification option in the Slurm configuration about the job completion:
228+
```
229+
#SBATCH --mail-user=<userid>@email.com
230+
#SBATCH --mail-type=ALL
231+
```
232+
233+
### Interactive job
234+
235+
Interactive option is preferred for **debugging** or running in the **notebook**, for all other case batch is preferred.
236+
The workflow is to request an interactive session for a 1 GPU interactive job:
237+
238+
```bash
239+
salloc -t 02:00:00 -A <project_id> -N 1 --gres=gpu:1 --exclusive -p <partition> --ntasks-per-socket=1 --ntasks-per-node=1
240+
```
241+
242+
[//]: # (Note, the modules might not/are not inherited from the shell that spawns the interactive Slurm session. Need to reload anaconda module, activate environment, and reload other compiler/library modules)
243+
244+
Ensure the above modules are still loaded and reactivate your conda environmnt.
245+
Then, launch the application from the command line:
246+
247+
```bash
248+
python mpi_learn.py
249+
```
250+
251+
## Visualizing learning
252+
253+
A regular FRNN run will produce several outputs and callbacks.
254+
255+
## Custom visualization
256+
You can visualize the accuracy of the trained FRNN model using the custom Python scripts and notebooks included in the repository.
257+
258+
### Learning curves, example shots, and ROC per epoch
259+
260+
You can produce the ROC curves for validation and test data as well as visualizations of shots by using:
261+
```
262+
cd examples/
263+
python performance_analysis.py
264+
```
265+
The `performance_analysis.py` script uses the file produced as a result of training the neural network as an input, and produces several `.png` files with plots as an output.
266+
267+
In addition, you can check the scalar variable summaries for training loss, validation loss, and validation ROC logged at `/outputdir/<userid>/csv_logs` (each run will produce a new log file with a timestamp in name).
268+
269+
Sample notebooks for analyzing the files in this directory can be found in `examples/notebooks/`. For instance, the [LearningCurves.ipynb](https://github.com/PPPLDeepLearning/plasma-python/blob/master/examples/notebooks/LearningCurves.ipynb) notebook contains a variation on the following code snippet:
270+
```python
271+
import pandas as pd
272+
import numpy as np
273+
from bokeh.plotting import figure, show, output_file, save
274+
275+
data = pd.read_csv("<destination folder name on your laptop>/csv_logs/<name of the log file>.csv")
276+
277+
from bokeh.io import output_notebook
278+
output_notebook()
279+
280+
from bokeh.models import Range1d
281+
#optionally set the plotting range
282+
#left, right, bottom, top = -0.1, 31, 0.005, 1.51
283+
284+
p = figure(title="Learning curve", y_axis_label="Training loss", x_axis_label='Epoch number') #,y_axis_type="log")
285+
#p.set(x_range=Range1d(left, right), y_range=Range1d(bottom, top))
286+
287+
p.line(data['epoch'].values, data['train_loss'].values, legend="Test description",
288+
line_color="tomato", line_dash="dotdash", line_width=2)
289+
p.legend.location = "top_right"
290+
show(p, notebook_handle=True)
291+
```
292+
The resulting plot should match the `train_loss` plot in the Scalars tab of the TensorBoard summary.
293+
294+
#### Learning curve summaries per mini-batch
295+
296+
To extract per mini-batch summaries, we require a finer granularity of checkpoint data than what it is logged to the per-epoch lines of `csv_logs/` files. We must directly use the output produced by FRNN logged to the standard output stream. In the case of the non-interactive Slurm batch jobs, it will all be contained in the Slurm output file, e.g. `slurm-3842170.out`. Refer to the following notebook to perform the analysis of learning curve on a mini-batch level: [FRNN_scaling.ipynb](https://github.com/PPPLDeepLearning/plasma-python/blob/master/examples/notebooks/FRNN_scaling.ipynb)

0 commit comments

Comments
 (0)