Skip to content

Commit 31fafa3

Browse files
committed
Drop Rick Zamora's ALCF notes into docs/
1 parent a99c01a commit 31fafa3

File tree

2 files changed

+369
-3
lines changed

2 files changed

+369
-3
lines changed

docs/ALCF.md

Lines changed: 366 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,366 @@
1+
# ALCF Theta `plasma-python` FRNN Notes
2+
3+
**Author: Rick Zamora ([email protected])**
4+
5+
This document is intended to act as a tutorial for running the [plasma-python](https://github.com/PPPLDeepLearning/plasma-python) implementation of the Fusion recurrent neural network (FRNN) on the ALCF Theta supercomputer (Cray XC40; Intel KNL processors). The steps followed in these notes are based on the Princeton [Tiger-GPU tutorial](https://github.com/PPPLDeepLearning/plasma-python/blob/master/docs/PrincetonUTutorial.md#location-of-the-data-on-tigress), hosted within the main GitHub repository for the project.
6+
7+
## Environment Setup
8+
9+
10+
Choose a *root* directory for FRNN-related installations on Theta:
11+
12+
```
13+
export FRNN_ROOT=<desired-root-directory>
14+
cd $FRNN_ROOT
15+
```
16+
17+
*Personal Note: Using FRNN_ROOT=/home/zamora/ESP*
18+
19+
Create a simple directory structure allowing experimental *builds* of the `plasma-python` python code/library:
20+
21+
```
22+
mkdir build
23+
mkdir build/miniconda-3.6-4.5.4
24+
cd build/miniconda-3.6-4.5.4
25+
```
26+
27+
### Custom Miniconda Environment Setup
28+
29+
Copy miniconda installation script to working directory (and install):
30+
31+
```
32+
cp /lus/theta-fs0/projects/fusiondl_aesp/FRNN/rzamora/scripts/install_miniconda-3.6-4.5.4.sh .
33+
./install_miniconda-3.6-4.5.4.sh
34+
```
35+
36+
The `install_miniconda-3.6-4.5.4.sh` script will install `miniconda-4.5.4` (using `Python-3.6`), as well as `Tensorflow-1.12.0` and `Keras 2.2.4`.
37+
38+
39+
Update your environment variables to use miniconda:
40+
41+
```
42+
export PATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/bin:$PATH
43+
export PYTHONPATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/lib/python3.6/site-packages/:$PYTHONPATH
44+
```
45+
46+
Note that the previous lines (as well as the definition of `FRNN_ROOT`) can be appended to your `$HOME/.bashrc` file if you want to use this environment on Theta by default.
47+
48+
49+
## Installing `plasma-python`
50+
51+
Here, we assume the installation is within the custom miniconda environment installed in the previous steps. We also assume the following commands have already been executed:
52+
53+
```
54+
export FRNN_ROOT=<desired-root-directory>
55+
export PATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/bin:$PATH
56+
export PYTHONPATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/lib/python3.6/site-packages/:$PYTHONPATH
57+
```
58+
59+
*Personal Note: Using `export FRNN_ROOT=/lus/theta-fs0/projects/fusiondl_aesp/zamora/FRNN_project`*
60+
61+
If the environment is set up correctly, installation of `plasma-python` is straightforward:
62+
63+
```
64+
cd ${FRNN_ROOT}/build/miniconda-3.6-4.5.4
65+
git clone https://github.com/PPPLDeepLearning/plasma-python.git
66+
cd plasma-python
67+
python setup.py build
68+
python setup.py install
69+
```
70+
71+
## Data Access
72+
73+
Sample data and metadata is available in `/lus/theta-fs0/projects/FRNN/tigress/alexeys/signal_data` and `/lus/theta-fs0/projects/FRNN/tigress/alexeys/shot_lists`, respectively. It is recommended that users create their own symbolic links to these directories. I recommend that you do this within a directory called `/lus/theta-fs0/projects/fusiondl_aesp/<your-alcf-username>/`. For example:
74+
75+
```
76+
ln -s /lus/theta-fs0/projects/fusiondl_aesp/FRNN/tigress/alexeys/shot_lists  /lus/theta-fs0/projects/fusiondl_aesp/<your-alcf-username>/shot_lists
77+
ln -s /lus/theta-fs0/projects/fusiondl_aesp/FRNN/tigress/alexeys/signal_data  /lus/theta-fs0/projects/fusiondl_aesp/<your-alcf-username>/signal_data
78+
```
79+
80+
For the examples included in `plasma-python`, there is a configuration file that specifies the root directory of the raw data. Change the `fs_path: '/tigress'` line in `examples/conf.yaml` to reflect the following:
81+
82+
```
83+
fs_path: '/lus/theta-fs0/projects/fusiondl_aesp'
84+
```
85+
86+
Its also a good idea to change `num_gpus: 4` to `num_gpus: 1`. I am also using the `jet_data_0D` dataset:
87+
88+
```
89+
paths:
90+
data: jet_data_0D
91+
```
92+
93+
94+
### Data Preprocessing
95+
96+
#### The SLOW Way (On Theta)
97+
98+
Theta is KNL-based, and is **not** the best resource for processing many text files in python. However, the preprocessing step *can* be used by using the following steps (although it may need to be repeated many times to get through the whole dataset in a 60-minute debug queues):
99+
100+
```
101+
cd ${FRNN_ROOT}/build/miniconda-3.6-4.5.4/plasma-python/examples
102+
cp /lus/theta-fs0/projects/fusiondl_aesp/FRNN/rzamora/scripts/submit_guarantee_preprocessed.sh .
103+
```
104+
105+
Modify the paths defined in `submit_guarantee_preprocessed.sh` to match your environment.
106+
107+
Note that the preprocessing module will use Pathos multiprocessing (not MPI/mpi4py). Therefore, the script will see every compute core (all 256 per node) as an available resource. Since the LUSTRE file system is unlikely to perform well with 256 processes (on the same node) opening/closing/creating files at once, it might improve performance if you make a slight change to line 85 in the `vi ~/plasma-python/plasma/preprocessor/preprocess.py` file:
108+
109+
```
110+
line 85: use_cores = min( <desired-maximum-process-count>, max(1,mp.cpu_count()-2) )
111+
```
112+
113+
After optionally re-building and installing plasm-python with this change, submit the preprocessing job:
114+
115+
```
116+
qsub submit_guarantee_preprocessed.sh
117+
```
118+
119+
#### The FAST Way (On Cooley)
120+
121+
You will fine it much less painful to preprocess the data on Cooley, because the Haswell processors are much better suited for this... Log onto the ALCF Cooley Machine:
122+
123+
```
124+
ssh <alcf-username>@cooley.alcf.anl.gov
125+
```
126+
127+
Copy my `cooley_preprocess` example directory to whatever directory you choose to work in:
128+
129+
```
130+
cp -r /lus/theta-fs0/projects/fusiondl_aesp/FRNN/rzamora/scripts/cooley_preprocess .
131+
cd cooley_preprocess
132+
```
133+
134+
This directory has a Singularity image with everything you need to run your code on Cooley. Assuming you have created symbolic links to the `shot_lists` and `signal_data` directories in `/lus/theta-fs0/projects/fusiondl_aesp/<your-alcf-username>/`, you can just submit the included `COBALT` script (to specify the data you want to process, just modify the included `conf.yaml` file):
135+
136+
```
137+
qsub submit.sh
138+
```
139+
140+
For me, this finishes in less than 10 minutes, and creates 5523 `.npz` files in the `/lus/theta-fs0/projects/fusiondl_aesp/<your-alcf-username>/processed_shots/` directory. The output file of the COBALT submission ends with the following message:
141+
142+
```
143+
5522/5523Finished Preprocessing 5523 files in 406.94421911239624 seconds
144+
Omitted 5523 shots of 5523 total.
145+
0/0 disruptive shots
146+
WARNING: All shots were omitted, please ensure raw data is complete and available at /lus/theta-fs0/projects/fusiondl_aesp/zamora/signal_data/.
147+
4327 1196
148+
```
149+
150+
151+
# Notes on Revisiting Pre-Processes
152+
153+
## Preprocessing Information
154+
155+
To understand what might be going wrong with the preprocessing step, let's investigate what the code is actually doing.
156+
157+
**Step 1** Call `guarentee_preprocessed( conf )`, which is defined in `plasma/preprocessor/preprocess.py`. This function first initializes a `Preprocessor()` object (whose class definition is in the same file), and then checks if the preprocessing was already done (by looking for a file). The preprocessor object is called `pp`.
158+
159+
**Step 2** Assuming preprocessing is needed, we call `pp.clean_shot_lists()`, which loops through each file in the `shot_lists` directory and calls `self.clean_shot_list()` (not plural) for each text-file item. I do not believe this function is doing any thing when I run it, because all the shot list files have been "cleaned." The cleaning of a shot-list file just means the data is corrected to have two columns, and the file is renamed (to have "clear" in the name).
160+
161+
**Step 3** We call `pp.preprocess_all()`, which parses some of the config file, and ultimately calls `self.preprocess_from_files(shot_files_all,use_shots)` (where I believe `shot_files_all` is the output directory, and `use_shots` is the number of shots to use).
162+
163+
**Step 4** The `preprocess_from_files()` function is used to do the actual preprocessing. It does this by creating a multiprocessing pool, and mapping the processes to the `self.preprocess_single_file` function (note that the code for `ShotList` class is in `plasma/primitives/shots.py`, and the preprocessing code is still in `plasma/preprocessor/preprocess.py`).
164+
165+
**Important:** It looks like the code uses the path definitions in `data/shot_lists/signals.py` to define the location/path of signal data. I believe that some of the signal data is missing, which is causing every "shot" to be labeled as incomplete (and consequently thrown out).
166+
167+
### Possible Issues
168+
169+
From the preprocessing output, it is clear that the *Signal Radiated Power Core* data was not downloaded correctly. According to the `data/shot_lists/signals.py` file, the data *should* be in `/lus/theta-fs0/projects/fusiondl_aesp/<alcf-user-name>/signal_data/jet/ppf/bolo/kb5h/channel14`. However, the only subdirectory of `~/jet/ppf/` is `~/jet/ppf/efit`
170+
171+
Another possible issue is that the `data/shot_lists/signals.py` file specifies the **name** of the directory containing the *Radiated Power* data incorrectly (*I THINK*). Instead of the following line:
172+
173+
`pradtot = Signal("Radiated Power",['jpf/db/b5r-ptot>out'],[jet])`
174+
175+
We might need this:
176+
177+
`pradtot = Signal("Radiated Power",['jpf/db/b5r-ptot\>out'],[jet])`
178+
179+
The issue has to do with the `>` character in the directory name (without the proper `\` escape character, python may be looking in the wrong path). **NOTE: I need to confirm that there is actually an issue with the way the code is actually using the string.**
180+
181+
182+
## Singularity/Docker Notes
183+
184+
Recall that the data preprocessing step was PAINFULLY slow on Theta, and so I decided to use Cooley. To simplify the process of using Cooley, I created a Docker image with the necessary environment. **Personal Note:** I performed this work on my local machine (Mac) in `/Users/rzamora/container-recipes`.
185+
186+
187+
In order to use a Docker image within a Singularity container (required on ALCF machines), it is useful to build the image on your local machine and push it to "Docker Hub":
188+
189+
190+
**Step 1:** Install Docker if you don't have it. [Docker-Mac](https://www.docker.com/docker-mac) works well for Mac.
191+
192+
**Step 2:** Build a Docker image using the recipe discussed below.
193+
194+
```
195+
export IMAGENAME="test_image"
196+
export RECIPENAME="Docker.centos7-cuda-tf1.12.0"
197+
docker build -t $IMAGENAME -f $RECIPENAME .
198+
```
199+
200+
You can check that the image is functional by starting an interactive shell session, and checking that the necessary python modules are available. For example (using `-it` for an interactive session):
201+
202+
```
203+
docker run --rm -it -v $PWD:/tmp -w /tmp $IMAGENAME:latest bash
204+
# python -c "import keras; import plasma; print(plasma.__file__)"
205+
```
206+
207+
Note that the `plasma-python` source code will be located in `/root/plasma-python/` for the recipe described below.
208+
209+
**Step 3:** Push the image to [Docker Hub](https://hub.docker.com/).
210+
211+
Using your docker-hub username:
212+
213+
```
214+
docker login --username=<username>
215+
```
216+
217+
Then, "tag" the image using the `IMAGE ID` value displayed with `docker image ls`:
218+
219+
```
220+
docker tag <IMAGE-ID> <username>/<image-name>:<label>
221+
```
222+
223+
Here, `<label>` is something like "latest". To finally push the image to [Docker Hub](https://hub.docker.com/):
224+
225+
```
226+
docker push <username>/<image-name>
227+
```
228+
229+
### Docker Recipe
230+
231+
The actual content of the docker recipe is mostly borrowed from an example on [GitHub](https://github.com/scieule/golden-heart/blob/master/Dockerfile):
232+
233+
```
234+
FROM nvidia/cuda:9.1-cudnn7-devel-centos7
235+
236+
# Setup environment:
237+
SHELL ["/bin/bash", "-c"]
238+
ENV LANG en_US.UTF-8
239+
ENV LANGUAGE en_US.UTF-8
240+
ENV LC_ALL en_US.UTF-8
241+
ENV CUDA_DEVICE_ORDER PCI_BUS_ID
242+
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/cuda/extras/CUPTI/lib64
243+
244+
RUN yum update -y
245+
246+
RUN yum groupinstall -y "Development tools"
247+
248+
RUN yum install -y wget \
249+
unzip \
250+
screen tmux \
251+
ruby \
252+
vim \
253+
bc \
254+
man \
255+
ncurses-devel \
256+
zlib-devel \
257+
curl-devel \
258+
openssl-devel \
259+
which
260+
261+
RUN yum install -y qt5*devel gtk2-devel
262+
263+
RUN yum install -y blas-devel \
264+
lapack-devel \
265+
atlas-devel \
266+
gcc-gfortran \
267+
tbb-devel \
268+
eigen3-devel \
269+
jasper-devel \
270+
libpng-devel \
271+
libtiff-devel \
272+
openexr-devel \
273+
libwebp-devel \
274+
libv4l-devel \
275+
libdc1394-devel \
276+
libv4l-devel \
277+
gstreamer-plugins-base-devel
278+
279+
# C/C++ CMake Python
280+
RUN yum install -y centos-release-scl && \
281+
yum install -y devtoolset-7-gcc* \
282+
devtoolset-7-valgrind \
283+
devtoolset-7-gdb \
284+
devtoolset-7-elfutils \
285+
clang \
286+
llvm-toolset-7 \
287+
llvm-toolset-7-cmake \
288+
rh-python36-python-devel \
289+
rh-python36-python-pip \
290+
rh-git29-git \
291+
devtoolset-7-make
292+
293+
RUN echo "source scl_source enable devtoolset-7" >> /etc/bashrc
294+
RUN echo "source scl_source enable llvm-toolset-7" >> /etc/bashrc
295+
RUN echo "source scl_source enable rh-python36" >> /etc/bashrc
296+
RUN echo "source scl_source enable rh-git29" >> /etc/bashrc
297+
298+
# Python libs & jupyter
299+
300+
RUN source /etc/bashrc; pip3 install --upgrade pip
301+
RUN source /etc/bashrc; pip3 install numpy scipy matplotlib pandas \
302+
tensorflow-gpu keras h5py tables \
303+
scikit-image scikit-learn Pillow opencv-python \
304+
jsonschema jinja2 tornado pyzmq ipython jupyter notebook
305+
306+
# Install MPICH
307+
RUN cd /root && wget -q http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1.tar.gz \
308+
&& tar xf mpich-3.2.1.tar.gz \
309+
&& rm mpich-3.2.1.tar.gz \
310+
&& cd mpich-3.2.1 \
311+
&& source /etc/bashrc; ./configure --prefix=/usr/local/mpich/install --disable-wrapper-rpath \
312+
&& make -j 4 install \
313+
&& cd .. \
314+
&& rm -rf mpich-3.2.1
315+
316+
ENV PATH ${PATH}:/usr/local/mpich/install/bin
317+
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/mpich/install/lib
318+
RUN env | sort
319+
320+
# Install plasma-python (https://github.com/PPPLDeepLearning/plasma-python)
321+
# For 'pip'-based install: pip --no-cache-dir --disable-pip-version-check install -i https://testpypi.python.org/pypi plasma
322+
RUN cd /root && git clone https://github.com/PPPLDeepLearning/plasma-python \
323+
&& cd plasma-python \
324+
&& source /etc/bashrc; python setup.py install \
325+
&& cd ..
326+
327+
# nccl2
328+
RUN cd /root && git clone https://github.com/NVIDIA/nccl.git \
329+
&& cd nccl \
330+
&& make -j src.build \
331+
&& make pkg.redhat.build \
332+
&& rpm -i build/pkg/rpm/x86_64/libnccl*
333+
334+
# pip-install mpi4py
335+
RUN source /etc/bashrc; pip3 install mpi4py
336+
337+
RUN yum install -y libffi libffi-devel
338+
339+
RUN source /etc/bashrc; pip3 install tensorflow
340+
341+
# Workaround to build horovod without needing cuda libraries available:
342+
# temporary add stub drivers to ld.so.cache
343+
RUN ldconfig /usr/local/cuda/lib64/stubs \
344+
&& source /etc/bashrc; HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_NCCL_HOME=/nccl/build/ pip3 --no-cache-dir install horovod \
345+
&& ldconfig
346+
347+
ENV NCCL_P2P_DISABLE 1
348+
```
349+
350+
### Converting Docker to Singularity
351+
352+
Needed to build a singularity image for Cooley... Used vagrant:
353+
354+
```
355+
cd ~/vm-singularity/
356+
vagrant up
357+
vagrant ssh
358+
sudo singularity build centos7-cuda-tf1.12.0-plasma.simg docker://rjzamora/centos7-cuda-tf1.12.0.dimg:latest
359+
```
360+
361+
362+
363+
364+
365+
366+

examples/conf.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,18 @@ paths:
1010
signal_prepath: '/signal_data/' # /signal_data/jet/
1111
shot_list_dir: '/shot_lists/'
1212
tensorboard_save_path: '/Graph/'
13-
data: d3d_data_0D # 'd3d_to_jet_data' # 'd3d_to_jet_data' # 'jet_to_d3d_data' # jet_data
13+
data: d3d_data_0D
1414
# if specific_signals: [] left empty, it will use all valid signals defined on a machine. Only use if need a custom set
1515
specific_signals: [] # ['q95','li','ip','betan','energy','lm','pradcore','pradedge','pradtot','pin','torquein','tmamp1','tmamp2','tmfreq1','tmfreq2','pechin','energydt','ipdirect','etemp_profile','edens_profile']
1616
executable: "mpi_learn.py"
1717
shallow_executable: "learn.py"
1818

1919
data:
20-
bleed_in: 0 # how many shots from the test sit to use in training?
20+
bleed_in: 0 # how many shots from the test set to use in training?
2121
bleed_in_repeat_fac: 1 # how many times to repeat shots in training and validation?
2222
bleed_in_remove_from_test: True
2323
bleed_in_equalize_sets: False
24-
# TODO(KGF): make next parameter use 'none' instead of None
24+
# TODO(KGF): make next parameter use 'none' instead of None for consistency
2525
signal_to_augment: None # 'plasma current' # or None
2626
augmentation_mode: 'none'
2727
augment_during_training: False

0 commit comments

Comments
 (0)