This lab is shold to be executed inside nvidia-docker with hopobcn/gtc2018-fwi:us image
In this lab you will make modifications to a real world oil&gas mini application called FWI.
You will compile and execute the newly enhanced code in each step.
Along the way, solutions are provided, so you can check your work, or take a peek if you get lost.
After login into the machine you should see a folder called FWI containing each step and solution as a different branch:
~# git branch
* gtc2018-step1
gtc2018-step1-sol
gtc2018-step3
gtc2018-step3-sol
gtc2018-step4
gtc2018-step4-sol
gtc2018-step5
gtc2018-step5-sol
ompss-openaccUse git diff <branch-name> to compare branches and git stash && git checkout <branch-name> to discard non-commited changes and change to another step (branch).
Dual socket Intel(R) Xeon(R) E5-2630 v3 (Haswell) @ 2.40 GHz with 8 cores (16 core total, HT off) NVIDIA Tesla K80
To save time, this step has been moved to Appendix 0
To compare our GPU executions we will use serial & OpenMP executions:
- Sequential execution with 1 core:
Number of frequencies 1
Number of shots 1
Number of gradient iterations 1
Number of test iterations 1
Output directory path: results
FWI Program finished in 1059.629833 seconds- OpenMP execution with 16 threads (1 per core)
Number of frequencies 1
Number of shots 1
Number of gradient iterations 1
Number of test iterations 1
Output directory path: results
FWI Program finished in 120.587904 secondsIn the initial step, you have to express parallelism available around expensive loops in the application. We assume you already have some OpenACC knowledge.
-
git checkout gtc2018-step1 -
Add
#pragma acc kernelspragmas tocompute_component_vcell_TL:166andcompute_component_scell_TR:624functions (fromsrc/fwi_propagator.cfile) -
Compile
fwi:mkdir build && cd build cmake -DCMAKE_C_COMPILER=pgcc -DUSE_OPENACC=ON .. -
Inspect the compiler output. Try to identify if the kernels were successfully generated or if you found messages like:
Loop carried dependence due to exposed use of [array-name] prevents parallelization. -
Fix the previous issue providing
#pragma acc loop independentto each loop-nest -
Compile and run
fwi:$ make irun
Step 2 will profile the application to find possible weaknesses and optimization opportunities.
We could use NVIDIA Visual Profiler for a graphical assestment or pgprof/nvprof for a command-line visualization.
For simplicity in this lab we are going to use nvprof
-
git checkout gtc2018-step2 -
Profile
fwiwith:nvprof --dependency-analysis bin/fwi fwi_schedule.txtor
NVVP -
Observe if:
- Which kernels take the most time in the critical path
- Is memory traffic due CUDA Unified Memory an issue?
Unified Memory (UM) can be very inefficient in older GPU generations and an experienced programmer with detailed knowledge of the application algorithm could outperform the Unified Memory.
As we have seen in Step 2, FWI doesn't specially suffer from too much UM traffic.
But as an example, in this step we'll migrate from UM to a movement of data using OpenACC directives.
-
git checkout gtc2018-step3 -
Annotate, using OpenACC directives, which arrays will be used on the GPU: In
alloc_memory_shotfunction (src/fwi_kernel.c), after all allocations, complete the OpenACC section:coeff_t cc = *c; #pragma acc enter data create(cc) #pragma acc enter data create(cc.c11[:cellsInVolume]) #pragma acc enter data create(cc.c12[:cellsInVolume]) #pragma acc enter data create(/* COMPLETEME */) #pragma acc enter data create(/* COMPLETEME */) ... // continue with c13,c14,c15,c16, c22,c23,c24,c25,c26, c33,c34,c35,c36, c44,c45,c46, c55,c56, c66
In
free_memory_shotfunction, before all dealocations (free) we should first, deallocate the GPU memory with:#pragma acc wait #pragma acc exit data delete(c->c11) #pragma acc exit data delete(c->c12) #pragma acc exit data delete(/* COMPLETEME */) #pragma acc exit data delete(/* COMPLETEME */) ... #pragma acc exit data delete(c) ...
-
In
compute_component_vcell_TL:166andcompute_component_scell_TR:625functions. Specify that all the GPU arrays are already on the gpu with thepresentclause -
Compile and run
fwi:$ make irun
OpenACC follows the same semantics as CUDA regarding streams. By default, kernels and memory copies are executed in the default stream which imposes serialization between kernels and memory transfers.
Since we know that scell don't have dependences between each other (same for vcell functions) we could launch all the scell/vcell kernels to different streams.
OpenACC uses the clause async that requires a nonnegative scalar integer as a parameter.
In include/fwi/fwi_propagator.h::78 we have already prepared an enum that we are going to use as identifiers:
typedef enum {TR, TL, BR, BL} phase_t;-
git checkout gtc2018-step4 -
Add the
asyncclause to all OpenACC kernels (src/fwi_propagator.c)All scell & vcell functions provide a parameter named
phasewhich can be provided to the OpenACC clause:async(phase) -
In
velocity_propagatorandstress_propagatorPass as the last paramter, the stream identifier from
phase_tenum (TR,TL,BRorBL). -
Compile and run
fwi:$ make irun
FWI is a great candidate to take advantage of shuffle intrinsics and shared memory.
OpenACC provides the directive cache to exploit shared memory, but it lacks a way to exploit CUDA intra-warp intrinsics.
In this step we provide a set of highly optimized kernels in src/fwi_propagator.cu file.
Your task consists on adding the necessary glue code to get:
- The GPU device pointers managed by the OpenACC runtime
- The CUDA stream allocated by OpenACC
-
git checkout gtc2018-step5 -
Add
#pragma acc host_data use_devicedirectives to forward the device pointers allocated by OpenACC to our CUDA kernels -
Pass the current stream to the CUDA kernel (with
acc_get_cuda_stream) -
Compile and run
fwi:$ make irun

