This small project contains the implementation of the Gauss Seidel linear equations solver, using NVIDIA CUDA for parallelizing computations using a GPU. The Gauss Seidel method for solving linear equations is an iterative method, in which the values for the given variables keep changing until a certain threshold of variance is reached.
The goal of this projet was to provide a faster resolution time than the sequential version of the code, which is also provided in the repository.
In order to parallelize the calculations, ALL data dependency constraints need to be omitted. This constraint relaxation will only mean that the solver function will take longer iterations to reach the predefined threshold. However, this is totally fine, as the speed up that is achieved by parallelizing the calculations, is way higher than the cost of more iterations.
1. Decomposition: as CUDA architecture favours fine grain parallelization, the computational unit is defined as all the iterations a CUDA thread needs to perform in order to achieve a value delta smaller than the threshold.
2. Assignation: the assignment is pretty clear: each matrix cell will be paired with a CUDA thread. However, given that CUDA threads are grouped in chunks of variable size: how many threads per block should we defined? (this will determine the required number of blocks for a given matrix size).
The formula to apply is the following one:
int blocks_num = ceil((n-2) * (n-2) / threads_per_block)3. Orchestation: first of all, the initial matrix needs to be copied from the host (CPU) to the device 8GPU) allowing the computations to begin.
The orchestation between the different CUDA threads is literally zero: each thread is given a cell to transform, and no iteration is done between threads. There is a special case to consider though: as whole groups of threads are raised, there may be the case that some threads from the last group are left with no cell to compute.
Finally, the computed results are copied from device memory (GPU) to host memory (CPU).
4. Mapping: all CUDA threads are executed in the same device (GPU). No multi-GPU support is provided.
The repository contains:
- Sequential code version (gc_seq.cu).
- CUDA code version (gs_cuda.cu).
- Bash script: for executing with multiple configurations (exec_script.sh).
- Makefile: used to compile and clean the project.
First of all, the NVCC compiler needs to be installed
$ makeOnce the code is compiled, it can receive 2 arguments:
- Matrix size: usually a power of 2 (64, 128, 256, 1024...).
- Threads per block: better a multiple of 32 (32, 64, 128...).
The execution command will be as follows:
$ ./gs_cuda <matrix_size> <threads_per_block>Example:
$ ./gs_cuda 1024 64As the whole goal of this project was to speed-up the resolution of the linear equation system, it is important to state the obtained acceleration. The speed-up values have been computed using the Amdahl's law:
This project was developped by:
