This project implements and analyzes a parallel bottom up mergesort using OpenMP.
Goals:
- Implement a sequential baseline for mergesort.
- Implement a parallel iterative mergesort that uses the
OmpLoophelper. - Measure speedup for different input sizes and thread counts.
- Visualize results using the plots in
mergesort/plots/.
The parallel implementation uses:
- Iterative bottom up mergesort (run widths 1, 2, 4, 8, ...).
- Parallelism across independent merges when there are many merges in a pass.
- Parallelism inside a single merge using a merge path style partitioning when there are only a few large merges.
- Coarsening of loop iterations to reduce OpenMP scheduling overhead.
- C and C++ compiler with OpenMP support (for example
gccandg++) makeandbash- For plotting:
- either
gnuplot - or Python 3 with
matplotlibandpandas
- either
Common benchmark parameters are defined in params.sh:
THREADS="1 4 16 64"
MERGESORT_NS="10000 1000000 100000000 1000000000"These control the problem sizes and thread counts used by the benchmark scripts.
From the project root:
make libgen.aSub makefiles will auto build libgen.a if it is missing.
cd sequential
make # builds mergesort_seq
make bench # or: bash ./queue.shThis writes one timing file per problem size:
sequential/result/mergesort_seq_<N>
Each file contains the runtime in seconds.
Example from this project:
mergesort_seq_1000000 -> 0.0937723
mergesort_seq_100000000 -> 11.9921
mergesort_seq_1000000000 -> 161.663
cd ../mergesort
make # builds mergesort with OpenMP
make test # functional and timing format tests
make bench # or: bash ./queue.shThis produces files of the form:
mergesort/result/mergesort_<N>_<T>
Example:
mergesort_100000000_1 -> 11.2595
mergesort_100000000_4 -> 5.51965
mergesort_100000000_16 -> 4.38349
mergesort_100000000_64 -> 4.20281
You must run the sequential and parallel benchmarks first.
cd mergesort
make plot # invokes plot.shThis creates PDF plots:
plots/mergesort_speedup_n.pdfplots/mergesort_speedup_thread.pdf
If gnuplot is not available, make plot falls back to:
python3 plot_py.pyThis produces:
plots/mergesort_speedup_n.pdfand.pngplots/mergesort_speedup_thread.pdfand.pngplots/mergesort_speedup_table.csv
For the full plot set and a combined submission PDF, run:
cd mergesort
python3 plots.pyThis writes:
plots/mergesort_speedup_n.{pdf,png}plots/mergesort_speedup_thread.{pdf,png}plots/mergesort_time_vs_threads.{pdf,png}plots/mergesort_efficiency_vs_threads.{pdf,png}submission_mergesort.pdf
Once you have run plot_py.py or plots.py, the following images will render directly in this README.
File: mergesort/plots/mergesort_speedup_n.png
This plot shows, for each fixed N, the speedup
[ ext{speedup}(N,T) = rac{T_{ ext{seq}}(N)}{T_{ ext{par}}(N,T)} ]
as a function of thread count T.
File: mergesort/plots/mergesort_speedup_thread.png
This plot shows, for each fixed thread count T, how the speedup changes as N grows. The x axis is logarithmic.
File: mergesort/plots/mergesort_time_vs_threads.png
This plot shows the absolute parallel runtime as a function of thread count for each N.
File: mergesort/plots/mergesort_efficiency_vs_threads.png
Parallel efficiency is defined as:
[ ext{efficiency}(N,T) = rac{ ext{speedup}(N,T)}{T} ]
This plot shows how effectively additional threads are used for different N.
The parallel mergesort in mergesort.cpp is iterative and uses two buffers src and dst.
-
Bottom up passes
For
widthin1, 2, 4, 8, ...:- Process blocks of size
2 * widthinsrc. - Treat the left and right halves as sorted runs.
- Merge them into
dst. - Swap
srcanddstafter each pass.
- Process blocks of size
-
Early passes: serial
When
widthis smaller thanSERIAL_PASS_THRESHOLD, merging is done serially. This avoids OpenMP overhead when the runs are tiny. -
Middle passes: parallel across merges
For passes with many independent merges, the implementation parallelizes across merges:
- Compute
merges_this_pass. - Choose a
groupsize to coarsen several merges into one iteration. - Use
OmpLoop::parforto process groups in parallel.
- Compute
-
Late passes: parallel inside a merge
When there are only a few large merges, the code parallelizes within each merge using
merge_runs_parallelandmergepath_partition:- View the merge of arrays
AandBas an output range of lengthlenOut. - For each output index
K, find(ia, ib)such thatia + ib = Kand that prefix condition holds. - Split the output range into tiles. Each tile computes its local
(ia, ib)boundaries and merges its portion independently.
- View the merge of arrays
-
OmpLoop abstraction
omploop.hppwraps the OpenMP parallel for into a simple interface:OmpLoop loop; loop.setNbThread(nbthreads); loop.setGranularity(1); loop.parfor(beg, end, increment, [&](int i) { // work at i });
The numeric data behind the plots is in:
mergesort/plots/mergesort_speedup_table.csv
Columns:
N(problem size)threadsseq_timepar_timespeedup = seq_time / par_timeefficiency = speedup / threads
From the provided data, the best observed speedup per N is:
| N | Threads with best speedup | Best speedup (approximate) |
|---|---|---|
| 10,000 | 4 | 1.42 x |
| 1,000,000 | 16 | 2.05 x |
| 100,000,000 | 64 | 2.85 x |
| 1,000,000,000 | 64 | 3.96 x |
- For very small
N, overhead dominates and speedups are modest, sometimes worse than sequential at high thread counts. - For moderate
N, speedup around 2 x is possible with 4 to 16 threads. - For large and very large
N, speedups approach about 3 to 4 x with 64 threads. Scaling is sub linear because mergesort is memory bandwidth bound. Each pass streams two input arrays and one output array, and DRAM bandwidth becomes the main bottleneck.
Parallel efficiency shows that small numbers of threads are used more efficiently. At very high thread counts, efficiency is low because the algorithm cannot make full use of all cores when limited by memory bandwidth and non parallelizable work.



