The post N-Queens project from over 10 years ago appeared first on StreamHPC.
]]>We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.
Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.
I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.
Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.
Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.
What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.
An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%.
I only had to find those optimizations be redoing the port from another angle.
This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.
The first approach tries to create a parallel version of an existing one, the second focuses on methods better used by GPUs. The kernels are not fully optimized, so I hope you can come with a faster version – I give some hints where further optimization could help.
The problem is described on Wikipedia as: placing N chess queens on an NĂN chessboard so that no two queens attack each other. It is an extensive search-problem, by walking a tree and test if it holds as a solution. The “nice” thing is that the number of solutions increases irregularly with increasing N (see “Counting solutions” at the Wikipedia-page). There are many optimizations possible, which are all very bright, but harder to port to the CPU.
The forums spoke about multiple fast implementations, of which I researched three.
The first one was software by Jeff Somers, who described his software as: “My program is a heavily-optimized C program, yet it still takes over a week for an 800 MHz PC to calculate the number of solutions for a 21 x 21 board“. In other words: perfect for a 1050MHz GPU to do it in less time.
I found other solutions by Kenjiro Taura, but his webpage is offline now. He is very focused on exploring different parallel/distributed methods, and it could make use of all the cores of the 8-core i7. It was somewhat tricky to get it going, but – in case you find his code – the following worked:
gcc nq.c -lpthread OMP_NUM_THREADS=8 ./a.out p 17 2
Results got in after 28.17 seconds, so only a small speed-up compared to Somers’ algorithm on just one single core. Another one that was advertised as fast, was made by Mr. Takaken, which I could not compile it with GCC at all – so I left that one.
But if you look around, there are many implementations to learn from:
Note that the above list is from 2016! Also, Dr. Bobbs does not even let you go to different pages – you need to edit the url to get to another page.
I based my code on the one from Somers, as it fully focused on the algorithm, and that is always the first problem to solve. As a bonus, I could focus on my learning goal: porting a difficult puzzle to the GPU. There were more reasons, but only if you try yourself, you’ll find these.
What Mr. Somers discovered already: the more algorithmic steps can be skipped, the less work needs to be done, and the faster we can make it.
This page describes is short some differences and shows where a lot of speed-up can be found. Also checking can be optimized. When going deeper into the search-tree, the previous checks don’t need be done again (step-wise forward-check). The first row has N possibilities: a queen on a position. For example N=4 gives 1000, 0100, 0010 and 0001. If it were a tower-problem, and we would not remember the previous row then for row 2, there are N-1 possibilities, etc, so for the whole board there are N*(N-1)*(N-1)*… = N*(N-1)^(N-1) possibilities (click here to see logarithmic representation to see how fast it grows). Of course, we can easily remember the previous rows to find the N! solutions, but we cannot easily do it vertically as queens need. We come back to the tower-search later when discussing the second approach, as it can be very usable. For queens, the first row has N possibilities to put a queen on, the second row N-2 (when first at the border) and N-3 (otherwise). The third row has between N-6 (only when N>8) and N-4 possibilities; There are N-3 possibilities when the first two queens shadow positions – having covered a position more than once. Still it’s not easy to find reusable search-paths. Finding the shadowed positions is as hard as checking validity, as far as I have found. This shadowing makes it a harder problem and solutions can be used for other search-problems too. As the results can be mirrored, we only need to check half and just double the results – exception for the middle row of odd sized boards. You see that Somers coded two runs: one for one side of the board and one for the middle row (when the board-size is odd) – in my port I combined it. When doing the middle row, he starts at the second row, so he can double the results anyway and speeds up the search.
If you read the various acceleration techniques by AMD and NVidia, you notice the focus for GPU-optimization is on memory-transfers – and this is the tricky part of N-Queens. Making a parallel version of the code was not the real problem, but working around the limitations of GPUs, really was. The first approach focuses on a parallel version of Somers’ code, the second one takes the GPU-limitations much more into account.
Because of dependencies on the previous run it actually is impossible to completely unloop, so we generate the possible starting-points for the first x rows and then start from there. We need a series, so we can call a kernel by id. The problem it is a depth-first algorithm, and we need a breadth-first – at least for the first rows. I therefore made a variation on the code, that combines the odd and even part This is one that fails immediately, but optimizing this is for later – I hope by pumping the threads the scheduler of the GPU helps me out. I made a different version that runs the separate threads (in serial) by only giving a row where to start. Unlooping and clustering would increase the speed, but we first try without.
The least I want is that it is faster than 2.89 seconds. The OpenCL versions gave results under 9 seconds on an 8-core Intel i7, so less than 4 times slower than a GPU using the other code base – that sounds hopeful. The threaded CPU-version (Java) could do around 30 seconds, so we do have some performance-issues here. But… running OpenCL on GPU also gave 30 seconds, which is not good at all!
I could tune the software by eliminating the first breadth-first search by better initiating or by looking into using vectors to use the AVX. But since that would not improve the GPU-times much, I’ll leave these optimisations from this approach. I had learned that Somers worked a lot on memory-efficiency for a single threaded solution and have found its limitations for my cause. By going wide before deep, we effectively created a parallel version which runs very nice on CPUs (which can handle complete asynchronous workers), but GPUs need more order and structure – time for approach 2.
Irregular code doesn’t work really well with GPU-style devices, as you can read in this extensive report about H264. There are more approaches, but all have in common that they are not usable when doing with less parallel processors. We now start with three focus-points:
We can use permutations of half correct solutions. We can make sure that horizontal and vertical there is only one queen per line, so we only need to check diagonal lines for illegal sharing. It would be faster to permute the diagonals somehow, but I did not see a possibility. Checking the diagonals should be done in parallel too. We can use shift and sum, which we need to do twice. To get the idea of this shift visualized, put 4 queens in the middle 4×4 of a chess-board, leave the first, shift the second 1 to the left, the third 2 positions, the fourth 3.
The number of permutations is N!, and looks a bit like bubble-sort visually. A starting-position in one diagonal fails immediately, so later we are going to improve this algorithm later to skip the obvious ones. Same for unnecessary checks of the diagonals.
The simple method this needs a double loop: i (0< =i0 then we of course need to keep track of some history. Before going on with how to divide the problem. AMD prefers to have workgroups-sizes which are a multiple of 64 (per dimension) – called a wavefront. NVIDIA works well with multiples of 32 (a warp), so has an advantage when the workgroups are smaller. A lot of time this explains the difference between NVIDIA and AMD when the algorithm was tested first on NVIDIA. NVIDIA has an advantage, as we cannot easily fill a 64 wavefront/warp. How do we make sure the whole wavefront/warp is filled? It is clear we need to have loops, as we cannot launch N! kernels as it is not representable in an uint64. You may try.
We have the same problem as the previous approach, that we can only launch from a certain row. Two rows have too little starting-positions, so we start with three or four as we did with approach 1. At http://wordaligned.org/articles/next-permutation a lot is explained about permutations. We have one problem: since there are many permutations invalid for
Writing down the rows it can be next to it comes very clear that there are no solutions for N = 2 or 3. 4 gives two solutions: 2,4,1,3 and 3,1,4,2. For 5 and 6 we need to check the diagonals for rows further apart, but it is clear you need to find whole, unique paths to find possible solutions where all nodes are visited only once. If we put it in a truth-table, you get a wide diagonal of 2 or 3 with false and the rest true (symmetrical). This problem is best known as a part of the travelling salesman problems: the Hamiltonian path. We don’t care about distances (set to 1) or directions, but do very much care about getting all the separate solutions as we need to test them for correctness. Creating such graph is easy, so let us focus on walking it. If all these paths can be found in less operations than using permutations AND it can be done in parallel, we have taken a big step forward. Somers’ solution walks paths very orderly and checking the diagonals efficiently, and the permutations only needed to update+check the diagonals of the two switched rows, so we need to see if check-wile-walking-tree could also be possible. If the board-size is bigger than 6, you see an increase in the number of solutions. The tree (figure) shows that when starting in the edge, you see only one escape. The rest has only one or two solutions. This tells something about the influence inside and outside 6×6; for us larger board-sizes are important and the complexity of such boards makes it hard to find theoretical shortcuts.
We keep track of two arrays – for both diagonals – of length 2*N if odd and 2*N-1 if even. They hold the id of the diagonal the queen is on. The two switched rows are shifted +n and -n, after which they are compared to the other rows. Starting from a position that is valid, only the switched rows need to be checked.
So, that was a lot of theory! Where’s the action? Where’s the code? Unfortunately I have no code that has all the optimizations implemented or is in shareable condition. I do hope that the above gives you some insights in the puzzle, to speed up existing implementations.
Also unfortunately, still as of today you can find “new” implementations that are slower than my initial version, even if that code runs on modern hardware. But I assume you want to see how all these optimizations would result to.
Luckily I found an implementation on Github by Ole and Tim Pöschl, who had the same starting points as I had and looked into (several of) the same approaches. Do check out their code – “FAF” stands for “fast and fun” of course.
The post N-Queens project from over 10 years ago appeared first on StreamHPC.
]]>The post The Fastest Payroll System Of The World appeared first on StreamHPC.
]]>
At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.
They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:
Yes, I can answer your questions.
For that I need to do a test-run.
Please come back tomorrow.
The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.
Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.
The code consisted of several hundreds megabytes of Delphi-code that was build up over the years, which made the project a lot less standard to do.
The interesting part about payrolling needs to have large flexibility to cope with continuously changing laws and ruling within the country. This is a main reason you see most payroll companies not working internationally. Intersoft’s software can be configured to work in any branch or country.
Porting code that offers both correctness and flexibility is less easy to port to the GPU, that expects very straightforward instructions – we needed to be very creative to convert the internal scripting language to OpenCL.
After about 2 weeks the test-input was processed correctly in seconds. After that the client worked with us to program out redesigns of the database-loading and storing, being able to feed the OpenCL calculations the data fast enough.
The code now runs correctly on AMD and NVIDIA GPUs, and (for backwards compatibility) on CPUs. We have not tested on Intel ARC. We neither benchmarked it, as this type of software has too many interdependencies to define a good case. Hence we kept it with only mentioning that ~250x speedup.
Most software we write is in CUDA and HIP. OpenCL is in C and less convenient, so why did we choose this language? Reason is the built-in compiler. Every time the script or configuration changes, OpenCL can be generated for it, and be compiled on the spot. Even just-in-time optimizations, based on the provided data, can be applied. If we had chosen CUDA, HIP or SYCL, this would have taken much more effort to do.
You might say that OpenCL’s kernels can be reverse engineered more easily than when using C++ based languages. In this project the real IP is in the OpenCL-code generation and the software to visually create changes to how payroll-calculations are done, and thus reverse engineering does not make much sense. The offender would need to reverse engineer the software with every change.
Intersoft is a provider of Payrolling Systems that can be adapted to any country and branch, with a specialization in the cleaning industry. Other software is created for Time & Task Tracking, Staff Planning & Scheduling, Forecasting & Workload Monitoring, Customer Surveys & Analysis and Incentive Programs.
Intersoft says that their customers are attracted by the combination of flexibility and correctness of the solution – unique in its market. Now also performance is added to that list, which allows companies with over 10,000 employees can be served now.
Where do you see software that is unexpectedly slow, that is making it difficult to provide a good service?
The post The Fastest Payroll System Of The World appeared first on StreamHPC.
]]>The post How to get full CMake support for AMD HIP SDK on Windows – including patches appeared first on StreamHPC.
]]>Written by Måté Ferenc Nagy-Egri and Gergely Mészåros
Disclaimer: if youâve stumbled across this page in search of fixing up the ROCm SDKâs CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.
The recent release of the AMDâs ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly itâs most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIPâs Module API was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!
[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

We cannot fathom how an ecosystem that has existed on Linux for 7 years and has relied primarily on CMake as its build system could ship with unfinished CMake support. Starting from 3.21 CMake has native support for the HIP language. We understand that the Windows release wanted to make Visual Studio support a priority (driven by the HIP VS Extension), but showing the back seat for the myriad of existing projects on Linux is a missed opportunity at best. StreamHPC being the authors of multiple larger libraries: rocRAND, rocPRIM, rocThrust and hipCUB, we took great care while developing rocm-examples that our and othersâ existing workflows do not break.
Using HIP in CMake is summarized well on the Using CMake page at AMDâs doc site, so we wouldnât repeat all thatâs there.

The main point is that CMake only ships half of the HIP language support, the part defining the HIP-specific variables, properties etc. How that translates to toolchain invocations are in separate scripts shipping with the toolchains themselves. This second part is whatâs half-baked in the initial Windows release of ROCm (documented under the ROCm on Windows section).
Join us for the fix in CMake and the thought process behind it to enable developing HIP on Windows using CMake HIP language support.
We will be riding the waves of CMake on the command-line. For this, minimally install the following SW:
We clone the rocm-examples repo and try to configure a build using defaults for everything.
PS C:\Users\User\rocm-examples> cmake -S . -B build -- Building for: Visual Studio 17 2022 ... CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeDetermineHIPCompiler.cmake:10 (message): HIP language not currently supported by "Visual Studio 17 2022" generator
Looking at the relevant part of CMakeDetemineHIPCompiler.cmake:10, we can see that this is a hardcoded error inside CMake. The default generator of CMake on Windows is the latest Visual Studio found on the system. HIP support in MSBuild (VSâs build system) is driven by the âAMD HIP Toolchainâ extension. There are two blockers here:
If the inability to use the Visual Studio generators bothers you, consider upvoting this and this issue on AMDâs and Kitwareâs trackers.
Moving on to using Ninja (non Multi-Config for brevity, however we advise defaulting to Ninja Multi-Config in most cases):
PS C:\Users\User\rocm-examples> cmake -G Ninja -S . -B build -- The CXX compiler identification is MSVC 19.36.32537.0 ... -- The HIP compiler identification is unknown CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeDetermineHIPCompiler.cmake:104 (message): Failed to find ROCm root directory.
âHIP compiler identification is unknownâ, âFailed to find ROCm root directoryâ… so after installing the ROCm SDK, CMake wonât find it in its default location? Well, skimming through CMakeDetermineHIPCompiler.cmake, we will soon find that right after the previous hardcoded error, if CMAKE_HIP_COMPILER is not set, CMake will enter a guessing game, of where it may be, potentially guided by the HIPCXX environment variable. Unless a compiler is found, the last desperate attempt is to consult the CMAKE_HIP_COMPILER_ROCM_ROOT variable, which it will use as a guide to finding the SDK side script driving the language support.
Environment variables by nature diffuse, leak into every context and are observable at every level of a stack. This is why we love using them at times, when we wish to push some information to the bottom of a call stack or a toolchain without the use of global variables or extra function arguments. The danger of env vars is exactly the same: they leak into contexts we may not have intended. They change behaviour at a global scale. If we know the purpose and thereâs a convenient way to restrict the effect to some specific context, we should prefer that method.
We chose the most up-front method and set CMAKE_HIP_COMPILER on the command-line.
PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_HIP_COMPILER="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeDetermineHIPCompiler.cmake:143 (message):
The ROCm root directory:
C:/Program Files/AMD/ROCm/5.5
does not contain the HIP runtime CMake package, expected at one of:
C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-config.cmake
C:/Program Files/AMD/ROCm/5.5/lib64/cmake/hip-lang/hip-lang-config.cmake
Notice how we used the HIP_LANG env var in the value of CMAKE_HIP_COMPILER created by the installer (note it has a trailing backslash). This way should we update our toolchain, weâll always get the latest compiler. This time around CMake argues that it canât find hip-lang-config.cmake. This is the file that maps the CMake properties into the toolchain invocations. Unfortunately AMD managed to omit it from the SDK, even though we have the folder and other files in that folder.
PS C:\Users\User\rocm-examples> Get-ChildItem ${env:HIP_PATH}lib\cmake\hip-lang
Directory: C:\Program Files\AMD\ROCm\5.5\lib\cmake\hip-lang
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 7/21/2023 11:17 AM 922 hip-lang-targets-release.cmake
-a---- 7/21/2023 11:17 AM 3577 hip-lang-targets.cmake
This file usually comes from hipamd (AMD’s implementation of the HIP interface) component. If we don’t want to manually configure it, we can just take the version from a Linux installation and tweak it slightly.
Got hip-lang-config.cmake and hip-lang-targets.cmake. Next try:
PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_HIP_COMPILER="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at build/CMakeFiles/3.27.1/CMakeHIPCompiler.cmake:1 (set):
Syntax error in cmake code at
C:/Users/User/rocm-examples/build/CMakeFiles/3.27.1/CMakeHIPCompiler.cmake:1
when parsing string
C:\Program Files\AMD\ROCm\5.5\bin\clang++.exe
Invalid character escape '\P'.
CMake hasnât a very sophisticated scripting language, its type system is fairly weak, so itâs easy to get into trouble on Windows with backslashes in paths being interpreted as escape characters. To fix this issue, specify what type of cache variable we want to define on the command line using -D VAR_NAME:VAR_TYPE=VAR_VALUE. Next try:
PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/Platform/Windows-Clang.cmake:170 (message):
The current configuration mixes Clang and MSVC or some other CL compatible
compiler tool. This is not supported. Use either clang or MSVC as both C,
C++ and/or HIP compilers.
This error is a quality of life improvement introduced in CMake 3.25, although CMake HIP language support only needs 3.21. The reason behind it is that CMake decides globally whether to use MSVC-like or GNU-like command-line syntax globally. Mixing-matching the two will result in compiler or linker errors. If a project uses both CXX and HIP language support, they must use the same style command-line.
So we specify the C++ compiler to be ROCmCC as well and clean configure (toolchain selection is cached and one cannot change the compiler after it has been found):
PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The CXX compiler identification is Clang 17.0.0 with GNU-like command-line
...
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeFindDependencyMacro.cmake:76 (find_package):
By not providing "FindAMDDeviceLibs.cmake" in CMAKE_MODULE_PATH this
project has asked CMake to find a package configuration file provided by
"AMDDeviceLibs", but CMake did not find one.
Could not find a package configuration file provided by "AMDDeviceLibs"
with any of the following names:
AMDDeviceLibsConfig.cmake
amddevicelibs-config.cmake
Add the installation prefix of "AMDDeviceLibs" to CMAKE_PREFIX_PATH or set
"AMDDeviceLibs_DIR" to a directory containing one of the above files. If
"AMDDeviceLibs" provides a separate development package or SDK, be sure it
has been installed.
Some of the package config files are missing from the installation, as requested from hip-lang-config.cmake:87 (find_dependency). For the uninitiated: find_dependency is essentially find_package, but meant to be used inside invocations of find_package. This file indeed is missing, we borrow it from Linux as well.
Our next attempt gives the exact same error, and that is because of one of the most subtle package search quirks of CMake. If you consult the Config Mode Search Procedure of find_package, youâll find:

Even though AMDâs APT repo doesnât add the install to the PATH, testing is often done in rocm-terminal where it is on the PATH. /bin is taken off the end of PATH entries, hence /opt/rocm/ is now on the find_package search prefix. The Windows installer (correctly) doesnât add the installed bin folder to the PATH, so hip-lang-config.cmake needs HINTS added to the find_dependency invocations to help it find its own components. The CXX piggyback scripts, hip-config.cmake ever so casually sidesteps this issue by not exposing the problematic components on Windows as a âhotfixâ. The undeployed hip-lang-config.cmake have not yet received this treatment.
After weâve added the HINTS, we are within armâs reach of our goal. Next we get:
PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/AMD/ROCm/5.5/lib/cmake/amd_comgr/amd_comgr-targets.cmake:75 (message):
The imported target "amd_comgr" references the file
"C:/Program Files/AMD/ROCm/5.5/lib/libamd_comgr.so.2.5.50500"
but this file does not exist. Possible reasons include:
* The file was deleted, renamed, or moved to another location.
* An install or uninstall procedure did not complete successfully.
* The installation package was faulty and contained
"C:/Program Files/AMD/ROCm/5.5/lib/cmake/amd_comgr/amd_comgr-targets.cmake"
but not all the files it references.
Of course, if we borrow files from Linux, theyâll refer to Linux shared objects. Patching up amd_comgr-config.cmake we also borrowed from Linux weâll get:
PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake:82 (message):
The imported target "hip-lang::amdhip64" references the file
"C:/constructicon/builds/gfx/two/23.10/drivers/compute/install/native/Release/x64/hip/lib/amdhip64.lib"
but this file does not exist. Possible reasons include:
* The file was deleted, renamed, or moved to another location.
* An install or uninstall procedure did not complete successfully.
* The installation package was faulty and contained
"C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake"
but not all the files it references.
Oooh, âC:/construction/builds/gfx/two/23.10/⊠â looks very much like a hardcoded build server path. Of course, nobody hardcoded this, itâs CMake generating this file. A seasoned CMake developer at this point inspects all the scar tissue covering their body, their mindâs eye replays a few episodes of scripting nightmares long past, and from the depths of recollection emerges a passage from the install() commandâs DESTINATION.

Following our scar tissue induced lead, we zoom in on the installation of hip-lang-targets, and while the targets are installed in proper GNUInstallDirs locations CONFIG_LANG_PACKAGE_INSTALL_DIR is suspiciously non-standard. Tracing down its source value through a layer of indirection its value is indeed prefixed with CMAKE_INSTALL_PREFIX, making it an absolute path, thus rendering the package non-relocatable. This happens to work
on Linux, where packages on build servers are likely targeting /opt/rocm, just like the actual DEB package install paths. The hardcoded package path happens to coincide between the build server and the userâs machine.
Fixing that up to be a path relative to the config file itself, we get the following:
PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake:88 (message):
The imported target "hip-lang::amdhip64" references the file
"C:/Program Files/AMD/ROCm/5.5/bin/amdhip64.dll"
but this file does not exist. Possible reasons include:
* The file was deleted, renamed, or moved to another location.
* An install or uninstall procedure did not complete successfully.
* The installation package was faulty and contained
"C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake"
but not all the files it references.
The amdhip64.dll doesnât ship with the SDK but with Radeon Software. Itâs part of the driver, hence itâs found in C:\Windows\System32.
(Note that recording the DLL as an IMPORTED_LOCATION property on hip-lang::amdhip64 one introduces a requirement that the build machine has to have a Radeon driver install. In most cases this shouldn’t be an issue, but omitting it will render CMake oblivious of its existence. A proper fix from AMD will have to consider how to expose the DLL.)
Fixing that up, configuration finally passes. We can now build our code using ninja:
PS C:\Users\User\rocm-examples> cmake --build --target hip_saxpy [1/2] Building HIP object HIP-Basic/saxpy/CMakeFiles/hip_saxpy.dir/main.hip.obj [2/2] Linking HIP executable HIP-Basic\saxpy\hip_saxpy.exe
If the samples would argue about some device binary missing from the executable, that just means that the wrong device binaries were embedded into the executable. The target property HIP_ARCHITECTURES controls which binary flavours are embedded into the target. Typically one would set it using the variable CMAKE_HIP_ARCHITECTURES, suggested to be explicitly set by the docs, otherwise the implementation chooses some defaults. CMakeDetermineHIPCompiler.cmake is trying to invoke rocm_agent_enumerator, an extremely lightweight utility just listing ROCm devices and parses the gfxXYZ entries from the response. This utility is missing from Windows for good reasons.
The runtime behind ROCm is ROCclr (ROCm Common Language Runtime), which on Linux is implemented by ROCr, turning API calls to amdkfd** (AMDâs kernel-mode driver) calls. ROCclr exists on Windows too, but instead of amdkfd, itâs implemented by PAL, the Platform Abstraction Layer. As a consequence, it is clearly visible that none of its existing search methods are going to work on Windows, so even if it were present, rocm_agent_enumeratir wouldnât function. The rocminfo utilityâs place is taken by hipInfo.exe.
To find what device binary formats we need to embed into our HIP executables, we have to query the devices in our system. eg.:
PS C:\Users\User\rocm-examples> & "$env:HIP_PATH\bin\hipInfo.exe" | Select-String gfx gcnArchName: gfx1032 gcnArchName: gfx1035
So by specifying these graphics IP names as the default value for the HIP_ARCHITECTURES target property, all our compiled code should function correctly.
PS C:\Users\User\rocm-examples> cmake -G Ninja -S . -B build -DCMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -DCMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -DCMAKE_HIP_ARCHITECTURES=âgfx1032;gfx1035â
Running the sample codes should look something like this:
PS C:\Users\User\rocm-examples\build> .\HIP-Basic\saxpy\hip_saxpy.exe Calculating y[i] = a * x[i] + y[i] over 1000000 elements. First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
Do note that the default build configuration is Debug for CMake. Performance should improve by also adding -D CMAKE_BUILD_TYPE=Release to the configuration command-line.
[**] Fun fact: the name comes from the earliest incantations of HSA, Heterogeneous System Architecture. This standard was coined shortly after AMD acquired ATi to obtain their vast range of graphics IP. The first APUs were born and the marketing name from AMD at the time was âThe future is Fusionâ and the SW stack implementing these integration features were called FSA, Fusion System Architecture. Because other vendors, primarily ARM, was interested in this design, it was spun off into a separate consortium, but naturally ARM disliked the fact that AMD brand naming is engraved into the now standard technology. Thus, FSA was renamed to HSA. Much of AMDâs implementation however already used Fusion as the naming of SW components and some of those remain to this day. The driver name amdkfd expands to AMD Kernel Fusion Driver, but really it was set out to implement ultimately whatâs called HSA today.
Capturing the above in a series of patch files, one may patch the AMD HIP SDK for Windows to enable CMake HIP language support. Download the ZIP file with the packages from here, extract it anywhere on your disk, and from the folder containing the patches using Git on a command-line with Administrator Privileges patch your installation:
Get-ChildItem .\00*.patch | ForEach-Object { git apply --unsafe-paths "--directory=${env:HIP_PATH}." --ignore-whitespace --whitespace=nowarn $_.FullName }
The post How to get full CMake support for AMD HIP SDK on Windows – including patches appeared first on StreamHPC.
]]>The post Improving FinanceBench for GPUs Part II – low hanging fruit appeared first on StreamHPC.
]]>Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.
Additionally, since the last article, weâve been in touch with the authors of the original repository. Theyâve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.
Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536
Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127â136. DOI:https://doi.org/10.1145/2458523.2458536
We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.
As a refresher, besides porting the CUDA code to HIP, some restructuring of the code and build system was also done. Such improvements are a standard phase in all projects we do, to make sure we spend the minimum time on building, testing and benchmarking.
The original code only measured the compute times. In the new benchmarks, compute times and transfer times are measured separately.
Note: for the new benchmarks we used more recent AMD and Nvidia drivers (ROCm 3.7 and CUDA 10.2).
QMC (Sobol) Monte-Carlo method (Equity Option Example)
Below are the results of the original code ported to HIP without any initial optimisations.
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 215.311 | 437.140 | 877.425 | |||
| Titan V | 24.712 | 25.29 | 542.859 | 543.930 | 1595.674 | 1597.240 |
| GTX980 | 75.832 | 76.852 | 1754.586 | 1755.851 | 5120.555 | 5127.255 |
| Vega 20 | 11.07 | 12.161 | 19.408 | 22.578 | 36.112 | 40.575 |
| MI25 (Vega 10) | 12.91 | 13.964 | 25.247 | 26.662 | 49.551 | 51.983 |
| s9300 (Fiji) | 42.909 | 42.839 | 85.739 | 89.463 | 169.858 | 174.248 |
The Monte-Carlo algorithm was observed to be compute-bound, thus making it easy to identify the low-hanging fruits in the kernel.
A big speed-up can be observed on the Nvidia GPUs; although a considerable speed-up can also be observed on the AMD GPUs.
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 215.311 | 437.140 | 877.425 | |||
| Titan V | 9.467 | 9.831 | 17.638 | 18.466 | 34.281 | 35.765 |
| GTX980 | 22.578 | 23.223 | 44.923 | 46.013 | 90.108 | 100.584 |
| Vega 20 | 4.456 | 5.003 | 8.570 | 10.117 | 17.003 | 19.626 |
| MI25 (Vega 10) | 9.118 | 9.663 | 17.995 | 18.918 | 35.760 | 42.621 |
| s9300 (Fiji) | 17.802 | 18.477 | 35.086 | 36.795 | 69.746 | 73.015 |


Black-Sholes-Merton Process with Analytic European Option engine
Below are the results of the original code ported to HIP without any initial optimisations.
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 | |||
| Titan V | 0.095 | 5.181 | 0.407 | 25.111 | 0.792 | 49.890 |
| GTX980 | 2.214 | 7.294 | 10.662 | 34.534 | 19.946 | 68.626 |
| Vega 20 | 0.201 | 7.105 | 0.894 | 33.693 | 1.596 | 70.732 |
| MI25 (Vega 10) | 1.090 | 7.657 | 3.453 | 41.343 | 6.000 | 74.387 |
| s9300 (Fiji) | 1.048 | 11.524 | 4.635 | 53.246 | 9.170 | 129.262 |
The Black-Scholes algorithm was observed to be memory-bound, thus there were a some low hanging fruits and structural problems to tackle.
The first step was to tackle the low hanging fruits in the kernel. A decent speed-up in the compute times could be observed on most GPUs (except for the Titan V).
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 | |||
| Titan V | 0.084 | 5.092 | 0.367 | 24.549 | 0.722 | 49.956 |
| GTX980 | 1.458 | 6.281 | 6.402 | 30.584 | 12.790 | 61.965 |
| Vega 20 | 0.114 | 6.995 | 0.397 | 33.571 | 0.775 | 69.858 |
| MI25 (Vega 10) | 0.517 | 6.929 | 1.836 | 30.490 | 3.021 | 64.961 |
| s9300 (Fiji) | 0.423 | 10.621 | 1.931 | 48.417 | 3.733 | 81.898 |
With the algorithm being memory-bound, the next step was to tackle the structural problems.
The results can be found below, where transfer times on all GPUs improved.
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 | |||
| Titan V | 0.068 | 3.937 | 0.286 | 18.258 | 0.565 | 36.479 |
| GTX980 | 1.290 | 5.096 | 6.387 | 24.337 | 12.758 | 48.578 |
| Vega 20 | 0.121 | 5.067 | 0.447 | 25.809 | 0.827 | 47.541 |
| MI25 (Vega 10) | 0.506 | 4.861 | 1.841 | 23.580 | 3.115 | 53.006 |
| s9300 (Fiji) | 0.444 | 7.416 | 2.002 | 36.859 | 3.922 | 64.056 |



Securities repurchase agreement
Below are the results of the original code ported to HIP without any initial optimisations.
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 186.928 | 369.718 | 732.446 | |||
| Titan V | 19.678 | 32.241 | 35.727 | 60.951 | 70.673 | 120.402 |
| GTX980 | 387.308 | 402.682 | 767.263 | 793.159 | 1520.351 | 1578.572 |
| Vega 20 | 14.771 | 37.174 | 28.595 | 69.743 | 56.699 | 131.652 |
| MI25 (Vega 10) | 46.461 | 71.191 | 91.742 | 143.673 | 182.137 | 277.597 |
| s9300 (Fiji) | 77.615 | 107.822 | 153.334 | 217.205 | 306.206 | 418.602 |
The Repo algorithm was observed to compute-bound, but also relied on pure double-precision operations. There were no obvious low-hanging fruits in the kernel, and the structure of the data was found to be rather complex (a mixture of Struct-of-Arrays and Array-of-Structs that are intertwined). Additionally, there are far too many transfer calls for different inputs and outputs that saturating the transfers with multiple non-blocking streams isnât effective. Also, the current state of the CUDA/HIP implementation is working best on GPUs that have good double-precision performance.
There are improvements possible, but these need a larger effort.
Fixed-rate bond valuation with flat forward curve
Below are the results of the original code ported to HIP without any initial optimisations.
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 241.248 | 482.187 | 952.058 | |||
| Titan V | 31.918 | 49.618 | 61.209 | 97.502 | 123.750 | 195.225 |
| GTX980 | 746.728 | 1117.349 | 1494.679 | 2233.761 | 2976.876 | 4470.009 |
| Vega 20 | 40.112 | 66.460 | 77.123 | 127.623 | 152.657 | 250.067 |
| MI25 (Vega 10) | 141.908 | 215.855 | 278.618 | 425.969 | 553.423 | 844.268 |
| s9300 (Fiji) | 229.011 | 340.212 | 451.539 | 699.059 | 891.284 | 1361.436 |
The Bonds algorithm was observed to more compute-bound than the Repo algorithm, and also relied on pure double-precision operations. The same problems were observed with the Repo algorithm, where no low hanging fruit could be easily identified, and the structure of the data is complex. That said, unlike the Repo algorithm, there arenât as many transfers of inputs/outputs, making it possible to use multiple streams.
The results can be found below, where 2 streams are used to transfer all the data.
| Size | 262144 | 524288 | 1048576 | |||
| Compute | + Transfers | Compute | + Transfers | Compute | + Transfers | |
| 2x Intel Xeon E5-2650 v3 OpenMP | 241.248 | 482.187 | 952.058 | |||
| Titan V | 31.918 | 45.988 | 61.209 | 89.198 | 123.750 | 178.688 |
| GTX980 | 746.728 | 770.180 | 1494.679 | 1527.538 | 2976.876 | 3043.102 |
| Vega 20 | 40.112 | 59.216 | 77.123 | 113.009 | 152.657 | 216.965 |
| MI25 (Vega 10) | 141.908 | 156.164 | 278.618 | 310.922 | 553.423 | 637.924 |
| s9300 (Fiji) | 229.011 | 256.373 | 451.539 | 493.679 | 891.284 | 981.604 |


The improvements described above produced good results, as improvements across the algorithms (except Repo) could be observed. In combination with newer drivers from AMD and Nvidia, general improvements can also be observed when compared to the results obtained in the previous article.
That said, there is currently a bug in AMDâs current drivers where data transfers are slower; we will update this blog with the results once this is fixed in a future driver release.
Whatâs next? The next step is to look for the high-hanging fruits for both the CPU and GPU implementations of the algorithms. This would be the next step in achieving better performance, as weâve hit the limit of optimising the current implementations.
Milestones we have planned:
Finance algorithms on GPUs are often not really optimised for performance, which is quite unexpected. A company that can react in minutes instead of a day is more competitive, and especially in finance this is crucial.
As you have seen, we made quite some speedups with a relatively small investment. When we design code from scratch, we can more quickly go into the right direction. The difficulty about financial software is that it often needs a holistic approach.
We can help your organization:
Feel free to contact us for inquiries, but also about sponsoring algorithms for the benchmark.
The post Improving FinanceBench for GPUs Part II – low hanging fruit appeared first on StreamHPC.
]]>The post The Art of Benchmarking appeared first on StreamHPC.
]]>
This article focuses on the foundations of solid benchmarking, so it helps you to decide which discussions to have with your team. It is not the full book.
There will be multiple blog posts coming in this series, which will be linked at the end of the post when published.

Even when it depends on various variables, answers do can be given. These answers are best be described as ‘insights’ and this blog is about that.
First the commercial message, so we can focus on the main subject. As benchmark-design is not always obvious, we help customers to set up a system that plugs into a continuous integration system and gives continuous insights. More about that in an upcoming blog.
We see benchmarking as providing insights in contrast with the stopwatch-number. Going back to F1 – being second in the race, means the team wants to probably know these:
This translates quite well to software. While we speak about “doing benchmarks”, the better name would be “getting insights” – benchmarking is just the tool to get these. Most used are:
Profiling is what many developers do, with different levels of expertise:
So it drills down to understanding what types of problems there are, how much one understands the mechanics behind tools like GDB, and experience with such tools.
Developers by training and mindset already think of automation by default. Still automating to support automation is not always a logical step. Think of liking to make algorithms that solve labyrinths and writing code that measures efficiency of the algorithm – for various reasons the automation-effort of the second part is more often done manually.
This is a part of the art of problem solving, of which we briefly discussed one aspect recently. How to make something that is neither under-engineering nor over-engineering? The first steps of problem solving are all around clearly defining the problem, where the later steps make sure a possible solution has the original problem in mind.
Problems are easily found, by measuring the summed time that profiling and benchmarking take as part of the whole project. Compare that to the priorities of the original problems that need to be solved.
Deciding that benchmarking is important for a project, is half the work. By having the right discussions using the points in this blog article, you can easily get to a good benchmark-foundations.
Benchmarking is initially obvious, but then gets a specialization quickly. We have thought out how to make it less steep for longer, which we will share later.
Want to know more? Watch this blog for follow-ups. Want to know more right now? Then get in contact.
The post The Art of Benchmarking appeared first on StreamHPC.
]]>The post Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL appeared first on StreamHPC.
]]>Now Corona (and fear for it) spreads, we had to rethink how to celebrate 10 years. So while there were different plans, we simply had to adapt to the market and world dynamics.

Getting your mind tuned from CPU-focused to GPU-focused is half the work. This is the first goal of the day.
The second goal is hands on. We’ll provide templates for CUDA/HIP and OpenCL, where you focus on making the core compute part.
The third goal is learning from a Q&A. Here you can tap into 10 years of our experience with HPC and GPUs, as we did the strangest to the most normal projects.
The training is very basic, but there will be a lot of aha!s that will help in making better decision for adding GPU-support to your project.
This is the agenda:
Inbetween there will be short lab-sessions on your own laptop.
More information you possibly will find in the description of the full training. Especially if you want to have a full in-company training, that page tells much more.
Copy the below template and send to [email protected]. A personal text is appreciated, ofcourse.
Hi Stream Team,
Congrats with 10 years. …
Kind regards,
We always contact you within 5 days, but we try the same day.
The exact way of organisation depends on the number of interested people and their timezone. As the training is not designed as a MOOC, we can limit the number of participants – first come, first served. We’ll share updates on Twitter, on LinkedIn and in updates in this blog-post.
Looking forward to virtually meet you!
Vincent Hindriksen and the Stream Team
The post Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL appeared first on StreamHPC.
]]>The post Problem solving tactic: making black boxes smaller appeared first on StreamHPC.
]]>Assumption is the mother of all mistakes
Eugene Lewis Fordsworthe
A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.
So if you use the words “solving a problem”, you visualise it as a problem disappearing. Other descriptions like “patching a problem” (which is often more accurate) is seen as bad, incomplete and such. Therefore I prefer describing problems as black boxes that will only get smaller or bigger, not more or less transparent, or even disappear – this is both visual and more accurate.
Donald Rumsfeld handled assumptions by putting the knowns into four categories (and how I translate them):
| Known knowns (facts) | Known unknowns (unanswered questions) |
| Unknown knowns (assumptions) | Unknown unknowns (missing questions) |
Here the second row describes what is inside the black box – it is an illusion that you will ever know everything. Rumsfeld probably based this on a classic quote:
The more you know the more you know you don’t know
Aristotle
Let’s start with reverse engineering code, before going to porting. That’s often seen as a very special task, where nobody knows what the code is actually doing. Documentation never written, sources lost, people left – it’s one huge black box. This sounds like a horror for many managers, but the skills needed to do that is actually required to do code porting from the CPU to the GPU. One of the skills is black box thinking.
The reason that even the world’s clearest code contains black boxes, is that developer have assumptions on anything they don’t fully master. And we can all assume that nobody can master everything, as that would require predicting the future. So when making a flow diagram of the code, you see the data flow from black box to black box, where each box has a list of observations around the known knowns and known unknowns.
Some black boxes are a bit larger than others, as the team thinks the chances are higher that something is not fully under control there but cannot specify what – something with unexpected input, whatever that could be. In such case we simply decide we keep the box large. There is also a list of known unknowns (assumptions from the original developer), we haven’t tested yet, but we know from experience that these happen. We assume the list is incomplete, and when we get feedback from benchmarks, tests or the customer, we already have ideas where to find a solution.
If we would have a list of problems (known knowns) only, we would have finished the project earlier, but it would not be as good. Simply because the first question is automatically “what did we assume wrong?”.
Just describe a problem as ‘a network of black boxes’. The rest will follow trough. It’s as simple as that!

If you want to learn more, I’m not the last to say: come work with us. We love to work with curious people. Check out our project success stories and jobs to learn more.
The post Problem solving tactic: making black boxes smaller appeared first on StreamHPC.
]]>The post Improving FinanceBench appeared first on StreamHPC.
]]>It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.
The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.
The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.
Ofcourse the first step was selecting good algorithms that had both needed a lot of compute and were representable within the finance industry. We could not have done this as well as the research team did. This was the main reason to choose the project.
The work done for making the code, according to the team itself:
The original QuantLib samples were written in C++. QuantLib is a C++ library. Unfortunately, languages like OpenCL, CUDA, and OpenACC cannot directly operate on C++ data structures, and virtual function calls are not possible. Because of this problem, all of the existing code had to be “flattened” to C code. We used a debugger is used to step through the code paths of each application and see what lines of QuantLib code are executed for each application, and manually flattened all of the QuantLib code.
This is typical work when porting CPU-code to the GPU. Complex C++ code can be very hard to bend in directions it was not intended to bend to. Simplification is then the first thing to do, so it can be split in manageable parts. As this can be time-consuming, we were happy it was already done, though it is often easier to also have the original simplified CPU-code for reference.
We have another focus than a research group, and logically this results in code changes. We’re now in phase 1.
As it’s difficult to focus on multiple languages while improving performance and project quality, we focused on OpenMP and CUDA first. Then we ported the project to HIP and made sure that the translation from CUDA to HIP could be fast. This way we could make sure CPUs and the fastest GPUs could be benchmarked, leaving OpenCL and OpenACC out for now. We have no intentions to keep HMPP in, and have chosen for introducing SYCL to prepare for Intel Xe. We have more interest in benchmarking different types of algorithms on all kinds of hardware than to compare programming languages.
Also the project has been cleaned up, cmake was introduced, and google-benchmark was added, to make it easier for us to work on. We did not look into Quantlib for improvements or read papers on the latest advancements. So the goal was really to get it started.
We picked a few broadly available AMD and Nvidia GPUs, and choose a dual socket Xeon (40 cores in total) for the CPU benchmarks. The times are INCLUDING transfer times for the GPUs. The original benchmark unfortunately showed compute-times only, so we might get some Nvidia Kepler GPUs back in a server to re-benchmark these.
QMC (Sobol) Monte-Carlo method (Equity Option Example)
Monte Carlo is seen often when HPC is applied in the finance domain. A good part is the easy interpretation and straightforward implementation, making it easy to explain to HPC-developers while showing the performance advantage to quants. It returns a distribution of future prices of assets by doing thousands to millions of simulations.
The below results were from the code as provided, with a direct port to HIP to include AMD GPUs. As you can see, the Titan V and GTX980 numbers don’t look good.
| Size | 262144 | 524288 | 1048576 |
|---|---|---|---|
| 2x Intel Xeon E5-2650 v3 OpenMP | 215.311 | 437.140 | 877.425 |
| Titan V | 25.162 | 544.829 | 1599.428 |
| GTX980 | 76.456 | 1753.816 | 5120.598 |
| Vega 20 | 15.286 | 30.140 | 62.110 |
| MI25 (Vega 10) | 13.694 | 26.733 | 52.971 |
| s9300 (Fiji) | 25.853 | 51.403 | 98.484 |
Here are the results after fixing the obvious problems and low-hanging fruit. This benefited the Nvidia GPUs a lot, but also the AMD GPUs. There was no low hanging fruit in the OpenMP-code, so no speedup there.
| Size | 262144 | 524288 | 1048576 |
|---|---|---|---|
| 2x Intel Xeon E5-2650 v3 OpenMP | 215.311 | 437.14 | 877.425 |
| Titan V | 9.885 | 18.501 | 35.714 |
| GTX980 | 23.427 | 45.81 | 91.721 |
| Vega 20 | 10.864 | 21.467 | 42.995 |
| MI25 (Vega 10) | 10.857 | 21.147 | 41.214 |
| s9300 (Fiji) | 20.806 | 41.633 | 80.465 |
Black-Sholes-Merton Process with Analytic European Option engine
Black Scholes is used for estimating the variation over time of financial instruments such as stocks, and using the implied volatility of the underlying asset to derive the price of a call option. Again, it is compute intensive.
The performance of the original code looked good at first sight, but the transfers took 95% of the time. That’s for the next phase.
| Size | 1048576 | 5242880 | 10485760 |
|---|---|---|---|
| 2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 |
| Titan V | 7.959 | 38.852 | 77.443 |
| GTX980 | 10.051 | 48.038 | 95.568 |
| Vega 20 | 5.907 | 26.468 | 53.342 |
| MI25 (Vega 10) | 7.827 | 36.947 | 71.499 |
| s9300 (Fiji) | 9.642 | 37.07 | 80.118 |
On some projects it’s better to focus on the largest bottleneck – for this project we chose to go through the project in a structured way. It sometimes is difficult to explain the improvements are only to be “activated” very late in the project – luckily the explanation “experience” is often accepted.
So the applied fixes had good influence, but are hardly noticeable right now.
| Size | 1048576 | 5242880 | 10485760 |
|---|---|---|---|
| 2x Intel Xeon E5-2650 v3 OpenMP | 5.005 | 22.194 | 43.994 |
| Titan V | 7.841 | 37.957 | 77.382 |
| GTX980 | 9.12 | 44.473 | 89.847 |
| Vega 20 | 5.87 | 27.663 | 51.828 |
| MI25 (Vega 10) | 7.037 | 31.737 | 64.898 |
| s9300 (Fiji) | 7.295 | 32.866 | 79.558 |
Fixed-rate bond valuation with flat forward curve
Only ported to HIP. We did not do any optimisations yet, as we have stability-problems with 2 AMD GPUs to focus on. With the current code Vega 20 is faster than Titan V.
| Size | 262144 | 524288 | 1048576 |
|---|---|---|---|
| 2x Intel Xeon E5-2650 v3 OpenMP | 186.928 | 369.718 | 732.446 |
| Titan V | 38.328 | 72.444 | 141.664 |
| GTX980 | 404.359 | 796.748 | 1599.416 |
| Vega 20 | 36.399 | 67.299 | 128.133 |
Securities repurchase agreement
Only ported to HIP. The code for Bonds benchmark needs some attention still. You see that FTX980 is too slow in comparison.
| Size | 262144 | 524288 | 1048576 |
|---|---|---|---|
| 2x Intel Xeon E5-2650 v3 OpenMP | 241.248 | 482.187 | 952.058 |
| Titan V | 46.172 | 88.865 | 177.188 |
| GTX980 | 761.382 | 1518.756 | 3030.603 |
| Vega 20 | 63.937 | 122.484 | 241.917 |
As you see this is really work in progress. Why show it already? Reason is that you can see how a project goes. Cleaning up the code is always done in every project, to avoid delays later on. Adding good tests and benchmarks is another foundational step. Most time has gone into these preparations, and limited time into the improvements.
Milestones we have planned for now:
We intend to release a milestone every 6 to 8 weeks. You can get noticed by following us on Twitter or LinkedIn.
Feel free to contact us with any question.
The post Improving FinanceBench appeared first on StreamHPC.
]]>The post Updated: OpenCL and CUDA programming training – now online appeared first on StreamHPC.
]]>As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.
If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.
This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.
It includes:
Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.
Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.
Important:
This is close to our standard OpenCL crash course. We start with the basic concepts, write our fist OpenCL program, discuss the architectures, discuss the difficulties of GPU-programming, compare to CUDA and C++, and end with writing simple code that runs on a laptop (CPU or GPU).
During the day, we will increase the level of requirements and touch all important aspects of OpenCL-programming.
As we have several GPU-servers available for developing, we can provide you with login-credentials, a git-account and a short how-to for using the extra GPUs from NVidia and AMD, optionally Intel. This way you can use different graphic cards to find out which optimisations work and donât work.
Optimisations we discuss during this day:
During the days we use various lab-sessions to support the explained theory. We’ll discuss most of the following:
There are various tools you need to understand to get code that runs well on GPUs. This day you will learn to use various vendor-provided and open source tools to help you analyse your code.
The tools are discussed and used along the following four subjects:
We only discuss tools last, as you need to understand the concepts before having something solve it for you.
Often these are different per training, as these are defined by the attendees of the training. Subjects that have been discussed more frequently:
The final project you will need to use all youâve learnt and try to get the fastest code from class.
Lab-session:
We will send a questionnaire to understand the needs of each trainee. For larger groups, we also have a separate phone call with the representative.
Attendees need to bring their own laptops for the lab sessions. The only requirement is for the laptops to be equipped with an OpenCL capable CPU or GPU and OpenCL drivers are correctly installed. A complete list with OpenCL compliant devices can be found here. Regarding the software, laptops need to have installed the following software:
We prefer to focus on the core of the training, and therefore we ask a set of skills to be there. This avoids the others needing to wait.
Attendees are required to have intermediate programming experience and good C/C++ knowledge. This means that you should at least be able to write an application in C/C++ from scratch, high-level debug it with GDB and be very (!) comfortable working with pointers. When this is not the case, do contact us to provide pre-training material.
The costs are âŹ2500. Additional days for personal training and consultancy are excluded â please ask us for a quote.
If you’re interested, do fill in the pre-training questionnaire already.
Initiating the reservation can be done by email or phone.
Email: [email protected]
Phone: +31 854865760
The post Updated: OpenCL and CUDA programming training – now online appeared first on StreamHPC.
]]>The post Join us at the Dutch eScience Symposium 2019 in Amsterdam appeared first on StreamHPC.
]]>Interested? Read on!
The Netherlands eScience Center will host its 6th annual National eScience Symposium on 21 November 2019 at the Johan Cruijff ArenA, Amsterdam. The theme of this yearâs edition is âDigital Challenges in Open Scienceâ and will focus on the ways that innovative digital technologies can be used to advance Open Science. The Dutch National eScience Symposium is a one-day event where academic researchers, industry representatives and policy-makers from various disciplines come together to discuss and share views on the latest developments in eScience.
According to the European Commission âOpen Science represents a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools. The idea captures a systemic change to the way science and research have been carried out for the last fifty years: shifting from the standard practices of publishing research results in scientific publications towards sharing and using all available knowledge at an earlier stage in the research process.â The 2017 Coalition Agreement of the Dutch cabinet states that very soon âOpen Science [âŠ] will be the norm in scientific researchâ.
The systematic change of the practice of science is made possible only by a close collaboration between practitioners, policy-makers and the designers and developers of new digital solutions and tools. In particular, the role of research software and support is essential. The 2019 National eScience Symposium is an opportunity for anyone working in academia, industry, research funding, research publishing, and research policy to engage in fruitful discussions with advanced technology experts. It is the place for knowledge exchange on these topics, and to help shape and define the future of Open Science.
The symposium will feature thematic sessions showcasing world-class research and digital developments around Open Science. Topics include digital challenges in open science, the role of open science in space and earth research, and the European Open Science Cloud. In addition, the symposium will showcase several demos of cutting-edge research software developed by the eScience Center and her partners, as well as the announcement of the Young eScientist Award 2019.
Each year young and ambitious researchers can participate in the Young eScientist Award competition. The award aims to stimulate a young scientist demonstrating excellence in eScience: the development or application of digital technology to address scientific challenges. The Netherlands eScience Center awards âŹ50,000 worth of expertise to the winner. At the symposium the Young eScientist 2019 will be announced.
Read last yearâs winning research idea by Esther Bron.
09:00 â 09:45: Registration & Coffee
09:45 â 10-00: Welcome by director Netherlands eScience Center
10:00 â 10:30: Keynote Speech by Prof. Frank Miedema, Vice Rector for Research at Utrecht University
10:30 â 11:00: Coffee & networking â incl. information stands partners
11:00 â 12:30: Sessions on various Open Science topics, including keynote, lightning talks and discussions
12:30 â 13:30: Lunch & networking â incl. information stands partners
13:30 â 15:00: Sessions on various Open Science topics, including keynote, lightning talks and discussions
15:00 â 15:15: Young eScientist Award 2019
15:15 â 15:30: Closing + Young eScience Demo
15:45 â 18:00: Demo exhibition, networking, and drinks
Johan Cruijff ArenA
ArenA Boulevard 1
www.amsterdamarena.nl
Tickets are currently âŹ35 and can be ordered from https://www.esciencesymposium2019.nl/page/606790
The post Join us at the Dutch eScience Symposium 2019 in Amsterdam appeared first on StreamHPC.
]]>