StreamHPC

N-Queens project from over 10 years ago

Vincent Hindriksen — Thu, 10 Aug 2023 12:27:26 +0000

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

The first approach tries to create a parallel version of an existing one, the second focuses on methods better used by GPUs. The kernels are not fully optimized, so I hope you can come with a faster version – I give some hints where further optimization could help.

The N-queens problem

The problem is described on Wikipedia as: placing N chess queens on an N×N chessboard so that no two queens attack each other. It is an extensive search-problem, by walking a tree and test if it holds as a solution. The “nice” thing is that the number of solutions increases irregularly with increasing N (see “Counting solutions” at the Wikipedia-page). There are many optimizations possible, which are all very bright, but harder to port to the CPU.

Implementations

The forums spoke about multiple fast implementations, of which I researched three.

The first one was software by Jeff Somers, who described his software as: “My program is a heavily-optimized C program, yet it still takes over a week for an 800 MHz PC to calculate the number of solutions for a 21 x 21 board“. In other words: perfect for a 1050MHz GPU to do it in less time.

I found other solutions by Kenjiro Taura, but his webpage is offline now. He is very focused on exploring different parallel/distributed methods, and it could make use of all the cores of the 8-core i7. It was somewhat tricky to get it going, but – in case you find his code – the following worked:

gcc nq.c -lpthread
OMP_NUM_THREADS=8 ./a.out p 17 2

Results got in after 28.17 seconds, so only a small speed-up compared to Somers’ algorithm on just one single core. Another one that was advertised as fast, was made by Mr. Takaken, which I could not compile it with GCC at all – so I left that one.

But if you look around, there are many implementations to learn from:

Note that the above list is from 2016! Also, Dr. Bobbs does not even let you go to different pages – you need to edit the url to get to another page.

I based my code on the one from Somers, as it fully focused on the algorithm, and that is always the first problem to solve. As a bonus, I could focus on my learning goal: porting a difficult puzzle to the GPU. There were more reasons, but only if you try yourself, you’ll find these.

Searching & Checking

What Mr. Somers discovered already: the more algorithmic steps can be skipped, the less work needs to be done, and the faster we can make it.

This page describes is short some differences and shows where a lot of speed-up can be found. Also checking can be optimized. When going deeper into the search-tree, the previous checks don’t need be done again (step-wise forward-check). The first row has N possibilities: a queen on a position. For example N=4 gives 1000, 0100, 0010 and 0001. If it were a tower-problem, and we would not remember the previous row then for row 2, there are N-1 possibilities, etc, so for the whole board there are N*(N-1)*(N-1)*… = N*(N-1)^(N-1) possibilities (click here to see logarithmic representation to see how fast it grows). Of course, we can easily remember the previous rows to find the N! solutions, but we cannot easily do it vertically as queens need. We come back to the tower-search later when discussing the second approach, as it can be very usable. For queens, the first row has N possibilities to put a queen on, the second row N-2 (when first at the border) and N-3 (otherwise). The third row has between N-6 (only when N>8) and N-4 possibilities; There are N-3 possibilities when the first two queens shadow positions – having covered a position more than once. Still it’s not easy to find reusable search-paths. Finding the shadowed positions is as hard as checking validity, as far as I have found. This shadowing makes it a harder problem and solutions can be used for other search-problems too. As the results can be mirrored, we only need to check half and just double the results – exception for the middle row of odd sized boards. You see that Somers coded two runs: one for one side of the board and one for the middle row (when the board-size is odd) – in my port I combined it. When doing the middle row, he starts at the second row, so he can double the results anyway and speeds up the search.

If you read the various acceleration techniques by AMD and NVidia, you notice the focus for GPU-optimization is on memory-transfers – and this is the tricky part of N-Queens. Making a parallel version of the code was not the real problem, but working around the limitations of GPUs, really was. The first approach focuses on a parallel version of Somers’ code, the second one takes the GPU-limitations much more into account.

The first approach

Finding parallelizeable code

Because of dependencies on the previous run it actually is impossible to completely unloop, so we generate the possible starting-points for the first x rows and then start from there. We need a series, so we can call a kernel by id. The problem it is a depth-first algorithm, and we need a breadth-first – at least for the first rows. I therefore made a variation on the code, that combines the odd and even part This is one that fails immediately, but optimizing this is for later – I hope by pumping the threads the scheduler of the GPU helps me out. I made a different version that runs the separate threads (in serial) by only giving a row where to start. Unlooping and clustering would increase the speed, but we first try without.

The results

The least I want is that it is faster than 2.89 seconds. The OpenCL versions gave results under 9 seconds on an 8-core Intel i7, so less than 4 times slower than a GPU using the other code base – that sounds hopeful. The threaded CPU-version (Java) could do around 30 seconds, so we do have some performance-issues here. But… running OpenCL on GPU also gave 30 seconds, which is not good at all!

I could tune the software by eliminating the first breadth-first search by better initiating or by looking into using vectors to use the AVX. But since that would not improve the GPU-times much, I’ll leave these optimisations from this approach. I had learned that Somers worked a lot on memory-efficiency for a single threaded solution and have found its limitations for my cause. By going wide before deep, we effectively created a parallel version which runs very nice on CPUs (which can handle complete asynchronous workers), but GPUs need more order and structure – time for approach 2.

The second approach

Irregular code doesn’t work really well with GPU-style devices, as you can read in this extensive report about H264. There are more approaches, but all have in common that they are not usable when doing with less parallel processors. We now start with three focus-points:

No data-steered control. This makes it hard to keep the kernels synchronous.
Checking the data in parallel and not step-by-step
Better to focus on calculating than caching

We can use permutations of half correct solutions. We can make sure that horizontal and vertical there is only one queen per line, so we only need to check diagonal lines for illegal sharing. It would be faster to permute the diagonals somehow, but I did not see a possibility. Checking the diagonals should be done in parallel too. We can use shift and sum, which we need to do twice. To get the idea of this shift visualized, put 4 queens in the middle 4×4 of a chess-board, leave the first, shift the second 1 to the left, the third 2 positions, the fourth 3.

The number of permutations is N!, and looks a bit like bubble-sort visually. A starting-position in one diagonal fails immediately, so later we are going to improve this algorithm later to skip the obvious ones. Same for unnecessary checks of the diagonals.

Permute / switch rows

The simple method this needs a double loop: i (0< =i0 then we of course need to keep track of some history. Before going on with how to divide the problem. AMD prefers to have workgroups-sizes which are a multiple of 64 (per dimension) – called a wavefront. NVIDIA works well with multiples of 32 (a warp), so has an advantage when the workgroups are smaller. A lot of time this explains the difference between NVIDIA and AMD when the algorithm was tested first on NVIDIA. NVIDIA has an advantage, as we cannot easily fill a 64 wavefront/warp. How do we make sure the whole wavefront/warp is filled? It is clear we need to have loops, as we cannot launch N! kernels as it is not representable in an uint64. You may try.

We have the same problem as the previous approach, that we can only launch from a certain row. Two rows have too little starting-positions, so we start with three or four as we did with approach 1. At http://wordaligned.org/articles/next-permutation a lot is explained about permutations. We have one problem: since there are many permutations invalid for

Hamiltonian path

Writing down the rows it can be next to it comes very clear that there are no solutions for N = 2 or 3. 4 gives two solutions: 2,4,1,3 and 3,1,4,2. For 5 and 6 we need to check the diagonals for rows further apart, but it is clear you need to find whole, unique paths to find possible solutions where all nodes are visited only once. If we put it in a truth-table, you get a wide diagonal of 2 or 3 with false and the rest true (symmetrical). This problem is best known as a part of the travelling salesman problems: the Hamiltonian path. We don’t care about distances (set to 1) or directions, but do very much care about getting all the separate solutions as we need to test them for correctness. Creating such graph is easy, so let us focus on walking it. If all these paths can be found in less operations than using permutations AND it can be done in parallel, we have taken a big step forward. Somers’ solution walks paths very orderly and checking the diagonals efficiently, and the permutations only needed to update+check the diagonals of the two switched rows, so we need to see if check-wile-walking-tree could also be possible. If the board-size is bigger than 6, you see an increase in the number of solutions. The tree (figure) shows that when starting in the edge, you see only one escape. The rest has only one or two solutions. This tells something about the influence inside and outside 6×6; for us larger board-sizes are important and the complexity of such boards makes it hard to find theoretical shortcuts.

Checks

We keep track of two arrays – for both diagonals – of length 2*N if odd and 2*N-1 if even. They hold the id of the diagonal the queen is on. The two switched rows are shifted +n and -n, after which they are compared to the other rows. Starting from a position that is valid, only the switched rows need to be checked.

Back in 2023: gimme an implementation

So, that was a lot of theory! Where’s the action? Where’s the code? Unfortunately I have no code that has all the optimizations implemented or is in shareable condition. I do hope that the above gives you some insights in the puzzle, to speed up existing implementations.

Also unfortunately, still as of today you can find “new” implementations that are slower than my initial version, even if that code runs on modern hardware. But I assume you want to see how all these optimizations would result to.

Luckily I found an implementation on Github by Ole and Tim Pöschl, who had the same starting points as I had and looked into (several of) the same approaches. Do check out their code – “FAF” stands for “fast and fun” of course.

The post N-Queens project from over 10 years ago appeared first on StreamHPC.

The Fastest Payroll System Of The World

Vincent Hindriksen — Wed, 09 Aug 2023 13:27:34 +0000

At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.

They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:

Yes, I can answer your questions.

For that I need to do a test-run.

Please come back tomorrow.

The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.

Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.

Accelerating the Payroll Calculation

The code consisted of several hundreds megabytes of Delphi-code that was build up over the years, which made the project a lot less standard to do.

The interesting part about payrolling needs to have large flexibility to cope with continuously changing laws and ruling within the country. This is a main reason you see most payroll companies not working internationally. Intersoft’s software can be configured to work in any branch or country.

Porting code that offers both correctness and flexibility is less easy to port to the GPU, that expects very straightforward instructions – we needed to be very creative to convert the internal scripting language to OpenCL.

After about 2 weeks the test-input was processed correctly in seconds. After that the client worked with us to program out redesigns of the database-loading and storing, being able to feed the OpenCL calculations the data fast enough.

The code now runs correctly on AMD and NVIDIA GPUs, and (for backwards compatibility) on CPUs. We have not tested on Intel ARC. We neither benchmarked it, as this type of software has too many interdependencies to define a good case. Hence we kept it with only mentioning that ~250x speedup.

Some technical background: why OpenCL?

Most software we write is in CUDA and HIP. OpenCL is in C and less convenient, so why did we choose this language? Reason is the built-in compiler. Every time the script or configuration changes, OpenCL can be generated for it, and be compiled on the spot. Even just-in-time optimizations, based on the provided data, can be applied. If we had chosen CUDA, HIP or SYCL, this would have taken much more effort to do.

You might say that OpenCL’s kernels can be reverse engineered more easily than when using C++ based languages. In this project the real IP is in the OpenCL-code generation and the software to visually create changes to how payroll-calculations are done, and thus reverse engineering does not make much sense. The offender would need to reverse engineer the software with every change.

About Intersoft

Intersoft is a provider of Payrolling Systems that can be adapted to any country and branch, with a specialization in the cleaning industry. Other software is created for Time & Task Tracking, Staff Planning & Scheduling, Forecasting & Workload Monitoring, Customer Surveys & Analysis and Incentive Programs.

Intersoft says that their customers are attracted by the combination of flexibility and correctness of the solution – unique in its market. Now also performance is added to that list, which allows companies with over 10,000 employees can be served now.

Where do you see software that is unexpectedly slow, that is making it difficult to provide a good service?

The post The Fastest Payroll System Of The World appeared first on StreamHPC.

How to get full CMake support for AMD HIP SDK on Windows – including patches

Máté Ferenc Nagy-Egri — Tue, 01 Aug 2023 10:19:50 +0000

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

We cannot fathom how an ecosystem that has existed on Linux for 7 years and has relied primarily on CMake as its build system could ship with unfinished CMake support. Starting from 3.21 CMake has native support for the HIP language. We understand that the Windows release wanted to make Visual Studio support a priority (driven by the HIP VS Extension), but showing the back seat for the myriad of existing projects on Linux is a missed opportunity at best. StreamHPC being the authors of multiple larger libraries: rocRAND, rocPRIM, rocThrust and hipCUB, we took great care while developing rocm-examples that our and others’ existing workflows do not break.

Using HIP in CMake is summarized well on the Using CMake page at AMD’s doc site, so we wouldn’t repeat all that’s there.

The main point is that CMake only ships half of the HIP language support, the part defining the HIP-specific variables, properties etc. How that translates to toolchain invocations are in separate scripts shipping with the toolchains themselves. This second part is what’s half-baked in the initial Windows release of ROCm (documented under the ROCm on Windows section).

Join us for the fix in CMake and the thought process behind it to enable developing HIP on Windows using CMake HIP language support.

Step-by-Step

Prerequisites

We will be riding the waves of CMake on the command-line. For this, minimally install the following SW:

Git for Windows
- if you’re a developer of any kind, you likely have this installed already
Visual Studio 2022 Build Tools
- Tick “Desktop development with C++” workload.
- From Individual components select
  - MSVC v143 – VS 2022 C++ x64/x86 build tools (Latest)
  - Windows 10/11 (depending on your host system) SDK (any version)
ROCm Windows SDK
CMake 3.27
- Latest at the time of writing
Ninja

Let the games begin!

We clone the rocm-examples repo and try to configure a build using defaults for everything.

PS C:\Users\User\rocm-examples> cmake -S . -B build
-- Building for: Visual Studio 17 2022
...
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeDetermineHIPCompiler.cmake:10 (message):
  HIP language not currently supported by "Visual Studio 17 2022" generator

Looking at the relevant part of CMakeDetemineHIPCompiler.cmake:10, we can see that this is a hardcoded error inside CMake. The default generator of CMake on Windows is the latest Visual Studio found on the system. HIP support in MSBuild (VS’s build system) is driven by the “AMD HIP Toolchain” extension. There are two blockers here:

The extension currently only applies to IDE installations of VS, it will not apply to Build Tools-only installations.
CMake needs to be taught how to use that extension.

If the inability to use the Visual Studio generators bothers you, consider upvoting this and this issue on AMD’s and Kitware’s trackers.

Moving on to using Ninja (non Multi-Config for brevity, however we advise defaulting to Ninja Multi-Config in most cases):

PS C:\Users\User\rocm-examples> cmake -G Ninja -S . -B build
-- The CXX compiler identification is MSVC 19.36.32537.0
...
-- The HIP compiler identification is unknown
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeDetermineHIPCompiler.cmake:104 (message):   	 
  Failed to find ROCm root directory.

“HIP compiler identification is unknown”, “Failed to find ROCm root directory”… so after installing the ROCm SDK, CMake won’t find it in its default location? Well, skimming through CMakeDetermineHIPCompiler.cmake, we will soon find that right after the previous hardcoded error, if CMAKE_HIP_COMPILER is not set, CMake will enter a guessing game, of where it may be, potentially guided by the HIPCXX environment variable. Unless a compiler is found, the last desperate attempt is to consult the CMAKE_HIP_COMPILER_ROCM_ROOT variable, which it will use as a guide to finding the SDK side script driving the language support.

Environment variables by nature diffuse, leak into every context and are observable at every level of a stack. This is why we love using them at times, when we wish to push some information to the bottom of a call stack or a toolchain without the use of global variables or extra function arguments. The danger of env vars is exactly the same: they leak into contexts we may not have intended. They change behaviour at a global scale. If we know the purpose and there’s a convenient way to restrict the effect to some specific context, we should prefer that method.

We chose the most up-front method and set CMAKE_HIP_COMPILER on the command-line.

PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_HIP_COMPILER="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeDetermineHIPCompiler.cmake:143 (message):
  The ROCm root directory:

   C:/Program Files/AMD/ROCm/5.5

  does not contain the HIP runtime CMake package, expected at one of:

   C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-config.cmake
   C:/Program Files/AMD/ROCm/5.5/lib64/cmake/hip-lang/hip-lang-config.cmake

Notice how we used the HIP_LANG env var in the value of CMAKE_HIP_COMPILER created by the installer (note it has a trailing backslash). This way should we update our toolchain, we’ll always get the latest compiler. This time around CMake argues that it can’t find hip-lang-config.cmake. This is the file that maps the CMake properties into the toolchain invocations. Unfortunately AMD managed to omit it from the SDK, even though we have the folder and other files in that folder.

PS C:\Users\User\rocm-examples> Get-ChildItem ${env:HIP_PATH}lib\cmake\hip-lang


	Directory: C:\Program Files\AMD\ROCm\5.5\lib\cmake\hip-lang


Mode             	LastWriteTime     	Length Name
----             	-------------     	------ ----
-a----     	7/21/2023  11:17 AM        	922 hip-lang-targets-release.cmake
-a----     	7/21/2023  11:17 AM       	3577 hip-lang-targets.cmake

This file usually comes from hipamd (AMD’s implementation of the HIP interface) component. If we don’t want to manually configure it, we can just take the version from a Linux installation and tweak it slightly.

Got hip-lang-config.cmake and hip-lang-targets.cmake. Next try:

PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_HIP_COMPILER="${env:HIP_PATH}bin\clang++.exe" -S . -B build        	 
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at build/CMakeFiles/3.27.1/CMakeHIPCompiler.cmake:1 (set):
  Syntax error in cmake code at

	C:/Users/User/rocm-examples/build/CMakeFiles/3.27.1/CMakeHIPCompiler.cmake:1

  when parsing string

	C:\Program Files\AMD\ROCm\5.5\bin\clang++.exe

  Invalid character escape '\P'.

CMake hasn’t a very sophisticated scripting language, its type system is fairly weak, so it’s easy to get into trouble on Windows with backslashes in paths being interpreted as escape characters. To fix this issue, specify what type of cache variable we want to define on the command line using -D VAR_NAME:VAR_TYPE=VAR_VALUE. Next try:

PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/Platform/Windows-Clang.cmake:170 (message):
  The current configuration mixes Clang and MSVC or some other CL compatible
  compiler tool.  This is not supported.  Use either clang or MSVC as both C,
  C++ and/or HIP compilers.

This error is a quality of life improvement introduced in CMake 3.25, although CMake HIP language support only needs 3.21. The reason behind it is that CMake decides globally whether to use MSVC-like or GNU-like command-line syntax globally. Mixing-matching the two will result in compiler or linker errors. If a project uses both CXX and HIP language support, they must use the same style command-line.

So we specify the C++ compiler to be ROCmCC as well and clean configure (toolchain selection is cached and one cannot change the compiler after it has been found):

PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The CXX compiler identification is Clang 17.0.0 with GNU-like command-line
...
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/Kitware/CMake/3.27.1/share/cmake-3.27/Modules/CMakeFindDependencyMacro.cmake:76 (find_package):
  By not providing "FindAMDDeviceLibs.cmake" in CMAKE_MODULE_PATH this
  project has asked CMake to find a package configuration file provided by
  "AMDDeviceLibs", but CMake did not find one.

  Could not find a package configuration file provided by "AMDDeviceLibs"
  with any of the following names:

	AMDDeviceLibsConfig.cmake
	amddevicelibs-config.cmake

  Add the installation prefix of "AMDDeviceLibs" to CMAKE_PREFIX_PATH or set
  "AMDDeviceLibs_DIR" to a directory containing one of the above files.  If
  "AMDDeviceLibs" provides a separate development package or SDK, be sure it
  has been installed.

Some of the package config files are missing from the installation, as requested from hip-lang-config.cmake:87 (find_dependency). For the uninitiated: find_dependency is essentially find_package, but meant to be used inside invocations of find_package. This file indeed is missing, we borrow it from Linux as well.

Our next attempt gives the exact same error, and that is because of one of the most subtle package search quirks of CMake. If you consult the Config Mode Search Procedure of find_package, you’ll find:

Even though AMD’s APT repo doesn’t add the install to the PATH, testing is often done in rocm-terminal where it is on the PATH. /bin is taken off the end of PATH entries, hence /opt/rocm/ is now on the find_package search prefix. The Windows installer (correctly) doesn’t add the installed bin folder to the PATH, so hip-lang-config.cmake needs HINTS added to the find_dependency invocations to help it find its own components. The CXX piggyback scripts, hip-config.cmake ever so casually sidesteps this issue by not exposing the problematic components on Windows as a “hotfix”. The undeployed hip-lang-config.cmake have not yet received this treatment.

After we’ve added the HINTS, we are within arm’s reach of our goal. Next we get:

PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/AMD/ROCm/5.5/lib/cmake/amd_comgr/amd_comgr-targets.cmake:75 (message):
  The imported target "amd_comgr" references the file

 	"C:/Program Files/AMD/ROCm/5.5/lib/libamd_comgr.so.2.5.50500"

  but this file does not exist.  Possible reasons include:

  * The file was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and contained

 	"C:/Program Files/AMD/ROCm/5.5/lib/cmake/amd_comgr/amd_comgr-targets.cmake"

  but not all the files it references.

Of course, if we borrow files from Linux, they’ll refer to Linux shared objects. Patching up amd_comgr-config.cmake we also borrowed from Linux we’ll get:

PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake:82 (message):
  The imported target "hip-lang::amdhip64" references the file

 	"C:/constructicon/builds/gfx/two/23.10/drivers/compute/install/native/Release/x64/hip/lib/amdhip64.lib"

  but this file does not exist.  Possible reasons include:

  * The file was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and contained

 	"C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake"

  but not all the files it references.

Oooh, “C:/construction/builds/gfx/two/23.10/… ” looks very much like a hardcoded build server path. Of course, nobody hardcoded this, it’s CMake generating this file. A seasoned CMake developer at this point inspects all the scar tissue covering their body, their mind’s eye replays a few episodes of scripting nightmares long past, and from the depths of recollection emerges a passage from the install() command’s DESTINATION.

Following our scar tissue induced lead, we zoom in on the installation of hip-lang-targets, and while the targets are installed in proper GNUInstallDirs locations CONFIG_LANG_PACKAGE_INSTALL_DIR is suspiciously non-standard. Tracing down its source value through a layer of indirection its value is indeed prefixed with CMAKE_INSTALL_PREFIX, making it an absolute path, thus rendering the package non-relocatable. This happens to work on Linux, where packages on build servers are likely targeting /opt/rocm, just like the actual DEB package install paths. The hardcoded package path happens to coincide between the build server and the user’s machine.

Fixing that up to be a path relative to the config file itself, we get the following:

PS C:\Users\User\rocm-examples> cmake -G Ninja -D CMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -D CMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -S . -B build
-- The HIP compiler identification is Clang 17.0.0 with GNU-like command-line
CMake Error at C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake:88 (message):
  The imported target "hip-lang::amdhip64" references the file

 	"C:/Program Files/AMD/ROCm/5.5/bin/amdhip64.dll"

  but this file does not exist.  Possible reasons include:

  * The file was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and contained

 	"C:/Program Files/AMD/ROCm/5.5/lib/cmake/hip-lang/hip-lang-targets.cmake"

  but not all the files it references.

The amdhip64.dll doesn’t ship with the SDK but with Radeon Software. It’s part of the driver, hence it’s found in C:\Windows\System32.

(Note that recording the DLL as an IMPORTED_LOCATION property on hip-lang::amdhip64 one introduces a requirement that the build machine has to have a Radeon driver install. In most cases this shouldn’t be an issue, but omitting it will render CMake oblivious of its existence. A proper fix from AMD will have to consider how to expose the DLL.)

Fixing that up, configuration finally passes. We can now build our code using ninja:

PS C:\Users\User\rocm-examples> cmake --build --target hip_saxpy
[1/2] Building HIP object HIP-Basic/saxpy/CMakeFiles/hip_saxpy.dir/main.hip.obj
[2/2] Linking HIP executable HIP-Basic\saxpy\hip_saxpy.exe

If the samples would argue about some device binary missing from the executable, that just means that the wrong device binaries were embedded into the executable. The target property HIP_ARCHITECTURES controls which binary flavours are embedded into the target. Typically one would set it using the variable CMAKE_HIP_ARCHITECTURES, suggested to be explicitly set by the docs, otherwise the implementation chooses some defaults. CMakeDetermineHIPCompiler.cmake is trying to invoke rocm_agent_enumerator, an extremely lightweight utility just listing ROCm devices and parses the gfxXYZ entries from the response. This utility is missing from Windows for good reasons.

The runtime behind ROCm is ROCclr (ROCm Common Language Runtime), which on Linux is implemented by ROCr, turning API calls to amdkfd** (AMD’s kernel-mode driver) calls. ROCclr exists on Windows too, but instead of amdkfd, it’s implemented by PAL, the Platform Abstraction Layer. As a consequence, it is clearly visible that none of its existing search methods are going to work on Windows, so even if it were present, rocm_agent_enumeratir wouldn’t function. The rocminfo utility’s place is taken by hipInfo.exe.

To find what device binary formats we need to embed into our HIP executables, we have to query the devices in our system. eg.:

PS C:\Users\User\rocm-examples> & "$env:HIP_PATH\bin\hipInfo.exe" | Select-String gfx

gcnArchName:                  	gfx1032
gcnArchName:                  	gfx1035

So by specifying these graphics IP names as the default value for the HIP_ARCHITECTURES target property, all our compiled code should function correctly.

PS C:\Users\User\rocm-examples> cmake -G Ninja -S . -B build -DCMAKE_CXX_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -DCMAKE_HIP_COMPILER:FILEPATH="${env:HIP_PATH}bin\clang++.exe" -DCMAKE_HIP_ARCHITECTURES=”gfx1032;gfx1035”

Running the sample codes should look something like this:

PS C:\Users\User\rocm-examples\build> .\HIP-Basic\saxpy\hip_saxpy.exe
Calculating y[i] = a * x[i] + y[i] over 1000000 elements.
First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]

Do note that the default build configuration is Debug for CMake. Performance should improve by also adding -D CMAKE_BUILD_TYPE=Release to the configuration command-line.

[**] Fun fact: the name comes from the earliest incantations of HSA, Heterogeneous System Architecture. This standard was coined shortly after AMD acquired ATi to obtain their vast range of graphics IP. The first APUs were born and the marketing name from AMD at the time was “The future is Fusion” and the SW stack implementing these integration features were called FSA, Fusion System Architecture. Because other vendors, primarily ARM, was interested in this design, it was spun off into a separate consortium, but naturally ARM disliked the fact that AMD brand naming is engraved into the now standard technology. Thus, FSA was renamed to HSA. Much of AMD’s implementation however already used Fusion as the naming of SW components and some of those remain to this day. The driver name amdkfd expands to AMD Kernel Fusion Driver, but really it was set out to implement ultimately what’s called HSA today.

Conclusion

Capturing the above in a series of patch files, one may patch the AMD HIP SDK for Windows to enable CMake HIP language support. Download the ZIP file with the packages from here, extract it anywhere on your disk, and from the folder containing the patches using Git on a command-line with Administrator Privileges patch your installation:

AMD HIP SDK for Windows – patches for CMake HIP language support Download

Get-ChildItem .\00*.patch | ForEach-Object { git apply --unsafe-paths "--directory=${env:HIP_PATH}." --ignore-whitespace --whitespace=nowarn $_.FullName }

The post How to get full CMake support for AMD HIP SDK on Windows – including patches appeared first on StreamHPC.

Improving FinanceBench for GPUs Part II – low hanging fruit

Adel Johar — Fri, 28 Aug 2020 10:46:29 +0000

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

As a refresher, besides porting the CUDA code to HIP, some restructuring of the code and build system was also done. Such improvements are a standard phase in all projects we do, to make sure we spend the minimum time on building, testing and benchmarking.

CMake is now used to build the binaries. This allows the developer to choose their own IDE
Benchmarks and Unit Tests are now implemented for each algorithm
Googlebenchmark and googletest are used as the benchmarking and unit test framework. These integrate well within our automated testing and benchmarking environment
The unit tests are designed to compare OpenMP and HIP implementations against the standard C++ implementation

The original code only measured the compute times. In the new benchmarks, compute times and transfer times are measured separately.

Note: for the new benchmarks we used more recent AMD and Nvidia drivers (ROCm 3.7 and CUDA 10.2).

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	215.311		437.140		877.425
Titan V	24.712	25.29	542.859	543.930	1595.674	1597.240
GTX980	75.832	76.852	1754.586	1755.851	5120.555	5127.255
Vega 20	11.07	12.161	19.408	22.578	36.112	40.575
MI25 (Vega 10)	12.91	13.964	25.247	26.662	49.551	51.983
s9300 (Fiji)	42.909	42.839	85.739	89.463	169.858	174.248

Benchmark-results of the original Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Monte-Carlo algorithm was observed to be compute-bound, thus making it easy to identify the low-hanging fruits in the kernel.

The original implementation initialised the random states in a separate kernel; this initialisation can actually be done in the same compute kernel
Instead of using calculating the normal distribution of the random number manually, it’s faster to use the HIP provided function (which we built for AMD)
On Nvidia GPUs, using the default cuRAND state (XORWOW) is pretty slow. Switching to Philox improves performance significantly on Nvidia GPUs

A big speed-up can be observed on the Nvidia GPUs; although a considerable speed-up can also be observed on the AMD GPUs.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	215.311		437.140		877.425
Titan V	9.467	9.831	17.638	18.466	34.281	35.765
GTX980	22.578	23.223	44.923	46.013	90.108	100.584
Vega 20	4.456	5.003	8.570	10.117	17.003	19.626
MI25 (Vega 10)	9.118	9.663	17.995	18.918	35.760	42.621
s9300 (Fiji)	17.802	18.477	35.086	36.795	69.746	73.015

Benchmark-results of the improved Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

less is better

Black Scholes

Black-Sholes-Merton Process with Analytic European Option engine

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	5.005		22.194		43.994
Titan V	0.095	5.181	0.407	25.111	0.792	49.890
GTX980	2.214	7.294	10.662	34.534	19.946	68.626
Vega 20	0.201	7.105	0.894	33.693	1.596	70.732
MI25 (Vega 10)	1.090	7.657	3.453	41.343	6.000	74.387
s9300 (Fiji)	1.048	11.524	4.635	53.246	9.170	129.262

Benchmark-results of the original Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Black-Scholes algorithm was observed to be memory-bound, thus there were a some low hanging fruits and structural problems to tackle.

Use CUDA/HIP provided erf function instead of custom error function

The first step was to tackle the low hanging fruits in the kernel. A decent speed-up in the compute times could be observed on most GPUs (except for the Titan V).

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	5.005		22.194		43.994
Titan V	0.084	5.092	0.367	24.549	0.722	49.956
GTX980	1.458	6.281	6.402	30.584	12.790	61.965
Vega 20	0.114	6.995	0.397	33.571	0.775	69.858
MI25 (Vega 10)	0.517	6.929	1.836	30.490	3.021	64.961
s9300 (Fiji)	0.423	10.621	1.931	48.417	3.733	81.898

Benchmark-results of the improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

With the algorithm being memory-bound, the next step was to tackle the structural problems.

Given that the original code input required an Array of Structs to be transferred to the GPU, the next step was to restructure the input data into a linear array
This prevents transferring an entire struct where not all inputs are used

The results can be found below, where transfer times on all GPUs improved.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	5.005		22.194		43.994
Titan V	0.068	3.937	0.286	18.258	0.565	36.479
GTX980	1.290	5.096	6.387	24.337	12.758	48.578
Vega 20	0.121	5.067	0.447	25.809	0.827	47.541
MI25 (Vega 10)	0.506	4.861	1.841	23.580	3.115	53.006
s9300 (Fiji)	0.444	7.416	2.002	36.859	3.922	64.056

Benchmark-results of the further improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Here we changed the data structures.

Less is better

Repo

Securities repurchase agreement

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	186.928		369.718		732.446
Titan V	19.678	32.241	35.727	60.951	70.673	120.402
GTX980	387.308	402.682	767.263	793.159	1520.351	1578.572
Vega 20	14.771	37.174	28.595	69.743	56.699	131.652
MI25 (Vega 10)	46.461	71.191	91.742	143.673	182.137	277.597
s9300 (Fiji)	77.615	107.822	153.334	217.205	306.206	418.602

Benchmark-results of the original Repo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Did not allow for easy improvements, but needs some more extensive rewriting

The Repo algorithm was observed to compute-bound, but also relied on pure double-precision operations. There were no obvious low-hanging fruits in the kernel, and the structure of the data was found to be rather complex (a mixture of Struct-of-Arrays and Array-of-Structs that are intertwined). Additionally, there are far too many transfer calls for different inputs and outputs that saturating the transfers with multiple non-blocking streams isn’t effective. Also, the current state of the CUDA/HIP implementation is working best on GPUs that have good double-precision performance.

There are improvements possible, but these need a larger effort.

Bonds

Fixed-rate bond valuation with flat forward curve

Below are the results of the original code ported to HIP without any initial optimisations.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	241.248		482.187		952.058
Titan V	31.918	49.618	61.209	97.502	123.750	195.225
GTX980	746.728	1117.349	1494.679	2233.761	2976.876	4470.009
Vega 20	40.112	66.460	77.123	127.623	152.657	250.067
MI25 (Vega 10)	141.908	215.855	278.618	425.969	553.423	844.268
s9300 (Fiji)	229.011	340.212	451.539	699.059	891.284	1361.436

Benchmark-results of the original Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

The Bonds algorithm was observed to more compute-bound than the Repo algorithm, and also relied on pure double-precision operations. The same problems were observed with the Repo algorithm, where no low hanging fruit could be easily identified, and the structure of the data is complex. That said, unlike the Repo algorithm, there aren’t as many transfers of inputs/outputs, making it possible to use multiple streams.

The results can be found below, where 2 streams are used to transfer all the data.

Size	262144		524288		1048576
	Compute	+ Transfers	Compute	+ Transfers	Compute	+ Transfers
2x Intel Xeon E5-2650 v3 OpenMP	241.248		482.187		952.058
Titan V	31.918	45.988	61.209	89.198	123.750	178.688
GTX980	746.728	770.180	1494.679	1527.538	2976.876	3043.102
Vega 20	40.112	59.216	77.123	113.009	152.657	216.965
MI25 (Vega 10)	141.908	156.164	278.618	310.922	553.423	637.924
s9300 (Fiji)	229.011	256.373	451.539	493.679	891.284	981.604

Benchmark-results of the improved Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)

Less is better

Next steps

The improvements described above produced good results, as improvements across the algorithms (except Repo) could be observed. In combination with newer drivers from AMD and Nvidia, general improvements can also be observed when compared to the results obtained in the previous article.

That said, there is currently a bug in AMD’s current drivers where data transfers are slower; we will update this blog with the results once this is fixed in a future driver release.

What’s next? The next step is to look for the high-hanging fruits for both the CPU and GPU implementations of the algorithms. This would be the next step in achieving better performance, as we’ve hit the limit of optimising the current implementations.

Milestones we have planned:

Get it started + low-hanging fruit in the kernels (Done)
Looking for structural problems outside the kernels + fixes (Done)
High-hanging fruit for both CPU and GPU
OpenCL / SYCL port
Extending algorithms. Any financial company can sponsor such a development

What we can do for you

Finance algorithms on GPUs are often not really optimised for performance, which is quite unexpected. A company that can react in minutes instead of a day is more competitive, and especially in finance this is crucial.

As you have seen, we made quite some speedups with a relatively small investment. When we design code from scratch, we can more quickly go into the right direction. The difficulty about financial software is that it often needs a holistic approach.

We can help your organization:

Select the best hardware
Build libraries that work with existing software, even Excel
Architect larger scale software with performance in mind
Improve performance of existing algorithms

Feel free to contact us for inquiries, but also about sponsoring algorithms for the benchmark.

The post Improving FinanceBench for GPUs Part II – low hanging fruit appeared first on StreamHPC.

The Art of Benchmarking

Vincent Hindriksen — Tue, 24 Mar 2020 12:15:36 +0000

How fast is your software? The simpler the software setup, the easier to answer this question. The more complex the software, the more the answer will “it depends”. But just peek at F1-racing – the answer will depend on the driver and the track.

This article focuses on the foundations of solid benchmarking, so it helps you to decide which discussions to have with your team. It is not the full book.

There will be multiple blog posts coming in this series, which will be linked at the end of the post when published.

The questions to ask

Even when it depends on various variables, answers do can be given. These answers are best be described as ‘insights’ and this blog is about that.

First the commercial message, so we can focus on the main subject. As benchmark-design is not always obvious, we help customers to set up a system that plugs into a continuous integration system and gives continuous insights. More about that in an upcoming blog.

We see benchmarking as providing insights in contrast with the stopwatch-number. Going back to F1 – being second in the race, means the team wants to probably know these:

What elements build up the race? From weather conditions to corners, and from other cars on the track to driver-responses
How can each of these elements be quantified?
How can each of these elements be measured for both own cars and other cars?
And as you guessed from the high-level result, the stopwatch: how much speedup is required in total and per round?

This translates quite well to software. While we speak about “doing benchmarks”, the better name would be “getting insights” – benchmarking is just the tool to get these. Most used are:

Functional, measured in time
Data flow, measured in Bytes/s
High level, measured in time and Bytes/s

Foundation 1: Profiling

Profiling is what many developers do, with different levels of expertise:

Not knowing where to start
Seeking very advanced tools
Finding a few (types of) problematic areas
Actually finding all the problematic areas, but not fast
Quickly finding the problematic areas

So it drills down to understanding what types of problems there are, how much one understands the mechanics behind tools like GDB, and experience with such tools.

Foundation 2: Automation

Developers by training and mindset already think of automation by default. Still automating to support automation is not always a logical step. Think of liking to make algorithms that solve labyrinths and writing code that measures efficiency of the algorithm – for various reasons the automation-effort of the second part is more often done manually.

Foundation 3: Balancing of effort

This is a part of the art of problem solving, of which we briefly discussed one aspect recently. How to make something that is neither under-engineering nor over-engineering? The first steps of problem solving are all around clearly defining the problem, where the later steps make sure a possible solution has the original problem in mind.

Problems are easily found, by measuring the summed time that profiling and benchmarking take as part of the whole project. Compare that to the priorities of the original problems that need to be solved.

In conclusion

Deciding that benchmarking is important for a project, is half the work. By having the right discussions using the points in this blog article, you can easily get to a good benchmark-foundations.

Benchmarking is initially obvious, but then gets a specialization quickly. We have thought out how to make it less steep for longer, which we will share later.

Want to know more? Watch this blog for follow-ups. Want to know more right now? Then get in contact.

The post The Art of Benchmarking appeared first on StreamHPC.

Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL

Vincent Hindriksen — Sat, 07 Mar 2020 16:53:19 +0000

Stream HPC is 10 years old on 1 April 2020. Therefore we offer our one day GPGPU crash course for free that whole month.

Now Corona (and fear for it) spreads, we had to rethink how to celebrate 10 years. So while there were different plans, we simply had to adapt to the market and world dynamics.

5 years ago…

Setup of the training

Getting your mind tuned from CPU-focused to GPU-focused is half the work. This is the first goal of the day.

The second goal is hands on. We’ll provide templates for CUDA/HIP and OpenCL, where you focus on making the core compute part.

The third goal is learning from a Q&A. Here you can tap into 10 years of our experience with HPC and GPUs, as we did the strangest to the most normal projects.

The training is very basic, but there will be a lot of aha!s that will help in making better decision for adding GPU-support to your project.

This is the agenda:

Generic GPGPU model
OpenCL language
CUDA/HIP language
Memory objects
General hardware overview
Task-parallelism and data-parallelism
Mapping code to CPUs and GPUs
Comparison to other languages like SYCL, OpenMP and OpenACC

Inbetween there will be short lab-sessions on your own laptop.

Q&A

Q: Will I be able to write a simple GPU program after one day?
A: Yes, we’ll share free templates of CUDA, HIP and OpenCL you can use
Q: I’m a very senior developer. Will I be a senior in GPU-development after the training?
A: No. It depends on the person, but it can take multiple years before you reach proficiency in GPU-development. This training mostly helps in avoiding some time-consuming paths
Q: I’m in Asia/America. Will you hold a training in my timezone?
A: no and yes. It will be two half days.
Q: Why not SYCL, OpenMP or OpenACC?
A: Not this time. We certainly see a rising demand, but the material is not finalised enough. We therefore keep it to a comparison.
Q: How good should my internet be?
A: Fast enough for live video.
Q: Do I need to have set up an environment on my own computer?
A: Yes. GPU-drivers, CUDA, HIP or OpenCL, Cmake and Git. You’ll get instructions after admission.
Q: Do I need background in programming.
A: No. But be sure that a colleague who can program also joins.
Q: Do I need good understanding of C and C++?
A: You need to understand pointers, to not delay the programme. If you don’t, then you need to prepare some material before the training. This material is given to everybody. During the training pointer-questions will not be answered.
Q: When exactly is it?
A: Share the dates that suit you, and we’ll use schedule-polls to find a date that suits everybody. The more participants, the more choice.

More information you possibly will find in the description of the full training. Especially if you want to have a full in-company training, that page tells much more.

Admission

Copy the below template and send to [email protected]. A personal text is appreciated, ofcourse.

Hi Stream Team,

Congrats with 10 years. …

Name:
Company/institution:
Primary programming languages:
Reason for interest:
Timezone:
Preferences on dates:
Optional: important extra info:

Kind regards,

We always contact you within 5 days, but we try the same day.

The exact way of organisation depends on the number of interested people and their timezone. As the training is not designed as a MOOC, we can limit the number of participants – first come, first served. We’ll share updates on Twitter, on LinkedIn and in updates in this blog-post.

Looking forward to virtually meet you!

Vincent Hindriksen and the Stream Team

The post Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL appeared first on StreamHPC.

Problem solving tactic: making black boxes smaller

Vincent Hindriksen — Fri, 06 Mar 2020 12:22:03 +0000

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

Seniority in software engineering is given by 3 skills:
– problem solving
– planning
– coding

In that order
— StreamHPC (@StreamHPC) March 6, 2020

Black boxes will never be transparent

Assumption is the mother of all mistakes
Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

So if you use the words “solving a problem”, you visualise it as a problem disappearing. Other descriptions like “patching a problem” (which is often more accurate) is seen as bad, incomplete and such. Therefore I prefer describing problems as black boxes that will only get smaller or bigger, not more or less transparent, or even disappear – this is both visual and more accurate.

Black boxes are the unknown knowns and unknown unknows

Donald Rumsfeld handled assumptions by putting the knowns into four categories (and how I translate them):

Known knowns (facts)	Known unknowns (unanswered questions)
Unknown knowns (assumptions)	Unknown unknowns (missing questions)

Here the second row describes what is inside the black box – it is an illusion that you will ever know everything. Rumsfeld probably based this on a classic quote:

The more you know the more you know you don’t know
Aristotle

Practical example: porting code

Let’s start with reverse engineering code, before going to porting. That’s often seen as a very special task, where nobody knows what the code is actually doing. Documentation never written, sources lost, people left – it’s one huge black box. This sounds like a horror for many managers, but the skills needed to do that is actually required to do code porting from the CPU to the GPU. One of the skills is black box thinking.

The reason that even the world’s clearest code contains black boxes, is that developer have assumptions on anything they don’t fully master. And we can all assume that nobody can master everything, as that would require predicting the future. So when making a flow diagram of the code, you see the data flow from black box to black box, where each box has a list of observations around the known knowns and known unknowns.

Some black boxes are a bit larger than others, as the team thinks the chances are higher that something is not fully under control there but cannot specify what – something with unexpected input, whatever that could be. In such case we simply decide we keep the box large. There is also a list of known unknowns (assumptions from the original developer), we haven’t tested yet, but we know from experience that these happen. We assume the list is incomplete, and when we get feedback from benchmarks, tests or the customer, we already have ideas where to find a solution.

If we would have a list of problems (known knowns) only, we would have finished the project earlier, but it would not be as good. Simply because the first question is automatically “what did we assume wrong?”.

A small change

Just describe a problem as ‘a network of black boxes’. The rest will follow trough. It’s as simple as that!

If you want to learn more, I’m not the last to say: come work with us. We love to work with curious people. Check out our project success stories and jobs to learn more.

The post Problem solving tactic: making black boxes smaller appeared first on StreamHPC.

Improving FinanceBench

Vincent Hindriksen — Wed, 29 Jan 2020 18:12:12 +0000

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Work done before

Ofcourse the first step was selecting good algorithms that had both needed a lot of compute and were representable within the finance industry. We could not have done this as well as the research team did. This was the main reason to choose the project.

The work done for making the code, according to the team itself:

The original QuantLib samples were written in C++. QuantLib is a C++ library. Unfortunately, languages like OpenCL, CUDA, and OpenACC cannot directly operate on C++ data structures, and virtual function calls are not possible. Because of this problem, all of the existing code had to be “flattened” to C code. We used a debugger is used to step through the code paths of each application and see what lines of QuantLib code are executed for each application, and manually flattened all of the QuantLib code.

This is typical work when porting CPU-code to the GPU. Complex C++ code can be very hard to bend in directions it was not intended to bend to. Simplification is then the first thing to do, so it can be split in manageable parts. As this can be time-consuming, we were happy it was already done, though it is often easier to also have the original simplified CPU-code for reference.

We have another focus than a research group, and logically this results in code changes. We’re now in phase 1.

Phase 1: Focus on OpenMP and CUDA/HIP

As it’s difficult to focus on multiple languages while improving performance and project quality, we focused on OpenMP and CUDA first. Then we ported the project to HIP and made sure that the translation from CUDA to HIP could be fast. This way we could make sure CPUs and the fastest GPUs could be benchmarked, leaving OpenCL and OpenACC out for now. We have no intentions to keep HMPP in, and have chosen for introducing SYCL to prepare for Intel Xe. We have more interest in benchmarking different types of algorithms on all kinds of hardware than to compare programming languages.

Also the project has been cleaned up, cmake was introduced, and google-benchmark was added, to make it easier for us to work on. We did not look into Quantlib for improvements or read papers on the latest advancements. So the goal was really to get it started.

We picked a few broadly available AMD and Nvidia GPUs, and choose a dual socket Xeon (40 cores in total) for the CPU benchmarks. The times are INCLUDING transfer times for the GPUs. The original benchmark unfortunately showed compute-times only, so we might get some Nvidia Kepler GPUs back in a server to re-benchmark these.

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Monte Carlo is seen often when HPC is applied in the finance domain. A good part is the easy interpretation and straightforward implementation, making it easy to explain to HPC-developers while showing the performance advantage to quants. It returns a distribution of future prices of assets by doing thousands to millions of simulations.

The below results were from the code as provided, with a direct port to HIP to include AMD GPUs. As you can see, the Titan V and GTX980 numbers don’t look good.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	215.311	437.140	877.425
Titan V	25.162	544.829	1599.428
GTX980	76.456	1753.816	5120.598
Vega 20	15.286	30.140	62.110
MI25 (Vega 10)	13.694	26.733	52.971
s9300 (Fiji)	25.853	51.403	98.484

Here are the results after fixing the obvious problems and low-hanging fruit. This benefited the Nvidia GPUs a lot, but also the AMD GPUs. There was no low hanging fruit in the OpenMP-code, so no speedup there.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	215.311	437.14	877.425
Titan V	9.885	18.501	35.714
GTX980	23.427	45.81	91.721
Vega 20	10.864	21.467	42.995
MI25 (Vega 10)	10.857	21.147	41.214
s9300 (Fiji)	20.806	41.633	80.465

Black Scholes

Black-Sholes-Merton Process with Analytic European Option engine

Black Scholes is used for estimating the variation over time of financial instruments such as stocks, and using the implied volatility of the underlying asset to derive the price of a call option. Again, it is compute intensive.

The performance of the original code looked good at first sight, but the transfers took 95% of the time. That’s for the next phase.

Size	1048576	5242880	10485760
2x Intel Xeon E5-2650 v3 OpenMP	5.005	22.194	43.994
Titan V	7.959	38.852	77.443
GTX980	10.051	48.038	95.568
Vega 20	5.907	26.468	53.342
MI25 (Vega 10)	7.827	36.947	71.499
s9300 (Fiji)	9.642	37.07	80.118

On some projects it’s better to focus on the largest bottleneck – for this project we chose to go through the project in a structured way. It sometimes is difficult to explain the improvements are only to be “activated” very late in the project – luckily the explanation “experience” is often accepted.

So the applied fixes had good influence, but are hardly noticeable right now.

Size	1048576	5242880	10485760
2x Intel Xeon E5-2650 v3 OpenMP	5.005	22.194	43.994
Titan V	7.841	37.957	77.382
GTX980	9.12	44.473	89.847
Vega 20	5.87	27.663	51.828
MI25 (Vega 10)	7.037	31.737	64.898
s9300 (Fiji)	7.295	32.866	79.558

WIP: Repo

Fixed-rate bond valuation with flat forward curve

Only ported to HIP. We did not do any optimisations yet, as we have stability-problems with 2 AMD GPUs to focus on. With the current code Vega 20 is faster than Titan V.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	186.928	369.718	732.446
Titan V	38.328	72.444	141.664
GTX980	404.359	796.748	1599.416
Vega 20	36.399	67.299	128.133

WIP: Bonds

Securities repurchase agreement

Only ported to HIP. The code for Bonds benchmark needs some attention still. You see that FTX980 is too slow in comparison.

Size	262144	524288	1048576
2x Intel Xeon E5-2650 v3 OpenMP	241.248	482.187	952.058
Titan V	46.172	88.865	177.188
GTX980	761.382	1518.756	3030.603
Vega 20	63.937	122.484	241.917

Next steps

As you see this is really work in progress. Why show it already? Reason is that you can see how a project goes. Cleaning up the code is always done in every project, to avoid delays later on. Adding good tests and benchmarks is another foundational step. Most time has gone into these preparations, and limited time into the improvements.

Milestones we have planned for now:

Get it started + low-hanging fruit in the kernels (WIP)
Looking for structural problems outside the kernels + fixes
High-hanging fruit for both CPU and GPU
OpenCL / SYCL port
Optional: extending algorithms (by request?)

We intend to release a milestone every 6 to 8 weeks. You can get noticed by following us on Twitter or LinkedIn.

Feel free to contact us with any question.

The post Improving FinanceBench appeared first on StreamHPC.

Updated: OpenCL and CUDA programming training – now online

Vincent Hindriksen — Mon, 27 Jan 2020 13:37:41 +0000

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

Four days of training online
Free code-review after the training, to get feedback on what you created with the new knowledge;
1 month of limited support, so you can avoid StackOverflow;
Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Important:

The training is guaranteed to take place. If you are the only one, you’ll simply get personal training.
The below schedule is indicative when it comes to lab-sessions – some lab-sessions can be (partly) skipped or replaced if time is getting too limited.

Day 1: OpenCL/CUDA Foundations

This is close to our standard OpenCL crash course. We start with the basic concepts, write our fist OpenCL program, discuss the architectures, discuss the difficulties of GPU-programming, compare to CUDA and C++, and end with writing simple code that runs on a laptop (CPU or GPU).

Generic GPGPU model
OpenCL language
CUDA/HIP language
Memory objects
General hardware overview
Task-parallelism and data-parallelism
Mapping code to CPUs and GPUs
Comparison to other languages like SYCL, OpenMP and OpenACC

Day 2 + 3: Optimise a program from scratch

During the day, we will increase the level of requirements and touch all important aspects of OpenCL-programming.

As we have several GPU-servers available for developing, we can provide you with login-credentials, a git-account and a short how-to for using the extra GPUs from NVidia and AMD, optionally Intel. This way you can use different graphic cards to find out which optimisations work and don’t work.

Optimisations we discuss during this day:

Host-code
Data-flow
Memory handling
Data transfer speed increase
Memory alignments
Scheduling
Parallelism increase
Latency reduction
The most important kernel-optimisations from the different vendors

Lab sessions

During the days we use various lab-sessions to support the explained theory. We’ll discuss most of the following:

Clinfo
ColorBalance
Matrix-multiplication
Convolution
Histogram
Contrast Stretching
Frame correlation
Fixing non-optimal and broken code

Day 4: Tools, special subjects and final project

Tools

There are various tools you need to understand to get code that runs well on GPUs. This day you will learn to use various vendor-provided and open source tools to help you analyse your code.

Software correctness: data-races.
Profiling: timing and finding hot-spots
Reporting: let tools create useful reports.
Debugging: finding bugs and learning what actually happens.

The tools are discussed and used along the following four subjects:

Software correctness: data-races.
Profiling: timing and finding hot-spots
Reporting: let tools create useful reports.
Debugging: finding bugs and learning what actually happens.

We only discuss tools last, as you need to understand the concepts before having something solve it for you.

Special subjects

Often these are different per training, as these are defined by the attendees of the training. Subjects that have been discussed more frequently:

Splitting work over CPU&GPU: Running different kernels on CPU and GPU, to make maximum use of the whole computer.
GL-CL interop: Understanding how interoperability with OpenGL works. With this the results can be shown on the screen with minimal latency.
Optimising data-throughput when using multiple kernels.

Final Project

The final project you will need to use all you’ve learnt and try to get the fastest code from class.

Lab-session:

Fixing non-optimal and/or broken code

Prerequisites

We will send a questionnaire to understand the needs of each trainee. For larger groups, we also have a separate phone call with the representative.

Laptop

Attendees need to bring their own laptops for the lab sessions. The only requirement is for the laptops to be equipped with an OpenCL capable CPU or GPU and OpenCL drivers are correctly installed. A complete list with OpenCL compliant devices can be found here. Regarding the software, laptops need to have installed the following software:

cmake 3.1 or higher: lab-sessions are available in cmake, so almost all IDEs are supported.
An IDE or text-editor with coding-support.
One or more OpenCL SDKs, for each OpenCL-device in the computer.
A C/C++ compiler suite. Examples of supported suites are Microsoft Visual Studio, Apple Xcode and GNU GCC/G++. We will send a small project in advance, which can be used to test compilers.
ssh/putty: optional. Needed when working at StreamHPC’s GPU servers.
git: optional. When easy transfer of lab-work between several computers/servers is needed, this is required. Lab-sessions can also be downloaded as zip-files.

Skills

We prefer to focus on the core of the training, and therefore we ask a set of skills to be there. This avoids the others needing to wait.

Attendees are required to have intermediate programming experience and good C/C++ knowledge. This means that you should at least be able to write an application in C/C++ from scratch, high-level debug it with GDB and be very (!) comfortable working with pointers. When this is not the case, do contact us to provide pre-training material.

Info Online training

Costs

The costs are €2500. Additional days for personal training and consultancy are excluded – please ask us for a quote.

Reserve your spot today!

If you’re interested, do fill in the pre-training questionnaire already.

Initiating the reservation can be done by email or phone.
Email: [email protected]
Phone: +31 854865760

The post Updated: OpenCL and CUDA programming training – now online appeared first on StreamHPC.

Join us at the Dutch eScience Symposium 2019 in Amsterdam

Vincent Hindriksen — Fri, 23 Aug 2019 13:59:49 +0000

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stream HPC in the end is just making scientific software, so we’re here at the right place. The eScience Center is a government institute that aims to advance eScience in the Netherlands.

Interested? Read on!

6th annual National eScience Symposium: Digital Challenges in Open Science

The Netherlands eScience Center will host its 6th annual National eScience Symposium on 21 November 2019 at the Johan Cruijff ArenA, Amsterdam. The theme of this year’s edition is ‘Digital Challenges in Open Science’ and will focus on the ways that innovative digital technologies can be used to advance Open Science. The Dutch National eScience Symposium is a one-day event where academic researchers, industry representatives and policy-makers from various disciplines come together to discuss and share views on the latest developments in eScience.

Using digital technologies to advance Open Science

According to the European Commission “Open Science represents a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools. The idea captures a systemic change to the way science and research have been carried out for the last fifty years: shifting from the standard practices of publishing research results in scientific publications towards sharing and using all available knowledge at an earlier stage in the research process.” The 2017 Coalition Agreement of the Dutch cabinet states that very soon “Open Science […] will be the norm in scientific research”.

The systematic change of the practice of science is made possible only by a close collaboration between practitioners, policy-makers and the designers and developers of new digital solutions and tools. In particular, the role of research software and support is essential. The 2019 National eScience Symposium is an opportunity for anyone working in academia, industry, research funding, research publishing, and research policy to engage in fruitful discussions with advanced technology experts. It is the place for knowledge exchange on these topics, and to help shape and define the future of Open Science.

Topics

The symposium will feature thematic sessions showcasing world-class research and digital developments around Open Science. Topics include digital challenges in open science, the role of open science in space and earth research, and the European Open Science Cloud. In addition, the symposium will showcase several demos of cutting-edge research software developed by the eScience Center and her partners, as well as the announcement of the Young eScientist Award 2019.

Young eScientist Award 2019

Each year young and ambitious researchers can participate in the Young eScientist Award competition. The award aims to stimulate a young scientist demonstrating excellence in eScience: the development or application of digital technology to address scientific challenges. The Netherlands eScience Center awards €50,000 worth of expertise to the winner. At the symposium the Young eScientist 2019 will be announced.

Read last year’s winning research idea by Esther Bron.

Program

09:00 – 09:45: Registration & Coffee
09:45 – 10-00: Welcome by director Netherlands eScience Center
10:00 – 10:30: Keynote Speech by Prof. Frank Miedema, Vice Rector for Research at Utrecht University
10:30 – 11:00: Coffee & networking – incl. information stands partners
11:00 – 12:30: Sessions on various Open Science topics, including keynote, lightning talks and discussions
12:30 – 13:30: Lunch & networking – incl. information stands partners
13:30 – 15:00: Sessions on various Open Science topics, including keynote, lightning talks and discussions
15:00 – 15:15: Young eScientist Award 2019
15:15 – 15:30: Closing + Young eScience Demo
15:45 – 18:00: Demo exhibition, networking, and drinks

Venue

Johan Cruijff ArenA
ArenA Boulevard 1
www.amsterdamarena.nl

Tickets

Tickets are currently €35 and can be ordered from https://www.esciencesymposium2019.nl/page/606790

The post Join us at the Dutch eScience Symposium 2019 in Amsterdam appeared first on StreamHPC.