parallel processing – Hackaday

PentaPico: A Pi Pico Cluster For Image Convolution

John Elliot V — Tue, 20 May 2025 08:00:56 +0000

Here’s something fun. Our hacker [Willow Cunningham] has sent us a copy of their homework. This is their final project for the “ECE 574: Cluster Computing” course at the University of Maine, Orono.

It was enjoyable going through the process of having a good look at everything in this project. The project is a “cluster” of 5x Raspberry Pi Pico microcontrollers — with one head node as the leader and four compute nodes that work on tasks. The software for both types of node is written in C. The head node is connected to a workstation via USB 1.1 allowing the system to be controlled with a Python script.

The cluster is configured to process an embarrassingly parallel image convolution. The input image is copied into the head node via USB which then divvies it up and distributes it to n compute nodes via I²C, one node at a time. Results are given for n = {1,2,4} compute nodes.

It turns out that the work of distributing the data dwarfs the compute by three orders of magnitude. The result is that the whole system gets slower the more nodes we add. But we’re not going to hold that against anyone. This was a fascinating investigation and we were impressed by [Willow]’s technical chops. This was a complicated project with diverse hardware and software challenges and they’ve done a great job making it all work and in the best scientific tradition.

It was fun reading their journal in which they chronicled their progress and frustrations during the project. Their final report in IEEE format was created using LaTeX and Overleaf, at only six pages it is an easy and interesting read.

For anyone interested in cluster tech be sure to check out the 256-core RISC-V megacluster and a RISC-V supercluster for very low cost.

Import GPU: Python Programming with CUDA

Bryan Cockfield — Wed, 26 Feb 2025 03:00:30 +0000

Every few years or so, a development in computing results in a sea change and a need for specialized workers to take advantage of the new technology. Whether that’s COBOL in the 60s and 70s, HTML in the 90s, or SQL in the past decade or so, there’s always something new to learn in the computing world. The introduction of graphics processing units (GPUs) for general-purpose computing is perhaps the most important recent development for computing, and if you want to develop some new Python skills to take advantage of the modern technology take a look at this introduction to CUDA which allows developers to use Nvidia GPUs for general-purpose computing.

Of course CUDA is a proprietary platform and requires one of Nvidia’s supported graphics cards to run, but assuming that barrier to entry is met it’s not too much more effort to use it for non-graphics tasks. The guide takes a closer look at the open-source library PyTorch which allows a Python developer to quickly get up-to-speed with the features of CUDA that make it so appealing to researchers and developers in artificial intelligence, machine learning, big data, and other frontiers in computer science. The guide describes how threads are created, how they travel along within the GPU and work together with other threads, how memory can be managed both on the CPU and GPU, creating CUDA kernels, and managing everything else involved largely through the lens of Python.

Getting started with something like this is almost a requirement to stay relevant in the fast-paced realm of computer science, as machine learning has taken center stage with almost everything related to computers these days. It’s worth noting that strictly speaking, an Nvidia GPU is not required for GPU programming like this; AMD has a GPU computing platform called ROCm but despite it being open-source is still behind Nvidia in adoption rates and arguably in performance as well. Some other learning tools for GPU programming we’ve seen in the past include this puzzle-based tool which illustrates some of the specific problems GPUs excel at.

Turing Pi 2: The Low Power Cluster

Jonathan Bennett — Thu, 16 Jun 2022 17:00:43 +0000

We’re not in the habit of recommending Kickstarter projects here at Hackaday, but when prototype hardware shows up on our desk, we just can’t help but play with it and write it up for the readers. And that is exactly where we find ourselves with the Turing Pi 2. You may be familiar with the original Turing Pi, the carrier board that runs seven Raspberry Pi Compute boards at once. That one supports the Compute versions 1 and 3, but a new design was clearly needed for the Compute Module 4. Not content with just supporting the CM4, the developers at Turing Machines have designed a 4-slot carrier board based on the NVIDIA Jetson pinout. The entire line of Jetson devices are supported, and a simple adapter makes the CM4 work. There’s even a brand new module planned around the RK3588, which should be quite impressive.

One of the design decisions of the TP2 is to use the mini-ITX form-factor and 24-pin ATX power connection, giving us the option to install the TP2 in a small computer case. There’s even a custom rack-mountable case being planned by the folks over at My Electronics. So if you want 4 or 8 Raspberry Pis in a rack mount, this one’s for you.

@jp_bennett you mean something like this except in 2U, and full mini-ITX support? Relax, only thing you need is some patience… pic.twitter.com/vQcVCwmgDc

— MyElectronics.nl (@MyElectronicsNL) June 11, 2022

The Appeal — And the Risks

“Wait, wait”, I hear you say, “There’s plenty of ways to rack-mount Raspberry Pis!” Certainly. The form factor options are handy, but the real magic is the rest of the board. Individually controlled power supply for all four boards from a single ATX power supply makes for a very clean solution. Need to reboot a hung Pi remotely? There’s the Baseboard Management Controller (BMC) that will do full power control over the network. That’s the real killer feature: the BMC is going to run Open Source firmware, and will power some very clever functions. Want UART to troubleshoot a boot problem? It’s available from all four nodes on the BMC. Need to push a new image to a CM4? The BMC will include image flashing functions. Built into the board is a Gigabit network switch linking the Pis, the BMC, and two external Ethernet ports, all supporting VLANs.

On the other hand, not much of the BMC wizardry is actually implemented yet on the review units. This is the project’s biggest promise and the place it could go awry. Putting together a stable firmware with all the bells and whistles in the three months before scheduled ship date may be a bit optimistic. I’m expecting a working firmware, with updates to refine the experience in the months following launch.

Then there’s the expanded IO. The board comes with a pair of Mini PCIe ports, 4 USB3 ports, and a pair of SATA ports. This works via the PCIe lanes exposed by the various compute modules. Nodes 1 and 2 are connected to the mini PCIe ports, node 3 to the SATA, and node 4 to the USB3 ports. On top of that, a switchable USB2 port can be dynamically assigned to any of the existing nodes. Oh, and there’s an HDMI output from node 1, so even more options, like running a Pi CM4 8GB as a desktop machine. A late option added to the Kickstarter bolts four NVMe ports to the bottom of the board, one per slot, though not every compute module has the PCIe lanes to support it.

Now keep in mind that I’m testing a pre-production unit (more on that later), and not all of the above is actually working yet. Quite a few changes are slated for the production boards vs my unit, and the BMC firmware on this board is absolutely minimal. There is also the supply-chain issues we’ve continued to cover here on Hackaday, but the TP2 has the advantage of being designed during the shortage, so should be able to avoid using hard-to-source parts.

Use-Case

Now let’s talk about what this *doesn’t* do. This may seem obvious, but the Turing Pi 2 doesn’t give you a single ARM machine with 16+ processing cores. There isn’t enough magic onboard to make the devices act like a unified multi-processor computer. I’m not sure there’s enough magic anywhere to really pull that off. However, what you do get is four easily-managed machines that are perfect for running light-weight services or Docker images.

Looking for a platform for learning Docker and Kubernetes? Or a place to host Gitlab, Nextcloud, and a file server? Maybe you want to play Nginx as a front-end proxy, and several devices running services behind it? The Homelab-in-a-box nature of the TP2 makes it a useful choice for all of the above. And even though you can’t reasonably do all the above on a single Raspberry Pi, a programmable cluster of 4 of them does the job quite nicely. The VLAN support means that you can add virtual NICs to your nodes, and create an internal network. With the two physical Ethernet ports, you could even use your TP2 as your primary router, on top of everything else it can do.

Real-World Testing

So what’s the actual state of the project? I have my pre-production board currently booting a Raspberry Pi CM4, a Pine64 SOQuartz module, an NVIDIA Jetson Nano, and the Jetson TX2 NX. The Jetson Xavier NX had a quirk requiring a minor board modification, but runs like a champ once that was done. There are the normal warts of a pre-production board, like extra dip switches all over the place, and a few quirks, like Ethernet only coming up at 100M for some devices. These are known issues, and a good example of why you do a test run of rev 0 boards. The final product should have all the kinks worked out.

I’ve been monitoring power draw, and the most I’ve managed to pull is a mere 30 watts of power. This suggests a real-world use case, an off-grid compute cluster. The mini-PCIe ports should allow for an LTE modem (Or you can use Starlink if you’re *way* off grid). Add a couple cameras and install the Zoneminder docker images, and you have a low-power video monitoring solution. Add a RTL-SDR dongle, and the rtl_433 software listening to a solar-powered weather station, and you can track the weather at your remote location, too. Just for fun, I ran a Janus docker image on one of the Raspberry Pi CM4s on my TP2. Janus is the WebRTC server we’ve integrated into Zoneminder, and I was able to live stream 12 security cameras at 1080p, only using around 25% of the available processor power, or a load of 1 on a four core Pi. It’s a testament to how lightweight Janus is, but also a great example of something useful you could do with a TP2.

What’s Next

The Kickstarter is over, with better than two million dollars raised, but don’t sweat it, because you will soon be able to purchase a Turing Pi 2. Ordering will be handled through the Turing Pi website itself, stay tuned for the details. There will be a few months til the final revision of the board is finished and shipped, hopefully with some killer firmware and everything working exactly as advertised. Then finally there’s the alluring RK1 compute board, with up to 32 GB of ram and eight cores of Arm goodness from the RK3588. That’s a little further out, and may be a second Kickstarter campaign. I asked about mainline support for the RK1, and was told that this is a primary goal, but they’re not exactly sure on the timing. There is quite a bit of excitement around this particular chip, so look forward to the community working together to get all the needed bits in place for mainline support.

There may be an unexpected consequence of the Turing Pi 2 and RK1 using the NVIDIA Jetson SO-DIMM connector. Imagine a handheld device built on the Antmicro open source Jetson Baseboard, that woks with multiple compute modules. I mentioned the Pine64 SOQuartz: That’s not an officially supported board in the TP2, but because Pine64 built it to the CM4 specifications, it clicks right into the adapter card and works like a champ. There’s an interesting possibility that one or two of these compute module interfaces will gain enough of a critical mass, that it gets widely used in devices. And if anyone wondered, using the TP2 CM4 adapter doesn’t magically allow booting a CM4 in a Jetson Nano carrier board. Yes, we checked.

So is the Turing Pi 2 for you? Maybe. If you don’t mind juggling multiple single-board computers, and the mess of cabling required, then maybe not. But if the ability to slot four SBCs in a single mini-ITX case, with a BMC that makes life way easier sounds like a breath of fresh air, then give it a look. The real test will be when the finished product ships, and what shape the support is in. I’m cautiously optimistic that it won’t be terribly late, and that it will have working OSS firmware. I’m looking forward to getting my hands on the final product. Now if you’ll excuse me, I think I need to go set up an automated system for building aarch64 docker images.

Parallel Processing Was Never Quite Done Like This

Jenny List — Sat, 29 Jun 2019 08:00:00 +0000

Parallel processing is an idea that will be familiar to most readers. Few of you will not be reading this on a device with only one processor core, and quite a few of you will have experimented with clusters of Raspberry Pi or similar SBCs. Instead of one processor doing tasks sequentially, the idea goes, take a bunch of processors and hand out the tasks to be done simultaneously.

It’s a fair bet though that few of you will have designed and constructed your own parallel processing architecture. [BB] sends us a link which though it’s an old one is interesting enough to bring you today: [Michael] created a massively parallel array of Parallax Propeller microcontrollers back in 2008, and he did so on a breadboard.

The Parallax Propeller is an 8-core RISC microcontroller from the company that had found success in the 1990s with the BASIC Stamp, the PIC-based board that was all the rage before Arduino came into the world. In the last decade it was seen as an extremely exciting prospect, but high price and arcane development tools compared to a new generation of low-cost and easy to code competitors meant that it never quite caught on and remains today something of an intriguing oddity. So today’s value in this project lies not in something that you should run out and do yourselves, but instead in what the work tells us about the nuts and bolts of parallel processing architecture. It involves more than simply hooking up a load of chips and hoping for the best, and we gain some insight into the different strategies involved.

The Propeller certainly wasn’t the first attempt at a massively parallel microcontroller, and we doubt it will be the last. We’re certainly seeing microcontrollers with more than one core becoming more mainstream even in our community, but even with those how many of you have made use of the second core in your dual-core ESP32? Is a multicore microcontroller a solution searching for a problem, or will somebody one day crack it and the world will never be the same again? As always, the comments are below.

CUDA is Like Owning a Supercomputer

Al Williams — Mon, 19 Mar 2018 17:00:58 +0000

The word supercomputer gets thrown around quite a bit. The original Cray-1, for example, operated at about 150 MIPS and had about eight megabytes of memory. A modern Intel i7 CPU can hit almost 250,000 MIPS and is unlikely to have less than eight gigabytes of memory, and probably has quite a bit more. Sure, MIPS isn’t a great performance number, but clearly, a top-end PC is way more powerful than the old Cray. The problem is, it’s never enough.

Today’s computers have to processes huge numbers of pixels, video data, audio data, neural networks, and long key encryption. Because of this, video cards have become what in the old days would have been called vector processors. That is, they are optimized to do operations on multiple data items in parallel. There are a few standards for using the video card processing for computation and today I’m going to show you how simple it is to use CUDA — the NVIDIA proprietary library for this task. You can also use OpenCL which works with many different kinds of hardware, but I’ll show you that it is a bit more verbose.

Dessert First

One of the things that’s great about being an adult is you are allowed to eat dessert first if you want to. In that spirit, I’m going to show you two bits of code that will demonstrate just how simple using CUDA can be. First, here’s a piece of code known as a “kernel” that will run on the GPU.

__global__
void scale(unsigned int n, float *x, float *y)
{
  int i = threadIdx.x;
  x[i]=x[i]*y[i];
}

There are a few things to note:

The __global__ tag indicates this function can run on the GPU
The set up of the variable “i” gives you the current vector element
This example assumes there is one thread block of the right size; if not, the setup for i would be slightly more complicated and you’d need to make sure i < n before doing the calculation

So how do you call this kernel? Simple:

scale<<<1,1024>>>(1024,x,y);

Naturally, the devil is in the details, but it really is that simple. The kernel, in this case, multiplies each element in x by the corresponding element in y and leaves the result in x. The example will process 1024 data items using one block of threads, and the block contains 1024 threads.

You’ll also want to wait for the threads to finish at some point. One way to do that is to call cudaDeviceSynchronize().

By the way, I’m using C because I like it, but you can use other languages too. For example, the video from NVidia, below, shows how they do the same thing with Python.

Grids, Blocks, and More

The details are a bit uglier, of course, especially if you want to maximize performance. CUDA abstracts the video hardware from you. That’s a good thing because you don’t have to adapt your problem to specific video adapters. If you really want to know the details of the GPU you are using, you can query it via the API or use the deviceQuery example that comes with the developer’s kit (more on that shortly).

For example, here’s a portion of the output of deviceQuery for my setup:

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1060 3GB"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 3013 MBytes (3158900736 bytes)
( 9) Multiprocessors, (128) CUDA Cores/MP: 1152 CUDA Cores
GPU Max Clock rate: 1772 MHz (1.77 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
. . .
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes

Some of this is hard to figure out until you learn more, but the key items are there are nine multiprocessors, each with 128 cores. The clock is about 1.8 GHz and there’s a lot of memory. The other important parameter is that a block can have up to 1024 threads.

So what’s a thread? And a block? Simply put, a thread runs a kernel. Threads form blocks that can be one, two, or three dimensional. All the threads in one block run on one multiprocessor, although not necessarily simultaneously. Blocks are put together into grids, which can also have one, two, or three dimensions.

So remember the line above that said scale<<>>? That runs the scale kernel with a grid containing one block and the block has 1024 threads in it. Confused? It will get clearer as you try using it, but the idea is to group threads that can share resources and run them in parallel for better performance. CUDA makes what you ask for work on the hardware you have up to some limits (like the 1024 threads per block, in this case).

Grid Stride Loop

One of the things we can do, then, is make our kernels smarter. The simple example kernel I showed you earlier processed exactly one data item per thread. If you have enough threads to handle your data set, then that’s fine. Usually, that’s not the case, though. You probably have a very large dataset and you need to do the processing in chunks.

Let’s look at a dumb but illustrative example. Suppose I have ten data items to process. This is dumb because using the GPU for ten items is probably not effective due to the overhead of setting things up. But bear with me.

Since I have a lot of multiprocessors, it is no problem to ask CUDA to do one block that contains ten threads. However, you could also ask for two blocks of five. In fact, you could ask for one block of 100 and it will dutifully create 100 threads. Your kernel would need to ignore all of them that would cause you to access data out of bounds. CUDA is smart, but it isn’t that smart.

The real power, however, is when you specify fewer threads than you have items. This will require a grid with more than one block and a properly written kernel can compute multiple values.

Consider this kernel, which uses what is known as a grid stride loop:

__global__
void scale(unsigned int n, float *x, float *y)
{
 unsigned int i, base=blockIdx.x*blockDim.x+threadIdx.x, incr=blockDim.x*gridDim.x;
 for (i=base;i<n;i+=incr) // note that i>=n is discarded
   x[i]=x[i]*y[i];
}

This does the same calculations but in a loop. The base variable is the index of the first data item to process. The incr variable holds how far away the next item is. If your grid only has one block, this will degenerate to a single execution. For example, if n is 10 and we have one block of ten threads, then each thread will get a unique base (from 0 to 9) and an increment of ten. Since adding ten to any of the base numbers will exceed n, the loop will only execute once in each thread.

However, suppose we ask for one block of five threads. Then thread 0 will get a base of zero and an increment of five. That means it will compute items 0 and 5. Thread 1 will get a base of one with the same increment so it will compute 1 and 6.

Of course, you could also ask for a block size of one and ten blocks which would have each thread in its own block. Depending on what you are doing, all of these cases have different performance ramifications. To better understand that, I’ve written a simple example program you can experiment with.

Software and Setup

Assuming you have an NVidia graphics card, the first thing you have to do is install the CUDA libraries. You might have a version in your Linux repository but skip that. It is probably as old as dirt. You can also install for Windows (see video, below) or Mac. Once you have that set up, you might want to build the examples, especially the deviceQuery one to make sure everything works and examine your particular hardware.

You have to run the CUDA source files, which by convention have a .cu extension, through nvcc instead of your system C compiler. This lets CUDA interpret the special things like the angle brackets around a kernel invocation.

An Example

I’ve posted a very simple example on GitHub. You can use it to do some tests on both CPU and GPU processing. The code creates some memory regions and initializes them. It also optionally does the calculation using conventional CPU code. Then it also uses one of two kernels to do the same math on the GPU. One kernel is what you would use for benchmarking or normal use. The other one has some debugging output that will help you see what’s happening but will not be good for execution timing.

Normally, you will pick CPU or GPU, but if you do both, the program will compare the results to see if there are any errors. It can optionally also dump a few words out of the arrays so you can see that something happened. I didn’t do a lot of error checking, so that’s handy for debugging because you’ll see the results aren’t what you expect if an error occurred.

Here’s the help text from the program:

So to do the tests to show how blocks and grids work with ten items, for example, try these commands:

./gocuda g p d bs=10 nb=1 10
./gocuda g p d bs=5 nb=1 10

To generate large datasets, you can make n negative and it will take it as a power of two. For example, -4 will create 16 samples.

Is it Faster?

Although it isn’t super scientific, you can use any method (like time on Linux) to time the execution of the program when using GPU or CPU. You might be surprised that the GPU code doesn’t execute much faster than the CPU and, in fact, it is often slower. That’s because our kernel is pretty simple and modern CPUs have their own tricks for doing processing on arrays. You’ll have to venture into more complex kernels to see much benefit. Keep in mind there is some overhead to set up all the memory transfers, depending on your hardware.

You can also use nvprof — included with the CUDA software — to get a lot of detailed information about things running on the GPU. Try putting nvprof in front of the two example gocuda lines above. You’ll see a report that shows how much time was spent copying memory, calling APIs, and executing your kernel. You’ll probably get better results if you leave off the “p” and “d” options, too.

For example, on my machine, using one block with ten threads took 176.11 microseconds. By using one block with five threads, that time went down to 160 microseconds. Not much, but it shows how doing more work in one thread cuts the thread setup overhead which can add up when you are doing a lot more data processing.

OpenCL

OpenCL has a lot of the same objectives as CUDA, but it works differently. Some of this is necessary since it handles many more devices (including non-NVidia hardware). I won’t comment much on the complexity, but I will note that you can find a simple example on GitHub, and I think you’ll agree that if you don’t know either system, the CUDA example is a lot easier to understand.

Next Steps

There’s lots more to learn, but that’s enough for one sitting. You might skim the documentation to get some ideas. You can compile just in time, if your code is more dynamic and there are plenty of ways to organize memory and threads. The real challenge is getting the best performance by sharing memory and optimizing thread usage. It is somewhat like chess. You can learn the moves, but becoming a good player takes more than that.

Don’t have NVidia hardware? You can even do CUDA in the cloud now. You can check out the video for NVidia’s setup instructions.

Just remember when you create a program that processes a few megabytes of image or sound data, that you are controlling a supercomputer that would have made [Seymour Cray’s] mouth water back in 1976.

Neural Nets in the Browser: Why Not?

Al Williams — Fri, 04 Aug 2017 15:30:55 +0000

We keep seeing more and more Tensor Flow neural network projects. We also keep seeing more and more things running in the browser. You don’t have to be Mr. Spock to see this one coming. TensorFire runs neural networks in the browser and claims that WebGL allows it to run as quickly as it would on the user’s desktop computer. The main page is a demo that stylizes images, but if you want more detail you’ll probably want to visit the project page, instead. You might also enjoy the video from one of the creators, [Kevin Kwok], below.

TensorFire has two parts: a low-level language for writing massively parallel WebGL shaders that operate on 4D tensors and a high-level library for importing models from Keras or TensorFlow. The authors claim it will work on any GPU and–in some cases–will be actually faster than running native TensorFlow.

This is a logical progression of using WebGL to do browser-based parallel processing, which we’ve covered before. The work has been done by a group of recent MIT graduates who applied for (and received) an AI Grant for their work. We wonder if some enterprising Hackaday readers might not get some similar financing (be aware, you have to apply by the end of August).

If you have been itching to learn more about TensorFlow, we’ve covered it in depth. If you want the bare-bones example, we’ve looked at that, too.

Thanks [Patrick] for the tip.

1000 CPUs on a Chip

Al Williams — Mon, 20 Jun 2016 23:01:57 +0000

Often, CPUs that work together operate on SIMD (Single Instruction Multiple Data) or MISD (Multiple Instruction Single Data), part of Flynn’s taxonomy. For example, your video card probably has the ability to apply a single operation (an instruction) to lots of pixels simultaneously (multiple data). Researchers at the University of California–Davis recently constructed a single chip with 1,000 independently programmable processors onboard. The device is energy efficient and can compute up to 1.78 trillion instructions per second.

The KiloCore chip (not to be confused with the 2006 Rapport chip of the same name) has 621 million transistors and uses special techniques to be energy efficient, an important design feature when dealing with so many CPUs. Each processor operates at 1.78 GHz or less and can shut itself down when not needed. The team reports that even when computing 115 billion instructions per second, the device only consumes about 700 milliwatts.

Unlike some multicore designs that use a shared memory area to communicate between processors, the KiloCore allows processors to directly communicate. If you are just a diehard Arduino user, maybe you could scale up this design. Or, if you want to make use of the unused power in your video card under Linux, you can always try to bring KGPU up to date.