image processing – Hackaday

Make Your Bookshelf Clickable

Al Williams — Thu, 15 Feb 2024 12:00:00 +0000

We’ll confess that we have a fondness for real books and plenty of them. So does [James], and he decided he needed a way to take a picture of his bookshelves and make each book clickable to find more information. This is one of those things that sounds fairly simple until you decide to do it. You can try an example of the results and then go back and read about the journey it took to get there.

There are several subtasks involved. First, you want to identify each book’s envelope. It wouldn’t do to click on the Joy of Cooking and get information about Remembrance of Things Past.

The next challenge is reading the title of the book. This can be tricky. Fonts differ. The book could be upside down. Some titles go cross the spine, but most go vertically. The remainder of the task is fairly easy. If you know the region and the title, you can easily find a link (for Google Books, in this case) and build an SVG overlay that maps the areas for each book to the right link.

The optical character recognition is done with GPT-4. The prompt used is straightforward:

Read the text on the book spine. Only say the book cover title and author if you can find them. Say the book that is most promiment. Return the format [title] [author], with no punctuation.

With that information, a Google API will look up the book for you, and the rest is straightforward. You can grab the code on GitHub. We wonder how this method of OCR for difficult text would compare to more conventional methods. After all, OCR isn’t a hard problem. The complex problem is making it work well.

Cracking the Spotify Code

Matthew Carlson — Tue, 07 Dec 2021 09:00:00 +0000

If you’ve used Spotify, you might have noticed a handy little code that it can generate that looks like a series of bars of different heights. If you’re like [Peter Boone], such an encoding will pique your curiosity, and you might set out to figure out how they work.

Spotify offers a little picture that, when scanned, opens almost anything searchable with Spotify. Several lines are centered on the Spotify logo with eight different heights, storing information in octal. Many visual encoding schemes encode some URI (Uniform Resource Identifier) that provides a unique identifier for that specific song, album, or artist when decoded. Since many URIs on Spotify are pretty long (one example being spotify :show:3NRV0mhZa8xeRT0EyLPaIp which clocks in at 218 bits), some mechanism is needed to compress the URIs down to something more manageable. Enter the media reference, a short sequence encoding a specific URI, generally under 40 bits. The reference is just a lookup in a database that Spotify maintains, so it requires a network connection to resolve. The actual encoding scheme from media reference to the values in the bars is quite complex involving CRC, convolution, and puncturing. The CRC allows the program to check for correct decoding, and the convolution enables the program to have a small number of read errors while still having an accurate result. Puncturing is just removing bits to reduce the numbers encoded, relying on convolution to fill in the holes.

[Peter] explains it all in his write-up helpfully and understandably. The creator of the Spotify codes stopped by in the comments to offer some valuable pointers, including pointing out there is a second mode where the lines aren’t centered, allowing it to store double the bits. [Peter] has a python package on Github with all the needed code for you to start decoding. Maybe you can incorporate a Spotify code scanner into your custom Spotify playing mini computer.

What Exactly is a Gaussian Blur?

Lewin Day — Wed, 21 Jul 2021 14:00:33 +0000

Blurring is a commonly used visual effect when digitally editing photos and videos. One of the most common blurs used in these fields is the Gaussian blur. You may have used this tool thousands of times without ever giving it greater thought. After all, it does a nice job and does indeed make things blurrier.

Of course, we often like to dig deeper here at Hackaday, so here’s our crash course on what’s going on when you run a Gaussian blur operation.

It’s Math! It’s All Math.

A 2D Gaussian distribution shown in a 3D plot. Note the higher values towards the center, and growing smaller towards the outside in a bell curve shape.

" data-medium-file="https://hackaday.com/wp-content/uploads/2021/06/gaussainplot3d.png?w=400" data-large-file="https://hackaday.com/wp-content/uploads/2021/06/gaussainplot3d.png?w=786" class="wp-image-485300 size-thumbnail" src="proxy.php?url=https://hackaday.com/wp-content/uploads/2021/06/gaussainplot3d.png?w=250" alt="" width="250" height="191" srcset="https://hackaday.com/wp-content/uploads/2021/06/gaussainplot3d.png 786w, https://hackaday.com/wp-content/uploads/2021/06/gaussainplot3d.png?resize=250,191 250w, https://hackaday.com/wp-content/uploads/2021/06/gaussainplot3d.png?resize=400,305 400w" sizes="auto, (max-width: 250px) 100vw, 250px" />

A 2D Gaussian distribution shown in a 3D plot. Note the higher values towards the center, and growing smaller towards the outside in a bell curve shape.

Digital images are really just lots of numbers, so we can work with them mathematically. Each pixel that makes up a typical digital color image has three values- its intensity in red, green and blue. Of course, greyscale images consist of just a single value per pixel, representing its intensity on a scale from black to white, with greys in between.

Regardless of the image, whether color or greyscale, the basic principle of a Gaussian blur remains the same. Each pixel in the image we wish to blur is considered independently, and its value changed depending on its own value, and those of its surroundings, based on a filter matrix called a kernel.

The kernel consists of a rectangular array of numbers that follow a Gaussian distribution, AKA a normal distribution, or a bell curve.

This diagram shows the manner in which each pixel is processed. For a 3×3 kernel, the pixel of interest and all directly surrounding pixels are sampled. The kernel is then used to generate a new output pixel value based on a weighted average of the sampled pixels based on the Gaussian distribution.

" data-medium-file="https://hackaday.com/wp-content/uploads/2021/06/gaussblurkernal-1.jpg?w=331" data-large-file="https://hackaday.com/wp-content/uploads/2021/06/gaussblurkernal-1.jpg?w=331" class="wp-image-485304 size-medium" src="proxy.php?url=https://hackaday.com/wp-content/uploads/2021/06/gaussblurkernal-1.jpg?w=331" alt="" width="331" height="273" srcset="https://hackaday.com/wp-content/uploads/2021/06/gaussblurkernal-1.jpg 331w, https://hackaday.com/wp-content/uploads/2021/06/gaussblurkernal-1.jpg?resize=250,206 250w" sizes="auto, (max-width: 331px) 100vw, 331px" />

Our rectangular kernel consists of values that are higher in the middle and drop off towards the outer edges of the square array, like the height of a bell curve in two dimensions. The kernel corresponds to the number of pixels we consider when blurring each individual pixel. Larger kernels spread the blur around a wider region, as each pixel is modified by more of its surrounding pixels.

For each pixel to be subject to the blur operation, a rectangular section equal to the size of the kernel is taken around the pixel of interest itself. These surrounding pixel values are used to calculate a weighted average for the original pixel’s new value based on the Gaussian distribution in the kernel itself.

A 5×5 Gaussian kernel. Note the external factor, which ensures that the total values all add up to 1. This avoids adding any intensity to the image, solely average the pixels without otherwise changing their intensity.

" data-medium-file="https://hackaday.com/wp-content/uploads/2021/06/gaussiankernal55.jpg?w=224" data-large-file="https://hackaday.com/wp-content/uploads/2021/06/gaussiankernal55.jpg?w=224" class="size-medium wp-image-485303" src="proxy.php?url=https://hackaday.com/wp-content/uploads/2021/06/gaussiankernal55.jpg?w=224" alt="" width="224" height="148" />

Thanks to the distribution, the central pixel’s original value has the highest weight, so it doesn’t obliterate the image entirely. Immediately neighboring pixels having the next highest influence on the new pixel, and so on. This local averaging smoothes out the pixel values, and that’s the blur.

Edge cases are straightforward too. Where an edge pixel is sampled, the otherwise non-existent surrounding pixels are either given the same value of their nearest neighbor, or given a value matching up with their mirror opposite pixel in the sampled area.

The same calculation is run for each pixel in the original image to be blurred, with the final output image made up of the pixel values calculated through the process. For grayscale images, it’s that simple. Color images can be done the same way, with the blur calculated separately for the red, green, and blue values of each pixel. Alternatively, you can specify the pixel values in some other color space and smooth them there.

The original image, then filtered with a Gaussian blur kernal size 3, and kernal size 10. Note the increased blur as the kernal size increases.

" data-medium-file="https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?w=400" data-large-file="https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?w=800" class="wp-image-485299 size-large" src="proxy.php?url=https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?w=800" alt="" width="800" height="267" srcset="https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg 2379w, https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?resize=250,83 250w, https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?resize=400,133 400w, https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?resize=800,267 800w, https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?resize=1536,512 1536w, https://hackaday.com/wp-content/uploads/2021/06/GAUSSBLUR0310.jpg?resize=2048,683 2048w" sizes="auto, (max-width: 800px) 100vw, 800px" />

Here we see an original image, and a version filtered with a Gaussian blur of kernel size three and kernel size ten. Note the increased blur as the kernel size increases. More pixels incorporated in the averaging results in more smoothing.

Of course, larger images require more calculations to deal with the greater number of pixels, and larger kernel sizes sample more surrounding pixels for each pixel of interest, and can thus take much longer to calculate. However, on modern computers, even blurring high-resolution images with huge kernel sizes can be done in the blink of an eye. Typically, however, it’s uncommon to use a kernel size larger than around 50 or so as things are usually already pretty blurry by that point.

The Gaussian blur is a great example of simple mathematics put to a powerful use in image processing. Now you know how it works on a fundamental level!

Putting Perseverance Rover’s View Into Satellite View Context

Roger Cheng — Tue, 30 Mar 2021 08:00:00 +0000

It’s always fun to look over aerial and satellite maps of places we know, seeing a perspective different from our usual ground level view. We lose that context when it’s a place we don’t know by heart. Such as, say, Mars. So [Matthew Earl] sought to give Perseverance rover’s landing video some context by projecting onto orbital imagery from ESA’s Mars Express. The resulting video (embedded below the break) is a fun watch alongside the technical writeup Reprojecting the Perseverance landing footage onto satellite imagery.

Some telemetry of rover position and orientation were transmitted live during the landing process, with the rest recorded and downloaded later. Surprisingly, none of that information was used for this project, which was based entirely on video pixels. This makes the results even more impressive and the techniques more widely applicable to other projects. The foundational piece is SIFT (Scale Invariant Feature Transform), which is one of many tools in the OpenCV toolbox. SIFT found correlations between Perseverance’s video frames and Mars Express orbital image, feeding into a processing pipeline written in Python for results rendered in Blender.

While many elements of this project sound enticing for applications in robot vision, there are a few challenges touched upon in the “Final Touches” section of the writeup. The falling heatshield interfered with automated tracking, implying this process will need help to properly understand dynamically changing environments. Furthermore, it does not seem to run fast enough for a robot’s real-time needs. But at first glance, these problems are not fundamental. They merely await some motivated people to tackle in the future.

This process bears some superficial similarities to projection mapping, which is a category of projects we’ve featured on these pages. Except everything is reversed (camera instead of video projector, etc.) making the math an entirely different can of worms. But if projection mapping sounds more to your interest, here is a starting point.

[via Dr. Tanya Harrison @TanyaOfMars]

Read Your Movies as Automatically Generated Comic Books

Michael Shaub — Mon, 22 Mar 2021 15:05:00 +0000

A research paper from Dalian University of Technology in China and City University of Hong Kong (direct PDF link) outlines a system that automatically generates comic books from videos. But how can an algorithm boil down video scenes to appropriately reflect the gravity of the scene in a still image? This impressive feat is accomplished by saving two still images per second, then segments the frames into scenes through analysis of region-of-interest and importance ranking.

For its next trick, speech for each scene is processed by combining subtitle information with the audio track of the video. The audio is analyzed for emotion to determine the appropriate speech bubble type and size of the subtitle text. Frames are even analyzed to establish which person is speaking for proper placement of the bubbles. It can then create layouts of the keyframes, determining panel sizes for each page based on the region-of-interest analysis.

The process is completed by stylizing the keyframes with flat color through quantization, for that classic cel shading look, and then populating the layouts with each frame and word balloon.

The team conducted a study with 40 users, pitting their results against previous techniques which require more human intervention and still besting them in every measure. Like any great superhero, the team still sees room for improvement. In the future, they would like to improve the accuracy of keyframe selection and propose using a neural network to do so.

Thanks to [Qes] for the tip!

Colorizing Images With The Help Of AI

Lewin Day — Tue, 19 Nov 2019 03:00:48 +0000

The world was never black and white – we simply lacked the technology to capture it in full color. Many have experimented with techniques to take black and white images, and colorize them. [Adrian Rosebrock] decided to put an AI on the job, with impressive results.

The method involves training a Convolutional Neural Network (CNN) on a large batch of photos, which have been converted to the Lab colorspace. In this colorspace, images are made up of 3 channels – lightness, a (red-green), and b (blue-yellow). This colorspace is used as it better corresponds to the nature of the human visual system than RGB. The model is then trained such that when given a lightness channel as an input, it can predict the likely a & b channels. These can then be recombined into a colorized image, and converted back to RGB for human consumption.

It’s a technique capable of doing a decent job on a wide variety of material. Things such as grass, countryside, and ocean are particularly well dealt with, however more complicated scenes can suffer from some aberration. Regardless, it’s a useful technique, and far less tedious than manual methods.

CNNs are doing other great things too, from naming tomatoes to helping out with home automation. Video after the break.

Take Pictures Around a Corner

Bryan Cockfield — Sun, 25 Aug 2019 11:00:57 +0000

One of the core lessons any physics student will come to realize is that the more you know about physics, the less intuitive it seems. Take the nature of light, for example. Is it a wave? A particle? Both? Neither? Whatever the answer to the question, scientists are at least able to exploit some of its characteristics, like its ability to bend and bounce off of obstacles. This camera, for example, is able to image a room without a direct light-of-sight as a result.

The process works by pointing a camera through an opening in the room and then strobing a laser at the exposed wall. The laser light bounces off of the wall, into the room, off of the objects on the hidden side of the room, and then back to the camera. This concept isn’t new, but the interesting thing that this group has done is lift the curtain on the image processing underpinnings. Before, the process required a research team and often the backing of the university, but this project shows off the technique using just a few lines of code.

This project’s page documents everything extensively, including all of the algorithms used for reconstructing an image of the room. And by the way, it’s not a simple 2D image, but a 3D model that the camera can capture. So there should be some good information for anyone working in the 3D modeling world as well.

Thanks to [Chris] for the tip!