unicode – Hackaday

Nic Barker Explains ASCII, Unicode, and UTF-8

John Elliot V — Fri, 23 Jan 2026 03:00:51 +0000

Over on YouTube [Nic Barker] gives us: UTF-8, Explained Simply.

If you’re gonna be a hacker eventually you’re gonna have to write software to process and generate text data. And when you deal with text data, in this day and age, there are really only two main things you need to know: 7-bit ASCII and UTF-8. In this video [Nic] explains 7-bit ASCII and Unicode, and then explains UTF-8 and how it relates to Unicode and ASCII. [Nic] goes into detail about some of the clever features of Unicode and UTF-8 such as self-synchronization, single-byte ASCII, multi-byte codepoints, leading bytes, continuation bytes, and grapheme clusters.

[Nic] mentions about UTF-16, but UTF-16 turned out to be a really bad idea. UTF-16 combines all of the disadvantages of UTF-8 with all of the disadvantages of UTF-32. In UTF-16 there are things known as “surrogate pairs”, which means a single Unicode codepoint might require two UTF-16 “characters” to describe it. Also the Byte Order Marks (BOM) introduced with UTF-16 proved to be problematic. Particularly if you cat files together you can end up with stray BOM indicators randomly embedded in your new file. They say that null was a billion dollar mistake, well, UTF-16 was the other billion dollar mistake.

tl;dr: don’t use UTF-16, but do use 7-bit ASCII and UTF-8.

Oh, and as we’re here, and talking about Unicode, did you know that you can support The Unicode Consortium with Unicode Adopt-a-Character? You send money to sponsor a character and they put your name up in lights! Win, win! (We noticed while doing the research for this post that Jeroen Frijters of IKVM fame has sponsored #, a nod to C#.)

If you’re interested in learning more about Unicode check out Understanding And Using Unicode and Building Up Unicode Characters One Bit At A Time.

Send Images to Your Terminal With Rich Pixels

Donald Papp — Sat, 13 Sep 2025 20:00:00 +0000

[darrenburns]’ Rich Pixels is a library for sending colorful images to a terminal. Give it an image, and it’ll dump it to your terminal in full color. While it also supports ASCII art, the cool part is how it makes it so easy to display an arbitrary image — a pixel-art rendition of it, anyway — in a terminal window.

How it does this is by cleverly representing two lines of pixels in the source image with a single terminal row of characters. Each vertical pixel pair is represented by a single Unicode ▄ (U+2584 “lower half block”) character. The trick is to set the background color of the half-block to the upper pixel’s RGB value, and the foreground color of the half-block to the lower pixel’s RGB. By doing this, a single half block character represents two vertically-stacked pixels. The only gotcha is that Rich Pixels doesn’t resize the source image; if one’s source image is 600 pixels wide, one’s terminal is going to receive 600 U+2584 characters per line to render the Rich Pixels version.

[Simon WIllison] took things a step further and made show_image.py, which works the same except it resizes the source image to fit one’s terminal first. This makes it much more flexible and intuitive.

The code is here on [Simon]’s tools GitHub, a repository for software tools he finds useful, like the Incomplete JSON Pretty Printer.

Why Names Break Systems

Ian Bos — Wed, 06 Aug 2025 02:00:23 +0000

Web systems are designed to be simple and reliable. Designing for the everyday person is the goal, but if you don’t consider the odd man out, they may encounter some problems. This is the everyday life for some people with names that often have unconsidered features, such as apostrophes or spaces. This is the life of [Luke O’Sullivan], who even had to fly under a different name than his legal one.

[O’Sullivan] is far from a rare surname, but presents an interesting challenge for many computer systems. Systems from the era of penny pinching every bit relied on ASCII. ASCII only included 128 characters, which included a very small set of special characters. Some systems didn’t even include some of these characters to reduce loading times. Throw on the security features put in place to prevent injection attacks, and you have a very unfriendly field for many uncommon names.

Unicode is a newer standard with over 150,000 characters, allowing for nearly any character. However, many older systems are far from easy or cheap to convert to the new standard. This leaves many people to have to adapt to the software rather than the software adapting to the user. While this is simply poor design in general, [O’Sullivan] makes sure to point out how demeaning this can be for many people. Imagine being told that your name isn’t important enough to be included, or told that it’s “invalid”.

One excuse that gets thrown about is the aforementioned injection prompts that can be used to affect these systems. This can cause systems to crash or even change settings; however, it’s not just these older systems that get affected. For modern-day injection prompts, check out how AI models can get affected!

Thanks to Ken Fallon for the tip!

This Week in Security: Unicode Strikes Again, Trust No One (Redditor), and More

Jonathan Bennett — Fri, 14 Jun 2024 14:00:50 +0000

There’s a popular Sysadmin meme that system problems are “always DNS”. In the realm of security, it seems like “it’s always Unicode“. And it’s not hard to see why. Unicode is the attempt to represent all of Earth’s languages with a single character set, and that means there’s a lot of very similar characters. The two broad issues are that human users can’t always see the difference between similar characters, and that libraries and applications sometimes automatically convert exotic Unicode characters into more traditional text.

This week we see the resurrection of an ancient vulnerability in PHP-CGI, that allows injecting command line switches when a web server launches an instance of PHP-CGI. The solution was to block some characters in specific places in query strings, like a query string starting with a dash.

The bypass is due to a Windows feature, “Best-Fit”, an automatic down-convert from certain Unicode characters. This feature works on a per-locale basis, which means that not every system language behaves the same. The exact bypass that has been found is the conversion of a soft hyphen, which doesn’t get blocked by PHP, into a regular hyphen, which can trigger the command injection. This quirk only happens when the Windows locale is set to Chinese or Japanese. Combined with the relative rarity of running PHP-CGI, and PHP on Windows, this is a pretty narrow problem. The XAMPP install does use this arrangement, so those installs are vulnerable, again if the locale is set to one of these specific languages. The other thing to keep in mind is that the Unicode character set is huge, and it’s very likely that there are other special characters in other locales that behave similarly.

Downloader Beware

The ComfyUI project is a flowchart interface for doing AI image generation workflows. It’s an easy way to build complicated generation pipelines, and the community has stepped up to build custom plugins and nodes for generation. The thing is, it’s not always the best idea to download and run code from strangers on the Internet, as a group of ComfyUI users found out the hard way this week. The ComfyUI_LLMVISION node from u/AppleBotzz was malicious.

The node references a malicious Python package that grabs browser data and sends it all to a Discord or Pastebin. It appears that some additional malware gets installed, for continuing access to infected systems. It’s a rough way to learn.

PyTorch Scores a Dubious 10.0

CVE-2024-5480 is a PyTorch flaw that allows PyTorch worker nodes to trigger arbitrary eval() calls on the master node. No authentication is required to add a PyTorch worker, so this is technically an unauthorized RCE, earning the CVSS of 10.0. Practically speaking it’s not that dire of a problem, as your PyTorch cluster shouldn’t be on the Internet to start with, and there’s no authentication as a design choice. It’s not clear the the PyTorch developers consider this a legitimate security vulnerability at all. It may or may not be fixed with version 2.3.

Next Level Smishing

My least favorite term in infosec has to be “smishing”, a frankenword for SMS phishing. Cell phone carriers around the world are working hard to blocking spam messages, making smishing an impossible task. And that’s why it’s particularly interesting to hear about a bypass that a pair of criminals were using in London. The technical details are light, but the police reported a “homemade mobile antenna”, “illegitimate telephone mast”, and “text message blaster” as part of the seized kit. The initial report sounds like it may be a sort of reverse stingray, where messages are skipping the regular cellular infrastructure and are getting sent directly to nearby cell phones. Hopefully more information will be forthcoming soon.

Zyxel’s NsaRescueAngel

The programmers at Zyxel apparently have a sense of humor, given the naming used for this mis-feature. Zyxel NAS units have a bit of magic code that writes a password for the new user, NsaRescueAngel, to the shadow password file. The SSH daemon is restarted, and upnp is fired off to request port forwarding from the outside world. One of the script names, possibly from a previous iteration, was open_back_door.sh, which seems to be sort of lampshading the whole thing.

It’s presumably intended to be a great troubleshooting tool, when a customer is stuck and needs help, to be able to visit a web url to enable remote access for a Zyxel tech. The problem is that the Zyxel NAS already has an authentication bypass flaw, and while it’s been patched, it wasn’t patched very well, making this whole scheme accessible without authentication, just by slapping /favicon.ico onto the url. The additional problems have been fixed in a more recent update.

Russian Secure Phablet?

A Twitter thread tells the story of a Russian secure device, left behind on the back of a bus in England. That’s an interesting premise. But the thread continues, that ‘conveniently the owner also left a briefcase with design notes, architecture, documentation, implementation, marketing material and internal Zoom demos about “trusted” devices too!’ OK, now this has to either be a fanfic, or a fell-off-the-back-of-a-truck story. There’s some convincing looking screenshots, and even rom dumps. What’s going on here?

Nobody knew how the devices worked, conveniently the owner also left a briefcase with design notes, architecture, documentation, implementation, marketing material and internal Zoom demos about "trusted" devices too! We'd all have been lost without those. https://t.co/LN7cTybxOV pic.twitter.com/j5OCHprSie

— hackerfantastic.x (@hackerfantastic) June 11, 2024

The most likely explanation is that somebody got their hands on a trove of data on these devices, and wanted to dump it online with a silly story. But fair warning, don’t trust any of the shared files. Who knows what’s actually in there. Taking a look at something untrusted like this is an art in itself, best done with isolated VMs and burner machines, maybe a Linux install you don’t mind wiping?

Bits and Bytes

Buskill just published their 8th warrant canary, a cryptographically signed statement attesting that they have not been served any secret warrants or national security letters that would undermine the trustworthiness of the Buskill project or code. In addition to a good cryptographic signature, this canary includes a handful of latest news headlines in the signed material, proving it is actually a recently generated document.

[Aethlios] has published Reset Tolkien, an open source tool for finding and attacking a very specific sort of weakness in time based tokens. The targeted flaw is a token generated from improper randomness source, like the current time. If the pattern can be found, a “sandwich attack” can narrow down the possible reset codes by requesting a reset code for a controlled account, requesting one for the target account, and then once again for the controlled account. The target code must come between the two known codes.

And finally, TPM security is hard. This time, the Trusted Platform Module can be reset by reclaiming the GPIO pins connected to it, and simulating a reboot by pulling the reset pin. This results in the TPM possibly talking to an application when it thinks it is talking to the CPU doing boot decryption. In short, it can result in compromised keys. Thanks to [char] from Discord for sending this one in!

Building Up Unicode Characters One Bit at a Time

Dan Maloney — Thu, 07 Sep 2023 11:00:44 +0000

The range of characters that can be represented by Unicode is truly bewildering. If there’s a symbol that was ever used to represent a sound or a concept anywhere in the world, chances are pretty good that you can find it somewhere in Unicode. But can many of us recall the proper keyboard calisthenics needed to call forth a particular character at will? Probably not, which is where this Unicode binary input terminal may offer some relief.

“Surely they can’t be suggesting that entering Unicode characters as a sequence of bytes using toggle switches is somehow easier than looking up the numpad shortcut?” we hear you cry. No, but we suspect that’s hardly [Stephen Holdaway]’s intention with this build. Rather, it seems geared specifically at making the process of keying in Unicode harder, but cooler; after all, it was originally his intention to enter this in last year’s Odd Inputs and Peculiar Peripherals contest. [Stephen] didn’t feel it was quite ready at the time, but now we’ve got a chance to give this project a once-over.

The idea is simple: a bank of eight toggle switches (with LEDs, of course) is used to compose the desired UTF-8 character, which is made up of one to four bytes. Each byte is added to a buffer with a separate “shift/clear” momentary toggle, and eventually sent out over USB with a flick of the “send” toggle. [Stephen] thoughtfully included a tiny LCD screen to keep track of the character being composed, so you know what you’re sending down the line. Behind the handsome brushed aluminum panel, a Pi Pico runs the show, drawing glyphs from an SD card containing 200 MB of True Type Font files.

At the end of the day, it’s tempting to look at this as an attractive but essentially useless project. We beg to differ, though — there’s a lot to learn about Unicode, and [Stephen] certainly knocked that off his bucket list with this build. There’s also something wonderfully tactile about this interface, and we’d imagine that composing each codepoint is pretty illustrative of how UTF-8 is organized. Sounds like an all-around win to us.

Understanding and Using Unicode

Chris Lott — Sat, 29 Jul 2023 02:00:45 +0000

Unicode logo from Wikimedia, with the text “is hard” added

" data-image-caption="" data-medium-file="https://hackaday.com/wp-content/uploads/2023/07/uni-feature.png?w=400" data-large-file="https://hackaday.com/wp-content/uploads/2023/07/uni-feature.png?w=800" />

Computer engineer [Marco Cilloni] realized a lot of developers today still have trouble dealing with Unicode in their programs, especially in the C/C++ world. He wrote an excellent guide that summarizes many of the issues surrounding Unicode and its encoding called “Unicode is harder than you think“. He first presents a brief history of Unicode and how it came about, so you can understand the reasons for the frustrating edge cases you’re bound to encounter.

There have been a variety of Unicode encoding methods over the years, but modern programs dealing with strings will probably be using UTF-8 encoding — and you should too. This multibyte encoding scheme has the convenient property of not changing the original character values when dealing with 7-bit ASCII text. We were surprised to read that there is actually an EBCDIC version of UTF still officially on the books today:

UTF-EBCDIC, a variable-width encoding that uses 1-byte characters designed for IBM’s EBCDIC systems (note: I think it’s safe to argue that using EBCDIC in 2023 edges very close to being a felony)

[Marco] goes in detail about different problems found when dealing with Unicode strings. When C was being developed, ASCII itself had just been finalized in the form we know today, so it treats characters as single byte numbers. With multi-byte, variable-width character strings, the usual functions like strlen fall apart.

Unicode’s combining characters also causes problems when it comes to comparison and collation of text. These are characters which can be built from multiple glyphs, but they also have a pre-built Unicode point. There are also ligatures that combine multiple characters into a single code point. Suddenly it isn’t so clear what character equality even means — Unicode defines two kinds of equivalences, canonical and compatibility.

These are but a sampling of the issues [Marco] discusses. The most important takeaway is that “Unicode handling is always best left to a library“. If your language / compiler of choice doesn’t have one, the Unicode organization provides a reference design called the ICU.

If this topic interests you, do check out his essay linked above. And if you want to get your hands dirty with Unicode glyphs, check out [Roman Czyborra]’s tools here, which are simple command line tools that let you easily experiment using ASCII art. [Roman] founded the open-sourced GNU Unicode Font project back in the 1990s, Unifoundry. Our own [Maya Posch] wrote a great article on the history of Unicode in 2021.

Punycodes Explained

Matthew Carlson — Wed, 18 Jan 2023 15:10:59 +0000

When you’re restricted to ASCII, how can you represent more complex things like emojis or non-Latin characters? One answer is Punycode, which is a way to represent Unicode characters in ASCII. However, while you could technically encode the raw bits of Unicode into characters, like Base64, there’s a snag. The Domain Name System (DNS) generally requires that hostnames are case-insensitive, so whether you type in HACKADAY.com, HackADay.com, or just hackaday.com, it all goes to the same place.

[A. Costello] at the University of California, Berkley proposed the idea of Punycode in RFC 3492 in March 2003. It outlines a simple algorithm where all regular ASCII characters are pulled out and stuck on one side with a separator in between, in this case, a hyphen. Then the Unicode characters are encoded and stuck on the end of the string.

First, the numeric codepoint and position in the string are multiplied together. Then the number is encoded as a Base-36 (a-z and 0-9) variable-length integer. For example, a greeting and the Greek for thanks, “Hey, ευχαριστώ” becomes “Hey, -mxahn5algcq2″. Similarly, the beautiful city of München becomes mnchen-3ya.

As you might notice in the Greek example, there is nothing to help the decoder know which base-36 characters belong to which original Unicode symbol. Thanks to the variable-length integers, each significant digit is recognizable, as there are thresholds for what numbers can be encoded. A finite-state machine comes to the rescue. The RFC gives some exemplary pseudocode that outlines the algorithm. It’s pretty clever, utilizing a bias that rolls as the decoding goes along. As it is always increasing, it is a monotonic function with some clever properties.

Of course, to prevent regular URLs from being interpreted as punycodes, URLs have a special little prefix xn-- to let the browser know that it’s a code. This includes all Unicode characters, so emojis are also valid. So why can’t you go to xn--mnchen-3ya.de? If you type it into your browser or click the link, you might see your browser transform that confusing letter soup into a beautiful URL (not all browsers do this). The biggest problem is Unicode itself.

While Unicode offers incredible support for making the hundreds of languages used around the web every day possible and, dare we say, even somewhat straightforward, there are some warts. Cyrillic, zero-width letters and other Unicode oddities allow those with more nefarious intentions to set up a domain that, when rendered, displays as a well-known website. The SSL certificates are valid, and everything else checks out. Cyrillic includes characters that visually look identical to their Latin counterparts but are represented differently. The opportunities for hackers and phishing attempts are too great, and so far, punycodes haven’t been allowed on most domains.

For example, can you tell the difference between these two domains?

hackaday.com

hаckаday.com

Some browsers will render the hover text as the Punycode, and some will keep it as its UTF-8 equivalent. The “a” (U+0061) has been replaced by the Cyrillic “a” (U+0430), which most computers render with the exact same character.

This is an IDN homograph attack, where they’re relying on a user to click on a link that they can’t tell the difference between. In 2001, two security researchers published a paper on the subject, registering “microsoft.com” with Cyrillic characters as a proof of concept. In response, top-level domains were recommended to only accept Unicode characters containing Latin characters and characters from languages used in that country. As a result, many of the common US-based top-level domains don’t accept Unicode domain names at all. At least the non-displayable characters are specifically banded by the ICANN, which avoids a large can of worms, but having visually identical but bit-wise different characters out there leads to confusion.

However, mitigations to these types of attacks are slowly being rolled out. As a first layer of protection, Firefox and Chromium-based browsers only show the non-Punycode version if all the characters are from the same language. Some browsers convert all Unicode URLs to Punycode. Other techniques use optical character recognition (OCR) to determine whether a URL can be interpreted differently. Outside the browser, links sent by text message or in emails, might not have the same smarts, and you won’t know until you’ve opened them in your browser. And by then, it’s too late.

Challenges aside, will Punycodes get their time in the sun? Will Hackaday ever get .com? Who knows. But in the meantime, we can enjoy a clever solution proposed in 2003 to the thorny problem of domain name internationalization that we still haven’t quite solved.