UTF-8 Is Beautiful

It’s likely that many Hackaday readers will be aware of UTF-8, the mechanism for incorporating diverse alphabets and other characters such as 💩 emojis. It takes the long-established 7-bit ASCII character set and extends it into multiple bytes to represent many thousands of characters. How it does this may well be beyond that basic grasp, and [Vishnu] is here with a primer that’s both fascinating and easy to read.

UTF-8 extends ASCII from codes which fit in a single byte, to codes which can be up to four bytes long. The key lies in the first few bits of each byte, which specify how many bytes each character has, and then that it is a data byte. Since 7-bit ASCII codes always have a 0 in their most significant bit when mapped onto an 8-bit byte, compatibility with ASCII is ensured by the first 128 characters always beginning with a zero bit. It’s simple, elegant, and for any of who had to deal with character set hell in the days before it came along, magic.

We’ve talked surprisingly little about the internals of UTF-8 in the past, but it’s worthy of note that this is our second piece ever to use the poop emoji, after our coverage of the billionth GitHub repository.

Emoji bales: Tony Hisgett, CC BY 2.0.

Building Up Unicode Characters One Bit At A Time

The range of characters that can be represented by Unicode is truly bewildering. If there’s a symbol that was ever used to represent a sound or a concept anywhere in the world, chances are pretty good that you can find it somewhere in Unicode. But can many of us recall the proper keyboard calisthenics needed to call forth a particular character at will? Probably not, which is where this Unicode binary input terminal may offer some relief.

“Surely they can’t be suggesting that entering Unicode characters as a sequence of bytes using toggle switches is somehow easier than looking up the numpad shortcut?” we hear you cry. No, but we suspect that’s hardly [Stephen Holdaway]’s intention with this build. Rather, it seems geared specifically at making the process of keying in Unicode harder, but cooler; after all, it was originally his intention to enter this in last year’s Odd Inputs and Peculiar Peripherals contest. [Stephen] didn’t feel it was quite ready at the time, but now we’ve got a chance to give this project a once-over.

The idea is simple: a bank of eight toggle switches (with LEDs, of course) is used to compose the desired UTF-8 character, which is made up of one to four bytes. Each byte is added to a buffer with a separate “shift/clear” momentary toggle, and eventually sent out over USB with a flick of the “send” toggle. [Stephen] thoughtfully included a tiny LCD screen to keep track of the character being composed, so you know what you’re sending down the line. Behind the handsome brushed aluminum panel, a Pi Pico runs the show, drawing glyphs from an SD card containing 200 MB of True Type Font files.

At the end of the day, it’s tempting to look at this as an attractive but essentially useless project. We beg to differ, though — there’s a lot to learn about Unicode, and [Stephen] certainly knocked that off his bucket list with this build. There’s also something wonderfully tactile about this interface, and we’d imagine that composing each codepoint is pretty illustrative of how UTF-8 is organized. Sounds like an all-around win to us.