Nic Barker Explains ASCII, Unicode, And UTF-8

January 22, 2026 by John Elliot V 15 Comments

Over on YouTube [Nic Barker] gives us: UTF-8, Explained Simply.

If you’re gonna be a hacker eventually you’re gonna have to write software to process and generate text data. And when you deal with text data, in this day and age, there are really only two main things you need to know: 7-bit ASCII and UTF-8. In this video [Nic] explains 7-bit ASCII and Unicode, and then explains UTF-8 and how it relates to Unicode and ASCII. [Nic] goes into detail about some of the clever features of Unicode and UTF-8 such as self-synchronization, single-byte ASCII, multi-byte codepoints, leading bytes, continuation bytes, and grapheme clusters.

[Nic] mentions about UTF-16, but UTF-16 turned out to be a really bad idea. UTF-16 combines all of the disadvantages of UTF-8 with all of the disadvantages of UTF-32. In UTF-16 there are things known as “surrogate pairs”, which means a single Unicode codepoint might require two UTF-16 “characters” to describe it. Also the Byte Order Marks (BOM) introduced with UTF-16 proved to be problematic. Particularly if you cat files together you can end up with stray BOM indicators randomly embedded in your new file. They say that null was a billion dollar mistake, well, UTF-16 was the other billion dollar mistake.

tl;dr: don’t use UTF-16, but do use 7-bit ASCII and UTF-8.

Oh, and as we’re here, and talking about Unicode, did you know that you can support The Unicode Consortium with Unicode Adopt-a-Character? You send money to sponsor a character and they put your name up in lights! Win, win! (We noticed while doing the research for this post that Jeroen Frijters of IKVM fame has sponsored #, a nod to C#.)

If you’re interested in learning more about Unicode check out Understanding And Using Unicode and Building Up Unicode Characters One Bit At A Time.

Continue reading “Nic Barker Explains ASCII, Unicode, And UTF-8” →

UTF-8 Is Beautiful

September 14, 2025 by Jenny List 30 Comments

It’s likely that many Hackaday readers will be aware of UTF-8, the mechanism for incorporating diverse alphabets and other characters such as 💩 emojis. It takes the long-established 7-bit ASCII character set and extends it into multiple bytes to represent many thousands of characters. How it does this may well be beyond that basic grasp, and [Vishnu] is here with a primer that’s both fascinating and easy to read.

UTF-8 extends ASCII from codes which fit in a single byte, to codes which can be up to four bytes long. The key lies in the first few bits of each byte, which specify how many bytes each character has, and then that it is a data byte. Since 7-bit ASCII codes always have a 0 in their most significant bit when mapped onto an 8-bit byte, compatibility with ASCII is ensured by the first 128 characters always beginning with a zero bit. It’s simple, elegant, and for any of who had to deal with character set hell in the days before it came along, magic.

We’ve talked surprisingly little about the internals of UTF-8 in the past, but it’s worthy of note that this is our second piece ever to use the poop emoji, after our coverage of the billionth GitHub repository.

Emoji bales: Tony Hisgett, CC BY 2.0.

Building Up Unicode Characters One Bit At A Time

September 7, 2023 by Dan Maloney 33 Comments

The range of characters that can be represented by Unicode is truly bewildering. If there’s a symbol that was ever used to represent a sound or a concept anywhere in the world, chances are pretty good that you can find it somewhere in Unicode. But can many of us recall the proper keyboard calisthenics needed to call forth a particular character at will? Probably not, which is where this Unicode binary input terminal may offer some relief.

“Surely they can’t be suggesting that entering Unicode characters as a sequence of bytes using toggle switches is somehow easier than looking up the numpad shortcut?” we hear you cry. No, but we suspect that’s hardly [Stephen Holdaway]’s intention with this build. Rather, it seems geared specifically at making the process of keying in Unicode harder, but cooler; after all, it was originally his intention to enter this in last year’s Odd Inputs and Peculiar Peripherals contest. [Stephen] didn’t feel it was quite ready at the time, but now we’ve got a chance to give this project a once-over.

The idea is simple: a bank of eight toggle switches (with LEDs, of course) is used to compose the desired UTF-8 character, which is made up of one to four bytes. Each byte is added to a buffer with a separate “shift/clear” momentary toggle, and eventually sent out over USB with a flick of the “send” toggle. [Stephen] thoughtfully included a tiny LCD screen to keep track of the character being composed, so you know what you’re sending down the line. Behind the handsome brushed aluminum panel, a Pi Pico runs the show, drawing glyphs from an SD card containing 200 MB of True Type Font files.

At the end of the day, it’s tempting to look at this as an attractive but essentially useless project. We beg to differ, though — there’s a lot to learn about Unicode, and [Stephen] certainly knocked that off his bucket list with this build. There’s also something wonderfully tactile about this interface, and we’d imagine that composing each codepoint is pretty illustrative of how UTF-8 is organized. Sounds like an all-around win to us.

Hackaday

UTF-8

3 Articles

Nic Barker Explains ASCII, Unicode, And UTF-8

UTF-8 Is Beautiful

Building Up Unicode Characters One Bit At A Time

Search

Never miss a hack

If you missed it

New Artemis Plan Returns To Apollo Playbook

Back To Basics: Hacking On Key Matrixes

Accidental Climate Engineering With Disintegrating Satellites

Reflections On Ten Years With The Wrencher

The Curse Of The Everything Device

Our Columns

Choice, Control, And Interruption

Hackaday Podcast Episode 360: Cool Rubber Bands, Science-y Stuff, And The Whys Of Office Supplies

This Week In Security: Getting Back Up To Speed

Keebin’ With Kristina: The One With The Beginner’s Guide To Split Keyboards

SpyTech: The Underwater Wire Tap

Search

Never miss a hack

Subscribe

If you missed it

Our Columns