Nic Barker Explains ASCII, Unicode, And UTF-8

UTF-8 brain lifting weights

Over on YouTube [Nic Barker] gives us: UTF-8, Explained Simply.

If you’re gonna be a hacker eventually you’re gonna have to write software to process and generate text data. And when you deal with text data, in this day and age, there are really only two main things you need to know: 7-bit ASCII and UTF-8. In this video [Nic] explains 7-bit ASCII and Unicode, and then explains UTF-8 and how it relates to Unicode and ASCII. [Nic] goes into detail about some of the clever features of Unicode and UTF-8 such as self-synchronization, single-byte ASCII, multi-byte codepoints, leading bytes, continuation bytes, and grapheme clusters.

[Nic] mentions about UTF-16, but UTF-16 turned out to be a really bad idea. UTF-16 combines all of the disadvantages of UTF-8 with all of the disadvantages of UTF-32. In UTF-16 there are things known as “surrogate pairs”, which means a single Unicode codepoint might require two UTF-16 “characters” to describe it. Also the Byte Order Marks (BOM) introduced with UTF-16 proved to be problematic. Particularly if you cat files together you can end up with stray BOM indicators randomly embedded in your new file. They say that null was a billion dollar mistake, well, UTF-16 was the other billion dollar mistake.

tl;dr: don’t use UTF-16, but do use 7-bit ASCII and UTF-8.

Oh, and as we’re here, and talking about Unicode, did you know that you can support The Unicode Consortium with Unicode Adopt-a-Character? You send money to sponsor a character and they put your name up in lights! Win, win! (We noticed while doing the research for this post that Jeroen Frijters of IKVM fame has sponsored #, a nod to C#.)

If you’re interested in learning more about Unicode check out Understanding And Using Unicode and Building Up Unicode Characters One Bit At A Time.

2 thoughts on “Nic Barker Explains ASCII, Unicode, And UTF-8

  1. On a related note, I’m reading “The Chinese Computer: A Global History of the Information Age”(2024) by Thomas S. Mullaney, who also wrote “The Chinese Typewriter: A History”. In the early days of PCs with 7-bit ASCII, getting Chinese characters was a huge problem to overcome, much of the work was also done in Japan, of course, but the PC industry in China was very much shaped by the challenge. I should probably watch the video before commenting here, but the subject is fascinating and Mullaney is the only English writer I knowwho has fully explored the history.

Leave a Reply to ialonepossessthetruthCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.