Over on YouTube [Nic Barker] gives us: UTF-8, Explained Simply.
If you’re gonna be a hacker eventually you’re gonna have to write software to process and generate text data. And when you deal with text data, in this day and age, there are really only two main things you need to know: 7-bit ASCII and UTF-8. In this video [Nic] explains 7-bit ASCII and Unicode, and then explains UTF-8 and how it relates to Unicode and ASCII. [Nic] goes into detail about some of the clever features of Unicode and UTF-8 such as self-synchronization, single-byte ASCII, multi-byte codepoints, leading bytes, continuation bytes, and grapheme clusters.
[Nic] mentions about UTF-16, but UTF-16 turned out to be a really bad idea. UTF-16 combines all of the disadvantages of UTF-8 with all of the disadvantages of UTF-32. In UTF-16 there are things known as “surrogate pairs”, which means a single Unicode codepoint might require two UTF-16 “characters” to describe it. Also the Byte Order Marks (BOM) introduced with UTF-16 proved to be problematic. Particularly if you cat files together you can end up with stray BOM indicators randomly embedded in your new file. They say that null was a billion dollar mistake, well, UTF-16 was the other billion dollar mistake.
tl;dr: don’t use UTF-16, but do use 7-bit ASCII and UTF-8.
Oh, and as we’re here, and talking about Unicode, did you know that you can support The Unicode Consortium with Unicode Adopt-a-Character? You send money to sponsor a character and they put your name up in lights! Win, win! (We noticed while doing the research for this post that Jeroen Frijters of IKVM fame has sponsored #, a nod to C#.)
If you’re interested in learning more about Unicode check out Understanding And Using Unicode and Building Up Unicode Characters One Bit At A Time.

Thankyouthankyouthankyou. I’ve always seatofthepants’ed it with mostly Asian languages and this is precisely something that should fit well in my head.
pleased to hear it! thanks!
On a related note, I’m reading “The Chinese Computer: A Global History of the Information Age”(2024) by Thomas S. Mullaney, who also wrote “The Chinese Typewriter: A History”. In the early days of PCs with 7-bit ASCII, getting Chinese characters was a huge problem to overcome, much of the work was also done in Japan, of course, but the PC industry in China was very much shaped by the challenge. I should probably watch the video before commenting here, but the subject is fascinating and Mullaney is the only English writer I knowwho has fully explored the history.
Rendering Chinese has come a long way since Telegraph Code. Thanks for the book tips.
utf is ugly and not comfortable, please make other format
I agree that UTF-8 is ugly. Particularly the vestigial ASCII control characters remain latent in it, which is kind of gross. But it is eminently practical and it has a lot of clever features. And at this point we’re stuck with it.
I skimmed the title too fast, seeing “Barker Code”, hoping to learn more. Alas.
Barker code is a fascinating coding scheme relevant to communications, radar, sonar & ultrasound imaging, lidar, and others. https://en.wikipedia.org/wiki/Barker_code
thanks for this link!
And all we need is simple ASCII. And then someone had to complicate it…
Well we had that top bit spare and UTF-8 has put it to tremendous use!
It depends on the definition of “us”. Most of the world population might beg to differ.
Does anyone knows a good approach/ utility to “convert” UTF-8 to ASCII, preferably in XSLT 2.0? I tried iconv with the //TRANSLIT option before transformation but the business is not really happy with the result.
Well in UTF-8 everything which can be ASCII already is ASCII! As for converting the rest that’s not going to be practical. How do you convert 日本語 to ASCII?
“how do you convert…” : You don’t . Just use ASCII 7bit…. Keep it simple. Use the simple codes as the universal language for communication across the globe. I know, I know, the barn door is already open, and that horse is gone and stuck with the ‘new’ way…. Would have been nice though and unifying….
I don’t get the part about SMTP and 16-bit encodings. It sounds like ASCII-only e-mails were the reason why 16-bit encodings didn’t get traction and we ended up with UTF-8. Millions of people were using their non-ASCII writing systems with 7-bit transport just fine. Some could use quoted-printable if the non-ASCII characters were a minority. Some could use encodings made specifically for 7-bit transport (e.g. ISO-2022-JP). And the content could always be base64-encoded.