UTF-8 – “The Most Elegant Hack”

September 27, 2013

While it may not look like much, the image above is a piece of the original email where [Ken Thompson] described what would become the implementation of UTF-8. At the dawn of the computer age in America, when we were still using teletype machines, encoding the English language was all we worried about. Programmers standardized on the ASCII character set, but there was no room for all of the characters used in other languages. To enable real-time worldwide communication, we needed something better. There were many proposals, but the one submitted by [Ken Thompson] and [Rob ‘Commander’ Pike] was the one accepted, quite possibly because of what a beautiful hack it is.

[Tom Scott] did an excellent job of describing the UTF-8. Why he chose to explain it in the middle of a busy cafe is beyond us, but his enthusiasm was definitely up to the task. In the video (which is embedded after the break) he quickly shows the simplicity and genius of ASCII. He then explains the challenge of supporting so many character sets, and why UTF-8 made so much sense.

We considered making this a Retrotechtacular, but the consensus is that understanding how UTF-8 came about is useful for modern hackers and coders. If you’re interested in learning more, there are tons of links in this Reddit post, including a link to the original email.

33 thoughts on “UTF-8 – “The Most Elegant Hack””

Trui says:

September 27, 2013 at 7:19 am

I was expecting something amazingly clever, but this seems rather straightforward.

Report comment

Reply
1. Wretch says:
  
  September 27, 2013 at 8:38 am
  
  I think that’s why it’s amazingly clever (i.e., because it is quite straightforward). (c:
  
  Report comment
  
  Reply
  1. Trui says:
    
    September 27, 2013 at 8:49 am
    
    But it’s also quite straightforward to come with the idea. The hardest part is coming up with a list of requirements. Once you have those, the implementation follows naturally, especially if you’re familiar with Huffman coding.
    
    Report comment
    
    Reply
    1. cheeseslices says:
      
      September 27, 2013 at 8:57 am
      
      easy to say in hindsight my good man.
      
      Report comment
      
      Reply
      1. Wretch says:
        
        September 27, 2013 at 9:30 am
        
        +1. (c:
        
        Report comment
speps says:

September 27, 2013 at 7:23 am

He might have done it in a busy cafe to be in the same setting that UTF-8 came into the their mind according to a linked file on the Wikipedia page : “I very clearly remember Ken writing on the placemat and wished we had kept it!” (http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt)

Report comment

Reply
1. J5892 says:
  
  September 27, 2013 at 7:39 am
  
  It looks like a hotel lobby. Probably at a convention.
  
  Report comment
  
  Reply
tekkieneet says:

September 27, 2013 at 7:29 am

Looks like a text book Huffman encoding header onto variable bits. Not the first time
it is being used. http://en.wikipedia.org/wiki/Huffman_coding#Compression

Report comment

Reply
cruster says:

September 27, 2013 at 7:37 am

Superb! Great explanation. Thanks :-)

Report comment

Reply
Ari Diacou says:

September 27, 2013 at 7:46 am

Actually, I think Dijkstra’s algorithm (which is how a GPS figures out shortest path) is the most elegant hack you can describe on the back of a napkin.

The algorithm and how it works: http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
The story of how he invented it in 20 min while shopping with his fiancee: http://dl.acm.org/citation.cfm?doid=1787234.1787249

Report comment

Reply
1. tekkieneet says:
  
  September 27, 2013 at 8:56 am
  
  Poor guy. At least he gets to do something productive and make everyone’s life
  a bit easier.
  
  For men, shopping is a sport. i.e. get to the cashier in the shortest time possible.
  For women, it is a long classic Quest collecting meaningless objects on the way
  while trying to stretch it out to escape from reality.
  
  Report comment
  
  Reply
  1. John Bokma says:
    
    September 27, 2013 at 9:10 am
    
    You don’t read real books, I guess.
    
    Report comment
    
    Reply
  2. 0c says:
    
    September 27, 2013 at 3:17 pm
    
    How many fedoras do you own?
    
    Report comment
    
    Reply
    1. Greenaum says:
      
      September 27, 2013 at 8:34 pm
      
      How many women have you shopped with?
      
      Report comment
      
      Reply
graavatass says:

September 27, 2013 at 8:36 am

Well. I would say that subnetting is in the running for most elegant, as in it is so simple it takes a month to understand how it works, then it is easy forever.

Report comment

Reply
rebeccaerobbins says:

September 27, 2013 at 11:46 am

Gorgeous explanation, thank you. Now if only programming languages could all agree on how to process unicode. Natural language processing in Python is infinitely more difficult with unicode, in spite of how elegant a solution it is for information transmission.

Report comment

Reply
1. Giacomo Lacava (@toyg) says:
  
  September 28, 2013 at 4:14 pm
  
  As John Pinner said at PyCon UK last week, that’s because Guido was too good to Americans: Python 1.x/2.x was basically built with ASCII users in mind, and tried to make their life very easy *in the common case*. Unicode support was tacked on later, thanks to the generous efforts of a few French guys.
  
  Things should get better with Python 3, once all “six”/legacy hacks are shed from 3rd-party libraries. It might take a while.
  
  Report comment
  
  Reply
tomás zerolo says:

September 27, 2013 at 12:11 pm

And still, Microsoft managed to spoil the elegance of UTF-8 by strong-arming the Commitees to allow a Byte Order Mark at the beginning of UTF-8 files, killing one of the most elegant properties (a plain ASCII file can be seen as the same content in UTF-8).

BIG FAIL.

Report comment

Reply
Shawn Swift says:

September 27, 2013 at 12:37 pm

Starting the alphabet at 65 in the ascii table was brilliant because if you look at it in binary it makes sense? That’s absurd! I’ve been programming since I was a kid before the days of DOS, and I always had to look those damn character tables up because I could never remember where the letters started. And knowing this trick in binary would have done me little good because who the hell ever inputs text or examines text, in binary?

Report comment

Reply
1. Gdogg says:
  
  September 27, 2013 at 1:23 pm
  
  It isn’t exactly difficult to convert decimal to binary
  
  Report comment
  
  Reply
  1. Peter says:
    
    September 27, 2013 at 2:33 pm
    
    Difficult no. Tedious yes.
    
    Report comment
    
    Reply
2. RP says:
  
  September 27, 2013 at 2:36 pm
  
  Take a decimal number, OR it with $30Hex and you get an ASCII digit. OR it with $40 and you get the alphabet starting with the first letter after zero being “A” ($41). $60 gets you the lower case alphabet.
  
  Simple ORs and ANDs make conversion and alphabetizing simple.
  
  Report comment
  
  Reply
3. Enginerd says:
  
  September 27, 2013 at 3:17 pm
  
  ASCII was standardized in 1963. If you think what kind of hardware was available back then, I think you can appreciate why they decided to use every bit of bitwise magic they could.
  
  Report comment
  
  Reply
4. Greenaum says:
  
  September 27, 2013 at 8:46 pm
  
  It wasn’t meant to be simple for humans. Who examines text in binary? Computers! Also teleprinters and god knows what other half-mechanical abominations whirring away with enough torque and speed to take your finger off.
  
  If you look at ASCII, the 16s and 32s bit high means numbers. The 64s bit means letters. 64+32 means lower-case letters. With 64s high, A B C is 1 2 3.
  
  How do you convert a digit from 0-9 into it’s ASCII rep? Add 48. Or OR it with 48. That’s pretty simple, now, isn’t it?
  
  The whole thing’s pretty clever, dividing the symbols up into groups that are easy to tell apart just by checking one bit. That one bit powers an electromagnet that moves a lever or the carraige or some bit of heavy metal, which means an ASCII-powered typewriter is nice and simple to make. So it’s cheaper, so more people can have one. And it breaks down less often. And when that happens, it’s easier to fix.
  
  I actually figured out this aspect of ASCII for myself just idly thinking one day, that’s when the genius of it struck me.
  
  In the case of stuff like this, standards that go back a way, you have to bear in mind what they were invented for. Mechanical typewriters instead of screens and keyboards, mainframe computers with 4K of memory that cost ludicrous amounts, and reams of bank records on paper cards or magnetic tape, brought in in huge batches at the end of the day to keep everybody’s records balanced. It’s evidence of how well the designers did their jobs that we still use it all now. Them were the real men’s days!
  
  Though myself I grew up with a collection of tiny plastic 8-bitters as various xmas presents, and I wouldn’t give them up either.
  
  Report comment
  
  Reply
HackJack says:

September 27, 2013 at 1:13 pm

Why is there a need for “continuation bits”? That seems very wasteful to me. There are lots of ways to avoid 0x00. Just simply start by making them as reserved and not map to any character. i.e. 0xc300 is not mapped to anything. You waste only 1 out of 256 instead of 1 out of 4 for the subsequent bytes.

Report comment

Reply
1. Gdogg says:
  
  September 27, 2013 at 1:24 pm
  
  Then how do you distinguish 0x36, ?
  
  Report comment
  
  Reply
  1. Gdogg says:
    
    September 27, 2013 at 1:25 pm
    
    Shitty CMS can’t handle <
    
    how do you distinguish a 0x36 ascii byte followed by a null end byte?
    
    Report comment
    
    Reply
2. Enginerd says:
  
  September 27, 2013 at 3:25 pm
  
  In addition to the problem Gdogg brought up, there’s also the requirement of being able to “hop in” mid-stream. If the first byte you get starts with 10, you immediately know you’re in the middle of a character and can start skipping forward/backward, depending on what you want. When you hit a byte that starts with a 0, you know it’s a 1-byte plain ASCII character. When it starts with 11, you know it’s a start byte.
  Compare that with having a map of reserved values…
  
  Report comment
  
  Reply
3. Manuel McLure says:
  
  September 27, 2013 at 3:36 pm
  
  The reason for the continuation bits is to make resynchronization easier. Without the continuation bits if you’re plopped into the middle of a string you don’t know whether the byte you’re reading is a single ASCII character or the middle of a multibyte character. By adding the continuation bits resynchronizing is as easy as “move forward until you find a byte that doesn’t have 10 as the first two bits.”
  
  Report comment
  
  Reply
4. Ben Crist says:
  
  September 27, 2013 at 3:52 pm
  
  Keeping continuation bytes distinct from starting bytes has nothing to do with avoiding NUL bytes. The UTF-8 encoding of U+0000 is a single NUL byte. The purpose is to make the encoding self-synchronizing (meaning it doesn’t need to be parsed from the beginning to be accurately decoded). No representation of a single character can contain the encoding of another character, and one must be able to determine if a byte is the first byte of a character’s code-unit sequence or in the middle of it without backing up. Yes, you could fit 4 times more of the BMP into 2-byte encodings instead of 3-byte ones, but most of it would still be encoded with 3 bytes, including all of unified CJK and the U+2xxx punctuation and symbol codepoints. Of course there’s also a bunch of non-BMP codepoints which are represented with 4 bytes in UTF-8 that could be encoded in 3 bytes with this modification, but thats of even less concern given that virtually no codepoints outside the BMP are used on a regular basis. And the price you’d pay for being able to represent 0.5% of the UCS codepoints in 2 bytes instead of 3 is that every single UTF-8 string you work with would have to be parsed from the beginning to the end, and corruption of a single byte could cause an indeterminable number of characters following the corrupted one to also be incorrectly parsed.
  
  Report comment
  
  Reply
5. jpa says:
  
  September 28, 2013 at 3:31 am
  
  After the C language being in wide use for, what, 20 years at that point when utf-8 was designed? And suddenly you decide that 0x00 can appear anywhere in a string? I can see how that would work out..
  
  Report comment
  
  Reply
Alex says:

September 27, 2013 at 1:39 pm

Really cool video. The camerman-in-a-rocking-chair effect did not help though, so distracting.

Report comment

Reply
1. notmyfault2000 says:
  
  September 28, 2013 at 4:31 pm
  
  He really should have gone to the bathroom before the interview started…
  
  Report comment
  
  Reply

Hackaday

UTF-8 – “The Most Elegant Hack”

33 thoughts on “UTF-8 – “The Most Elegant Hack””

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Crunching The News For Fun And Little Profit

The End Of The Hackintosh Is Upon Us

The Hackaday Summer Reading List: No AI Involvement, Guaranteed

Back To The Future, 40 Years Old, Looks Like The Past

Why The Latest Linux Kernel Won’t Run On Your 486 And 586 Anymore

Our Columns

FLOSS Weekly Episode 840: End-of-10; Not Just Some Guy In A Van

Dithering With Quantization To Smooth Things Over

Could Space Radiation Mutate Seeds For The Benefit Of Humanity?

This Week In Security: Anthropic, Coinbase, And Oops Hunting

Hackaday Links: July 6, 2025

33 thoughts on “UTF-8 – “The Most Elegant Hack””

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns