Understanding And Using Unicode

Computer engineer [Marco Cilloni] realized a lot of developers today still have trouble dealing with Unicode in their programs, especially in the C/C++ world. He wrote an excellent guide that summarizes many of the issues surrounding Unicode and its encoding called “Unicode is harder than you think“. He first presents a brief history of Unicode and how it came about, so you can understand the reasons for the frustrating edge cases you’re bound to encounter.

There have been a variety of Unicode encoding methods over the years, but modern programs dealing with strings will probably be using UTF-8 encoding — and you should too. This multibyte encoding scheme has the convenient property of not changing the original character values when dealing with 7-bit ASCII text. We were surprised to read that there is actually an EBCDIC version of UTF still officially on the books today:

UTF-EBCDIC, a variable-width encoding that uses 1-byte characters designed for IBM’s EBCDIC systems (note: I think it’s safe to argue that using EBCDIC in 2023 edges very close to being a felony)

[Marco] goes in detail about different problems found when dealing with Unicode strings. When C was being developed, ASCII itself had just been finalized in the form we know today, so it treats characters as single byte numbers. With multi-byte, variable-width character strings, the usual functions like strlen fall apart.

Unicode’s combining characters also causes problems when it comes to comparison and collation of text. These are characters which can be built from multiple glyphs, but they also have a pre-built Unicode point. There are also ligatures that combine multiple characters into a single code point. Suddenly it isn’t so clear what character equality even means — Unicode defines two kinds of equivalences, canonical and compatibility.

These are but a sampling of the issues [Marco] discusses. The most important takeaway is that “Unicode handling is always best left to a library“. If your language / compiler of choice doesn’t have one, the Unicode organization provides a reference design called the ICU.

If this topic interests you, do check out his essay linked above. And if you want to get your hands dirty with Unicode glyphs, check out [Roman Czyborra]’s tools here, which are simple command line tools that let you easily experiment using ASCII art. [Roman] founded the open-sourced GNU Unicode Font project back in the 1990s, Unifoundry. Our own [Maya Posch] wrote a great article on the history of Unicode in 2021.

43 thoughts on “Understanding And Using Unicode

  1. Ligatures are a thing of typography. What you mean are combining characters, that can be joined to a (pre)composed character. Like the diaresis and the a, resulting in an ä.

    Glyphs are again a thing of typography. You mean codepoints, or graphemes.

    Unicode has a tricky terminology. Characters are not defined in Unicode, eventhough they sometimes speaks of abstract characters.

    You have code units (bytes/UInt8 in UTF-8 and UInt16 in UTF-16), then codepoints (which are just an integer number, no matter how it is encoded), and then you have graphemes (which is somewhat similar to what a person could think of as a character).

  2. Using “Unicode Confusables” (look it up, it’s fun!) in Python3 code is how I kept myself employed the last few years. Apparently “coding ninjas” from yet another bootcamp are not able to think outside the box enough to recognize all the trick I’ve put over the years.

    1. In the old days you could hack backspaces into your source with hex editor. Listing on screen didn’t show the actual code. Was fun. Variable name with backspaces, then variable name that read the same. Good editors would copy name with embedded control characters.

  3. One thing to not do with Unicode or any other special character encoding is use that method’s coding for any character in the Extended ASCII set. Every character* required for the languages that use the “english” letters is in Extended ASCII. I found that out when using Palm OS PDAs for reading ebooks. Since Palm OS doesn’t do Unicode I had to do find and replace on the text to replace any odd characters with their Extended ASCII versions before converting the book to TealDoc or Mobipocket format.

    *All except one little used character in Norwegian is in Extended ASCII. Even left and right single and double quotes, upside down punctuation for Spanish, French’s double angle brackets, and more.

    For some reason the “most offended” instances were simple punctuation characters. WHY would anyone, in an English book, use “HTML Friendly” or UTF-8 codes for question marks, exclamation points, colons etc? It was a PITA to load up a book on my Handspring Visor to find ALL the punctuation replaced with ? or removed and replaced with nothing.

    So someone whipped up a little find and replace program to my specifications. (He wanted something to do while learning the software he was using.) It has a substitution file which is just a plain text file. In that file is placed on line 1 the text to search for, which will be replaced by whatever text is on the second line. Lines 3 and 4, 5 and 6 etc. But it has a bug. It has a limit on the maximum number of find and substitution pairs. Exceed that and it will totally foul up the text file it’s working on. I made a substitution file with all the UTF-8 HTML codes and their Extended ASCII equivalents and it was too much for the program. So I made a much shorter substitution file with only the ones I frequently encountered in ebook HTML files.

    With the bug fixed to be able to work with any length of substitution file it would work as a One Time Pad encrypter/decrypter. Have a set of substitution files, to be used in order, and the recipient has a set of substitution files with the find and replace line pairs inverted. Two or more substitution files could be run on the same text file (plain text, HTML, source code, any file that’s written in ordinary text) to apply multiple cipher levels, then reversed in opposite order.

    I have the program and source code for Windows. It’s written in Microsoft C-something 2005.

    I’ll send them to anyone who wants to have a go at debugging it, if they’ll send me back a debugged version of the program. Could be ported to another programming language. I wouldn’t care as long as it runs in Windows and doesn’t have the bug.

  4. I would love to see an analysis of what fraction of software bugs are unicode-related.

    Anecdotally it seems to be a shocking number, and I could honestly argue that teaching ASCII to every human on Earth would’ve been easier than teaching every human language on Earth to computers. We’re simply more adaptable than they are, as evidenced by the bizarre variety in our languages.

  5. This is amusing because I was just dealing with this issue! Bottom line: if you use UTF-32 then you use all your old string functions by merely changing the typename from ‘char’ to ‘char32_t’.

  6. I’ve been wondering if it would’ve been better to do a fixed size 16 bit character and just give the finger to the thousands of Chinese symbols. It would be enough for most alphabets used today and plenty of symbols, but is still a good compromise on memory usage vs UTF-32.

    1. The 8th bit was used for parity, I believe.
      Some extended ASCII character sets like Code Page 437/850 (PC DOS) or PETSCII or SharpSCII use the 8th bit for more characters instead (AFAIK).

      Same goes for Baudot code, btw. The code itself is 5-Bit, but start/stop bits must be taken into consideration, making it 7 Bit.

      1. Yes, the 8th was commonly used for parity in teleprinter service. The Baudot code is more formally the ITA2 code. Stop and stop bits as well as data bits are a way of adapting teleprinter signal timing to computer codes. Start and stop are not part of the code. Either code could be used in synchronous transmission where there are no start and stop bits. What the start and stop bits are actually trying to simulate is the timing of a teleprinter signal. Teleprinter signals use “mark” and “space”. These are signals of equal length. “Mark” refers to a high signal and “space” refers to a low signal. Their length is determined by the speed set on the teleprinter equipment. To begin a character transmission the line is brought low for one mark/space time. The next mark/space time the equipment interprets the state as the first level of the code. This continues until 5 or 8 levels are sent–depending upon the equipment. After the last level is sent the line is held high for at least 1.47 mark/space times. This gives the receiving equipment time to interpret and print the character. On a mechanical teleprinter this time is the time of the longest duration operation on the teleprinter–a carriage return. It really isn’t a stop signal. It doesn’t signal anything. The teleprinter knows the character is done because it has counted 5 or eight levels.

    1. I feel the same.
      Smileys originally used text characters to express feelings.
      Now it’s the other way round. Emoticons try to take the place of text characters.

      It’s like a downgrade. We may end up talking with symbols instead of real language.
      So we could end up with symbolic languages like it’s common in Japanese or Chinese.

      Which is exactly the reverse to what the western world has evolved.
      We use a pretty simple alphabet to form words and successfully transport sounds, too.

      Paradoxically, the sheer mass of different emoticons makes it harder and harder to make the appropriate selection.

      It’s like with Pokémons. We used to have 150 (151) characters. Now they’re in the thousands. Speaking of, where’s the Pikachu Emoji?!? 🤔

    2. Personally, I think it had been better if Emoticons had been encoded in HTML Entities (&happy). Or alternatively, how it’s been encoded in forums software on the web (:happy:). That way, Unicode had been kept tidy.

  7. Functions such as strlen do not fall apart. They still work (if you are dealing with null-terminated strings) and are still useful, although it does not measure the number of characters. However, I think that measuring the number of bytes is usually more useful anyways.

    One character set will not be suitable for everything. Unicode especially is pretty bad, but even any other character set or encoding (including TRON code) to try to be suitable for everything will not be.

    Also, using ICU cannot avoid some of the bugs and security problems. There are problems with Unicode inherently.

    So, I mostly just use ASCII (and in some cases it is better to be restricted to ASCII, or even to a smaller subset of ASCII in some cases), or sometimes other character sets such as the PC character set. If necessary then code page numbers can be specified. (My opinion is that ASCII is better than EBCDIC.)

    And, when large character set will be needed, I might use TRON code. (I wanted some free bitmap fonts for TRON code, but have been unable to find any. I could try to make up my own, but could not do entirely by myself. I could start by recoding fonts for JIS X 0212 and JIS X 0213, I suppose.) (Like with Unicode there are multiple possible encodings, including HTML entities (“&T code”), and I have written program to convert between some of these. Furthermore, the valid range of codepoints with TRON-32 and UTF-32 do not overlap at all so a API that uses it (such as Glulx) can be designed to use TRON and not cause interference and not need new functions either, if the implementation supports that.)

    So, my “Free Hero Mesh” software does not use and will not use Unicode. It uses code page numbers of 8-bit characters and can also use TRON code (although the latter implementation is incomplete, and font sizes other than 8×8 are not implemented yet even for 8-bit characters).

  8. Unicode is a mess.
    A codepage per country is the way to go + file format allowing multiple codepage in the same time.
    And please neither stupid emoji, nor typographic trick…

  9. I always enjoy the sneering parochialism that hits the comments any time computing in anything other than the commenters’ native language(s) needs to be supported.

    1. Haha, there’s some truth within, I suppose. 😅
      There’s even was an European extension to US ASCII, I vaguely remember..

      I for one always did prefer CP437, though, despite being German, which isn’t natural per se.

      Normally, we used to use CP850 back then (roughly by the time DOS 5 was out), but the American CP437 (basic ROM font found in x86 PCs) had already covered our umlauts, which was good enough to me.

      CP437, on DOS, also was compatible with international ASCII art and classics such as original Norton Commander.
      And since it was in ROM, it didn’t need loading a display driver, thus saving a few KBs of conventional memory.

      By contrast, CP850, which we Western Europeans (the old cold war terminology still in use in the 90s) were supposed to use, had messed up the old symbol characters used to draw GUI (or TUI) elements.

      It wasn’t until Norton Commander 4 or 5 in which that issue was fixed/circumvented. However, not everyone liked the new software. I preferred the original, English Norton Commander. And reading international text files found on shareware CDs and mailboxes.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.