Understanding And Using Unicode

July 28, 2023

Computer engineer [Marco Cilloni] realized a lot of developers today still have trouble dealing with Unicode in their programs, especially in the C/C++ world. He wrote an excellent guide that summarizes many of the issues surrounding Unicode and its encoding called “Unicode is harder than you think“. He first presents a brief history of Unicode and how it came about, so you can understand the reasons for the frustrating edge cases you’re bound to encounter.

There have been a variety of Unicode encoding methods over the years, but modern programs dealing with strings will probably be using UTF-8 encoding — and you should too. This multibyte encoding scheme has the convenient property of not changing the original character values when dealing with 7-bit ASCII text. We were surprised to read that there is actually an EBCDIC version of UTF still officially on the books today:

UTF-EBCDIC, a variable-width encoding that uses 1-byte characters designed for IBM’s EBCDIC systems (note: I think it’s safe to argue that using EBCDIC in 2023 edges very close to being a felony)

[Marco] goes in detail about different problems found when dealing with Unicode strings. When C was being developed, ASCII itself had just been finalized in the form we know today, so it treats characters as single byte numbers. With multi-byte, variable-width character strings, the usual functions like strlen fall apart.

Unicode’s combining characters also causes problems when it comes to comparison and collation of text. These are characters which can be built from multiple glyphs, but they also have a pre-built Unicode point. There are also ligatures that combine multiple characters into a single code point. Suddenly it isn’t so clear what character equality even means — Unicode defines two kinds of equivalences, canonical and compatibility.

These are but a sampling of the issues [Marco] discusses. The most important takeaway is that “Unicode handling is always best left to a library“. If your language / compiler of choice doesn’t have one, the Unicode organization provides a reference design called the ICU.

If this topic interests you, do check out his essay linked above. And if you want to get your hands dirty with Unicode glyphs, check out [Roman Czyborra]’s tools here, which are simple command line tools that let you easily experiment using ASCII art. [Roman] founded the open-sourced GNU Unicode Font project back in the 1990s, Unifoundry. Our own [Maya Posch] wrote a great article on the history of Unicode in 2021.

43 thoughts on “Understanding And Using Unicode”

DerAxeman says:

July 28, 2023 at 7:50 pm

I’ll just stick with ASCII thank you very much.

Report comment

Reply
1. rclark says:
  
  July 28, 2023 at 10:47 pm
  
  Me too.
  
  Report comment
  
  Reply
2. Raster says:
  
  July 29, 2023 at 1:50 am
  
  Which means that you are sticking to Unicode+UTF8 :-P
  
  Report comment
  
  Reply
  1. DerAxeman says:
    
    July 29, 2023 at 3:36 pm
    
    Nope. UTF8 is variable length of 8 bit Bytes. ASCII is fixed length of 7 bits. While UTF8 is a Superset of ASCII it won’t work over 7 bit serial lines.
    
    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480066/
    
    Report comment
    
    Reply
3. ian 42 says:
  
  July 29, 2023 at 2:19 am
  
  what’s unicode.. :-) Set a compiler flag to turn it off…
  
  Report comment
  
  Reply
4. Joshua says:
  
  July 29, 2023 at 7:44 am
  
  CP437 for me.
  
  Report comment
  
  Reply
5. Mystick says:
  
  July 29, 2023 at 8:05 am
  
  I concur. I miss ASCII and ANSI. TheDraw.
  
  Report comment
  
  Reply
6. HaHa says:
  
  July 29, 2023 at 12:13 pm
  
  EBCDIC was good enough for Jesus, it’s good enough for me!
  
  Report comment
  
  Reply
  1. Lion XL says:
    
    July 29, 2023 at 10:58 pm
    
    +!
    
    Report comment
    
    Reply
None says:

July 28, 2023 at 10:27 pm

Ligatures are a thing of typography. What you mean are combining characters, that can be joined to a (pre)composed character. Like the diaresis and the a, resulting in an ä.

Glyphs are again a thing of typography. You mean codepoints, or graphemes.

Unicode has a tricky terminology. Characters are not defined in Unicode, eventhough they sometimes speaks of abstract characters.

You have code units (bytes/UInt8 in UTF-8 and UInt16 in UTF-16), then codepoints (which are just an integer number, no matter how it is encoded), and then you have graphemes (which is somewhat similar to what a person could think of as a character).

Report comment

Reply
1. kuro68k says:
  
  July 31, 2023 at 5:27 am
  
  Yet another reason why Unicode is a disaster. We badly need a replacement.
  
  Report comment
  
  Reply
spaceballs says:

July 29, 2023 at 2:47 am

utf8everywhere.org

Report comment

Reply
rhndi says:

July 29, 2023 at 3:27 am

Using “Unicode Confusables” (look it up, it’s fun!) in Python3 code is how I kept myself employed the last few years. Apparently “coding ninjas” from yet another bootcamp are not able to think outside the box enough to recognize all the trick I’ve put over the years.

Report comment

Reply
1. HaHa says:
  
  July 29, 2023 at 12:19 pm
  
  In the old days you could hack backspaces into your source with hex editor. Listing on screen didn’t show the actual code. Was fun. Variable name with backspaces, then variable name that read the same. Good editors would copy name with embedded control characters.
  
  Report comment
  
  Reply
  1. Joshua says:
    
    July 29, 2023 at 3:35 pm
    
    Hm, why’s that? Where there so many “vi”, err, “EDLIN” users out there back then? 🤣
    
    Report comment
    
    Reply
somebody says:

July 29, 2023 at 3:50 am

Rust has great support for unicode.

Report comment

Reply
1. The Commenter Formerly Known As Ren says:
  
  July 29, 2023 at 9:32 pm
  
  Another reason to avoid using it.
  B^)
  
  Report comment
  
  Reply
Gregg Eshelman says:

July 29, 2023 at 4:18 am

One thing to not do with Unicode or any other special character encoding is use that method’s coding for any character in the Extended ASCII set. Every character* required for the languages that use the “english” letters is in Extended ASCII. I found that out when using Palm OS PDAs for reading ebooks. Since Palm OS doesn’t do Unicode I had to do find and replace on the text to replace any odd characters with their Extended ASCII versions before converting the book to TealDoc or Mobipocket format.

*All except one little used character in Norwegian is in Extended ASCII. Even left and right single and double quotes, upside down punctuation for Spanish, French’s double angle brackets, and more.

For some reason the “most offended” instances were simple punctuation characters. WHY would anyone, in an English book, use “HTML Friendly” or UTF-8 codes for question marks, exclamation points, colons etc? It was a PITA to load up a book on my Handspring Visor to find ALL the punctuation replaced with ? or removed and replaced with nothing.

So someone whipped up a little find and replace program to my specifications. (He wanted something to do while learning the software he was using.) It has a substitution file which is just a plain text file. In that file is placed on line 1 the text to search for, which will be replaced by whatever text is on the second line. Lines 3 and 4, 5 and 6 etc. But it has a bug. It has a limit on the maximum number of find and substitution pairs. Exceed that and it will totally foul up the text file it’s working on. I made a substitution file with all the UTF-8 HTML codes and their Extended ASCII equivalents and it was too much for the program. So I made a much shorter substitution file with only the ones I frequently encountered in ebook HTML files.

With the bug fixed to be able to work with any length of substitution file it would work as a One Time Pad encrypter/decrypter. Have a set of substitution files, to be used in order, and the recipient has a set of substitution files with the find and replace line pairs inverted. Two or more substitution files could be run on the same text file (plain text, HTML, source code, any file that’s written in ordinary text) to apply multiple cipher levels, then reversed in opposite order.

I have the program and source code for Windows. It’s written in Microsoft C-something 2005.

I’ll send them to anyone who wants to have a go at debugging it, if they’ll send me back a debugged version of the program. Could be ported to another programming language. I wouldn’t care as long as it runs in Windows and doesn’t have the bug.

Report comment

Reply
1. Dan (No, the OTHER one) says:
  
  July 29, 2023 at 9:53 pm
  
  Are you sure that it is the number of substitutions? Have you checked whether file size triggers the bug?
  
  Report comment
  
  Reply
2. Pete says:
  
  August 2, 2023 at 1:05 am
  
  interested. Could be useful for my job
  
  Report comment
  
  Reply
Myself says:

July 29, 2023 at 5:44 am

I would love to see an analysis of what fraction of software bugs are unicode-related.

Anecdotally it seems to be a shocking number, and I could honestly argue that teaching ASCII to every human on Earth would’ve been easier than teaching every human language on Earth to computers. We’re simply more adaptable than they are, as evidenced by the bizarre variety in our languages.

Report comment

Reply
Gravis says:

July 29, 2023 at 5:59 am

This is amusing because I was just dealing with this issue! Bottom line: if you use UTF-32 then you use all your old string functions by merely changing the typename from ‘char’ to ‘char32_t’.

Report comment

Reply
1. Miroslav says:
  
  July 29, 2023 at 6:25 am
  
  Yes! UTF-32 is the way out of variable-width UTF-8 mess. Everything just works.
  
  Report comment
  
  Reply
  1. YGDES says:
    
    July 29, 2023 at 7:25 am
    
    but a lot of mess remains beyond the code units and the code points…
    
    Report comment
    
    Reply
  2. Joshua says:
    
    July 29, 2023 at 7:46 am
    
    But is it supported by Unicows.dll on Windows 98SE?
    
    Report comment
    
    Reply
a says:

July 29, 2023 at 8:31 am

I’ve been wondering if it would’ve been better to do a fixed size 16 bit character and just give the finger to the thousands of Chinese symbols. It would be enough for most alphabets used today and plenty of symbols, but is still a good compromise on memory usage vs UTF-32.

Report comment

Reply
Charles Lamb says:

July 29, 2023 at 10:49 am

Actually ASCII is a 7-bit code therefore it is a less than one byte code.

Report comment

Reply
1. Joshua says:
  
  July 29, 2023 at 11:58 am
  
  The 8th bit was used for parity, I believe.
  Some extended ASCII character sets like Code Page 437/850 (PC DOS) or PETSCII or SharpSCII use the 8th bit for more characters instead (AFAIK).
  
  Same goes for Baudot code, btw. The code itself is 5-Bit, but start/stop bits must be taken into consideration, making it 7 Bit.
  
  Report comment
  
  Reply
  1. Charles Lamb says:
    
    July 29, 2023 at 5:42 pm
    
    Yes, the 8th was commonly used for parity in teleprinter service. The Baudot code is more formally the ITA2 code. Stop and stop bits as well as data bits are a way of adapting teleprinter signal timing to computer codes. Start and stop are not part of the code. Either code could be used in synchronous transmission where there are no start and stop bits. What the start and stop bits are actually trying to simulate is the timing of a teleprinter signal. Teleprinter signals use “mark” and “space”. These are signals of equal length. “Mark” refers to a high signal and “space” refers to a low signal. Their length is determined by the speed set on the teleprinter equipment. To begin a character transmission the line is brought low for one mark/space time. The next mark/space time the equipment interprets the state as the first level of the code. This continues until 5 or 8 levels are sent–depending upon the equipment. After the last level is sent the line is held high for at least 1.47 mark/space times. This gives the receiving equipment time to interpret and print the character. On a mechanical teleprinter this time is the time of the longest duration operation on the teleprinter–a carriage return. It really isn’t a stop signal. It doesn’t signal anything. The teleprinter knows the character is done because it has counted 5 or eight levels.
    
    Report comment
    
    Reply
Ack210 says:

July 29, 2023 at 11:10 am

Bah! EBCDIC forever!

Report comment

Reply
BrendaEM says:

July 29, 2023 at 11:40 am

Unicode ended up being a big disappointment. It could have been a basis for international communication for humanity. Instead it’s a rubbish bin for emoji–including a Ford truck.

https://emojipedia.org/pickup-truck/

Report comment

Reply
1. Joshua says:
  
  July 29, 2023 at 12:08 pm
  
  I feel the same.
  Smileys originally used text characters to express feelings.
  Now it’s the other way round. Emoticons try to take the place of text characters.
  
  It’s like a downgrade. We may end up talking with symbols instead of real language.
  So we could end up with symbolic languages like it’s common in Japanese or Chinese.
  
  Which is exactly the reverse to what the western world has evolved.
  We use a pretty simple alphabet to form words and successfully transport sounds, too.
  
  Paradoxically, the sheer mass of different emoticons makes it harder and harder to make the appropriate selection.
  
  It’s like with Pokémons. We used to have 150 (151) characters. Now they’re in the thousands. Speaking of, where’s the Pikachu Emoji?!? 🤔
  
  Report comment
  
  Reply
2. Joshua says:
  
  July 29, 2023 at 12:13 pm
  
  Personally, I think it had been better if Emoticons had been encoded in HTML Entities (&happy). Or alternatively, how it’s been encoded in forums software on the web (:happy:). That way, Unicode had been kept tidy.
  
  Report comment
  
  Reply
3. Miles says:
  
  July 29, 2023 at 12:40 pm
  
  Eh, since 1997 they’ve made at least 22 million ford trucks. That’s better than 3 trucks every 2 minutes for 26 years. So they can have the emoji.
  
  Report comment
  
  Reply
  1. HaHa says:
    
    July 29, 2023 at 1:36 pm
    
    Blue oval of shame isn’t even on the top 10 of the dumbest emojos.
    
    Report comment
    
    Reply
    1. The Commenter Formerly Known As Ren says:
      
      July 29, 2023 at 9:36 pm
      
      Blue oval, as in Subaru?
      B^)
      
      Report comment
      
      Reply
zzo38 says:

July 29, 2023 at 2:38 pm

Functions such as strlen do not fall apart. They still work (if you are dealing with null-terminated strings) and are still useful, although it does not measure the number of characters. However, I think that measuring the number of bytes is usually more useful anyways.

One character set will not be suitable for everything. Unicode especially is pretty bad, but even any other character set or encoding (including TRON code) to try to be suitable for everything will not be.

Also, using ICU cannot avoid some of the bugs and security problems. There are problems with Unicode inherently.

So, I mostly just use ASCII (and in some cases it is better to be restricted to ASCII, or even to a smaller subset of ASCII in some cases), or sometimes other character sets such as the PC character set. If necessary then code page numbers can be specified. (My opinion is that ASCII is better than EBCDIC.)

And, when large character set will be needed, I might use TRON code. (I wanted some free bitmap fonts for TRON code, but have been unable to find any. I could try to make up my own, but could not do entirely by myself. I could start by recoding fonts for JIS X 0212 and JIS X 0213, I suppose.) (Like with Unicode there are multiple possible encodings, including HTML entities (“&T code”), and I have written program to convert between some of these. Furthermore, the valid range of codepoints with TRON-32 and UTF-32 do not overlap at all so a API that uses it (such as Glulx) can be designed to use TRON and not cause interference and not need new functions either, if the implementation supports that.)

So, my “Free Hero Mesh” software does not use and will not use Unicode. It uses code page numbers of 8-bit characters and can also use TRON code (although the latter implementation is incomplete, and font sizes other than 8×8 are not implemented yet even for 8-bit characters).

Report comment

Reply
defdefred says:

July 29, 2023 at 4:04 pm

Unicode is a mess.
A codepage per country is the way to go + file format allowing multiple codepage in the same time.
And please neither stupid emoji, nor typographic trick…

Report comment

Reply
Francois Otis says:

July 29, 2023 at 10:15 pm

To unscramble myself on this topic is often refer to this page:
https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/#:~:text=UTF%2D8%20treats%20numbers%200,175%64)%20%3D%201071.

See the algorithm (in section “UTF-8 To The Rescue”) for example to access the Cyrilic charsets in UTF-8.

Report comment

Reply
1. fo says:
  
  July 29, 2023 at 10:17 pm
  
  Typo please read: […] I often […]
  
  Report comment
  
  Reply
Taper says:

July 30, 2023 at 2:57 am

I always enjoy the sneering parochialism that hits the comments any time computing in anything other than the commenters’ native language(s) needs to be supported.

Report comment

Reply
1. Joshua says:
  
  July 30, 2023 at 3:31 am
  
  Haha, there’s some truth within, I suppose. 😅
  There’s even was an European extension to US ASCII, I vaguely remember..
  
  I for one always did prefer CP437, though, despite being German, which isn’t natural per se.
  
  Normally, we used to use CP850 back then (roughly by the time DOS 5 was out), but the American CP437 (basic ROM font found in x86 PCs) had already covered our umlauts, which was good enough to me.
  
  CP437, on DOS, also was compatible with international ASCII art and classics such as original Norton Commander.
  And since it was in ROM, it didn’t need loading a display driver, thus saving a few KBs of conventional memory.
  
  By contrast, CP850, which we Western Europeans (the old cold war terminology still in use in the 90s) were supposed to use, had messed up the old symbol characters used to draw GUI (or TUI) elements.
  
  It wasn’t until Norton Commander 4 or 5 in which that issue was fixed/circumvented. However, not everyone liked the new software. I preferred the original, English Norton Commander. And reading international text files found on shareware CDs and mailboxes.
  
  Report comment
  
  Reply
2. da.platonov says:
  
  July 30, 2023 at 9:44 pm
  
  +1
  
  Report comment
  
  Reply

Hackaday

Understanding And Using Unicode

43 thoughts on “Understanding And Using Unicode”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

My Space

NASA Is Now Tasked With Developing A Lunar Time Standard, Relativity Or Not

VAR Is Ruining Football, And Tech Is Ruining Sport

Mining And Refining: Uranium And Plutonium

Programming Ada: First Steps On The Desktop

Our Columns

Retrotechtacular: How Not To Use Hand Tools

Retrogadgets: The Ageia PhysX Card

Hackaday Links: May 5, 2024

Tool-Building Mammals

Hackaday Podcast Episode 269: 3D Printed Flexure Whegs, El Cheapo Bullet Time, And A DIY Cell Phone Sniffer

43 thoughts on “Understanding And Using Unicode”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns