Computer engineer [Marco Cilloni] realized a lot of developers today still have trouble dealing with Unicode in their programs, especially in the C/C++ world. He wrote an excellent guide that summarizes many of the issues surrounding Unicode and its encoding called “Unicode is harder than you think“. He first presents a brief history of Unicode and how it came about, so you can understand the reasons for the frustrating edge cases you’re bound to encounter.
There have been a variety of Unicode encoding methods over the years, but modern programs dealing with strings will probably be using UTF-8 encoding — and you should too. This multibyte encoding scheme has the convenient property of not changing the original character values when dealing with 7-bit ASCII text. We were surprised to read that there is actually an EBCDIC version of UTF still officially on the books today:
UTF-EBCDIC, a variable-width encoding that uses 1-byte characters designed for IBM’s EBCDIC systems (note: I think it’s safe to argue that using EBCDIC in 2023 edges very close to being a felony)