When you’re restricted to ASCII, how can you represent more complex things like emojis or non-Latin characters? One answer is Punycode, which is a way to represent Unicode characters in ASCII. However, while you could technically encode the raw bits of Unicode into characters, like Base64, there’s a snag. The Domain Name System (DNS) generally requires that hostnames are case-insensitive, so whether you type in HACKADAY.com, HackADay.com, or just hackaday.com, it all goes to the same place.
[A. Costello] at the University of California, Berkley proposed the idea of Punycode in RFC 3492 in March 2003. It outlines a simple algorithm where all regular ASCII characters are pulled out and stuck on one side with a separator in between, in this case, a hyphen. Then the Unicode characters are encoded and stuck on the end of the string.
First, the numeric codepoint and position in the string are multiplied together. Then the number is encoded as a Base-36 (a-z and 0-9) variable-length integer. For example, a greeting and the Greek for thanks, “Hey, ευχαριστώ” becomes “Hey, -mxahn5algcq2″. Similarly, the beautiful city of München becomes mnchen-3ya. Continue reading “Punycodes Explained”