Most readers will have at least some passing familiarity with the terms ‘Unicode’ and ‘UTF-8’, but what is really behind them? At their core they refer to character encoding schemes, also known as character sets. This is a concept which dates back to far beyond the era of electronic computers, to the dawn of the optical telegraph and its predecessors. As far back as the 18th century there was a need to transmit information rapidly across large distances, which was accomplished using so-called telegraph codes. These encoded information using optical, electrical and other means.
During the hundreds of years since the invention of the first telegraph code, there was no real effort to establish international standardization of such encoding schemes, with even the first decades of the era of teleprinters and home computers bringing little change there. Even as EBCDIC (IBM’s 8-bit character encoding demonstrated in the punch card above) and finally ASCII made some headway, the need to encode a growing collection of different characters without having to spend ridiculous amounts of storage on this was held back by elegant solutions.
Development of Unicode began during the late 1980s, when the increasing exchange of digital information across the world made the need for a singular encoding system more urgent than before. These days Unicode allows us to not only use a single encoding scheme for everything from basic English text to Traditional Chinese, Vietnamese, and even Mayan, but also small pictographs called ‘emoji‘, from Japanese ‘e’ (絵) and ‘moji’ (文字), literally ‘picture word’.
Continue reading “Unicode: On Building The One Character Set To Rule Them All”
Unicode, the wonderful extension to to ASCII that gives us gems like “✈”, “⌨”, and “☕”, has had some unexpected security ramifications. The most common problems with Unicode are visual security issues, like character confusion between letters. For example, the English “M” (U+004D) is indistinguishable from the Cyrillic “М” (U+041C). Can you tell the difference between IBM.com and IBМ.com?
This bug, discovered by [John Gracey] turns the common problem on its head. Properly referred to as a case mapping collision, it’s the story of different Unicode characters getting mapped to the same upper or lowercase equivalent.
'ß'.toLowerCase() === 'SS'.toLowerCase() // true
// Note the Turkish dotless i
'John@Gıthub.com'.toUpperCase() === 'John@Github.com'.toUpperCase()
GitHub stores all email addresses in their lowercase form. When a user sends a password reset, GitHub’s logic worked like this: Take the email address that requested a password reset, convert to lower case, and look up the account that uses the converted email address. That by itself wouldn’t be a problem, but the reset is then sent to the email address that was requested, not the one on file. In retrospect, this is an obvious flaw, but without the presence of Unicode and the possibility of a case mapping collision, would be a perfectly safe practice.
This flaw seems to have been fixed quite some time ago, but was only recently disclosed. It’s also a novel problem affecting Unicode that we haven’t covered. Interestingly, my research has turned up an almost identical problem at Spotify, back in 2013.
Continue reading “This Week In Security: Unicode, Truecrypt, And NPM Vulnerabilities”
Well, think again. At least if you are using Chrome or Firefox. Don’t believe us? Well, check out Apple new website then, at https://www.apple.com . Notice anything? If you are not using an affected browser you are just seeing a strange URL after opening the webpage, otherwise it’s pretty legit. This is a page to demonstrate a type of Unicode vulnerability in how the browser interprets and show the URL to the user. Notice the valid HTTPS. Of course the domain is not from Apple, it is actually the domain: “https://www.xn--80ak6aa92e.com/“. If you open the page, you can see the actual URL by right-clicking and select view-source.
So what’s going on? This type of phishing attack, known as IDN homograph attacks, relies on the fact that the browser, in this case Chrome or Firefox, interprets the “xn--” prefix in a URL as an ASCII compatible encoding prefix. It is called Punycode and it’s a way to represent Unicode using only the ASCII characters used in Internet host names. Imagine a sort of Base64 for domains. This allows for domains with international characters to be registered, for example, the domain “xn--s7y.co” is equivalent to “短.co”, as [Xudong Zheng] explains in his blog.
Different alphabets have different glyphs that work in this kinds of attacks. Take the Cyrillic alphabet, it contains 11 lowercase glyphs that are identical or nearly identical to Latin counterparts. These class of attacks, where an attacker replaces one letter for its counterpart is widely known and are usually mitigated by the browser:
Continue reading “You Think You Can’t Be Phished?”
You may be a hardcore keyboard aficionado whose buckled-spring switches will be pried from your cold dead hands, but there is a new model on the street that relegates your blank-key Das Keyboard or your trusty IBM Model M to the toy chest.
The new challenger comes from Reddit user [duckythescientist], who has created a minimalist three-key binary keyboard. It features a 0 key, a 1 key, a return key, and nothing else. Characters are entered as ASCII or Unicode, and the device emulates either a QWERTY or Dvorak keyboard layout to the host computer’s USB interface. It couldn’t be a simpler layout to learn, though we’d concede that not everyone has the entire binary Unicode table memorised.
The keys are mounted in a custom 3D printed case, and the electronics come from the creator’s own “tinydev” board based on an ATtiny85. All the code is available in a GitHub repository, and there is a very short video of its Unicode ability below the break.
Continue reading “Binary Keyboard Is The Purest Form Of Input Device”
While it may not look like much, the image above is a piece of the original email where [Ken Thompson] described what would become the implementation of UTF-8. At the dawn of the computer age in America, when we were still using teletype machines, encoding the English language was all we worried about. Programmers standardized on the ASCII character set, but there was no room for all of the characters used in other languages. To enable real-time worldwide communication, we needed something better. There were many proposals, but the one submitted by [Ken Thompson] and [Rob ‘Commander’ Pike] was the one accepted, quite possibly because of what a beautiful hack it is.
[Tom Scott] did an excellent job of describing the UTF-8. Why he chose to explain it in the middle of a busy cafe is beyond us, but his enthusiasm was definitely up to the task. In the video (which is embedded after the break) he quickly shows the simplicity and genius of ASCII. He then explains the challenge of supporting so many character sets, and why UTF-8 made so much sense.
We considered making this a Retrotechtacular, but the consensus is that understanding how UTF-8 came about is useful for modern hackers and coders. If you’re interested in learning more, there are tons of links in this Reddit post, including a link to the original email.
Continue reading “UTF-8 – “The Most Elegant Hack””
Last week at Black Hat DC, [Moxie Marlinspike] presented a novel way to hijack SSL. You can read about it in this Forbes article, but we highly recommend you watch the video. sslstrip can rewrite all https links as http, but it goes far beyond that. Using unicode characters that look similar to / and ? it can construct URLs with a valid certificate and then redirect the user to the original site after stealing their credentials. The attack can be very difficult for even above average users to notice. This attack requires access to the client’s network, but [Moxie] successfully ran it on a Tor exit node.