This Week In Security: Unicode Strikes Again, Trust No One (Redditor), And More

There’s a popular Sysadmin meme that system problems are “always DNS”. In the realm of security, it seems like “it’s always Unicode“. And it’s not hard to see why. Unicode is the attempt to represent all of Earth’s languages with a single character set, and that means there’s a lot of very similar characters. The two broad issues are that human users can’t always see the difference between similar characters, and that libraries and applications sometimes automatically convert exotic Unicode characters into more traditional text.

This week we see the resurrection of an ancient vulnerability in PHP-CGI, that allows injecting command line switches when a web server launches an instance of PHP-CGI. The solution was to block some characters in specific places in query strings, like a query string starting with a dash.

The bypass is due to a Windows feature, “Best-Fit”, an automatic down-convert from certain Unicode characters. This feature works on a per-locale basis, which means that not every system language behaves the same. The exact bypass that has been found is the conversion of a soft hyphen, which doesn’t get blocked by PHP, into a regular hyphen, which can trigger the command injection. This quirk only happens when the Windows locale is set to Chinese or Japanese. Combined with the relative rarity of running PHP-CGI, and PHP on Windows, this is a pretty narrow problem. The XAMPP install does use this arrangement, so those installs are vulnerable, again if the locale is set to one of these specific languages. The other thing to keep in mind is that the Unicode character set is huge, and it’s very likely that there are other special characters in other locales that behave similarly.

Downloader Beware

The ComfyUI project is a flowchart interface for doing AI image generation workflows. It’s an easy way to build complicated generation pipelines, and the community has stepped up to build custom plugins and nodes for generation. The thing is, it’s not always the best idea to download and run code from strangers on the Internet, as a group of ComfyUI users found out the hard way this week. The ComfyUI_LLMVISION node from u/AppleBotzz was malicious.

The node references a malicious Python package that grabs browser data and sends it all to a Discord or Pastebin. It appears that some additional malware gets installed, for continuing access to infected systems. It’s a rough way to learn. Continue reading “This Week In Security: Unicode Strikes Again, Trust No One (Redditor), And More”

Building Up Unicode Characters One Bit At A Time

The range of characters that can be represented by Unicode is truly bewildering. If there’s a symbol that was ever used to represent a sound or a concept anywhere in the world, chances are pretty good that you can find it somewhere in Unicode. But can many of us recall the proper keyboard calisthenics needed to call forth a particular character at will? Probably not, which is where this Unicode binary input terminal may offer some relief.

“Surely they can’t be suggesting that entering Unicode characters as a sequence of bytes using toggle switches is somehow easier than looking up the numpad shortcut?” we hear you cry. No, but we suspect that’s hardly [Stephen Holdaway]’s intention with this build. Rather, it seems geared specifically at making the process of keying in Unicode harder, but cooler; after all, it was originally his intention to enter this in last year’s Odd Inputs and Peculiar Peripherals contest. [Stephen] didn’t feel it was quite ready at the time, but now we’ve got a chance to give this project a once-over.

The idea is simple: a bank of eight toggle switches (with LEDs, of course) is used to compose the desired UTF-8 character, which is made up of one to four bytes. Each byte is added to a buffer with a separate “shift/clear” momentary toggle, and eventually sent out over USB with a flick of the “send” toggle. [Stephen] thoughtfully included a tiny LCD screen to keep track of the character being composed, so you know what you’re sending down the line. Behind the handsome brushed aluminum panel, a Pi Pico runs the show, drawing glyphs from an SD card containing 200 MB of True Type Font files.

At the end of the day, it’s tempting to look at this as an attractive but essentially useless project. We beg to differ, though — there’s a lot to learn about Unicode, and [Stephen] certainly knocked that off his bucket list with this build. There’s also something wonderfully tactile about this interface, and we’d imagine that composing each codepoint is pretty illustrative of how UTF-8 is organized. Sounds like an all-around win to us.

Understanding And Using Unicode

Computer engineer [Marco Cilloni] realized a lot of developers today still have trouble dealing with Unicode in their programs, especially in the C/C++ world. He wrote an excellent guide that summarizes many of the issues surrounding Unicode and its encoding called “Unicode is harder than you think“. He first presents a brief history of Unicode and how it came about, so you can understand the reasons for the frustrating edge cases you’re bound to encounter.

There have been a variety of Unicode encoding methods over the years, but modern programs dealing with strings will probably be using UTF-8 encoding — and you should too. This multibyte encoding scheme has the convenient property of not changing the original character values when dealing with 7-bit ASCII text. We were surprised to read that there is actually an EBCDIC version of UTF still officially on the books today:

UTF-EBCDIC, a variable-width encoding that uses 1-byte characters designed for IBM’s EBCDIC systems (note: I think it’s safe to argue that using EBCDIC in 2023 edges very close to being a felony)

Continue reading “Understanding And Using Unicode”

Punycodes Explained

When you’re restricted to ASCII, how can you represent more complex things like emojis or non-Latin characters? One answer is Punycode, which is a way to represent Unicode characters in ASCII. However, while you could technically encode the raw bits of Unicode into characters, like Base64, there’s a snag. The Domain Name System (DNS) generally requires that hostnames are case-insensitive, so whether you type in HACKADAY.com, HackADay.com, or just hackaday.com, it all goes to the same place.

[A. Costello] at the University of California, Berkley proposed the idea of Punycode in RFC 3492 in March 2003. It outlines a simple algorithm where all regular ASCII characters are pulled out and stuck on one side with a separator in between, in this case, a hyphen. Then the Unicode characters are encoded and stuck on the end of the string.

First, the numeric codepoint and position in the string are multiplied together. Then the number is encoded as a Base-36 (a-z and 0-9) variable-length integer. For example, a greeting and the Greek for thanks, “Hey, ευχαριστώ” becomes “Hey, -mxahn5algcq2″. Similarly, the beautiful city of München becomes mnchen-3ya. Continue reading “Punycodes Explained”

Custom Macro Pad Helps Deliver Winning Formulas

For those of us with science and engineering backgrounds, opening the character map or memorizing the Unicode shortcuts for various symbols is a tedious but familiar part of writing reports or presentations. [Magne Lauritzen] thought there had to be a better way and developed the Mathboard.

With more than 80 “of the most commonly used mathematical operators” and the entire Greek alphabet, the Mathboard could prove very useful to a wide number of disciplines. Hardware-wise, the Mathboard is a 4×4 macro pad, but the special sauce is in the key set implementation firmware. While the most straightforward approach would be to pick 16 or 32 symbols for the board, [Magne] felt that didn’t do the wide range of Unicode symbols justice. By implementing a system of columns and layers, he was able to get 6+ symbols per key, giving a much greater breadth of symbols than just 16 keys and a shift layer. The symbols with a dot next to them unlock variants of that symbol by double or triple-tapping the key. For instance, a lower or capital case of a Greek letter.

The Mathboard currently works in Microsoft Office’s equation editor and as a plain-text Unicode board. [Magne] is currently working on LaTeX support and hopes to add Open Office support in the future. This device was an honorable mention in our Odd Inputs and Peculiar Peripherals Contest. If you’d like to see another interesting math-themed board, check out the one on the MCM/70 from 1974.

Coffee With Kernighan

There was an interesting tidbit buried in a Computerphile video released last week (below the break), featuring professors [David Brailsford] and [Brian Kernighan] having a chat over coffee. Among other topics, they discuss the history and current state of various text processing tools. We learn that [Kernighan] has taken on a summer project of updating the AWK text processing language to handle UTF-8 text, an omission he admits is embarrassing in this day and age. He is also working on a second edition of The AWK Programming Language book, which hasn’t been updated since being first released in 1988.

[Brian Kernighan] is a legend in the world of Unix and computing, working at Bell Labs during the 70s where Unix and C were developed. Among the many accomplishments in his career, he is well-known as the co-author with [Dennis Ritchie] of The C Programming Language, first published in 1972 and still being used decades later, AWK mentioned above, and major updates to troff. More recently, he co-authored The Go Programming Language book in 2015.

If an updated UTF-8-capable AWK interests you, keep an eye on the AWK GitHub repository where [Kernighan] anticipates an update, once he wraps his head around git a little better. We’re happy to see [Brian] so active at 80 years old. If you want to learn more about those early days at Bell Labs, we reviewed [kernighan]’s very interesting UNIX: A History and a Memoir a couple of years ago. 

Continue reading “Coffee With Kernighan”

Can You Identify This Mystery Unicode Glyph?

For anyone old enough to have worked with the hell of multiple incompatible character sets, Unicode has been a liberation; a true One Character Set To Contain Them All. We have so many Unicode characters to play with that there’s a fascinating pursuit in itself in probing at the obscure corners of what can be rendered on screen as a Unicode glyph. With so many disparate character sets having been brought together to make the Unicode standard there are plenty of unusual characters to choose from, and it’s one of them that [Jonathan Chan] has examined in detail.

U+237C ⍼, or the right angle with downwards zigzag arrow, is a mysterious Unicode symbol with no known use and from an unknown origin. XKCD featured it as a spoof “Larry Potter”, but as [Jonathan]’s analysis shows it’s proving impossible to narrow down where it came from. Mystical cult symbol? Or perhaps fiscal growth in an economy in which time runs downwards? Either way, when its lineage has been traced into the early 1990s with no answer to the question it appears that there may be a story behind it.

Hackaday readers never cease to amaze us with the breadth of their knowledge, ingenuity, and experience, so we think it’s not impossible that among you there may be people who will turn and pull a dusty computer manual from the shelf to give us the story behind this elusive glyph. We’d love to hear in the comments below.

Meanwhile if Unicode sparks your interest, we’ve given it a close look in the past.

Thanks [Jonty] for the tip.