People have gotten much savvier about computer security in the last decade or so. Most people know that sending a document with sensitive information in it is a no-no, so many people try to redact documents with varying levels of success. A common strategy is to replace text with a black box, but you sometimes see sophisticated users pixelate part of an image or document they want to keep private. If you do this for text, be careful. It is possible to unredact pixelated images through software.
It appears that the algorithm is pretty straightforward. It simply guesses letters, pixelates them, and matches the result. You do have to estimate the size of the pixelation, but that’s usually not very hard to do. The code is built using TypeScript and while the process does require a little manual preparation, there’s nothing that seems very difficult or that couldn’t be automated if you were sufficiently motivated.
You don’t see it as often as you used to, but there have been a slew of legal and government scandals where someone redacted a document by putting a black box over a PDF so it was hidden when printed but the text was still in the document. Older wordprocessors often didn’t really delete text, either, if you knew how to look at the files. The Facebook valuation comes to mind. Not to mention that the National Legal and Policy Center was stung with poor redaction techniques.
63 thoughts on “Pixelating Text Not A Good Idea”
If an approach this simple works, throwing some AI at it means pixelating is double plus non-good for redaction.
Also Cue CSI meme:. “Enlarge”
You mean “Enhance”
I thought that was from Star Trek?
Super Troopers is the most popular current reference.
Just print the damn thing!
That’s double plus UN good. In to the memory hole with it all. The past never had been altered. And Eurasia was at war with Eastasia. Eurasian had always been at war with Eastasia.
Russian bot with bad AI
It’s a reference to the book 1894 but okay
Lol “1894” you mean 1984? Or is this from the multiverse?
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows.
It’s such a fragile approach. Even the author says “the whole thing really depends on being able to correctly replicate the redacted characters.” which means anything can go wrong.
So what？ An exploit like this only need to happen once, for the right target, under the right circumstances.
In any realistic scenario, where you’re looking at a photo of some blurred characters, it’s a cold day in hell before you even guess the font right.
Except that presumably there is unredacted text to go along with the redacted text
This. You can get all the info you want from the other text, and maybe there is some kind of adjustable threshold to set how close to the original the output has to be (for antialiasing, slightly different x/y position, whatever)
Then you have to guess which blur algorithm was used.
Yes, this. Also, even if you change the font for redacted text, this is one computer using brute force. You could be smarter on the brute force with a rainbow table equivalent (except more effective because this wouldn’t be all or nothing; someone could get feedback on a partial answer). You could throw multiple computers at the problem. You could train a neural net and invest upfront time on learning for fast inference on multiple fonts. You could use a language faster than TypeScript and greedily guess the first letter on a font-by-font basis and back out if you didn’t get high alignment. This PoC is incredibly damning of this blurring practice
It’s not that hard to guess the font metrics. And you don’t need a complete reveal, just enough for a human to be able to guess the words.
Unscrambling passwords does not have to be robust to be a threat. The scrambling part is the one that must not be fragile.
Most people, even in this day and age, tend to use easily recognizable words. Even with a couple of errors in the unredacted output, a human could probably guess the correct one.
But you probably don’t even need all the characters to ascertain the message and context through.
this is why I always convert my text to cuneiform before pixelating.
𒀱 would like that.
To be fair, there is many ways to “pixelate” text. Quantizing the pixels into larger pixels is just one approach. One can also first scramble the original pixels a bit, before pixilating it.
Or the easiest solution, one changes the characters to different characters first. Since the recipient shouldn’t known the original text, they have no reason to really care about the pixelated text.
Even using black bars to censor text isn’t without flaws. One still knows the length of the word/phrase used. And that alone leaks some information. (this is though starting to go into tempest.)
Surely the easiest and least computationally intensive method is to put a plain black box over them?
As Alexander said, the length of a text can give away the content, especially where there’s a small set of likely options.
Sometimes, even the knowledge of the message existing can be enough to leak vital information.
One reason why some more security oriented organizations sends “empty” documents internally.
One example is an embassy just sending a briefcase with content that in itself is pointless. But by sending it one can greatly blur when actual data is being sent. Since if there only is transactions happening when things of importance happens, then it is easy to conclude that whatever is happening is important enough to converse about. It is easier to hide these patterns if one always sends something at a certain intervals.
The same applies for trivial things like web traffic between servers in an organization. (and here timing of transactions can leak even more data compared to sneakerneting documents)
But information security is a field where the question of “can it be decrypted?” shouldn’t be asked, but rather, “how long until they/someone find(s) out?” is the more important question, that hopefully has the answer “long enough.”
See https://www.mattblaze.org/blog/neinnines for example (or, better, the referenced source: Strzok’s _Compromized_)
Isn’t that a theory behind numbers stations? That they were always broadcasting garbage but when necessary you could transmit encrypted messages that naively resembled the garbage
Some number stations are very easy to know when they send actual messages, and when they send garbage.
Since some just play music, or noise when “offline”, and have someone reading out numbers or even words when “online”. Making a very stark contrast between the two modes of operation.
However, most people that are intended to listen to a number station to receive information will do so at a pre specified time. So they tend to not be reactionary, so one can’t gleam as much from the timing of a message.
Though, we don’t really know how many of the “messages” are actual messages. I wouldn’t be surprised if some are just garbage to fill the void.
Numbers stations are not sending an encoded message directly. The people who are expecting to the messages, know the schedule and also know the key. The numbers could be referring to pages/chapters/word on a line from a specific book that they know to use, etc. There is not likely a directly encoded message in those numbers.
Best method, put your text, cover the sensitive data with a text box containing things like F* you I won’t give my f*ing password” and pixelate that, at least it will make the hacker burn some time…
It would be a much better demo if the text said “No more secrets”
Does it work with porn?
Read the OP’s writeup, and you’ll find out!
There was some thing a couple years ago that could supposedly unblur a face. The pixelization would change as the face moved, and could guess smaller and smaller features. Can’t seem to find anythjng about it anymore
Why bother pixelating stuff anymore? just flat out put a black bar on top and call it a day.
This brings to mind a more fun way to redact text. Paste other pixelated text over it!
It could be false information, made to look correct but actually harmful to the competitor/enemy/opponent who thinks they are being sneaky by decoding it. It could be nonsense text. Or it could even just be insults.
So much more fun than a plain black box.
Hehehe, I like it, and you can make it easier or harder to unscramble depending on just how ‘important’ the document seems to be – so they will spend all those hours of computer time to find ‘Erm, reading others mail is rude buddy, please bugger off…’ just because that document seemed like it was worth it…
Really got to ask the question these days though of why redact at all? – if its digital just having a character you use to represent a cut is enough, it doesn’t need to be in printed page format that gives all these hints as to the content just by its length… And if you really are putting it down in paper, or the layout matters for some reason yet you actually need to redact it in some way you have to ask why? Can’t you find a better method to transfer the more public elements separately from the sensitive…
Yep, I can just imagine the NSA guy saying “What the hell does ‘Be sure to drink your Ovaltine’ mean!?”
Beat me to it!
back in the day, my university had a burroughs mainframe and a bunch of printing terminals. when you logged in, the password was echoed. to obscure this, the mainframe would overprint with a sequence of xes and asterisks and other characters that put a lot of ink on the paper. some of the poor students hanging around in the terminal room were pretty good at seeing through the overprinting to extract the password.
This is why you take a PDF, screen shot it, use MS Paint (or equivalent) to overlay black boxes, then ‘print’ the image back to a PDF document.
You beat me to it.
It’s pretty hard to extract text from a black box in a JPEG.
Screen shot, edit image, print out image, scan image and send scan.
For the love of cod DON’T. Especially not if you’re simply being overly paranoid about trivial content. I’ve been making a number of public records request recently and some of the local governments have peculiar sensitivities and definitions of “personal” information, despite the documents being completely legible if I was able to view them in person. Instead I get charged a quarter for the paper they use to print them out on some splotchy printer, and then receive a 2-bit scan of the result as a muti-page TIFF. They’ve complied with letter of the request, but not the spirit, as the result is completely unreadable.
If I were being paid by a 3-letter org I would delete and rewrite these sensitive messages prior to pixelation and public dissemination. You know they already decrypt far more difficult puzzles. False messages would be well planted in such a fashion….
Seriously, just blot it out. Pixellate when you want to give others a hint.
This isn’t terribly new as something similar was presented over a year ago: https://hackaday.com/2020/12/18/this-week-in-security-solarwinds-and-fireeye-wordpress-ddos-and-enhance/
I once got a whole pile of “redacted” word documents related to a bid we were working on for a company that way to into it’s “propriatary” data.
It turned out that the genius lawyers at the sending end had just generated all their copious black boxes by selecting the text and changing the background color to black.
After I discovered that it took all of two seconds to unmask each “double-extra-super-secret” document.
pixelate.. fine. but definitely salt the pixels with random numbers.
I thought the same like a new and small Qr codes as characters and a high security private and public key to encode/decode the text with a zero knowledge proof method or somenthing, that would be too heavy and intense but if someone really wants that low level of security on printed or digital documents, they will have the resources to do so
or even have the text pixelisation use a random llorum ipsum section (or other “random” word fillers, or even replace the letters with random numbers, possibly even have the pixelization chose a random font to use for the text to be pixelized) for the pixels.
even with inserting random numbers in the pixelization you could still potentially un-pixelate the text and filterout the numbers.
replacing the text with random noise will mostly garentee that someone cant extract data from a redacted section.
and you could have the algorithm used to pixelate the text require 2 factor authentication to unredact the text (i.e. pull from an encrypted unredacted file) otherwise the document when you try to unredact the text will just show the random noise.
Since we are ALL familiar with the pixelated picture of Abe Lincoln, and how easily it is recognized, it’s amazing that anyone, anywhere, would ever think that this would be a robust way to obscure text. Human neural nets are quite good at extracting data at extremely low signal-to-noise ratios.
Ask this guy how well pixellation worked:
They untwirled his swirled pixels, tracked him down, and put him in jail.
Sometimes you’ve got to be glad that people don’t understand the technology they are using.
I’ll see your single pixelation un-pixelater. Now I’ll pixelate my picketed text, then pixelate that pixelated text.
Just don’t write things down you’d need to redact.
That’s why I always use Gaussian blur with large radius when reacting something
You might be surprised what can be recovered from even extremely blurred text. Check out last year’s Helsinki Deblur Challenge. At the highest difficulty levels you would hardly guess that the picture actually contains text if you weren’t told.
Good thing I use the wingdings don’t for all my important communication… I pixilate that…
(Ok I don’t do this…but it would throw a wrench at hackers (after pixelation) and the people you intend to read it both)
This is why you should just write all personal information in Wingdings and then blur it. No one’s guessing that shit without brute force.
Please be kind and respectful to help make the comments section excellent. (Comment Policy)