Pixelating Text Not A Good Idea

People have gotten much savvier about computer security in the last decade or so. Most people know that sending a document with sensitive information in it is a no-no, so many people try to redact documents with varying levels of success. A common strategy is to replace text with a black box, but you sometimes see sophisticated users pixelate part of an image or document they want to keep private. If you do this for text, be careful. It is possible to unredact pixelated images through software.

It appears that the algorithm is pretty straightforward. It simply guesses letters, pixelates them, and matches the result. You do have to estimate the size of the pixelation, but that’s usually not very hard to do. The code is built using TypeScript and while the process does require a little manual preparation, there’s nothing that seems very difficult or that couldn’t be automated if you were sufficiently motivated.

You don’t see it as often as you used to, but there have been a slew of legal and government scandals where someone redacted a document by putting a black box over a PDF so it was hidden when printed but the text was still in the document. Older wordprocessors often didn’t really delete text, either, if you knew how to look at the files. The Facebook valuation comes to mind. Not to mention that the National Legal and Policy Center was stung with poor redaction techniques.

63 thoughts on “Pixelating Text Not A Good Idea

    1. That’s double plus UN good. In to the memory hole with it all. The past never had been altered. And Eurasia was at war with Eastasia. Eurasian had always been at war with Eastasia.

  1. It’s such a fragile approach. Even the author says “the whole thing really depends on being able to correctly replicate the redacted characters.” which means anything can go wrong.

          1. This. You can get all the info you want from the other text, and maybe there is some kind of adjustable threshold to set how close to the original the output has to be (for antialiasing, slightly different x/y position, whatever)

          2. Yes, this. Also, even if you change the font for redacted text, this is one computer using brute force. You could be smarter on the brute force with a rainbow table equivalent (except more effective because this wouldn’t be all or nothing; someone could get feedback on a partial answer). You could throw multiple computers at the problem. You could train a neural net and invest upfront time on learning for fast inference on multiple fonts. You could use a language faster than TypeScript and greedily guess the first letter on a font-by-font basis and back out if you didn’t get high alignment. This PoC is incredibly damning of this blurring practice

  2. To be fair, there is many ways to “pixelate” text. Quantizing the pixels into larger pixels is just one approach. One can also first scramble the original pixels a bit, before pixilating it.

    Or the easiest solution, one changes the characters to different characters first. Since the recipient shouldn’t known the original text, they have no reason to really care about the pixelated text.

    Even using black bars to censor text isn’t without flaws. One still knows the length of the word/phrase used. And that alone leaks some information. (this is though starting to go into tempest.)

        1. Sometimes, even the knowledge of the message existing can be enough to leak vital information.

          One reason why some more security oriented organizations sends “empty” documents internally.

          One example is an embassy just sending a briefcase with content that in itself is pointless. But by sending it one can greatly blur when actual data is being sent. Since if there only is transactions happening when things of importance happens, then it is easy to conclude that whatever is happening is important enough to converse about. It is easier to hide these patterns if one always sends something at a certain intervals.

          The same applies for trivial things like web traffic between servers in an organization. (and here timing of transactions can leak even more data compared to sneakerneting documents)

          But information security is a field where the question of “can it be decrypted?” shouldn’t be asked, but rather, “how long until they/someone find(s) out?” is the more important question, that hopefully has the answer “long enough.”

          1. Isn’t that a theory behind numbers stations? That they were always broadcasting garbage but when necessary you could transmit encrypted messages that naively resembled the garbage

          2. Kyle
            Some number stations are very easy to know when they send actual messages, and when they send garbage.

            Since some just play music, or noise when “offline”, and have someone reading out numbers or even words when “online”. Making a very stark contrast between the two modes of operation.

            However, most people that are intended to listen to a number station to receive information will do so at a pre specified time. So they tend to not be reactionary, so one can’t gleam as much from the timing of a message.

            Though, we don’t really know how many of the “messages” are actual messages. I wouldn’t be surprised if some are just garbage to fill the void.

          3. Numbers stations are not sending an encoded message directly. The people who are expecting to the messages, know the schedule and also know the key. The numbers could be referring to pages/chapters/word on a line from a specific book that they know to use, etc. There is not likely a directly encoded message in those numbers.

      1. Best method, put your text, cover the sensitive data with a text box containing things like F* you I won’t give my f*ing password” and pixelate that, at least it will make the hacker burn some time…

    1. There was some thing a couple years ago that could supposedly unblur a face. The pixelization would change as the face moved, and could guess smaller and smaller features. Can’t seem to find anythjng about it anymore

  3. This brings to mind a more fun way to redact text. Paste other pixelated text over it!

    It could be false information, made to look correct but actually harmful to the competitor/enemy/opponent who thinks they are being sneaky by decoding it. It could be nonsense text. Or it could even just be insults.

    So much more fun than a plain black box.

    1. Hehehe, I like it, and you can make it easier or harder to unscramble depending on just how ‘important’ the document seems to be – so they will spend all those hours of computer time to find ‘Erm, reading others mail is rude buddy, please bugger off…’ just because that document seemed like it was worth it…

      Really got to ask the question these days though of why redact at all? – if its digital just having a character you use to represent a cut is enough, it doesn’t need to be in printed page format that gives all these hints as to the content just by its length… And if you really are putting it down in paper, or the layout matters for some reason yet you actually need to redact it in some way you have to ask why? Can’t you find a better method to transfer the more public elements separately from the sensitive…

  4. back in the day, my university had a burroughs mainframe and a bunch of printing terminals. when you logged in, the password was echoed. to obscure this, the mainframe would overprint with a sequence of xes and asterisks and other characters that put a lot of ink on the paper. some of the poor students hanging around in the terminal room were pretty good at seeing through the overprinting to extract the password.

      1. For the love of cod DON’T. Especially not if you’re simply being overly paranoid about trivial content. I’ve been making a number of public records request recently and some of the local governments have peculiar sensitivities and definitions of “personal” information, despite the documents being completely legible if I was able to view them in person. Instead I get charged a quarter for the paper they use to print them out on some splotchy printer, and then receive a 2-bit scan of the result as a muti-page TIFF. They’ve complied with letter of the request, but not the spirit, as the result is completely unreadable.

        1. If I were being paid by a 3-letter org I would delete and rewrite these sensitive messages prior to pixelation and public dissemination. You know they already decrypt far more difficult puzzles. False messages would be well planted in such a fashion….

  5. I once got a whole pile of “redacted” word documents related to a bid we were working on for a company that way to into it’s “propriatary” data.

    It turned out that the genius lawyers at the sending end had just generated all their copious black boxes by selecting the text and changing the background color to black.

    After I discovered that it took all of two seconds to unmask each “double-extra-super-secret” document.

    1. I thought the same like a new and small Qr codes as characters and a high security private and public key to encode/decode the text with a zero knowledge proof method or somenthing, that would be too heavy and intense but if someone really wants that low level of security on printed or digital documents, they will have the resources to do so

    2. or even have the text pixelisation use a random llorum ipsum section (or other “random” word fillers, or even replace the letters with random numbers, possibly even have the pixelization chose a random font to use for the text to be pixelized) for the pixels.

      even with inserting random numbers in the pixelization you could still potentially un-pixelate the text and filterout the numbers.
      replacing the text with random noise will mostly garentee that someone cant extract data from a redacted section.

      and you could have the algorithm used to pixelate the text require 2 factor authentication to unredact the text (i.e. pull from an encrypted unredacted file) otherwise the document when you try to unredact the text will just show the random noise.

  6. Since we are ALL familiar with the pixelated picture of Abe Lincoln, and how easily it is recognized, it’s amazing that anyone, anywhere, would ever think that this would be a robust way to obscure text. Human neural nets are quite good at extracting data at extremely low signal-to-noise ratios.

    1. You might be surprised what can be recovered from even extremely blurred text. Check out last year’s Helsinki Deblur Challenge. At the highest difficulty levels you would hardly guess that the picture actually contains text if you weren’t told.

  7. Good thing I use the wingdings don’t for all my important communication… I pixilate that…

    (Ok I don’t do this…but it would throw a wrench at hackers (after pixelation) and the people you intend to read it both)

Leave a Reply to EndlessCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.