AI Can Now Compress Text

There are many claims in the air about the capabilities of AI systems, as the technology continues to ascend the dizzy heights of the hype cycle. Some of them are true, others stretch definitions a little, while yet more cross the line into the definitely bogus. [J] has one that is backed up by real code though, a compression scheme for text using an AI, and while there may be limitations in its approach, it demonstrates an interesting feature of large language models.

The compression works by assuming that for a sufficiently large model, it’s likely that many source texts will exist somewhere in the training. Using llama.cpp it’s possible to extract the tokenization information of a piece of text contained in its training data and store that as the compressed output. The decompressor can then use that tokenization data as a series of keys to reassemble the original from its training. We’re not AI experts but we are guessing that a source text which has little in common with any training text would fare badly, and we expect that the same model would have to be used on both compression and decompression. It remains a worthy technique though, and no doubt because it has AI pixie dust, somewhere there’s a hype-blinded venture capitalist who would pay millions for it. What a world we live in!

Oddly this isn’t the first time we’ve looked at AI text compression.

42 thoughts on “AI Can Now Compress Text

  1. so big surprise, you can compress text more than using plain usual compression using AI, assuming you have some Gb free for a model that most likely needs to exist on both end? am I missing something there? because I see like no benefit in that…

    1. This is my takeaway as well. Having the model available to the receiving party is technically like sending the content ahead of time; a rudimentary dictionary lookup table is a rough approximation. And yes, the model is gigabytes in size which dwarfs the content size of a lot of use cases.

      What I think is novel here is the ratio of compression possible. The author muses about using this for images which might be a more compelling use.

    2. Yes, see the Hutter Prize.
      There’s some mathematicians who study information theory who believe that compressing data is the same thing as understanding it.
      So compressing english text (as in the hutter prize) is the most difficult ai challenge. Whatever program can loselessly compress wikipedia best, is the one that understands it the best.

      The OP though doesn’t have a good compression ratio by the Hutter Prize metrics.

      1. It’s actually quite easy.
        When compressing the text, you run it through the decompression algo and see if the input and output match before sending it to the destination.

        1. what he/she meant is that you cannot possibly guarantee ahead of time that any text will be compressed losslessly. You can test on big samples, but you cannot tell that a particular input out of the infinite that are possible is not going to trigger a bad behavior. It’s like proving there are no blue squirrels: you can easily tell for each individual you meet, but you cannot generalize.

          1. That’s one interpretation. Another is “having received text so compressed, you can’t know whether it’s the original text”. If the sender was lazy and didn’t do that check, there’s no way to tell. And if the compressed text is corrupted, it might decompress to sensible but wrong text.

          2. Or, they could just include a hash/checksum of the original (pre compression) text in the final message.

            Also, it’s much more likely that the algorithm would be set up to always give you the right result but have edge cases that don’t compress well (making them bigger than the original).

  2. It’s cool to see people experimenting, but this is a terrible way to compress something with a ready-made LLM. The problem was solved ages ago, and newer developments like LLMs don’t really change the fundamental aspects of compression.

    If you do it right, you take a predictor (in this case an ready-made LLM) and slap arithmetic coding (or something equivalent) on the outputs. Voila, you have near-optimal compression, and you can even use the probability for every token, not just the chosen one.

    1. Yes, its a bit like assigning every word in the Oxford English dictionary an integer, then converting each word in text to an integer. There has always been a correlation between compression size and memory size. Not sure really this adds a lot since you will need to access the LLM to decompress, unlike things like zip where it is self contained.

      1. That would probably work in English, but for example Finnish language would generate about 10x more integers, and correspondingly larger storage size, because of so many grammatical cases:

        home = koti
        at home = kotona
        to home = kotiin
        from home = kodista

        etc, with a total of 15 different cases.

        1. “You can give every word in a dictionary an index. That’s about 18 bits English.”

          so say 20 bits for Finnish

          so say it’ll be 24 bits all the Chinese characters. (about 3,500 standard characters)

        2. The LLM doesn’t work using entire words. It uses shorter pieces, such as syllables. Most likely “kot” would be one entry, since it appears in 75% of those words.

          Since the case endings are also just syllables, and Finnish lacks prepositions like “to”, “from”, it would likely not use any more information to encode.

  3. Train the model on github and on a sample of feeding it a ton of files to compress with all the various open source compression algorithms. (open source because step a)

  4. This feels like storing your music and videos and the like as hashes which you recover by looking them up on the Internet and then streaming them. It works really well until someone stops hosting the last copy of the hashed file.

  5. “Would it practical to train a model for the purpose of compression?”
    I don’t know if it is practical, but it’s certainly possible and not difficult considering how LLMs work.
    LLMs work by basically predicting words based on statistics. So I think you could make something like this:
    -ignoring punctuation and capitalization for simplicity
    -You can give every word in a dictionary an index. That’s about 18 bits English. You might use a variable width so common words take less bits than less common words. Sorting words by frequency instead of by alphabet.
    -you can also spell out words not found in the dictionary, but for simplicity sake let’s leave this out. Only words found in the dictionary can be used
    -The LLM guesses a top 2^N-1 of the possible next word in the sentence (it should be very good at this). for instance a top 3 for N=2
    -you use N-bits to index the next word, you use a value of all ones to indicate the word is not in the list of guesses and then you use the 18-bit index (or fewer bits in case it is a common word)
    -so in case of a correct guess the word takes up 2 bits, and in case of an incorrect guess it takes up 20 bits
    -compression and decompression, of course, have to be done with the same model
    -The dictionary and the AI model will of course take up a lot of space, but depending on the amount of text being compressed that can save a lot of data.
    -compression and decompression would take a lot of time, but if you split the text in blocks you can do compression and decompression in parallel instead of sequentially and then it would be less slow. But you still need to run the model for every single word.

    Correct me if I’m wrong

  6. It’s about the same what many backup applications do, chunks of data get a hash, that hash is stored and compared to other chunks and just a reference to that hash is stored for the identical duplicate chunks. Unique data is stored.

    Ie if you backup multiple Windows PCs, you can imagine that most of the Windows system files and applications are the same so these only have to be stored once. This can save a lot of storage space on a backup server that stores 100s of client PC backups for example.

    The difference is AI processing power and dictionary size (LLM) is way way higher. But it can compress potentially more (and save storage space) if the data to be compressed is a much larger than the LLM.

    As with many things cloud, the risk of your dictionary disappearing or corruption is real (company goes bankrupt, gets hacked, is tampered with etc), and you probably don’t want to base your backup strategy on that.

    1. It isn’t. Technically, it’s just a very complicated way to build your huffman tree. It might compress some thing better then others, just like different huffman trees.

      Add the overhead of needing to know what tree you used (the LLM training set) and all your compression is quickly lost.

  7. If you’re going to use a database anyway, you can just scan a load of books and compress any book in that library to its ISBN, without all of the unpredictability that AI entails.

  8. LLMs provide a probabilistic model of language and this allows you to compress language using arithmetic encoding. The average number of bits per token will be the expected value of -log(probability of the next token)/log(2), which is precisely the quantity that the training of LLMs is trying to minimize.

    None of this is new or surprising. It has been known for a long time that compression and understanding are mathematically equivalent. Look up Marcus Hutter and the Hutter prize.

    If you don’t understand AI or compression, educate yourself on these subjects before you write about them.

    1. Let’s see…

      >”predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. ”

      True, but this still doesn’t solve the Chinese Room Problem: the application of formal rules on information doesn’t equal intelligence. A plain answering machine can fake its way through the Turing Test by being complex enough to exhaust the tester.

      >”Models like ChatGPT are not eligible for the Hutter Prize”

      Yep, because of special pleading. That which breaks the argument is excluded from the prize by applying arbitrary limitations.

      It may be that AI and text compression are equal and lead to the shortest possible code, but at the same time there is also a shortest possible non-intelligent version of the code and when you find that, you can’t call it intelligent based on being the shortest you’ve found yet, because you don’t know whether there’s still a shorter one nor whether that is actually intelligent – that would be just assuming the premise.

      1. More to the point: an intelligent agent may solve the problem in an optimal way by the application of formal rules, if such a solution is the optimal solution, but the reverse doesn’t apply: the existence of such a solution doesn’t prove that intelligence equals the application of formal rules (i.e. an optimal computer program).

      2. What seems obvious to me is that limiting the prize to the compression of a particular text will always have a more optimal “dumb” solution particular to that piece of text. That is the limit of the Turing Test being applied to the case, and the program only needs to beat the test. The Turing Test has always been flawed as a test of intelligence in this respect.

        My thesis is that for any limited test, there exists a non-intelligent algorithm that performs better than an intelligent agent would, since a general intelligence – even a limited one – needs to contain mechanisms to deal with more than just the particulars of that test. Therefore it cannot be the best or most efficient solution to that particular problem and therefore the Prize leads you down the wrong rabbit hole.

        More generally, the solution to the test isn’t intelligence – it’s just the answer generated by an intelligent or non-intelligent agent in response to being probed with a question. The way in which you find solutions is intelligence, not the solutions themselves. Finding the optimal compression algorithm doesn’t give you an intelligent agent – it’s YOU that is acting as the intelligent agent in coming up with the algorithms to find the optimal solution.

  9. huffman still is better and best optimal than ai
    dictionary is nice if we have real BIG text and very SPECIFIC text to compress.

    sorry, today arithmetic compression is still better

  10. Actually, this is exactly the danger that generative AI faces regarding use of copyright works. Stable Diffusion and most LLMs are indistinguishable from lossy compression at worst or simply inefficient compression at best. If you can prompt a generative AI to emit an unlicensed work currently under copyright, then the model violates copyright law and since the training data would have been intentionally gathered (and perhaps classified) the company/individual who made the AI will be wide open for legal action under copyright law.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.