Text Compression Gets Weirdly Efficient With LLMs

It used to be that memory and storage space were so precious and so limited of a resource that handling nontrivial amounts of text was a serious problem. Text compression was a highly practical application of computing power.

Today it might be a solved problem, but that doesn’t mean it doesn’t attract new or unusual solutions. [Fabrice Bellard] released ts_zip which uses Large Language Models (LLM) to attain text compression ratios higher than any other tool can offer.

LLMs are the technology behind natural language AIs, and applying them in this way seems effective. The tradeoff? Unlike typical compression tools, the lossless decompression part isn’t exactly guaranteed when an LLM is involved. Lossy compression methods are in fact quite useful. JPEG compression, for example, is a good example of discarding data that isn’t readily perceived by humans to make a smaller file, but that isn’t usually applied to text. If you absolutely require lossless compression, [Fabrice] has that covered with NNCP, a neural-network powered lossless data compressor.

Do neural networks and LLMs sound far too serious and complicated for your text compression needs? As long as you don’t mind a mild amount of definitely noticeable data loss, check out [Greg Kennedy]’s Lossy Text Compression which simply, brilliantly, and amusingly uses a thesaurus instead of some fancy algorithms. Yep, it just swaps longer words for shorter ones. Perhaps not the best solution for every need, but between that and [Fabrice]’s brilliant work we’re confident there’s something for everyone who craves some novelty with their text compression.

[Photo by Matthew Henry from Burst]

65 thoughts on “Text Compression Gets Weirdly Efficient With LLMs

  1. The trade off here is in the size of the compressor/decompressor and the speed it works at. You need up to 8.5 GB of GPU memory (for both compression and decompression) and it runs at between 7 and 128KB/sec. I’m not sure of the circumstances where the extra cost outweighs the reduction in file size.

    1. Even if you can do this in 486 with 8MB ram, what kind of text can handle lossy ?
      You can not compress anything for work, study, most of communications… like what can you really do with lossy compression ? 0.00001 % of non bussiness related stuff. This thing should not be even a thing, just standard bloom filter can provide same lossy compression ratios as this, with orderS of magnitude lower requirements and it is old idea not used exactly because LOSSY text compression is useless (“technology”). It is essentially making 2fps GIFs from 1080p60 with ML and praise it for novelty…. it is just s….d

      1. While lossy text compression is useless for storage (plain text takes orders of magnitude less space than audio or video so it isn’t even worth the energy spent), it might be useful for simulating common neurodegenerative diseases and in developing AI-based compensating devices. Age-related deficiencies will place a heavy load on our society in the coming decades.

    2. I actually think the use case is other LLMs.

      Chatbots tend to have very limited memories for special instructions.

      There’s a hardware and speed tradeoff but the result is LLMs that can have longer conversations without forgetting what they’ve been told and have more thorough special instructions (such as replicating the speaking style and biographical information of historical or fictional characters they’ve been asked to emulate).

      The net result is, say, a Plato who can know the complete writings of Plato and every significant piece of writing about Plato.

      1. this…its like singing a song we all know but not using any consonants just vowels …it WONT sound right …but it WILL be communicatively transmitted understood and acted on with full recognition…to humans that’s novelty…ai’s do not create novel solutions to GET YOU to say they are novel ideas…they find efficiency and don’t notice (novelty at first occurrence) or complain that its different (recurring “novelty”) like humans do

    3. No…
      The tradeoff is LOSSY TEXT COMPRESSION.

      The actual use of the material needs to be taken into consideration when choosing a compression mehod.

      You know how you could save a bunch of space storing your security camera footage? Why not store it as a text prompt and generate the video back later?

      Human wearing shirt walks toward camera, crouches down, then walks left.

  2. What form does the ‘loss’ take ? In images it’s typically loss of definition. But given LLMs proclivities, this might just hallucinate the expanded text. Not really what you want.

      1. But obviously you did not understood that it is on purpose now, so it is ok. And most importantly needing RTX4090 so you can open VIM is a loooong established trend already. (s) 🤣

    1. What if it drops common articles (a, an, the)?
      Or, in certain cases, dropped some vowels or consonants?
      The original content would probably be not lost.
      And if it dropped repeated phrases, such as those pronounced by a certain Vice-president, the message might even be clearer!
      B^)

  3. This is asking for trouble.

    LLMs achieve compression by compressing down the text based on the concepts it talks about. turns out there are only so many topics humans communicate about in their daily life. The “lossyness” isn’t in terms of slightly different pixel values or wrong characters, but in changing the phrasing or leaving out details.

    It’s not hard to have a sentence like “Ane Berry saw the thief at the scene of the crime.”
    Turn into “Ane Berry was seen at the scene of the crime.”

    You see this frequently with LLM chatbots misinterpreting various aspects from the chat’s history.

  4. So now the only part missing would be a LLM-based text “diff” mechanism. You first apply the lossy compression, then generate a diff from the original and store both. I’m just waving hands here, but I expect that LLM could be adapted to produce an efficient representation of such differences, since it can (or at least pretends to) take language semantics into account.

    1. I thought thats what modern compression already does?

      But i think that idea can work with a twist. Have the library constant and come with the compression tool and just send the compressed file without a library, or one that is way smaller and doesn’t include the static one. I bet you could get a list of most used long words into a list not larger than 65535 words for every language. Tough, problem starts when you have a text that uses multiple languages.

      By the way, that was something i thought of for my retro RPG project, have a list of up to 127 elements of the most used longer words ready and use escape codes to insert them into texts. Why only 127? Because the seventh bit is used if one wants the first letter of the word in upper case.
      So “Sword” becomes \x1b\xff\x81, where the 0xff denotes that a compressed word follows and that we use word 1 with upper case first letter. Two bytes are two bytes if you only have 65535 of them. ;)

  5. What a terrible idea – lossy text compression. But unlike lossy image or audio compression, this will produce copies that mean something different from the original. I guess Mr. Ballard has never played “telephone”.

  6. “Lossy compression methods are in fact quite useful.”

    Not for text they aren’t. I honestly don’t see what the point was in even attempting this, let alone posting the results on a web page as if it was interesting or useful.

    Why not just run the text through a process that truncates every word to one character and deletes the spaces and punctuation then applies lossless compression to what’s left? That’d achieve lossy compression about as well.

    1. It would be interesting to see an inverse to the latter mentioned, Lossy text compression, perhaps calling it the Tribbiani.
      I wonder how well it would do just to order the whole dictionary in order of common use, number all the words in base 36 and then use 7zip on the resulting list of numbers.

  7. Amazing that he never even addresses the condition the decompressed data is in.

    If it doesn’t decompress into a useful state it isn’t compression, now is it. It’s just 100% loss.

  8. all I can say about this article and the comments is AAAAAAAAAAAAAAAAAGH

    the article is fundamentally incorrect – ts_zip is not lossy. So all the comments griping about how lossy text compression is stupid may be right, but they’re also irrelevant. I understand that the linked page doesn’t explain the details of how it works either, but can we at least do some research before crapping all over something that’s actually quite clever?

    The comment by [Andrzej] that suggests how to “fix” lossy models is insightful to a degree at least – it more or less describes how this *actually* works. ts_zip uses the LLM as a predictive model and further encodes the difference between what the LLM predicts and the actual text, so the original text can be perfectly reconstructed.

    The key difference between ts_zip and NNCP is not lossy encoding, it’s that NNCP uses its own trained models based on transformer_xl and LSTM, while ts_zip is designed to use *any* large language model to compress data, but also requires the same one to decompress it.

  9. BASIC interpreters of the 70s/80s did it better.
    They used Token format to reduce size (GW-BASIC used it, too).

    Or, let’s take the ZX81 interpreter..
    It stored commands like PRINT or INPUT in a shortened format.

    1. BASIC interpreters used a very small dictionary of keywords, so it was easy to save space by tokenising it. Tokenising natural language, with a huge number of words, doesn’t give such easy savings.

      That’s where other compression techniques, such as Huffman, LZ, etc can come in. (Those are lossless examples, but this new approach is a lossy one, given the LLM.)

      1. Now if you did analyse some commonly used languages, like English, German, French or Chinese and make a list of 65535 most used words that are longer than four bytes, that would make a nice tokenization list. No clue about the compression ratio tough and if it is worth it outside some special use cases…

  10. Good article HaD, but how can you write about Fabrice Bellard without mentioning he’s also the creator of FFmpeg, QEMU, TinyCC, and more (or perhaps all your readers know this already?). This guy deserves a Nobel prize or at least a Turing award.

    1. Yes! This was my concern when I read the article, that this was a sort of condensation rather than compression, with the inevitable side effect that only the principal points of the text are retained, with any subtlety and secondary meanings discarded as irrelevant. If that is the case, it could be fun to run poetry through it. Where by “fun” I mean painful.

  11. this sort of innovation is coming for everything. it’s scary to me because once you get going, the ‘enhance’ function is inevitable. and the resulting image will look so good, you will be forgiven for forgetting that it made it up and you aren’t actually seeing anything that the original camera saw. and that’s in photos, where we know how to think about it…but in text! can you imagine an ‘enhance’ function for text? to zoom in or out, elucidate or summarize a bit of prose? having played with chatgpt a bit, it no longer seems far out.

    anyways, anecdote…i was, for no good reason, trying to compress a large collection of historical chess games. and i came up with the idea of enumerating the possible moves at each position, so instead of 16 pieces * 8*8 coordinates (1024 combinations per move, 10 bits), i could encode typically on the order of a dozen (3.5 bits). i forget the typical number of bytes per game, but i came up with what i thought was a very impressively terse representation. i was so pleased with myself i started wondering how close to the ideal representation i was. and i realized, what if i packaged it with a chess engine, and ordered the possible moves by probability assessed by the chess engine! and with chess engines these days being not that different from LLMs… and that’s about when i lost interest in the project :)

    just like compressing text by encoding the autocomplete choice (left / middle / right, 1.5 bits) instead of the word.

    1. Don’t forget photographs have *never* been a factual representation of reality. There has been artifice and manipulation at every stage for as long as it has existed as a technology.

      I had a private joke with a colleague a couple of decades ago that is becoming possible far too quickly for my liking – a text engine that had two modes: “desemantication” which would distill a large block of text down to it’s core ideas, and “resemantication” that would do the reverse. We would be on the lookout for anything that was “resemanticated bullshit”… i.e. your general political speech or online diatribe that was clearly a tiny bit of garbage inflated into a full press release.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.