Text Compression Gets Weirdly Efficient With LLMs

August 27, 2023

It used to be that memory and storage space were so precious and so limited of a resource that handling nontrivial amounts of text was a serious problem. Text compression was a highly practical application of computing power.

Today it might be a solved problem, but that doesn’t mean it doesn’t attract new or unusual solutions. [Fabrice Bellard] released ts_zip which uses Large Language Models (LLM) to attain text compression ratios higher than any other tool can offer.

LLMs are the technology behind natural language AIs, and applying them in this way seems effective. The tradeoff? Unlike typical compression tools, the lossless decompression part isn’t exactly guaranteed when an LLM is involved. Lossy compression methods are in fact quite useful. JPEG compression, for example, is a good example of discarding data that isn’t readily perceived by humans to make a smaller file, but that isn’t usually applied to text. If you absolutely require lossless compression, [Fabrice] has that covered with NNCP, a neural-network powered lossless data compressor.

Do neural networks and LLMs sound far too serious and complicated for your text compression needs? As long as you don’t mind a mild amount of definitely noticeable data loss, check out [Greg Kennedy]’s Lossy Text Compression which simply, brilliantly, and amusingly uses a thesaurus instead of some fancy algorithms. Yep, it just swaps longer words for shorter ones. Perhaps not the best solution for every need, but between that and [Fabrice]’s brilliant work we’re confident there’s something for everyone who craves some novelty with their text compression.

[Photo by Matthew Henry from Burst]

66 thoughts on “Text Compression Gets Weirdly Efficient With LLMs”

YGDES says:

August 27, 2023 at 10:10 am

Sacré Fabrice ! Après les images, maintenant le texte ! Il a encore cassé Internet :-P

Report comment

Reply
1. Lapin1 says:
  
  August 29, 2023 at 12:24 am
  
  Et aussi qemu!
  
  Report comment
  
  Reply
Sweeney says:

August 27, 2023 at 10:41 am

The trade off here is in the size of the compressor/decompressor and the speed it works at. You need up to 8.5 GB of GPU memory (for both compression and decompression) and it runs at between 7 and 128KB/sec. I’m not sure of the circumstances where the extra cost outweighs the reduction in file size.

Report comment

Reply
1. Oskar says:
  
  August 27, 2023 at 2:54 pm
  
  Even if you can do this in 486 with 8MB ram, what kind of text can handle lossy ?
  You can not compress anything for work, study, most of communications… like what can you really do with lossy compression ? 0.00001 % of non bussiness related stuff. This thing should not be even a thing, just standard bloom filter can provide same lossy compression ratios as this, with orderS of magnitude lower requirements and it is old idea not used exactly because LOSSY text compression is useless (“technology”). It is essentially making 2fps GIFs from 1080p60 with ML and praise it for novelty…. it is just s….d
  
  Report comment
  
  Reply
  1. hartl says:
    
    August 28, 2023 at 11:53 am
    
    While lossy text compression is useless for storage (plain text takes orders of magnitude less space than audio or video so it isn’t even worth the energy spent), it might be useful for simulating common neurodegenerative diseases and in developing AI-based compensating devices. Age-related deficiencies will place a heavy load on our society in the coming decades.
    
    Report comment
    
    Reply
  2. Jad9 says:
    
    August 28, 2023 at 2:53 pm
    
    We could compress your comment
    
    Report comment
    
    Reply
    1. Filippo says:
      
      August 28, 2023 at 4:57 pm
      
      *mic drop*
      
      Report comment
      
      Reply
  3. GDR! says:
    
    April 1, 2025 at 12:31 am
    
    Computing is not only about meeting someone’s business requirements. It can be just for fun and exploration.
    
    Report comment
    
    Reply
2. Patrick Gerard says:
  
  August 28, 2023 at 8:16 am
  
  I actually think the use case is other LLMs.
  
  Chatbots tend to have very limited memories for special instructions.
  
  There’s a hardware and speed tradeoff but the result is LLMs that can have longer conversations without forgetting what they’ve been told and have more thorough special instructions (such as replicating the speaking style and biographical information of historical or fictional characters they’ve been asked to emulate).
  
  The net result is, say, a Plato who can know the complete writings of Plato and every significant piece of writing about Plato.
  
  Report comment
  
  Reply
  1. liam says:
    
    August 29, 2023 at 9:50 pm
    
    this…its like singing a song we all know but not using any consonants just vowels …it WONT sound right …but it WILL be communicatively transmitted understood and acted on with full recognition…to humans that’s novelty…ai’s do not create novel solutions to GET YOU to say they are novel ideas…they find efficiency and don’t notice (novelty at first occurrence) or complain that its different (recurring “novelty”) like humans do
    
    Report comment
    
    Reply
3. Ian says:
  
  August 31, 2023 at 8:14 am
  
  No…
  The tradeoff is LOSSY TEXT COMPRESSION.
  
  The actual use of the material needs to be taken into consideration when choosing a compression mehod.
  
  You know how you could save a bunch of space storing your security camera footage? Why not store it as a text prompt and generate the video back later?
  
  Human wearing shirt walks toward camera, crouches down, then walks left.
  
  Report comment
  
  Reply
k-ww says:

August 27, 2023 at 10:55 am

Is it me, but could this be used to disguise plagiarized text – letting it use the thesaurused output to keep AI algorithms for detected the original text?

Report comment

Reply
1. SayWhat? says:
  
  August 27, 2023 at 11:18 am
  
  The funky recovered unnatural language could tip off reader.
  
  Report comment
  
  Reply
  1. Davidp says:
    
    August 28, 2023 at 2:48 am
    
    I’m sure there are “simplify my text to grade level XX” utilities, though. I’ll bet a compression/decompression cycle would pretty good at laundering text.
    
    Report comment
    
    Reply
adrian says:

August 27, 2023 at 10:58 am

What form does the ‘loss’ take ? In images it’s typically loss of definition. But given LLMs proclivities, this might just hallucinate the expanded text. Not really what you want.

Report comment

Reply
1. toybuilder says:
  
  August 27, 2023 at 1:04 pm
  
  Have we not learned from the Xerox digital copier compression fiasco?
  https://news.ycombinator.com/item?id=29223815
  
  Report comment
  
  Reply
  1. Oskar says:
    
    August 27, 2023 at 2:58 pm
    
    But obviously you did not understood that it is on purpose now, so it is ok. And most importantly needing RTX4090 so you can open VIM is a loooong established trend already. (s) 🤣
    
    Report comment
    
    Reply
    1. charles says:
      
      August 27, 2023 at 11:47 pm
      
      We all know that *opening* VIM is not the problem.
      
      Report comment
      
      Reply
      1. Ian says:
        
        August 31, 2023 at 8:19 am
        
        Yeah…
        The realm problem wih vim is when someone posts a beginner Linux tutorial for something that includes any steps using vim.
        
        Report comment
2. The Commenter Formerly Known As Ren says:
  
  August 27, 2023 at 1:33 pm
  
  What if it drops common articles (a, an, the)?
  Or, in certain cases, dropped some vowels or consonants?
  The original content would probably be not lost.
  And if it dropped repeated phrases, such as those pronounced by a certain Vice-president, the message might even be clearer!
  B^)
  
  Report comment
  
  Reply
  1. Jon H says:
    
    August 27, 2023 at 1:54 pm
    
    It’d more likely retain the most common strings of characters and drop the less common ones.
    
    Report comment
    
    Reply
3. hartl says:
  
  August 27, 2023 at 1:55 pm
  
  “What form does the ‘loss’ take ?”
  It reduces Schiller and Goethe to newspaper apprentices.
  
  Report comment
  
  Reply
4. Isaac says:
  
  August 27, 2023 at 8:35 pm
  
  If it really does substitute shorter words for longer ones (as I understand the process), that would be a “loss of definition” for sure.
  
  Report comment
  
  Reply
M says:

August 27, 2023 at 11:26 am

This is asking for trouble.

LLMs achieve compression by compressing down the text based on the concepts it talks about. turns out there are only so many topics humans communicate about in their daily life. The “lossyness” isn’t in terms of slightly different pixel values or wrong characters, but in changing the phrasing or leaving out details.

It’s not hard to have a sentence like “Ane Berry saw the thief at the scene of the crime.”
Turn into “Ane Berry was seen at the scene of the crime.”

You see this frequently with LLM chatbots misinterpreting various aspects from the chat’s history.

Report comment

Reply
1. The Commenter Formerly Known As Ren says:
  
  August 27, 2023 at 1:35 pm
  
  Well, trusting chatbots for accuracy is the real problem.
  
  Report comment
  
  Reply
Thinkerer says:

August 27, 2023 at 11:27 am

The generation loss of a recursive LLM system should prove to be very interesting. Maybe we’ll finally get to those monkeys on typewriters…

Report comment

Reply
1. hartl says:
  
  August 27, 2023 at 1:59 pm
  
  I see the “bad translator” meme appearing again…
  
  Report comment
  
  Reply
Andrzej says:

August 27, 2023 at 11:52 am

So now the only part missing would be a LLM-based text “diff” mechanism. You first apply the lossy compression, then generate a diff from the original and store both. I’m just waving hands here, but I expect that LLM could be adapted to produce an efficient representation of such differences, since it can (or at least pretends to) take language semantics into account.

Report comment

Reply
yet another bruce says:

August 27, 2023 at 12:18 pm

Hutter Prize winner?

Report comment

Reply
1. Jon H says:
  
  August 27, 2023 at 1:55 pm
  
  More like Ig Nobel Prize territory.
  
  Report comment
  
  Reply
words for computers says:

August 27, 2023 at 1:20 pm

What about giving every word and it’s derivative a number and use a lookup table?

Report comment

Reply
1. Bastet says:
  
  August 28, 2023 at 12:44 am
  
  I thought thats what modern compression already does?
  
  But i think that idea can work with a twist. Have the library constant and come with the compression tool and just send the compressed file without a library, or one that is way smaller and doesn’t include the static one. I bet you could get a list of most used long words into a list not larger than 65535 words for every language. Tough, problem starts when you have a text that uses multiple languages.
  
  By the way, that was something i thought of for my retro RPG project, have a list of up to 127 elements of the most used longer words ready and use escape codes to insert them into texts. Why only 127? Because the seventh bit is used if one wants the first letter of the word in upper case.
  So “Sword” becomes \x1b\xff\x81, where the 0xff denotes that a compressed word follows and that we use word 1 with upper case first letter. Two bytes are two bytes if you only have 65535 of them. ;)
  
  Report comment
  
  Reply
paulvdh says:

August 27, 2023 at 1:50 pm

So, [Greg Kennedy] brought us one step closer to new speak by eliminating long words from the dictionary.

Report comment

Reply
1. The Commenter Formerly Known As Ren says:
  
  August 27, 2023 at 2:09 pm
  
  Double Plus Good!
  
  Report comment
  
  Reply
  1. The Commenter Formerly Known As Ren says:
    
    August 27, 2023 at 2:10 pm
    
    Or maybe it could use xkcd’s 1000 word library.
    
    Report comment
    
    Reply
    1. -jeffB says:
      
      August 27, 2023 at 9:34 pm
      
      You will not go to space today.
      
      Report comment
      
      Reply
BrightBlueJim says:

August 27, 2023 at 1:50 pm

What a terrible idea – lossy text compression. But unlike lossy image or audio compression, this will produce copies that mean something different from the original. I guess Mr. Ballard has never played “telephone”.

Report comment

Reply
Jon H says:

August 27, 2023 at 1:52 pm

“Lossy compression methods are in fact quite useful.”

Not for text they aren’t. I honestly don’t see what the point was in even attempting this, let alone posting the results on a web page as if it was interesting or useful.

Why not just run the text through a process that truncates every word to one character and deletes the spaces and punctuation then applies lossless compression to what’s left? That’d achieve lossy compression about as well.

Report comment

Reply
1. CRJEEA says:
  
  August 28, 2023 at 2:39 am
  
  It would be interesting to see an inverse to the latter mentioned, Lossy text compression, perhaps calling it the Tribbiani.
  I wonder how well it would do just to order the whole dictionary in order of common use, number all the words in base 36 and then use 7zip on the resulting list of numbers.
  
  Report comment
  
  Reply
Jon H says:

August 27, 2023 at 1:59 pm

Amazing that he never even addresses the condition the decompressed data is in.

If it doesn’t decompress into a useful state it isn’t compression, now is it. It’s just 100% loss.

Report comment

Reply
Andrew says:

August 27, 2023 at 2:32 pm

So it’s basically paraphrasing.

Report comment

Reply
1. Jon H says:
  
  August 27, 2023 at 2:46 pm
  
  Paraphrasing keeps the same meaning. This doesn’t care about meaning.
  
  Report comment
  
  Reply
  1. Andrew says:
    
    August 27, 2023 at 3:42 pm
    
    Well, it depends on the skill of who is paraphrasing.
    
    Report comment
    
    Reply
Hobo Lobo says:

August 27, 2023 at 3:04 pm

all I can say about this article and the comments is AAAAAAAAAAAAAAAAAGH

the article is fundamentally incorrect – ts_zip is not lossy. So all the comments griping about how lossy text compression is stupid may be right, but they’re also irrelevant. I understand that the linked page doesn’t explain the details of how it works either, but can we at least do some research before crapping all over something that’s actually quite clever?

The comment by [Andrzej] that suggests how to “fix” lossy models is insightful to a degree at least – it more or less describes how this *actually* works. ts_zip uses the LLM as a predictive model and further encodes the difference between what the LLM predicts and the actual text, so the original text can be perfectly reconstructed.

The key difference between ts_zip and NNCP is not lossy encoding, it’s that NNCP uses its own trained models based on transformer_xl and LSTM, while ts_zip is designed to use *any* large language model to compress data, but also requires the same one to decompress it.

Report comment

Reply
1. paulvdh says:
  
  August 27, 2023 at 4:46 pm
  
  (Almost?) Nobody in the comments here cares about that LLM thing. That other Lossy Greg thing is much more fun.
  
  Report comment
  
  Reply
2. FloofyKitteh says:
  
  August 27, 2023 at 9:11 pm
  
  It’s amazing that people that are Deeply Concerned with accuracy in text seem to be so alarmingly illiterate.
  
  Report comment
  
  Reply
  1. pelrun says:
    
    August 29, 2023 at 1:26 am
    
    You’d be forgiven for thinking communication was more than just the error-free transmission of text.
    
    Dammit Jim, I’m a human being, not a compiler!
    
    Report comment
    
    Reply
Luke Davis says:

August 27, 2023 at 4:48 pm

Plot twist: the compression algorithm is just Kevin from The Office.

Report comment

Reply
Joshua says:

August 27, 2023 at 6:44 pm

BASIC interpreters of the 70s/80s did it better.
They used Token format to reduce size (GW-BASIC used it, too).

Or, let’s take the ZX81 interpreter..
It stored commands like PRINT or INPUT in a shortened format.

Report comment

Reply
1. Kaz says:
  
  August 28, 2023 at 12:06 am
  
  BASIC interpreters used a very small dictionary of keywords, so it was easy to save space by tokenising it. Tokenising natural language, with a huge number of words, doesn’t give such easy savings.
  
  That’s where other compression techniques, such as Huffman, LZ, etc can come in. (Those are lossless examples, but this new approach is a lossy one, given the LLM.)
  
  Report comment
  
  Reply
  1. Kaz says:
    
    August 28, 2023 at 12:09 am
    
    Clarification: the LLM bit is lossy, but the partner algorithm predicts and corrects those mistakes, for an overall lossless algorithm. Or so the idea goes…
    
    Report comment
    
    Reply
  2. Bastet says:
    
    August 28, 2023 at 12:51 am
    
    Now if you did analyse some commonly used languages, like English, German, French or Chinese and make a list of 65535 most used words that are longer than four bytes, that would make a nice tokenization list. No clue about the compression ratio tough and if it is worth it outside some special use cases…
    
    Report comment
    
    Reply
NQ says:

August 27, 2023 at 8:00 pm

Interesting that the antique book photo is featuring 1 Chronicles 11.

Report comment

Reply
Fabrice Fan says:

August 27, 2023 at 9:20 pm

Good article HaD, but how can you write about Fabrice Bellard without mentioning he’s also the creator of FFmpeg, QEMU, TinyCC, and more (or perhaps all your readers know this already?). This guy deserves a Nobel prize or at least a Turing award.

Report comment

Reply
ziggurat29 says:

August 28, 2023 at 6:52 am

can Cliff Notes be considered lossy compression of text?

Report comment

Reply
1. BrightBlueJim says:
  
  August 28, 2023 at 9:18 am
  
  Yes! This was my concern when I read the article, that this was a sort of condensation rather than compression, with the inevitable side effect that only the principal points of the text are retained, with any subtlety and secondary meanings discarded as irrelevant. If that is the case, it could be fun to run poetry through it. Where by “fun” I mean painful.
  
  Report comment
  
  Reply
noapparentfunction says:

August 28, 2023 at 9:09 am

Why waste time say lot word when few word do trick?

Report comment

Reply
1. BrightBlueJim says:
  
  August 28, 2023 at 9:21 am
  
  Not many word when few work.
  
  Report comment
  
  Reply
  1. BrightBlueJim says:
    
    August 28, 2023 at 9:23 am
    
    Few word
    
    Report comment
    
    Reply
    1. Bastet says:
      
      August 28, 2023 at 9:23 am
      
      Few
      
      Report comment
      
      Reply
      1. Blue Footed Booby says:
        
        August 28, 2023 at 10:34 am
        
        ⠀
        
        Report comment
      2. Hobo Lobo says:
        
        August 28, 2023 at 10:36 am
        
        1
        
        Report comment
Greg A says:

August 28, 2023 at 11:48 am

this sort of innovation is coming for everything. it’s scary to me because once you get going, the ‘enhance’ function is inevitable. and the resulting image will look so good, you will be forgiven for forgetting that it made it up and you aren’t actually seeing anything that the original camera saw. and that’s in photos, where we know how to think about it…but in text! can you imagine an ‘enhance’ function for text? to zoom in or out, elucidate or summarize a bit of prose? having played with chatgpt a bit, it no longer seems far out.

anyways, anecdote…i was, for no good reason, trying to compress a large collection of historical chess games. and i came up with the idea of enumerating the possible moves at each position, so instead of 16 pieces * 8*8 coordinates (1024 combinations per move, 10 bits), i could encode typically on the order of a dozen (3.5 bits). i forget the typical number of bytes per game, but i came up with what i thought was a very impressively terse representation. i was so pleased with myself i started wondering how close to the ideal representation i was. and i realized, what if i packaged it with a chess engine, and ordered the possible moves by probability assessed by the chess engine! and with chess engines these days being not that different from LLMs… and that’s about when i lost interest in the project :)

just like compressing text by encoding the autocomplete choice (left / middle / right, 1.5 bits) instead of the word.

Report comment

Reply
1. pelrun says:
  
  August 29, 2023 at 1:19 am
  
  Don’t forget photographs have *never* been a factual representation of reality. There has been artifice and manipulation at every stage for as long as it has existed as a technology.
  
  I had a private joke with a colleague a couple of decades ago that is becoming possible far too quickly for my liking – a text engine that had two modes: “desemantication” which would distill a large block of text down to it’s core ideas, and “resemantication” that would do the reverse. We would be on the lookout for anything that was “resemanticated bullshit”… i.e. your general political speech or online diatribe that was clearly a tiny bit of garbage inflated into a full press release.
  
  Report comment
  
  Reply
aquahoodch says:

August 29, 2023 at 12:47 am

A famous author said he got paid by the word so why use meteoplos when you can just say city and get paid the same.

Report comment

Reply
Steven Clark says:

August 29, 2023 at 12:49 pm

What part of an effective ‘markov chain’ being useful for text compression is at all surprising?

Report comment

Reply

Hackaday

Text Compression Gets Weirdly Efficient With LLMs

66 thoughts on “Text Compression Gets Weirdly Efficient With LLMs”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Dearest C++, Let Me Count The Ways I Love/Hate Thee

Personal Reflections On Immutable Linux

Crunching The News For Fun And Little Profit

The End Of The Hackintosh Is Upon Us

The Hackaday Summer Reading List: No AI Involvement, Guaranteed

Our Columns

Trickle Down: When Doing Something Silly Actually Makes Sense

Hackaday Podcast Episode 328: Benchies, Beanies, And Back To The Future

This Week In Security: Bitchat, CitrixBleed Part 2, Opossum, And TSAs

Ask Hackaday: Are You Wearing 3D Printed Shoes?

FLOSS Weekly Episode 840: End-of-10; Not Just Some Guy In A Van

66 thoughts on “Text Compression Gets Weirdly Efficient With LLMs”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns