It used to be that memory and storage space were so precious and so limited of a resource that handling nontrivial amounts of text was a serious problem. Text compression was a highly practical application of computing power.
Today it might be a solved problem, but that doesn’t mean it doesn’t attract new or unusual solutions. [Fabrice Bellard] released ts_zip which uses Large Language Models (LLM) to attain text compression ratios higher than any other tool can offer.
LLMs are the technology behind natural language AIs, and applying them in this way seems effective. The tradeoff? Unlike typical compression tools, the lossless decompression part isn’t exactly guaranteed when an LLM is involved. Lossy compression methods are in fact quite useful. JPEG compression, for example, is a good example of discarding data that isn’t readily perceived by humans to make a smaller file, but that isn’t usually applied to text. If you absolutely require lossless compression, [Fabrice] has that covered with NNCP, a neural-network powered lossless data compressor.
Do neural networks and LLMs sound far too serious and complicated for your text compression needs? As long as you don’t mind a mild amount of definitely noticeable data loss, check out [Greg Kennedy]’s Lossy Text Compression which simply, brilliantly, and amusingly uses a thesaurus instead of some fancy algorithms. Yep, it just swaps longer words for shorter ones. Perhaps not the best solution for every need, but between that and [Fabrice]’s brilliant work we’re confident there’s something for everyone who craves some novelty with their text compression.
[Photo by Matthew Henry from Burst]
Sacré Fabrice ! Après les images, maintenant le texte ! Il a encore cassé Internet :-P
Et aussi qemu!
The trade off here is in the size of the compressor/decompressor and the speed it works at. You need up to 8.5 GB of GPU memory (for both compression and decompression) and it runs at between 7 and 128KB/sec. I’m not sure of the circumstances where the extra cost outweighs the reduction in file size.
Even if you can do this in 486 with 8MB ram, what kind of text can handle lossy ?
You can not compress anything for work, study, most of communications… like what can you really do with lossy compression ? 0.00001 % of non bussiness related stuff. This thing should not be even a thing, just standard bloom filter can provide same lossy compression ratios as this, with orderS of magnitude lower requirements and it is old idea not used exactly because LOSSY text compression is useless (“technology”). It is essentially making 2fps GIFs from 1080p60 with ML and praise it for novelty…. it is just s….d
While lossy text compression is useless for storage (plain text takes orders of magnitude less space than audio or video so it isn’t even worth the energy spent), it might be useful for simulating common neurodegenerative diseases and in developing AI-based compensating devices. Age-related deficiencies will place a heavy load on our society in the coming decades.
We could compress your comment
*mic drop*
I actually think the use case is other LLMs.
Chatbots tend to have very limited memories for special instructions.
There’s a hardware and speed tradeoff but the result is LLMs that can have longer conversations without forgetting what they’ve been told and have more thorough special instructions (such as replicating the speaking style and biographical information of historical or fictional characters they’ve been asked to emulate).
The net result is, say, a Plato who can know the complete writings of Plato and every significant piece of writing about Plato.
this…its like singing a song we all know but not using any consonants just vowels …it WONT sound right …but it WILL be communicatively transmitted understood and acted on with full recognition…to humans that’s novelty…ai’s do not create novel solutions to GET YOU to say they are novel ideas…they find efficiency and don’t notice (novelty at first occurrence) or complain that its different (recurring “novelty”) like humans do
No…
The tradeoff is LOSSY TEXT COMPRESSION.
The actual use of the material needs to be taken into consideration when choosing a compression mehod.
You know how you could save a bunch of space storing your security camera footage? Why not store it as a text prompt and generate the video back later?
Human wearing shirt walks toward camera, crouches down, then walks left.
Is it me, but could this be used to disguise plagiarized text – letting it use the thesaurused output to keep AI algorithms for detected the original text?
The funky recovered unnatural language could tip off reader.
I’m sure there are “simplify my text to grade level XX” utilities, though. I’ll bet a compression/decompression cycle would pretty good at laundering text.
What form does the ‘loss’ take ? In images it’s typically loss of definition. But given LLMs proclivities, this might just hallucinate the expanded text. Not really what you want.
Have we not learned from the Xerox digital copier compression fiasco?
https://news.ycombinator.com/item?id=29223815
But obviously you did not understood that it is on purpose now, so it is ok. And most importantly needing RTX4090 so you can open VIM is a loooong established trend already. (s) 🤣
We all know that *opening* VIM is not the problem.
Yeah…
The realm problem wih vim is when someone posts a beginner Linux tutorial for something that includes any steps using vim.
What if it drops common articles (a, an, the)?
Or, in certain cases, dropped some vowels or consonants?
The original content would probably be not lost.
And if it dropped repeated phrases, such as those pronounced by a certain Vice-president, the message might even be clearer!
B^)
It’d more likely retain the most common strings of characters and drop the less common ones.
“What form does the ‘loss’ take ?”
It reduces Schiller and Goethe to newspaper apprentices.
If it really does substitute shorter words for longer ones (as I understand the process), that would be a “loss of definition” for sure.
This is asking for trouble.
LLMs achieve compression by compressing down the text based on the concepts it talks about. turns out there are only so many topics humans communicate about in their daily life. The “lossyness” isn’t in terms of slightly different pixel values or wrong characters, but in changing the phrasing or leaving out details.
It’s not hard to have a sentence like “Ane Berry saw the thief at the scene of the crime.”
Turn into “Ane Berry was seen at the scene of the crime.”
You see this frequently with LLM chatbots misinterpreting various aspects from the chat’s history.
Well, trusting chatbots for accuracy is the real problem.
The generation loss of a recursive LLM system should prove to be very interesting. Maybe we’ll finally get to those monkeys on typewriters…
I see the “bad translator” meme appearing again…
So now the only part missing would be a LLM-based text “diff” mechanism. You first apply the lossy compression, then generate a diff from the original and store both. I’m just waving hands here, but I expect that LLM could be adapted to produce an efficient representation of such differences, since it can (or at least pretends to) take language semantics into account.
Hutter Prize winner?
More like Ig Nobel Prize territory.
What about giving every word and it’s derivative a number and use a lookup table?
I thought thats what modern compression already does?
But i think that idea can work with a twist. Have the library constant and come with the compression tool and just send the compressed file without a library, or one that is way smaller and doesn’t include the static one. I bet you could get a list of most used long words into a list not larger than 65535 words for every language. Tough, problem starts when you have a text that uses multiple languages.
By the way, that was something i thought of for my retro RPG project, have a list of up to 127 elements of the most used longer words ready and use escape codes to insert them into texts. Why only 127? Because the seventh bit is used if one wants the first letter of the word in upper case.
So “Sword” becomes \x1b\xff\x81, where the 0xff denotes that a compressed word follows and that we use word 1 with upper case first letter. Two bytes are two bytes if you only have 65535 of them. ;)
So, [Greg Kennedy] brought us one step closer to new speak by eliminating long words from the dictionary.
Double Plus Good!
Or maybe it could use xkcd’s 1000 word library.
You will not go to space today.
What a terrible idea – lossy text compression. But unlike lossy image or audio compression, this will produce copies that mean something different from the original. I guess Mr. Ballard has never played “telephone”.
“Lossy compression methods are in fact quite useful.”
Not for text they aren’t. I honestly don’t see what the point was in even attempting this, let alone posting the results on a web page as if it was interesting or useful.
Why not just run the text through a process that truncates every word to one character and deletes the spaces and punctuation then applies lossless compression to what’s left? That’d achieve lossy compression about as well.
It would be interesting to see an inverse to the latter mentioned, Lossy text compression, perhaps calling it the Tribbiani.
I wonder how well it would do just to order the whole dictionary in order of common use, number all the words in base 36 and then use 7zip on the resulting list of numbers.
Amazing that he never even addresses the condition the decompressed data is in.
If it doesn’t decompress into a useful state it isn’t compression, now is it. It’s just 100% loss.
So it’s basically paraphrasing.
Paraphrasing keeps the same meaning. This doesn’t care about meaning.
Well, it depends on the skill of who is paraphrasing.
all I can say about this article and the comments is AAAAAAAAAAAAAAAAAGH
the article is fundamentally incorrect – ts_zip is not lossy. So all the comments griping about how lossy text compression is stupid may be right, but they’re also irrelevant. I understand that the linked page doesn’t explain the details of how it works either, but can we at least do some research before crapping all over something that’s actually quite clever?
The comment by [Andrzej] that suggests how to “fix” lossy models is insightful to a degree at least – it more or less describes how this *actually* works. ts_zip uses the LLM as a predictive model and further encodes the difference between what the LLM predicts and the actual text, so the original text can be perfectly reconstructed.
The key difference between ts_zip and NNCP is not lossy encoding, it’s that NNCP uses its own trained models based on transformer_xl and LSTM, while ts_zip is designed to use *any* large language model to compress data, but also requires the same one to decompress it.
(Almost?) Nobody in the comments here cares about that LLM thing. That other Lossy Greg thing is much more fun.
It’s amazing that people that are Deeply Concerned with accuracy in text seem to be so alarmingly illiterate.
You’d be forgiven for thinking communication was more than just the error-free transmission of text.
Dammit Jim, I’m a human being, not a compiler!
Plot twist: the compression algorithm is just Kevin from The Office.
BASIC interpreters of the 70s/80s did it better.
They used Token format to reduce size (GW-BASIC used it, too).
Or, let’s take the ZX81 interpreter..
It stored commands like PRINT or INPUT in a shortened format.
BASIC interpreters used a very small dictionary of keywords, so it was easy to save space by tokenising it. Tokenising natural language, with a huge number of words, doesn’t give such easy savings.
That’s where other compression techniques, such as Huffman, LZ, etc can come in. (Those are lossless examples, but this new approach is a lossy one, given the LLM.)
Clarification: the LLM bit is lossy, but the partner algorithm predicts and corrects those mistakes, for an overall lossless algorithm. Or so the idea goes…
Now if you did analyse some commonly used languages, like English, German, French or Chinese and make a list of 65535 most used words that are longer than four bytes, that would make a nice tokenization list. No clue about the compression ratio tough and if it is worth it outside some special use cases…
Interesting that the antique book photo is featuring 1 Chronicles 11.
Good article HaD, but how can you write about Fabrice Bellard without mentioning he’s also the creator of FFmpeg, QEMU, TinyCC, and more (or perhaps all your readers know this already?). This guy deserves a Nobel prize or at least a Turing award.
can Cliff Notes be considered lossy compression of text?
Yes! This was my concern when I read the article, that this was a sort of condensation rather than compression, with the inevitable side effect that only the principal points of the text are retained, with any subtlety and secondary meanings discarded as irrelevant. If that is the case, it could be fun to run poetry through it. Where by “fun” I mean painful.
Why waste time say lot word when few word do trick?
Not many word when few work.
Few word
Few
⠀
1
this sort of innovation is coming for everything. it’s scary to me because once you get going, the ‘enhance’ function is inevitable. and the resulting image will look so good, you will be forgiven for forgetting that it made it up and you aren’t actually seeing anything that the original camera saw. and that’s in photos, where we know how to think about it…but in text! can you imagine an ‘enhance’ function for text? to zoom in or out, elucidate or summarize a bit of prose? having played with chatgpt a bit, it no longer seems far out.
anyways, anecdote…i was, for no good reason, trying to compress a large collection of historical chess games. and i came up with the idea of enumerating the possible moves at each position, so instead of 16 pieces * 8*8 coordinates (1024 combinations per move, 10 bits), i could encode typically on the order of a dozen (3.5 bits). i forget the typical number of bytes per game, but i came up with what i thought was a very impressively terse representation. i was so pleased with myself i started wondering how close to the ideal representation i was. and i realized, what if i packaged it with a chess engine, and ordered the possible moves by probability assessed by the chess engine! and with chess engines these days being not that different from LLMs… and that’s about when i lost interest in the project :)
just like compressing text by encoding the autocomplete choice (left / middle / right, 1.5 bits) instead of the word.
Don’t forget photographs have *never* been a factual representation of reality. There has been artifice and manipulation at every stage for as long as it has existed as a technology.
I had a private joke with a colleague a couple of decades ago that is becoming possible far too quickly for my liking – a text engine that had two modes: “desemantication” which would distill a large block of text down to it’s core ideas, and “resemantication” that would do the reverse. We would be on the lookout for anything that was “resemanticated bullshit”… i.e. your general political speech or online diatribe that was clearly a tiny bit of garbage inflated into a full press release.
A famous author said he got paid by the word so why use meteoplos when you can just say city and get paid the same.
What part of an effective ‘markov chain’ being useful for text compression is at all surprising?