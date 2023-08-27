It used to be that memory and storage space were so precious and so limited of a resource that handling nontrivial amounts of text was a serious problem. Text compression was a highly practical application of computing power.
Today it might be a solved problem, but that doesn’t mean it doesn’t attract new or unusual solutions. [Fabrice Bellard] released ts_zip which uses Large Language Models (LLM) to attain text compression ratios higher than any other tool can offer.
LLMs are the technology behind natural language AIs, and applying them in this way seems effective. The tradeoff? Unlike typical compression tools, the lossless decompression part isn’t exactly guaranteed when an LLM is involved. Lossy compression methods are in fact quite useful. JPEG compression, for example, is a good example of discarding data that isn’t readily perceived by humans to make a smaller file, but that isn’t usually applied to text. If you absolutely require lossless compression, [Fabrice] has that covered with NNCP, a neural-network powered lossless data compressor.
Do neural networks and LLMs sound far too serious and complicated for your text compression needs? As long as you don’t mind a mild amount of definitely noticeable data loss, check out [Greg Kennedy]’s Lossy Text Compression which simply, brilliantly, and amusingly uses a thesaurus instead of some fancy algorithms. Yep, it just swaps longer words for shorter ones. Perhaps not the best solution for every need, but between that and [Fabrice]’s brilliant work we’re confident there’s something for everyone who craves some novelty with their text compression.
[Photo by Matthew Henry from Burst]
8 thoughts on “Text Compression Gets Weirdly Efficient With LLMs”
Sacré Fabrice ! Après les images, maintenant le texte ! Il a encore cassé Internet :-P
The trade off here is in the size of the compressor/decompressor and the speed it works at. You need up to 8.5 GB of GPU memory (for both compression and decompression) and it runs at between 7 and 128KB/sec. I’m not sure of the circumstances where the extra cost outweighs the reduction in file size.
Is it me, but could this be used to disguise plagiarized text – letting it use the thesaurused output to keep AI algorithms for detected the original text?
The funky recovered unnatural language could tip off reader.
What form does the ‘loss’ take ? In images it’s typically loss of definition. But given LLMs proclivities, this might just hallucinate the expanded text. Not really what you want.
This is asking for trouble.
LLMs achieve compression by compressing down the text based on the concepts it talks about. turns out there are only so many topics humans communicate about in their daily life. The “lossyness” isn’t in terms of slightly different pixel values or wrong characters, but in changing the phrasing or leaving out details.
It’s not hard to have a sentence like “Ane Berry saw the thief at the scene of the crime.”
Turn into “Ane Berry was seen at the scene of the crime.”
You see this frequently with LLM chatbots misinterpreting various aspects from the chat’s history.
The generation loss of a recursive LLM system should prove to be very interesting. Maybe we’ll finally get to those monkeys on typewriters…
So now the only part missing would be a LLM-based text “diff” mechanism. You first apply the lossy compression, then generate a diff from the original and store both. I’m just waving hands here, but I expect that LLM could be adapted to produce an efficient representation of such differences, since it can (or at least pretends to) take language semantics into account.
