Fitting A Spell Checker Into 64 KB

March 26, 2025 by Bryan Cockfield 16 Comments

By some estimates, the English language contains over a million unique words. This is perhaps overly generous, but even conservative estimates generally put the number at over a hundred thousand. Regardless of where the exact number falls between those two extremes, it’s certainly many more words than could fit in the 64 kB of memory allocated to the spell checking program on some of the first Unix machines. This article by [Abhinav Upadhyay] takes a deep dive on how the early Unix engineers accomplished the feat despite the extreme limitations of the computers they were working with.

Perhaps the most obvious way to build a spell checker is by simply looking up each word in a dictionary. With modern hardware this wouldn’t be too hard, but disks in the ’70s were extremely slow and expensive. To move the dictionary into memory it was first whittled down to around 25,000 words by various methods, including using an algorithm to remove all affixes, and then using a Bloom filter to perform the lookups. The team found that this wasn’t a big enough dictionary size, and had to change strategies to expand the number of words the spell checker could check. Hash compression was used at first, followed by hash differences and then a special compression method which achieved an almost theoretically perfect compression.

Although most computers that run spell checkers today have much more memory as well as disks which are orders of magnitude larger and faster, a lot of the innovation made by this early Unix team is still relevant for showing how various compression algorithms can be used on data in general. Large language models, for one example, are proving to be the new frontier for text-based data compression.

Your Text Needs More JPEG

March 19, 2024 by Bryan Cockfield 22 Comments

We’ve all been victims of bad memes on the Internet, but they’re not all just bad jokes gone wrong. Some are simply bad as a result of being copies-of-copies, as each reposter adds another layer of compression to an already lossy image format like JPEG. Compression can certainly be a benefit in areas like images and videos, but [Michal] had a bit of a fever dream imagining this process applied to text. Rather than let the idea escape, he built the Lossifizer to add JPEG-like compression to text.

JPEG compression uses a system similar to the fast Fourier transform (FFT) called the discrete cosine transform (DCT) to reduce the amount of data in an image by essentially removing some frequency information. The data lost is often not noticeable to the human eye, at least until it gets out of hand. [Michal]’s system performs the same transform on text instead, with a slider to control the “amount of JPEG” in the output text. The code for this script uses a “perceptual” character map, clustering similarly-looking and similarly-sounding characters next to each other, resembling “leet speak” from days of yore, although at high enough compression this quickly gets out of hand.

One of the quirks that [Michal] discovered is that certain AI chat bots have a much less difficult time interpreting this JPEG-ified text than a human probably would have, which provides a bit of insight into how some of these algorithms might be functioning under the hood. For some more insight into how JPEG actually works on images, we posted about a deep dive into the image format a while back.

G-code Goes Binary With Proposed New Format

November 28, 2023 by Donald Papp 66 Comments

G-code is effective, easily edited, and nearly ubiquitous when it comes to anything CNC. The format has many strengths, but space efficiency isn’t one of them. In fact, when it comes to 3D printing in particular file sizes can get awfully large. Partly to address this, Prusa have proposed a new .bgcode binary G-code format. You can read the specification of the new (and optional) format here.

The newest version of PrusaSlicer has support for .bgcode, and a utility to convert ASCII G-code to binary (and back) is in the File menu. Want to code an interface of your own? The libbgcode repository provides everything needed to flip .gcode to .bgcode (with a huge file size savings in the process) and vice versa in a way that preserves all aspects of the data. Need to hand-edit a binary G-code file? Convert it to ASCII G-code, make your changes, then flip it right back.

Prusa are not the only ones to notice that the space inefficiency of the G-code file format is not ideal in all situations. Heatshrink and MeatPack are two other solutions in this space with their own strong points. Handily, the command-line tool in libgcode can optionally apply Heatshrink compression or MeatPack encoding in the conversion process.

In a way, G-code is the assembly language of 3D printers. G-code files are normally created when slicing software processes a 3D model, but there are some interesting tricks to be done when G-code is created directly.

Text Compression Gets Weirdly Efficient With LLMs

August 27, 2023 by Donald Papp 66 Comments

It used to be that memory and storage space were so precious and so limited of a resource that handling nontrivial amounts of text was a serious problem. Text compression was a highly practical application of computing power.

Today it might be a solved problem, but that doesn’t mean it doesn’t attract new or unusual solutions. [Fabrice Bellard] released ts_zip which uses Large Language Models (LLM) to attain text compression ratios higher than any other tool can offer.

LLMs are the technology behind natural language AIs, and applying them in this way seems effective. The tradeoff? Unlike typical compression tools, the lossless decompression part isn’t exactly guaranteed when an LLM is involved. Lossy compression methods are in fact quite useful. JPEG compression, for example, is a good example of discarding data that isn’t readily perceived by humans to make a smaller file, but that isn’t usually applied to text. If you absolutely require lossless compression, [Fabrice] has that covered with NNCP, a neural-network powered lossless data compressor.

Do neural networks and LLMs sound far too serious and complicated for your text compression needs? As long as you don’t mind a mild amount of definitely noticeable data loss, check out [Greg Kennedy]’s Lossy Text Compression which simply, brilliantly, and amusingly uses a thesaurus instead of some fancy algorithms. Yep, it just swaps longer words for shorter ones. Perhaps not the best solution for every need, but between that and [Fabrice]’s brilliant work we’re confident there’s something for everyone who craves some novelty with their text compression.

[Photo by Matthew Henry from Burst]

Explore FFmpeg From The Comfort Of Your Browser

August 27, 2023 by Tom Nardi 24 Comments

If you’re looking to manipulate video, FFmpeg is one of the most powerful tools out there. But with this power comes a considerable degree of complexity, and a learning curve that looks suspiciously like a brick wall. To try and make this incredible tool a bit less obtuse, [Sam Lavigne] has developed a web interface that lets you play around with FFmpeg’s vast collection of audio and video filters.

To try out a filter, you just need to select one from the window on the left and it will pop up in the central workspace. Here, the input, output, and any enabled filters will show up as boxes that can be virtually “wired” together. Selecting a filter will populate its options on the right hand side, with sliders and input boxes that allow you to play around with their parameters. When you want to see the final result, just click “Render Preview” and wait a bit.

If there was any downside, it seems like ~~whatever box the site is running on~~ the overhead of running in the browser doesn’t provide it a lot of horsepower. Even with the relatively low resolution of the demo videos available, the console output at the top of the page shows FFmpeg sometimes flirts with a processing speed measured in single-digit frames per second. Still, for a filter playground, it gets the job done. Perhaps the best part of the whole tool is that you can then copy your properly formatted command right out of the browser window and into your terminal so you can put it to work on your local files.

FFmpeg is one of those programs you should really be familiar with because it often proves useful in unexpected ways. The ability to manipulate audio and video with just a few keystrokes can really come in handy, and we’ve seen this open-source tool used for everything from compressing podcasts onto floppy disks to overlaying real-time environmental data onto a video stream.

Never Too Rich Or Thin: Compress Sqlite 80%

August 1, 2022 by Al Williams 11 Comments

We are big fans of using SQLite for anything of even moderate complexity where you might otherwise use a file. The advantages are numerous, but sometimes you want to be lean on file storage. [Phiresky] has a great answer to that: the sqlite-zstd extension offers transparent row-level compression for SQLite.

There are other options, of course, but as the post mentions, each of these have some drawbacks. However, by compressing each row of a table, you can retain random access without some of the drawbacks of other methods.

Continue reading “Never Too Rich Or Thin: Compress Sqlite 80%” →

Squeezing A Wordle Clone Onto The Game Boy

March 1, 2022 by Jenny List 9 Comments

The popular word game Wordle is both an addictive brain teaser for some and a perpetual social media annoyance for others. Its runaway success has spawned a host of clones, among them one created for the Nintendo Game Boy with a reduced vocabulary. [Alexander Pruss] took on the challenge of improving it by fitting the entire 12972-strong 5-letter-word vocabulary as well as the 2315-word answer list into a 32K cartridge along with the code. The challenge in compression on a platform of such meager resources is to devise an algorithm which does not require more computing power or memory than the device has at its disposal. His solution is both elegant and easy to understand.

Starting by dividing the words into lists by first letter such that he can ignore the letter, he can reduce each word to 20 bits as four 5-bit letters. The clever part comes when he organises the words alphabetically, meaning that the 20-bit numbers representing each word are in numerical order.

Thus instead of storing the full number he could store the difference between it and its predecessor. With a few extra tweaks he was able to get the full list down to an impressive 20186 bytes, but was still faced with not enough space. Turning to the Wordle code he found that a library function call could be switched to an alternative with a much more efficient footprint, resulting in a new ROM with all words in place and ready to play.

Of course our community have applied their minds to Wordle and we’ve featured more than one hack based upon it. Mostly they have involved automated solving, so this retro gaming version breaks new ground.

Header image: Sammlung der Medien und Wissenschaft, CC BY 4.0.