Making The Smallest And Dumbest LLM With Extreme Quantization

Turns out that training on Twitch quotes doesn't make an LLM a math genius. (Credit: Codeically, YouTube)
Turns out that training on Twitch quotes doesn’t make an LLM a math genius. (Credit: Codeically, YouTube)

The reason why large language models are called ‘large’ is not because of how smart they are, but as a factor of their sheer size in bytes. At billions of parameters at four bytes each, they pose a serious challenge when it comes to not just their size on disk, but also in RAM, specifically the RAM of your videocard (VRAM). Reducing this immense size, as is done routinely for the smaller pretrained models which one can download for local use, involves quantization. This process is explained and demonstrated by [Codeically], who takes it to its logical extreme: reducing what could be a GB-sized model down to a mere 63 MB by reducing the bits per parameter.

While you can offload a model, i.e. keep only part of it in VRAM and the rest in system RAM, this massively impacts performance. An alternative is to use fewer bits per weight in the model, called ‘compression’, which typically involves reducing 16-bit floating point to 8-bit, reducing memory usage by about 75%. Going lower than this is generally deemed unadvisable.

Using GPT-2 as the base, it was trained with a pile of internet quotes, creating parameters with a very anemic 4-bit integer size. After initially manually zeroing the weights made the output too garbled, the second attempt without the zeroing did somewhat produce usable output before flying off the rails. Yet it did this with a 63 MB model at 78 tokens a second on just the CPU, demonstrating that you can create a pocket-sized chatbot to spout nonsense even without splurging on expensive hardware.

15 thoughts on “Making The Smallest And Dumbest LLM With Extreme Quantization

  1. ESPressif has really knocked this out of the park for microcontrollers. You do have to reduce the instruction set and limit it – but I have a LLM capable of voice only interaction that can answer local questions and uses websocket to talk to a bigger LLM when necessary running on an ESP32S3. Check out the project at XiaoZhi on Github.

      1. I suppose, perhaps a bit closer to a Cyberpunk 2077 character who went too far with modifications.

        I like your website a lot :). Perhaps consider adding https so it would not throw insecure access warnings in modern browsers, it has interesting projects.

  2. What a useless video. Didn’t show anything at all of what he’s speaking about. Spent maybe a couple of weeks working on that, another one making the video, and I lost 5 minutes of my time to watch something that could’ve been read in 20 seconds or a whole week reading the details that are no where to be found.

  3. “The reason why large language models are called ‘large’ is not because of how smart they are, but as a factor of their sheer size in bytes.”

    … Has anyone ever confused the concept of size with intelligence? Stating the obvious is a quite peculiar way to start the article.

  4. I used a script from Predator for one of my first self built chatbots for its learn model well over 20 years ago lol. It was a foul mouthed beast but had partial awareness by the end of our journey.

Leave a Reply to Bill DreschelCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.