Making The Smallest And Dumbest LLM With Extreme Quantization

October 23, 2025 by Maya Posch 15 Comments

Turns out that training on Twitch quotes doesn't make an LLM a math genius. (Credit: Codeically, YouTube) — Turns out that training on Twitch quotes doesn’t make an LLM a math genius. (Credit: Codeically, YouTube)

The reason why large language models are called ‘large’ is not because of how smart they are, but as a factor of their sheer size in bytes. At billions of parameters at four bytes each, they pose a serious challenge when it comes to not just their size on disk, but also in RAM, specifically the RAM of your videocard (VRAM). Reducing this immense size, as is done routinely for the smaller pretrained models which one can download for local use, involves quantization. This process is explained and demonstrated by [Codeically], who takes it to its logical extreme: reducing what could be a GB-sized model down to a mere 63 MB by reducing the bits per parameter.

While you can offload a model, i.e. keep only part of it in VRAM and the rest in system RAM, this massively impacts performance. An alternative is to use fewer bits per weight in the model, called ‘compression’, which typically involves reducing 16-bit floating point to 8-bit, reducing memory usage by about 75%. Going lower than this is generally deemed unadvisable.

Using GPT-2 as the base, it was trained with a pile of internet quotes, creating parameters with a very anemic 4-bit integer size. After initially manually zeroing the weights made the output too garbled, the second attempt without the zeroing did somewhat produce usable output before flying off the rails. Yet it did this with a 63 MB model at 78 tokens a second on just the CPU, demonstrating that you can create a pocket-sized chatbot to spout nonsense even without splurging on expensive hardware.

Continue reading “Making The Smallest And Dumbest LLM With Extreme Quantization” →

Microsoft’s New Agentic Web Protocol Stumbles With Path Traversal Exploit

August 7, 2025 by Maya Posch 11 Comments

If the term ‘NLWeb’ first brought to mind an image of a Dutch internet service provider, you’re probably not alone. What it actually is – or tries to become – is Microsoft’s vision of a parallel internet protocol using which website owners and application developers can integrate whatever LLM-based chatbot they desire. Unfortunately for Microsoft, the NLWeb protocol just suffered its first major security flaw.

The flaw is an absolute doozy, involving a basic path traversal vulnerability that allows an attacker to use appropriately formatted URLs to traverse the filesystem of the remote, LLM-hosting, system to extract keys and other sensitive information. Although Microsoft patched it already, no CVE was assigned, while raising the question of just how many more elementary bugs like this may be lurking in the protocol and associated software.

As for why a website or application owner might be interested in NLWeb, the marketing pitch appears to be as an alternative to integrating a local search function. This way any website or app can have their own ChatGPT-style search functionality that is theoretically restricted to just their website, instead of chatbot-loving customers going to the ChatGPT or equivalent site to ask their questions there.

Even aside from the the strong ‘solution in search of a problem’ vibe, it’s worrying that right from the outset it seems to introduce pretty serious security issues that suggest a lack of real testing, never mind a strong ignorance of the fact that a lack of user input sanitization is the primary cause for widely exploited CVEs. Unknown is whether GitHub Copilot was used to write the affected codebase.

Why GitHub Copilot Isn’t Your Coding Partner

July 4, 2025 by Maya Posch 79 Comments

These days ‘AI’ is everywhere, including in software development. Coming hot on the heels of approaches like eXtreme Programming and Pair Programming, there’s now a new kind of pair programming in town in the form of an LLM that’s been digesting millions of lines of code. Purportedly designed to help developers program faster and more efficiently, these ‘AI programming assistants’ have primarily led to heated debate and some interesting studies.

In the case of [Jj], their undiluted feelings towards programming assistants like GitHub Copilot burn as brightly as the fire of a thousand Suns, and not a happy kind of fire.

Whether it’s Copilot or ChatGPT or some other chatbot that may or may not be integrated into your IDE, the frustration with what often feels like StackOverflow-powered-autocomplete is something that many of us can likely sympathize with. Although [Jj] lists a few positives of using an LLM trained on codebases and documentation, their overall view is that using Copilot degrades a programmer, mostly because of how it takes critical thinking skills out of the loop.

Regardless of whether you agree with [Jj] or not, the research so far on using LLMs with software development and other tasks strongly suggests that they’re not a net positive for one’s mental faculties. It’s also important to note that at the end of the day it’s still you, the fleshy bag of mostly salty water, who has to justify the code during code review and when something catches on fire in production. Your ‘copilot’ meanwhile gets off easy.

USB Stick Hides Large Language Model

February 17, 2025 by Bryan Cockfield 27 Comments

Large language models (LLMs) are all the rage in the generative AI world these days, with the truly large ones like GPT, LLaMA, and others using tens or even hundreds of billions of parameters to churn out their text-based responses. These typically require glacier-melting amounts of computing hardware, but the “large” in “large language models” doesn’t really need to be that big for there to be a functional, useful model. LLMs designed for limited hardware or consumer-grade PCs are available now as well, but [Binh] wanted something even smaller and more portable, so he put an LLM on a USB stick.

This USB stick isn’t just a jump drive with a bit of memory on it, though. Inside the custom 3D printed case is a Raspberry Pi Zero W running llama.cpp, a lightweight, high-performance version of LLaMA. Getting it on this Pi wasn’t straightforward at all, though, as the latest version of llama.cpp is meant for ARMv8 and this particular Pi was running the ARMv6 instruction set. That meant that [Binh] needed to change the source code to remove the optimizations for the more modern ARM machines, but with a week’s worth of effort spent on it he finally got the model on the older Raspberry Pi.

Getting the model to run was just one part of this project. The rest of the build was ensuring that the LLM could run on any computer without drivers and be relatively simple to use. By setting up the USB device as a composite device which presents a filesystem to the host computer, all a user has to do to interact with the LLM is to create an empty text file with a filename, and the LLM will automatically fill the file with generated text. While it’s not blindingly fast, [Binh] believes this is the first plug-and-play USB-based LLM, and we’d have to agree. It’s not the least powerful computer to ever run an LLM, though. That honor goes to this project which is able to cram one on an ESP32.

Continue reading “USB Stick Hides Large Language Model” →

Examining The Vulnerability Of Large Language Models To Data-Poisoning

February 3, 2025 by Maya Posch 23 Comments

Large language models (LLMs) are wholly dependent on the quality of the input data with which these models are trained. While suggestions that people eat rocks are funny to you and me, in the case of LLMs intended to help out medical professionals, any false claims or statements dripping out of such an LLM can have dire consequences, ranging from incorrect diagnoses to much worse. In a recent study published in Nature Medicine by [Daniel Alexander Alber] et al. the ease with which this data poisoning can occur is demonstrated.

According to their findings, only 0.001% of training tokens have to be replaced with medical misinformation to order to create models that are likely to produce medically erroneous statement. Most concerning is that such a corrupted model isn’t readily discovered using standard medical LLM benchmarks. There are filters for erroneous content, but these tend to be limited in scope due to the overhead. Post-training adjustments can be made, as can the addition of RAG, but none of this helps with the confident bull excrement due to corruption.

The mitigation approach that the researchers developed cross-references LLM output against biomedical knowledge graphs, to reduce the LLM mostly for generating natural language. In this approach LLM outputs are matched against the graphs and if LLM ‘facts’ cannot be verified, it’s marked as potential misinformation. In a test with 1,000 random passages detected issues with a claimed effectiveness of 91.9%.

Naturally, this does not guarantee that misinformation does not make it past these knowledge graphs, and largely leaves the original problem with LLMs in place, namely that their outputs can never be fully trusted. This study also makes it abundantly clear how easy it is to corrupt an LLM via the input training data, as well as underlining the broader problem that AI is making mistakes that we don’t expect.

New Open Source DeepSeek V3 Language Model Making Waves

January 27, 2025 by Maya Posch 82 Comments

In the world of large language models (LLMs) there tend to be relatively few upsets ever since OpenAI barged onto the scene with its transformer-based GPT models a few years ago, yet now it seems that Chinese company DeepSeek has upended the status quo. Its new DeepSeek-V3 model is not only open source, it also claims to have been trained for only a fraction of the effort required by competing models, while performing significantly better.

The full training of DeepSeek-V3’s 671B parameters is claimed to have only taken 2.788 M hours on NVidia H800 (Hopper-based) GPUs, which is almost a factor of ten less than others. Naturally this has the LLM industry somewhat up in a mild panic, but for those who are not investors in LLM companies or NVidia can partake in this new OSS model that has been released under the MIT license, along with the DeepSeek-R1 reasoning model.

Both of these models can be run locally, using both AMD and NVidia GPUs, as well as using the online APIs. If these models do indeed perform as efficiently as claimed, they stand to massively reduce the hardware and power required to not only train but also query LLMs.

Trap Naughty Web Crawlers In Digestive Juices With Nepenthes

January 23, 2025 by Maya Posch 45 Comments

In the olden days of the WWW you could just put a robots.txt file in the root of your website and crawling bots from search engines and kin would (generally) respect the rules in it. These days, however, we have especially web crawlers from large language model (LLM) companies happily ignoring such signs on the lawn before proceeding to hover up every scrap of content on websites. Naturally this makes a lot of people very angry, but what can you do about it? The answer by [Aaron B] is Nepenthes, described on the project page as a ‘tar pit for catching web crawlers’.

More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. /nepenthes), any web crawler that accesses it will be presented with an endless number of (randomly generated) pages with many URLs to follow. Page generating is deliberately quite slow to not soak up significant CPU time, while still giving the LLM scrapers plenty of random nonsense to chew on.

Considering that these web crawlers deemed adhering to the friendly sign on the lawn beneath them, the least we can do in response, is to hasten model collapse by feeding these LLM scrapers whatever rolls out of a simple (optionally Markov-based) text generator.