TurboQuant: Reducing LLM Memory Usage With Vector Quantization

Large language models (LLMs) aren’t actually giant computer brains. Instead, they are massive vector spaces in which the probabilities of tokens occurring in a specific order is encoded. Billions of parameters, times N bits per parameter, equals N-billion bits of storage required for a full model. Since increasing the number of parameters makes the models appear smarter, correspondingly the size of these models and their associated caches has been increasing rapidly.

Vector quantization (VQ) is a method that can compress the vectors calculated during inference to take up less space without significant loss of data. Google’s recently published pre-print paper on TurboQuant covers an LLM-oriented VQ algorithm that’s claimed to provide up to a 6x compression level with no negative impact on inference times.

The tokens aren’t directly encoded in the vector space, but their associated key value is, which along with the single token per inference process creates the need for a key-value (KV) cache, the size of which scales with the size of the model. Thus by compressing the KV cache using VQ, it will reduce its size and correspondingly speed up look-ups due to the smaller size in memory. One catch here is that VQ is due to the nature of quantization some accuracy will be lost. The trick here is thus to apply VQ in such a way that it does not affect this accuracy in a noticeable manner.

Other aspects that had to be taken into account by the TurboQuant algorithm was fast computation to keep up with real-time requirements, along with compatibility with so-called ‘AI accelerator’ hardware.

Continue reading “TurboQuant: Reducing LLM Memory Usage With Vector Quantization”

Despite Penalties, Lawyers Can’t Stop Using AI

Despite a few high-profile cases in recent years with lawyers getting caught using LLM-generated documents and facing disciplinary action due to this, it would seem that this is not deterring many other lawyers from following them off this particular cliff, per reporting from NPR.

We reported back in the innocent days of 2023 about the amusing case of Robert Mata v. Avianca, Inc. In this case, the plaintiff’s lawyer decided to have ChatGPT ‘assist’ with the legal filing, which ended up being filled with non-existent cases being cited, despite the chatbot’s assurance that these were all real cases. Now it would seem that this blind trust in cases cited by LLM chatbots is becoming the rule, rather than the exception.

Last year a record number of lawyers fell into the same trap, with many lawyers getting fined thousands of dollars for confabulated case citations. According to a researcher at the business school HEC Paris, who is keeping a worldwide tally, the count so far is 1,200, of which 800 originate from US courts.

Unsurprisingly, penalties are also increasing in severity, with monetary penalties passing the $100,000 and some courts demanding that any use of ‘AI’ be declared up-front. Whether or not the popularity of LLM chatbots among US lawyers is simply due to the massive caseload that digging through cases in Common Law legal systems entails has not yet been addressed, but that undesirable shortcuts are being taken is undeniable.

Remember that it’s easy to point and laugh, but the next case could involve the lawyer handling your delicate situation.

A graph showing the poisoning success rate of 7B and 13B parameter models

It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds

It stands to reason that if you have access to an LLM’s training data, you can influence what’s coming out the other end of the inscrutable AI’s network. The obvious guess is that you’d need some percentage of the overall input, though exactly how much that was — 2%, 1%, or less — was an active research question. New research by Anthropic, the UK AI Security Institute, and the Alan Turing Institute shows it is actually a lot easier to poison the well than that.

We’re talking parts-per-million of poison for large models, because the researchers found that with just 250 carefully-crafted poison pills, they could compromise the output of any size LLM. Now, when we say poison the model, we’re not talking about a total hijacking, at least in this study. The specific backdoor under investigation was getting the model to produce total gibberish.

Continue reading “It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds”

Making The Smallest And Dumbest LLM With Extreme Quantization

Turns out that training on Twitch quotes doesn't make an LLM a math genius. (Credit: Codeically, YouTube)
Turns out that training on Twitch quotes doesn’t make an LLM a math genius. (Credit: Codeically, YouTube)

The reason why large language models are called ‘large’ is not because of how smart they are, but as a factor of their sheer size in bytes. At billions of parameters at four bytes each, they pose a serious challenge when it comes to not just their size on disk, but also in RAM, specifically the RAM of your videocard (VRAM). Reducing this immense size, as is done routinely for the smaller pretrained models which one can download for local use, involves quantization. This process is explained and demonstrated by [Codeically], who takes it to its logical extreme: reducing what could be a GB-sized model down to a mere 63 MB by reducing the bits per parameter.

While you can offload a model, i.e. keep only part of it in VRAM and the rest in system RAM, this massively impacts performance. An alternative is to use fewer bits per weight in the model, called ‘compression’, which typically involves reducing 16-bit floating point to 8-bit, reducing memory usage by about 75%. Going lower than this is generally deemed unadvisable.

Using GPT-2 as the base, it was trained with a pile of internet quotes, creating parameters with a very anemic 4-bit integer size. After initially manually zeroing the weights made the output too garbled, the second attempt without the zeroing did somewhat produce usable output before flying off the rails. Yet it did this with a 63 MB model at 78 tokens a second on just the CPU, demonstrating that you can create a pocket-sized chatbot to spout nonsense even without splurging on expensive hardware.

Continue reading “Making The Smallest And Dumbest LLM With Extreme Quantization”

Microsoft’s New Agentic Web Protocol Stumbles With Path Traversal Exploit

If the term ‘NLWeb’ first brought to mind an image of a Dutch internet service provider, you’re probably not alone. What it actually is – or tries to become – is Microsoft’s vision of a parallel internet protocol using which website owners and application developers can integrate whatever LLM-based chatbot they desire. Unfortunately for Microsoft, the NLWeb protocol just suffered its first major security flaw.

The flaw is an absolute doozy, involving a basic path traversal vulnerability that allows an attacker to use appropriately formatted URLs to traverse the filesystem of the remote, LLM-hosting, system to extract keys and other sensitive information. Although Microsoft patched it already, no CVE was assigned, while raising the question of just how many more elementary bugs like this may be lurking in the protocol and associated software.

As for why a website or application owner might be interested in NLWeb, the marketing pitch appears to be as an alternative to integrating a local search function. This way any website or app can have their own ChatGPT-style search functionality that is theoretically restricted to just their website, instead of chatbot-loving customers going to the ChatGPT or equivalent site to ask their questions there.

Even aside from the the strong ‘solution in search of a problem’ vibe, it’s worrying that right from the outset it seems to introduce pretty serious security issues that suggest a lack of real testing, never mind a strong ignorance of the fact that a lack of user input sanitization is the primary cause for widely exploited CVEs. Unknown is whether GitHub Copilot was used to write the affected codebase.

Why GitHub Copilot Isn’t Your Coding Partner

These days ‘AI’ is everywhere, including in software development. Coming hot on the heels of approaches like eXtreme Programming and Pair Programming, there’s now a new kind of pair programming in town in the form of an LLM that’s been digesting millions of lines of code. Purportedly designed to help developers program faster and more efficiently, these ‘AI programming assistants’ have primarily led to heated debate and some interesting studies.

In the case of [Jj], their undiluted feelings towards programming assistants like GitHub Copilot burn as brightly as the fire of a thousand Suns, and not a happy kind of fire.

Whether it’s Copilot or ChatGPT or some other chatbot that may or may not be integrated into your IDE, the frustration with what often feels like StackOverflow-powered-autocomplete is something that many of us can likely sympathize with. Although [Jj] lists a few positives of using an LLM trained on codebases and documentation, their overall view is that using Copilot degrades a programmer, mostly because of how it takes critical thinking skills out of the loop.

Regardless of whether you agree with [Jj] or not, the research so far on using LLMs with software development and other tasks strongly suggests that they’re not a net positive for one’s mental faculties. It’s also important to note that at the end of the day it’s still you, the fleshy bag of mostly salty water, who has to justify the code during code review and when something catches on fire in production. Your ‘copilot’ meanwhile gets off easy.

USB Stick Hides Large Language Model

Large language models (LLMs) are all the rage in the generative AI world these days, with the truly large ones like GPT, LLaMA, and others using tens or even hundreds of billions of parameters to churn out their text-based responses. These typically require glacier-melting amounts of computing hardware, but the “large” in “large language models” doesn’t really need to be that big for there to be a functional, useful model. LLMs designed for limited hardware or consumer-grade PCs are available now as well, but [Binh] wanted something even smaller and more portable, so he put an LLM on a USB stick.

This USB stick isn’t just a jump drive with a bit of memory on it, though. Inside the custom 3D printed case is a Raspberry Pi Zero W running llama.cpp, a lightweight, high-performance version of LLaMA. Getting it on this Pi wasn’t straightforward at all, though, as the latest version of llama.cpp is meant for ARMv8 and this particular Pi was running the ARMv6 instruction set. That meant that [Binh] needed to change the source code to remove the optimizations for the more modern ARM machines, but with a week’s worth of effort spent on it he finally got the model on the older Raspberry Pi.

Getting the model to run was just one part of this project. The rest of the build was ensuring that the LLM could run on any computer without drivers and be relatively simple to use. By setting up the USB device as a composite device which presents a filesystem to the host computer, all a user has to do to interact with the LLM is to create an empty text file with a filename, and the LLM will automatically fill the file with generated text. While it’s not blindingly fast, [Binh] believes this is the first plug-and-play USB-based LLM, and we’d have to agree. It’s not the least powerful computer to ever run an LLM, though. That honor goes to this project which is able to cram one on an ESP32.

Continue reading “USB Stick Hides Large Language Model”