Getting A Proprietary-Bus GPU Onto PCIe Enables Cheaper Local LLMs, For Now

If you’ve been thinking of getting into self-hosting generative AI, but don’t have a big budget for hardware, you might want to check out [Hardware Haven]’s latest video on an unusually cheap GPU option — but you’ll have to do so quickly, before the market realizes the chance for arbitrage and prices rise accordingly.

He’s gotten a hold of a 16 GB NVidia V100 card for only about a hundred bucks, mostly because it’s not easy to plug in, being on an SXM2 socket rather than the PCIe bus. SXM is a server architecture, and not something you’re likely to get on your motherboard. Another hundred got him an adapter board to fit this enterprise GPU on a consumer motherboard. That’s still a lot less than the PCIe version of the same card, which will likely set you back a thousand or more unless you get very lucky on eBay.

It’s not the newest card, dating back from 2017, but that doesn’t mean it can’t run the latest open models. After 3D printing a fan shroud for the thing so it didn’t cook itself, adding very slightly to the build cost, [Hardware Haven] set to work seeing what it could do. Going head-to-head against an RTX 3060 12 GB, the older V100 delivered more tokens per second at a  slightly higher efficiency — but much higher idle power.

Still, it’s nice to see a cheap way to get into local AI, even if it might not still be cheap by the time you read this. Once you have the hardware, you might want some easy software options so you don’t have to spend all day on setup. Of course you only need a hefty GPU to run larger models — you can get into hosting your own AI on a Raspberry Pi, if you’re patient.

Continue reading “Getting A Proprietary-Bus GPU Onto PCIe Enables Cheaper Local LLMs, For Now”

An LLM From “Scratch”

Reading a book about bowling is not the same as actually bowling. If that resonates with you and you want to learn more about large language models, check out the LLM From Scratch project. The hands-on workshop lets you use a Mac, Linux, or Windows PC running Python and common libraries like numpy and torch to build your own bare-bones LLM.

The project takes inspiration from nanoGPT but scales it down so you can train the model in around an hour on a typical computer. It will use an Apple or NVIDIA GPU, if available.

Continue reading “An LLM From “Scratch””

AI On Every Machine: The LLM You Probably Didn’t Want

It’s been a story of the last week or so if you follow the kind of news channels a Hackaday scribe does, that Google have quietly installed an LLM as part of the Chrome browser. Reports vary as to when they did this because there’s a lot of confusion online with their online Gemini features also present in the browser, but it seems Chrome users are noticing its effect through slower performance and hefty disk access. Given that Chrome is by far the most popular web browser, this means that billions of users will have downloaded the four gigabyte Gemini Nano model, and now have an LLM they didn’t know about. It will be used to provide advanced auto-correct and other text suggestion features that their online version of Gemini would presumably be overburdened with, and since it’s available through a set of in-browser APIs we expect that it will find its way into a lot of websites, online applications, and plugins.

It’s caused a bit of a fuss in some circles, and we think, with some justification. When billions of computers unwittingly install an extremely energy intensive software component the effect on global power consumption will be significant, with a consequent uptick in the carbon footprint of computing. It’s not a phenomenon restricted to Chrome, as an example Siri has used a local LLM on Apple devices for a while now. We’ve seen rumblings of discontent and talk of getting European climate regulators involved, but perhaps instead it’s time to have a conversation about local AI models. The key is not whether or not they are a good thing to have, but when and how they operate.

While many of us are sick to death of AI slop and have not been lured into AI psychosis by an over-reinforcing chatbot, the fact remains that LLMs can do some useful things, they’re here to stay whether we like it or not, and having one under your control on your own computer doesn’t have to be a bad thing. Install Llama.cpp on your machine, and you’ve got an LLM of your very own, upon which your usage data isn’t going to be sold, and your content isn’t going to reinforce the finest plagiarism device the world has ever seen.

Opt-In and Opt-Out

The concerning development with the Chrome LLM is that not only has it been installed without the user’s consent, it runs without their consent too, and they can’t use it for anything except what Google Chrome wants it to be used for. Unlike the Llama.cpp mentioned above, it’s not under their control, instead it’s a compute-hungry monster ultimately controlled by Google. The prospect of a future in which multiple pieces of everyday software install their own similarly out-of-control multi-gigabyte CPU-munchers is a concerning one. Anyone who remembers Microsoft’s Clippy grabbing all the resources in a 1990s desktop as its stuttering animation played its course will know where this is going.

If local LLMs are an inevitability, what’s needed is a way to make them like any other application, one that the user chooses and installs themselves. Such an LLM could make its services available to applications such as a web browser if the user allows it to, but not run unless asked. It’s fairly obvious that installing Llama.cpp or similar is beyond many users, but it shouldn’t lie beyond the bounds of possibility to package something like it as an application they can install.

We know that the previous paragraph is pie-in-the-sky wishful thinking, and that as the person who knows computers in your family your next few Christmases will be spent wrestling with six different LLMs running on some elderly family member’s PC. But perhaps in Clippy lies the answer. If the consumer can learn to associate built-in AI features with their computer grinding to a halt just as they did with an office assistant thirty years ago, then perhaps they’ll demand change. We can hope.

The Math You Need To Start Understanding LLMs

Once you peel back the hype and mysticism, large language models (LLMs) are a fascinating application of statistical models, effectively what you get when you dial a basic auto-complete model up to eleven. In order to analyze a mind-boggling amount of text and produce meaningful auto-completion results quite a bit of math is involved, with a recent three-part article series by [Giles] going through the basics of inference, being the prediction step using a trained model.

The text is encoded in the LLM’s vector space as token IDs, each token being a text fragment that has some probability of following another ID, such as when cats may be found on desks, as in the above photo by [Giles]. With inference multiple of such IDs are retrieved in a vector from which in successive steps a sentence can be pieced together. These so-called logits are detailed in the first article in the series, with the second article focusing on vocabulary space and embedding, as well as the matrix operations used for inference.

Finally, the third article puts all of this together and looks at transformers, which is a crucial part of GPT (generative pretrained transformer) LLM architecture. Of note is the attention mechanism, which takes GPTs beyond merely being glorified auto-complete systems by adding pattern matching. Here we can see how the statistical model of the LLM is used to generate a rather plausible output, which is where the human has to ask themselves in how far they feel that it is correct.

Of course, there goes a lot more into making LLMs and GPTs performant, such as key-value caches that massively speed up inference.

Why Model Collapse In LLMs Is Inevitable With Self-Learning

There is a persistent belief in the ‘AI’ community that large language models (LLMs) have the ability to learn and self-improve by tweaking the weights in their vector space. Although there’s scant evidence that tweaking a probability vector space is anything like the learning process in biological brains, we nevertheless get sold the idea that artificial general intelligence (AGI) is just around the corner if we do just enough tweaking.

Instead of emerging super intelligence, the most likely outcome is what is called model collapse, with a recent paper by [Hector Zenil] going over the details on why self-training/learning in LLMs and similar systems is a fool’s errand. For those who just want the brief summary with all the memes, [Metin] wrote a blog post covering the basics.

In the end an LLM as well as a diffusion model (DM) is a statistical model of input data using which a statistically likely output can be generated (inferred) based on an input query. It follows intuitively that by using said output  to adjust the model with, the model will over time converge on a kind of statistical singularity rather than some ‘AI singularity’ event. This is also why these models need to be constantly trained with external, human-generated data in order to prevent such a collapse.

In the paper by [Hector] a mathematical model is created to demonstrate that an LLM, DM or similar statistical model undergoes degenerative dynamics whenever said external input is reduced. Although in the paper a mechanism is suggested to counter the entropy decay within the model, the ultimate point is that a statistical model cannot improve itself without continuous external anchoring.

The idea of LLMs being at all intelligent in any sense has been a contentious one, with the concept of language models being equated with ‘AI’ dating back to the 20th century, including as fun home computer projects. Much of the problem probably lies in humans projecting intelligent behavior onto these statistical models, turning LLMs into ‘counterfeit humans’, not helped by how closely generated text can resemble something written by a human, even if completely confabulated.

Thanks to [deshipu] for the tip.

Trying Pair Programming With An LLM Chatbot

When it comes to software developers, there are a few distinct types. For example, the extroverted, chatty type, who is always going out there to share the latest and newest libraries and projects with everyone, and is very much into bouncing ideas off others, regardless of whether they know what you’re talking about. Then there is the introverted loner, who prefers to tackle programming challenges by bouncing things around inside their own minds and going on long walks to mull things over before committing to anything significant.

This leads to interesting scenarios when it comes to management-enforced ‘optimization’ strategies, like Pair Programming. This approach involves two developers sharing the same computer and keyboard, theoretically doubling the effective output by some kind of metric, but realistically often leading to at least one side feeling pretty miserable and disconnected unless you put two of the chatty types together.

As a certified introverted loner developer, the idea of using an LLM chatbot as a coding assistant naturally triggers unpleasant flashbacks to hours of forced awkward pair ‘programming’. However, maybe using an LLM chatbot could be more pleasant because you can skip the whole awkward socializing bit. In order to give it a shake, I put together a little experiment to see whether LLM-based coding assistants is something that I could come to appreciate, unlike pair programming.

Continue reading “Trying Pair Programming With An LLM Chatbot”

How Anthropic’s Model Context Protocol Allows For Easy Remote Execution

As part of the effort to push Large Language Model (LLM) ‘AI’ into more and more places, Anthropic’s Model Context Protocol (MCP) has been adopted as the standard to connect LLMs with various external tools and systems in a client-server model. A light oversight with the architecture of this protocol is that remote command execution (RCE) of arbitrary commands is effectively an essential part of its design, as covered in a recent article by [OX Security].

The details of this flaw are found in a detailed breakdown article, which applies to all implementations regardless of the programming language. Essentially the StdioServerParameters that are passed to the remote server to create a new local instance on said server can contain any command and arguments, which are executed in a server-side shell.

Continue reading “How Anthropic’s Model Context Protocol Allows For Easy Remote Execution”