Why Model Collapse In LLMs Is Inevitable With Self-Learning

There is a persistent belief in the ‘AI’ community that large language models (LLMs) have the ability to learn and self-improve by tweaking the weights in their vector space. Although there’s scant evidence that tweaking a probability vector space is anything like the learning process in biological brains, we nevertheless get sold the idea that artificial general intelligence (AGI) is just around the corner if we do just enough tweaking.

Instead of emerging super intelligence, the most likely outcome is what is called model collapse, with a recent paper by [Hector Zenil] going over the details on why self-training/learning in LLMs and similar systems is a fool’s errand. For those who just want the brief summary with all the memes, [Metin] wrote a blog post covering the basics.

In the end an LLM as well as a diffusion model (DM) is a statistical model of input data using which a statistically likely output can be generated (inferred) based on an input query. It follows intuitively that by using said output  to adjust the model with, the model will over time converge on a kind of statistical singularity rather than some ‘AI singularity’ event. This is also why these models need to be constantly trained with external, human-generated data in order to prevent such a collapse.

In the paper by [Hector] a mathematical model is created to demonstrate that an LLM, DM or similar statistical model undergoes degenerative dynamics whenever said external input is reduced. Although in the paper a mechanism is suggested to counter the entropy decay within the model, the ultimate point is that a statistical model cannot improve itself without continuous external anchoring.

The idea of LLMs being at all intelligent in any sense has been a contentious one, with the concept of language models being equated with ‘AI’ dating back to the 20th century, including as fun home computer projects. Much of the problem probably lies in humans projecting intelligent behavior onto these statistical models, turning LLMs into ‘counterfeit humans’, not helped by how closely generated text can resemble something written by a human, even if completely confabulated.

Thanks to [deshipu] for the tip.

Reservoir Sampling, Or How To Sample Sets Of Unknown Size

Selecting a random sample from a set is simple. But what about selecting a fair random sample from a set of unknown or indeterminate size? That’s where reservoir sampling comes in, and [Sam Rose] has a beautifully-illustrated, interactive guide to how reservoir sampling works. As far as methods go, it’s as elegant as it is simple, and particularly suited to fairly sampling dynamic datasets like sipping from a firehose of log events.

While reservoir sampling is simple in principle it’s not entirely intuitive to everyone. That’s what makes [Sam]’s interactive essay so helpful; he first articulates the problem before presenting the solution in a way that makes it almost self-evident.

[Sam] uses an imaginary deck of cards to illustrate the problem. If one is being dealt cards one at a time from a deck of unknown size (there could be ten cards, or a million), how can one choose a single card in a way that gives each an equal chance of having been selected? Without collecting them all first?

In a nutshell, the solution is to make a decision every time a new card arrives: hold onto the current card, or replace it with the new one. Each new card is given a 1/n chance of becoming held, where n is the number of cards we’ve seen so far. That’s all it takes. No matter when the dealer stops dealing, each card that has been seen will have had an equal chance of ending up the one selected.

There are a few variations which [Sam] also covers, and practical ways of applying it to log collection, so check it out for yourself.

If [Sam]’s knack for illustrating concepts in an interactive way is your jam, we have one more to point out. Our own Al Williams wrote a piece on Turing machines; the original “universal machine” being a theoretical device with a read/write head and infinite paper tape. A wonderful companion to that article is [Sam]’s piece illustrating exactly how such a Turing machines would work in an interactive way.

Intuitive Explanation Of Arithmetic, Geometric & Harmonic Mean

The simple definition of a mean is that of a numeric quantity which represents the center of a collection of numbers. Here the trick lies in defining the exact type of numeric collection, as beyond the arithmetic mean (AM for short, the sum of all values divided by their number) there are many more, with the other two classical Pythagorean means being the geometric mean (GM) and harmonic mean (HM).

The question that many start off with, is what the GM and AM are and why you’d want to use them, which is why [W.D.] wrote a blog post on that topic that they figure should be somewhat intuitive relative to digging through search results, or consulting the Wikipedia entries.

Compared to the AM, the GM uses the product of the values rather than the sum, which makes it a good fit for e.g. changes in a percentage data set. One thing that [W.D] argues for is to use logarithms to grasp the GM, as this makes it more obvious and closer to taking the AM. Finally, the HM is useful for something like the average speed across multiple trips, and is perhaps the easiest to grasp.

Ultimately, the Pythagorean means and their non-Pythagorean brethren are useful for things like data analysis and statistics, where using the right mean can reveal interesting data, much like how other types using something like the median can make a lot more sense. The latter obviously mostly in the hazy field of statistics.

No matter what approach works for you to make these concepts ‘click’, they’re all very useful things to comprehend, as much of every day life revolves around them, including concepts like ‘mean time to failure’ for parts.


Top image: Cycles of sunspots for the last 400 years as an example data set to apply statistical interpretations to. (Credit: Robert A. Rohde, CC BY-SA 3.0)

A Second OctoPrint Plugin Has Been Falsifying Stats

The ongoing story of bogus analytical data being submitted to the public OctoPrint usage statistics has taken a surprising turn with the news that a second plugin was being artificially pushed up the charts. At least this time, the developer of the plugin has admitted to doing the deed personally.

Just to recap, last week OctoPrint creator [Gina Häußge] found that somebody had been generating fictitious OctoPrint usage stats since 2022 in an effort to make the OctoEverywhere plugin appear to be more popular than it actually was. It was a clever attempt, and if it wasn’t for the fact that the fake data was reporting itself to be from a significantly out of date build of OctoPrint, there’s no telling how long it would have continued. When the developers of the plugin were confronted, they claimed it was an overzealous user operating under their own initiative, and denied any knowledge that the stats were being manipulated in their favor.

Presumably it was around this time that Obico creator [Kenneth Jiang] started sweating bullets. It turns out he’d been doing the same thing, for just about as long. When [Gina] contacted him about the suspicious data she was seeing regarding his plugin, he owned up to falsifying the data and published what strikes us as a fairly contrite apology on the Obico blog. While this doesn’t absolve him of making a very poor decision, we respect that he didn’t try to shift the blame elsewhere.

That said, there’s at least one part of his version of events that doesn’t quite pass the sniff test for us. According to [Kenneth], he first wrote the script that generated the fake data back in 2022 because he suspected (correctly, it turns out) that the developers of OctoEverywhere were doing something similar. But after that, he says he didn’t realize the script was still running until [Gina] confronted him about it.

Now admittedly, we’re not professional programmers here at Hackaday. But we’ve written enough code to be suspicious when somebody claims a script they whipped up on a lark was able to run unattended for two years and never once crashed or otherwise bailed out. We won’t even begin to speculate where said script could have been running since 2022 without anyone noticing…

But we won’t dwell on the minutiae here. [Gina] has once again purged the garbage data from the OctoPrint stats, and hopefully things are finally starting to reflect reality. We know she was already angry about the earlier attempts to manipulate the stats, so she’s got to be seething right about now. But as we said before, these unfortunate incidents are ultimately just bumps in the road. We don’t need any stat tracker to know that the community as a whole greatly appreciates the incredible work she’s put into OctoPrint.

The Guinness Brewery Invented One Of Science’s Most Important Statistical Tools

The Guinness brewery has a long history of innovation, but did you know that it was the birthplace of the t-test? A t-test is usually what underpins a declaration of results being “statistically significant”. Scientific American has a fascinating article all about how the Guinness brewery (and one experimental brewer in particular) brought it into being, with ramifications far beyond that of brewing better beer.

William Sealy Gosset (aka ‘Student’), self-trained statistician. [source: user Wujaszek, wikipedia]
Head brewer William Sealy Gosset developed the technique in the early 1900s as a way to more effectively monitor and control the quality of stout beer. At Guinness, Gosset and other brilliant researchers measured everything they could in their quest to optimize and refine large-scale brewing, but there was a repeated problem. Time and again, existing techniques of analysis were simply not applicable to their gathered data, because sample sizes were too small to work with.

While the concept of statistical significance was not new at the time, Gosset’s significant contribution was finding a way to effectively and economically interpret data in the face of small sample sizes. That contribution was the t-test; a practical and logical approach to dealing with uncertainty.

As mentioned, t-testing had ramifications and applications far beyond that of brewing beer. The basic question of whether to consider one population of results significantly different from another population of results is one that underlies nearly all purposeful scientific inquiry. (If you’re unclear on how exactly the t-test is applied and how it is meaningful, the article in the first link walks through some excellent and practical examples.)

Dublin’s Guinness brewery has a rich heritage of innovation so maybe spare them a thought the next time you indulge in statistical inquiry, or in a modern “nitro brew” style beverage. But if you prefer to keep things ultra-classic, there’s always beer from 1574, Dublin castle-style.

Full Self-Driving, On A Budget

Self-driving is currently the Holy Grail in the automotive world, with a number of companies racing to build general-purpose autonomous vehicles that can get from point A to point B with no user input. While no one has brought one to market yet, at least one has promised this feature and had customers pay for it, but continually moved the goalposts for delivery due to how challenging this problem turns out to be. But it doesn’t need to be that hard or expensive to solve, at least in some situations.

The situation in question is driving on a single stretch of highway, and only focuses on steering, so it doesn’t handle the accelerator or brake pedal input. The highway is driven normally, using a webcam to take images of the route and an Arduino to capture data about the steering angle. The idea here is that with enough training the Arduino could eventually steer the car. But first some math needs to happen on the training data since the steering wheel is almost always not turning the car, so the Arduino knows that actual steering events aren’t just statistical anomalies. After the training, the system does a surprisingly good job at “driving” based on this data, and does it on a budget not much larger than laptop, microcontroller, and webcam.

Admittedly, this project was a proof-of-concept to investigate machine learning, neural networks, and other statistical algorithms used in these sorts of systems, and doesn’t actually drive any cars on any roadways. Even the creator says he wouldn’t trust it himself, but that he was pleasantly surprised by the results of such a simple system. It could also be expanded out to handle brake and accelerator pedals with separate neural networks as well. It’s not our first budget-friendly self-driving system, either. This one makes it happen with the enormous computing resources of a single Android smartphone.

Continue reading “Full Self-Driving, On A Budget”

Putting A Cheap Laser Rangefinder Through Its Paces

Sometimes a gizmo seems too cheap to be true. You know there’s just no way it’ll work as advertised — but sometimes it’s fun to find out. Thankfully, if that gadget happens to be a MILESEEY PF210 Hunting Laser Rangefinder, [Phil] has got you covered. He recently got his hands on one (for less than 100 euros, which is wild for a laser rangefinder) and decided to see just how useful it actually was.

The instrument in question measures distances via the time-of-flight method; it bounces a laser pulse off of some distant (or not-so-distant) object and measures how long the pulse takes to return. Using the speed of light, it can calculate the distance the pulse has traveled).

As it turns out, it worked surprisingly well. [Phil] decided to focus his analysis on accuracy and precision, arguably the most important features you’d look for while purchasing such an instrument. We won’t get into the statistical nitty-gritty here, but suffice it to say that [Phil] did his homework. To evaluate the instrument’s precision, he took ten measurements against each of ten different targets of various ranges between 2.9 m and 800 m. He found that it was incredibly precise (almost perfectly repeatable) at low distances, and still pretty darn good way out at 800 m (±1 m repeatability).

To test the accuracy, he took a series of measurements and compared them against their known values (pretty straightforward, right?). He found that the instrument was accurate to within a maximum of 3% (but was usually even better than that).

While this may not be groundbreaking science, it’s really nice to be reminded that sometimes a cheap instrument will do the job, and we love that there are dedicated folks like [Phil] out there who are willing to put the time in to prove it.