Reservoir Sampling, Or How To Sample Sets Of Unknown Size

July 2, 2025 by Donald Papp 2 Comments

Selecting a random sample from a set is simple. But what about selecting a fair random sample from a set of unknown or indeterminate size? That’s where reservoir sampling comes in, and [Sam Rose] has a beautifully-illustrated, interactive guide to how reservoir sampling works. As far as methods go, it’s as elegant as it is simple, and particularly suited to fairly sampling dynamic datasets like sipping from a firehose of log events.

While reservoir sampling is simple in principle it’s not entirely intuitive to everyone. That’s what makes [Sam]’s interactive essay so helpful; he first articulates the problem before presenting the solution in a way that makes it almost self-evident.

[Sam] uses an imaginary deck of cards to illustrate the problem. If one is being dealt cards one at a time from a deck of unknown size (there could be ten cards, or a million), how can one choose a single card in a way that gives each an equal chance of having been selected? Without collecting them all first?

In a nutshell, the solution is to make a decision every time a new card arrives: hold onto the current card, or replace it with the new one. Each new card is given a 1/n chance of becoming held, where n is the number of cards we’ve seen so far. That’s all it takes. No matter when the dealer stops dealing, each card that has been seen will have had an equal chance of ending up the one selected.

There are a few variations which [Sam] also covers, and practical ways of applying it to log collection, so check it out for yourself.

If [Sam]’s knack for illustrating concepts in an interactive way is your jam, we have one more to point out. Our own Al Williams wrote a piece on Turing machines; the original “universal machine” being a theoretical device with a read/write head and infinite paper tape. A wonderful companion to that article is [Sam]’s piece illustrating exactly how such a Turing machines would work in an interactive way.

Intuitive Explanation Of Arithmetic, Geometric & Harmonic Mean

August 24, 2024 by Maya Posch 8 Comments

The simple definition of a mean is that of a numeric quantity which represents the center of a collection of numbers. Here the trick lies in defining the exact type of numeric collection, as beyond the arithmetic mean (AM for short, the sum of all values divided by their number) there are many more, with the other two classical Pythagorean means being the geometric mean (GM) and harmonic mean (HM).

The question that many start off with, is what the GM and AM are and why you’d want to use them, which is why [W.D.] wrote a blog post on that topic that they figure should be somewhat intuitive relative to digging through search results, or consulting the Wikipedia entries.

Compared to the AM, the GM uses the product of the values rather than the sum, which makes it a good fit for e.g. changes in a percentage data set. One thing that [W.D] argues for is to use logarithms to grasp the GM, as this makes it more obvious and closer to taking the AM. Finally, the HM is useful for something like the average speed across multiple trips, and is perhaps the easiest to grasp.

Ultimately, the Pythagorean means and their non-Pythagorean brethren are useful for things like data analysis and statistics, where using the right mean can reveal interesting data, much like how other types using something like the median can make a lot more sense. The latter obviously mostly in the hazy field of statistics.

No matter what approach works for you to make these concepts ‘click’, they’re all very useful things to comprehend, as much of every day life revolves around them, including concepts like ‘mean time to failure’ for parts.

Top image: Cycles of sunspots for the last 400 years as an example data set to apply statistical interpretations to. (Credit: Robert A. Rohde, CC BY-SA 3.0)

A Second OctoPrint Plugin Has Been Falsifying Stats

July 4, 2024 by Tom Nardi 34 Comments

The ongoing story of bogus analytical data being submitted to the public OctoPrint usage statistics has taken a surprising turn with the news that a second plugin was being artificially pushed up the charts. At least this time, the developer of the plugin has admitted to doing the deed personally.

Just to recap, last week OctoPrint creator [Gina Häußge] found that somebody had been generating fictitious OctoPrint usage stats since 2022 in an effort to make the OctoEverywhere plugin appear to be more popular than it actually was. It was a clever attempt, and if it wasn’t for the fact that the fake data was reporting itself to be from a significantly out of date build of OctoPrint, there’s no telling how long it would have continued. When the developers of the plugin were confronted, they claimed it was an overzealous user operating under their own initiative, and denied any knowledge that the stats were being manipulated in their favor.

Presumably it was around this time that Obico creator [Kenneth Jiang] started sweating bullets. It turns out he’d been doing the same thing, for just about as long. When [Gina] contacted him about the suspicious data she was seeing regarding his plugin, he owned up to falsifying the data and published what strikes us as a fairly contrite apology on the Obico blog. While this doesn’t absolve him of making a very poor decision, we respect that he didn’t try to shift the blame elsewhere.

That said, there’s at least one part of his version of events that doesn’t quite pass the sniff test for us. According to [Kenneth], he first wrote the script that generated the fake data back in 2022 because he suspected (correctly, it turns out) that the developers of OctoEverywhere were doing something similar. But after that, he says he didn’t realize the script was still running until [Gina] confronted him about it.

Now admittedly, we’re not professional programmers here at Hackaday. But we’ve written enough code to be suspicious when somebody claims a script they whipped up on a lark was able to run unattended for two years and never once crashed or otherwise bailed out. We won’t even begin to speculate where said script could have been running since 2022 without anyone noticing…

But we won’t dwell on the minutiae here. [Gina] has once again purged the garbage data from the OctoPrint stats, and hopefully things are finally starting to reflect reality. We know she was already angry about the earlier attempts to manipulate the stats, so she’s got to be seething right about now. But as we said before, these unfortunate incidents are ultimately just bumps in the road. We don’t need any stat tracker to know that the community as a whole greatly appreciates the incredible work she’s put into OctoPrint.

The Guinness Brewery Invented One Of Science’s Most Important Statistical Tools

June 18, 2024 by Donald Papp 12 Comments

The Guinness brewery has a long history of innovation, but did you know that it was the birthplace of the t-test? A t-test is usually what underpins a declaration of results being “statistically significant”. Scientific American has a fascinating article all about how the Guinness brewery (and one experimental brewer in particular) brought it into being, with ramifications far beyond that of brewing better beer.

William Sealy Gosset (aka ‘Student’), self-trained statistician. [source: user Wujaszek, wikipedia]

Head brewer William Sealy Gosset developed the technique in the early 1900s as a way to more effectively monitor and control the quality of stout beer. At Guinness, Gosset and other brilliant researchers measured everything they could in their quest to optimize and refine large-scale brewing, but there was a repeated problem. Time and again, existing techniques of analysis were simply not applicable to their gathered data, because sample sizes were too small to work with.

While the concept of statistical significance was not new at the time, Gosset’s significant contribution was finding a way to effectively and economically interpret data in the face of small sample sizes. That contribution was the t-test; a practical and logical approach to dealing with uncertainty.

As mentioned, t-testing had ramifications and applications far beyond that of brewing beer. The basic question of whether to consider one population of results significantly different from another population of results is one that underlies nearly all purposeful scientific inquiry. (If you’re unclear on how exactly the t-test is applied and how it is meaningful, the article in the first link walks through some excellent and practical examples.)

Dublin’s Guinness brewery has a rich heritage of innovation so maybe spare them a thought the next time you indulge in statistical inquiry, or in a modern “nitro brew” style beverage. But if you prefer to keep things ultra-classic, there’s always beer from 1574, Dublin castle-style.

Full Self-Driving, On A Budget

October 17, 2023 by Bryan Cockfield 35 Comments

Self-driving is currently the Holy Grail in the automotive world, with a number of companies racing to build general-purpose autonomous vehicles that can get from point A to point B with no user input. While no one has brought one to market yet, at least one has promised this feature and had customers pay for it, but continually moved the goalposts for delivery due to how challenging this problem turns out to be. But it doesn’t need to be that hard or expensive to solve, at least in some situations.

The situation in question is driving on a single stretch of highway, and only focuses on steering, so it doesn’t handle the accelerator or brake pedal input. The highway is driven normally, using a webcam to take images of the route and an Arduino to capture data about the steering angle. The idea here is that with enough training the Arduino could eventually steer the car. But first some math needs to happen on the training data since the steering wheel is almost always not turning the car, so the Arduino knows that actual steering events aren’t just statistical anomalies. After the training, the system does a surprisingly good job at “driving” based on this data, and does it on a budget not much larger than laptop, microcontroller, and webcam.

Admittedly, this project was a proof-of-concept to investigate machine learning, neural networks, and other statistical algorithms used in these sorts of systems, and doesn’t actually drive any cars on any roadways. Even the creator says he wouldn’t trust it himself, but that he was pleasantly surprised by the results of such a simple system. It could also be expanded out to handle brake and accelerator pedals with separate neural networks as well. It’s not our first budget-friendly self-driving system, either. This one makes it happen with the enormous computing resources of a single Android smartphone.

Continue reading “Full Self-Driving, On A Budget” →

Putting A Cheap Laser Rangefinder Through Its Paces

July 19, 2022 by Adam Zeloof 18 Comments

Sometimes a gizmo seems too cheap to be true. You know there’s just no way it’ll work as advertised — but sometimes it’s fun to find out. Thankfully, if that gadget happens to be a MILESEEY PF210 Hunting Laser Rangefinder, [Phil] has got you covered. He recently got his hands on one (for less than 100 euros, which is wild for a laser rangefinder) and decided to see just how useful it actually was.

The instrument in question measures distances via the time-of-flight method; it bounces a laser pulse off of some distant (or not-so-distant) object and measures how long the pulse takes to return. Using the speed of light, it can calculate the distance the pulse has traveled).

As it turns out, it worked surprisingly well. [Phil] decided to focus his analysis on accuracy and precision, arguably the most important features you’d look for while purchasing such an instrument. We won’t get into the statistical nitty-gritty here, but suffice it to say that [Phil] did his homework. To evaluate the instrument’s precision, he took ten measurements against each of ten different targets of various ranges between 2.9 m and 800 m. He found that it was incredibly precise (almost perfectly repeatable) at low distances, and still pretty darn good way out at 800 m (±1 m repeatability).

To test the accuracy, he took a series of measurements and compared them against their known values (pretty straightforward, right?). He found that the instrument was accurate to within a maximum of 3% (but was usually even better than that).

While this may not be groundbreaking science, it’s really nice to be reminded that sometimes a cheap instrument will do the job, and we love that there are dedicated folks like [Phil] out there who are willing to put the time in to prove it.

Using Statistics Instead Of Sensors

April 3, 2022 by Bryan Cockfield 9 Comments

Statistics often gets a bad rap in mathematics circles for being less than concrete at best, and being downright misleading at worst. While these sentiments might ring true for things like political polling, it hides the fact that statistical methods can be put to good use in engineering systems with fantastic results. [Mark Smith], for example, has been working on an espresso machine which can make the perfect shot of coffee, and turned to one of the tools in the statistics toolbox in order to solve a problem rather than adding another sensor to his complex coffee-brewing machine.

To make espresso, steam is generated which is then forced through finely ground coffee. [Mark] found that his espresso machine was often pouring too much or too little coffee, and in order to improve his machine’s accuracy in this area he turned to the linear regression parameter R², also known as the coefficient of determination. By using a machine learning algorithm tuned to this value, which assesses predictable variation in a data set, a computer can more easily tell when the coffee begins pouring out of the portafilter and into the espresso cup based on the pressure and water flow in the machine itself rather than using some other input such as the weight of the cup.

We have seen in the past how seriously [Mark] takes his coffee-making, and this is another step in a series of improvements he has made to his equipment. In this iteration, he has additionally produced a simulation in JupyterLab to better assist him in modeling the system and making even more accurate predictions. It’s quite a bit more effort than adding sensors, but since his espresso machine already included quite a bit of computing power it’s not too big a leap for him to make.