Hackaday Links: July 21, 2024

Hackaday Links Column Banner

When monitors around the world display a “Blue Screen of Death” and you know it’s probably your fault, it’s got to be a terrible, horrible, no good, very bad day at work. That’s likely the situation inside CrowdStrike this weekend, as engineers at the cybersecurity provider struggle to recover from an update rollout that went very, very badly indeed. The rollout, which affected enterprise-level Windows 10 and 11 hosts running their flagship Falcon Sensor product, resulted in machines going into a boot loop or just dropping into restore mode, leaving hapless millions to stare at the dreaded BSOD screen on everything from POS terminals to transit ticketing systems.

Tales of woe from the fallout from what’s being called “the largest IT outage in history” are pouring in, including this very bewildered game developer who while stranded at an airport had plenty of ponder about why CrowdStrike broke the cardinal rule of software development by rolling a change to production on a Friday. The good news is that there’s a workaround, but the bad news is that someone has to access each borked machine and manually delete a file to fix it. Current estimates place the number of affected machines at 8.5 million, so that’s a lot of legwork. There’s plenty of time after the fix is rolled out for a full accounting of the impact, including the search for the guilty and persecution of the innocent, but for now, let’s spare a moment’s pity for the devs who must be sweating things out this weekend.

Back in 2011, Craig Fugate of the Federal Emergency Management Agency said of disaster response in the southern US, “If you get there and the Waffle House is closed? That’s really bad.” Thus was born the “Waffle House Index,” an informal measure of a natural disaster’s impact based on where individual restaurants in the chain that prides itself on always being open are actually up and running. With over 1,900 locations in 25 states, you’d think it would cover just about any emergency, but desperate Texans eschewed the index during the recent extensive power outages in the Houston area caused by Hurricane Beryl by inventing the “Whattaburger Index.” We haven’t had the pleasure of this particular delicacy, but it seems Texans can’t get enough of the hamburger chain, enough so that their online app’s location map provides a pretty granular view of a wide swathe of Texas. Plus, the chain thoughtfully color-codes each location’s marker by whether it’s currently open or closed, making it a quick and easy way to check where the power is on or off — at least during regular business hours. Hat’s off to the enterprising Texans who figured this out, and here’s hoping that life has returned to normal for everyone by now.

While we’re generally not fans of Apple products, which seem overpriced and far too tightly controlled for our liking, we’ve been pretty impressed by some of the results people have reported using their Apple AirTags to recover lost or stolen items — this recent discovery of a cache of stolen tools (fourth item) comes to mind. Results such as that require a “me too” response from the Android side of the market, resulting in the Find My Device network that, perhaps unsurprisingly, doesn’t appear to work very well. The test was pretty much what you’d expect — drop an Android-compatible tag in the mail along with an AirTag and track their journey. The Android tag only reported in a couple of times, while the AirTag provided a comprehensive track of the parcel’s journey through the USPS. Our first thought is that this speaks mostly to the power of being first to market, allowing Apple to have a more completely built-out infrastructure. But this may say more about the previously mentioned flexibility of Android compared to Apple; we know we noped the hell out of participating in Find My Device as soon as it rolled out on our Android phone. Seems like a lot of Android users feel the same way.

And finally, while we haven’t checked out comments on this week’s podcast, we’re pretty sure we’re getting raked over the coals for betraying our ignorance of and lack of appreciation for the finer points of soccer, or football. Whatever you call it, we just don’t get it, but we do understand and agree with our own Lewin Day’s argument that instrument-enhanced officiating isn’t making the game any better. Our argument is that in any sport, the officials are like a third team, one that’s adversarial to both of the competing teams, hopefully equally so, and that giving them super-human abilities isn’t fair to the un-enhanced players on the field/pitch/court/ice. So it was with considerable dismay that we learned that Major League Baseball is experimenting with automatic umpires to call balls and strikes behind the plate. While you may not care about baseball, you have to appreciate the ability of an umpire to stand directly in the line of fire of someone who can hurl a ball fast enough to hit a strike zone about the size of a pizza box the ball in less than 500 milliseconds. Being able to determine if the ball ended up in or out of that box is pretty amazing, not to mention all the other things an umpire has to do to make sure the game is played by the rules. They’re not perfect, of course, and neither are the players, and half the fun of watching sports for us is witnessing the very human contest of wills and skills of everyone involved. It seems like a bad idea to take the humans out of that particular loop.

20 thoughts on “Hackaday Links: July 21, 2024

  1. The Whataburger App’s mapping function _was_ useful for people heading across town to verify that their destination probably did or didn’t have power. It did not, sadly, tell you how ridiculously overwhelmed open locations would be with people who needed to eat and couldn’t cook.

    For those outside the area, Whataburger is roughly similar in quality and fan devotion to In-and-Out. Both sell high quality fast food made with good fresh ingredients using traditional, but distinctive recipes. It’s the other _good burger joint_ and worth checking out if you’re in an area that has them, especially since there is very very little overlap between the chains.

  2. Auromatic updates are what made me switch from Windows for Ubuntu. Now that Canonical are pushing auto-updating snaps so hard I have moved to Debian (and not missed anything!) I use btrfs on the system partition so that I can take a snapshot before (manually) updating, so I can revert to a previous version if necessary. It’s not rocket science.

    I am responsible for only a handful of machines. It beggars belief that system specifiers for large institutions allow automatic updates at all let alone from third party organisations, and don’t have a test and gradual deployment scheme for any changes to the systems that their organisations rely on, and don’t have a mechanism for reversing a bad update.

    It was a mistake on CloudStrike’s part – accidents happen, although one wonders about their testing regime – but pure incompetence on the part of many system specifiers.

    1. It’s worse than that. CrowdStrike had their own update mechanism, bypassing Microsoft’s process which would have caught it. CrowdStrike then push out an update on a Friday that seemingly wasn’t tested at all because the most basic test would have revealed that the update was completely faulty. Finally, CrowdStrike classified their driver as a boot driver, thus Windows is not allowed to continue booting normally without it (faulty ordinary drivers are disabled by Windows and allow the boot process to continue.)

      You allow CrowdStrike to perform automatic updates once you sign up for their service. Don’t like it? Don’t purchase their product. An antivirus without automatic updates is useless but CrowdStrike isn’t the only vendor.

      1. I’m not sure how “microsoft’s process would have caught it”; they’ve also released plenty of updates that have caused havoc themselves, though not to this scale. No process is perfect, and no stack of processes is perfect either. Sometimes a sniper falls through the holes.

        1. Most severe accidents occur as a result of a combination of regrettable circumstances. My point was that it shouldn’t just be Microsoft and CrowdStrike under the spotlight here, but the people who specify the systems for large organisations, whoever they are, who have put together such fragile systems. Bad updates, viruses and hackers (in the bad sense of the word) have been around for decades now, yet apparently it is widespread not to be able to regress the systems to a working set of software, and to allow changes to the systems adhoc and untested.

          1. Exactly ! As I once said to someone last century who caused havoc when they went directly from dev to prod, ” testing on the qa platform would have saved your reputation AND your job”

    2. “Auromatic updates are what made me switch from Windows for Ubuntu.”

      Ubuntu can update my aura???
      Where’s the man page for that?
      B^)

    3. Accidents do happen, but the fact they had no mechanism in place to gracefully error instead of BSOD.

      Also you want staggered rollout. Half your machines die, you stop the update on the other half. Just good basic 101 practices here. Sadly nobody ever heard of these guys and now their name is mud.

  3. Tile was first to market, not AirTag.

    Apple as ever comes a bit later, a bit more expensive, but learns from others mistakes and makes it just work.

    And with proper privacy so that no one from the people whose devices are reporting AirTag locations through to Apple themselves can discover your tags locations, so no reason to opt out.

  4. Sadly, I’m not really surprised by the automatic umpires. I’m actually more surprised it hasn’t already been implemented in some form. While the fans aren’t always rabid, the players certainly seem to be some of the most avid rules-lawyers in all of sports.

  5. Actually, CrowdStrike broke TWO laws of DevOps: they deployed on a Friday (self evident), and they *broke rollback* (the crashing having to be fixed by physically going to every individual system and manually doing the fix, rather than being able to remotely or automatically remove the borked file).

  6. Instrument-enhanced officiating is the same as traffic cameras automatically sending speeding tickets.

    Do you want to live in a free world where you need to risk trust in people or a secure world where every little aspect of society is automated and regulated for you.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.