Patch, Or Your Solid State Drives Roll Over And Die

Expiration dates for computer drives? That’s what a line of HP solid-state drives are facing as the variable for their uptime counter is running out. When it does, the drive “expires” and, well, no more data storage for you!

There are a series of stages in the evolution of a software developer as they master their art, and one of those stages comes in understanding that while they may have a handle on the abstracted world presented by their development environment they perhaps haven’t considered the moments in which the real computer that lives behind it intrudes. Think of the first time you saw an SQL injection attack on a website, for example, or the moment you realised that a variable type is linked to the physical constraints of the number of memory locations it has reserved for it. So people who write software surround themselves with an armoury of things they watch out for as they code, and thus endeavour to produce software less likely to break. Firmly in that arena is the size of the variables you use and what will happen when that limit is reached.

Your Drive Is Good For About 3 Years And 9 Months

Sometimes though even developers that should know better get it wrong, and this week has brought an unfortunate example for the enterprise wing of the hardware giant HP. Their manufacturer has notified them that certain models of solid-state disk drives supplied in enterprise storage systems contain an unfortunate bug, in which they stop working after 32,768 hours of uptime. That’s a familiar number to anyone working with base-2 numbers and hints at a 16-bit signed integer in use to log the hours of uptime. When it rolls over the value will then be negative and, rather than the drive believing itself to be in a renewed flush of youth, it will instead stop working.

Egg on the faces of the storage company then, and an urgently-released patch. We suspect that if you own a stack of these drives you will already know about the issue and be nervously pacing the racks of your data centre.

Have you ever considered what will happen when this rolls over? Bruce W. Stracener [Public domain]
Have you ever considered what will happen when this rolls over? Bruce W. Stracener [Public domain]
This does raise a question as to how such an issue could manifest itself in 2019. We can forgive developers in the 1960s or 1970s using limited-size variables to store incrementing numbers because there was little experience of rollover bugs and the hardware of their day was often severely constrained. But as we approach the third decade of the 21st century we should have both the experience and the hardware to avoid the trap.

It’s hardly as though there have not been a series of widely publicised rollovers such as the Year 2000 so-called “Millennium bug” which have entered our culture to the extent that they’ve been parodied on the Simpsons and in countless other places. We’ve had jokes about the number of McDonald’s burgers sold rolling over, and on a more serious note we’ve seen space probes crash and as an industry we’ve got an eye towards the UNIX time rollover in 2038. For this still to be a thing today, where have we gone wrong?

How Should We Be Finding Our Firmware Developers?

It’s a question we have to ask ourselves then, does the effect of Moore’s Law breed complacency? When all the computing devices for which you code have effectively limitless resources, do you lose track of the constraints of the hardware?

This is written from a formative computing experience with very limited resources as a Hackaday scribe whose first machine was an 8-bit home computer with only 1k of memory. With that in hand, or perhaps as a more modern equivalent the experience of coding for one of the smaller microcontrollers, developing with a full awareness of the machine behind the code becomes second nature. When a variable requires two bytes, you know it requires two bytes, because you’ve had to make sure that there is a two byte space in memory for it. By comparison, it’s easy when declaring an integer variable in a modern IDE for a high-spec machine to forget that its real-world effect is to reserve two bytes, and thus it can only count up to 32,768 of whatever it is you are counting.

Maybe this will never be a problem that completely goes away. After all, each successive generation must learn about it the hard way, and the old-hands will nod sagely while another satellite crashes or an enterprise server fails. Meanwhile, as always, patch early and patch often.

Header image: Phrontis [CC BY-SA 3.0].

117 thoughts on “Patch, Or Your Solid State Drives Roll Over And Die

      1. Because some coding teams forbid the use of unsigned integers on the theory that it will greatly reduce bugs.

        I’m more of the opinion that it’s just reducing the types of bugs and not the quantity as someone who isn’t looking out for unsigned overflows will not have the experience to look out for signed overflows either.

        1. For any variable that’s never ever supposed to go negative, use an unsigned integer. And while you’re at it, code a trap and handler for the event that something attempts to write a negative value to the variable. That way the code won’t crash (or won’t just blow up and die) and you’ll get whatever notice you programmed to tell you there’s a problem with whatever part of your code is writing to the variable.

          An infamous case of “Should have used unsigned integer.” is the spice amount indicator in the Dune 2 game. The internal code used unsigned so when a level was completed the score was correct. But the score display in the game could roll over and start showing weird stuff instead of numbers – because when it rolled over the code for the score counter would go to the next sections of the texture file that had the art for the score numbers.

          1. Another example of similar logic is the .NET Environment.TickCount property, which is an int, rather than a uint.
            Why would the tick count ever need to be negative? The value must be cast to a uint to reflect reality.

          2. You’ll be surprised how many games behave strangely if variables are set to “invalid” values, for example by hacking the save file with a hex editor. (Best to do that with old/”retro”/”classic” games since they’re unlikely to have a file integrity check.) Not too long ago, I had quite a bit of fun with Zeliard and a hex editor.

    1. Always assume incompetence before conspiracy.

      Without knowing anything more about it, my guess is the development process included a step where copypasta coding plus Dilbert management metrics yielded paychecks for the week, and a problem for someone else to deal with years down the line.

      1. Ehh. This isn’t a hard and fast rule. Especially with a company like HP. There are examples of past behavior. Also understand that real grifters (that do absolutely exist in shockingly large numbers) take this rule into consideration for cover. They’re always gonna make it seem like it could be ignorance or blundering. Plausible deniability.

        That said, could very possibly be just a goof. But don’t let ’em all off just like that.

        1. I worked for many companies. And found that the only reason that people blame their own (previous) company is that they THINK that other companies are better because they never WORKED in those other companies…

          Incompetence is everywhere. Please tell me ONE single product on which you worked, where you didn’t have to learn anything new… I am sure you cannot name one product for which you had to learn something new.

          So by your own definition, doesn’t it mean that you were just as incompetent as the others at HP?

          Incompetence is the precursor to learning. So if you work in a high-tech company at the forefront of technology, and think that you are competent, it can only mean that you must be learning nothing. And that can only mean that you never do anything challenging and new in your company.

          So I guess that you are the guy of whom nobody knows what he does? Not being fired, only because someone forgot to assign you a manager and nobody knows that you exist?

          Where there’s new technology, there’s learning, and where there’s learning, there is incompetence. You cannot burn your former colleagues at HP for learning as they go, without burning yourself as well. :W

          1. Just because you do not know everything, that does not make you incompetent. RetepV appears to be creating her own definition of incompetence, attributing it to another, and then using that mangled definition to defend the poor outcome that is the subject of this article.

    2. I try not to to attribute to malevolence things which are adequately explained by laziness or a lack of thought.

      There are many examples of inappropriately using int for values better handled with uint.

      For example, the .NET Environment.TickCount property is internally implemented as a uint, but is exposed and reported as an int.

      Microsoft even goes into considerable detail about how the property transitions from Int32.MaxValue to Int32.MinValue, and how to deal with this, without acknowledging that the correct (unsigned) interpretation is available via a simple (uint) cast.
      https://docs.microsoft.com/en-us/dotnet/api/system.environment.tickcount?view=netframework-4.8

      The core issue is that once an API mistake is made, it must be preserved forever, because changing the behavior is an even worse sin.

        1. code that intentionally borks the drive would be grounds for a massive class-action lawsuit, probably the end of HP. There is such a thing as “data recovery” and all drive manufacturers explicitly endorse it.

          1. I’ve actually encountered solid state drives that get to a point (usually when it runs out of wear leveling blocks) where the controller deems the drive end of life, and it reverts to a read-only state so that you can get your data off the drive before it degrades further.

          2. Intel’s consumer level SSDs have a write number counter and when it hits a pre-determined value the drive bricks itself, can’t even read it. (Unless they’ve stopped doing this the past couple of years.) Intel’s enterprise/business SSDs also have a maximum writes counter but when it’s hit the data can still be read, but the drive slows write speed to a crawl to inspire the IT department to get everything copied off.

            I don’t care if it’d take 50 years of use to hit such a counter. I’d never buy a device that’s designed to make my data on it completely inaccessible just because some counter hits a certain value.

          3. Of course a drive that bricks itself on purpose should be cause for a damage claim lawsuit. If it slows writing and allows only normal reading to retrieve/back up your data before it dies completely, I find this not so bad,.

      1. Naw, I think Joe is just a legit pretty picture-drawer. I always admire the illustrations here. Playful and fun, but still with a great mechanical gestalt sort of like an engineer’s draftwork in some ways.

  1. Go ahead, take back your forgiveness. Maybe you can get a shovel, dig up the dead guy and have the corpse apologize to you. Maybe you can climb into your trusty time machine and make sure he never gets hired in the first place. Do you have enough charge in your flux capacitor for such a jump? What is your plan here?

  2. That’s some primo irony right there. They didn’t want to use up an extra 16 bits, on their device designed to store literally billions of bits, and as a result none of the bits can be used.

    1. Something along these lines seems like the most likely explanation. It was probably a port of some pre-existing code that had been working fine on other drives. If that is the case, then it makes it more likely that there was never a conversation or review question about how many bits were being used, or what would happen if it rolled over.

      Drive firmware is many thousands of lines, and can be quite sophisticated. Doubtful much energy was spent thinking about the simple power-on-hours variable. And now customers are paying for it.

      1. People still not heeding the lesson of Ariane 5. Test ALL the software, in a full system integration test, BEFORE launch! Be it a rocket or product launch, spend the time and money on a FULL test, or you may end up paying far more later.

        1. While this is a ridiculous bug for various reasons. Testing can be problematic on complex systems when the inputs can’t be defined properly or require a large load to properly test. I work on services that could contain a large amount of data for a given item and a lot of items over time. I don’t have the proper tools to measure the resources over a given time. I can find lots of bugs (I’m good at breaking things) but I can’t tell you how well something will work in production if you can’t give me numbers to expect. And mock data is not real data.

          Sorry for the rant.

      1. If you want a great case of signed int money counters gone horribly wrong, ever play the old shareware game Scorched Earth? There was a “Free Market” setting that would cause the price of ordinance change in between rounds depending on how much was purchased the last time. However, they failed to include a rollover check. So you could buy up a ton of your favorite weapon one round, then the next round the price would roll over and they’d pay you to buy the weapon!

  3. Had issues with controllers on both mine and my buddies hard drives from windows 7 failing and killing the drives in windows 10 (just to give him some music). One a Seagate and one a Western Digital. Poisoned the controller. Thats why they were selling on EvilBay. Combo Momentus drives me up the wall trying to get it to do any tricks. Better off with Full blown SSD. Bought him a new Toshiba Drive same USB 3.0 that I have, and a link to a 3.0 hub. He had no 3.0 computers yet.

    But I like this news. We have seen these uptime issues in switches, Cisco and Nortel in the past. Have not seen it in awhile though. And problems like these are stupid and they knew about it, yet failed to fix it.

  4. Probably some junior developer who didn’t know the size of an “int” varies by CPU architecture, and assumed it would be good for 2 billion hours because that’s the size of an int on their desktop.

    Or maybe it was someone who has been yelled at for “wasting bytes” because they previously used a long when the code reviewer thought it only needed a short.

    Or maybe someone named a variable “counter” when they should have named it “hours_of_life_remaining”.

    If nothing else, this pushes hard for a robust code review process. And undoubtedly nobody wrote unit tests for this code, either.

    I think their development team needs a new, strong leader to instill some modern best practices. They need effective code reviews, automated unit testing, and a way to flash update their drives.

    1. Or they did have unit tests, but those tests took under 3.5 years to run.
      And when one developer asked “shouldn’t that be an unsigned long?”, he was answered “no, for it passes unit tests and therefore is fine”.

  5. What’s the bet they allocated an “int” instead of a “int32_t”?

    I’ve seen systems where “int” is “int8_t” (AVR with -mint8).
    I’ve seen systems where “int” is “int16_t” (16-bit x86, TI TMS320LF2406A)
    I’ve seen systems where “int” is “int32_t” (most 32-bit systems)

    What’s the size of “int”? There is no consistent answer, its use therefore should be considered harmful. Time to read up on “stdint.h” and leave the old legacy types behind.

    1. I think you are probably right – and why I only ever use uint32_t etc in all my embedded code. You need to know what you can fit!

      The other problem is that people don’t seem to realise how long their hardware can be on. Sure, you might think your battery powered device can’t be on for 10 years, but the next release might have the option to be plugged in, and that assumption in the code may come back to bite you.

      Anything that I write – that counts – is done so it can sensibly handle rollover no matter what – it’s normally only one line of code, or structuring the if statement differently.

      1. This was generally true, outside of unconforming implementations like the above AVR, until x86-64 systems came out. The lack of an integer type between short and int in C forced most systems to use 32-bits for ints, ruining the equivalance.

    2. This is why on all the C projects (big or small) I ever worked on (work or home), there was always an include ‘portab.h’ file (one for each architecture) created which type def’ed M_INT8, M_UINT8, M_INT16, M_UINT16, M_INT32, M_UINT32, M_INT64, … M_FLOAT32, … M_ADDR, and etc. so that the application(s) could easily be recompiled for different platforms without modification and still work the same. Works/worked very well. The reason for the M_ was to make it ‘unique’. Some compilers had a INT64 already defined for example.

  6. Crucial did the same thing with their M4 SSDs–after 5184 hours of power on time, the drive would become unrecognizable even in BIOS. They did offer a firmware fix, but never notified customers of the bug. So…if you crossed the 5184 hour threshold and your drive is no longer recognizable, there’s no way to upgrade the FW.

    I have one such drive with data on it, but Crucial refuses to help–saying that it’s outta warranty, stop bugging us. Needless to say, they’ve lost a customer.

    1. I was affected by this with my very first SSD (Crucial M4). I had just applied new thermal paste to my GPU, so I figured I must have fried it. I went through every possible step to determine the issue. Hours wasted because of some crappy code that killed the drive after so many hours of use.

  7. I love how you can all be critical of programmers in the 60s and 70s being so shortsighted. You have gigabyte laptops with terabyte drives. When I started programming in the mid 60s, the largest mainframes filled a room, but had maybe 256 kilobytes of memory. Data was entered on 80 byte paper cards. Everywhere space was at a premium. Yes, in the day and age were space is not so restricted, more forethought should have gone into the design.

    1. – Like football for many – easy for them to yell at the coaches or players ‘you idiot, what were you thinking!’, while they couldn’t do a tenth of the job the person does on a day-to-day basis. Maybe armchair developers are a new thing.

    2. Seems quite possible the frequently changed yet peristent power on hours value was stored in an eeprom (or eeprom space in a microcontroller). Eeprom space is often extreemly limited, like a Microchip 24AA01 is 1K bits. Unmanaged flash is inappropriate for values that change often, as it quickly wears out the flash. RAM is not persistent. If you only have 128 bytes of persistent storage, every byte could easily matter so the 16-bit value may have been a deliberate choice. Had they counted 2 (or 4 hour) increments in an unsigned value that would have been good for almost 15 years (30 years for 4 hour increment), likely longer than the drive life. Would anybody care if the power on hours reported as 30000 when it actually was 30001?

    3. I must have missed the comment about the usage of SSD drives in the 60s and 70s…:)
      If the drive in question took platter packs(let’s say from a Data General Nova system) then your comment might apply, but it does not. One the other hand, if HP is reusing code that was generated in the 60s and 70s in their SSDs then they have other problems.

      My first computer (in the early 80’s) was a hand built Z80 with 128K of memory, the first 32K was fixed and the other 32K could be bank switched;It was a modified version of Steve Ciarcia’s “Build your own Z80 computer”. The only reason I had that much memory was that my Dad worked for National Semiconductor and had friends…

    4. well, I started on a computer with 256 bytes of memory – and I still wouldn’t have write code then that would crash the computer if a counter rolled over. Sure, I might have only used one byte, or a nibble (used quite often), or even a bit, for a variable – but then I’d make sure that if it overflowed something half sensible happened – not bricking the whole thing.
      So it is either lazy stupid programming, or planed obsolescences.

  8. HP’s enterprise hardware group is infamous for stuff that leads me to believe this wasn’t entirely an accident. They’re one of the only manufacturers that makes you pay a license for parity RAID. RAID5/6, requires a license. Think about that. Their ILOm is also crippled if you don’t pay a license for that as well.

    The real teller is that HP firmware updates can’t be had without a support contract. This is a way to force people to get a support contract. “Oh well, shucks, we introduced this firmware bug, which we fixed soon after you bought your machine. If you’d been on our support contract, it would’ve been patched…”

    1. And that’s why when Carly Fiorina (then-HP CEO) purchased Compaq, the vast majority of data center managers switched to Dell. Be it a company or a country, choose your leaders wisely. History shows all too often that handing over control to a nitwit is a one-way ticket to losing it all. A sad ending for a once-great company.

  9. The biggest question is why _any_ value of the uptime counter can result in an otherwise sound drive appearing to fail? That to me seems like the key issue. Even if the drive has been running for 2^32 hours, if the flash (improbably) hadn’t stopped retaining data and the drive still functioned it seems like a bad idea to inject a made-up failure that isn’t legitimately triggered by an actual underlying failure.

    SSDs will eventually hit the point where few enough blocks still work that the drive can no longer function reliably but that point should be based on actual failures. _Maybe_ they’re doing something clever like using the hour counter to refresh blocks that are nearing the flash’s underlying retention time since the last write/erase cycle and the drive panics if a block has been written in the future? still a silly bug to make ot to the field!

    1. The saddest bit is that uptime hours is a poor metric for wear-and-tear on SSDs. For spinning media past the left-hand (early failure) part of the bathtub curve it is a statistically useful metric but doesn’t predict failure of individual drives in a meaningful way. For SSDs it’s just silly though.

  10. The thing that gets me is why would the drive croak if the running time indicator loops around? How many cars croaked in the old days when the odometer rolled over at 100K miles? I wonder should a tesla last that long if it will have a similar bug…

    1. Speaking of Tesla… The Model S (until the last couple of years) used a flash chip that was large enough back in 2012 but as software features have been added, the space has become rather tight. What’s caused a problem is some numpty though it’d be great to leave the underlying Linux system’s logging enabled, and dump it onto the flash chip where the operating system and car software lives. The car software’s log also goes onto that chip. That’s the log Tesla service uses to diagnose problems with the car’s systems.

      As features have been added to the software, the space for the logs has become smaller and smaller. Since the car software is mostly static, the chip’s wear leveling has had less and less space to spread out the wear caused by the logging.

      *POP* *FZZZT* the chip dies, your Tesla dies. You need a whole new central computer module – or send it off to one of the aftermarket shops that will replace the chip and install the latest software for your Model S – with the Linux logging disabled to stretch out the life of the chip.

      The current Model S computer has a larger flash chip and IIRC the car software log goes to a removable SD card, but (also IIRC) the Linux log was still left enabled, and going to the main flash chip. D’oh!

      1. Like everything in life, if an engineer built it, the bean counters cut the numbers so they could save .05 cents per unit. Intel shipped an “Entry Level” motherboard. Maxed out at 2 gig ram. Who figured out that would be good?
        I suspect the same mentality exists in SSD units shipped.
        Or, you could simply say, buy more product as soon as it stops working.

  11. On the scale of “bare metal – high level scripting language in the cloud” Hackaday crowd might be closer to the hardware side, but let me assure you there are coders out there who really don’t consider any hardware limitations at all, and computer science programs where the notion of things actually having to run on some real hardware instead of being abstract math in limitless space is barely a side note if even mentioned(and that’s actually a good thing). The bad thing that this leads to is wrong people doing the close-to-hardware-programming. Get embedded systems programming from someone who can actually do it, and get business logic programming and scientific projects from the people on the other end of the scale.

  12. True embedded(firmware) engineers are rare nowadays; Most kids graduate with CS degrees and don’t have a clue what they are doing and there aren’t enough good listeners to learn what they need to from the senior guys before they retire. Stack on top of that he fact that there are a lot of H1B going into software engineering, and upwards of 90% of indian graduates cannot write compile-able code(this is well documented), with similar conditions among the other groups that tech countries are importing, and you will start to understand why these high level tools are necessary – You can’t fix stupid, and you can’t mentor a group when only a small portion of that group actually have a true passion for what they are doing, and the rest just want a job.

  13. Just remembered, I had a quantum 40GB drive I ran for a decade, I got it used for cheap and it was throwing SMART errors, on running SMART tools against it, all I was seeing was the power on hours was negative, and one of the other attributes was screwy, a number that made no sense, forget which one it was, something that had normally had a low range. Anyway, external tests run on the drive suggested it was working fine, and it ran until a couple of years ago when I stopped using that machine. So I think stuff like this might happen semi-frequently, it just doesn’t always brick the drive. (Unless you consider SMART warnings bricked)

  14. Being a little self serving, but. Maybe we should be hiring CS grads instead of or in addition to EE’s for firmware. A good CS grad of the low-level persuasion with C++ experience is going to have run into this stuff before, They’re told to look for it pretty constantly.

  15. I believe it’s the size of the field for the SMART data, which is a protocol to query drive health related state. Rather than being any sneaky programmer’s obsolete attempt to save bits. Bits do matter somewhat on these ASIC microcontrollers as they often have a small amount of RAM and an even smaller amount of high-endurance NVRAM.

  16. Planned obsolescence was taught to me in 8th grade so I subscribe to the theory that HP has done this intentionally with the intent to part people with their money and not disclosing at the time of purchase the lifespan. Yes, that would constitute fraud.
    However, It doesn’t matter whether it is incompetence, negligence or intentional at this point.
    This is HP’s problem, they created it not me so it is now their obligation to make it right, not mine.
    When an auto manufacturer say makes a car that they discover will blow up and kill your family in 3 years they don’t send you a bag of parts with instructions on how to fix it yourself, do they? Yet this is in effect what HP offers.
    HP needs to replace every single one of these drives included with an apology letter so people can at least have the opportunity to clone their drives.
    Lets say i accept HP’s inadequate offer of a “patch” that I have to flash myself, what if something goes wrong, what if I make a mistake, what if I end up losing all my data, what then? Then HP blames you, and maybe it is your fault, and says go BUY a new drive, and oh yeah too bad about all your data.
    No, HP must replace drives or be class action sued out of existence.

  17. I want to point out the article is actually incorrect, and I’ve seen this same mistake repeated quite a bit in the tech press. HP is NOT the company that has these failing drives. It’s HPE. The “E” is important. Hewlett Packard spun off “Hewlett Packard Enterprise” 5 years ago. They are not the same corporation. They don’t even have the same stock ticker.

    No HP systems are affected, only certain HPE servers using a particular rebranded third party SSD. Course, if you’re one of those clients with that particular combination…

    Please check your sources next time!

    1. Business level HP systems Vs Consumer grade plastic boxes can’t be compared.
      And for the record, no Mfg builds computers.
      They actually sketch them out on a tablet, and issue a R. F..P.
      Asian contractors agree to build the RFP for a price, usually the lowest price submitted.
      The drives are built by others, as is the MoBo, DVD (if any) video chip, ram, and USB connections.
      Bean counters forced the designers to install most accessories to the USB channel
      This is simply a cost reduction maneuver.
      Consumer grade systems are absolute minimum spec machines.
      There in is the rub.
      While business often will pay for support, the consumer does not.
      So a minimum system without support gets a “Bad Rap”.
      Add to the bean counter chopping specs down, it’s a wonder any of them run.

  18. Sounds like planned obsolesance to me, used Dell, HP and Fujitsu in the data centres i work in for years HP stuff always seem to have a faster turn round. Sounds like a certain French car company, who build ECUs with an expiry date, where stuff starts to fail, requiring you to buy a new ECU, as the things like wipers or headlights are not actually broken. Easy fix, every year disconnect the battery and the plug back in when the car asks the date just go with what it displays, which is its incept date. As long as the year does not roll over the ECU is to dumb to notice. Got told that by a factory mechanic, and have first hand experience.

Leave a Reply to Steven Clark Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.