Patch, Or Your Solid State Drives Roll Over And Die

December 10, 2019

Expiration dates for computer drives? That’s what a line of HP solid-state drives are facing as the variable for their uptime counter is running out. When it does, the drive “expires” and, well, no more data storage for you!

There are a series of stages in the evolution of a software developer as they master their art, and one of those stages comes in understanding that while they may have a handle on the abstracted world presented by their development environment they perhaps haven’t considered the moments in which the real computer that lives behind it intrudes. Think of the first time you saw an SQL injection attack on a website, for example, or the moment you realised that a variable type is linked to the physical constraints of the number of memory locations it has reserved for it. So people who write software surround themselves with an armoury of things they watch out for as they code, and thus endeavour to produce software less likely to break. Firmly in that arena is the size of the variables you use and what will happen when that limit is reached.

Your Drive Is Good For About 3 Years And 9 Months

Sometimes though even developers that should know better get it wrong, and this week has brought an unfortunate example for the enterprise wing of the hardware giant HP. Their manufacturer has notified them that certain models of solid-state disk drives supplied in enterprise storage systems contain an unfortunate bug, in which they stop working after 32,768 hours of uptime. That’s a familiar number to anyone working with base-2 numbers and hints at a 16-bit signed integer in use to log the hours of uptime. When it rolls over the value will then be negative and, rather than the drive believing itself to be in a renewed flush of youth, it will instead stop working.

Egg on the faces of the storage company then, and an urgently-released patch. We suspect that if you own a stack of these drives you will already know about the issue and be nervously pacing the racks of your data centre.

Have you ever considered what will happen when this rolls over? Bruce W. Stracener [Public domain]

This does raise a question as to how such an issue could manifest itself in 2019. We can forgive developers in the 1960s or 1970s using limited-size variables to store incrementing numbers because there was little experience of rollover bugs and the hardware of their day was often severely constrained. But as we approach the third decade of the 21st century we should have both the experience and the hardware to avoid the trap.

It’s hardly as though there have not been a series of widely publicised rollovers such as the Year 2000 so-called “Millennium bug” which have entered our culture to the extent that they’ve been parodied on the Simpsons and in countless other places. We’ve had jokes about the number of McDonald’s burgers sold rolling over, and on a more serious note we’ve seen space probes crash and as an industry we’ve got an eye towards the UNIX time rollover in 2038. For this still to be a thing today, where have we gone wrong?

How Should We Be Finding Our Firmware Developers?

It’s a question we have to ask ourselves then, does the effect of Moore’s Law breed complacency? When all the computing devices for which you code have effectively limitless resources, do you lose track of the constraints of the hardware?

This is written from a formative computing experience with very limited resources as a Hackaday scribe whose first machine was an 8-bit home computer with only 1k of memory. With that in hand, or perhaps as a more modern equivalent the experience of coding for one of the smaller microcontrollers, developing with a full awareness of the machine behind the code becomes second nature. When a variable requires two bytes, you know it requires two bytes, because you’ve had to make sure that there is a two byte space in memory for it. By comparison, it’s easy when declaring an integer variable in a modern IDE for a high-spec machine to forget that its real-world effect is to reserve two bytes, and thus it can only count up to 32,768 of whatever it is you are counting.

Maybe this will never be a problem that completely goes away. After all, each successive generation must learn about it the hard way, and the old-hands will nod sagely while another satellite crashes or an enterprise server fails. Meanwhile, as always, patch early and patch often.

Header image: Phrontis [CC BY-SA 3.0].

117 thoughts on “Patch, Or Your Solid State Drives Roll Over And Die”

Midnight says:

December 10, 2019 at 7:20 am

This sounds like a typical planned obsolescence strategy imho

https://en.m.wikipedia.org/wiki/Planned_obsolescence

Report comment

Reply
1. 8bitwiz says:
  
  December 10, 2019 at 7:45 am
  
  If they had used an unsigned short, the lifetime would be 7 1/2 years, a perfect obsolescence timer for enterprise storage! Why does uptime need to be a signed integer anyhow?
  
  Report comment
  
  Reply
  1. SGOrava says:
    
    December 10, 2019 at 11:18 am
    
    So they could sell limited ultra version of the product.
    
    Report comment
    
    Reply
    1. Ren says:
      
      December 11, 2019 at 6:51 am
      
      B^)
      
      Report comment
      
      Reply
  2. Jeff says:
    
    December 10, 2019 at 11:51 am
    
    Why have customers come back in 7-1/2 years when you could have them come back in half that time?
    
    Sounds like they just got greedy.
    
    Report comment
    
    Reply
  3. Stephane Hockenhull says:
    
    December 10, 2019 at 2:41 pm
    
    Because some coding teams forbid the use of unsigned integers on the theory that it will greatly reduce bugs.
    
    I’m more of the opinion that it’s just reducing the types of bugs and not the quantity as someone who isn’t looking out for unsigned overflows will not have the experience to look out for signed overflows either.
    
    Report comment
    
    Reply
    1. Gregg Eshelman says:
      
      December 10, 2019 at 3:02 pm
      
      For any variable that’s never ever supposed to go negative, use an unsigned integer. And while you’re at it, code a trap and handler for the event that something attempts to write a negative value to the variable. That way the code won’t crash (or won’t just blow up and die) and you’ll get whatever notice you programmed to tell you there’s a problem with whatever part of your code is writing to the variable.
      
      An infamous case of “Should have used unsigned integer.” is the spice amount indicator in the Dune 2 game. The internal code used unsigned so when a level was completed the score was correct. But the score display in the game could roll over and start showing weird stuff instead of numbers – because when it rolled over the code for the score counter would go to the next sections of the texture file that had the art for the score numbers.
      
      Report comment
      
      Reply
      1. Sean Rhinehart says:
        
        December 10, 2019 at 5:02 pm
        
        Another example of similar logic is the .NET Environment.TickCount property, which is an int, rather than a uint.
        Why would the tick count ever need to be negative? The value must be cast to a uint to reflect reality.
        
        Report comment
      2. Somun says:
        
        December 10, 2019 at 5:30 pm
        
        If you are using an unsigned integer how can you write a negative value?
        
        Report comment
      3. NiHaoMike says:
        
        December 10, 2019 at 7:21 pm
        
        You’ll be surprised how many games behave strangely if variables are set to “invalid” values, for example by hacking the save file with a hex editor. (Best to do that with old/”retro”/”classic” games since they’re unlikely to have a file integrity check.) Not too long ago, I had quite a bit of fun with Zeliard and a hex editor.
        
        Report comment
    2. Jul13 says:
      
      December 11, 2019 at 7:40 am
      
      Is that the case here?
      
      Or a wild guess you made?
      
      Report comment
      
      Reply
  4. Martin says:
    
    December 11, 2019 at 3:00 am
    
    That’s the basic mistake. A negative uptime is physically impossible until we develop time travel.
    
    Report comment
    
    Reply
2. Thinkerer says:
  
  December 10, 2019 at 8:09 am
  
  Always assume incompetence before conspiracy.
  
  Without knowing anything more about it, my guess is the development process included a step where copypasta coding plus Dilbert management metrics yielded paychecks for the week, and a problem for someone else to deal with years down the line.
  
  Report comment
  
  Reply
  1. Jerry says:
    
    December 10, 2019 at 8:24 am
    
    Well said +1
    
    Report comment
    
    Reply
  2. qwert says:
    
    December 10, 2019 at 11:04 am
    
    Ehh. This isn’t a hard and fast rule. Especially with a company like HP. There are examples of past behavior. Also understand that real grifters (that do absolutely exist in shockingly large numbers) take this rule into consideration for cover. They’re always gonna make it seem like it could be ignorance or blundering. Plausible deniability.
    
    That said, could very possibly be just a goof. But don’t let ’em all off just like that.
    
    Report comment
    
    Reply
  3. Whatname? says:
    
    December 10, 2019 at 11:39 am
    
    Sorry but I worked with HP for more than a decade and with them i assume conspiracy AND incompetence on every product.
    And even on professional oriented products.
    
    Report comment
    
    Reply
    1. RetepV says:
      
      December 11, 2019 at 9:12 am
      
      I worked for many companies. And found that the only reason that people blame their own (previous) company is that they THINK that other companies are better because they never WORKED in those other companies…
      
      Incompetence is everywhere. Please tell me ONE single product on which you worked, where you didn’t have to learn anything new… I am sure you cannot name one product for which you had to learn something new.
      
      So by your own definition, doesn’t it mean that you were just as incompetent as the others at HP?
      
      Incompetence is the precursor to learning. So if you work in a high-tech company at the forefront of technology, and think that you are competent, it can only mean that you must be learning nothing. And that can only mean that you never do anything challenging and new in your company.
      
      So I guess that you are the guy of whom nobody knows what he does? Not being fired, only because someone forgot to assign you a manager and nobody knows that you exist?
      
      Where there’s new technology, there’s learning, and where there’s learning, there is incompetence. You cannot burn your former colleagues at HP for learning as they go, without burning yourself as well. :W
      
      Report comment
      
      Reply
      1. Mod says:
        
        December 12, 2019 at 12:32 pm
        
        No HP is evil so is GM moving jobs to mexico if there was only some common trait they both shared oh oh…never mind.
        
        Report comment
      2. onetruegod says:
        
        December 13, 2019 at 9:31 pm
        
        Just because you do not know everything, that does not make you incompetent. RetepV appears to be creating her own definition of incompetence, attributing it to another, and then using that mangled definition to defend the poor outcome that is the subject of this article.
        
        Report comment
  4. Jul13 says:
    
    December 11, 2019 at 7:42 am
    
    That is a shockingly, incredibly bad idea, especially when you’re dealing with known sociopaths.
    
    Report comment
    
    Reply
3. seanrhinehart says:
  
  December 10, 2019 at 8:55 pm
  
  I try not to to attribute to malevolence things which are adequately explained by laziness or a lack of thought.
  
  There are many examples of inappropriately using int for values better handled with uint.
  
  For example, the .NET Environment.TickCount property is internally implemented as a uint, but is exposed and reported as an int.
  
  Microsoft even goes into considerable detail about how the property transitions from Int32.MaxValue to Int32.MinValue, and how to deal with this, without acknowledging that the correct (unsigned) interpretation is available via a simple (uint) cast.
  https://docs.microsoft.com/en-us/dotnet/api/system.environment.tickcount?view=netframework-4.8
  
  The core issue is that once an API mistake is made, it must be preserved forever, because changing the behavior is an even worse sin.
  
  Report comment
  
  Reply
  1. Jul13 says:
    
    December 11, 2019 at 7:43 am
    
    And you TOTALLY aren’t part of HPs multi million dollar propaganda department.
    
    Report comment
    
    Reply
4. Jul13 says:
  
  December 11, 2019 at 7:40 am
  
  HP is seen as a good brand name by an aging population. If they use that to trick customers soon they may not get a chance
  
  Report comment
  
  Reply
5. Jerry says:
  
  December 14, 2019 at 5:43 am
  
  As I Recall, the normal laptop battery arrives with a limited number of recharge cycles programmed in.
  Depending on Mfg, 250 to 500 charges.
  Outraged yet?
  
  Report comment
  
  Reply
Chris Soukup says:

December 10, 2019 at 7:32 am

it’s a typical 16bit signed integer problem – but how can a runover bug cause a complete death of the disk? I’m so eager seeing the code passage where this happens.

Report comment

Reply
1. N says:
  
  December 10, 2019 at 8:00 am
  
  how about a media check on power-up, current timestamp is less than the saved one on the disk, so the firmware thinks the media is borked.
  
  Report comment
  
  Reply
  1. djsmiley2k says:
    
    December 10, 2019 at 8:45 am
    
    How about the Smart specifically writes -1 to the clock for time, on error, to show it’s gone badly wrong?
    
    Report comment
    
    Reply
    1. N says:
      
      December 10, 2019 at 9:28 am
      
      code that intentionally borks the drive would be grounds for a massive class-action lawsuit, probably the end of HP. There is such a thing as “data recovery” and all drive manufacturers explicitly endorse it.
      
      Report comment
      
      Reply
      1. Andrew says:
        
        December 10, 2019 at 11:37 am
        
        I’ve actually encountered solid state drives that get to a point (usually when it runs out of wear leveling blocks) where the controller deems the drive end of life, and it reverts to a read-only state so that you can get your data off the drive before it degrades further.
        
        Report comment
      2. Gregg Eshelman says:
        
        December 10, 2019 at 3:07 pm
        
        Intel’s consumer level SSDs have a write number counter and when it hits a pre-determined value the drive bricks itself, can’t even read it. (Unless they’ve stopped doing this the past couple of years.) Intel’s enterprise/business SSDs also have a maximum writes counter but when it’s hit the data can still be read, but the drive slows write speed to a crawl to inspire the IT department to get everything copied off.
        
        I don’t care if it’d take 50 years of use to hit such a counter. I’d never buy a device that’s designed to make my data on it completely inaccessible just because some counter hits a certain value.
        
        Report comment
      3. Cuthbert says:
        
        December 10, 2019 at 3:25 pm
        
        Do you have any references with regard to your Intel SSD claim Greg?
        
        Report comment
      4. Martin says:
        
        December 11, 2019 at 3:07 am
        
        Of course a drive that bricks itself on purpose should be cause for a damage claim lawsuit. If it slows writing and allows only normal reading to retrieve/back up your data before it dies completely, I find this not so bad,.
        
        Report comment
      5. pigster6 says:
        
        December 11, 2019 at 6:44 am
        
        Not intel i think – but multiple SSD drives fail hard when exhausted – https://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead/
        
        Report comment
Cynde Moya says:

December 10, 2019 at 7:35 am

Who made the isometric drawing of the disk drive? That is very nice work.

Report comment

Reply
1. Marty says:
  
  December 10, 2019 at 7:46 am
  
  Might be a vector trace of a real laptop HDD?
  
  Report comment
  
  Reply
  1. qwert says:
    
    December 10, 2019 at 11:07 am
    
    Naw, I think Joe is just a legit pretty picture-drawer. I always admire the illustrations here. Playful and fun, but still with a great mechanical gestalt sort of like an engineer’s draftwork in some ways.
    
    Report comment
    
    Reply
2. Mike Szczys says:
  
  December 10, 2019 at 8:12 am
  
  That’s the work of Joe Kim, Art Director here at Hackaday.
  
  Report comment
  
  Reply
  1. Ronny says:
    
    December 10, 2019 at 11:27 am
    
    It is harder to make a solid-state drive look as interesting!
    
    Report comment
    
    Reply
John Wilson says:

December 10, 2019 at 7:38 am

I always use the infinity variable type when coding.

Report comment

Reply
1. Jon Penn says:
  
  December 10, 2019 at 11:12 pm
  
  You’re not the only one. Lisp, Python, Perl, Haskell and Ruby all have options to use infinite-precision arithmetic for all numerical operations and variables (source: https://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic).
  
  Report comment
  
  Reply
Mike B says:

December 10, 2019 at 7:43 am

That graphic is a HDD.

Report comment

Reply
N says:

December 10, 2019 at 7:53 am

Go ahead, take back your forgiveness. Maybe you can get a shovel, dig up the dead guy and have the corpse apologize to you. Maybe you can climb into your trusty time machine and make sure he never gets hired in the first place. Do you have enough charge in your flux capacitor for such a jump? What is your plan here?

Report comment

Reply
kdev says:

December 10, 2019 at 8:11 am

That’s some primo irony right there. They didn’t want to use up an extra 16 bits, on their device designed to store literally billions of bits, and as a result none of the bits can be used.

Report comment

Reply
e says:

December 10, 2019 at 8:23 am

Programmers were overflowing words prior to the 1960s when it suited them…

https://en.wikipedia.org/wiki/The_Story_of_Mel

Report comment

Reply
1. e says:
  
  December 10, 2019 at 8:24 am
  
  http://www.cs.utah.edu/~elb/folklore/mel.html
  
  Report comment
  
  Reply
  1. Elliot Williams says:
    
    December 11, 2019 at 1:06 am
    
    Easier to read: http://foldoc.org/The%20Story%20of%20Mel
    
    Report comment
    
    Reply
cBaer says:

December 10, 2019 at 8:25 am

It can be that they had to use some obscure compiler where ‘int’ is 16 bit even when it is a 32 bit device. Happend to me.

Report comment

Reply
1. Ronny says:
  
  December 10, 2019 at 11:33 am
  
  Something along these lines seems like the most likely explanation. It was probably a port of some pre-existing code that had been working fine on other drives. If that is the case, then it makes it more likely that there was never a conversation or review question about how many bits were being used, or what would happen if it rolled over.
  
  Drive firmware is many thousands of lines, and can be quite sophisticated. Doubtful much energy was spent thinking about the simple power-on-hours variable. And now customers are paying for it.
  
  Report comment
  
  Reply
  1. Gregg Eshelman says:
    
    December 10, 2019 at 3:10 pm
    
    People still not heeding the lesson of Ariane 5. Test ALL the software, in a full system integration test, BEFORE launch! Be it a rocket or product launch, spend the time and money on a FULL test, or you may end up paying far more later.
    
    Report comment
    
    Reply
    1. montrough says:
      
      December 10, 2019 at 7:54 pm
      
      While this is a ridiculous bug for various reasons. Testing can be problematic on complex systems when the inputs can’t be defined properly or require a large load to properly test. I work on services that could contain a large amount of data for a given item and a lot of items over time. I don’t have the proper tools to measure the resources over a given time. I can find lots of bugs (I’m good at breaking things) but I can’t tell you how well something will work in production if you can’t give me numbers to expect. And mock data is not real data.
      
      Sorry for the rant.
      
      Report comment
      
      Reply
      1. Ren says:
        
        December 11, 2019 at 6:58 am
        
        Yes, to fully test software/hardware one needs to put at least 1000 3 year olds at the keyboard/mouse of the computer and tell them to find the really cool stuff!
        
        Report comment
Matt Cramer says:

December 10, 2019 at 8:32 am

Now… why did they use a signed integer? Were they expecting the hard drive to need to record negative uptime?

Report comment

Reply
1. Ren says:
  
  December 10, 2019 at 9:40 am
  
  As Doc Brown said, “Marty, you’re not thinking 4th dimensionally!”
  
  Report comment
  
  Reply
2. ligoten says:
  
  December 10, 2019 at 1:20 pm
  
  I say this every time I see a video game use a signed int for the money counter. Especially when the game gets plenty of other updates over its life cycle.
  
  Report comment
  
  Reply
  1. Gregg Eshelman says:
    
    December 10, 2019 at 3:11 pm
    
    Ever play Dune 2? Signed int spice counter, but only for the display. The amount was correct after finishing a level.
    
    Report comment
    
    Reply
  2. Martin says:
    
    December 11, 2019 at 3:10 am
    
    An amount Money for sure can get negative – normally then called “debt” and often printed in “red numbers”. :-)
    
    Report comment
    
    Reply
  3. Matt Cramer says:
    
    December 16, 2019 at 10:51 am
    
    If you want a great case of signed int money counters gone horribly wrong, ever play the old shareware game Scorched Earth? There was a “Free Market” setting that would cause the price of ordinance change in between rounds depending on how much was purchased the last time. However, they failed to include a rollover check. So you could buy up a ton of your favorite weapon one round, then the next round the price would roll over and they’d pay you to buy the weapon!
    
    Report comment
    
    Reply
3. Justin says:
  
  December 11, 2019 at 8:58 pm
  
  Google notes actually suffers from this issue if you set a reminder past this date it is reset.
  
  Report comment
  
  Reply
Brett The Brat says:

December 10, 2019 at 8:51 am

Had issues with controllers on both mine and my buddies hard drives from windows 7 failing and killing the drives in windows 10 (just to give him some music). One a Seagate and one a Western Digital. Poisoned the controller. Thats why they were selling on EvilBay. Combo Momentus drives me up the wall trying to get it to do any tricks. Better off with Full blown SSD. Bought him a new Toshiba Drive same USB 3.0 that I have, and a link to a 3.0 hub. He had no 3.0 computers yet.

But I like this news. We have seen these uptime issues in switches, Cisco and Nortel in the past. Have not seen it in awhile though. And problems like these are stupid and they knew about it, yet failed to fix it.

Report comment

Reply
Mr Hard Drive says:

December 10, 2019 at 9:31 am

This has already caused havoc in Sweden, where the medical service has had harddrives die on them, infact whole RAID storage units died, at a given time.

Look into it

Report comment

Reply
targetdrone says:

December 10, 2019 at 10:53 am

Probably some junior developer who didn’t know the size of an “int” varies by CPU architecture, and assumed it would be good for 2 billion hours because that’s the size of an int on their desktop.

Or maybe it was someone who has been yelled at for “wasting bytes” because they previously used a long when the code reviewer thought it only needed a short.

Or maybe someone named a variable “counter” when they should have named it “hours_of_life_remaining”.

If nothing else, this pushes hard for a robust code review process. And undoubtedly nobody wrote unit tests for this code, either.

I think their development team needs a new, strong leader to instill some modern best practices. They need effective code reviews, automated unit testing, and a way to flash update their drives.

Report comment

Reply
1. Dan says:
  
  December 10, 2019 at 1:52 pm
  
  Or they did have unit tests, but those tests took under 3.5 years to run.
  And when one developer asked “shouldn’t that be an unsigned long?”, he was answered “no, for it passes unit tests and therefore is fine”.
  
  Report comment
  
  Reply
2. Elliot Williams says:
  
  December 11, 2019 at 1:22 am
  
  +1 for “hours_of_life_remaining”. As a non-pro programmer, I’ll just take your word for all the rest. :)
  
  Report comment
  
  Reply
3. mac012345 says:
  
  December 11, 2019 at 2:24 am
  
  Code reviews don’t catch bad or lack of specs, that’s be basis of sane software development (sorry agile guys!).
  
  Report comment
  
  Reply
ejonesss says:

December 10, 2019 at 11:34 am

what about the samsung evo 840 drives? are they effected?

Report comment

Reply
Stuart Longland says:

December 10, 2019 at 11:46 am

What’s the bet they allocated an “int” instead of a “int32_t”?

I’ve seen systems where “int” is “int8_t” (AVR with -mint8).
I’ve seen systems where “int” is “int16_t” (16-bit x86, TI TMS320LF2406A)
I’ve seen systems where “int” is “int32_t” (most 32-bit systems)

What’s the size of “int”? There is no consistent answer, its use therefore should be considered harmful. Time to read up on “stdint.h” and leave the old legacy types behind.

Report comment

Reply
1. ian 42 says:
  
  December 10, 2019 at 2:58 pm
  
  I think you are probably right – and why I only ever use uint32_t etc in all my embedded code. You need to know what you can fit!
  
  The other problem is that people don’t seem to realise how long their hardware can be on. Sure, you might think your battery powered device can’t be on for 10 years, but the next release might have the option to be plugged in, and that assumption in the code may come back to bite you.
  
  Anything that I write – that counts – is done so it can sensibly handle rollover no matter what – it’s normally only one line of code, or structuring the if statement differently.
  
  Report comment
  
  Reply
2. Steve Herzog says:
  
  December 11, 2019 at 9:30 am
  
  I’ve always assumed an int or uint was the size of a CPU register on the target processor. (often, but not always)
  
  Report comment
  
  Reply
  1. Michael says:
    
    December 11, 2019 at 4:22 pm
    
    This was generally true, outside of unconforming implementations like the above AVR, until x86-64 systems came out. The lack of an integer type between short and int in C forced most systems to use 32-bits for ints, ruining the equivalance.
    
    Report comment
    
    Reply
3. rclark says:
  
  December 11, 2019 at 7:36 pm
  
  This is why on all the C projects (big or small) I ever worked on (work or home), there was always an include ‘portab.h’ file (one for each architecture) created which type def’ed M_INT8, M_UINT8, M_INT16, M_UINT16, M_INT32, M_UINT32, M_INT64, … M_FLOAT32, … M_ADDR, and etc. so that the application(s) could easily be recompiled for different platforms without modification and still work the same. Works/worked very well. The reason for the M_ was to make it ‘unique’. Some compilers had a INT64 already defined for example.
  
  Report comment
  
  Reply
SKSSF says:

December 10, 2019 at 12:21 pm

Crucial did the same thing with their M4 SSDs–after 5184 hours of power on time, the drive would become unrecognizable even in BIOS. They did offer a firmware fix, but never notified customers of the bug. So…if you crossed the 5184 hour threshold and your drive is no longer recognizable, there’s no way to upgrade the FW.

I have one such drive with data on it, but Crucial refuses to help–saying that it’s outta warranty, stop bugging us. Needless to say, they’ve lost a customer.

Report comment

Reply
1. Joel B says:
  
  December 10, 2019 at 1:34 pm
  
  That sounds rather ominous – I’ve got two Crucial M4s at home and haven’t had that happen yet. Where’d you hear that reported?
  
  Report comment
  
  Reply
  1. SKSSF says:
    
    December 10, 2019 at 1:37 pm
    
    here’s the most comprehensive post i’ve seen about it:
    https://kb.stonegroup.co.uk/index.php?View=entry&EntryID=426
    
    Report comment
    
    Reply
2. NSFW says:
  
  December 10, 2019 at 2:36 pm
  
  Thanks for the heads-up. They just lost another customer right here.
  
  Report comment
  
  Reply
3. Mad Murdock says:
  
  December 11, 2019 at 7:29 am
  
  I was affected by this with my very first SSD (Crucial M4). I had just applied new thermal paste to my GPU, so I figured I must have fried it. I went through every possible step to determine the issue. Hours wasted because of some crappy code that killed the drive after so many hours of use.
  
  Report comment
  
  Reply
4. Jul13 says:
  
  December 11, 2019 at 11:25 am
  
  They’re ok with that. They’ll run that brand forna while and rebrand once everyone has caught on.
  
  Wow they totally took advantage of you and you just let it happen.
  
  Report comment
  
  Reply
James haddock says:

December 10, 2019 at 12:44 pm

I love how you can all be critical of programmers in the 60s and 70s being so shortsighted. You have gigabyte laptops with terabyte drives. When I started programming in the mid 60s, the largest mainframes filled a room, but had maybe 256 kilobytes of memory. Data was entered on 80 byte paper cards. Everywhere space was at a premium. Yes, in the day and age were space is not so restricted, more forethought should have gone into the design.

Report comment

Reply
1. my2c says:
  
  December 10, 2019 at 7:42 pm
  
  – Like football for many – easy for them to yell at the coaches or players ‘you idiot, what were you thinking!’, while they couldn’t do a tenth of the job the person does on a day-to-day basis. Maybe armchair developers are a new thing.
  
  Report comment
  
  Reply
2. NFM says:
  
  December 10, 2019 at 7:45 pm
  
  And in 30 years, they’ll say the same thing about us for not taking into account their chip-scale exabyte arrays. :D
  
  Report comment
  
  Reply
3. Jan Bottorff says:
  
  December 11, 2019 at 8:08 am
  
  Seems quite possible the frequently changed yet peristent power on hours value was stored in an eeprom (or eeprom space in a microcontroller). Eeprom space is often extreemly limited, like a Microchip 24AA01 is 1K bits. Unmanaged flash is inappropriate for values that change often, as it quickly wears out the flash. RAM is not persistent. If you only have 128 bytes of persistent storage, every byte could easily matter so the 16-bit value may have been a deliberate choice. Had they counted 2 (or 4 hour) increments in an unsigned value that would have been good for almost 15 years (30 years for 4 hour increment), likely longer than the drive life. Would anybody care if the power on hours reported as 30000 when it actually was 30001?
  
  Report comment
  
  Reply
4. GB Clark says:
  
  December 15, 2019 at 8:53 pm
  
  I must have missed the comment about the usage of SSD drives in the 60s and 70s…:)
  If the drive in question took platter packs(let’s say from a Data General Nova system) then your comment might apply, but it does not. One the other hand, if HP is reusing code that was generated in the 60s and 70s in their SSDs then they have other problems.
  
  My first computer (in the early 80’s) was a hand built Z80 with 128K of memory, the first 32K was fixed and the other 32K could be bank switched;It was a modified version of Steve Ciarcia’s “Build your own Z80 computer”. The only reason I had that much memory was that my Dad worked for National Semiconductor and had friends…
  
  Report comment
  
  Reply
5. ian 42 says:
  
  December 16, 2019 at 10:37 pm
  
  well, I started on a computer with 256 bytes of memory – and I still wouldn’t have write code then that would crash the computer if a counter rolled over. Sure, I might have only used one byte, or a nibble (used quite often), or even a bit, for a variable – but then I’d make sure that if it overflowed something half sensible happened – not bricking the whole thing.
  So it is either lazy stupid programming, or planed obsolescences.
  
  Report comment
  
  Reply
sdfdsfdfdsf says:

December 10, 2019 at 1:11 pm

HP’s enterprise hardware group is infamous for stuff that leads me to believe this wasn’t entirely an accident. They’re one of the only manufacturers that makes you pay a license for parity RAID. RAID5/6, requires a license. Think about that. Their ILOm is also crippled if you don’t pay a license for that as well.

The real teller is that HP firmware updates can’t be had without a support contract. This is a way to force people to get a support contract. “Oh well, shucks, we introduced this firmware bug, which we fixed soon after you bought your machine. If you’d been on our support contract, it would’ve been patched…”

Report comment

Reply
1. Speed Daemon says:
  
  December 10, 2019 at 8:00 pm
  
  And that’s why when Carly Fiorina (then-HP CEO) purchased Compaq, the vast majority of data center managers switched to Dell. Be it a company or a country, choose your leaders wisely. History shows all too often that handing over control to a nitwit is a one-way ticket to losing it all. A sad ending for a once-great company.
  
  Report comment
  
  Reply
BruceJ says:

December 10, 2019 at 1:36 pm

When the McDonalds sign rolls over, it’ll still be correct: over one billion sold.,,just not perfectly accurate for how many OVER one billion…

Report comment

Reply
1. Ren says:
  
  December 11, 2019 at 7:03 am
  
  When the sign rolls over, it will say “Over 00 Billion Sold”
  
  Report comment
  
  Reply
drenehtsral says:

December 10, 2019 at 1:39 pm

The biggest question is why _any_ value of the uptime counter can result in an otherwise sound drive appearing to fail? That to me seems like the key issue. Even if the drive has been running for 2^32 hours, if the flash (improbably) hadn’t stopped retaining data and the drive still functioned it seems like a bad idea to inject a made-up failure that isn’t legitimately triggered by an actual underlying failure.

SSDs will eventually hit the point where few enough blocks still work that the drive can no longer function reliably but that point should be based on actual failures. _Maybe_ they’re doing something clever like using the hour counter to refresh blocks that are nearing the flash’s underlying retention time since the last write/erase cycle and the drive panics if a block has been written in the future? still a silly bug to make ot to the field!

Report comment

Reply
1. drenehtsral says:
  
  December 10, 2019 at 1:47 pm
  
  The saddest bit is that uptime hours is a poor metric for wear-and-tear on SSDs. For spinning media past the left-hand (early failure) part of the bathtub curve it is a statistically useful metric but doesn’t predict failure of individual drives in a meaningful way. For SSDs it’s just silly though.
  
  Report comment
  
  Reply
2. Dan says:
  
  December 10, 2019 at 1:58 pm
  
  It’s perhaps used for something like deciding when to do wear levelling or self tests or sleep or somesuch, and the negative number is throwing it into a loop.
  
  Report comment
  
  Reply
reg says:

December 10, 2019 at 2:46 pm

The thing that gets me is why would the drive croak if the running time indicator loops around? How many cars croaked in the old days when the odometer rolled over at 100K miles? I wonder should a tesla last that long if it will have a similar bug…

Report comment

Reply
1. Gregg Eshelman says:
  
  December 10, 2019 at 3:23 pm
  
  Speaking of Tesla… The Model S (until the last couple of years) used a flash chip that was large enough back in 2012 but as software features have been added, the space has become rather tight. What’s caused a problem is some numpty though it’d be great to leave the underlying Linux system’s logging enabled, and dump it onto the flash chip where the operating system and car software lives. The car software’s log also goes onto that chip. That’s the log Tesla service uses to diagnose problems with the car’s systems.
  
  As features have been added to the software, the space for the logs has become smaller and smaller. Since the car software is mostly static, the chip’s wear leveling has had less and less space to spread out the wear caused by the logging.
  
  *POP* *FZZZT* the chip dies, your Tesla dies. You need a whole new central computer module – or send it off to one of the aftermarket shops that will replace the chip and install the latest software for your Model S – with the Linux logging disabled to stretch out the life of the chip.
  
  The current Model S computer has a larger flash chip and IIRC the car software log goes to a removable SD card, but (also IIRC) the Linux log was still left enabled, and going to the main flash chip. D’oh!
  
  Report comment
  
  Reply
  1. Jerry says:
    
    December 10, 2019 at 4:42 pm
    
    Like everything in life, if an engineer built it, the bean counters cut the numbers so they could save .05 cents per unit. Intel shipped an “Entry Level” motherboard. Maxed out at 2 gig ram. Who figured out that would be good?
    I suspect the same mentality exists in SSD units shipped.
    Or, you could simply say, buy more product as soon as it stops working.
    
    Report comment
    
    Reply
LordNothing says:

December 10, 2019 at 9:28 pm

when in doubt
print(sizeof(int));

Report comment

Reply
profumple says:

December 10, 2019 at 10:32 pm

Two bytes can only count to 32768? My 8bit brain having trouble with that.

Report comment

Reply
1. seanrhinehart says:
  
  December 10, 2019 at 10:59 pm
  
  The problem is in the interpretation of bit 15, which has a place value of +32768 for a uint16 or -32768 for an int16 (as a sign bit)
  
  Report comment
  
  Reply
2. Lennart says:
  
  December 10, 2019 at 11:25 pm
  
  It can’t, the max is 32767 then it rolls over to -32768.
  
  Report comment
  
  Reply
  1. profumple says:
    
    December 11, 2019 at 12:37 am
    
    snicker. no way!
    
    Report comment
    
    Reply
PM says:

December 11, 2019 at 4:46 am

On the scale of “bare metal – high level scripting language in the cloud” Hackaday crowd might be closer to the hardware side, but let me assure you there are coders out there who really don’t consider any hardware limitations at all, and computer science programs where the notion of things actually having to run on some real hardware instead of being abstract math in limitless space is barely a side note if even mentioned(and that’s actually a good thing). The bad thing that this leads to is wrong people doing the close-to-hardware-programming. Get embedded systems programming from someone who can actually do it, and get business logic programming and scientific projects from the people on the other end of the scale.

Report comment

Reply
phyvyn says:

December 11, 2019 at 4:49 am

You missed the point that “planned obsolescence” or human error on this scale could be our end in a very unexpected way there are probably examples out there already

Report comment

Reply
NATO says:

December 11, 2019 at 5:08 am

True embedded(firmware) engineers are rare nowadays; Most kids graduate with CS degrees and don’t have a clue what they are doing and there aren’t enough good listeners to learn what they need to from the senior guys before they retire. Stack on top of that he fact that there are a lot of H1B going into software engineering, and upwards of 90% of indian graduates cannot write compile-able code(this is well documented), with similar conditions among the other groups that tech countries are importing, and you will start to understand why these high level tools are necessary – You can’t fix stupid, and you can’t mentor a group when only a small portion of that group actually have a true passion for what they are doing, and the rest just want a job.

Report comment

Reply
quarterturn says:

December 11, 2019 at 5:19 am

Same company who released the remake of the HP 15-C calculator with a bug which broke the PSE function.

Report comment

Reply
RW ver 0.0.1 says:

December 11, 2019 at 5:38 am

Just remembered, I had a quantum 40GB drive I ran for a decade, I got it used for cheap and it was throwing SMART errors, on running SMART tools against it, all I was seeing was the power on hours was negative, and one of the other attributes was screwy, a number that made no sense, forget which one it was, something that had normally had a low range. Anyway, external tests run on the drive suggested it was working fine, and it ran until a couple of years ago when I stopped using that machine. So I think stuff like this might happen semi-frequently, it just doesn’t always brick the drive. (Unless you consider SMART warnings bricked)

Report comment

Reply
Jul13 says:

December 11, 2019 at 7:38 am

“Unfortunate bug” made by a sociopath named HP, which is going to earn the sociopath millions of dollars.

Sure. That’s not incredibly naive at all.

Report comment

Reply
Steven Clark says:

December 11, 2019 at 7:55 am

Being a little self serving, but. Maybe we should be hiring CS grads instead of or in addition to EE’s for firmware. A good CS grad of the low-level persuasion with C++ experience is going to have run into this stuff before, They’re told to look for it pretty constantly.

Report comment

Reply
jonmayo says:

December 11, 2019 at 9:23 am

I believe it’s the size of the field for the SMART data, which is a protocol to query drive health related state. Rather than being any sneaky programmer’s obsolete attempt to save bits. Bits do matter somewhat on these ASIC microcontrollers as they often have a small amount of RAM and an even smaller amount of high-endurance NVRAM.

Report comment

Reply
pete says:

December 11, 2019 at 5:53 pm

The picture in this article is not of a solid state drive.

Report comment

Reply
1. Elliot Williams says:
  
  December 12, 2019 at 12:20 am
  
  This is true.
  
  Report comment
  
  Reply
francis steffan says:

December 11, 2019 at 6:42 pm

Planned obsolescence was taught to me in 8th grade so I subscribe to the theory that HP has done this intentionally with the intent to part people with their money and not disclosing at the time of purchase the lifespan. Yes, that would constitute fraud.
However, It doesn’t matter whether it is incompetence, negligence or intentional at this point.
This is HP’s problem, they created it not me so it is now their obligation to make it right, not mine.
When an auto manufacturer say makes a car that they discover will blow up and kill your family in 3 years they don’t send you a bag of parts with instructions on how to fix it yourself, do they? Yet this is in effect what HP offers.
HP needs to replace every single one of these drives included with an apology letter so people can at least have the opportunity to clone their drives.
Lets say i accept HP’s inadequate offer of a “patch” that I have to flash myself, what if something goes wrong, what if I make a mistake, what if I end up losing all my data, what then? Then HP blames you, and maybe it is your fault, and says go BUY a new drive, and oh yeah too bad about all your data.
No, HP must replace drives or be class action sued out of existence.

Report comment

Reply
sxsshadowxs says:

December 12, 2019 at 12:12 am

“…certain models of solid-state disk drives supplied…”

solid-state disk drives
Must be like at police cycle car.
:)

Report comment

Reply
bighorsecart@gmail.com says:

December 12, 2019 at 8:45 am

Hmm I have a 4 year old SSD that begs to differ. Fake news.

Report comment

Reply
1. sxsshadowxs says:
  
  December 12, 2019 at 11:39 am
  
  Huh?
  
  Report comment
  
  Reply
Jack says:

December 13, 2019 at 5:45 am

I want to point out the article is actually incorrect, and I’ve seen this same mistake repeated quite a bit in the tech press. HP is NOT the company that has these failing drives. It’s HPE. The “E” is important. Hewlett Packard spun off “Hewlett Packard Enterprise” 5 years ago. They are not the same corporation. They don’t even have the same stock ticker.

No HP systems are affected, only certain HPE servers using a particular rebranded third party SSD. Course, if you’re one of those clients with that particular combination…

Please check your sources next time!

Report comment

Reply
1. Jerry says:
  
  December 13, 2019 at 7:35 pm
  
  Business level HP systems Vs Consumer grade plastic boxes can’t be compared.
  And for the record, no Mfg builds computers.
  They actually sketch them out on a tablet, and issue a R. F..P.
  Asian contractors agree to build the RFP for a price, usually the lowest price submitted.
  The drives are built by others, as is the MoBo, DVD (if any) video chip, ram, and USB connections.
  Bean counters forced the designers to install most accessories to the USB channel
  This is simply a cost reduction maneuver.
  Consumer grade systems are absolute minimum spec machines.
  There in is the rub.
  While business often will pay for support, the consumer does not.
  So a minimum system without support gets a “Bad Rap”.
  Add to the bean counter chopping specs down, it’s a wonder any of them run.
  
  Report comment
  
  Reply
Paul Beckett says:

December 13, 2019 at 6:51 pm

Having owned a number of HP devices and used them in the workplace I am certain that obsolence is accelerated in HP products.

Report comment

Reply
big F says:

December 14, 2019 at 1:55 am

Sounds like planned obsolesance to me, used Dell, HP and Fujitsu in the data centres i work in for years HP stuff always seem to have a faster turn round. Sounds like a certain French car company, who build ECUs with an expiry date, where stuff starts to fail, requiring you to buy a new ECU, as the things like wipers or headlights are not actually broken. Easy fix, every year disconnect the battery and the plug back in when the car asks the date just go with what it displays, which is its incept date. As long as the year does not roll over the ECU is to dumb to notice. Got told that by a factory mechanic, and have first hand experience.

Report comment

Reply
1. Jerry says:
  
  December 14, 2019 at 4:59 am
  
  That used to work with computers..
  No more, as the OS grabs the time from the internet, and if time stamp does not match, no internet.
  
  Report comment
  
  Reply
  1. big F says:
    
    December 14, 2019 at 5:41 am
    
    Yes I remember setting the cmos time a month back to allow me to use some software while I waited for the new licence to arrive in the post. How thing have changed.
    
    Report comment
    
    Reply