Fail Of The Week (in 1996): The 7 Billion Dollar Overflow

June 30, 2016

The year was 1996, the European Space agency was poised for commercial supremacy in space. Their new Ariane 5 Rocket could launch two three-ton satellites into space. It had more power than anything that had come before.

The rocket rose up towards the heavens on a pillar of flame, carrying four very expensive and very uninsured satellites. Thirty-seven seconds later it self destructed. Seven billion dollars of RUD rained down on the local beaches near the Guiana Space Centre in ~~Southern~~ South America. A video of the failed launch is after the break.

The cause of all this was a single improper type cast in a bit of code that wasn’t even supposed to run during the actual launch. Talk about a fail.

There were two bits of code. One that measured the sideways velocity, and one that used it in the guidance system. The measurement side used a 64 bit variable, but the guidance side used a 16 bit variable. The code was borrowed from an earlier, slower rocket whose velocity would never grow large enough to exceed that 16 bits. The Ariane 5, however, could be described with a Daft Punk song, and quickly overflowed this value.

The code that caused the overflow was actually a bit of pre-launch software that aligned the rocket. It was supposed to be turned off before the rocket firing, but since the rocket launch got delayed so often, the engineers made it timeout 40 seconds into the launch so they didn’t have to keep restarting it.

The ESA never placed blame on a single contractor. The programmers had made assumptions. The engineers had made reasonable shortcuts to make their job easier. It had all made it through inspections, approvals, and finally the launch event.

They certainly learned from the event; the Ariane 5 rocket has flown 82 out of 86 missions successfully since then. It has at least five more launches contracted before it is retired in 2023 for the Ariane 6 rocket being developed now. This event also changed the way critical software and redundant systems were tested, bringing the dangers of code failure to the attention of the public for the first time.

If you want to read more, there is a great discussion on Reddit which tipped us off to this fail, a quite thorough Wikipedia article, and the original article that ran in the New York Times is mirrored here.

Fail of the Week is a Hackaday column which celebrates failure as a learning tool. Help keep the fun rolling by writing about your own failures and sending us a link to the story — or sending in links to fail write ups you find in your Internet travels.

50 thoughts on “Fail Of The Week (in 1996): The 7 Billion Dollar Overflow”

me says:

June 30, 2016 at 10:27 am

I’m not trying to be pedantic I’m actually curious. In other parts of the world is South America known as Southern America or is that a mistake?

Report comment

Reply
1. LH says:
  
  June 30, 2016 at 10:50 am
  
  I’m pretty sure it’s a mistake, although I waited for someone else to point it out (thanks, lol).
  
  I checked a few Wikipedia pages and it appears that most languages favor their respective versions of “South America”, although a few apparently have a “Southern America” alternative.
  
  e.g.
  South America – English, French, German
  Both, but South America equivalent preferred – Spanish, Italian
  Both, but Southern America equivalent preferred – Hebrew
  
  Report comment
  
  Reply
  1. Whatnot says:
    
    July 2, 2016 at 6:33 pm
    
    So how would you distinguish between the south in america and south-america if you use southern america?
    
    Report comment
    
    Reply
    1. Leonard says:
      
      July 5, 2016 at 12:51 pm
      
      Southern USA.
      
      Report comment
      
      Reply
    2. phreaknik says:
      
      July 6, 2016 at 7:48 am
      
      “America” is not a country… I think you mean to say “United States of America”
      
      Report comment
      
      Reply
2. Mike Szczys says:
  
  June 30, 2016 at 11:29 am
  
  Fixed, thanks.
  
  Report comment
  
  Reply
3. Laser Girl says:
  
  July 1, 2016 at 1:50 am
  
  We are, as a group, surprising more pedantic than others.
  
  https://xkcd.com/386/
  
  Report comment
  
  Reply
  1. notarealemail says:
    
    July 1, 2016 at 2:07 am
    
    This was fun.
    
    http://usvsth3m.com/post/69688215802/control-your-inner-pedant
    
    Report comment
    
    Reply
Bill says:

June 30, 2016 at 10:28 am

One of the early Mariner missions flew into the ocean due to a sign error in the servo loop code that was supposed to keep it on course.

Report comment

Reply
1. Jonny says:
  
  June 30, 2016 at 10:59 am
  
  Hey, the name was Mariner after all — just taking its own name a bit too seriously!
  
  Report comment
  
  Reply
  1. When did writer = engineer? says:
    
    July 5, 2016 at 4:23 am
    
    I wonder which SUB routine it was? xD
    
    Report comment
    
    Reply
Leithoa says:

June 30, 2016 at 10:46 am

It’s a typo

Report comment

Reply
chango says:

June 30, 2016 at 10:51 am

Yesterday I listened to the recent Embedded.fm podcast (http://embedded.fm/episodes/158) about Ada for embedded ARM. It sounds like this is the sort of bug that Ada is designed to prevent.

Report comment

Reply
1. Ray Moore says:
  
  June 30, 2016 at 10:54 am
  
  No programming language can prevent lazy programmers and engineers.
  
  Report comment
  
  Reply
  1. russdill says:
    
    June 30, 2016 at 7:27 pm
    
    You are aware that the software in question was written in Ada, right?
    
    Report comment
    
    Reply
bwmetz says:

June 30, 2016 at 11:33 am

Don’t forget the Mars probe crash due to ft vs m being used by different teams.

Report comment

Reply
1. Chris says:
  
  June 30, 2016 at 3:49 pm
  
  That was just the part that the media latched onto. The full report into the failure of that mission focused on a lack of communication between the teams working on the project.
  
  Report comment
  
  Reply
  1. Whatnot says:
    
    July 2, 2016 at 6:37 pm
    
    It would be a silly and probably costly thing to have the teams on such a project discuss each and every detail when there is no apparent reason to. And in regards to ISO vs Imperial, you just agree at the start of the project (or universally for the entire agency/company) and then no discussion is needed in that regard.
    
    Report comment
    
    Reply
    1. noblea149 says:
      
      July 4, 2016 at 4:20 am
      
      That’s why the key to good communication is knowing what information to share and to who.
      
      Report comment
      
      Reply
    2. mitch says:
      
      July 4, 2016 at 6:18 pm
      
      Metric vs. USCS (“imperial”) wasn’t really the source of the problem.
      
      Let’s say that the whole rocket was programmed using the metric system. Great. Will we measure velocity using km/hr or m/s? Obviously these are both “metric” units, but they’re not interchangeable.
      
      Even if we decide to ignore hours, because of those silly Babylonians and their base-60 system, what happens when one programming team uses km/s, and another uses m/s? Again, they’re both “metric” units, but they’re not interchangeable.
      
      Report comment
      
      Reply
2. Whatnot says:
  
  July 2, 2016 at 6:43 pm
  
  There’s also the use of the wrong voltage relays that caused disaster, the Hubble telescope chipped probe fail, as well as the shadow vibration stupidity (military knew, did not tell NASA due to secrecy), the shuttle rings that leaked due to temperature issues, and so many more, sometimes costing lives.
  But this one about the Ariane is new to me, and an ESA one for a chance.
  
  Report comment
  
  Reply
jaap says:

June 30, 2016 at 12:36 pm

The overflow should have been easy to catch if they tested the exact same code as the code used in the rocket. Testing things that are very similar is called development, not testing.

Report comment

Reply
1. Sweeney says:
  
  July 1, 2016 at 12:12 am
  
  The problem was that the exact same code worked fine on earlier hardware. On the newer hardware with higher performance it failed. That’s something that has been seen repeatedly throughout history.
  
  Report comment
  
  Reply
2. Galane says:
  
  July 2, 2016 at 9:02 pm
  
  The problem was the pre-launch stability system from the 4 was installed in the 5, but wasn’t used at all. Yet despite not being used by the flight control, it was activated and left running.
  
  In the 4, the system timed out and shut down before the rocket began its gravity turn. The faster 5 rocket began its gravity turn before the unused bit of kit timed out. With the turn, it spewed junk data into the rest of the system, crashing the backup first, then the primary.
  
  Massive fail because the people responsible for testing decided NOT to do a full-up ground simulation of all the hardware together. They left out the unused hardware because its output wasn’t going to be used, and because it had never had a problem in any launch of the previous model.
  
  But even lazier and problematic was using a *timer* to shut it down instead of an event trigger. All equipment not going to be used on a rocket once it lifts off the pad should be shut down *by the event of lifting off the pad*. Logical, simple, no chance of time based anomalies causing the rocket to spin out of control.
  
  In other words, they set up a race condition with no checks to ensure that things ended in the proper sequence. At least the Ariane 5 didn’t kill people, like the Therac 25 did.
  
  Or how about *just not installing* the equipment that wasn’t going to be used? How much did the people paying for that launch get ripped off for the useless hardware that destroyed their satellites? If Mercedes installed a diamond encrusted clock behind a carpet panel in the trunk, and charged you for it but didn’t tell you it was there, would you not be POed once you found out it was there, especially if it was connected to the rest of the electronics in such a way that it caused weird problems under certain operating conditions?
  
  Sanity checks! Keep asking things like “Is this part needed?” “Has what this does been added to another component or block of code?” “Have we done a FULL test with ALL the code and hardware in operational configuration?” “No? DO IT! We don’t want this exploding or even just very embarrassingly doing nothing because it goes #@%! due to some preventable conflict.”
  
  Report comment
  
  Reply
Ed says:

June 30, 2016 at 12:41 pm

Is it bad the first Daft Punk song that came to mind was Instant Crush, lol?

Report comment

Reply
1. heltonbiker says:
  
  June 30, 2016 at 12:54 pm
  
  More of an instant _crash_…
  
  Report comment
  
  Reply
2. andres says:
  
  June 30, 2016 at 2:14 pm
  
  it’s perfect
  
  Report comment
  
  Reply
3. Mechanicus says:
  
  June 30, 2016 at 2:18 pm
  
  What came to my mind https://www.youtube.com/watch?v=h5EofwRzit0 needless to say I did not get the reference.
  
  Report comment
  
  Reply
ajeandet says:

June 30, 2016 at 12:47 pm

It was 4 satellites.
https://en.wikipedia.org/wiki/Cluster_(spacecraft)

Report comment

Reply
Peter says:

June 30, 2016 at 2:01 pm

In a similar vein, the Genesis probe was a delicate device that was supposed to collect particles from the solar wind. So delicate in fact that a regular parachute assisted landing was considered too rough– the plan was to snatch the probe using a helicopter while descending under a parachute. Well, it turns out the accelerometer that was to sense when the probe was entering the atmosphere was installed upside down. As you can imagine, the probe was waiting for an acceleration event (opposite of the expected deceleration) that never happened. It never opened its chute and slammed down onto the desert floor– BAM!

https://en.wikipedia.org/wiki/Genesis_(spacecraft)

Report comment

Reply
1. notarealemail says:
  
  July 1, 2016 at 12:25 am
  
  That was news to me. They really kept the ‘design flaw’ quiet.
  They skipped a pre-test that would have caught the screw-up.
  
  Also, I just learned about nickel hydrogen batteries. So, awesome link [Peter]!
  
  Report comment
  
  Reply
  1. Blue Footed Booby says:
    
    July 1, 2016 at 5:38 am
    
    From what I recall (lol, not reading the wiki page) the mistake was caught and fixed, then UNfixed by the guy who makes sure everything matches the designs because the fixer didn’t do the proper paperwork (or perhaps it didn’t occur to him/her that the mistake was in design).
    
    Report comment
    
    Reply
2. Galane says:
  
  July 2, 2016 at 9:08 pm
  
  A machinist told me about a bulkhead fitting he’d made for a deep submersible ROV. Turned out he’d inverted the blueprint and made it a mirror image. Turned out OK because the guys cutting the holes in the bulkhead also inverted their drawing.
  
  Thus the mirror image fitting fit perfectly. The hoses and wires routed through the fitting just had to be slightly adjusted in length.
  
  They all decided to re-do the drawings to match their screwups, replacing the ‘incorrect’ originals in the files.
  
  Report comment
  
  Reply
dhavenith says:

June 30, 2016 at 3:31 pm

I read the esa report sometime at the start of my programming career and always found Section 2.1 to be a great example of a root cause analysis of a software-intensive system. The report as a whole is very readable.

Report comment

Reply
pff says:

June 30, 2016 at 5:02 pm

yes it was an expensive failure and a good lesson which is probably why i have heard this story so many times in every single class even vaguely related to software testing.along with the other standard example of the therac-25.
i know which one i would have preferred to be responsible for.
if it was anything like EU projects then the real culprit was probably pointless bureaucracy.

Report comment

Reply
1. ludwig says:
  
  June 30, 2016 at 9:05 pm
  
  or a lack of use of formal methods of verification
  
  Report comment
  
  Reply
Thomas Barth says:

June 30, 2016 at 9:34 pm

The snippet that caused the error:
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)))

Report comment

Reply
1. Elliot Williams says:
  
  July 1, 2016 at 2:08 am
  
  Worst variable/function names evar!
  
  Report comment
  
  Reply
  1. Whatnot says:
    
    July 2, 2016 at 6:45 pm
    
    Probably not if you know what it refers to.
    
    Report comment
    
    Reply
    1. Rachel Elsey says:
      
      July 12, 2016 at 10:52 pm
      
      … and French
      
      Report comment
      
      Reply
jpa says:

July 1, 2016 at 1:01 am

Where is the 7 billion dollar figure from? Wikipedia lists the current price as 165-220 million dollars per launch. Has it come down in price so much?

Report comment

Reply
1. AKA the A says:
  
  July 1, 2016 at 1:46 am
  
  The 4 satellites that blew up as well probably weren’t free…
  
  Report comment
  
  Reply
  1. jpa says:
    
    July 1, 2016 at 4:51 am
    
    Considering it was the first test flight, they probably didn’t put satellites that cost 10x the price of the launch on it.
    
    Report comment
    
    Reply
  2. Jeroen says:
    
    July 1, 2016 at 5:09 am
    
    But they didn’t cost 7 billion dollars either.
    
    Report comment
    
    Reply
2. Jeroen says:
  
  July 1, 2016 at 5:09 am
  
  Yeah, I was wondering about that too. It is probably based on the total cost of developing the launcher (this was the first/test launch, after all), but even then it is not a 7 billion dollar mistake, since the program was not cancelled.
  
  Report comment
  
  Reply
3. noblea149 says:
  
  July 4, 2016 at 4:41 am
  
  Agreed, this 7 Billion figure does sound over exaggerated. I would argue that the only financial loss would be of the payload, the man hours required to launch the rocket and any damage done by the falling debris.
  
  The best estimation I can find simply says “over US$370m” from link below:
  https://en.wikipedia.org/wiki/Cluster_(spacecraft)#cite_note-2
  
  Report comment
  
  Reply
onebiozz says:

July 1, 2016 at 2:30 pm

Rubber ducky debugging has stopped 90% of potential overflows in my software … I highly recommend it

Report comment

Reply
David (@djmips) says:

July 2, 2016 at 3:50 pm

“bringing the dangers of code failure to the attention of the public for the first time.” – by some definition of public we had already had this brought to our attention with the Therac-25

Report comment

Reply
Nomen luni says:

July 3, 2016 at 4:48 am

Heard a similar story of a prototype racing car using a computerised gear shift. The thing threw a drife shaft and crashed. Repaired it and same thing again. The issue was tracked to the use of legacy code from an earlier, less powerful model using an inadequate register length for the car’s (now higher) top speed and resulting in the car believing it was back near zero after it cleared 128mph (or whatever the number was). Of course, shifting into 1st at this speed was not the right decision.

Report comment

Reply
1. Nomen luni says:
  
  July 3, 2016 at 4:50 am
  
  …and of course it should have used a drive shaft rather than a drife shaft.
  
  Report comment
  
  Reply