The year was 1996, the European Space agency was poised for commercial supremacy in space. Their new Ariane 5 Rocket could launch two three-ton satellites into space. It had more power than anything that had come before.
The rocket rose up towards the heavens on a pillar of flame, carrying four very expensive and very uninsured satellites. Thirty-seven seconds later it self destructed. Seven billion dollars of RUD rained down on the local beaches near the Guiana Space Centre in
Southern South America. A video of the failed launch is after the break.
The cause of all this was a single improper type cast in a bit of code that wasn’t even supposed to run during the actual launch. Talk about a fail.
There were two bits of code. One that measured the sideways velocity, and one that used it in the guidance system. The measurement side used a 64 bit variable, but the guidance side used a 16 bit variable. The code was borrowed from an earlier, slower rocket whose velocity would never grow large enough to exceed that 16 bits. The Ariane 5, however, could be described with a Daft Punk song, and quickly overflowed this value.
The code that caused the overflow was actually a bit of pre-launch software that aligned the rocket. It was supposed to be turned off before the rocket firing, but since the rocket launch got delayed so often, the engineers made it timeout 40 seconds into the launch so they didn’t have to keep restarting it.
The ESA never placed blame on a single contractor. The programmers had made assumptions. The engineers had made reasonable shortcuts to make their job easier. It had all made it through inspections, approvals, and finally the launch event.
They certainly learned from the event; the Ariane 5 rocket has flown 82 out of 86 missions successfully since then. It has at least five more launches contracted before it is retired in 2023 for the Ariane 6 rocket being developed now. This event also changed the way critical software and redundant systems were tested, bringing the dangers of code failure to the attention of the public for the first time.
If you want to read more, there is a great discussion on Reddit which tipped us off to this fail, a quite thorough Wikipedia article, and the original article that ran in the New York Times is mirrored here.
Fail of the Week is a Hackaday column which celebrates failure as a learning tool. Help keep the fun rolling by writing about your own failures and sending us a link to the story — or sending in links to fail write ups you find in your Internet travels.
50 thoughts on “Fail Of The Week (in 1996): The 7 Billion Dollar Overflow”
I’m not trying to be pedantic I’m actually curious. In other parts of the world is South America known as Southern America or is that a mistake?
I’m pretty sure it’s a mistake, although I waited for someone else to point it out (thanks, lol).
I checked a few Wikipedia pages and it appears that most languages favor their respective versions of “South America”, although a few apparently have a “Southern America” alternative.
South America – English, French, German
Both, but South America equivalent preferred – Spanish, Italian
Both, but Southern America equivalent preferred – Hebrew
So how would you distinguish between the south in america and south-america if you use southern america?
“America” is not a country… I think you mean to say “United States of America”
We are, as a group, surprising more pedantic than others.
This was fun.
One of the early Mariner missions flew into the ocean due to a sign error in the servo loop code that was supposed to keep it on course.
Hey, the name was Mariner after all — just taking its own name a bit too seriously!
I wonder which SUB routine it was? xD
It’s a typo
Yesterday I listened to the recent Embedded.fm podcast (http://embedded.fm/episodes/158) about Ada for embedded ARM. It sounds like this is the sort of bug that Ada is designed to prevent.
No programming language can prevent lazy programmers and engineers.
You are aware that the software in question was written in Ada, right?
Don’t forget the Mars probe crash due to ft vs m being used by different teams.
That was just the part that the media latched onto. The full report into the failure of that mission focused on a lack of communication between the teams working on the project.
It would be a silly and probably costly thing to have the teams on such a project discuss each and every detail when there is no apparent reason to. And in regards to ISO vs Imperial, you just agree at the start of the project (or universally for the entire agency/company) and then no discussion is needed in that regard.
That’s why the key to good communication is knowing what information to share and to who.
Metric vs. USCS (“imperial”) wasn’t really the source of the problem.
Let’s say that the whole rocket was programmed using the metric system. Great. Will we measure velocity using km/hr or m/s? Obviously these are both “metric” units, but they’re not interchangeable.
Even if we decide to ignore hours, because of those silly Babylonians and their base-60 system, what happens when one programming team uses km/s, and another uses m/s? Again, they’re both “metric” units, but they’re not interchangeable.
There’s also the use of the wrong voltage relays that caused disaster, the Hubble telescope chipped probe fail, as well as the shadow vibration stupidity (military knew, did not tell NASA due to secrecy), the shuttle rings that leaked due to temperature issues, and so many more, sometimes costing lives.
But this one about the Ariane is new to me, and an ESA one for a chance.
The overflow should have been easy to catch if they tested the exact same code as the code used in the rocket. Testing things that are very similar is called development, not testing.
The problem was that the exact same code worked fine on earlier hardware. On the newer hardware with higher performance it failed. That’s something that has been seen repeatedly throughout history.
The problem was the pre-launch stability system from the 4 was installed in the 5, but wasn’t used at all. Yet despite not being used by the flight control, it was activated and left running.
In the 4, the system timed out and shut down before the rocket began its gravity turn. The faster 5 rocket began its gravity turn before the unused bit of kit timed out. With the turn, it spewed junk data into the rest of the system, crashing the backup first, then the primary.
Massive fail because the people responsible for testing decided NOT to do a full-up ground simulation of all the hardware together. They left out the unused hardware because its output wasn’t going to be used, and because it had never had a problem in any launch of the previous model.
But even lazier and problematic was using a *timer* to shut it down instead of an event trigger. All equipment not going to be used on a rocket once it lifts off the pad should be shut down *by the event of lifting off the pad*. Logical, simple, no chance of time based anomalies causing the rocket to spin out of control.
In other words, they set up a race condition with no checks to ensure that things ended in the proper sequence. At least the Ariane 5 didn’t kill people, like the Therac 25 did.
Or how about *just not installing* the equipment that wasn’t going to be used? How much did the people paying for that launch get ripped off for the useless hardware that destroyed their satellites? If Mercedes installed a diamond encrusted clock behind a carpet panel in the trunk, and charged you for it but didn’t tell you it was there, would you not be POed once you found out it was there, especially if it was connected to the rest of the electronics in such a way that it caused weird problems under certain operating conditions?
Sanity checks! Keep asking things like “Is this part needed?” “Has what this does been added to another component or block of code?” “Have we done a FULL test with ALL the code and hardware in operational configuration?” “No? DO IT! We don’t want this exploding or even just very embarrassingly doing nothing because it goes #@%! due to some preventable conflict.”
Is it bad the first Daft Punk song that came to mind was Instant Crush, lol?
More of an instant _crash_…
What came to my mind https://www.youtube.com/watch?v=h5EofwRzit0 needless to say I did not get the reference.
It was 4 satellites.
In a similar vein, the Genesis probe was a delicate device that was supposed to collect particles from the solar wind. So delicate in fact that a regular parachute assisted landing was considered too rough– the plan was to snatch the probe using a helicopter while descending under a parachute. Well, it turns out the accelerometer that was to sense when the probe was entering the atmosphere was installed upside down. As you can imagine, the probe was waiting for an acceleration event (opposite of the expected deceleration) that never happened. It never opened its chute and slammed down onto the desert floor– BAM!
That was news to me. They really kept the ‘design flaw’ quiet.
They skipped a pre-test that would have caught the screw-up.
Also, I just learned about nickel hydrogen batteries. So, awesome link [Peter]!
From what I recall (lol, not reading the wiki page) the mistake was caught and fixed, then UNfixed by the guy who makes sure everything matches the designs because the fixer didn’t do the proper paperwork (or perhaps it didn’t occur to him/her that the mistake was in design).
A machinist told me about a bulkhead fitting he’d made for a deep submersible ROV. Turned out he’d inverted the blueprint and made it a mirror image. Turned out OK because the guys cutting the holes in the bulkhead also inverted their drawing.
Thus the mirror image fitting fit perfectly. The hoses and wires routed through the fitting just had to be slightly adjusted in length.
They all decided to re-do the drawings to match their screwups, replacing the ‘incorrect’ originals in the files.
I read the esa report sometime at the start of my programming career and always found Section 2.1 to be a great example of a root cause analysis of a software-intensive system. The report as a whole is very readable.
yes it was an expensive failure and a good lesson which is probably why i have heard this story so many times in every single class even vaguely related to software testing.along with the other standard example of the therac-25.
i know which one i would have preferred to be responsible for.
if it was anything like EU projects then the real culprit was probably pointless bureaucracy.
or a lack of use of formal methods of verification
The snippet that caused the error:
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)))
Worst variable/function names evar!
Probably not if you know what it refers to.
… and French
Where is the 7 billion dollar figure from? Wikipedia lists the current price as 165-220 million dollars per launch. Has it come down in price so much?
The 4 satellites that blew up as well probably weren’t free…
Considering it was the first test flight, they probably didn’t put satellites that cost 10x the price of the launch on it.
But they didn’t cost 7 billion dollars either.
Yeah, I was wondering about that too. It is probably based on the total cost of developing the launcher (this was the first/test launch, after all), but even then it is not a 7 billion dollar mistake, since the program was not cancelled.
Agreed, this 7 Billion figure does sound over exaggerated. I would argue that the only financial loss would be of the payload, the man hours required to launch the rocket and any damage done by the falling debris.
The best estimation I can find simply says “over US$370m” from link below:
Rubber ducky debugging has stopped 90% of potential overflows in my software … I highly recommend it
“bringing the dangers of code failure to the attention of the public for the first time.” – by some definition of public we had already had this brought to our attention with the Therac-25
Heard a similar story of a prototype racing car using a computerised gear shift. The thing threw a drife shaft and crashed. Repaired it and same thing again. The issue was tracked to the use of legacy code from an earlier, less powerful model using an inadequate register length for the car’s (now higher) top speed and resulting in the car believing it was back near zero after it cleared 128mph (or whatever the number was). Of course, shifting into 1st at this speed was not the right decision.
…and of course it should have used a drive shaft rather than a drife shaft.
Please be kind and respectful to help make the comments section excellent. (Comment Policy)