It isn’t really a book, but Richard Feynman’s Appendix to the Challenger Disaster Report is still definitely something you should read. It’s not particularly long, but it’s educational and relevant not just as an example of critical thinking in action, but as a reminder not to fool oneself; neither individually, nor on an organizational level. Sadly, while much was learned from the events leading to and surrounding the Challenger disaster, over thirty years later many of us can still find a lot of the same things to relate to in our own professional lives. There isn’t a single magic solution, because these problems are subtle and often masquerade as normal.
Feynman and the Challenger Disaster
Richard Feynman (1918-1988) was a Nobel Prize winning physicist and one of the best-known scientists of his time. In 1986 he somewhat reluctantly agreed to join the Rogers Commission, whose task was to investigate the Challenger disaster. The space shuttle Challenger had exploded a little more than a minute after launch, killing everyone on board. The commission’s job was to find out what had gone wrong and how it had happened, and figure out how to keep it from happening again.
Feynman, who had undergone cancer-related surgery at the time, was initially reluctant to join the commission for simple reasons: he didn’t want to go anywhere near Washington and didn’t want anything at all to do with government. As for the shuttle itself, he would read about it going up and coming down but it bothered him a little that he never saw in any scientific journal any results of the experiments carried out on the shuttle, so as a result he wasn’t paying much attention to it. Ultimately, he did join the commission and in the process changed himself. The shuttle and its related systems were feats of engineering at (and sometimes beyond) the very limits of technology at the time. He hadn’t fully appreciated the enormous number of people working on the shuttle, and the sheer scale of their dedicated effort. He came to see how the accident was a terrible blow, and became greatly determined to do what he could to help.
It came out that the cause of the disaster was that O-rings on one of the sections of solid rocket booster had failed, but that was only the proximate cause of the disaster. The real problem was a culture of NASA management reducing criteria and accepting more and more errors while, as Feynman put it, “engineers are screaming from below HELP! and This is a RED ALERT!” This demonstrates an almost incredible lack of communication between management and working engineers, which was itself a clue as to what was really wrong at an organizational level.
How Did This Happen?
This situation didn’t happen all at once; it grew over time. NASA was filled with dedicated people, but at an organizational level it had developed a culture of gradually decreasing strictness when it came to certifications for flight readiness. A common argument for accepting flight risks was that the same risk was flown before with no failure, and that fact was accepted as an argument for the safety of accepting it again. As a result, obvious weaknesses and problems were accepted repeatedly. This was not limited to the O-rings in the solid rocket boosters that caused the catastrophic failure with Challenger. A slow shift toward lower standards was evident time and time again in other areas as well; safety criteria were being subtly altered with apparently logical arguments for doing so. What was happening was that NASA was fooling itself.
Fooling Oneself Still Happens Today
Much has been learned from the Challenger disaster and similar cases, but over 30 years later people and organizations still struggle with the same basic issues and end up with an environment of bad decision-making. There’s no easy solution, but at least it’s possible to understand more about what to look out for. There isn’t always a particular broken part to blame or replace, and there isn’t always someone specifically at fault. The things that go wrong can be subtle and numerous, and the environment it creates may seem actually normal in an oh-well-what-can-you-do kind of way.
Feynman often asserted that you must never fool yourself. Once you have succeeded in not fooling yourself, it’s easier to not fool others. Feynman observed several ways in which NASA had grown to fool itself, and a common thread was lack of communication.
For example, management confidently estimated the chance of shuttle failure as 1 in 100,000 whereas engineering estimated it closer to 1 in 100 or 1 in 200, and were fervently crossing their fingers for every flight.
Feynman’s experiences on the commission led him to think hard about how this situation actually happened. It struck him that there was a fair bit of fishiness associated with the “big cheeses” at NASA. Every time the commission spoke to higher level managers, they kept saying they didn’t know anything about the problems below them. Assuming the higher-ups weren’t lying, there had to be some reason why problems at lower levels weren’t making it up to them. Feynman suspected that it came down to bad communication due to management and engineering fundamentally having different priorities and goals. When the engineers at the bottom say things like “No, no! We can’t do that unless this because it would mean such-and-such!” and the higher-ups don’t want to hear such talk, pretty soon attitudes start to change and you get an environment that suppresses bad news. Feynman described this process in “What Do You Care What Other People Think?“:
Maybe they don’t say explicitly “Don’t tell me,” but they discourage communication, which amounts to the same thing. It’s not a question of what has been written down, or who should tell what to whom; it’s a question of whether, when you do tell someone about some problem, they’re delighted to hear about it and they say “Tell me more,” and “Have you tried such-and-such?” or whether they instead say “Well, see what you can do about it” — which is a completely different atmosphere. If you try once or twice to communicate and get pushed back, pretty soon you decide “To hell with it.”
That was Feynman’s theory: because promises being made at the top are inconsistent with reality at the bottom, communications got slowed up and ultimately jammed; that’s how it’s possible the higher-ups at NASA actually didn’t know about problems below them. I can’t help but think of all the modern-day situations where technical staff are left to figure out how to deliver on unrealistic promises sales or executives have made, and somehow get through the mess only to have to do it all over again the next week, and I wonder if Feynman wasn’t right on the money.
Two things are certain: people and organizations still fool themselves in similar ways today, and lack of communication is always a factor. But fooling others always starts with fooling oneself, and when it comes to that Feynman had clear advice: “The first principle is that you must not fool yourself — and you are the easiest person to fool. After you’ve not fooled yourself, it’s easy not to fool others. You just have to be honest in a conventional way after that.”
Note: There is a proper book related to this article. “What Do You Care What Other People Think?” by Richard Feynman devotes its second half to Feynman’s experience working on the Rogers Commission.
84 thoughts on “Books You Should Read: Feynman’s Appendix To The Challenger Disaster Report”
“Feynman, who had undergone cancer-related surgery at the time, was initially reluctant to join the commission for simple reasons: he didn’t want to go anywhere near Washington and didn’t want anything at all to do with government”
Taxpayers every April 15 can relate.
“Much has been learned from the Challenger disaster and similar cases, but over 30 years later people and organizations still struggle with the same basic issues and end up with an environment of bad decision-making. There’s no easy solution, but at least it’s possible to understand more about what to look out for. ”
Hmmm, if only AI was capable enough we could take the fleshy-bits out of the role.
‘”Taxpayers every April 15 can relate.”
I don’t think it was like that with Feynman. His writings, particularly in letters of the era and in “What do you care…” above suggest more that he enjoyed science so much, he didn’t want to burden it with politics and bureaucracy and get burned — as did Oppenheimer and Von Neumann, both of whom he was very much in awe of (as scientists), and both whose politics became tremendously burdensome to their ability to do real science in their later years. Along the same lines, he almost turned down the Nobel Prize many years later — the pomp and circumstance, the media, but mostly because he didn’t want his colleagues or students or anybody else to see him as a Nobel laureate first and a scientist second.
In other words, I don’t think he held the government in any more distain than he did any large organization / bureaucracy. He disdained them all fairly uniformly.I think he simply wanted nothing to do with politics.
In fact, in Surely You’re Joking, he said he often insisted on pseudonyms when delivering speeches at other colleges. He wanted to focus on the physics in his talk to people who cared about that. He didn’t want an audience of people who wanted to have heard a lecture by a Nobel Laureate.
Great book. Great author.
That’s the recession I looked for,
you just convinced me to buy that book :)
I’ve become friends with Ralph Leighton, who says people are forgetting who Richard Feynman is. Feynman’s 100th birthday will be May 2018 — yet Google has never done a Doodle of Feynman. I’m hoping to run away with Ralph next year to Tuva and leave an impression of Feynman there for his birthday! Happy birthday, my hero — you were the guiding rudder that led me to an adventurous life!
Cynthia P — is that you? Glad to hear you’ve finally planned The Big Trip!
One thing that I thought was extremely relatable with regards to the shuttle work was this:
The software development side of things for the shuttle was top-notch. Rigorous testing, excellent development environment, the whole works. But software needed to be precisely written to match payload details. Every time the payload was fiddled or changed, the software needed to be updated to match precisely (which necessitated expensive testing.)
The relatable part to many of us was that management was thinking “Why don’t we cut back on all the testing? It always passes anyway.” But the testing is part of a rigorous development environment that ensures good results! Feynman points out that the correct way to reduce costs is not to remove testing, the correct way to reduce costs is to lessen the number of times the payload gets fiddled or changed or played with, triggering the expensive re-testing! The more things change, the more they stay the same…
I work in manufacturing and this attitude is true here as well.
Um… no. Speaking as a former contractor on the Shuttle (first in ascent trajectory development, then working on the Shuttle’s robot arm), the software for the Shuttle did not need to be re-written for every different payload. Why would it? Mass is mass and that’s all that counted for most purposes. Sure, there was development work on payload deployment, particularly if the payload had odd requirements, but even that wasn’t a re-write of the underlying software. It was just a different sequence of movements (all of which had to be planned, tested and simulated, with contingencies planned for every possible point of failure).
The Shuttle’s basic operating system was indeed a top-notch software development project (it took a whole committee to OK a one-line change to the code). It didn’t change for every launch, however, because the process for making even small changes to the software was far too rigorous for that.
Feynman’s book is exactly right. The problem was cultural, not software or engineering.
It also brings to mind another not unrelated fail at NASA. The Hubble Space Telescope Mirror. https://www.techworld.com.au/article/420036/what_went_wrong_hubble_space_telescope_what_managers_can_learn_from_it_/?pp=2
Wow, what a story. How could such a critical task (checking the mirror) be assigned to someone who obviously had not clue about how to do it. Scratching the distance standard?
Same problem, different day, Space Shuttle Columbia…
Came here to say this.
Happy to hear that I’m not alone in seeing the parallels.
Sadly, Columbia was proof that NASA learned nothing from Challenger and was incapable of learning.
“you must not fool yourself” Something people are doing when they think they can go too Mars without issue. I seen a lot of people proposing fixes for the many things that will kill you but not a single solution. As much as I would love to see a Mars base before I die I don’t think I will. With current tech you’ll be dead or near too death before you get there.
On the other hand, the average human lifespan increases exponentially, and soon it will break the threshold of “more than 1 year per year increase”, so maybe you have more time than you think.
I’d love to hear more on this subject of increasing life span. I hear often that we are but don’t see the data to back it up.
All I see is more people living longer simply because, there is more people.
Only about 88% of humans who have ever been born, have died, so there’s that statistical 12% chance of being immortal.
0% of the members of my new sect have died, come and join!
Since the first thing breathed air as a human, to now. Though in misremembering the figures a bit, I should have remembered approx 100 Billion humans ever lived, around 7 Billion alive now, so that’s 93% ever died, 7% haven’t.
I was pretty sure all of them(us) died except for the tiny proportion that still happens to be alive right now.
You got it, 7% are alive right now, ergo based on statistical data, you only have a 93% chance of dying.
You do know that the term “squaring of the survival curve” has nothing to do with exponents, right?
The average might be increasing, but there’s a hard limit before 120. We’re just increasing the number of people who get closer to that limit.
Thanks for this link, the Wikipedia link gives a good overview and synopsis that I found very interesting.
more elaboration on the book:
Continue reading —> please
Genuine question, and apologies for derailing the discussion a bit: We frequently see this asked, and I understand this to be a request that the author place a page break in the content when posting, but why? Is it a mobile thing or some browser-specific thing?
I come in the front page, using Firefox on a laptop, and I see the list of articles with a picture and few-line start, and a “…read more” link on every article (including this one). What are people seeing differently that’s objectionable?
I can fill in a little bit about what looks different and how. At Hackaday, the front page has short segments of more or less consistent length with a thumbnail image and “…read more” for each. But if you click on “Blog” you get same content but a slightly different view (what wordpress calls blogview.) Each post has a header image, but some posts appear in their entirety and some appear with a “click to read more” break to get the rest.
At Hackaday if there is an embedded video, the video (or image gallery, etc) it is always behind a “click to read more” so that content doesn’t get loaded at all unless the reader clicks through. Sometimes, if an article is mostly text it will appear in its entirety with nothing stuck behind a “click to read more”, but this happens only in the Blog view. The front page is where each post or article is always a thumbnail, a short segment, and a “…read more” break. In Blog view, some posts will be shown in their entirety, but if there is an embedded video or similar content it will always be behind a break. (Animated gifs are sometimes given exceptions if they are kept small and judged an important visual, but it’s an ongoing process figuring out what’s best.)
Ah. Yes, I see. I never used the blogview mode before. Many thanks for the explanation.
Not fool yourself or others? Then what are all those managers going to do?
I think I’m going to make a very detailed list of all of the bandaids I’ve had to put on the machines at work.
Management probably thinks the CNC equipment will last through the year.
Feynman is an amazing person. His insights in this experience are invaluable to all of us, and highlighting them here is one of the most valuable articles I’ve seen here in a while.
One thing to note is that the cause of the Challenger disaster is *not* what is popularly believed and stated in the media–most people think the problem was caused by a combination of the O-ring and the cold temperatures of that launch day. That is incorrect–inspection of the O-rings after launch on much warmer days had shown degradation of the O-rings as well, and the shuttle had been launched in cold conditions before, at least once with no damage to the O-rings. The problem was actually a known fault in the engineering design being ignored because “it’s always worked, so it’s not the problem they’re making it out to be. Besides, we have a schedule we have to keep, and fixing it will take too long and be too expensive!” If you’re curious, the fault has to do with the joint design between segments of the booster having a direct path between the O-ring and the combustion chamber. Wouldn’t normally be a problem except, you know, O-ring material doesn’t much like the heat.
Here’s a great presentation that Mike Mullane does about the problems Feynman described.
Over the course of reading comments at hackaday I have to feel that this what all the DIY community should view. Again from from viewing the comments I have believe most will not think it relevant to their activity, and that those who do see a relevance may believe it will never happened to them.. Not that I saying that The DIY community is under the same sort of pressure the live audience is, Just that the DY community in general isn’t conditioned to be aware of potential hazards.no mater how often they have seen Norm’s safety announcement.
Thanks for this video. I hadn’t seen it until now. I think I like what he says about Normalized Deviance.
According to Mike, deviance might initially occur under a high-pressure situation. That is, in the face of mounting pressure to put the shuttle in the air 24 weeks a year, rational were developed for ignoring certain safety criteria that were originally developed free of such pressure. Mike says that’s a natural human tendency. And, the immediate consequence for ignoring the safety criteria was nothing but a successful launch. And because humans tend to weight their own personal experience over statistics (e.g., gambling in a casino), the bar of tolerable damage to o-rings was subsequently lowered. Mike didn’t directly say this, but I think it’s implied that management or the culture at NASA was responsible for this bar lowering because they (the managers) experienced the pressure, not the engineers who developed the rocket boosters. It was the latter who continued to write memos and voice their concerns to NASA, which as we know, went unheeded.
Immediate reality check (feedback) which management is open-looped from.
Problem is us, and I can’t get the screwdriver and soldering iron into such tiny places to fix it on even just one. Beyond economical to repair.
I will never forget January 28, 1986. Never.
My 6th grade school teacher came into the classroom just after we returned from lunch, stood at the front of the class and announced, “At 11:38 this morning, the space shuttle Challenger lifted off from Cape Canaveral, Florida.” As we were in New Hampshire, we had all been avidly following the details of the flight, particularly anything about Christa McAuliffe who lived less than an hour’s drive from us. She was our home-town hero, so to speak. Naturally, the classroom full of 6th graders, myself included, erupted into cheering and high-fiving with this seemingly wonderful news. My teacher waited patiently for the room to become quiet again, and then somberly announced, “Seventy-three seconds later, the space shuttle Challenger exploded, killing all seven astronauts.” This time you could have heard a pin drop as two dozen 6th graders stood in stunned silence with our mouths hanging open.
I was in middle school about 50 miles east of Cape Canaveral. Standing outside in the PE area watching the whole thing. It was a rough day in central Florida.
I was part of the engineering staff at a local tv station in1986, and in the repair shop we were watching the launch direct off of a satellite feed.
If you take away one thing from this, remember Feynman’s last sentence:
” For a successful technology, reality must take precedence over
public relations, for nature cannot be fooled.”
The worst part is, they probably didn’t even die in the explosion. But only after an agonising trip back down.
Sadly, you are correct. There was evidence found inside the shuttle that indicated that at least one or two of the astronauts was conscious after the explosion. I can’t even begin to imagine the terror that they must have felt. Very sad indeed.
Caught this interview last year with Bob Ebeling, one of the engineers that tried to stop the launch. It’s tragically sad that not only his (and other engineers) warnings went unheeded but then he blamed himself for much of the rest of his live.
It’s nice that he feels bad about it, but did he explain to management (or even understand for himself) the problem was the field joint rotation after ignition of the SRBs? That NPR article only talks about cold O-rings. The correlation between O-ring performance and temperature was not that strong.
This kind of thinking is no more evident than in modern software security. The number of times I’ve had management brush off a potential vulnerability because “features”.
How many of the recent major breaches happened and the engineers knew all along? How many IoT devices with no or laughable security have been released with the engineering staff aware?
Did not BA lay of IT staff, then had a major system crash, but they claimed the two were not connected.
ever has it always been, ever will it always be.
How many? My money is on all of them.
As long as managers are more important than engineers, things like this will happen.
As they say, sht always floats to the top. The exact people that shouldn’t be there.
Hey, here’s an idea, let’s put the 110 IQ MBAs over the top of the 130 IQ eggheads…
I thought having an IQ higher than 79 automatically disqualified one from being able to earn an MBA…
I think that was something they tried at Harvard for while, then decided the candidates should be minimally able to button their own shirts/suits and avoid drooling on the dictaphone, which were still rather expensive at the time.
Which is also of course the standard that keeps 160+ IQ folks out of gainful commercial employment and corralled up in ivory towers.
Are you sure they would of not found it easier to use their hands?
It’s an old joke, but has age sullied it’s subtle humours, blunted it’s rapier wit? …. Yes.
Nah isn’t isn’t on group of people Vs. another group of people. Just part of what I have come to call the American mythologies. The one in play With the shuttle program is the is reasonable to expect that investors have the right to receive ever growing returns on their investment. sure NASA isn’t a for profit venture, but all the contracting companies are. The shuttle program provided employment for a lot of people, the investors can’t make money if those employees aren’t working.
In the heyday of the dot-com boom, companies got big heads about how they could do the impossible, repeatedly. The expected thing was that the right team would always go 110%, the ducks would always line up obediently, and we’d always pull it off.
I well remember one project where us developers said initially that the requirements probably couldn’t be met in the desired time, we repeated that warning when progress reports were made, yet we all got reamed when the deadline was actually missed, despite everyone’s best efforts.
Sometimes no means no.
I was working for Hughes Aircraft when it happened. A few years before there had been an announcement in Hughes News, the company newspaper, that they were looking for someone to fly on the shuttle and do an experiment, any employee could apply. I think the experiment was about the effects of rotation in zero-G on the fuel in fuel tanks. Being a new engineer with only a few years’ experience I figured I had little chance of being selected and did not apply. It would probably be an Engineering Manager from Space and Comm with a lot of experience. When they announced who had been selected it was Gregory Jarvis, an Engineering Manager from Space and Comm with a lot of experience. I remember thinking, he was one of us and if I had more experience that could be me.
I followed his progress. He was scheduled to fly in December 1985 but got bumped to the next flight by the first politician in space, a senator. January 27’s weather report had said it would be too cold to fly the shuttle so I was not paying any attention to the news. I found out when a manager called us together to tell us that “there had been an explosion of the shuttle on lift-off but they didn’t know what had happened to the crew, it could be like that show Lost In Space where the explosion pushes them further into space”. He was fooling himself.
Notably Feynman’s report appears as an addition to the Rogers report, not part of it.
Many members of the commission did not want to be associated with what they saw as something that was overly critical of NASA’s hierarchy and wanted the report to contain the facts only.
Even though many of them shared Feynman’s sentiments they had jobs in the government and military they did not want to jeopardise, or their future prospects. Feynman was the only truly independent commission member and threatened that he would remove his name from the report ( which would have caused negative publicity) if he was not allowed to have his say.
A compromise was reached that Feynman’s report was to be allowed as an appendix to the Rogers report so that it was clear that it came form him alone.
Just looked this up – there were 135 launched Shuttle missions. Two resulted in loss of vehicle and loss of human life. *1-in-67.5 failure rate* over the life of the program. Even the 1-in-100 estimate from the engineers was 30% overoptimistic.
Assuming the same failure rate for SLS, and their current plans for 2 launches a year, it should go 33 years before catastrophic failure!
Even if risk were triple, quadruple, or worse… they will still have an excess of qualified and competent volunteers that will be turned down. The good of the many… And it would not bother me if resulted in joining John Doohan.
If you build it, they will come.
ackk! James Doohan.
“Much has been learned from the Challenger disaster and similar cases, but over 30 years later people and organizations still struggle with the same basic issues and end up with an environment of bad decision-making. ”
Which is pretty much exactly why the Columbia shuttle disaster happened in 2003. So how much did NASA learn? I read the Feynman report on Challenger just after the Columbia incident, and I remember thinking that they could have recycled the Feynman document with just a few changes to describe what happened.
Sadly Feynman wasn’t around in 2003 to comment on Columbia but I think he would have agreed. I had the good fortune to meet his daughter recently but couldn’t manage to bring up that subject gracefully. The same person who made that introduction also got me permission to tag along on a tour of SpaceX, which seems to have a rather different dynamic between management and engineering, to the extent that there is actually a division between the two. Free frozen yogurt too!
Rockets are so damn cool!
Oh please… that’s not a problem that “existed” in one specific organization – it’s a problem that continues to exist absolutely everywhere all the time. Management NEVER wants to hear about any problems, because then they’d have to make a call that even the engineers aren’t able to make, which therefore could only reasonably ever be “stop and cancel everything”. Engineers are happy to point out issues they see exactly because they don’t have to make that call – or conversely, they are not the ones responsible for actually getting something done. Please understand, I’m not saying any of this trying to “fault” either category. I’m just attempting to illustrate that an engineer might be perfectly fine with a “we can’t do this” call that management can’t possibly accept without solid evidence of the consequences of going ahead nonetheless – which the engineer doesn’t have or cannot reliably predict.
Yes, going ahead with the launch in out-of-spec weather was definitely an… unwise idea. But the thing is, whenever one of us techies pop up with “hey, there’s a problem here”, we can hardly ever give a definitive answer to the question that invariably follows from the management: “Will it be an issue? How big?”. Well, we don’t know. Nobody does. “Will the shuttle blow up if we launch anyway?” Well, it might. Or it might not. It’s a risk. “How large?” “How the hell would I know?!?” And if this would be the Only Issue Ever (or if it was unmistakably clear to all involved that this one is specifically a huge issue) then I think most sane people would take a step back and go the prudent route even if it comes at a cost (and it always does). But issues are Legion, always, and most of the time most people involved cannot possibly predict exactly how large of an issue they are going to actually be, while stuff still needs to get done at the end of the day or we can just all pack up and go home once and for all.
So we just make some sort of judgement call instead knowing full well it is not an actual exact prediction of what will actually happen. But most of the time things ultimately still turn out alright – and most of the time we never get to find out which of our calls were near misses and which were never a serious problem. How was anyone supposed to predict that pieces of foam insulation could, under the right conditions, put a hole in a wing? It was so counter-intuitive nobody took the idea seriously until they actually tried and saw the consequences. Yes, it could have been investigated _in advance_ – if only anybody had a clue it _might_ turn out to be a major issue. Except nobody did, and it was just one thing among millions. What else would you thoroughly investigate before attempting a launch? The effects of potential lighting strikes on the launch pad? On the rocket in flight? Bird hits? Dust on seals? Dead bugs in junction boxes? Where do you stop…? There’s no end to the list of issues that _might_ be a problem – so do you just investigate the ones that are _likely_ to be, and go ahead ignoring everything else that _might_ be a problem (nobody can tell exactly how big) or do you just give up and never get anywhere…? Hindsight is 20/20 you know…
Yeah, in the end, the Challenger disaster was a very unfortunate consequence of this process running too far into the red. But to think that the underlying phenomenon is just a defective cultural issue that can simply be identified and set aside is full-on lunacy. It’s an organic part of the process of getting anything at all done while taking estimated risks of unknown real magnitude, and the best you can hope for is to have functioning processes in place that try to at least limit the risks within reasonable limits. In the Challenger case, those were obviously somewhat clogged and miscalibrated – but no manager ever will want to hear a warning about things that invariably can’t be quantified…
>”How was anyone supposed to predict that pieces of foam insulation could, under the right conditions, put a hole in a wing? ”
They didn’t need to, because it did happen on occasion. In previous flights, the pieces of insulation did damage the tiles, but in non-critical locations. Sometimes entire tiles had been missing, but never on the critical leading edges of the wings.
NASA knew the risk, but chose not to do anything about it. They knew that the foam pieces could put dents and holes in the thermal tiles that were never specified to take any hits in the first place, but since it didn’t seem to cause any trouble it was simply accepted and the design flaw glossed over.
It’s like watching a trail of gasoline leaking from your tank and complaining about the smell of petrol, but since you’ve already driven a thousand miles without incident, you just decide that the constant drip of fuel is a feature and not a flaw.
>”How was anyone supposed to predict that pieces of foam insulation could, under the right conditions, put a hole in a wing? ”
If anyone should know that the kinetic energy equation includes a velocity squared component, it would be a rocket scientist.
The root cause of the Challenger failure was a bad SRB field joint design which, contrary to intentions, actually rotated to increase the gap between SRB segments upon ignition of the SRB.
Asking O-rings to reliably seal a suddenly expanded gap like this was not a wise choice. Remember that there are enormous forces on these field joints. It’s not just the pressure of the SRB burning, but also the shock of the STS RS-25 SSMEs ramping up to full thrust which torques the entire stack.
A couple of more drawings.
The real fix was to add a capture feature to prevent joint rotation from expanding the gap.
This problem was well known but ignored by NASA until they suffered a loss of vehicle and crew.
1972 – contract awarded to Morton Thiokol to design the Solid Rocket Boosters (SRBs)
– the design is based on a modified Titan III rocket, with significant design changes
– one of the changes was an o-ring seal along the rocket body. The joint was made longer, and a second ring added to provide a redundant seal.
1977-78 – An engineer discovers during tests that under pressure the joints rotated significantly causing the secondary o-ring to become ineffective. This is a result of the elongated joint to hold the secondary o-ring. Morton Thiokol management did not recognize the problem.
1980 – The joint is classified on the CIL (Critical Item List) as 1R, indicating that failure would be catastrophic, but there is a redundant o-ring to act as a backup in the event of failure. This was only one of 700 items listed as criticality 1.
1981 – the shuttle begins orbital testing
1982 – the space shuttle is declared operational
– After a few flights, problems with the o-rings were noted, as were other items. The normal procedure was to assign a problem tracking number, and examine the causes. This was not done for the o-ring problem. Eventually the problem was recognized and the rating was changed to 1 on the CIL. It was shown that despite NASA’s reclassification, the system was still listed as 1R in the Morton Thiokol paperwork, as well as a number of other documents. Also, Morton Thiokol disagreed with the criticality change, and went to a referee procedure.
1984 – the erosion of the o-rings has become a significant concern, and review procedures are requested for the packing of the o-ring joint with the asbestos filled putty that prevents heating of the rings. Morton Thiokol responds with a letter suggesting that higher pressures used in testing the joints was resulting in channels in the putty, and increased erosion of the o-rings. statistics from before and after the change in testing pressure seemed to confirm this. Morton Thiokol recommends continuing the tests to ensure sealing despite the problems, and begins investigating the effects of the testing on the putty.
Jan 1985 – A launch of a space shuttle at the coldest temperatures to date leads to the greatest failure of the o-rings to date. The o-rings will deform under pressure to seal the gap, but this is hindered when they are colder, and the material stiffer.
Jan-April 1985 – Continued flights and investigations show continued problems with the o-rings, and a relationship to launch temperature. Morton Thiokol acknowledges the problem, and the effects of temperature, but concludes that the second o-ring will ensure safety.
April 1985 – the primary o-ring does not seal, and the secondary ring carries the pressure, with some blowby (i.e., the backup was starting to fail). As a result a committee concludes that the shuttle must only be operated in an acceptable flight envelope for the o-ring seal. This report is received by Morton Thiokol, but does not seem to be properly distributed. The problem was also not properly reported within NASA to upper management.
July 1985 – A Morton Thiokol engineer recommends that a team be set up to study the o-ring seal problem, citing a potential disaster.
August 1985 – Morton Thiokol and NASA managers brief NASA headquarters on the o-ring problems, with a recommendation to continue flights, but step up investigations. A Morton Thiokol task force is set up.
October 1985 – The head of the Thiokol task force complains to management about lack of cooperation and support.
December 1985 – One Thiokol engineer suggests stopping shipments of SRBs until the problem is fixed. Thiokol writes a memo to NASA suggesting that the problem tracking of the o-rings be discontinued. This lead to an erroneous listing of the problem as closed, meaning that it would not be considered as critical during launch.
Jan 1986 – The space shuttle Challenger is prepared to launch Jan., 22, originally it was scheduled for July 1985, and postponed 3 times, and scrubbed once. It was rescheduled again to the 23rd, then 25th, then 27th, then 28th. This was a result of weather, equipment, scheduling, and other problems.
Jan., 27th, 1986 – The shuttle begins preparation for launch the next day, despite predicted temperatures below freezing (26°F) at launch time. Thiokol engineers express concerns over low temperatures, and suggests NASA managers be notified (this was not done). A minimum launch temperature of 53°F had been suggested to NASA. There was no technical opinion supporting the launch at this point. The NASA representative discussing the launch objected to Thiokol’s engineers opinions, and accused them of changing their opinions. Upper management became involved with the process, and “convinced” the technical staff to withdraw objections to the launch. Management at Thiokol gave the go ahead to launch under pressure from NASA officials (this was the critical decision).
– the shuttle is wheeled out to the launch pad. Rain has frozen on the launch pad, and may have gotten into the SRB joints and frozen there also.
Jan., 28th, 1986 – The shuttle director gives the OK to launch, without having been informed of the Thiokol concerns. The temperature is 36°F.
11:39 am – The engines are ignited, and a puff of black smoke can be seen blowing from the right SRB. As the shuttle rises the gas can be seen blowing by the o-rings. The vibrations experienced in the first 30 seconds of flight are the worst encountered to date.
11:40 am – A flame jet from the SRB starts to cut into the liquid fuel engine tank, and a support strut.
11:40:15 am – the strut gives way, and the SRB pointed nose cone pierces the liquid fuel tank. The resulting explosion totally destroys the shuttle and crew.
11:40:50 am – the SRB’s are destroyed by the range safety officer.
Thanks, interesting read. It looks almost like a natural law, that as more people get involved, the more likely is system to go to disorganized state.
I have more than once read about “rotation” of the joints. The way it was explained to me in what I read (and without this explanation I would have been very confused because I’d think of it all wrong!) is that under the great pressure from inside the rocket, the walls of the rocket bulge outwards a bit. And because the rocket booster sections are thicker at the joints, the walls bulge out more than the joints themselves. You see this in the first drawing of pngai‘s comment. As a result of this bulging, the joint bends a little and “opens up” a bit where the o-rings are (the second drawing).
I’m grateful for that plain explanation I read because without it I’d be thinking the sections were “rotating” as in somehow twisting, and would have been very confused!
Same here, it took me a while to understand what “rotation” of the SRB joints meant, but once you do, what happened to Challenger seems inevitable. And it is clear that the capture feature added as a fix is the right solution.
Maybe go with a mono-shell for the rocket.
Trying to build the SRBs in one piece would have raised a lot of other issues such as how do you pour such a large engine without voids or cracks? There’s also the transportation problem.
The Space Shuttle presented NASA with a unique lift-off environment, which magnified the adverse effects of dynamic overshoot on the system. The Shuttle assembly is asymmetric. The thrust line of the main engines (SSMEs), as they ignited before lift-off, is 9m (32ft) from the centre of the launch-vehicle structure, exerting a tremendous torque. The bolts holding the Shuttle on the pad withstand enormous forces, which are transmitted from the Orbiter through its support struts to the external tank (ET) and through the tank mountings to the SRBs.
In the 1970s, the Shuttle was designed to lift-off with SRB ignition at T+3.832s, at 90% SSME thrust, with the base bending force at the bottom of the aft SRB segments estimated at 350 million in lb – NASA’s chosen unit of bending – (39.5 million Nm), as the vehicle lurched laterally – a phenomenon predicted, but not understood fully, caused by the effect of the angular positioning of the main engines. The Shuttle system was therefore designed within an allowable limit of 347 million in lb.
It was discovered that the horizontal movement of the Shuttle immediately after lift-off (the momentum of the lateral lurch) could cause a collision with the launch pad. At launch, torque build-up would not have disappeared. It would have been relieved by pushing the Shuttle sideways like a released spring, hitting some pad structures. The vehicle could have been tilted on the pad or have been lifted off at a lower thrust to avoid this.
In 1980, however, a lift-off delay of 2.5s was introduced instead, to give the vehicle time to rebound from the lateral movement. The base bending moment at this lift-off point at T+6.332s – 100% thrust rating – would actually be reduced to 150 million in lb. NASA failed to realise, however, that, to reach the new bending moment, the system had to go through a new peak bending-moment load of 580 million in lb – dynamic overshoot – for which the Shuttle was not designed (see diagram). It exceeded even NASA’s design-plus safety margins.
When the Shuttle lurches horizontally at main-engine start and, as thrust builds up, it springs back to its original position for lift-off at 100% thrust because the thrust has ramped up rapidly – causing dynamic overshoot. If there were no dynamic overshoot, the craft would stay in the lurched position for lift-off at 100% thrust. To lift off at 100% thrust in its original, near-vertical, position, it must have exceeded something to spring back.
F=MA is pretty basic. I’ve never understood how any engineer could have ignored the A of a shuttle during ascent. There was clear cause to put in a few seconds to do a simple calculation but all it got was just jaw flapping.
The O-ring problem was even discussed in meeting, and sounds like interoffice politics ruled the decision. Inappropriate.
I would still take the ride if offered, but would give every engineer a calculator at my own expense.
The download link for “Appendix to the Challenger Disaster Report” mentioned in the first sentence of the article, doesn’t work for me. Am I doing something wrong (like maybe having been born in the wrong place, for instance?)
Aha – the download link works through the Tor browser, so presumably science.ksc.nasa.gov is too snooty to respond to http requests from South Africa. Shame on you.
What concerns me even more from my experience in USDA and more-so concerning is DHHS/DEA/DOJ regulated industry is the “Certificate of Analysis” (CofA) testing as the specification being tested for each batch or lot received and as valid versus actually performing the specific identification, qualitative and quantitative testing (foreign matter and impurities also). I gave my NIR team positions to make sure there was at least a confirmation testing group dedicated to verifying the C of A’s when I much rather preferred more testing and process analytical technologies (PAT’s) controls for real time testing that was more monkey proof automated with confirmation equipment PQ’s still of course within a batch/lot manufacturing processes. The new Director’s or higher didn’t want the PAT’s. They seemed to want to defraud the customers and even shareholders and maybe stakeholders.
I used to own and read “What Do You Care What Other People Think?“ I can not stand the later “to hell with it attitude.” That was a trait I learned when I was younger I didn’t want to emulate. I do a more… step back really slowly and take a look at later.
What I found is that Human Resources can become poisonous and deadly even in some communities. Like when I applied at the NSA, HR was not as smart as I expected same with Stryker. More-so I was deterred from ever applying again due to their incompetence… especially after dealing with the back stabbing cult HR at the Perigo Company that literally is connected to Satanists and malfeasance at the least not the local community law enforcement. The U.S. culture has become from my observations an opioid and cocaine brain damaged on meth or alcohol or something drugged or maybe sexual deviant culture of lies, deceit, delusional grandiose and tangential even to literal interpretation of word definitions in some parts.
Please be kind and respectful to help make the comments section excellent. (Comment Policy)