Just a few days after Christmas last year AirAsia Flight 8051 traveling to Singapore tragically plummeted into the sea. Indonesia completed its investigation of the crash and just released the final report. Media coverage, especially in Asia is big. The stories are headlined by pilot error but,as technologists, there are lessons to be learned deeper in the report.
The Airbus A320 is a fly-by-wire system meaning there are no mechanical linkages between the pilots and the control surfaces. Everything is electronic and most of a flight is under automatic control. Unfortunately, this also means pilots don’t spend much time actually flying a plane, possibly less than a minute, according to one report.
Here’s the scenario laid out by the Indonesian report: A rudder travel limit computer system alarmed four times. The pilots cleared the alarms following normal procedures. After the fifth alarm, the plane rolled beyond 45 degrees, climbed rapidly, stalled, and fell.
The media headlines focus on the latter steps in the failure chain, in part because the pilots were never trained to handle the type of upset that occurred. It wasn’t just AirAsia who omitted this training on the A320. All airlines did because Airbus, the aircraft manufacturer, did not expect the aircraft to ever experience such an extreme upset. Note that France, as the host country for Airbus, participated in the investigation.
As technologists we need to look further. The technical root cause was cracked solder joints on circuit boards for the rudder limit control system. This system limits the amount of rudder movement at high speeds. A key point is this same system failed 23 times in 2014. This was considered minor damage and never fixed.
As in many situations, the failure chain is a cascade of human failures to respond properly to a technical fault. Little mentioned in most reports is how the pilots attempted to fix the fifth rudder control fault. They followed normal procedures for the first faults but the last time they opened and reset a circuit breaker while in flight. Somehow that meant the autothrust and autopilot were disconnected and never restored. This put the pilots solely in control of the plane through the fly-by-wire system.
Tragic Sequence of Events
To summarize, here are the three key failures:
- Bad solder joint,
- Cycling the circuit breaker,
- Inadequate recovery training.
We’ll ignore the mistake of not properly troubleshooting the board. That is a human failure but also a larger policy issue for AirAsia and not directly technical.
Bad solder joints occur despite best efforts to prevent them in manufacturing. Diagnosing an intermittent joint failure can be a nightmare so we can sympathize with the aircraft maintainers. How should we handle intermittent failures in critical or important systems? Clearly the system was checking its integrity because it kept issuing warnings throughout 2014. Is it feasible to have a system refuse to function if a certain number of failures occur? I’d suggest that after 6 faults it could have a heightened alert, like refusing to boot when powered on in a safe environment (i.e. parked on the ground). Basically the system says, “I know I’m bad, now fix me.”
Why did the pilots mess with the circuit breaker? One report says the pilot saw a maintenance worker cycle a circuit breaker to clear a fault. That’s fine on the ground but not in the air. Why would a pilot try this, especially since there are advisories to pilots not to reset circuit breakers unless the system is flight critical? The control system here is a safety feature, but not critical so why not just leave it off?
People in general get overly comfortable with technology since it abounds. There are all kinds of jokes about non-technical relatives doing something crazy to a computer because the same action fixed something else.
Unfortunately, this often means people don’t know what they don’t know. In this case, the pilots appeared not to know cycling that breaker would disrupt other systems. Yes, it sounds strange that would happen and I can’t explain it because I don’t know why that would happen. If true, it appears to be a systemic problem that should be addressed. In our work, we need to make sure that failures in one part of a system do not upset critical parts elsewhere.
The pilots weren’t trained to handle the flight upset because even Airbus, the aircraft manufacturer, did not expect the aircraft to ever experience such an extreme upset. I guess since Murphy isn’t French they don’t expect his effects to occur there. This assumption probably derived from the aircraft being fly-by-wire. The expectation being the aircraft would not let itself become upset to this degree. But the automatic flight systems were disrupted by the cycling of the circuit breaker.
Failures in complex systems take a lot of effort to track down. In this situation we see how three separate actions lead to the failure with a fourth, the maintenance failure, contributing greatly. This points out that the total failure might have been avoided at multiple times: If the solder joints had not failed. If the pilots had not cycled the circuit breaker. If the pilots had restored the automatic flight computers. If the pilots had reacted properly after the upset.
Even as hackers we need to keep in mind when and how failures can occur. We’ve written articles on electronic door locks created by hackers. How do you get in if the power goes off or a bad solder joint fails after a few hundred door openings and closings? Hopefully a key will override the electronics. Fortunately, most of the hacks we see are not critical. Fortunately, failures would not be life threatening. Let’s keep it that way.