Apollo 13's "Failure is not an option", and how non-engineers misinterpret it
Failure is not an option!
It might surprise you to know that this quote wasn't real - it feels legendary, but was never said by Gene Kranz. It was written up for the film.
The aerospace engineering discipline isn't really something everybody gets to experience, so it makes sense that "spicing things up" for the movie would be generally accepted as reality.
When you create a program (or release a new capability), it makes perfect sense to get all excited and release it as soon as you feel it's "done" - but this is just an example of how IT/Computer Science is relatively young compared to other engineering disciplines.
With more traditional engineering disciplines, testing is a key aspect to deployment and design. Everything is tested for safety. Concrete is thoroughly tested before integration in bridges and structures. Most pickup trucks are tested to their listed tow capacity.
This isn't a perfect ideal world, however. Bridges still fail, and in this case companies didn't follow the SAE J2807 standard until forced (Toyota: 2011, General Motors: 2015, Ford: 2015, Dodge: 2015).
Industry-wide changes take time
Here's why: It's expensive to re-tool in the physical world. NASA just straight up didn't have the option, so they compensated by accounting for as many potential scenarios as possible, at the expense of cost. That's what "Failure is not an option" was intended to reflect. Everything is tested and planned ahead of time, and the mission systems didn't try anything truly new.
Engineering is the practice of taking learned experiences and codifying them, ensuring that the same mistake doesn't happen twice. The safety codes and engineering artifacts we use in the physical world are "written in blood" - many structural engineering practices were learnt from a loss of life, it's why they're so important.
I don't think anybody has died due to an email not getting through, but I'd counter that the same practices are much easier to execute in IT and therefore should be followed. IT is a relatively young engineering-adjacent discipline, and the standards for performance are relatively low, albeit always increasing.
Here's a rough estimate of each engineering discipline's age:
- Chemical Engineering (~1800s AD)
- Civil Engineering (BC, formalized in the 1700s AD)
- Electrical Engineering (1700s AD, formalized in the 1800s AD)
- Mechanical Engineering (BC, formalized in the 1800s AD)
- Software Engineering (1960s AD)
More recent engineering disciplines fit in these families, and one could argue (correctly) that while they are younger, they benefit from the preceding disciplines and the broader body of knowledge. This is particularly true in the field of aerospace.
Systems Engineering practitioners have collected a number of practices together to integrate new technologies and disciplines in the SEBoK - which essentially forms a "starter kit" of practices and protocols for developing new solutions. The SEBoK is an excellent (albeit overwhelming) place to procure methods for continuous improvement, either as a team or individually.
Don't fear failure, understand it
Across all of these disciplines, we see a common pattern around failure; the natural reaction to failure is to avoid it. Humans don't want to be associated with failure, and this reflex must be overridden to be a successful engineer.
I'd like to provide an example of good failure analysis instead of harping on past failures - my concern here is that any controversy may get in the way of the idea I want to convey - which deviates from the practice of failure analysis somewhat.
Washington State DOT's analysis of the Tacoma Narrows bridge failure is an example of well-executed failure analysis.
In this case, the structure was too rigid - "common sense" would tell us that if a bridge is extremely strong, it won't have any issues standing up to high winds.
Applying failure analysis to IT
It's important that we learn from these shortcomings and integrate solutions into future designs. Typically, this is where "system integration" comes into play - as a product is validated for release, all known tests are applied to it to ensure that failures don't recur. The NASA engineers supporting Apollo 13 didn't try anything new on the mission system (Apollo 13 itself). NASA tested all solutions thoroughly with the ground crew, astronauts, and QA engineers before rollout was ever considered an option.
The Apollo program was extremely expensive compared to most of our IT budgets, but we're almost always testing software. Failure Analysis practices are trivial with software debugging and mature unit testing, and eventually we're going to have to perform at the standards held by traditional engineering disciplines.
Example - a maintenance window backfired
We've all been here before - let's say that spanning tree did something unexpected during a maintenance window and caused unexpected downtime.
The first and most effective aspect of failure analysis (at least for our careers) is to provide a compelling narrative. We need to invert the human reflexive reaction to failure and encourage interest over punitive behaviors. Writing a complete and compelling narrative both ensures that people will react more positively to the occurrence and provide confidence that due diligence will be performed to ensure it doesn't happen again.
Sure, it'll always happen again with STP in some way, but other materials have common patterns and properties too. We didn't stop using aluminum because it isn't as strong as steel or as good of a conductor as copper; instead we learned its strengths and weaknesses, applying the solution judiciously. In this case, we need to prove that we will apply the solution more judiciously as well.
Second, gather all possible data on the time of the outage. Don't try to filter it yet, and don't react slowly. Anything that can record system data is valuable here (telemetry in particular) - so automatic gathering is extremely valuable.
Third, find ways to locate precursors and the failure itself. This part should be automated and attached to any CI pipelines for the future, "set it and forget it" is the best way. As this practice evolves, a solution develops incredible mass and manually executing every failure analysis unit test after every change will quickly become tedious and slow.
Why?
The pressure to follow this pattern is only going to grow in the future. The previous decade's reliability standards were hilariously low compared to the quality of technology and service today - just look at the standards people hold us to. Instead of fearing this trend, let's analyze it and find ways to improve. It'll give us a competitive edge in the future.
As with Apollo 13, our greatest failures drive our greatest successes.