These days, it's fairly common to use redundancy to build systems that are more resilient than the individual parts they are made from. In the software world, it's a common practice to build a system that runs on multiple hosts, and that can continue operating even if one or more of those hosts fail. Many of us have probably seen the formula that looks like this:
Where P is the failure probability between 0 and 1, and n is the number of redundant components, or more specifically the number of components a system could lose before a failure would occur. What's important is that the relationship is exponential, and we love exponents when they act in our favor. This means that small increases in redundancy will bring large reductions in failure probability. Or put another way, small increases in cost will bring disproportionally large increases in reliability. Looking at this mathematical model, it's easy to arrive at the conclusion that planes should have as many engines as possible. Especially if you are 14 years old. Unfortunately, the reality is far more nuanced.
Where P is the failure probability between 0 and 1, and n is the number of redundant components, or more specifically the number of components a system could lose before a failure would occur. What's important is that the relationship is exponential, and we love exponents when they act in our favor. This means that small increases in redundancy will bring large reductions in failure probability. Or put another way, small increases in cost will bring disproportionally large increases in reliability. Looking at this mathematical model, it's easy to arrive at the conclusion that planes should have as many engines as possible. Especially if you are 14 years old. Unfortunately, the reality is far more nuanced.
The complication comes from the fact that there are different failure modes engineers working on complex systems need to account for. Two of the most important failure modes in redundant systems are correlated failures and cascading failures. Correlated failures are perhaps the easier of the two to understand. As most math textbooks tend to warn, probabilities compound only if the individual events are completely independent, or uncorrelated. In plain language, if there are failure modes that can impact multiple redundant components (for example, servers sharing a single power feed, or a single fuel pump feeding multiple engines), then we lose the benefit of exponentiation. And our mathematical model begins looking more like this:
For systems that have many correlated failure modes, increased redundancy (and cost) no longer increase reliability!
For systems that have many correlated failure modes, increased redundancy (and cost) no longer increase reliability!
The second type of failures that should concern engineers building redundant systems is cascading failures. In a cascading failure, a fault in a single redundant component causes other components to also fail. A concrete example of a cascading failure could be an explosion in one engine damaging other critical components of the plane (Qantas Flight 32 was a recent example of that, where luckily the crew still managed to safely land the badly damaged plane). Or a replicated database system, where issues in the replication protocol could cause a failure in one host to propagate to the others. In mathematics, cascading failures can be modeled by complement probability. In other words, our mathematical model looks more like this:
Conclusion
It goes without saying that this model is just a toy representation of the real world. All models are wrong, some are useful. But what does this mean for building resilient systems through redundancy? For starters, it means that simply adding redundancy into the system does not immediately make it more resilient. Unless the engineers put concerted effort into turning as many correlated and cascading failure modes into uncorrelated ones, the benefit of redundancy on resilience can be greatly diminished, or worse yet, turned negative. And the planes with four engines are not necessarily safer than the ones with two.
Notable mention
It's hard to have a conversation about redundancy and resilience without talking about MTBF and MTTR (mean time between failures and mean time to repair). The simplistic mathematical model above only holds if the operators replace failing components much faster than subsequent components fail. The math behind that can itself be equally fascinating, especially when dealing with systems where state is involved.