Wednesday, January 24, 2024

The mathematics of redundancy

Boeing 747
I was at an airport during a recent business trip, watching planes take off and land when I saw the iconic Boeing 747 take off. This reminded me how growing up, 747 was a favorite. There is just something special about a double-decker jumbo jet capable of flying long range routes. But the one factor that endeared me to 747 above all else were its four engines. Perhaps as a foreshadowing of my future interests, it just felt inherently more reliable to fly on a plane with that level of redundancy. After all, even if two of its engines were to fail, the rest would make it possible to safely land the plane.

These days, it's fairly common to use redundancy to build systems that are more resilient than the individual parts they are made from. In the software world, it's a common practice to build a system that runs on multiple hosts, and that can continue operating even if one or more of those hosts fail. Many of us have probably seen the formula that looks like this: 

Where P is the failure probability between 0 and 1, and n is the number of redundant components, or more specifically the number of components a system could lose before a failure would occur. What's important is that the relationship is exponential, and we love exponents when they act in our favor. This means that small increases in redundancy will bring large reductions in failure probability. Or put another way, small increases in cost will bring disproportionally large increases in reliability. Looking at this mathematical model, it's easy to arrive at the conclusion that planes should have as many engines as possible. Especially if you are 14 years old. Unfortunately, the reality is far more nuanced.

The complication comes from the fact that there are different failure modes engineers working on complex systems need to account for. Two of the most important failure modes in redundant systems are correlated failures and cascading failures. Correlated failures are perhaps the easier of the two to understand. As most math textbooks tend to warn, probabilities compound only if the individual events are completely independent, or uncorrelated. In plain language, if there are failure modes that can impact multiple redundant components (for example, servers sharing a single power feed, or a single fuel pump feeding multiple engines), then we lose the benefit of exponentiation. And our mathematical model begins looking more like this:


For systems that have many correlated failure modes, increased redundancy (and cost) no longer increase reliability! 

The second type of failures that should concern engineers building redundant systems is cascading failures. In a cascading failure, a fault in a single redundant component causes other components to also fail. A concrete example of a cascading failure could be an explosion in one engine damaging other critical components of the plane (Qantas Flight 32 was a recent example of that, where luckily the crew still managed to safely land the badly damaged plane). Or a replicated database system, where issues in the replication protocol could cause a failure in one host to propagate to the others. In mathematics, cascading failures can be modeled by complement probability. In other words, our mathematical model looks more like this:


The important observation about cascading failures is that the exponent no longer works in our in our favor! In systems where most failures cascade, adding more redundancy is actually counter-productive. This is an easy intuition to test - if most engine failures risked bringing the whole plane down, then we would want as few engines on our planes as possible.

Conclusion

It goes without saying that this model is just a toy representation of the real world. All models are wrong, some are useful. But what does this mean for building resilient systems through redundancy? For starters, it means that simply adding redundancy into the system does not immediately make it more resilient. Unless the engineers put concerted effort into turning as many correlated and cascading failure modes into uncorrelated ones, the benefit of redundancy on resilience can be greatly diminished, or worse yet, turned negative. And the planes with four engines are not necessarily safer than the ones with two. 

Notable mention

It's hard to have a conversation about redundancy and resilience without talking about MTBF and MTTR (mean time between failures and mean time to repair). The simplistic mathematical model above only holds if the operators replace failing components much faster than subsequent components fail. The math behind that can itself be equally fascinating, especially when dealing with systems where state is involved.