Wednesday, January 24, 2024

The mathematics of redundancy

Boeing 747
I was at an airport during a recent business trip, watching planes take off and land when I saw the iconic Boeing 747 take off. This reminded me how growing up, 747 was a favorite. There is just something special about a double-decker jumbo jet capable of flying long range routes. But the one factor that endeared me to 747 above all else were its four engines. Perhaps as a foreshadowing of my future interests, it just felt inherently more reliable to fly on a plane with that level of redundancy. After all, even if two of its engines were to fail, the rest would make it possible to safely land the plane.

These days, it's fairly common to use redundancy to build systems that are more resilient than the individual parts they are made from. In the software world, it's a common practice to build a system that runs on multiple hosts, and that can continue operating even if one or more of those hosts fail. Many of us have probably seen the formula that looks like this: 

Where P is the failure probability between 0 and 1, and n is the number of redundant components, or more specifically the number of components a system could lose before a failure would occur. What's important is that the relationship is exponential, and we love exponents when they act in our favor. This means that small increases in redundancy will bring large reductions in failure probability. Or put another way, small increases in cost will bring disproportionally large increases in reliability. Looking at this mathematical model, it's easy to arrive at the conclusion that planes should have as many engines as possible. Especially if you are 14 years old. Unfortunately, the reality is far more nuanced.

The complication comes from the fact that there are different failure modes engineers working on complex systems need to account for. Two of the most important failure modes in redundant systems are correlated failures and cascading failures. Correlated failures are perhaps the easier of the two to understand. As most math textbooks tend to warn, probabilities compound only if the individual events are completely independent, or uncorrelated. In plain language, if there are failure modes that can impact multiple redundant components (for example, servers sharing a single power feed, or a single fuel pump feeding multiple engines), then we lose the benefit of exponentiation. And our mathematical model begins looking more like this:


For systems that have many correlated failure modes, increased redundancy (and cost) no longer increase reliability! 

The second type of failures that should concern engineers building redundant systems is cascading failures. In a cascading failure, a fault in a single redundant component causes other components to also fail. A concrete example of a cascading failure could be an explosion in one engine damaging other critical components of the plane (Qantas Flight 32 was a recent example of that, where luckily the crew still managed to safely land the badly damaged plane). Or a replicated database system, where issues in the replication protocol could cause a failure in one host to propagate to the others. In mathematics, cascading failures can be modeled by complement probability. In other words, our mathematical model looks more like this:


The important observation about cascading failures is that the exponent no longer works in our in our favor! In systems where most failures cascade, adding more redundancy is actually counter-productive. This is an easy intuition to test - if most engine failures risked bringing the whole plane down, then we would want as few engines on our planes as possible.

Conclusion

It goes without saying that this model is just a toy representation of the real world. All models are wrong, some are useful. But what does this mean for building resilient systems through redundancy? For starters, it means that simply adding redundancy into the system does not immediately make it more resilient. Unless the engineers put concerted effort into turning as many correlated and cascading failure modes into uncorrelated ones, the benefit of redundancy on resilience can be greatly diminished, or worse yet, turned negative. And the planes with four engines are not necessarily safer than the ones with two. 

Notable mention

It's hard to have a conversation about redundancy and resilience without talking about MTBF and MTTR (mean time between failures and mean time to repair). The simplistic mathematical model above only holds if the operators replace failing components much faster than subsequent components fail. The math behind that can itself be equally fascinating, especially when dealing with systems where state is involved.

Saturday, February 4, 2023

Batching: Efficiency under load

In this post, I wanted to make a quick observation about batching as an underused technique in distributed systems. Systems folks have long used batching as an effective way to increase the throughput of the system. This works by amortizing the overhead of an expensive action (such as device IO or a syscall) over multiple operations. Yet, outside of non-realtime systems, batching is rarely used within distributed services.

The common reasoning is that while batching can increase throughput, it also increases latency. But that doesn't need to be true. A great example of having our cake and eating it too can be found within software defined networking systems. Throughput and latency are critically important for such services, so they are written in systems languages like C, C++, or Rust, rely on lock-free data structures, and tend to use a variety of kernel-bypass techniques to avoid the overhead of sys calls.

A common way such services are implemented is by polling the network device driver queue in an infinite loop (usually with the aid of a framework like DPDK) and then performing their business function on individual packets as they arrive. Most of the time the system is lightly loaded, and the queue depth never exceeds one. However, if the load on the system increases, the arrival rate of new packets can temporarily exceed the rate at which worker threads can handle them. When that happens, the bounded queue builds up a short backlog of packets waiting to be processed. This dynamic isn't that different from how web-servers that make up most of our distributed systems operate.

This is where networking services tend to take advantage of batching. Instead of dequeuing individual packets, the polling thread will try to dequeue multiple packets and submit them for processing as a single batch. Individual stages of the application are then written in a way that takes advantage of such batching, amortizing expensive lookups and operations over multiple packets. Most of the time, the system is lightly loaded, and each batch contains a single packet. However, if the system begins building up a backlog, the batch sizes will increase, improving both latency and efficiency.

This creates a dynamic where the system becomes more efficient as the load increases, and that's such a desirable property! I believe that the same type of opportunistic batching could benefit our distributed systems as well, by improving peak efficiency without sacrificing latency.

Wednesday, December 21, 2022

Performance and efficiency

The topic of software performance and efficiency has been making rounds this month, especially around engineers not being able to influence their leadership to invest in performance. For many engineers, performance work tends to be some of the most fun and satisfying engineering projects. If you are like me, you love seeing some metric or graph show a step function improvement - my last code commit this year was one such effort, and it felt great seeing the results:


A time series graph showing a step function increase around November of 2022.


However, judging by the replies on John Carmack's post, many feel they are unable to influence their leadership to give such projects appropriate priority. Upwards influence is a complex subject, and cultural norms at individual companies will have a big impact on how (or if) engineers get to influence their roadmaps. However, one tool that has been useful for me in prioritizing performance work is tying it to specific business outcomes. And when it comes to performance and efficiency work, there are multiple distinct ways they could impact the business, and I think it's worth being explicit about those.

Performance is a feature

We frequently see this in infrastructure services like virtualized compute, storage, and networking. Databases and various storage and network appliances are another example. Performance of these products is frequently one of the dimensions customers evaluate, and that tends to shape product roadmaps. This is also where we see benchmarks that fuel performance arms race, and never-ending debates of how representative these benchmarks are of the true customer experience. In my experience, teams working in these domains have the least trouble prioritizing performance related work.

Better performance drives down costs

We tend to see this in distributed systems a lot. By improving the performance of distributed system's main unit of work code path, we can increase the throughput of any single host. That in turn decreases the count (or size) of hosts needed to handle the aggregate work that the distributed system faces, driving down the costs. One obvious caveat is that many distributed systems have a minimum footprint that is needed to meet their redundancy and availability goals. A lightly loaded distributed system that operates within that minimum footprint is unlikely to see any cost savings from performance optimizations, at least not without additional and often non-trivial investments that would allow the service to "scale to zero."

Better performance improves customer experience

This category of performance projects improve customers experience, typically by improving the latency or quality of some process. Perhaps the best known example is website load times, but any other process where humans (or machines!) wait for something to happen is a good candidate. Yet another example from my youth is video games - the same hardware would run some games much smoother and at higher resolution than others. This was typically the case for any game where John Carmack had a hand!

The biggest challenge in advocating for this category of performance improvement projects is quantifying the impact. Over the past decades multiple studies demonstrated strong correlation between fast page load times and positive business outcomes. However, gaining that conviction required a number of time and effort consuming experiments, an investment that may be impractical for most teams. And even then, at some points the returns likely become diminishing. Will a video game sell more copies if it runs at 90 fps rather than 60? 

Besides time consuming A/B studies, one of the best ways to advocate for these kinds of performance improvement projects, in my experience, is to deeply understand how the customers use your product. Then you can bring concrete customer anecdotes to make the case.

Better performance reduces availability risks

These types of performance projects tend to come up anytime we have performance modes in our system, that is the behavior where some units of work are less performant than others. An example of that could be an O(N²) algorithm, which sometimes faces large Ns. Such inefficiencies in the system create a situation where a shift in the composition of various modes can tip the system over the edge, creating a DOS vector. Throttling and load shedding more expensive units of work is one common approach, however improving their performance to be on par with the rest can be even more effective. 

One way to identify and drive down these kinds of performance modality issues is by plotting tail and median latencies on the same graph, or perhaps even using fancier graph types like histogram heatmaps. Seeing multiple distinct modes in the graph is a telltale sign that your system has such an intrinsic risk.

Performance has security impact

Security sensitive domains are unique because more important that the raw performance is the lack of performance variability. This is because any variability can expose potential side channels to the attacker. For example, if an encryption API takes different lengths of time depending on the content being encrypted, then a determined adversary can use that timing information as a side channel to infer information about the plaintext that is being encrypted. Luckily, very few teams are writing code with that kind of blast radius. The amount of expertise and considerations that go into getting such code right is truly mind boggling!

Conclusion

Like all classifications, this one is imperfect. For starters, a project can fall into multiple categories. For example, improving P50 request latency on a web service is likely to both improve customer experience and also service costs. But like they say, "All models are wrong, some are useful", and hopefully this model will be useful in helping folks tie performance work to desirable business outcomes.

Sunday, April 10, 2022

Feature Flags: Are they considered harmful?

In a recent operational review, a colleague brought up how past feedback from me has influenced his team's avoidance of feature flags. If we ever worked together, then you too probably heard me grumble about the downsides of feature flags. So should feature flags be avoided at all costs? My opinions on this topic are nuanced, so I thought it would be worth writing them down. But first, what do we even have in mind when we talk about feature flags? When I think about feature flags, I tend to think about two distinct use-cases. 

Coordinated enablement

When developing features within a complex codebase, it's common that the entire feature cannot be shipped in a single commit. Without using feature flags, one approach is to use a dedicated branch for feature development, where engineers working on a feature can accumulate incremental commits. Then, once the entire feature is code complete and tested, it can be merged into the mainline branch and deployed to production. This works well, but requires continuous merging of the feature and mainline branches to avoid a painful merge at the end. I don't know about others, but merging code is one of my least favorite software development activities.

Feature flags present another option. As the team is working on a new feature, code paths related to that feature can be placed behind a gate. For example, consider a service that vends answers to commonly asked questions. We may have an idea for a cool new feature, where the service auto-detects the language in which the question was asked and vends the answer in the same language. We could use a feature flag to gate the feature like this:
/// Returns a thoughtful answer to the question provided in 'question'
/// argument.
fn answer_question(question: String) -> String {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE) {
language = detect_language(&question);
} else {
language = ENGLISH;
}

lookup_answer(question, language)
}

The feature flag can remain turned off while the rest of the implementation is being worked on. Perhaps we still need to populate the database with non-English answers, or perhaps detect_language() function is still not working as well as we want before we feel comfortable releasing the feature to customers. Gating new code paths behind a feature flag allows us to incrementally commit and deploy new code, without accumulating a big and risky change set. We can even enable the feature flag in our staging environment, allowing engineers to test the new feature. Then, once all the parts of the feature are ready and well tested, we can flip the feature flag from false to true, enabling it for customers.

Incremental enablement

Another use-case for feature flags is fine-grained control over how a feature gets rolled out to customers. Building up on the example from the previous section, we may think that language auto-detection will be a huge success and customers will love it. But what if something goes wrong. Even the most rigorous testing will not cover all the possible ways customers can ask their questions. Language detection could fail in unexpected ways, or worse yet regress the service for millions of customers happily asking their questions in English today. One way to increase our confidence is to make feature gate customer aware by passing account_id into feature_enabled() function:
/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account.
fn answer_question(question: String, account_id: String) -> String {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
language = detect_language(&question);
} else {
language = ENGLISH;
}

lookup_answer(question, language)
}

Then, assuming the feature_enabled() function can keep track of enablement status by account, we can incrementally ramp up traffic to the new code paths. Perhaps, initially we will turn it on for our own developer accounts, then 1% of the customers, then 5%, and so on. At each step, we will watch for errors and customer feedback, only dialing the feature up if we receive a positive signal. And if something goes wrong, we can quickly ramp things down to 0%, effectively rolling back the change. Delivering major features this way can help reduce the blast radius. If things go wrong, we would've only impacted a small percentage of customers!

This approach has secondary benefits too. What if the feature causes problems for one of our customers. Perhaps they submitEnglish questions that have French words in them. The system works for them today, because they get their answers in English. But with our auto-detection feature, they begin getting French answers. 
Cela peut ĂȘtre frustrant si vous ne parlez pas français!

Because we are a customer obsessed service, we would want to right their experience as soon as possible. Having per customer control over the feature allows us to temporarily deny list their account, while we work on a proper fix, which may take some time. Without this capability, we would need to rollback the new feature for everybody!

Ok, so what's the problem?

Despite the opening paragraph, everything I've said so far makes feature flags sound useful. So why all the pessimism? The challenge with feature flags comes in two forms: permutations and delivery mechanisms. Here's what I mean by permutations. In addition to auto detecting the language, our popular question answering service is likely working on other features. Perhaps, we want to submit a Mechanical Turk request whoever we see a question that doesn't match an answer in our database. Using a feature flag, our example could look like this:

/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account. If we don't have an answer to the question,
/// this function may submit a Mechanical Turk request to populate an
/// answer into its internal database.
fn answer_question(question: String, account_id: String) -> Option<String> {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
language = detect_language(&question);
} else {
language = ENGLISH;
}

let answer = answer(question, language);
if answer.is_none() {
if feature_enabled(ENABLE_MECHANICAL_TURK, &account_id) {
submit_mturk_request(&question, language);
}
}

answer
}
By adding another feature flag, we doubled the number of distinct code paths we need to maintain and test! We now need to worry about our service working correctly with all four possible permutations of language detection and Mechanical Turk enablement. And this complexity grows exponentially with the number of feature flags - just with five feature flags we have 32 possible permutations! And while it may seem improbable that enabling language auto detection in our toy example would have any impact on Mechanical Turk integration, real systems tend to be much more complex and harder to reason about. A small behavior change in one part of the code base can have unintended consequences on another.

Building automation that tests all the possible permutations can help, but even then we have to accept that systems often behave differently under real production workloads. Faced with an operational issue, we will need to reason about interplay of various feature flags. And then we still need to contend with the cognitive overhead of all the if feature_enabled(..) { } blocks in our codebase. For these reasons, it's important to have a closed loop process which bounds the number of feature flags that are active at any point in time, continuously cleaning up older feature flags. But just like cleaning your room, it's an action where the costs are real and the benefits are abstract, and therefore a hard behavior to keep up!

The second pitfall is the mechanism by which the service learns about changes to its feature flags. It's common to use a dynamic datastore, such as a database, to store and propagate feature flag changes. But that always seemed backwards to me! A change in the feature flag is effectively a change in the behavior of the service - new code paths are getting activated for the first time. And every service already has a mechanism by which new code gets delivered: its deployment pipeline. This pipeline is where teams tend to apply all of their deployment safety learnings, such as the ones described in this Amazon Builders’ Library article. So why invent a secondary deployment mechanism, one that will need to reimplement all the safety guardrails of the service’s deployment pipeline?

One argument I've heard is that having a dedicated "fast path" for feature flag toggles makes it easy to quickly rollout or rollback new features. But even that seems off to me. Most services deploy new code more frequently than they toggle feature flags, so if deployments are slow and painful, wouldn't it be beneficial to spend the energy fixing that, rather than inventing a deployment side channel? One that still has to support all the safety properties of the service's deployment pipeline, because feature flags are effectively new code path deployments.

Conclusion

To summarize, I think feature flags can be a powerful mechanism by which a service can deliver new capabilities. They can be especially effective in helping us control the blast radius in case of unforeseen problems with new features. But feature flags also come with an operational and cognitive costs, so it's important to have a mechanism that bounds the number of feature flags within the code base at any point in time. 

It's also important to acknowledge that feature flag toggles are no different that any other code changes. Just like code or configuration changes, they change the behavior of your service. And if your service already has a deployment pipeline by which it safely delivers new code to production, consider using the same mechanism for deploying feature flag toggles.

Thursday, June 18, 2020

Hello

Well, hello there. This is a sample post to see how things look.