Joe Magerramov's blog

Monday, June 16, 2025

The Nuanced Reality of Throttling: It's Not Just About Preventing Abuse

If you work with multi-tenant systems you are probably familiar with the concept of throttling or admission control. The idea is pretty simple and is rooted in the common human desire for fairness: when using a shared system or resource no customer should be able to consume "too much" of that resource and negatively impact other customers. What constitutes "too much" can vary a lot and will usually depend on technical, product, business, and even social factors.

Yet when most engineering teams think about throttling, the first thing that comes to mind is often protecting from bad actors who may intentionally or accidentally try to knock the system down. It's a clean, morally satisfying mental model. Bad actors perform unreasonable actions, so we put up guardrails to protect everyone else. Justice served, system protected, everyone goes home happy.

But here's the thing - protecting against bad actors is just a small fraction of the throttling story. The reality of throttling is far more nuanced, and frankly, more interesting than the "prevent abuse" story we often tell ourselves.

Two Sides of the Same Coin

I want to start with a distinction that influences how I think about throttling. When we implement an admission control like throttling, we're typically optimizing for one of two parties: the customer or the system operator. And these two scenarios are fundamentally different beasts.

Quotas and Limits for the Customers's Benefit

This is the "helpful" throttling. Think about a scenario where a developer accidentally writes a runaway script that starts making thousands of paid API calls per second to your service. Without throttling, they might wake up to a bill that they do not love. In this case, throttling is essentially a safety net - it prevents their own code from causing financial harm.

Similarly, consumption limits can be a mechanism to steer customers towards more efficient patterns. For example, by preventing the customer from making thousands of "describe resource" API calls we could steer them towards a more efficient "generate report" API. This could become a win-win situation: the customer gets the data they need more easily, and the system operator gets to improve the efficiency of their system.

Load Shedding for the System's Benefit

Now here's where things get nuanced. Sometimes you implement throttling not to help the customer, but to protect your system from legitimate traffic that just happens to be inconveniently timed. Maybe one of your customers is dealing with their own traffic surge - perhaps they just got featured on the front page of Reddit, or their marketing campaign went viral.

In this scenario, you're potentially hurting a customer who's doing absolutely nothing wrong. They're not trying to abuse your system; they're just experiencing success in their own business. But if you let their traffic through, it might overload the system and impact all your other customers. Now, technically we could argue that this type of throttling also helps the customer - nobody wins when the system is overloaded and suffers a congestion collapse. However, the point is that the customer isn't going to thank you for throttling them here!

I find it helpful to think of these as different concepts entirely. The first is quotas or limits - helping customers avoid surprises or use your system more efficiently. The second is load shedding - protecting your system from legitimate but inconvenient demand.

The Uncomfortable Truth About Load Shedding

This distinction matters because it forces us to confront an uncomfortable reality: sometimes we're actively hurting our customers to protect our system. The "preventing abuse" mental model breaks down completely here, and we need a more honest framework.

A healthier way to think about load shedding is that we want to protect our system in a way that causes the least amount of harm to our customers. It's not about good guys and bad guys anymore - it's about making difficult trade-offs when resources are constrained.

This reframing changes how we approach the problem. Instead of thinking "how do we stop bad actors," we start thinking "how do we gracefully degrade when we hit capacity limits while minimizing customer impact?"

The Scaling Dance

Here's where throttling gets really interesting. Load shedding doesn't have to be a permanent punishment. If you're dealing with legitimate traffic spikes, throttling can be a temporary protective measure while you scale up your system to handle the demand.

Think of a restaurant during an unexpectedly busy dinner rush. If they are short staffed, a restaurant may choose to keep some tables empty and turn away customers to make sure the customers who do get in still have a pleasant experience. Then, once additional staff arrive, they may open additional tables and begin accepting walk ins again.

In practice, this means your load-shedding system should be closely integrated with your auto-scaling infrastructure. When you start load shedding, that should trigger scaling decisions. The goal is to make load shedding temporary - a protective measure that buys you time to add capacity.

However, you also want to be careful to avoid problems like run away scaling, where the system scales up to unreasonable sizes because load shedding does not stop. Or oscillations, where the system wastes resources by continuously scaling up and down due to hysteresis. In both of these scenarios, placing velocity controls on scaling decisions can be a reasonable mechanism.

Beyond Static Limits

Many load shedding systems I've encountered use static limits. "Customer A gets 100 requests per minute, Customer B gets 100 requests per minute, everyone gets 100 requests per minute." It's simple, it's fair, and it's probably insufficient.

It's a simple system to implement and to explain, but static limits assume that every customer has the same needs from your system, regardless of their scale. But in reality, your customers exist in a wide spectrum of use cases. Some are weekend hobbyists making a few API calls. Others are large companies whose entire business depends on your service.

Static limits also assume that the customers of your system act in uncorrelated fashion. If multiple customers of the system hit their limit at the same time, the system could still get overloaded. There are lots of real-world reasons such a correlated behavior could occur. Perhaps all these customers are different teams within the same company, and the whole company is seeing a big increase in workload. Or perhaps they are using the same client software that contains the same bug. Or, my personal favorite, perhaps they've all configured their system to perform some intensive action on a cron at midnight, because humans love round numbers!

An interesting alternative is capacity-based throttling. Instead of hard limits, you admit new requests as long as your system has capacity. Think of it like a highway onramp - when the traffic on the highway is flowing freely the onramp lets new cars in without any constraints. But as soon as congestion builds up, the traffic lights on the onramp activate and begin metering new cars.

The Top Talker Dilemma

But what happens when you hit capacity limits? The naive approach is to shed load indiscriminately, but that's almost as bad as experiencing an overload. Almost, because you are avoiding congestion collapse - so many requests will still go through. However, such indiscriminate load shedding will make most of your customers see some failures - from their point of view the system is experiencing an outage.

A different option might be to shed load from your top talkers first. They're using the most resources, so cutting them off gives you the biggest bang for your buck in terms of freeing up capacity. The problem is that your top talkers are often your biggest customers. Cutting them off first is like a retailer turning away their top spending customers. Not exactly a winning business strategy.

One approach I think can work well is to shed load from "new" top talkers - customers whose traffic has recently spiked above their normal patterns. This gives you the capacity relief you need while protecting established usage patterns. The assumption is that sudden spikes are more likely to be temporary or problematic, while established high usage represents legitimate business needs.

One way you could implement this behavior is by starting with low static throttling limits, but then automatically increasing those limits whenever a customer reaches it, as long as the system has capacity. In happy state, no customer experiences load shedding and the throttling limits are increased to meet new demand. However, if the system is at capacity, new increases are temporarily halted and customers who need an increase may get throttled, until the system is scaled up and new headroom is created.

A Different Mental Model

I think the key insight here is that throttling is not primarily about preventing abuse - it's about resource allocation under constraints. Sometimes those constraints are financial (protecting customers from runaway bills), sometimes they're technical (preventing system overload), and sometimes they're business-related (product tier differentiation).

When we frame throttling as resource allocation rather than abuse prevention, we start asking better questions:

How do we allocate limited resources fairly?
How do we balance individual customer needs against system stability?
How do we minimize harm when we have to make difficult trade-offs?
How do we use throttling as a signal to guide scaling decisions?

These are more nuanced questions than "how do we stop bad actors," and they lead to more sophisticated solutions.

The Path Forward

None of this is to say that traditional abuse prevention doesn't matter. There are definitely bad actors out there trying to overwhelm systems, and throttling is one tool in your arsenal to deal with them. But I think we do ourselves a disservice when we reduce all throttling to abuse prevention.

The reality is that throttling is a complex, multi-faceted tool that touches on resource allocation, system reliability, product design, and business strategy. The sooner we embrace that complexity, the better solutions we'll build.

In my experience, the most effective throttling systems are those that:

Clearly distinguish between customer protection and system protection use cases
Integrate closely with auto-scaling infrastructure
Use capacity-based limits rather than static ones where possible
Prioritize established usage patterns over new spikes
Treat throttling as a resource allocation problem, not just an abuse prevention one

The next time you're designing a throttling system, I'd encourage you to think beyond the "prevent abuse" narrative. Ask yourself: who is this throttling protecting, and what are the trade-offs involved? The answers might surprise you, and they'll almost certainly lead to a better system.

Saturday, March 1, 2025

The Trouble with Leader Elections (in distributed systems)

When working on distributed systems, it's not uncommon to need a piece of logic that executes across the entire system. Some examples of that can be a task that deletes tomb-stoned records, or a task that periodically updates system-wide configuration, or perhaps a piece of code that continuously scans the entire fleet replacing unhealthy hosts. In my experience, most distributed systems have at least one such component.

The simplest way to implement such logic is to execute it on a single host - non-distributed algorithms tend to be simpler and more efficient to implement. However, because we still want redundancy a leader election algorithm is commonly used to elect one of a small number of hosts to act as the leader and execute the task. In a nutshell, most widely used leader election algorithms have a concept of leases (or timed locks) at their core. Each host continuously attempts to acquire a lease from a data store with appropriate atomicity and consistency guarantees. This could be a relational database, a datastore like DynamoDB, or a Paxos-based locking client. If successfully acquired, the host can begin executing the task, continuously renewing the lease. If the host dies, the lease will eventually expire and another host will win the race and take over as the leader. This Amazon Builders Library article goes into a lot of helpful details on implementing leader election: Leader election in distributed systems.

On the surface, leader election is a simple approach that fills a need for many distributed systems. So why the dislike? In my experience, leader election based systems have a couple of downsides:

Blast radius

Just like in the real world, leader election elevates a single participant into a position of immense power. If the leader is good, the system chugs along. If the leader makes bad decisions, it can very quickly impact the entire system. I find this concern most visible during deployments. In most systems, deployments tend to be the leading cause of failures. No matter how well you test the system, the real moment of truth hits when the new code and configuration meet the diversity and complexity of the real production environment. Most distributed systems deal with that by rolling new changes slowly across the fleet, perhaps starting with a single host and waiting for signs of success before incrementally rolling out the changes to the rest of the fleet. Deployments to a leader election based systems tend to go from 0% to 100% as soon as the changes hit the currently elected leader. Depending on the task performed by the leader, any bug that sneaks through testing can be catastrophic for the health of the overall system.

Liveness vs split-leader tension

One of the most important decisions in the leader-elected systems is the length of the lease. If the lease expires while the task is still executing, we run the risk of another host grabbing the lease and starting its own execution. All of a sudden, a task that was designed to run once at a time can begin running concurrently on multiple hosts. Unless the system has been designed and is continuously tested to support that, this could lead to subtle concurrency bugs that impact the entire system.

On the other hand, if the lease is long enough to cover even the most pessimistic task execution time, then an actual leader failure will need to wait that long until a new leader acquires the lease. This can lead to stalling and liveness problems.

We can attempt to work around this tension by using short-lived leases and having the leader continuously extend the lease while the task is running. That can quickly get tricky. For example, a blocking IO call or a stop-the-world garbage collection pause can push us past lease expiration time without an opportunity to extend. Or an attempt to extend can fail, and we will need to decide whether to abort the task and deal with partial progress or risk multiple tasks executing concurrently.

Liveness vs faux-leader tension

I find that leader election based system are also weak to ambiguous gray failure scenarios. The leader may be happily acquiring and renewing the lease, but it may have trouble executing the task. Perhaps a downstream dependency is timing out, or the task is terminating abnormally. When that happens, the leader gets to make a really tough decision. It could continue renewing the lease, in the hope that the task will eventually succeed. But that runs the risk of getting stuck in the faux-leader mode, where a problem with the host (e.g. a broken volume) prevents it from making progress, while it happily renews the lease.

Alternatively, the leader could decide to give up and release the lease, in the hope that a different host will have better luck with the task. But that runs a different risk. If the task was failing due to outside factors (perhaps a dependency was down), no leader will be willing to hold the lease. That can cause the system to enter into a leader churn, which can lead to various types of cascading failures.

So what should we do?

I think it's always worth starting by deciding if these problems matter to your system. There are systems where these problems are not deal breakers, and leader election can be the simplest solution. For example, a component that sends out an operational report once a day can likely ignore these tensions. In the worst case, operators may occasionally receive duplicate e-mails or a report may get delayed by a few hours.

However, for many real world distributed systems these tensions matter and need to be considered. When dealing with those kinds of systems, I prefer to steer away from approaches that require trading off between multiple desirable properties. I like systems that have both liveness and predictability. I also like systems where no single fallible component has the power to impact the entire system. So what are some improvements or alternatives?

Localized leaders

In United States, a common debate is how much power should be wielded by individual states and how much should be in the hands of the central federal government. It's a nuanced trade off with many strong opinions on both sides. Luckily, in distributed systems it's almost always better to have smaller blast radii. Instead of having a single leader that operates on the entire distributed system, we could have smaller sub-leaders that each operate on a portion of our distributed system. This can help reduce the blast radius of failures, as well as reduce the amount of work each leader needs to perform, making it easier to maintain liveness in the system.

Idempotent co-leaders

In the real world, deputizing multiple leaders can lead to chaos and indecision. In distributed systems, if we can make the task that is being executed idempotent and atomic, we can skip leader election altogether. Each participating host could execute the task at the same time. If some hosts fail, the system will still make progress as long as there's at least one healthy host. This approach introduces extra load into the system, but on the other hand that load tends to be consistent and predictable. Such systems don't suffer from surprising mode shifts during failures, if anything the load tends to decrease when things fail. I find that to be a desirable property for most systems!

Different architectures

Sometimes, it's possible to make small tweaks to our distributed system to avoid needing centralized logic altogether. Some alternatives that I've run into:

Using a queue (like SQS) to enqueue housekeeping items as they arise and then processing those using a small fleet of subscribers.
Using capabilities of the platform to perform housekeeping tasks (e.g. using AutoScaling Groups to replace unhealthy hosts or S3 lifecycles to delete expired objects).
Using event driven approaches (e.g. using a Lambda to trigger an action when S3 object changes, instead of centrally recomputing all files in the bucket).

Wednesday, January 24, 2024

The mathematics of redundancy

I was at an airport during a recent business trip, watching planes take off and land when I saw the iconic Boeing 747 take off. This reminded me how growing up, 747 was a favorite. There is just something special about a double-decker jumbo jet capable of flying long range routes. But the one factor that endeared me to 747 above all else were its four engines. Perhaps as a foreshadowing of my future interests, it just felt inherently more reliable to fly on a plane with that level of redundancy. After all, even if two of its engines were to fail, the rest would make it possible to safely land the plane.

These days, it's fairly common to use redundancy to build systems that are more resilient than the individual parts they are made from. In the software world, it's a common practice to build a system that runs on multiple hosts, and that can continue operating even if one or more of those hosts fail. Many of us have probably seen the formula that looks like this:

Where P is the failure probability between 0 and 1, and n is the number of redundant components, or more specifically the number of components a system could lose before a failure would occur. What's important is that the relationship is exponential, and we love exponents when they act in our favor. This means that small increases in redundancy will bring large reductions in failure probability. Or put another way, small increases in cost will bring disproportionally large increases in reliability. Looking at this mathematical model, it's easy to arrive at the conclusion that planes should have as many engines as possible. Especially if you are 14 years old. Unfortunately, the reality is far more nuanced.

The complication comes from the fact that there are different failure modes engineers working on complex systems need to account for. Two of the most important failure modes in redundant systems are correlated failures and cascading failures. Correlated failures are perhaps the easier of the two to understand. As most math textbooks tend to warn, probabilities compound only if the individual events are completely independent, or uncorrelated. In plain language, if there are failure modes that can impact multiple redundant components (for example, servers sharing a single power feed, or a single fuel pump feeding multiple engines), then we lose the benefit of exponentiation. And our mathematical model begins looking more like this:

For systems that have many correlated failure modes, increased redundancy (and cost) no longer increase reliability!

The second type of failures that should concern engineers building redundant systems is cascading failures. In a cascading failure, a fault in a single redundant component causes other components to also fail. A concrete example of a cascading failure could be an explosion in one engine damaging other critical components of the plane (Qantas Flight 32 was a recent example of that, where luckily the crew still managed to safely land the badly damaged plane). Or a replicated database system, where issues in the replication protocol could cause a failure in one host to propagate to the others. In mathematics, cascading failures can be modeled by complement probability. In other words, our mathematical model looks more like this:

The important observation about cascading failures is that the exponent no longer works in our in our favor! In systems where most failures cascade, adding more redundancy is actually counter-productive. This is an easy intuition to test - if most engine failures risked bringing the whole plane down, then we would want as few engines on our planes as possible.

Conclusion

It goes without saying that this model is just a toy representation of the real world. All models are wrong, some are useful. But what does this mean for building resilient systems through redundancy? For starters, it means that simply adding redundancy into the system does not immediately make it more resilient. Unless the engineers put concerted effort into turning as many correlated and cascading failure modes into uncorrelated ones, the benefit of redundancy on resilience can be greatly diminished, or worse yet, turned negative. And the planes with four engines are not necessarily safer than the ones with two.

Notable mention

It's hard to have a conversation about redundancy and resilience without talking about MTBF and MTTR (mean time between failures and mean time to repair). The simplistic mathematical model above only holds if the operators replace failing components much faster than subsequent components fail. The math behind that can itself be equally fascinating, especially when dealing with systems where state is involved.

Saturday, February 4, 2023

Batching: Efficiency under load

In this post, I wanted to make a quick observation about batching as an underused technique in distributed systems. Systems folks have long used batching as an effective way to increase the throughput of the system. This works by amortizing the overhead of an expensive action (such as device IO or a syscall) over multiple operations. Yet, outside of non-realtime systems, batching is rarely used within distributed services.

The common reasoning is that while batching can increase throughput, it also increases latency. But that doesn't need to be true. A great example of having our cake and eating it too can be found within software defined networking systems. Throughput and latency are critically important for such services, so they are written in systems languages like C, C++, or Rust, rely on lock-free data structures, and tend to use a variety of kernel-bypass techniques to avoid the overhead of sys calls.

A common way such services are implemented is by polling the network device driver queue in an infinite loop (usually with the aid of a framework like DPDK) and then performing their business function on individual packets as they arrive. Most of the time the system is lightly loaded, and the queue depth never exceeds one. However, if the load on the system increases, the arrival rate of new packets can temporarily exceed the rate at which worker threads can handle them. When that happens, the bounded queue builds up a short backlog of packets waiting to be processed. This dynamic isn't that different from how web-servers that make up most of our distributed systems operate.

This is where networking services tend to take advantage of batching. Instead of dequeuing individual packets, the polling thread will try to dequeue multiple packets and submit them for processing as a single batch. Individual stages of the application are then written in a way that takes advantage of such batching, amortizing expensive lookups and operations over multiple packets. Most of the time, the system is lightly loaded, and each batch contains a single packet. However, if the system begins building up a backlog, the batch sizes will increase, improving both latency and efficiency.

This creates a dynamic where the system becomes more efficient as the load increases, and that's such a desirable property! I believe that the same type of opportunistic batching could benefit our distributed systems as well, by improving peak efficiency without sacrificing latency.

Wednesday, December 21, 2022

Performance and efficiency

The topic of software performance and efficiency has been making rounds this month, especially around engineers not being able to influence their leadership to invest in performance. For many engineers, performance work tends to be some of the most fun and satisfying engineering projects. If you are like me, you love seeing some metric or graph show a step function improvement - my last code commit this year was one such effort, and it felt great seeing the results:

A time series graph showing a step function increase around November of 2022.

However, judging by the replies on John Carmack's post, many feel they are unable to influence their leadership to give such projects appropriate priority. Upwards influence is a complex subject, and cultural norms at individual companies will have a big impact on how (or if) engineers get to influence their roadmaps. However, one tool that has been useful for me in prioritizing performance work is tying it to specific business outcomes. And when it comes to performance and efficiency work, there are multiple distinct ways they could impact the business, and I think it's worth being explicit about those.

Performance is a feature

We frequently see this in infrastructure services like virtualized compute, storage, and networking. Databases and various storage and network appliances are another example. Performance of these products is frequently one of the dimensions customers evaluate, and that tends to shape product roadmaps. This is also where we see benchmarks that fuel performance arms race, and never-ending debates of how representative these benchmarks are of the true customer experience. In my experience, teams working in these domains have the least trouble prioritizing performance related work.

Better performance drives down costs

We tend to see this in distributed systems a lot. By improving the performance of distributed system's main unit of work code path, we can increase the throughput of any single host. That in turn decreases the count (or size) of hosts needed to handle the aggregate work that the distributed system faces, driving down the costs. One obvious caveat is that many distributed systems have a minimum footprint that is needed to meet their redundancy and availability goals. A lightly loaded distributed system that operates within that minimum footprint is unlikely to see any cost savings from performance optimizations, at least not without additional and often non-trivial investments that would allow the service to "scale to zero."

Better performance improves customer experience

This category of performance projects improve customers experience, typically by improving the latency or quality of some process. Perhaps the best known example is website load times, but any other process where humans (or machines!) wait for something to happen is a good candidate. Yet another example from my youth is video games - the same hardware would run some games much smoother and at higher resolution than others. This was typically the case for any game where John Carmack had a hand!

The biggest challenge in advocating for this category of performance improvement projects is quantifying the impact. Over the past decades multiple studies demonstrated strong correlation between fast page load times and positive business outcomes. However, gaining that conviction required a number of time and effort consuming experiments, an investment that may be impractical for most teams. And even then, at some points the returns likely become diminishing. Will a video game sell more copies if it runs at 90 fps rather than 60?

Besides time consuming A/B studies, one of the best ways to advocate for these kinds of performance improvement projects, in my experience, is to deeply understand how the customers use your product. Then you can bring concrete customer anecdotes to make the case.

Better performance reduces availability risks

These types of performance projects tend to come up anytime we have performance modes in our system, that is the behavior where some units of work are less performant than others. An example of that could be an O(N²) algorithm, which sometimes faces large Ns. Such inefficiencies in the system create a situation where a shift in the composition of various modes can tip the system over the edge, creating a DOS vector. Throttling and load shedding more expensive units of work is one common approach, however improving their performance to be on par with the rest can be even more effective.

One way to identify and drive down these kinds of performance modality issues is by plotting tail and median latencies on the same graph, or perhaps even using fancier graph types like histogram heatmaps. Seeing multiple distinct modes in the graph is a telltale sign that your system has such an intrinsic risk.

Performance has security impact

Security sensitive domains are unique because more important that the raw performance is the lack of performance variability. This is because any variability can expose potential side channels to the attacker. For example, if an encryption API takes different lengths of time depending on the content being encrypted, then a determined adversary can use that timing information as a side channel to infer information about the plaintext that is being encrypted. Luckily, very few teams are writing code with that kind of blast radius. The amount of expertise and considerations that go into getting such code right is truly mind boggling!

Conclusion

Like all classifications, this one is imperfect. For starters, a project can fall into multiple categories. For example, improving P50 request latency on a web service is likely to both improve customer experience and also service costs. But like they say, "All models are wrong, some are useful", and hopefully this model will be useful in helping folks tie performance work to desirable business outcomes.

Sunday, April 10, 2022

Feature Flags: Are they considered harmful?

In a recent operational review, a colleague brought up how past feedback from me has influenced his team's avoidance of feature flags. If we ever worked together, then you too probably heard me grumble about the downsides of feature flags. So should feature flags be avoided at all costs? My opinions on this topic are nuanced, so I thought it would be worth writing them down. But first, what do we even have in mind when we talk about feature flags? When I think about feature flags, I tend to think about two distinct use-cases.

Coordinated enablement

When developing features within a complex codebase, it's common that the entire feature cannot be shipped in a single commit. Without using feature flags, one approach is to use a dedicated branch for feature development, where engineers working on a feature can accumulate incremental commits. Then, once the entire feature is code complete and tested, it can be merged into the mainline branch and deployed to production. This works well, but requires continuous merging of the feature and mainline branches to avoid a painful merge at the end. I don't know about others, but merging code is one of my least favorite software development activities.

Feature flags present another option. As the team is working on a new feature, code paths related to that feature can be placed behind a gate. For example, consider a service that vends answers to commonly asked questions. We may have an idea for a cool new feature, where the service auto-detects the language in which the question was asked and vends the answer in the same language. We could use a feature flag to gate the feature like this:

/// Returns a thoughtful answer to the question provided in 'question'
/// argument.
fn answer_question(question: String) -> String {
    let language;
    if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE) {
        language = detect_language(&question);
    } else {
        language = ENGLISH;
    }

    lookup_answer(question, language)
}

The feature flag can remain turned off while the rest of the implementation is being worked on. Perhaps we still need to populate the database with non-English answers, or perhaps detect_language() function is still not working as well as we want before we feel comfortable releasing the feature to customers. Gating new code paths behind a feature flag allows us to incrementally commit and deploy new code, without accumulating a big and risky change set. We can even enable the feature flag in our staging environment, allowing engineers to test the new feature. Then, once all the parts of the feature are ready and well tested, we can flip the feature flag from false to true, enabling it for customers.

Incremental enablement

Another use-case for feature flags is fine-grained control over how a feature gets rolled out to customers. Building up on the example from the previous section, we may think that language auto-detection will be a huge success and customers will love it. But what if something goes wrong. Even the most rigorous testing will not cover all the possible ways customers can ask their questions. Language detection could fail in unexpected ways, or worse yet regress the service for millions of customers happily asking their questions in English today. One way to increase our confidence is to make feature gate customer aware by passing account_id into feature_enabled() function:

/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account.
fn answer_question(question: String, account_id: String) -> String {
    let language;
    if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
        language = detect_language(&question);
    } else {
        language = ENGLISH;
    }

    lookup_answer(question, language)
}

Then, assuming the feature_enabled() function can keep track of enablement status by account, we can incrementally ramp up traffic to the new code paths. Perhaps, initially we will turn it on for our own developer accounts, then 1% of the customers, then 5%, and so on. At each step, we will watch for errors and customer feedback, only dialing the feature up if we receive a positive signal. And if something goes wrong, we can quickly ramp things down to 0%, effectively rolling back the change. Delivering major features this way can help reduce the blast radius. If things go wrong, we would've only impacted a small percentage of customers!

This approach has secondary benefits too. What if the feature causes problems for one of our customers. Perhaps they submitEnglish questions that have French words in them. The system works for them today, because they get their answers in English. But with our auto-detection feature, they begin getting French answers. Cela peut être frustrant si vous ne parlez pas français!

Because we are a customer obsessed service, we would want to right their experience as soon as possible. Having per customer control over the feature allows us to temporarily deny list their account, while we work on a proper fix, which may take some time. Without this capability, we would need to rollback the new feature for everybody!

Ok, so what's the problem?

Despite the opening paragraph, everything I've said so far makes feature flags sound useful. So why all the pessimism? The challenge with feature flags comes in two forms: permutations and delivery mechanisms. Here's what I mean by permutations. In addition to auto detecting the language, our popular question answering service is likely working on other features. Perhaps, we want to submit a Mechanical Turk request whoever we see a question that doesn't match an answer in our database. Using a feature flag, our example could look like this:

/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account. If we don't have an answer to the question,
/// this function may submit a Mechanical Turk request to populate an
/// answer into its internal database.
fn answer_question(question: String, account_id: String) -> Option<String> {
    let language;
    if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
        language = detect_language(&question);
    } else {
        language = ENGLISH;
    }

    let answer = answer(question, language);
    if answer.is_none() {
        if feature_enabled(ENABLE_MECHANICAL_TURK, &account_id) {
            submit_mturk_request(&question, language);
        }
    }
    
    answer
}

By adding another feature flag, we doubled the number of distinct code paths we need to maintain and test! We now need to worry about our service working correctly with all four possible permutations of language detection and Mechanical Turk enablement. And this complexity grows exponentially with the number of feature flags - just with five feature flags we have 32 possible permutations! And while it may seem improbable that enabling language auto detection in our toy example would have any impact on Mechanical Turk integration, real systems tend to be much more complex and harder to reason about. A small behavior change in one part of the code base can have unintended consequences on another.

Building automation that tests all the possible permutations can help, but even then we have to accept that systems often behave differently under real production workloads. Faced with an operational issue, we will need to reason about interplay of various feature flags. And then we still need to contend with the cognitive overhead of all the if feature_enabled(..) { } blocks in our codebase. For these reasons, it's important to have a closed loop process which bounds the number of feature flags that are active at any point in time, continuously cleaning up older feature flags. But just like cleaning your room, it's an action where the costs are real and the benefits are abstract, and therefore a hard behavior to keep up!

The second pitfall is the mechanism by which the service learns about changes to its feature flags. It's common to use a dynamic datastore, such as a database, to store and propagate feature flag changes. But that always seemed backwards to me! A change in the feature flag is effectively a change in the behavior of the service - new code paths are getting activated for the first time. And every service already has a mechanism by which new code gets delivered: its deployment pipeline. This pipeline is where teams tend to apply all of their deployment safety learnings, such as the ones described in this Amazon Builders’ Library article. So why invent a secondary deployment mechanism, one that will need to reimplement all the safety guardrails of the service’s deployment pipeline?

One argument I've heard is that having a dedicated "fast path" for feature flag toggles makes it easy to quickly rollout or rollback new features. But even that seems off to me. Most services deploy new code more frequently than they toggle feature flags, so if deployments are slow and painful, wouldn't it be beneficial to spend the energy fixing that, rather than inventing a deployment side channel? One that still has to support all the safety properties of the service's deployment pipeline, because feature flags are effectively new code path deployments.

Conclusion

To summarize, I think feature flags can be a powerful mechanism by which a service can deliver new capabilities. They can be especially effective in helping us control the blast radius in case of unforeseen problems with new features. But feature flags also come with an operational and cognitive costs, so it's important to have a mechanism that bounds the number of feature flags within the code base at any point in time.

It's also important to acknowledge that feature flag toggles are no different that any other code changes. Just like code or configuration changes, they change the behavior of your service. And if your service already has a deployment pipeline by which it safely delivers new code to production, consider using the same mechanism for deploying feature flag toggles.

Thursday, June 18, 2020

Hello

Well, hello there. This is a sample post to see how things look.