Sunday, April 10, 2022

Feature Flags: Are they considered harmful?

In a recent operational review, a colleague brought up how past feedback from me has influenced his team's avoidance of feature flags. If we ever worked together, then you too probably heard me grumble about the downsides of feature flags. So should feature flags be avoided at all costs? My opinions on this topic are nuanced, so I thought it would be worth writing them down. But first, what do we even have in mind when we talk about feature flags? When I think about feature flags, I tend to think about two distinct use-cases. 

Coordinated enablement

When developing features within a complex codebase, it's common that the entire feature cannot be shipped in a single commit. Without using feature flags, one approach is to use a dedicated branch for feature development, where engineers working on a feature can accumulate incremental commits. Then, once the entire feature is code complete and tested, it can be merged into the mainline branch and deployed to production. This works well, but requires continuous merging of the feature and mainline branches to avoid a painful merge at the end. I don't know about others, but merging code is one of my least favorite software development activities.

Feature flags present another option. As the team is working on a new feature, code paths related to that feature can be placed behind a gate. For example, consider a service that vends answers to commonly asked questions. We may have an idea for a cool new feature, where the service auto-detects the language in which the question was asked and vends the answer in the same language. We could use a feature flag to gate the feature like this:
/// Returns a thoughtful answer to the question provided in 'question'
/// argument.
fn answer_question(question: String) -> String {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE) {
language = detect_language(&question);
} else {
language = ENGLISH;
}

lookup_answer(question, language)
}

The feature flag can remain turned off while the rest of the implementation is being worked on. Perhaps we still need to populate the database with non-English answers, or perhaps detect_language() function is still not working as well as we want before we feel comfortable releasing the feature to customers. Gating new code paths behind a feature flag allows us to incrementally commit and deploy new code, without accumulating a big and risky change set. We can even enable the feature flag in our staging environment, allowing engineers to test the new feature. Then, once all the parts of the feature are ready and well tested, we can flip the feature flag from false to true, enabling it for customers.

Incremental enablement

Another use-case for feature flags is fine-grained control over how a feature gets rolled out to customers. Building up on the example from the previous section, we may think that language auto-detection will be a huge success and customers will love it. But what if something goes wrong. Even the most rigorous testing will not cover all the possible ways customers can ask their questions. Language detection could fail in unexpected ways, or worse yet regress the service for millions of customers happily asking their questions in English today. One way to increase our confidence is to make feature gate customer aware by passing account_id into feature_enabled() function:
/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account.
fn answer_question(question: String, account_id: String) -> String {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
language = detect_language(&question);
} else {
language = ENGLISH;
}

lookup_answer(question, language)
}

Then, assuming the feature_enabled() function can keep track of enablement status by account, we can incrementally ramp up traffic to the new code paths. Perhaps, initially we will turn it on for our own developer accounts, then 1% of the customers, then 5%, and so on. At each step, we will watch for errors and customer feedback, only dialing the feature up if we receive a positive signal. And if something goes wrong, we can quickly ramp things down to 0%, effectively rolling back the change. Delivering major features this way can help reduce the blast radius. If things go wrong, we would've only impacted a small percentage of customers!

This approach has secondary benefits too. What if the feature causes problems for one of our customers. Perhaps they submitEnglish questions that have French words in them. The system works for them today, because they get their answers in English. But with our auto-detection feature, they begin getting French answers. 
Cela peut ĂȘtre frustrant si vous ne parlez pas français!

Because we are a customer obsessed service, we would want to right their experience as soon as possible. Having per customer control over the feature allows us to temporarily deny list their account, while we work on a proper fix, which may take some time. Without this capability, we would need to rollback the new feature for everybody!

Ok, so what's the problem?

Despite the opening paragraph, everything I've said so far makes feature flags sound useful. So why all the pessimism? The challenge with feature flags comes in two forms: permutations and delivery mechanisms. Here's what I mean by permutations. In addition to auto detecting the language, our popular question answering service is likely working on other features. Perhaps, we want to submit a Mechanical Turk request whoever we see a question that doesn't match an answer in our database. Using a feature flag, our example could look like this:

/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account. If we don't have an answer to the question,
/// this function may submit a Mechanical Turk request to populate an
/// answer into its internal database.
fn answer_question(question: String, account_id: String) -> Option<String> {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
language = detect_language(&question);
} else {
language = ENGLISH;
}

let answer = answer(question, language);
if answer.is_none() {
if feature_enabled(ENABLE_MECHANICAL_TURK, &account_id) {
submit_mturk_request(&question, language);
}
}

answer
}
By adding another feature flag, we doubled the number of distinct code paths we need to maintain and test! We now need to worry about our service working correctly with all four possible permutations of language detection and Mechanical Turk enablement. And this complexity grows exponentially with the number of feature flags - just with five feature flags we have 32 possible permutations! And while it may seem improbable that enabling language auto detection in our toy example would have any impact on Mechanical Turk integration, real systems tend to be much more complex and harder to reason about. A small behavior change in one part of the code base can have unintended consequences on another.

Building automation that tests all the possible permutations can help, but even then we have to accept that systems often behave differently under real production workloads. Faced with an operational issue, we will need to reason about interplay of various feature flags. And then we still need to contend with the cognitive overhead of all the if feature_enabled(..) { } blocks in our codebase. For these reasons, it's important to have a closed loop process which bounds the number of feature flags that are active at any point in time, continuously cleaning up older feature flags. But just like cleaning your room, it's an action where the costs are real and the benefits are abstract, and therefore a hard behavior to keep up!

The second pitfall is the mechanism by which the service learns about changes to its feature flags. It's common to use a dynamic datastore, such as a database, to store and propagate feature flag changes. But that always seemed backwards to me! A change in the feature flag is effectively a change in the behavior of the service - new code paths are getting activated for the first time. And every service already has a mechanism by which new code gets delivered: its deployment pipeline. This pipeline is where teams tend to apply all of their deployment safety learnings, such as the ones described in this Amazon Builders’ Library article. So why invent a secondary deployment mechanism, one that will need to reimplement all the safety guardrails of the service’s deployment pipeline?

One argument I've heard is that having a dedicated "fast path" for feature flag toggles makes it easy to quickly rollout or rollback new features. But even that seems off to me. Most services deploy new code more frequently than they toggle feature flags, so if deployments are slow and painful, wouldn't it be beneficial to spend the energy fixing that, rather than inventing a deployment side channel? One that still has to support all the safety properties of the service's deployment pipeline, because feature flags are effectively new code path deployments.

Conclusion

To summarize, I think feature flags can be a powerful mechanism by which a service can deliver new capabilities. They can be especially effective in helping us control the blast radius in case of unforeseen problems with new features. But feature flags also come with an operational and cognitive costs, so it's important to have a mechanism that bounds the number of feature flags within the code base at any point in time. 

It's also important to acknowledge that feature flag toggles are no different that any other code changes. Just like code or configuration changes, they change the behavior of your service. And if your service already has a deployment pipeline by which it safely delivers new code to production, consider using the same mechanism for deploying feature flag toggles.