The topic of software performance and efficiency has been making rounds this month, especially around engineers not being able to influence their leadership to invest in performance. For many engineers, performance work tends to be some of the most fun and satisfying engineering projects. If you are like me, you love seeing some metric or graph show a step function improvement - my last code commit this year was one such effort, and it felt great seeing the results:
Wednesday, December 21, 2022
Performance and efficiency
However, judging by the replies on John Carmack's post, many feel they are unable to influence their leadership to give such projects appropriate priority. Upwards influence is a complex subject, and cultural norms at individual companies will have a big impact on how (or if) engineers get to influence their roadmaps. However, one tool that has been useful for me in prioritizing performance work is tying it to specific business outcomes. And when it comes to performance and efficiency work, there are multiple distinct ways they could impact the business, and I think it's worth being explicit about those.
We frequently see this in infrastructure services like virtualized compute, storage, and networking. Databases and various storage and network appliances are another example. Performance of these products is frequently one of the dimensions customers evaluate, and that tends to shape product roadmaps. This is also where we see benchmarks that fuel performance arms race, and never-ending debates of how representative these benchmarks are of the true customer experience. In my experience, teams working in these domains have the least trouble prioritizing performance related work.
We tend to see this in distributed systems a lot. By improving the performance of distributed system's main unit of work code path, we can increase the throughput of any single host. That in turn decreases the count (or size) of hosts needed to handle the aggregate work that the distributed system faces, driving down the costs. One obvious caveat is that many distributed systems have a minimum footprint that is needed to meet their redundancy and availability goals. A lightly loaded distributed system that operates within that minimum footprint is unlikely to see any cost savings from performance optimizations, at least not without additional and often non-trivial investments that would allow the service to "scale to zero."
This category of performance projects improve customers experience, typically by improving the latency or quality of some process. Perhaps the best known example is website load times, but any other process where humans (or machines!) wait for something to happen is a good candidate. Yet another example from my youth is video games - the same hardware would run some games much smoother and at higher resolution than others. This was typically the case for any game where John Carmack had a hand!
The biggest challenge in advocating for this category of performance improvement projects is quantifying the impact. Over the past decades multiple studies demonstrated strong correlation between fast page load times and positive business outcomes. However, gaining that conviction required a number of time and effort consuming experiments, an investment that may be impractical for most teams. And even then, at some points the returns likely become diminishing. Will a video game sell more copies if it runs at 90 fps rather than 60?
Besides time consuming A/B studies, one of the best ways to advocate for these kinds of performance improvement projects, in my experience, is to deeply understand how the customers use your product. Then you can bring concrete customer anecdotes to make the case.
These types of performance projects tend to come up anytime we have performance modes in our system, that is the behavior where some units of work are less performant than others. An example of that could be an O(N²) algorithm, which sometimes faces large Ns. Such inefficiencies in the system create a situation where a shift in the composition of various modes can tip the system over the edge, creating a DOS vector. Throttling and load shedding more expensive units of work is one common approach, however improving their performance to be on par with the rest can be even more effective.
One way to identify and drive down these kinds of performance modality issues is by plotting tail and median latencies on the same graph, or perhaps even using fancier graph types like histogram heatmaps. Seeing multiple distinct modes in the graph is a telltale sign that your system has such an intrinsic risk.
Security sensitive domains are unique because more important that the raw performance is the lack of performance variability. This is because any variability can expose potential side channels to the attacker. For example, if an encryption API takes different lengths of time depending on the content being encrypted, then a determined adversary can use that timing information as a side channel to infer information about the plaintext that is being encrypted. Luckily, very few teams are writing code with that kind of blast radius. The amount of expertise and considerations that go into getting such code right is truly mind boggling!
Like all classifications, this one is imperfect. For starters, a project can fall into multiple categories. For example, improving P50 request latency on a web service is likely to both improve customer experience and also service costs. But like they say, "All models are wrong, some are useful", and hopefully this model will be useful in helping folks tie performance work to desirable business outcomes.
Sunday, April 10, 2022
Feature Flags: Are they considered harmful?
In a recent operational review, a colleague brought up how past feedback from me has influenced his team's avoidance of feature flags. If we ever worked together, then you too probably heard me grumble about the downsides of feature flags. So should feature flags be avoided at all costs? My opinions on this topic are nuanced, so I thought it would be worth writing them down. But first, what do we even have in mind when we talk about feature flags? When I think about feature flags, I tend to think about two distinct use-cases.
Coordinated enablement
When developing features within a complex codebase, it's common that the entire feature cannot be shipped in a single commit. Without using feature flags, one approach is to use a dedicated branch for feature development, where engineers working on a feature can accumulate incremental commits. Then, once the entire feature is code complete and tested, it can be merged into the mainline branch and deployed to production. This works well, but requires continuous merging of the feature and mainline branches to avoid a painful merge at the end. I don't know about others, but merging code is one of my least favorite software development activities.
Feature flags present another option. As the team is working on a new feature, code paths related to that feature can be placed behind a gate. For example, consider a service that vends answers to commonly asked questions. We may have an idea for a cool new feature, where the service auto-detects the language in which the question was asked and vends the answer in the same language. We could use a feature flag to gate the feature like this:
/// Returns a thoughtful answer to the question provided in 'question'
/// argument.
fn answer_question(question: String) -> String {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE) {
language = detect_language(&question);
} else {
language = ENGLISH;
}
lookup_answer(question, language)
}
The feature flag can remain turned off while the rest of the implementation is being worked on. Perhaps we still need to populate the database with non-English answers, or perhaps detect_language() function is still not working as well as we want before we feel comfortable releasing the feature to customers. Gating new code paths behind a feature flag allows us to incrementally commit and deploy new code, without accumulating a big and risky change set. We can even enable the feature flag in our staging environment, allowing engineers to test the new feature. Then, once all the parts of the feature are ready and well tested, we can flip the feature flag from false to true, enabling it for customers.
Incremental enablement
Another use-case for feature flags is fine-grained control over how a feature gets rolled out to customers. Building up on the example from the previous section, we may think that language auto-detection will be a huge success and customers will love it. But what if something goes wrong. Even the most rigorous testing will not cover all the possible ways customers can ask their questions. Language detection could fail in unexpected ways, or worse yet regress the service for millions of customers happily asking their questions in English today. One way to increase our confidence is to make feature gate customer aware by passing account_id into feature_enabled() function:
/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account.
fn answer_question(question: String, account_id: String) -> String {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
language = detect_language(&question);
} else {
language = ENGLISH;
}
lookup_answer(question, language)
}
Then, assuming the feature_enabled() function can keep track of enablement status by account, we can incrementally ramp up traffic to the new code paths. Perhaps, initially we will turn it on for our own developer accounts, then 1% of the customers, then 5%, and so on. At each step, we will watch for errors and customer feedback, only dialing the feature up if we receive a positive signal. And if something goes wrong, we can quickly ramp things down to 0%, effectively rolling back the change. Delivering major features this way can help reduce the blast radius. If things go wrong, we would've only impacted a small percentage of customers!
This approach has secondary benefits too. What if the feature causes problems for one of our customers. Perhaps they submitEnglish questions that have French words in them. The system works for them today, because they get their answers in English. But with our auto-detection feature, they begin getting French answers. Cela peut ĂȘtre frustrant si vous ne parlez pas français!
Because we are a customer obsessed service, we would want to right their experience as soon as possible. Having per customer control over the feature allows us to temporarily deny list their account, while we work on a proper fix, which may take some time. Without this capability, we would need to rollback the new feature for everybody!
Ok, so what's the problem?
Despite the opening paragraph, everything I've said so far makes feature flags sound useful. So why all the pessimism? The challenge with feature flags comes in two forms: permutations and delivery mechanisms. Here's what I mean by permutations. In addition to auto detecting the language, our popular question answering service is likely working on other features. Perhaps, we want to submit a Mechanical Turk request whoever we see a question that doesn't match an answer in our database. Using a feature flag, our example could look like this:
/// Returns a thoughtful answer to the question provided in 'question'
/// argument. May answer in your own language if that feature is
/// enabled on your account. If we don't have an answer to the question,
/// this function may submit a Mechanical Turk request to populate an
/// answer into its internal database.
fn answer_question(question: String, account_id: String) -> Option<String> {
let language;
if feature_enabled(ENABLE_AUTO_DETECT_LANGUAGE, &account_id) {
language = detect_language(&question);
} else {
language = ENGLISH;
}
let answer = answer(question, language);
if answer.is_none() {
if feature_enabled(ENABLE_MECHANICAL_TURK, &account_id) {
submit_mturk_request(&question, language);
}
}
answer
}
By adding another feature flag, we doubled the number of distinct code paths we need to maintain and test! We now need to worry about our service working correctly with all four possible permutations of language detection and Mechanical Turk enablement. And this complexity grows exponentially with the number of feature flags - just with five feature flags we have 32 possible permutations! And while it may seem improbable that enabling language auto detection in our toy example would have any impact on Mechanical Turk integration, real systems tend to be much more complex and harder to reason about. A small behavior change in one part of the code base can have unintended consequences on another.
Building automation that tests all the possible permutations can help, but even then we have to accept that systems often behave differently under real production workloads. Faced with an operational issue, we will need to reason about interplay of various feature flags. And then we still need to contend with the cognitive overhead of all the if feature_enabled(..) { } blocks in our codebase. For these reasons, it's important to have a closed loop process which bounds the number of feature flags that are active at any point in time, continuously cleaning up older feature flags. But just like cleaning your room, it's an action where the costs are real and the benefits are abstract, and therefore a hard behavior to keep up!
The second pitfall is the mechanism by which the service learns about changes to its feature flags. It's common to use a dynamic datastore, such as a database, to store and propagate feature flag changes. But that always seemed backwards to me! A change in the feature flag is effectively a change in the behavior of the service - new code paths are getting activated for the first time. And every service already has a mechanism by which new code gets delivered: its deployment pipeline. This pipeline is where teams tend to apply all of their deployment safety learnings, such as the ones described in this Amazon Builders’ Library article. So why invent a secondary deployment mechanism, one that will need to reimplement all the safety guardrails of the service’s deployment pipeline?
One argument I've heard is that having a dedicated "fast path" for feature flag toggles makes it easy to quickly rollout or rollback new features. But even that seems off to me. Most services deploy new code more frequently than they toggle feature flags, so if deployments are slow and painful, wouldn't it be beneficial to spend the energy fixing that, rather than inventing a deployment side channel? One that still has to support all the safety properties of the service's deployment pipeline, because feature flags are effectively new code path deployments.
Conclusion
To summarize, I think feature flags can be a powerful mechanism by which a service can deliver new capabilities. They can be especially effective in helping us control the blast radius in case of unforeseen problems with new features. But feature flags also come with an operational and cognitive costs, so it's important to have a mechanism that bounds the number of feature flags within the code base at any point in time.
It's also important to acknowledge that feature flag toggles are no different that any other code changes. Just like code or configuration changes, they change the behavior of your service. And if your service already has a deployment pipeline by which it safely delivers new code to production, consider using the same mechanism for deploying feature flag toggles.
Thursday, June 18, 2020
Subscribe to:
Posts (Atom)