Monday, June 16, 2025

The Nuanced Reality of Throttling: It's Not Just About Preventing Abuse

If you work with multi-tenant systems you are probably familiar with the concept of throttling or admission control. The idea is pretty simple and is rooted in the common human desire for fairness: when using a shared system or resource no customer should be able to consume "too much" of that resource and negatively impact other customers. What constitutes "too much" can vary a lot and will usually depend on technical, product, business, and even social factors.

Yet when most engineering teams think about throttling, the first thing that comes to mind is often protecting from bad actors who may intentionally or accidentally try to knock the system down. It's a clean, morally satisfying mental model. Bad actors perform unreasonable actions, so we put up guardrails to protect everyone else. Justice served, system protected, everyone goes home happy.


But here's the thing - protecting against bad actors is just a small fraction of the throttling story. The reality of throttling is far more nuanced, and frankly, more interesting than the "prevent abuse" story we often tell ourselves.

Two Sides of the Same Coin

I want to start with a distinction that influences how I think about throttling. When we implement an admission control like throttling, we're typically optimizing for one of two parties: the customer or the system operator. And these two scenarios are fundamentally different beasts.

Quotas and Limits for the Customers's Benefit

This is the "helpful" throttling. Think about a scenario where a developer accidentally writes a runaway script that starts making thousands of paid API calls per second to your service. Without throttling, they might wake up to a bill that they do not love. In this case, throttling is essentially a safety net - it prevents their own code from causing financial harm.

Similarly, consumption limits can be a mechanism to steer customers towards more efficient patterns. For example, by preventing the customer from making thousands of "describe resource" API calls we could steer them towards a more efficient "generate report" API. This could become a win-win situation: the customer gets the data they need more easily, and the system operator gets to improve the efficiency of their system.

Load Shedding for the System's Benefit

Now here's where things get nuanced. Sometimes you implement throttling not to help the customer, but to protect your system from legitimate traffic that just happens to be inconveniently timed. Maybe one of your customers is dealing with their own traffic surge - perhaps they just got featured on the front page of Reddit, or their marketing campaign went viral.

In this scenario, you're potentially hurting a customer who's doing absolutely nothing wrong. They're not trying to abuse your system; they're just experiencing success in their own business. But if you let their traffic through, it might overload the system and impact all your other customers. Now, technically we could argue that this type of throttling also helps the customer - nobody wins when the system is overloaded and suffers a congestion collapse. However, the point is that the customer isn't going to thank you for throttling them here!

I find it helpful to think of these as different concepts entirely. The first is quotas or limits - helping customers avoid surprises or use your system more efficiently. The second is load shedding - protecting your system from legitimate but inconvenient demand. 

The Uncomfortable Truth About Load Shedding

This distinction matters because it forces us to confront an uncomfortable reality: sometimes we're actively hurting our customers to protect our system. The "preventing abuse" mental model breaks down completely here, and we need a more honest framework.

A healthier way to think about load shedding is that we want to protect our system in a way that causes the least amount of harm to our customers. It's not about good guys and bad guys anymore - it's about making difficult trade-offs when resources are constrained.

This reframing changes how we approach the problem. Instead of thinking "how do we stop bad actors," we start thinking "how do we gracefully degrade when we hit capacity limits while minimizing customer impact?"

The Scaling Dance

Here's where throttling gets really interesting. Load shedding doesn't have to be a permanent punishment. If you're dealing with legitimate traffic spikes, throttling can be a temporary protective measure while you scale up your system to handle the demand.

Think of a restaurant during an unexpectedly busy dinner rush. If they are short staffed, a restaurant may choose to keep some tables empty and turn away customers to make sure the customers who do get in still have a pleasant experience. Then, once additional staff arrive, they may open additional tables and begin accepting walk ins again.

In practice, this means your load-shedding system should be closely integrated with your auto-scaling infrastructure. When you start load shedding, that should trigger scaling decisions. The goal is to make load shedding temporary - a protective measure that buys you time to add capacity.

However, you also want to be careful to avoid problems like run away scaling, where the system scales up to unreasonable sizes because load shedding does not stop. Or oscillations, where the system wastes resources by continuously scaling up and down due to hysteresis. In both of these scenarios, placing velocity controls on scaling decisions can be a reasonable mechanism. 

Beyond Static Limits

Many load shedding systems I've encountered use static limits. "Customer A gets 100 requests per minute, Customer B gets 100 requests per minute, everyone gets 100 requests per minute." It's simple, it's fair, and it's probably insufficient.

It's a simple system to implement and to explain, but static limits assume that every customer has the same needs from your system, regardless of their scale. But in reality, your customers exist in a wide spectrum of use cases. Some are weekend hobbyists making a few API calls. Others are large companies whose entire business depends on your service. 

Static limits also assume that the customers of your system act in uncorrelated fashion. If multiple customers of the system hit their limit at the same time, the system could still get overloaded. There are lots of real-world reasons such a correlated behavior could occur. Perhaps all these customers are different teams within the same company, that is seeing a big increase in a workload. Or perhaps they are using the same client software that contains the same bug. Or, my personal favorite, perhaps they've all configured their system to perform some intensive action on a cron at midnight, because humans love round numbers! 

An interesting alternative is capacity-based throttling. Instead of hard limits, you admit new requests as long as your system has capacity. Think of it like a highway onramp - when the traffic on the highway is flowing freely the onramp lets new cars in without any constraints. But as soon as congestion builds up, the traffic lights on the onramp activate and begin metering new cars.



The Top Talker Dilemma

But what happens when you hit capacity limits? The naive approach is to shed load indiscriminately, but that's almost as bad as experiencing an overload. Almost, because you are avoiding congestion collapse - so many requests will still go through. However, such indiscriminate load shedding will make most of your customers see some failures - from their point of view the system is experiencing an outage. 

A different option might be to shed load from your top talkers first. They're using the most resources, so cutting them off gives you the biggest bang for your buck in terms of freeing up capacity. The problem is that your top talkers are often your biggest customers. Cutting them off first is like a retailer turning away their top spending customers. Not exactly a winning business strategy.

One approach I think can work well is to shed load from "new" top talkers - customers whose traffic has recently spiked above their normal patterns. This gives you the capacity relief you need while protecting established usage patterns. The assumption is that sudden spikes are more likely to be temporary or problematic, while established high usage represents legitimate business needs.

One way you could implement this behavior is by starting with low static throttling limits, but then automatically increasing those limits whenever a customer reaches it, as long as the system has capacity. In happy state, no customer experiences load shedding and the throttling limits are increased to meet new demand. However, if the system is at capacity, new increases are temporarily halted and customers who need an increase may get throttled, until the system is scaled up and new headroom is created.

A Different Mental Model

I think the key insight here is that throttling is not primarily about preventing abuse - it's about resource allocation under constraints. Sometimes those constraints are financial (protecting customers from runaway bills), sometimes they're technical (preventing system overload), and sometimes they're business-related (product tier differentiation).

When we frame throttling as resource allocation rather than abuse prevention, we start asking better questions:

  • How do we allocate limited resources fairly?
  • How do we balance individual customer needs against system stability?
  • How do we minimize harm when we have to make difficult trade-offs?
  • How do we use throttling as a signal to guide scaling decisions?

These are more nuanced questions than "how do we stop bad actors," and they lead to more sophisticated solutions.

The Path Forward

None of this is to say that traditional abuse prevention doesn't matter. There are definitely bad actors out there trying to overwhelm systems, and throttling is one tool in your arsenal to deal with them. But I think we do ourselves a disservice when we reduce all throttling to abuse prevention.

The reality is that throttling is a complex, multi-faceted tool that touches on resource allocation, system reliability, product design, and business strategy. The sooner we embrace that complexity, the better solutions we'll build.

In my experience, the most effective throttling systems are those that:

  1. Clearly distinguish between customer protection and system protection use cases
  2. Integrate closely with auto-scaling infrastructure
  3. Use capacity-based limits rather than static ones where possible
  4. Prioritize established usage patterns over new spikes
  5. Treat throttling as a resource allocation problem, not just an abuse prevention one

The next time you're designing a throttling system, I'd encourage you to think beyond the "prevent abuse" narrative. Ask yourself: who is this throttling protecting, and what are the trade-offs involved? The answers might surprise you, and they'll almost certainly lead to a better system.