Joe Magerramov's blog
Sunday, November 9, 2025
Switching from Synchronous to Asynchronous Mode of Coding
Sunday, October 19, 2025
The New Calculus of AI-based Coding
Driving at 200mph
Here's where it gets interesting. A typical software team, even an experienced one, doesn't get things right all the time. Even with good testing and engineering practices, bugs occasionally make it through. We've all heard the phrase "testing in production." That reality is the main reason I've always believed that focusing on testing alone is not enough, and investing in blast radius and time to recovery is just as important.
AI assisted code is no different, it may contain bugs even when thoroughly reviewed by a human, and I suspect the probabilities are not significantly different. However, when teams ship commits at 10x the rate, the overall math changes. What used to be a production impacting bug once or twice a year, can become a weekly occurrence. Even if most bugs get caught in integration or testing environments, they will still impact the shared code base, requiring investigation and slowing the rest of the team down. Once again, this is not just hyperbole—our team sees signs that these are the challenges that pop up with a step function increase in throughput.
I am increasingly convinced that in order for agentic development to increase engineering velocity by an order of magnitude, we need to decrease the probability of problematic commits by an order of magnitude too. And likely by even more than that, since at high velocities individual commits can begin interacting with each other in unexpected ways too.
In other words, driving at 200mph, you need a lot of downforce to keep the car on the track!
The Cost-Benefit Rebalance
One of the best ways to reduce the chance of bugs is to improve testing. I'm an airplane geek, and have always admired the testing ideas used by the airplane manufacturers. From early simulations, to component testing, to wind tunnel testing, to testing to breaking point, and ultimately test flights of fully assembled aircraft. Even flight simulators play a role in improving the overall safety of the industry. Some of these ideas have been tried in the software industry, but they are far from ubiquitous.
As an example, I've always liked "wind tunnel" style tests, that test fully assembled system in a controlled environment. To achieve that, one pattern I've used is implementing high fidelity "fake" versions of external dependencies that can be run locally. If you do that, you can then write build-time tests that run locally and verify end-to-end behavior of the whole system. You can even inject unexpected behaviors and failures into fake dependencies, to test how the system handles them. Such tests are easy to write and execute because they run locally, and they are great at catching those sneaky bugs in the seams between components.
Unfortunately, faking all the external dependencies isn't always easy for a service with moderate level of complexity. And even if you do, you now have to own keeping up with the real dependencies as they evolve. For those reasons, in my experience most teams don't write such tests.
I think we are seeing early signs that agentic coding can change the calculus here. AI agents are great at spitting out large volumes of code, especially when the desired behavior is well known and there's little ambiguity. Ideas that were sound in principle, but too expensive to implement and maintain just had their costs decrease by an order of magnitude. I really love riding such shifts in the industry, because they open the doors to new approaches that weren't practical in the past.
Our project (with the help of an AI agent) maintains fake implementations of external dependencies like authentication, storage, chain replication, and inference engine to be used in tests. We then wrote a test harness that uses those fakes to spin up our entire distributed system, including all the micro-services, on developers' machines. Build-time tests then spin up our canaries against that fully assembled stack verifying the system as a whole works.
I'm really bullish on this approach catching a category of bugs that in the past could only be caught once the change was committed and made it to the test environment. A few years ago, ideas like these would receive resistance as nice, but too expensive. This time around, it took just a few days to implement for a relatively complex system.
Driving Fast Requires Tighter Feedback Loop
Agentic coding changes that dynamic. In the amount of time it takes to build, package, and test one set of commits, another dozen might be waiting to go out. By the time a change set is ready to deploy to production, it may contain 100 or more commits. And if one of those commits contains a problem, the deployment needs to be rolled back grinding the pipeline to a halt. In the meantime, even more changes accumulate, adding to the chaos and the risk.
I'm a Formula 1 fan, and this reminds me of how an accident on the track can cause a Yellow Flag to be raised. Normally, the cars zoom around the track at immense speeds and accelerations. But if an accident occurs, the race marshals raise a yellow flag, which requires all the cars to slow down behind the pace car. An exciting race turns into a leisurely drive around the track until the debris is cleaned up and the track is safe again. To minimize such slow downs, race organizers go to great lengths to prepare for all types of accidents, and make sure they can clean up the track and restart the race in minutes.
Just like whole-system local tests help tighten the feedback loop for catching certain bugs, we may need to think similarly about how we implement our CICD pipelines. When teams are moving at the speed of dozen of commits per hour, problematic issues will need to be identified, isolated, and reverted in minutes instead of hours or days. That means that a typical build and test infrastructure will need to become an order of magnitude faster than it is today. Just like online video games become unplayable when there is high lag between player's inputs and the game's reaction, it's really hard to move 10x faster if every commit still requires a lengthy delay before you see the feedback.
The communication bottleneck
I enjoy observing well-run operations. If you've ever peeked behind the curtain of a busy restaurant, then at first sight you may think it's chaos. But if you take a second to notice the details, you'll see that all members are constantly coordinating with each other. Chefs, cooks, wait staff, bussers, and managers pass information back and forth in a continuous stream. By staying in constant sync, a well run restaurant manages to serve its patrons even during peak times, without sacrificing on quality or latency.

I believe that achieving similar increase in velocity for a software team requires constraints on how teams communicate. When your throughput increases by an order of magnitude, you're not just writing more code - you're making more decisions. Should we use this caching strategy or that one? How should we handle this edge case? What's the right abstraction here? At normal velocity, a team might make one or two of these decisions per week. At 10x velocity, they are making multiple each day.
The challenge is that many of these decisions impact what others are working on. Engineer A decides to refactor the authentication flow, which affects the API that Engineer B is about to extend. These aren't just implementation details - they're architectural choices that ripple through the codebase.
I find that traditional coordination mechanisms introduce too much latency here. Waiting for a Slack response or scheduling a quick sync for later in the day means either creating a bottleneck - the decision blocks progress - or risking going down the wrong path before realizing the conflict. At high throughput, the cost of coordination can dominate!
One approach is to eliminate coordination - if everybody works on independent components, they are unlikely to need to coordinate. But I find that ideal impractical in most real-world systems. So another alternative is to significantly decrease the cost of coordination. Our team sits on the same floor, and I think that's been critical to our velocity. When someone needs to make a decision that might impact others, they can walk over and hash it out in minutes in front of a whiteboard. We align on the approach, discuss trade-offs in real time, and both engineers get back to work. The decision gets made quickly, correctly, and without creating a pile-up of blocked work.
I recognize this doesn't solve the problem for distributed teams—that remains an open challenge.
The Path Forward
I'm really excited about the potential of agentic development. I think it has the capability to not only improve the efficiency of software development, but also allow us to tackle problems that were previously too niche or expensive to solve. The gains are real - our team's 10x throughput increase isn't theoretical, it's measurable.
But here's the critical part: these gains won't materialize if we simply bolt AI agents onto our existing development practices. Like adding a turbocharger to a car with narrow tires and old brakes, the result won't be faster lap times - it will be crashes. At 10x code velocity, our current approaches to testing, deployment, and team coordination become the limiting factors. The bottleneck just moves.
This means we need to fundamentally rethink how we approach building software. CICD pipelines designed for 10 commits per day will buckle under 100. Testing strategies that were "good enough" at normal velocity will let too many bugs through at high velocity. Communication patterns that worked fine before will create constant pile-ups of blocked work.
The good news is that we already have great ideas for comprehensive testing, rapid deployment, and efficient coordination - ideas that have shown promise but haven't seen wide adoption because they were too expensive to implement and maintain. What's changed is that agentic development itself can dramatically lower those costs. The same AI agents that are increasing our code throughput can also help us build the infrastructure needed to sustain that throughput.
This is the real opportunity: not just writing more code faster, but using AI to make previously impractical engineering practices practical. The teams that succeed with agentic development will be the ones who recognize that the entire software development lifecycle needs to evolve in concert.
Monday, June 16, 2025
The Nuanced Reality of Throttling: It's Not Just About Preventing Abuse
If you work with multi-tenant systems you are probably familiar with the concept of throttling or admission control. The idea is pretty simple and is rooted in the common human desire for fairness: when using a shared system or resource no customer should be able to consume "too much" of that resource and negatively impact other customers. What constitutes "too much" can vary a lot and will usually depend on technical, product, business, and even social factors.
Yet when most engineering teams think about throttling, the first thing that comes to mind is often protecting from bad actors who may intentionally or accidentally try to knock the system down. It's a clean, morally satisfying mental model. Bad actors perform unreasonable actions, so we put up guardrails to protect everyone else. Justice served, system protected, everyone goes home happy.
But here's the thing - protecting against bad actors is just a small fraction of the throttling story. The reality of throttling is far more nuanced, and frankly, more interesting than the "prevent abuse" story we often tell ourselves.
Two Sides of the Same Coin
I want to start with a distinction that influences how I think about throttling. When we implement an admission control like throttling, we're typically optimizing for one of two parties: the customer or the system operator. And these two scenarios are fundamentally different beasts.
Quotas and Limits for the Customers's Benefit
This is the "helpful" throttling. Think about a scenario where a developer accidentally writes a runaway script that starts making thousands of paid API calls per second to your service. Without throttling, they might wake up to a bill that they do not love. In this case, throttling is essentially a safety net - it prevents their own code from causing financial harm.
Similarly, consumption limits can be a mechanism to steer customers towards more efficient patterns. For example, by preventing the customer from making thousands of "describe resource" API calls we could steer them towards a more efficient "generate report" API. This could become a win-win situation: the customer gets the data they need more easily, and the system operator gets to improve the efficiency of their system.
Load Shedding for the System's Benefit
Now here's where things get nuanced. Sometimes you implement throttling not to help the customer, but to protect your system from legitimate traffic that just happens to be inconveniently timed. Maybe one of your customers is dealing with their own traffic surge - perhaps they just got featured on the front page of Reddit, or their marketing campaign went viral.
In this scenario, you're potentially hurting a customer who's doing absolutely nothing wrong. They're not trying to abuse your system; they're just experiencing success in their own business. But if you let their traffic through, it might overload the system and impact all your other customers. Now, technically we could argue that this type of throttling also helps the customer - nobody wins when the system is overloaded and suffers a congestion collapse. However, the point is that the customer isn't going to thank you for throttling them here!
I find it helpful to think of these as different concepts entirely. The first is quotas or limits - helping customers avoid surprises or use your system more efficiently. The second is load shedding - protecting your system from legitimate but inconvenient demand.
The Uncomfortable Truth About Load Shedding
This distinction matters because it forces us to confront an uncomfortable reality: sometimes we're actively hurting our customers to protect our system. The "preventing abuse" mental model breaks down completely here, and we need a more honest framework.
A healthier way to think about load shedding is that we want to protect our system in a way that causes the least amount of harm to our customers. It's not about good guys and bad guys anymore - it's about making difficult trade-offs when resources are constrained.
This reframing changes how we approach the problem. Instead of thinking "how do we stop bad actors," we start thinking "how do we gracefully degrade when we hit capacity limits while minimizing customer impact?"
The Scaling Dance
Here's where throttling gets really interesting. Load shedding doesn't have to be a permanent punishment. If you're dealing with legitimate traffic spikes, throttling can be a temporary protective measure while you scale up your system to handle the demand.
Think of a restaurant during an unexpectedly busy dinner rush. If they are short staffed, a restaurant may choose to keep some tables empty and turn away customers to make sure the customers who do get in still have a pleasant experience. Then, once additional staff arrive, they may open additional tables and begin accepting walk ins again.
In practice, this means your load-shedding system should be closely integrated with your auto-scaling infrastructure. When you start load shedding, that should trigger scaling decisions. The goal is to make load shedding temporary - a protective measure that buys you time to add capacity.
However, you also want to be careful to avoid problems like run away scaling, where the system scales up to unreasonable sizes because load shedding does not stop. Or oscillations, where the system wastes resources by continuously scaling up and down due to hysteresis. In both of these scenarios, placing velocity controls on scaling decisions can be a reasonable mechanism.
Beyond Static Limits
Many load shedding systems I've encountered use static limits. "Customer A gets 100 requests per minute, Customer B gets 100 requests per minute, everyone gets 100 requests per minute." It's simple, it's fair, and it's probably insufficient.
It's a simple system to implement and to explain, but static limits assume that every customer has the same needs from your system, regardless of their scale. But in reality, your customers exist in a wide spectrum of use cases. Some are weekend hobbyists making a few API calls. Others are large companies whose entire business depends on your service.
Static limits also assume that the customers of your system act in uncorrelated fashion. If multiple customers of the system hit their limit at the same time, the system could still get overloaded. There are lots of real-world reasons such a correlated behavior could occur. Perhaps all these customers are different teams within the same company, and the whole company is seeing a big increase in workload. Or perhaps they are using the same client software that contains the same bug. Or, my personal favorite, perhaps they've all configured their system to perform some intensive action on a cron at midnight, because humans love round numbers!
An interesting alternative is capacity-based throttling. Instead of hard limits, you admit new requests as long as your system has capacity. Think of it like a highway onramp - when the traffic on the highway is flowing freely the onramp lets new cars in without any constraints. But as soon as congestion builds up, the traffic lights on the onramp activate and begin metering new cars.
The Top Talker Dilemma
But what happens when you hit capacity limits? The naive approach is to shed load indiscriminately, but that's almost as bad as experiencing an overload. Almost, because you are avoiding congestion collapse - so many requests will still go through. However, such indiscriminate load shedding will make most of your customers see some failures - from their point of view the system is experiencing an outage.
A different option might be to shed load from your top talkers first. They're using the most resources, so cutting them off gives you the biggest bang for your buck in terms of freeing up capacity. The problem is that your top talkers are often your biggest customers. Cutting them off first is like a retailer turning away their top spending customers. Not exactly a winning business strategy.
One approach I think can work well is to shed load from "new" top talkers - customers whose traffic has recently spiked above their normal patterns. This gives you the capacity relief you need while protecting established usage patterns. The assumption is that sudden spikes are more likely to be temporary or problematic, while established high usage represents legitimate business needs.
One way you could implement this behavior is by starting with low static throttling limits, but then automatically increasing those limits whenever a customer reaches it, as long as the system has capacity. In happy state, no customer experiences load shedding and the throttling limits are increased to meet new demand. However, if the system is at capacity, new increases are temporarily halted and customers who need an increase may get throttled, until the system is scaled up and new headroom is created.
A Different Mental Model
I think the key insight here is that throttling is not primarily about preventing abuse - it's about resource allocation under constraints. Sometimes those constraints are financial (protecting customers from runaway bills), sometimes they're technical (preventing system overload), and sometimes they're business-related (product tier differentiation).
When we frame throttling as resource allocation rather than abuse prevention, we start asking better questions:
- How do we allocate limited resources fairly?
- How do we balance individual customer needs against system stability?
- How do we minimize harm when we have to make difficult trade-offs?
- How do we use throttling as a signal to guide scaling decisions?
These are more nuanced questions than "how do we stop bad actors," and they lead to more sophisticated solutions.
The Path Forward
None of this is to say that traditional abuse prevention doesn't matter. There are definitely bad actors out there trying to overwhelm systems, and throttling is one tool in your arsenal to deal with them. But I think we do ourselves a disservice when we reduce all throttling to abuse prevention.
The reality is that throttling is a complex, multi-faceted tool that touches on resource allocation, system reliability, product design, and business strategy. The sooner we embrace that complexity, the better solutions we'll build.
In my experience, the most effective throttling systems are those that:
- Clearly distinguish between customer protection and system protection use cases
- Integrate closely with auto-scaling infrastructure
- Use capacity-based limits rather than static ones where possible
- Prioritize established usage patterns over new spikes
- Treat throttling as a resource allocation problem, not just an abuse prevention one
The next time you're designing a throttling system, I'd encourage you to think beyond the "prevent abuse" narrative. Ask yourself: who is this throttling protecting, and what are the trade-offs involved? The answers might surprise you, and they'll almost certainly lead to a better system.
Saturday, March 1, 2025
The Trouble with Leader Elections (in distributed systems)
Blast radius
Liveness vs split-leader tension
Liveness vs faux-leader tension
So what should we do?
Localized leaders
In United States, a common debate is how much power should be wielded by individual states and how much should be in the hands of the central federal government. It's a nuanced trade off with many strong opinions on both sides. Luckily, in distributed systems it's almost always better to have smaller blast radii. Instead of having a single leader that operates on the entire distributed system, we could have smaller sub-leaders that each operate on a portion of our distributed system. This can help reduce the blast radius of failures, as well as reduce the amount of work each leader needs to perform, making it easier to maintain liveness in the system.
Idempotent co-leaders
Different architectures
- Using a queue (like SQS) to enqueue housekeeping items as they arise and then processing those using a small fleet of subscribers.
- Using capabilities of the platform to perform housekeeping tasks (e.g. using AutoScaling Groups to replace unhealthy hosts or S3 lifecycles to delete expired objects).
- Using event driven approaches (e.g. using a Lambda to trigger an action when S3 object changes, instead of centrally recomputing all files in the bucket).
Wednesday, January 24, 2024
The mathematics of redundancy
Where P is the failure probability between 0 and 1, and n is the number of redundant components, or more specifically the number of components a system could lose before a failure would occur. What's important is that the relationship is exponential, and we love exponents when they act in our favor. This means that small increases in redundancy will bring large reductions in failure probability. Or put another way, small increases in cost will bring disproportionally large increases in reliability. Looking at this mathematical model, it's easy to arrive at the conclusion that planes should have as many engines as possible. Especially if you are 14 years old. Unfortunately, the reality is far more nuanced.
For systems that have many correlated failure modes, increased redundancy (and cost) no longer increase reliability!
Conclusion
Notable mention
Saturday, February 4, 2023
Batching: Efficiency under load
Wednesday, December 21, 2022
Performance and efficiency
The topic of software performance and efficiency has been making rounds this month, especially around engineers not being able to influence their leadership to invest in performance. For many engineers, performance work tends to be some of the most fun and satisfying engineering projects. If you are like me, you love seeing some metric or graph show a step function improvement - my last code commit this year was one such effort, and it felt great seeing the results:







