Saturday, March 1, 2025

The Trouble with Leader Elections (in distributed systems)

When working on distributed systems, it's not uncommon to need a piece of logic that executes across the entire system. Some examples of that can be a task that deletes tomb-stoned records, or a task that periodically updates system-wide configuration, or perhaps a piece of code that continuously scans the entire fleet replacing unhealthy hosts. In my experience, most distributed systems have at least one such component.

The simplest way to implement such logic is to execute it on a single host - non-distributed algorithms tend to be simpler and more efficient to implement. However, because we still want redundancy a leader election algorithm is commonly used to elect one of a small number of hosts to act as the leader and execute the task. In a nutshell, most widely used leader election algorithms have a concept of leases (or timed locks) at their core. Each host continuously attempts to acquire a lease from a data store with appropriate atomicity and consistency guarantees. This could be a datastore like DynamoDB or a Paxos-based locking client. If successfully acquired, the host can begin executing the task, continuously renewing the lease. If the host dies, the lease will eventually expire and another host will win the race and take over as the leader. This Amazon Builders Library article goes into a lot of helpful details on implementing leader election: Leader election in distributed systems.


On the surface, leader election is a simple approach that fills a need for many distributed systems. So why the dislike? In my experience, leader election based systems have a couple of downsides:

Blast radius

Just like in the real world, leader election elevates a single participant into a position of immense power. If the leader is good, the system chugs along. If the leader makes bad decisions, it can very quickly impact the entire system. I find this concern most visible during deployments. In most systems, deployments tend to be the leading cause of failures. No matter how well you test the system, the real moment of truth hits when the new code and configuration meet the diversity and complexity of the real production environment. Most distributed systems deal with that by rolling new changes slowly across the fleet, perhaps starting with a single host and waiting for signs of success before incrementally rolling out the changes to the rest of the fleet. Deployments to a leader election based systems tend to go from 0% to 100% as soon as the changes hit the currently elected leader. Depending on the task performed by the leader, any bug that sneaks through testing can be catastrophic for the health of the overall system.



Liveness vs split-leader tension

One of the most important decisions in the leader-elected systems is the length of the lease. If the lease expires while the task is still executing, we run the risk of another host grabbing the lease and starting its own execution. All of a sudden, a task that was designed to run once at a time can begin running concurrently on multiple hosts. Unless the system has been designed and is continuously tested to support that, this could lead to subtle concurrency bugs that impact the entire system.

On the other hand, if the lease is long enough to cover even the most pessimistic task execution time, then an actual leader failure will need to wait that long until a new leader acquires the lease. This can lead to stalling and liveness problems.

We can attempt to work around this tension by using short-lived leases and having the leader continuously extend the lease while the task is running. That can quickly get tricky. For example, a blocking IO call or a stop-the-world garbage collection pause can push us past lease expiration time without an opportunity to extend. Or an attempt to extend can fail, and we will need to decide whether to abort the task and deal with partial progress or risk multiple tasks executing concurrently.

Liveness vs faux-leader tension

I find that leader election based system are also weak to ambiguous gray failure scenarios. The leader may be happily acquiring and renewing the lease, but it may have trouble executing the task. Perhaps a downstream dependency is timing out, or the task is terminating abnormally. When that happens, the leader gets to make a really tough decision. It could continue renewing the lease, in the hope that the task will eventually succeed. But that runs the risk of getting stuck in the faux-leader mode, where a problem with the host (e.g. a broken volume) prevents it from making progress, while it happily renews the lease.

Alternatively, the leader could decide to give up and release the lease, in the hope that a different host will have better luck with the task. But that runs a different risk. If the task was failing due to outside factors (perhaps a dependency was down), no leader will be willing to hold the lease. That can cause the system to enter into a leader churn, which can lead to various types of cascading failures.

So what should we do?

I think it's always worth starting by deciding if these problems matter to your system. There are many systems where these problems are not deal breakers, and leader election can be the simplest solution. For example, a component that sends out an operational report once a day can likely ignore these tensions. In the worst case, operators may occasionally receive duplicate e-mails or a report may get delayed by a few hours. 

However, for many real world distributed systems these tensions matter and need to be considered. When dealing with those kinds of systems, I prefer to steer away from approaches that require trading off between multiple desirable properties. I like systems that have both liveness and predictability. I also like systems where no single fallible component has the power to impact the entire system.  So what are some improvements or alternatives?

Localized leaders

In United States, a common debate is how much power should be wielded by individual states and how much should be in the hands of the central federal government. It's a nuanced trade off with many strong opinions on both sides. Luckily, in distributed systems it's almost always better to have smaller blast radii. Instead of having a single leader that operates on the entire distributed system, we could have smaller sub-leaders that each operate on a portion of our distributed system. This can help reduce the blast radius of failures, as well as reduce the amount of work each leader needs to perform, making it easier to maintain liveness in the system.


Idempotent co-leaders

In the real world, deputizing multiple leaders can lead to chaos and indecision. In distributed systems, if we can make the task that is being executed idempotent and concurrency-safe, we can skip leader election altogether. Each participating host could execute the task at the same time. If some hosts fail, the system will still make progress as long as there's at least one healthy host. This approach introduces extra load into the system, but on the other hand that load tends to be consistent and predictable. Such systems don't suffer from surprising mode shifts during failures, if anything the load tends to decrease when things fail. I find that to be a desirable property for most systems!

Different architectures

Sometimes, it's possible to make small tweaks to our distributed system to avoid needing centralized logic altogether. Some alternatives that I've run into:
  • Using a queue (like SQS) to enqueue housekeeping items as they arise and then processing those using a small fleet of subscribers.
  • Using capabilities of the platform to perform housekeeping tasks (e.g. using AutoScaling Groups to replace unhealthy hosts or S3 object expirations to delete expired objects).
  • Using event driven approaches (e.g. using a Lambda to trigger an action when S3 object changes, instead of centrally recomputing all files in the bucket).