Inside the race condition that broke AWS’s triple-redundant system
The bug created a race condition where stale records triggered cleanup while updates were still happening.
The massive AWS outage earlier this month was caused by a race condition in a triple-redundant system designed to ensure it never fails. It failed anyway.
Last week, AWS published a detailed explanation of what triggered the massive outage in its us-east-1 cloud region. I finally sat down to review it, and here's what happened.
The big crash
The problem revolved around DynamoDB, a proprietary fully managed cloud database service designed to be simple to use, fast, and highly scalable. It was also built for resilience.
As with modern web services, DynamoDB relies on a very large set of load balancers that redirect workload by IP address. As such, DynamoDB uses "hundreds of thousands" of DNS records.
Maintaining the DNS records is a top priority, which is why there were three distinct sets of "DNS Enactor" services operating across three Availability Zones to keep IP records for DynamoDB correctly updated.
Everything started when one DNS Enactor started lagging that fateful day. Which shouldn't be a problem, since there are two other DNS Enactor services to pick up the slack.
The problem: a software bug meant that by the time the next DNS Enactor saw the records, they were considered stale. This triggered an automatic cleanup, even as the first lagging Enactor made an update, causing a race condition. As a result, crucial records ended up erroneously deleted, leaving the system in an inconsistent state.
As the DNS-based load balancing system for DynamoDB collapsed, services that rely on access to the database service could no longer access it and promptly failed.
But why was Asia affected?
But why were many users in Asia affected too? Answer: dependencies.
As I wrote earlier, while modern web services are built with multiple safeguards, chances are high that they rely on capabilities only available in one location.
One example is authentication, which uses very little resources and is easier to maintain at one cloud location. Another could be API calls to third-party providers. A food delivery app, for instance, could rely on API calls to mapping, traffic, and payment providers.
Some of the API providers might in turn be reliant on another service that was impacted. What you have is a vast, invisible web of dependencies. What made things worse was how DynamoDB is a basic cloud building block that's so widely used.
Moving ahead
For now, AWS says it has disabled DNS Enactor automation globally. It will fix the race condition before re-enabling it. And it is conducting a review across all AWS services.
What are your thoughts about this?