US AWS outage knocks thousands of websites offline

Including for users in Asia

US AWS outage knocks thousands of websites offline
Photo Credit: Screenshot

A major outage at cloud giant AWS in the us-east-1 region knocked hundreds of websites and apps offline for hours, including Perplexity, Canva, and Slack.

Can data centres fail? And why were so many services down even for users in Asia, thousands of miles away from Virginia? Understanding what happened requires looking beyond the data centre itself.

Engineered for resilience

We don't actually know what happened yet. What we do know is AWS is now on top of matters, and the situation is slowly normalising.

Could it have started from a data centre outage? Maybe, but it's worth noting that hyperscalers, including AWS, design their data centres to exceedingly high levels of uptime.

In 2016, I attended a presentation that outlined how obsessively AWS builds resilience into its data centres - modifying equipment firmware to prioritise uptime over the risk of equipment damage.

For what it's worth, the commercial switchgear in question costs over three quarters of a million dollars each. It's not something a normal data centre operator will do. I wrote about it on DCD here.

Interdependencies

But why were so many sites down? Surely the likes of Duolingo, Zoom, or Coinbase don't run all their systems in the us-east-1 region for Asia users?

In one word: dependencies. While modern services are doubtlessly load balanced globally, the chances are high that they rely on core capabilities only available at one location.

One example is authentication, which uses very little resources and is easier to maintain at one cloud location. Another could be API calls to third-party providers. A food delivery app, for instance, could rely on API calls to mapping, traffic, and payment providers. These go down, and the service stalls too.

And because some of these API providers might in turn be reliant on another service that was impacted, what you have is a vast, invisible web of interdependencies.

Recovery headaches

Finally, recovery is more than simply flipping a toggle. First, an analysis must be done to identify the problem so that it could be fixed at source. Fixing the problem rarely brings immediate relief, however.

Often, the avalanche of queued requests could overwhelm services that would ordinarily have worked. This means everything must be carefully managed and monitored until full recovery, which is what is happening now.

Were you affected by the AWS outage?