Breaking: AWS DNS Error Triggers Major Global Outage

Breaking Aws Dns Error Triggers Major Global Outage
Follow Us:
1k
1k

Here’s one way to start a Monday morning: Early on Oct. 20, dozens of major websites and apps went dark during a multihour outage at AWS.

The outage originated in AWS’s US-east-1 region, which is based in Northern Virginia — an area often referred to as “Data Center Alley.” Though the disruption began there, it affected thousands of users worldwide since the region hosts some of AWS’s largest workloads.

Map of AWS data centers in Virginia
The outage stemmed from the US-east-1 region in Northern Virginia. Source: Dgtl Infra

It was around 3:11 a.m. EST when AWS began posting update logs, noting that its Northern Virginia region began showing “increased error rates.”

Logs then said the issue was traced back to DNS resolution errors in DynamoDB, which is one of AWS’s core database services.

DNS is one of those things that, when it breaks, everything can break. So while the actual servers were fine, a glitch between the internal DNS resolvers and DynamoDB meant nothing could be located, causing a break in the connection.

So the issue spread quickly to platforms, including Snapchat, Fortnite, Duolingo, Canva, Wordle, Lloyds, Slack, Monday.com, Bank of Scotland, HMRC, Zoom, Barclays, Vodafone, and Docker Hub.

The New York Times also reported that “kiosks appeared to not work and apps were down” at LaGuardia Airport, specifically for Delta and United Airlines.

By 6:35 a.m. EST, most of the issue had been mitigated. But some API and connectivity issues are still being resolved at the time of publishing, so users may still see plenty of apps and platforms down for the time being. As of noon EST, more than 7,000 users still saw downed sites, according to DownDetector.

AWS plans to publish a Post-Event Summary with a full breakdown of what happened.

The Bigger Issue

Though the outage occurred in the early morning hours, plenty of developers who use AWS were unhappy, taking to Reddit to share their thoughts.

On Reddit’s r/devops, engineers began reporting issues around 4 to 5 a.m. EST: “Just got woken up to multiple pages. No services are loading in east-1, can’t see any of my resources.”

Another summed it up in the industry’s favorite understatement: “It’s always DNS.”

That much is true — this isn’t AWS’s first DNS-mishap rodeo: Several outages since 2019 have also come from internal DNS errors that shut down multiple services.

George Foley, Technical Advisor at ESET Ireland, a subsidiary of global software company ESET, warned of the risks of dependency.

“Even if your own website or app isn’t hosted on AWS, there’s a good chance something you use from your CRM to your payment processor is,” said Foley. “Outages like this highlight the importance of having resilience plans in place, including backups and alternative routes for essential data and services.”

It’s as another commenter on a r/aws thread said: “This is a crazy outage lol everything is down it feels like. Makes you wonder how crazy things would be if there was a more major outage…”

Comment
byu/TankIllustrious from discussion
inaws

Not even the hyperscalers are immune to downtime. So, for everyone working on the cloud, perhaps this shutdown is a friendly reminder that “always on” is never a guarantee — especially when coming from a single point of failure.