Thoughts On Microsoft’s Azure Outage Post-Mortem

Last Updated 6 years ago

Last week, Azure suffered a day-long outage. One of the services involved was Visual Studio Team Services (aka Azure DevOps), and that team just published their outage postmortem.

The postmortem is FANTASTIC: open, honest (at least it reads that way), and goes into enough technical detail to satisfy a wide variety of readers from managers to technical implementers.

This section explains a lot about their HA/DR strategy:

Why didn’t VSTS services fail over to another region? We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS). This enables us to replicate data within the same geography while respecting data sovereignty. Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.

To rephrase, in the event of losing a region, the plan was to restore from backups. That’s absolutely fair, and it’s probably the same disaster recovery plan your company has, dear reader. Don’t get all high-and-mighty on me now – I like that plan just fine for disasters, and it’s the same thing we designed for our Faux PaaS project.

But I want to draw your attention to what their plan didn’t include: synchronous Availability Groups across data centers.

Cross-data-center synchronous AGs are something that work great in theory, but usually fall down in practice. Your applications just don’t want to wait until a write is committed across two different data centers. I’ll let Microsoft explain why:

However, the reality of cross-region synchronous replication is messy. For example, the region paired with South Central US is US North Central. Even at the speed of light, it takes time for the data to reach the other data center and for the original data center to receive the response. The round-trip latency is added to every write. This adds approximately 70ms for each round trip between South Central US and US North Central. For some of our key services, that’s too long. Machines slow down and networks have problems for any number of reasons. Since every write only succeeds when two different sets of services in two different regions can successfully commit the data and respond, there is twice the opportunity for slowdowns and failures. As a result, either availability suffers (halted while waiting for the secondary write to commit) or the system must fall back to asynchronous replication.

That’s Microsoft talking.

Microsoft can’t get sync AGs to work for them in a way that makes them happy.

Before you design a DR plan aiming for zero data loss using synchronous AG replication, make sure you build a solid proof of concept, and load test it with production-quality workloads. Make sure your end users will accept the latency slowdowns – or if they won’t, make sure they sign off on the RPO and RTO involved with a single-data-center solution. The time to learn these numbers isn’t when the hurricane is approaching, or when you’re writing a postmortem about your own apps.

6 Comments. Leave new

Robert Sterbal
September 11, 2018 9:39 am

Thank you for a clear tradeoff analysis that is written to communicate with the business user.

Reply
Albert Einsten
September 11, 2018 10:24 am

Well, Microsoft just needs to create a way to send data faster than the speed of light.

Reply
Steve Jones
September 11, 2018 1:40 pm

Interesting. I didn’t read it that way. I want back and I think that the issue with 70ms and sync writes is with Azure storage replication, not AGs. It seems a couple sections above as they talk about handling failure, the writing moves from a blend of database and storage to just storage.

Might be me, but that’s how I see it.

Reply
- Brent Ozar
  September 11, 2018 1:55 pm
  
  Right – they’re not even using AGs. (AGs would likely be worse.)
  
  Reply
- Brent Ozar
  September 11, 2018 1:57 pm
  
  Also, to be clear – the file sync stuff is for the binary files they get from customers, not the data.
  
  Reply
devops
May 7, 2019 6:51 am

The article is very useful and give lots of knowledge.

Reply