AWS outage and lessons learnt to build fault tolerant systems in cloud
Kuliza is one of the early adopters of AWS infrastructure (using it from early 2008), run hundreds of production machines with heavy scale-up and scale-out architectures, use it for almost every purpose like development boxes to staging server and evangelize AWS to most of our customers for managing their infrastructure. The recent AWS outage has affected some of our major and critical production deployments; you can read about AWS service disruption post mortem to understand the root cause of the outage.
Thanks to our infrastructure team for effectively managing the crisis and immediately moving affected systems to different AWS region, I thought it would be good to carry the lessons and experience forward to mitigate our risk exposure in such future cases. Also it’s time for us to look at different strategies for increasing the availability and dependability of applications deployed in AWS and focus on “Design for Failure” architecture to handle pro-longed outages.
The key learning for our team from recent crisis is “Application availability is the responsibility of the developers not that of AWS or cloud provider“, so it’s in our hands to make sure production systems are fault-tolerant and can withstand if any critical outages in the cloud at different levels like availability zones and regions, including a crisis plan for handling complete failure of AWS infrastructure.
What you need for handling production crisis? – This is as simple as bringing your systems online with latest data just before an outage and make it ready for use by customers. It means you need to have production data of applications and databases available for restoration as quickly as possible to create new systems in or outside AWS.
Focus on Backups Systems – One of the reasons why recent AWS outage affected lot of users was their backup systems (including ours) are completely dependent on AWS infrastructure. We run EC2 for compute and EBS for persistent storage including RDS services along with our backup servers for production data so we are completely dependent on EBS. So if the AWS infrastructure like EBS fails then your backup is as good as no backup and completely at the mercy of Amazon to resolve failures. Our first focus post-outage was to make sure our backup systems are replicated across different AWS regions (even if one data center goes down we can survive the outage) and also to push one copy of each backup to S3 apart from EBS based backup servers. But what if entire AWS cloud goes down including S3/EBS across all regions? – I will talk about it below with respect our internal thinking of handling such failures at Cloud level.
Must plan for Database Failovers – Why Quora/Reddit were in read-only mode for almost a day? – this was because most of the AWS customers use RDS services (relational databases like MySQL) which internally uses EBS for storing databases and related log files so if EBS is down then your database is at complete risk unless multiple availability zone is enabled on your RDS instaces but in the recent outage even multi-az failover didn’t work as expected. This is what happened with our Oracle database servers. Now, we are setting up a replication servers in different regions and push the database backups to S3 to avoid complete dependency on EBS for persistent data storage and backups. But this is not fool proof in case of complete AWS outage so the database backups must go outside AWS infrastructure (in encrypted mode) to a different infrastructure.
Irrespective of the SLA’s provided by Cloud vendors and guaranteed reliability promises, we have to start thinking of designing fail-safe systems because our users and customers don’t care whether it’s AWS problem or someone else’s failure so it’s our responsibility to architect production systems in a way that isolates failures at Availability Zone, Region Level and also to handle the complete failure of AWS services in worst cases.
Building “Design for Failure” systems requires understanding of hardware stack, deployment architecture, replication/backup mechanisms, production applications and customer use cases with focus on increasing the availability and reliability.
The following sections outline, our internal thinking to handle failures at different levels in and of AWS to avoid any critical business failures for our customers.
- Region Level – We are planning to divide our productions systems into different groups so we can deploy them in multiple AWS regions (US-EAST as R1, US-WEST as R2 and EU as R3 etc) – This will be a good strategy if you are running multiple instances of the same application which requires scaling. Each region will have set of production instances, at least one database server and one backup server. In R1 (US-EAST) DB server will have a replication system for R2 DB server and vice-a-versa to make sure we have a failover systems for databases in two different regions. Also backup server in R1 will act as a copy machine for Backup server in R2 and vice-a-versa. All backup copies will be pushed to S3 system so we have backups in three different systems across different regions among different hardware infrastructure like EBS & S3.
- Availability Zone – For applications where we are using load balancing or auto-scaling, we will deploy instances in multiple availability zones in the region instead of one availability zone to increase the reliability and availability. We want to make it a practice to deploy each of our critical applications in the cloud on at least two availability zones in a region and scale accordingly as we have more load on the systems. Also we will be enabling multiple availability zones (multi-az) on all our current RDS instances (MySQL servers) if we are not already leveraging it.
- Application Level – The backup copies of production applications (pre and post deployment) should be in multiple backup servers in different regions along with one copy outside AWS cloud infrastructure. Also plan to build UAT machines in different regions (UAT system for applications of R1 will be in R2 and vice-a-versa) to make sure we have replica of production system for critical customer applications in case of failures in multiple regions. In our use cases of EBS for persistent storage, we will take incremental snapshots of EBS systems based on business criticality and push them to S3. We will be leveraging open source Amanda/Bacula for this purpose.
- Deployment Stack and VM Images – What if entire EC2 infrastructure or S3 goes down? – How can we handle such failures, what happens to your machine images?. One of our first steps is to have copy of our custom AMI’s and VM’s with production stack configured and ready to use outside AWS infrastructure. This will allow us to take our VM’s and deploy them in another provider’s infrastructure in cases of 100% AWS infrastructure failure. We are really looking forward to open cloud standards to provide interoperability across different providers which is critical for users of cloud.
- Cloud Level Failures – What if the entire AWS cloud goes down across different regions?, we should have production data backups outside AWS to handle such cases which means our backup systems should have a copy outside AWS infrastructure for production data (databases and applications data etc). Our current thinking is to leverage Rackspace Cloud or Google Storage or Microsoft Azure Blob Store to push a copy of backup outside AWS for all critical production applications.
The design for fault-tolerant systems in Cloud requires new approaches and also increases the operational costs in some cases but it’s essential to plan for all our critical applications to make sure we are not at mercy of a cloud provider.
This failure has taught us to plan for and test failures as a part of our Infrastructure management process and operations. Our team will take up random audits/failure scenarios to make sure systems architecture and design will help us to sustain failures in AWS and continuously test the robustness of our business continuity plans.
In short, to handle failures in cloud – Plan, review and test your assumptions and systems design. We look forward working with customers to build robust cloud infrastructure and also would love to hear your learnings and strategies on building fail-safe systems in Cloud.

Why a low traffic website like Kuliza requires 100s of production servers is the basic question you need to first answer!
Ohh.. these 100s of production servers are not for us, but our clients.. we provide monitoring, hosting and infrastructure services under our ZaCloud offerings.
CHeck them out here http://kuliza.com/services/za-cloud/