💻Common Interview Question Candidates Get Wrong : Disaster Recovery (DR) for your AWS application

Hello Reader,

In today’s post, let’s look at another correct but average answer and a great answer that gets you hired to common cloud interview questions. This question is even more relevant now, after this week's AWS outage!

Question - How did you do Disaster Recovery (DR) for your AWS application?

Common but average answer - I will replicate it to another region

What the interviewer is looking for is how DR strategies are chosen, and what are the different strategies. As an SA, you will be responsible for talking to the app team and coming up with an appropriate DR strategy.

A great answer is - There are different DR options to choose from depending on RTO (Recovery Time Objective) and RPO (Recovery Point Objective). The available DR strategies ordered by highest to lowest RTO/RPO (and lowest to highest cost) are:

Backup and Restore
Pilot Light
Warm Standby
Multi-site Active/Active

Then explain one of the DR strategies in detail. Preferably Multisite Active/Active because it’s used in most critical prod applications. Architecture below:

The most critical part for DR is the database. In this case, we are utilizing Global Table of DynamoDB for active-active mode. If you are using SQL database like Aurora, keep in mind that Aurora Global Databse is Active-Passive, but new Aurora DSQL is active-active.
Application stack is running on EC2 with Auto Scaling Group. You run minimum two EC2s in each region to keep it highly available
Load Balancers are regional service, hence we are using one load balancer in each region, distributing the traffic to that region
Route53 sends traffic to one of the two Load Balancers based on geolocation and latency
RPO/RTO is minimum in this architecture because data is constantly being replicated, and EC2s are up and running with minimum count of two in both regions. In some cases, applications make the desired count higher to keep higher number of EC2 running in the second region for lower RTO

💡Other things to keep in mind for real-world projects

Establishing and implementing a DR strategy BEFORE the disaster happens is critical. This week's AWS outage, if you had the DR strategy like above set before it happened, you'd been okay. But if you tried to do it while us-east-1 went down, it'd be late and your app if it was running solely in us-east-1 will be down.
There are other auxiliary components that you need to think about. For example, if you are using S3 bucket, you need to ensure cross-region replication for it. Let's say you are using Cognito for AuthN/Z, then you need to instrument DR yourself using export import etc.
However, the web and app tier, database, and load balancers are the most common components that's asked in interviews because that's what application teams handle in an enterprise. Hence, don't go crazy thinking about DR of each part for interviews

If you get this question in your interview, make sure to knock it out of the park!

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

💻Common Interview Question Candidates Get Wrong : Disaster Recovery (DR) for your AWS application

💡Other things to keep in mind for real-world projects

What I Wish I Knew When I Was a Fresher

Best Way to Build Agents on AWS (This will be the future)

Is API Gateway Better than Application Load Balancer?