đź’»Common Interview Question Candidates Get Wrong : Disaster Recovery (DR) for your AWS application


Hello Reader,

In today’s post, let’s look at another correct but average answer and a great answer that gets you hired to common cloud interview questions. This question is even more relevant now, after this week's AWS outage!

Question - How did you do Disaster Recovery (DR) for your AWS application?

Common but average answer - I will replicate it to another region

What the interviewer is looking for is how DR strategies are chosen, and what are the different strategies. As an SA, you will be responsible for talking to the app team and coming up with an appropriate DR strategy.

A great answer is - There are different DR options to choose from depending on RTO (Recovery Time Objective) and RPO (Recovery Point Objective). The available DR strategies ordered by highest to lowest RTO/RPO (and lowest to highest cost) are:

  • Backup and Restore
  • Pilot Light
  • Warm Standby
  • Multi-site Active/Active

Then explain one of the DR strategies in detail. Preferably Multisite Active/Active because it’s used in most critical prod applications. Architecture below:

  • The most critical part for DR is the database. In this case, we are utilizing Global Table of DynamoDB for active-active mode. If you are using SQL database like Aurora, keep in mind that Aurora Global Databse is Active-Passive, but new Aurora DSQL is active-active.
  • Application stack is running on EC2 with Auto Scaling Group. You run minimum two EC2s in each region to keep it highly available
  • Load Balancers are regional service, hence we are using one load balancer in each region, distributing the traffic to that region
  • Route53 sends traffic to one of the two Load Balancers based on geolocation and latency
  • RPO/RTO is minimum in this architecture because data is constantly being replicated, and EC2s are up and running with minimum count of two in both regions. In some cases, applications make the desired count higher to keep higher number of EC2 running in the second region for lower RTO

đź’ˇOther things to keep in mind for real-world projects

  • Establishing and implementing a DR strategy BEFORE the disaster happens is critical. This week's AWS outage, if you had the DR strategy like above set before it happened, you'd been okay. But if you tried to do it while us-east-1 went down, it'd be late and your app if it was running solely in us-east-1 will be down.
  • There are other auxiliary components that you need to think about. For example, if you are using S3 bucket, you need to ensure cross-region replication for it. Let's say you are using Cognito for AuthN/Z, then you need to instrument DR yourself using export import etc.
  • However, the web and app tier, database, and load balancers are the most common components that's asked in interviews because that's what application teams handle in an enterprise. Hence, don't go crazy thinking about DR of each part for interviews

If you get this question in your interview, make sure to knock it out of the park!

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top AWS Solutions Architect.

Read more from Fast Track To Cloud

Hello Reader, Recently, I had the privilege of speaking to the Computer Science and Business Club at Rutgers University - ranked #1 in New Jersey for Engineering and Computer Science by U.S. News & World Report. It was incredible to see how driven and curious these students were. Many already had offers from Amazon, JPMorgan, and other top companies. Talking with them took me right back to my college days - studying for exams, chasing grades, and trying to figure out how to land that first...

Hello Reader, Another week, another AI announcement. But this one is worth studying because this one will become the defacto standard of running agents on AWS. I am talking about newly released Amazon AgentCore. Let's dive in. 🧩 The Big Picture: Why Agents Exist Let’s break it down using a practical example: What happens when a user asks an LLM app: What’s the time in New York? What’s the weather there? List my S3 buckets The LLM don't have these information, hence it needs to invoke tools...

Hello Reader, Often I hear this - API Gateway is Serverless, hence it's better than Application Load Balancer (ALB). In todays newsletter edition, we will take an objective look at both, consider pros and cons, and more importantly how to tackle this in system design or tech interview. Remember our guiding principle - to get the job, or to excel at the job - you need to DELIGHT and not just MEET the standard. Let's get started. Both can route traffic to backends, both are managed by AWS, and...