đź’»Common Interview Question Candidates Get Wrong : Disaster Recovery (DR) for your AWS application


Hello Reader,

In today’s post, let’s look at another correct but average answer and a great answer that gets you hired to common cloud interview questions. This question is even more relevant now, after this week's AWS outage!

Question - How did you do Disaster Recovery (DR) for your AWS application?

Common but average answer - I will replicate it to another region

What the interviewer is looking for is how DR strategies are chosen, and what are the different strategies. As an SA, you will be responsible for talking to the app team and coming up with an appropriate DR strategy.

A great answer is - There are different DR options to choose from depending on RTO (Recovery Time Objective) and RPO (Recovery Point Objective). The available DR strategies ordered by highest to lowest RTO/RPO (and lowest to highest cost) are:

  • Backup and Restore
  • Pilot Light
  • Warm Standby
  • Multi-site Active/Active

Then explain one of the DR strategies in detail. Preferably Multisite Active/Active because it’s used in most critical prod applications. Architecture below:

  • The most critical part for DR is the database. In this case, we are utilizing Global Table of DynamoDB for active-active mode. If you are using SQL database like Aurora, keep in mind that Aurora Global Databse is Active-Passive, but new Aurora DSQL is active-active.
  • Application stack is running on EC2 with Auto Scaling Group. You run minimum two EC2s in each region to keep it highly available
  • Load Balancers are regional service, hence we are using one load balancer in each region, distributing the traffic to that region
  • Route53 sends traffic to one of the two Load Balancers based on geolocation and latency
  • RPO/RTO is minimum in this architecture because data is constantly being replicated, and EC2s are up and running with minimum count of two in both regions. In some cases, applications make the desired count higher to keep higher number of EC2 running in the second region for lower RTO

đź’ˇOther things to keep in mind for real-world projects

  • Establishing and implementing a DR strategy BEFORE the disaster happens is critical. This week's AWS outage, if you had the DR strategy like above set before it happened, you'd been okay. But if you tried to do it while us-east-1 went down, it'd be late and your app if it was running solely in us-east-1 will be down.
  • There are other auxiliary components that you need to think about. For example, if you are using S3 bucket, you need to ensure cross-region replication for it. Let's say you are using Cognito for AuthN/Z, then you need to instrument DR yourself using export import etc.
  • However, the web and app tier, database, and load balancers are the most common components that's asked in interviews because that's what application teams handle in an enterprise. Hence, don't go crazy thinking about DR of each part for interviews

If you get this question in your interview, make sure to knock it out of the park!

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top AWS Solutions Architect.

Read more from Fast Track To Cloud

Hello Reader, Almost every cloud and Gen AI interview right now includes this question. And almost every candidate gets it wrong. Not because they don't know Gen AI. But because they know too many terms and connect none of them. Let's fix that today. Question: What is an AI Agent? Common but average answer - "An agent can perform complex tasks without a prompt." Why is this average? It doesn't explain the superpower of an AI agent. It doesn't show how agents are different from a simple...

Hello Reader, Everyone's building AI agents. If you've been following our newsletters, on MCP, on agent memory, on getting hired, you know that agents are the next evolution. They connect to your tools, they take actions on your behalf, and they're moving from demos into production faster than most organizations are ready for. But the question almost nobody is asking: who is securing the AI itself and how? To answer that, we welcome Adam Bluhm, Principal AI Architect @HiddenLayer (Ex-AWS)....

Hello Reader, Agents are everywhere. But there’s a big difference between using an agent and building one end-to-end. Let's face it - if you tell a recruiter that you played with Claude or ChatGPT, or even created a workflow using n8n, that won't impress them. Because when a company hires you, it expects you to know how to build agent using the infrastructure components. With that in mind, let's turn our attention to how to build an agent. Good Agent Let's take a look at building a good...