Will MCP Eliminate SRE Jobs?


Hello Reader,

Another day, another MCP tool. But this one is special. Today we are going to go over newly released EKS MCP server. This is the official Kubernetes MCP server released and maintained by AWS. This one will rule them all! In today's edition, we are going to go over what it is, why this one is a game changer, how you can use this to get job interviews and demand more money, and whether it will eliminate SRE jobs.

There are three Ways to Manage Kubernetes :

Traditional Way (Manual & Google-Driven)

  • You search steps on Google, implement them manually.
  • When something breaks, you Google the error, try again.
  • It’s a trial-and-error loop that can take hours or days.

High cognitive load - requires remembering kubectl commands, YAML syntax, etc.

With Large Language Models (LLM)

  • You can ask LLMs to generate Terraform or YAML for your cluster.
  • You can paste in error messages, and the LLM suggests fixes.
  • Still manual — you have to copy-paste commands and verify results.
  • LLMs are trained on past data — may lack awareness of latest Kubernetes versions.
  • Tools like Kubernetes-GPT improve this but are still limited to troubleshooting.
  • Speeds up work, but can still take hours.

With LLM + MCP (Autonomous Agentic AI)

An MCP-powered agent (like EKS MCP) can:

  • Create EKS clusters
  • Write and modify YAML, Helm charts, ingress configs
  • Read terminal output
  • Take real-time actions autonomously
  • It remembers context from earlier steps, no need to re-explain.
  • Works from natural language, just describe what you want.

You no longer need to memorize kubectl commands.

Execution time drops to minutes, not hours.

This is the future of cloud operations! Let's understand what it is.

EKS MCP Server

This is the only official EKS MCP server created and supported by AWS. This MCP server:

  • Can create EKS cluster
  • Write/Deploy YAML file
  • And more importantly, is trained on AWS knowledge of troubleshooting EKS cluster. For that reason it's really good at troubleshooting EKS issues
  • Due to the plug and play nature of MCP servers, easily add other MCP servers to this and extend your workflow. For example, you can add AWS Cost Analysis MCP server to the same MCP host, and can ask about cost of your AWS account!

Now the big question - will this replace SRE jobs?

✅ The Bar Will Go Up - Especially for Entry-Level Roles

  • Before: Entry-level SREs were given time - “Take a couple weeks to spin up EKS, write Terraform, Helm charts.”
  • Now: Expectation will be: “Use these tools. Get it done faster.
  • Managers will expect speed, not just effort.

💰 Cost Won’t Stop This

  • LLM costs are non issue
  • Example: Used MCP + LLM for hours, cost was ~75 cents.
  • That’s cheaper than any employee's time, even for experimentation.
  • Plus, model costs are dropping rapidly over time.

🔒 But Security Will Be a Challenge

  • MCP tools can execute commands directly.
  • Some are sourced from random GitHub repos, you have to be careful installing it. There can be multiple tools with same name!
  • Enterprises will not adopt MCP fully till security is solved

🧠 Experience Still Matters

  • Often, these MCP tools run commands or do things that are not necessary or correct. You still need to validate these commands and approve or reject
  • You have more context on the application, and domain knowledge, which will have it's own advantages

And as always, with technology, if you adopt this, this will be fine. Some of you say that, okay, so if this does this much faster, won't it reduce the numbers? That's not how it goes. So as your time gets free up from using these tools, you will be assigned something else that customers want. Because this is an emerging technology, and with any emerging technology, new jobs and new functionalities come up. The same thing happened when the internet revolution was going on, the .com, and when we switched from brick and mortar to the internet, horses to the car, industrialization, etc.

And if you adopt this and showcase this in your resume and GitHub, you can even demand more money, and you can even attract recruiters to get interviews faster.

If you want to see a demo for this AWS official MCP server in action, please check the detailed video:

video preview

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top Solutions Architect at AWS.

Read more from Fast Track To Cloud

Hello Reader, Not all questions are equal in interviews and real-world projects. There are some questions that you simply can't mess up, because these concepts are so fundamental, they are used in almost ALL projects. One such concept is high availability. Surprisingly, I hear wrong answers on this all the time. In this edition, let's go over the common bad answers, a good answer, and then some! Question: What is High Availability? Bad Answers Even if a component fails, application should...

Hello Reader, EDA (Event Driven Architecture) has become increasingly popular in recent times. In this newsletter edition, we will explore what EDA is, what the benefits of EDA are, and then some advanced patterns of EDA, including with Kubernetes! Let's get started: An event-driven architecture decouples the producer and processor. In this example producer (human) invokes an API, and send information in JSON payload. API Gateway puts it into an event store (SQS), and the processor (Lambda)...

Hello Reader, In today’s post, let’s look at another correct but average answer and a great answer that gets you hired to common cloud interview questions. Question - What is RTO and RPO? Common mistakes candidate make - they say RPO (Recovery Point Objective) is measured in unit of data, e.g. gigabyte, petabyte etc. Here is the right answer - Both RPO and RTO are measured in time. RTO stands for Recovery Time Objective and is a measure of how quickly after an outage an application must be...