Kubernetes Platform Team - What, Why, How?


Hello Reader,

If the most hyped Technology of the Year award goes to Gen AI this year, it was Platform Engineering last year. Kubernetes is complex enough, and many of us asked why platform team? And more importantly, why do we care? Well, for starters, it's a hot topic in interviews, and there are a lot of platform jobs. As you all know me, I don't believe in idle theory-crafting; my goal is to teach you things that help you succeed in interviews and real jobs. So, let's start from the beginning.

In the early days

The basic container lifecycle is as below:

The fundamental container workflow from local machine to cloud is below:

  1. Developer writes code, and associated Dockerfile to containerize the code in her local machine
  2. She uses “Docker build” command to create the container image, in her local machine. At this point container image is saved in the local machine
  3. Developer uses “Docker run” command to run the container image, and test out the code running from the container. Developer can repeat Steps 1-3, till the testing goes as per the requirements
  4. Next, developer runs “Docker push” command to push the container image from the local machine to a container registry. Some examples are DockerHub, or Amazon ECR.
  5. Finally, using “Kubectl apply” command, an YAML manifest which has the URL of the container image from the Amazon ECR, is deployed into the running Kubernetes cluster.

Kubernetes cluster considerations

There are some considerations when it comes to the Kubernetes cluster:

  • Kubernetes version needs to be upgraded at a certain cadence (this has NOTHING to do with AWS, new versions are coming from CNCF Kubernetes maintainers)
    • New versions are often not backward-compatible
    • Addons and applications can break with the new version
    • For the above reasons - Kubernetes upgrade can be painful and requires planning and testing
  • Many applications can run in the same Kubernetes cluster in different namespaces, the separation needs to be enforced in this multi-tenant setup
  • After the K8s cluster is created, many add-ons need to be installed (and maintained), such as Istio, Karpenter, ADOT, Fluent Bit, etc.
  • Some companies use Kubernetes to run databases and other ancillary systems, which brings in additional management of stateful components and DR

Autonomy Vs Standardization

As a developer, you want the fredom to create whatever AWS resources you want. I was a developer once, and I didn't give any thought to cost or other best practices. After all, I am the developer, and the world shall bow to me! Developers want autonomy

On the other hand, organizations need to enforce some standards so that developers can't just provision a Kubernetes cluster with public endpoints, install an add-on they are not supposed to, and DO NOT UPGRADE the multi-tenant cluster without getting approval or testing other tenants.

Even though shift-left, i.e., developers doing more, is good, let's be honest—managing a Kubernetes cluster is A LOT and adds a lot to the developer's plate! For these reasons, Platform Teams were born!

Enter Platform Team

Platform teams take over the responsibility of creating and managing the cluster. With the platform team in the picture, the flow looks like this:

Step 1: The developer team requests the Platform team to provision appropriate AWS resources. In this example, we are using Amazon EKS for the application, but this concept can be extended to any other AWS service. This request for AWS resources is typically done via the ticketing system.

Step 2: The platform team receives the request.

Step 3: The platform team uses Infrastructure as Code (IaC), such as Terraform, CDK, etc., to provision the requested AWS resources, and share the credentials with the Developer team.

Step 4: The developer team kicks off the CICD process. We are using a container process to understand the flow. Developers check in Code, Dockerfile, and manifest YAMLs to an application repository. CI tools (e.g., Jenkins, GitHub actions) kick off, build the container image and save the image in a container registry such as Amazon ECR.

Step 5: CD tools (e.g. Jenkins, Spinnaker) update the deployment manifest files with the tag of the container image.

Step 6: CD tools execute the command to deploy the manifest files into the cluster, which, in terms, deploys the newly built container in the Amazon EKS cluster.

In addition to the above, platform teams implement standards, such as the platform team creating OPA/Kyverno policies to ensure developers can't deploy non-standard applications. They also coordinate with the application teams and handle upgrades. Platform teams also help with cost optimization, troubleshooting, implementing GitOps, developing CICD strategy, etc. They enable developers to focus on the business needs without worrying about the management of the cluster. It's like reverse DevOps ;)

Conclusion

The platform team takes care of the infrastructure (with the guardrails) appropriate for the organization, and the developer team uses that infrastructure to deploy their application. The platform team does the upgrade and maintenance of the infrastructure to reduce the burden on the developer team.

Now you know why platform teams are so important and how they were born! If this question comes in your interview, make sure to knock it out of the park!

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement, and more (waitlist for next cohort): https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top Solutions Architect at AWS.

Read more from Fast Track To Cloud

Hello Reader, In today’s post, let’s look at another correct but average answer and a great answer that gets you hired to common cloud interview questions. Question - What is RTO and RPO? Common mistakes candidate make - they say RPO (Recovery Point Objective) is measured in unit of data, e.g. gigabyte, petabyte etc. Here is the right answer - Both RPO and RTO are measured in time. RTO stands for Recovery Time Objective and is a measure of how quickly after an outage an application must be...

Hello Reader, Most engineers are using MCP clients and agents. But very few know how to build and host an MCP server, let alone run it remotely on the cloud. In today's edition, we will learn how to create and run a remote MCP server on Kubernetes, on Amazon EKS! I will share the code repo as well, so you can try this out yourself. But first.. 🔧 What is an MCP Server really? It’s not just an API that performs a task. An MCP Server is a protocol-compliant endpoint (defined by Anthropic) that...

Hello Reader, On my interactions, this question is coming up a lot - “How are AWS Strands different from Bedrock Agents?”. In today's newsletter, we will go over this, so you can also answer this in your interviews or real-world projects Let’s break it down using a practical example: What happens when a user asks an LLM app: What’s the time in New York? What’s the weather there? List my S3 buckets The LLM don't have these information, hence it needs to invoke tools for time, weather, and AWS...