đź’°Gen AI Cost Question Everyone Gets Wrong in Interviews


Hello Reader,

GenAI is expensive. Most teams find out how expensive after the bill arrives. The overspend is not random. It comes from the same mistakes made across almost every GenAI project, and most of them are easy to fix once you know where to look.

This is a popular interview topic. But when asked "How will you cost optimize Gen AI workflow and application?", some of the average answers I hear is:

  • I will optimize prompts
  • I will use cheaper models
  • I will reduce usage

Why are they average?

  • Vague answers without showing architect level thinking. For example - you simply can't use cheaper models sacrificing quality
  • These are just basic techniques, there are intermediate and advanced techniques you must know

Understand and memorize 5-7 from the below to delight the interviewer, as well as optimize cost in real-world projects.

The quick wins that most people skip

The first mistake is using the wrong model for the task. Opus is powerful but it is overkill for summarizing a document, formatting output, or running a basic chatbot. Haiku handles those tasks at a fraction of the cost. Sonnet covers most coding work and daily tasks.

Reserve Opus for deep architectural planning and complex parallel processing with heavy business logic. Matching the model to the task is the single fastest way to reduce spend without touching your architecture. See how I am giving examples and not just saying use cheaper models.

The second mistake is letting context grow unchecked. Token cost grows with every message because models send prior context on each invocation. When you switch topics in a conversation, type /clear instead of starting a new chat. It sounds trivial. At scale it is not.

MCP servers are the hidden cost most people do not see. Every connected MCP server loads all its tool definitions into context on every message. If you type a two-word prompt and wonder why the token count is already high, MCP overhead is the reason.

Use skills instead. Skills only load their full definition when the prompt matches what the skill does. CLIs are even leaner. Large language models are already trained on AWS CLI commands. Running AWS S3 LS to list buckets does not need an MCP server. It needs one line.

CLI plus skills will replace MCP as the default pattern for cost-conscious teams.

The claude markdown or agent markdown file is underused. It injects your tech stack, coding conventions, architectural decisions, and project structure into every session automatically. You stop repeating yourself. The model stops exploring wrong paths. Back and forth conversation tokens drop significantly. If you are using GPT the equivalent is agent.markdown. Same concept, different filename.

Batching related tasks into a single prompt also matters more than people realize. Asking the model to summarize a file, then extract issues, then suggest fixes across three separate messages means the model re-reads the full prior context three times. One prompt with all three instructions reads the context once and plans better because it sees the full goal upfront.

The intermediate layer most teams ignore

Memory is where a significant amount of waste hides. Without memory, the model rediscovers the same context every session. You re-describe preferences, decisions, and project background every time you start fresh.

With memory, the model retains summaries, preferences, and prior decisions so you pick up where you left off. Combine claude.markdown with memory and the token savings compound across every session.

Agent teams sound impressive and they are genuinely expensive. A single agent uses one unit of tokens. A sub-agent uses three to five times that because each sub-agent maintains its own separate context and cannot see what other sub-agents are doing. An agent team with full message bus and inter-agent communication multiplies that further.

Before adding sub-agents or agent teams, ask whether the task actually requires that architecture. Most tasks do not.

Vector database selection is a cost decision that gets made once and rarely revisited. OpenSearch Serverless has low latency but charges for reserved capacity even when idle. Aurora PostgreSQL with pgvector is a strong middle ground if you are already running a SQL database. The S3 vector database option is the most cost-effective for batch processing and cost-sensitive production RAG workloads, with slightly higher latency that is often within acceptable SLA ranges. Test your specific workload before committing.

RAG document hygiene is also a real cost driver. Nightly ingestion pipelines that have been running for a year accumulate embeddings for documents nobody queries anymore. Old documents still cost money in storage. They also slow retrieval and introduce irrelevant chunks into context, which degrades output quality and pushes teams toward re-ranking solutions that add more cost. Cleaning up stale documents is free and the impact is immediate.

The advanced techniques that impress interviewers

Semantic caching sits in front of the agent and catches queries with the same underlying intent even when the wording is different. A user asking "how do I reset my password" and another asking "I forgot my password, what do I do" are different strings with identical intent. Traditional caching misses the second query. Semantic caching catches it and returns the cached answer without hitting the agent or the model. Redis with LangChain is the standard implementation. At scale this saves a significant portion of inference cost.

The Karpathy method, also called the LLM wiki, is a living markdown file of lessons learned, patterns, and decisions maintained as one-liners. It feeds structured context to the model instead of requiring it to rediscover the same information through conversation. No re-explanation of past failures.

Compact structured context instead of verbose conversation replay. Think of it as claude.markdown with accumulated project intelligence built in.

Model distillation is the technique with the highest ceiling. Use a large model like Opus to label or categorize a large dataset, then train a smaller model like Haiku on that output. The result is a task-specific model that performs at 90% plus quality for 10 to 50 times less cost, and runs significantly faster. Amazon Bedrock supports model distillation natively. For anyone building a proof of concept and showcasing it on LinkedIn and GitHub, this is the kind of work that gets recruiter attention.

Standard cloud cost practices apply to GenAI workloads the same way they apply to everything else. Enterprise discounts, reserved capacity on Bedrock, spot instances for EC2 inference, right-sizing, scaling inference endpoints to zero when idle, and cost allocation tags by team and project. Bedrock batch inference runs 50% cheaper than on-demand for non-real-time tasks. Prompt caching reduces cost by up to 90% and is different from semantic caching. Bedrock guardrails block malicious and irrelevant inputs before they reach the model, which saves wasted inference on inputs that would have produced no useful output anyway.

Why this matters for your career

This question is coming up in almost every SA interview right now. Interviewers are not looking for one or two tips. They are looking for candidates who can walk through a tiered approach, explain the trade-offs, and connect the techniques to real architectural decisions.

If you can speak to five to seven of these techniques with enough depth to explain when and why to use each one, you will stand out in the interview and deliver better results on the job.

Keep learning and keep rocking 🚀,

Raj

P.S. If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links​

Checkout my YouTube channel for Cloud Gen AI tutorial and interview prep videos: Here​

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/​

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top AWS Solutions Architect.

Read more from Fast Track To Cloud

Hello Reader, Most AI agents built today have a fundamental flaw. They forget everything the moment a session ends. You tell the agent your preferences, your constraints, your context. You close the tab. You come back. It has no idea who you are. This is not a bug. It is the default state of every LLM and agent. They are stateless by design. And if you are building agents or going into SA interviews, understanding how memory works at a system design level is now a baseline expectation. Why...

Hello Reader, Cloud With Raj is expanding, and looking to hire 4th fulltime position: Customer Success Manager Job Description Title: Customer Success Manager Responsibilities: Being the point of contact and managing 80-100 clients for the 12-week duration of the program. Host 1-on-1 calls with the clients every 2 weeks. Proactively guide clients through activation points to get results. Our customers’ success is our biggest success. Suggest necessary changes to the client fulfillment process...

Hello Reader, A new job type is getting very popular, called FDE or Forward Deployed Engineer. I analyzed multiple FDE job roles at OpenAI and Google. In this newsletter, we are going to go over what qualities you need to become an FDE, how to crack the interview, and how to excel at the job. OpenAI FDE Role Snippet Let's start with a specific job requirement from FDE posting: Own delivery across multiple deployments from first prototype to stable production. Scope work, sequence delivery,...