Your AI Agent Has No Memory. Here Is How to Fix That.


Hello Reader,

Most AI agents built today have a fundamental flaw. They forget everything the moment a session ends. You tell the agent your preferences, your constraints, your context. You close the tab. You come back. It has no idea who you are.

This is not a bug. It is the default state of every LLM and agent. They are stateless by design. And if you are building agents or going into SA interviews, understanding how memory works at a system design level is now a baseline expectation.

Why loading everything into context is not the answer

The obvious fix is to save every conversation and load it all back into the context window when the session resumes. This works at small scale and breaks at production scale for three reasons.

Context windows degrade before they fill up. A model with a 200,000 token context window starts producing lower quality output around 70 to 80 percent capacity. You hit the quality ceiling long before you hit the technical limit.

Context windows also treat every token equally. Your name gets the same weight as a throwaway comment from three weeks ago. There is no sense of importance or relevance. Everything competes for the same attention.

Cost scales linearly with context size. The more you load, the more tokens the model consumes on every invocation. At scale this becomes a significant and avoidable expense.

The answer is not to load everything. The answer is to load the right things.

The four types of memory every agent needs

The CoALA framework from 2023 defines four memory types that map directly to how human memory works.

Working memory is what the agent is actively thinking about right now. This is the current conversation context. It is short-term and lives in the context window.

Procedural memory is muscle memory. The agent knows how to take notes, when to escalate, how to format a response, without being told every time. This gets loaded into the system prompt.

Semantic memory is general knowledge and facts accumulated over time. User preferences, team communication norms, known constraints. High importance semantic memory gets loaded into a profile block in the system prompt. Lower importance items stay in long-term storage and get retrieved on demand.

Episodic memory is autobiographical. Specific experiences from the past. The last time this user asked about this topic, the answer was X. The last time this design was proposed, the VP rejected it because of budget. Episodic memory gets loaded into the user prompt area at session start.

An agent with all four types working together does not need to rediscover context every session. It picks up where it left off.

How short-term memory becomes long-term memory

The extraction process is where most implementations fall short. A separate LLM process periodically reviews the working memory and extracts what matters. It identifies preferences, decisions, patterns, and past experiences and writes them into long-term storage.

The working memory is not wiped immediately after a session ends, which gives this process time to run.

This extraction job can run after each session, on a cron schedule every hour or two, or during idle time when agents are not actively being used. The idle time approach is the most popular right now.

Claude recently released what they are calling dream mode. It is a nightly batch job that processes working memory during off-peak hours. If you are coming from a mainframe background, it is literally a nightly batch job with a better name.

The storage problem most architects overlook

Long-term memory needs to live somewhere. Real-world agents also need to get data from other data sources. SQL databases for structured data. Vector databases for semantic and episodic memory. Graph databases for relationship data. NoSQL for flexible schema data. Semantic cache for repeated query patterns.

This is a real architectural trade-off that comes up in SA interviews. The question is not just what storage type to use. It is how to manage the complexity of multiple storage types at scale and what the cost and latency implications are of each approach.

Study based on the role and interview

For a general Solutions Architect interview, you need to know the four memory types, how they relate to each other, how extraction works, and what the system design looks like end to end. That is enough to answer any standard GenAI architecture question.

For a GenAI specialist SA role, you need to go deeper. Is episodic memory more important than semantic memory for a given use case? Should they be stored in the same database or separate databases? Which memory type mathematically contributes more value to output quality? What are the latency trade-offs between retrieval approaches?

For an infrastructure SA role, the focus shifts to provisioning and resilience. How do you provision the vector database that holds long-term memory? What does failover look like? What is the disaster recovery strategy?

For an AI engineer role, you need LLMOps knowledge on top of all of this. How do you detect when extracted memories are stale or incorrect? How do you version and roll back memory state?

Study to the role. There is no end to how deep this topic goes. The candidates who win are the ones who know exactly how deep to go for the job they are interviewing for.

The bigger picture

GenAI context switches fast. New models, new tools, new frameworks every week. The candidates who keep up are not studying individual tool features. They are studying connected concepts. Memory, RAG, MCP, agents, how they connect to each other, what the real-world implications are for security, cost, and scale.

That depth of understanding is what makes an answer stand out in a debrief. And it is what makes an architect valuable on the job when the tool landscape shifts again next quarter.

Here is a video going in more depth on this topic 👇

video preview​

Keep learning and keep rocking 🚀,

Raj

P.S. If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links​

Checkout my YouTube channel for Cloud Gen AI tutorial and interview prep videos: Here​

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://app.cloudwithraj.com/​

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top AWS Solutions Architect.

Read more from Fast Track To Cloud

Hello Reader, GenAI is expensive. Most teams find out how expensive after the bill arrives. The overspend is not random. It comes from the same mistakes made across almost every GenAI project, and most of them are easy to fix once you know where to look. This is a popular interview topic. But when asked "How will you cost optimize Gen AI workflow and application?", some of the average answers I hear is: I will optimize prompts I will use cheaper models I will reduce usage Why are they...

Hello Reader, Cloud With Raj is expanding, and looking to hire 4th fulltime position: Customer Success Manager Job Description Title: Customer Success Manager Responsibilities: Being the point of contact and managing 80-100 clients for the 12-week duration of the program. Host 1-on-1 calls with the clients every 2 weeks. Proactively guide clients through activation points to get results. Our customers’ success is our biggest success. Suggest necessary changes to the client fulfillment process...

Hello Reader, A new job type is getting very popular, called FDE or Forward Deployed Engineer. I analyzed multiple FDE job roles at OpenAI and Google. In this newsletter, we are going to go over what qualities you need to become an FDE, how to crack the interview, and how to excel at the job. OpenAI FDE Role Snippet Let's start with a specific job requirement from FDE posting: Own delivery across multiple deployments from first prototype to stable production. Scope work, sequence delivery,...