Agent Memory Explained Simply

Hello Reader,

Have you ever repeated yourself to an AI and thought, “Didn’t we already talk about this?” That frustration isn’t your fault. It’s how GenAI systems work by default. To overcome this, we need to implement memory. Now, there are a lot of confusion around this - do we need different types of memory, does this make RAG obsolete, and how does this even work? Let's learn all of it in today's edition.

Agents Are Stateless

By default, agents are stateless. Previously, we used to combat this with adding everything from the current session in the context each time. However, this was unsustainable because:

Context window will grow exponentially, increasing cost
LLM has to reprocess EVERYTHING over and over again. Imagine you are discussing designing an app, and after 1000 lines of back and forth you arrived at conclusion A. At every subsequent prompt, LLM will go through the same 1000 lines and derive the conclusion A. Ideal state will be, it goes directly to conclusion A, and then process 1001st line (new input)
Slow processing speed

Hence the concept of memory was born. Let's take a look at that.

Memory - Short Vs Long

There are two kinds of memory - short term and long term. This is very similar to human memory. Short term memory is related to the current session.

This short term memory is generally on the same hardware stack the LLM is running. Since most LLMs these days run on GPU, the short term memory will be the Video RAM. What do we know about RAM:

RAM or Random Access Memory is fast, and ephemeral (fancy word for temporary)
As anything in computer hardware - if it's fast, then it's expensive (sounds like a sports car, doesn't it?)
Because it's expensive, it doesn't have unlimited storage. If you have a graphics card in your personal computer, you'd notice Video RAM is much smaller in size than your Hard Disk
But this serves an important function. It holds all the current session conversation here in key-value format. If we recall the previous scenario where LLM has to reprocess same 1000 lines each time, with memory that's not the case. Because it saved the conclusion/summary in this memory, it can just get the summary and process new input from context.
Fun fact: If you use Cline with Vs-Code which shows the context size on each LLM call, you'd see for subsequent prompts, the context window goes down in size (Previously it'd only go up). It's reducing the context window because, it saved the info in memory!
This approach is cheaper than processing large number of tokens, and is faster

Now, short term memory is great for the current session. but how about, you closed your session, and you came back later. It'd be terribly inconvenient if you have to repeat yourself again. But short term memory is ephemeral, so how can we persist the info? This is where long term memory comes into play!

Periodically, certain info from short term memory will be extracted, and saved to long term memory. This long term memory is saved in a vector store, which is typically on a hard disk. Hence, it's durable, cheap, but little slower than short term. That's okay, because once the info is retrieved for another session, and is used, it will be saved in short term memory again for faster access.

Now, even though long term memory is cheaper compare to short term memory, you don't wanna fill it up with ALL info. Hence only below things are extracted from short term memory:

User preferences - E.g. I like indoor activities in cold day
Semantics - Raw data with facts. E.g. User wants to vacation in NYC between Jan 25th and 30th
Summary - E.g. The user and agent discussed Jan vacation plan and decided on activities
This is the reason, sometimes you be wondering "Didn't I say this to the LLM in previous session?". If that convo doesn't fall into these three categories, it didn't go to long term memory

How is Memory Extraction Done?

This is actually simple! Think about a process which can extract certain things from a wall of text - a LLM!

A LLM with a prompt runs periodically, and extracts the three categories. You can customize this. Based on the agent type, you can extract other types of info and save it into the long term memory. Now, the question is how does the agent get the info from memory based on user query. Let's find out.

User sends a prompt to the agent
Agent searches the vector store (long term memory) with the prompt and get related info
Add this info to the context before sending to LLM
LLM generates an answer and sends to user based on original prompt + added context
What does this remind you of? RAG!
- Agents actually use RAG to get info from memory!
- RAG is not obsolete - it's alive and well!

Implementation

You'd separate yourself from the pack, if you can talk about the implementation!

I am a tad biased with AWS (for the new readers - I was a Principal SA at AWS, where I spent 6.5 years before leaving and building my own startup). I am showing the implementation with AWS, but the major components are open source:

Implement agent with AWS Strands (open source), or Langchain/graph/crew.ai (open source) etc.
Consume LLM from Amazon Bedrock or from any model provider
You can manage memory yourself - short term will still be VRAM, long term will be pinecone or other open source vector stores. Or you can run both the open source agents and the memory on Amazon Bedrock Agentcore. AWS manages and scales the memory for you, in a pay as you go model.
Agentcore also comes with observability baked in which is easier to troubleshoot and optimize agents

Now, how deep should you go on this for interviews, and for an explanation with a use case, check out my detailed video on this topic:

Hope, this helped you understand Gen AI memory, and got you the answer that RAG is still alive and kicking! Till next time!

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

Agent Memory Explained Simply

Agents Are Stateless

Memory - Short Vs Long

How is Memory Extraction Done?

Implementation

AWS SA Bootcamp Enrollment Is Now Open 🚀

[Your Invite Inside] AWS SA Job Bootcamp Webinar

Make 2026 Best Year for Career