Agent Memory Explained Simply


Hello Reader,

Have you ever repeated yourself to an AI and thought, “Didn’t we already talk about this?” That frustration isn’t your fault. It’s how GenAI systems work by default. To overcome this, we need to implement memory. Now, there are a lot of confusion around this - do we need different types of memory, does this make RAG obsolete, and how does this even work? Let's learn all of it in today's edition.

Agents Are Stateless

By default, agents are stateless. Previously, we used to combat this with adding everything from the current session in the context each time. However, this was unsustainable because:

  • Context window will grow exponentially, increasing cost
  • LLM has to reprocess EVERYTHING over and over again. Imagine you are discussing designing an app, and after 1000 lines of back and forth you arrived at conclusion A. At every subsequent prompt, LLM will go through the same 1000 lines and derive the conclusion A. Ideal state will be, it goes directly to conclusion A, and then process 1001st line (new input)
  • Slow processing speed

Hence the concept of memory was born. Let's take a look at that.

Memory - Short Vs Long

There are two kinds of memory - short term and long term. This is very similar to human memory. Short term memory is related to the current session.

This short term memory is generally on the same hardware stack the LLM is running. Since most LLMs these days run on GPU, the short term memory will be the Video RAM. What do we know about RAM:

  • RAM or Random Access Memory is fast, and ephemeral (fancy word for temporary)
  • As anything in computer hardware - if it's fast, then it's expensive (sounds like a sports car, doesn't it?)
  • Because it's expensive, it doesn't have unlimited storage. If you have a graphics card in your personal computer, you'd notice Video RAM is much smaller in size than your Hard Disk
  • But this serves an important function. It holds all the current session conversation here in key-value format. If we recall the previous scenario where LLM has to reprocess same 1000 lines each time, with memory that's not the case. Because it saved the conclusion/summary in this memory, it can just get the summary and process new input from context.
  • Fun fact: If you use Cline with Vs-Code which shows the context size on each LLM call, you'd see for subsequent prompts, the context window goes down in size (Previously it'd only go up). It's reducing the context window because, it saved the info in memory!
  • This approach is cheaper than processing large number of tokens, and is faster

Now, short term memory is great for the current session. but how about, you closed your session, and you came back later. It'd be terribly inconvenient if you have to repeat yourself again. But short term memory is ephemeral, so how can we persist the info? This is where long term memory comes into play!

Periodically, certain info from short term memory will be extracted, and saved to long term memory. This long term memory is saved in a vector store, which is typically on a hard disk. Hence, it's durable, cheap, but little slower than short term. That's okay, because once the info is retrieved for another session, and is used, it will be saved in short term memory again for faster access.

Now, even though long term memory is cheaper compare to short term memory, you don't wanna fill it up with ALL info. Hence only below things are extracted from short term memory:

  • User preferences - E.g. I like indoor activities in cold day
  • Semantics - Raw data with facts. E.g. User wants to vacation in NYC between Jan 25th and 30th
  • Summary - E.g. The user and agent discussed Jan vacation plan and decided on activities
  • This is the reason, sometimes you be wondering "Didn't I say this to the LLM in previous session?". If that convo doesn't fall into these three categories, it didn't go to long term memory

How is Memory Extraction Done?

This is actually simple! Think about a process which can extract certain things from a wall of text - a LLM!

A LLM with a prompt runs periodically, and extracts the three categories. You can customize this. Based on the agent type, you can extract other types of info and save it into the long term memory. Now, the question is how does the agent get the info from memory based on user query. Let's find out.

  • User sends a prompt to the agent
  • Agent searches the vector store (long term memory) with the prompt and get related info
  • Add this info to the context before sending to LLM
  • LLM generates an answer and sends to user based on original prompt + added context
  • What does this remind you of? RAG!
    • Agents actually use RAG to get info from memory!
    • RAG is not obsolete - it's alive and well!

Implementation

You'd separate yourself from the pack, if you can talk about the implementation!

I am a tad biased with AWS (for the new readers - I was a Principal SA at AWS, where I spent 6.5 years before leaving and building my own startup). I am showing the implementation with AWS, but the major components are open source:

  • Implement agent with AWS Strands (open source), or Langchain/graph/crew.ai (open source) etc.
  • Consume LLM from Amazon Bedrock or from any model provider
  • You can manage memory yourself - short term will still be VRAM, long term will be pinecone or other open source vector stores. Or you can run both the open source agents and the memory on Amazon Bedrock Agentcore. AWS manages and scales the memory for you, in a pay as you go model.
  • Agentcore also comes with observability baked in which is easier to troubleshoot and optimize agents

Now, how deep should you go on this for interviews, and for an explanation with a use case, check out my detailed video on this topic:

video preview

Hope, this helped you understand Gen AI memory, and got you the answer that RAG is still alive and kicking! Till next time!

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top AWS Solutions Architect.

Read more from Fast Track To Cloud

Hello Reader, I just unveiled the SA Bootcamp. The bootcamp covers everything you need to become an SA in as little as 3 months and spoiler alert its not just technical. This Bootcamp is a one of its kind because its taught by a Top SA still working on world class projects. And good news - it already worked for last cohort's students who secured cloud jobs in top FAANG companies, and some of them didn't even have cloud experience 💰. This SA bootcamp offers… a proven blueprint for the fastest...

Hello Reader, Are you thinking about becoming an AWS SA? The demand for AWS Solutions Architects has never been higher. And the data indicates it will continue to rise because there are literally trillions of dollars worth of projects currently running on legacy technologies that need to be migrated to the cloud. SA Bootcamp is developed to be the most direct and guided route to become a Solutions Architect and get a high paying cloud job. In as little as 3 months you could be an AWS SA...

Hello Reader, Happy New Year 2026 to you and your family 🎉. 2025 was a big year for me both professionally and personally. My biggest achievements of 2025 are delivering critical customer projects that YOU use in your life, starting a Start Up, and helping my students succeed. In this email, I will share some highlights and lessons that helped me: If you live in the US, you have certainly used one of the projects I have architected. When a commercial airplane pilot goes up or down, or turn...