Agent Memory Explained Simply


Hello Reader,

Have you ever repeated yourself to an AI and thought, “Didn’t we already talk about this?” That frustration isn’t your fault. It’s how GenAI systems work by default. To overcome this, we need to implement memory. Now, there are a lot of confusion around this - do we need different types of memory, does this make RAG obsolete, and how does this even work? Let's learn all of it in today's edition.

Agents Are Stateless

By default, agents are stateless. Previously, we used to combat this with adding everything from the current session in the context each time. However, this was unsustainable because:

  • Context window will grow exponentially, increasing cost
  • LLM has to reprocess EVERYTHING over and over again. Imagine you are discussing designing an app, and after 1000 lines of back and forth you arrived at conclusion A. At every subsequent prompt, LLM will go through the same 1000 lines and derive the conclusion A. Ideal state will be, it goes directly to conclusion A, and then process 1001st line (new input)
  • Slow processing speed

Hence the concept of memory was born. Let's take a look at that.

Memory - Short Vs Long

There are two kinds of memory - short term and long term. This is very similar to human memory. Short term memory is related to the current session.

This short term memory is generally on the same hardware stack the LLM is running. Since most LLMs these days run on GPU, the short term memory will be the Video RAM. What do we know about RAM:

  • RAM or Random Access Memory is fast, and ephemeral (fancy word for temporary)
  • As anything in computer hardware - if it's fast, then it's expensive (sounds like a sports car, doesn't it?)
  • Because it's expensive, it doesn't have unlimited storage. If you have a graphics card in your personal computer, you'd notice Video RAM is much smaller in size than your Hard Disk
  • But this serves an important function. It holds all the current session conversation here in key-value format. If we recall the previous scenario where LLM has to reprocess same 1000 lines each time, with memory that's not the case. Because it saved the conclusion/summary in this memory, it can just get the summary and process new input from context.
  • Fun fact: If you use Cline with Vs-Code which shows the context size on each LLM call, you'd see for subsequent prompts, the context window goes down in size (Previously it'd only go up). It's reducing the context window because, it saved the info in memory!
  • This approach is cheaper than processing large number of tokens, and is faster

Now, short term memory is great for the current session. but how about, you closed your session, and you came back later. It'd be terribly inconvenient if you have to repeat yourself again. But short term memory is ephemeral, so how can we persist the info? This is where long term memory comes into play!

Periodically, certain info from short term memory will be extracted, and saved to long term memory. This long term memory is saved in a vector store, which is typically on a hard disk. Hence, it's durable, cheap, but little slower than short term. That's okay, because once the info is retrieved for another session, and is used, it will be saved in short term memory again for faster access.

Now, even though long term memory is cheaper compare to short term memory, you don't wanna fill it up with ALL info. Hence only below things are extracted from short term memory:

  • User preferences - E.g. I like indoor activities in cold day
  • Semantics - Raw data with facts. E.g. User wants to vacation in NYC between Jan 25th and 30th
  • Summary - E.g. The user and agent discussed Jan vacation plan and decided on activities
  • This is the reason, sometimes you be wondering "Didn't I say this to the LLM in previous session?". If that convo doesn't fall into these three categories, it didn't go to long term memory

How is Memory Extraction Done?

This is actually simple! Think about a process which can extract certain things from a wall of text - a LLM!

A LLM with a prompt runs periodically, and extracts the three categories. You can customize this. Based on the agent type, you can extract other types of info and save it into the long term memory. Now, the question is how does the agent get the info from memory based on user query. Let's find out.

  • User sends a prompt to the agent
  • Agent searches the vector store (long term memory) with the prompt and get related info
  • Add this info to the context before sending to LLM
  • LLM generates an answer and sends to user based on original prompt + added context
  • What does this remind you of? RAG!
    • Agents actually use RAG to get info from memory!
    • RAG is not obsolete - it's alive and well!

Implementation

You'd separate yourself from the pack, if you can talk about the implementation!

I am a tad biased with AWS (for the new readers - I was a Principal SA at AWS, where I spent 6.5 years before leaving and building my own startup). I am showing the implementation with AWS, but the major components are open source:

  • Implement agent with AWS Strands (open source), or Langchain/graph/crew.ai (open source) etc.
  • Consume LLM from Amazon Bedrock or from any model provider
  • You can manage memory yourself - short term will still be VRAM, long term will be pinecone or other open source vector stores. Or you can run both the open source agents and the memory on Amazon Bedrock Agentcore. AWS manages and scales the memory for you, in a pay as you go model.
  • Agentcore also comes with observability baked in which is easier to troubleshoot and optimize agents

Now, how deep should you go on this for interviews, and for an explanation with a use case, check out my detailed video on this topic:

video preview

Hope, this helped you understand Gen AI memory, and got you the answer that RAG is still alive and kicking! Till next time!

If you have found this newsletter helpful, and want to support me 🙏:

Checkout my bestselling courses on AWS, System Design, Kubernetes, DevOps, and more: Max discounted links

AWS SA Bootcamp with Live Classes, Mock Interviews, Hands-On, Resume Improvement and more: https://www.sabootcamp.com/

Keep learning and keep rocking 🚀,

Raj

Fast Track To Cloud

Free Cloud Interview Guide to crush your next interview. Plus, real-world answers for cloud interviews, and system design from a top AWS Solutions Architect.

Read more from Fast Track To Cloud

Hello Reader, Not all System Designs are created equal! To make matters complicated, there are so many designs out there. As a former Principal Solutions Architect at AWS and Distinguished Cloud Architect at Verizon, I have taken over 300+ interviews, and I have seen three patterns coming over and over in interviews. In this newsletter edition, we will go through 3 System Design patterns that appear the MOST in cloud interviews and actual projects. If you nail these 3, you will be ahead of...

Hello Reader, Claude. ChatGPT. Gemini. Copilot. If you're not using at least one of these daily, you're the outlier. So here's the uncomfortable truth: walking into an interview and saying "I use Claude Code every day" is no longer impressive. It's table stakes. That's the average answer. And average doesn't get you hired. In today's edition, I'll show you what separates a forgettable Gen AI answer from one that makes the interviewer lean forward. The Average Answer (And Why It Fails) Here's...

Hello Reader, Recruiters reaching out to you for interviews. That's the dream, right? And one of the best ways to make that happen is a badge most cloud professionals have never heard of - the AWS Community Builder. I've had multiple students get accepted into this program recently. Recruiters started finding them on LinkedIn. Interview calls went up. And the best part? You don't need to be a Principal Architect or a 10x AWS certified rockstar to qualify. In today's newsletter, I'll show you...