How Context Management Works — Why AI Forgets and How to Fix It

How Context Management Works — Why AI Forgets and How to Fix It

2026-03-24

An AI model has no memory between conversations. Every session starts from zero. Within a conversation, the model can only see what fits in its context window — a fixed-size buffer of tokens (words, roughly). Claude's context window is up to 200K tokens. GPT-4 supports 128K. Smaller models may have 8K or 32K.

When the context fills up, information is lost. The model forgets what happened at the start of the conversation. Tools results from early steps disappear. Instructions from the beginning are gone.

This is the central engineering challenge of building AI-powered software. Context management is the discipline of making AI effective despite finite memory.

Context Window (200K tokens) System Msg 1 Tool 1 Result Msg 2 Tool 2 free After many steps... Lost Lost Lost Tool 47 Result Msg 48 Tool 48 ← early context gone recent context preserved → The agent forgot the first 46 steps. Instructions from the start are gone. This is why context management matters.

Why Does This Matter?

A developer asks an AI agent to refactor a large codebase. The agent reads 50 files, makes changes to 30, runs tests, fixes failures. Each file read, each tool call, each test result consumes context. After 100 steps, the early file contents are gone. The agent doesn't remember the first files it read.

Without context management, the agent:

  • Forgets decisions made earlier in the session
  • Re-reads files it already read (wasting time and tokens)
  • Loses track of the overall plan
  • Makes inconsistent changes because it can't see the full picture

With context management, the agent:

  • Summarizes completed work to preserve key decisions
  • Retrieves relevant files on demand instead of holding everything in context
  • Maintains a persistent plan that survives context limits
  • Stays consistent because critical information is preserved

The Strategies

1. Summarization

Replace detailed history with compressed summaries. After the agent reads a file and makes changes, replace the full file contents in context with a one-line summary: "Modified src/auth.rs — added token refresh logic, 3 functions changed."

Tradeoff: summaries lose detail. If the agent needs to re-examine the file later, it must read it again. But the summary preserves the decision (what was done and why) without the full content.

2. Retrieval-Augmented Generation (RAG)

Don't put everything in context upfront. Store information externally and retrieve what's relevant when needed.

The pattern:

  1. Store documents in a search index (inverted index for keywords, vector index for meaning)
  2. When the model needs information, search the index with the current query
  3. Insert only the relevant results into the context
  4. The model responds with up-to-date, relevant information

RAG is how AI chatbots answer questions about documentation, codebases, and private data. The model doesn't memorize everything — it searches and retrieves on demand.

Tradeoff: retrieval quality depends on search quality. If the search returns irrelevant results, the model gets bad context. If it misses relevant results, the model doesn't know what it doesn't know.

3. Persistent State

Store key information outside the context window in a persistent store:

  • Session state — what the agent has done, what's left to do, key decisions made
  • Memory — facts the agent should remember across conversations (user preferences, project structure, past decisions)
  • Plans — multi-step task plans that survive context compression

The agent reads from persistent state at the start of each turn and writes back after each action. The context window holds the current step. Persistent state holds the big picture.

Tradeoff: the agent must explicitly read and write state, adding tool calls. And the state must be designed carefully — too much is noise, too little loses critical information.

4. Context Window Optimization

Make the context window itself more efficient:

  • Tool result truncation — if a tool returns 10,000 lines, only include the first 100 in context. Store the rest externally.
  • System prompt compression — static instructions that don't change can be shortened after the first few turns.
  • Selective history — keep recent messages in full detail, summarize older ones.
  • Deduplication — if the same file appears in context multiple times, keep only the latest version.

5. Multi-Session Architecture

For tasks that span days or weeks, no single context window is sufficient. The architecture:

  1. Session 1: Agent works on part of the task, writes a summary and plan to persistent state
  2. Session 2: Agent reads the summary, continues from where it left off
  3. Session N: Agent reads the accumulated state, completes the final steps

Each session has a fresh context window. Persistent state bridges the gap. The quality of the summary determines whether the agent can effectively continue.

RAG in Detail

RAG is the most common context management strategy for knowledge-heavy applications. The pipeline:

  1. Index — split documents into chunks, compute embeddings, store in a vector database
  2. Query — when the user asks a question, embed the query and search for similar chunks
  3. Augment — insert the retrieved chunks into the model's context as additional information
  4. Generate — the model responds using both the question and the retrieved context

This connects directly to the search engineering section:

The Fundamental Tradeoff

More context = better decisions but higher cost and latency. Less context = faster and cheaper but more mistakes.

The art of context management is finding the minimum context the model needs to make the right decision at each step. Not everything — just the right things.

What's Changing

Context windows are growing. From 4K (GPT-3) to 128K (GPT-4) to 200K (Claude) to 1M+ (experimental). As windows grow, the urgency of context management decreases for simple tasks.

But for complex, multi-session, multi-file tasks — the kind of work 8Vast is built for — context management remains essential. A million tokens is still not enough to hold an entire codebase, all its documentation, all its history, and all the decisions made over weeks of work.

The solution is always the same: persistent state + retrieval + summarization. The context window is the working memory. Persistent state is the long-term memory. Search is how you find what you need when you need it.

Next Steps

This completes the AI Engineering learning path. You now understand:

For deeper understanding of the retrieval infrastructure: Search Engineering.

Prerequisites