How Context Management Works — Why AI Forgets and How to Fix It

2026-03-24

An AI model has no memory between conversations. Every session starts from zero. Within a conversation, the model can only see what fits in its context window — a fixed-size buffer of tokens (words, roughly). Claude's context window is up to 200K tokens. GPT-4 supports 128K. Smaller models may have 8K or 32K.

When the context fills up, information is lost. The model forgets what happened at the start of the conversation. Tools results from early steps disappear. Instructions from the beginning are gone.

This is the central engineering challenge of building AI-powered software. Context management is the discipline of making AI effective despite finite memory.

Why Does This Matter?

A developer asks an AI agent to refactor a large codebase. The agent reads 50 files, makes changes to 30, runs tests, fixes failures. Each file read, each tool call, each test result consumes context. After 100 steps, the early file contents are gone. The agent doesn't remember the first files it read.

Without context management, the agent:

Forgets decisions made earlier in the session
Re-reads files it already read (wasting time and tokens)
Loses track of the overall plan
Makes inconsistent changes because it can't see the full picture

With context management, the agent:

Summarizes completed work to preserve key decisions
Retrieves relevant files on demand instead of holding everything in context
Maintains a persistent plan that survives context limits
Stays consistent because critical information is preserved

The Strategies

1. Summarization

Replace detailed history with compressed summaries. After the agent reads a file and makes changes, replace the full file contents in context with a one-line summary: "Modified src/auth.rs — added token refresh logic, 3 functions changed."

Tradeoff: summaries lose detail. If the agent needs to re-examine the file later, it must read it again. But the summary preserves the decision (what was done and why) without the full content.

2. Retrieval-Augmented Generation (RAG)

Don't put everything in context upfront. Store information externally and retrieve what's relevant when needed.

The pattern:

Store documents in a search index (inverted index for keywords, vector index for meaning)
When the model needs information, search the index with the current query
Insert only the relevant results into the context
The model responds with up-to-date, relevant information

RAG is how AI chatbots answer questions about documentation, codebases, and private data. The model doesn't memorize everything — it searches and retrieves on demand.

Tradeoff: retrieval quality depends on search quality. If the search returns irrelevant results, the model gets bad context. If it misses relevant results, the model doesn't know what it doesn't know.

3. Persistent State

Store key information outside the context window in a persistent store:

Session state — what the agent has done, what's left to do, key decisions made
Memory — facts the agent should remember across conversations (user preferences, project structure, past decisions)
Plans — multi-step task plans that survive context compression

The agent reads from persistent state at the start of each turn and writes back after each action. The context window holds the current step. Persistent state holds the big picture.

Tradeoff: the agent must explicitly read and write state, adding tool calls. And the state must be designed carefully — too much is noise, too little loses critical information.

4. Context Window Optimization

Make the context window itself more efficient:

Tool result truncation — if a tool returns 10,000 lines, only include the first 100 in context. Store the rest externally.
System prompt compression — static instructions that don't change can be shortened after the first few turns.
Selective history — keep recent messages in full detail, summarize older ones.
Deduplication — if the same file appears in context multiple times, keep only the latest version.

5. Multi-Session Architecture

For tasks that span days or weeks, no single context window is sufficient. The architecture:

Session 1: Agent works on part of the task, writes a summary and plan to persistent state
Session 2: Agent reads the summary, continues from where it left off
Session N: Agent reads the accumulated state, completes the final steps

Each session has a fresh context window. Persistent state bridges the gap. The quality of the summary determines whether the agent can effectively continue.

RAG in Detail

RAG is the most common context management strategy for knowledge-heavy applications. The pipeline:

Index — split documents into chunks, compute embeddings, store in a vector database
Query — when the user asks a question, embed the query and search for similar chunks
Augment — insert the retrieved chunks into the model's context as additional information
Generate — the model responds using both the question and the retrieved context

This connects directly to the search engineering section:

Indexing handles step 1
BM25 or semantic search handles step 2
Hybrid search combines both for better results
Evaluation measures whether the retrieval is good enough

The Fundamental Tradeoff

More context = better decisions but higher cost and latency. Less context = faster and cheaper but more mistakes.

The art of context management is finding the minimum context the model needs to make the right decision at each step. Not everything — just the right things.

What's Changing

Context windows are growing. From 4K (GPT-3) to 128K (GPT-4) to 200K (Claude) to 1M+ (experimental). As windows grow, the urgency of context management decreases for simple tasks.

But for complex, multi-session, multi-file tasks — the kind of work 8Vast is built for — context management remains essential. A million tokens is still not enough to hold an entire codebase, all its documentation, all its history, and all the decisions made over weeks of work.

The solution is always the same: persistent state + retrieval + summarization. The context window is the working memory. Persistent state is the long-term memory. Search is how you find what you need when you need it.

Next Steps

This completes the AI Engineering learning path. You now understand:

How MCP Works — the protocol connecting AI to tools
How MCP Servers, Clients, Tools, Resources, and Transports work
How AI Agents Work — the reasoning loop
How Context Management Works — the memory problem and its solutions

For deeper understanding of the retrieval infrastructure: Search Engineering.

Prerequisites

How AI Agents Work