
How Context Management Works — Why AI Forgets and How to Fix It
An AI model has no memory between conversations. Every session starts from zero. Within a conversation, the model can only see what fits in its context window — a fixed-size buffer of tokens (words, roughly). Claude's context window is up to 200K tokens. GPT-4 supports 128K. Smaller models may have 8K or 32K.
When the context fills up, information is lost. The model forgets what happened at the start of the conversation. Tools results from early steps disappear. Instructions from the beginning are gone.
This is the central engineering challenge of building AI-powered software. Context management is the discipline of making AI effective despite finite memory.
Why Does This Matter?
A developer asks an AI agent to refactor a large codebase. The agent reads 50 files, makes changes to 30, runs tests, fixes failures. Each file read, each tool call, each test result consumes context. After 100 steps, the early file contents are gone. The agent doesn't remember the first files it read.
Without context management, the agent:
- Forgets decisions made earlier in the session
- Re-reads files it already read (wasting time and tokens)
- Loses track of the overall plan
- Makes inconsistent changes because it can't see the full picture
With context management, the agent:
- Summarizes completed work to preserve key decisions
- Retrieves relevant files on demand instead of holding everything in context
- Maintains a persistent plan that survives context limits
- Stays consistent because critical information is preserved
The Strategies
1. Summarization
Replace detailed history with compressed summaries. After the agent reads a file and makes changes, replace the full file contents in context with a one-line summary: "Modified src/auth.rs — added token refresh logic, 3 functions changed."
Tradeoff: summaries lose detail. If the agent needs to re-examine the file later, it must read it again. But the summary preserves the decision (what was done and why) without the full content.
2. Retrieval-Augmented Generation (RAG)
Don't put everything in context upfront. Store information externally and retrieve what's relevant when needed.
The pattern:
- Store documents in a search index (inverted index for keywords, vector index for meaning)
- When the model needs information, search the index with the current query
- Insert only the relevant results into the context
- The model responds with up-to-date, relevant information
RAG is how AI chatbots answer questions about documentation, codebases, and private data. The model doesn't memorize everything — it searches and retrieves on demand.
Tradeoff: retrieval quality depends on search quality. If the search returns irrelevant results, the model gets bad context. If it misses relevant results, the model doesn't know what it doesn't know.
3. Persistent State
Store key information outside the context window in a persistent store:
- Session state — what the agent has done, what's left to do, key decisions made
- Memory — facts the agent should remember across conversations (user preferences, project structure, past decisions)
- Plans — multi-step task plans that survive context compression
The agent reads from persistent state at the start of each turn and writes back after each action. The context window holds the current step. Persistent state holds the big picture.
Tradeoff: the agent must explicitly read and write state, adding tool calls. And the state must be designed carefully — too much is noise, too little loses critical information.
4. Context Window Optimization
Make the context window itself more efficient:
- Tool result truncation — if a tool returns 10,000 lines, only include the first 100 in context. Store the rest externally.
- System prompt compression — static instructions that don't change can be shortened after the first few turns.
- Selective history — keep recent messages in full detail, summarize older ones.
- Deduplication — if the same file appears in context multiple times, keep only the latest version.
5. Multi-Session Architecture
For tasks that span days or weeks, no single context window is sufficient. The architecture:
- Session 1: Agent works on part of the task, writes a summary and plan to persistent state
- Session 2: Agent reads the summary, continues from where it left off
- Session N: Agent reads the accumulated state, completes the final steps
Each session has a fresh context window. Persistent state bridges the gap. The quality of the summary determines whether the agent can effectively continue.
RAG in Detail
RAG is the most common context management strategy for knowledge-heavy applications. The pipeline:
- Index — split documents into chunks, compute embeddings, store in a vector database
- Query — when the user asks a question, embed the query and search for similar chunks
- Augment — insert the retrieved chunks into the model's context as additional information
- Generate — the model responds using both the question and the retrieved context
This connects directly to the search engineering section:
- Indexing handles step 1
- BM25 or semantic search handles step 2
- Hybrid search combines both for better results
- Evaluation measures whether the retrieval is good enough
The Fundamental Tradeoff
More context = better decisions but higher cost and latency. Less context = faster and cheaper but more mistakes.
The art of context management is finding the minimum context the model needs to make the right decision at each step. Not everything — just the right things.
What's Changing
Context windows are growing. From 4K (GPT-3) to 128K (GPT-4) to 200K (Claude) to 1M+ (experimental). As windows grow, the urgency of context management decreases for simple tasks.
But for complex, multi-session, multi-file tasks — the kind of work 8Vast is built for — context management remains essential. A million tokens is still not enough to hold an entire codebase, all its documentation, all its history, and all the decisions made over weeks of work.
The solution is always the same: persistent state + retrieval + summarization. The context window is the working memory. Persistent state is the long-term memory. Search is how you find what you need when you need it.
Next Steps
This completes the AI Engineering learning path. You now understand:
- How MCP Works — the protocol connecting AI to tools
- How MCP Servers, Clients, Tools, Resources, and Transports work
- How AI Agents Work — the reasoning loop
- How Context Management Works — the memory problem and its solutions
For deeper understanding of the retrieval infrastructure: Search Engineering.