What is RAG

RAG (Retrieval-Augmented Generation) is a technique where an AI system searches external data sources, retrieves relevant results, and injects them into the model's context window before generating a response. Instead of relying solely on training data, the model reasons over fresh, specific information.

How it works

RAG has three steps:

  1. Query — the user's question (or a reformulated version) is used as a search query
  2. Retrieve — a search system finds relevant documents from an external source (a database, a codebase, a knowledge base, an API)
  3. Generate — the retrieved documents are inserted into the prompt, and the model generates an answer grounded in that specific data
User: "What is our refund policy?"

→ Search knowledge base for "refund policy"
→ Retrieve: refund-policy.md (last updated March 2026)
→ Inject document into context
→ Model answers based on the actual policy, not training data

The search step can use BM25 (keyword matching), semantic search (meaning matching), or hybrid search (both combined). Hybrid search produces the best results because it catches both exact matches and conceptual matches.

Why it matters

AI models have two fundamental limitations: their training data is frozen in time, and they cannot access private data they were never trained on. RAG solves both problems. A RAG system can answer questions about your company's internal docs, yesterday's logs, or code written this morning — because it retrieves the actual data at query time.

RAG is also cheaper and more practical than fine-tuning. Fine-tuning requires retraining the model whenever data changes. RAG just updates the search index. New documents become available instantly.

MCP resources provide a structured way to implement RAG — servers expose searchable data, and the client injects results into context.

See How Context Management Works for how RAG fits into the broader context management strategy.