
How to Evaluate Search Quality — Measuring What Matters
You changed the tokenizer. You switched embedding models. You added query processing steps. Is search better now?
Without measurement, you're guessing. Search quality metrics turn "feels better" into numbers. They tell you exactly which queries improved, which got worse, and by how much.
What Do You Need to Evaluate?
Evaluation requires three things:
- A set of queries — representative queries that users actually run. Not 3 queries. At least 20-50 for meaningful results.
- Relevance judgments — for each query, which documents are relevant? This is the ground truth. Usually created by humans who read the documents and decide.
- A metric — a formula that compares the system's ranked results against the relevance judgments and produces a score.
The queries and judgments together form an evaluation set (sometimes called a test collection or gold standard). Building a good evaluation set is the hardest and most important part — a metric is only as good as the judgments it's measured against.
Precision and Recall
The two foundational metrics:
Precision — of the documents the system returned, how many were actually relevant?
Precision = relevant documents returned / total documents returned
If the system returns 10 documents and 7 are relevant, precision is 0.7 (70%).
Recall — of all the relevant documents that exist, how many did the system find?
Recall = relevant documents returned / total relevant documents
If there are 20 relevant documents in the corpus and the system found 7 of them, recall is 0.35 (35%).
Precision and recall are in tension. Returning more results increases recall (you find more relevant documents) but decreases precision (you also return more irrelevant ones). Returning fewer results increases precision but decreases recall.
Precision@K and Recall@K
In practice, users look at the top K results — the first page. Metrics at a fixed cutoff are more useful than metrics over all results.
Precision@5 — of the top 5 results, how many are relevant?
Recall@5 — of all relevant documents, how many appear in the top 5?
For a query with 10 relevant documents where the system returns:
| Position | Document | Relevant? |
|---|---|---|
| 1 | D3 | Yes |
| 2 | D7 | Yes |
| 3 | D12 | No |
| 4 | D1 | Yes |
| 5 | D9 | No |
- Precision@5 = 3/5 = 0.6
- Recall@5 = 3/10 = 0.3
MRR — Mean Reciprocal Rank
MRR measures how far down the user has to look to find the first relevant result. It's the metric for "did the answer appear near the top?"
For each query, the reciprocal rank is 1 / position of first relevant result. MRR is the average across all queries.
| Query | First relevant at position | Reciprocal rank |
|---|---|---|
| Q1 | 1 | 1.0 |
| Q2 | 3 | 0.333 |
| Q3 | 1 | 1.0 |
| Q4 | 2 | 0.5 |
MRR = (1.0 + 0.333 + 1.0 + 0.5) / 4 = 0.708
MRR is intuitive: 1.0 means the right answer is always first. 0.5 means it's typically second. An MRR of 0.708 means the user usually finds what they need in the first one or two results.
The BM25 vs Semantic Search benchmark used MRR as the primary metric: BM25 achieved 0.727, Nomic embedding achieved 0.754, hybrid search achieved 0.795.
nDCG — Normalized Discounted Cumulative Gain
MRR only cares about the first relevant result. nDCG cares about all relevant results and where they appear.
The idea: a relevant result at position 1 is more valuable than a relevant result at position 10. nDCG discounts the value of results by their position using a logarithmic function.
For a ranked list where each result has a relevance score (e.g., 0 = not relevant, 1 = somewhat relevant, 2 = highly relevant):
DCG (Discounted Cumulative Gain):
DCG = rel_1 + rel_2/log2(2) + rel_3/log2(3) + ... + rel_k/log2(k)
Ideal DCG — the DCG if results were perfectly ranked (all relevant documents first, in order of relevance).
nDCG = DCG / Ideal DCG — normalized to [0, 1].
Worked example. True relevance scores for 5 results:
| Position | Relevance | Discount (1/log2(pos+1)) | Discounted gain |
|---|---|---|---|
| 1 | 2 | 1.000 | 2.000 |
| 2 | 0 | 0.631 | 0.000 |
| 3 | 1 | 0.500 | 0.500 |
| 4 | 2 | 0.431 | 0.862 |
| 5 | 0 | 0.387 | 0.000 |
DCG = 2.000 + 0.000 + 0.500 + 0.862 + 0.000 = 3.362
If the ideal order were [2, 2, 1, 0, 0]: Ideal DCG = 2.000 + 1.262 + 0.500 + 0.000 + 0.000 = 3.762
nDCG@5 = 3.362 / 3.762 = 0.894
nDCG is the standard metric for search engines that care about the full ranking, not just the first result. It's what benchmarks like TREC and BEIR use.
Which Metric Should You Use?
| Metric | Best for | Measures |
|---|---|---|
| Precision@K | Filtering results | How clean are the top results? |
| Recall@K | Comprehensive retrieval | Did we find everything relevant? |
| MRR | Single-answer queries | How quickly does the user find the answer? |
| nDCG@K | Ranked lists with graded relevance | How good is the overall ranking? |
For code search: MRR is often the right choice. The developer wants to find the function definition or the right file — one correct result near the top.
For document search: nDCG is better. Multiple documents may be relevant at different levels, and the ranking order matters.
For hybrid search evaluation: use both Recall@K (did we find documents that BM25 or semantic search alone would miss?) and MRR (is the best result near the top?).
How Do You Build a Good Evaluation Set?
- Sample real queries — from search logs, user interviews, or common tasks. Synthetic queries rarely capture real search behavior.
- Judge relevance honestly — have humans (ideally multiple) label each query-document pair. Use graded relevance (0/1/2) rather than binary if you want nDCG.
- Include hard queries — queries where the system currently fails are more informative than queries where it already works.
- Keep it stable — don't change the evaluation set every time you tune the system. A stable set lets you track progress over time.
- Report confidence — with 20 queries, metric differences of less than 0.05 are probably noise. With 100+ queries, you can detect smaller improvements.
Next Steps
This lesson closes the search engineering learning path. You now understand:
- How BM25 Works — keyword matching and ranking
- How Semantic Search Works — meaning-based retrieval
- How Hybrid Search Works — combining both with RRF
- How Indexing Works — the data structures behind search
- How Query Processing Works — what happens before the index is hit
- How Code Search Works — structure-aware search for source code
- How to Evaluate Search Quality — measuring whether any of it works
For real-world benchmark data, see BM25 vs Semantic Search — We Benchmarked 6 Models and the Semantic Search Benchmark.