How to Evaluate Search Quality — Measuring What Matters

2026-03-22

You changed the tokenizer. You switched embedding models. You added query processing steps. Is search better now?

Without measurement, you're guessing. Search quality metrics turn "feels better" into numbers. They tell you exactly which queries improved, which got worse, and by how much.

What Do You Need to Evaluate?

Evaluation requires three things:

A set of queries — representative queries that users actually run. Not 3 queries. At least 20-50 for meaningful results.
Relevance judgments — for each query, which documents are relevant? This is the ground truth. Usually created by humans who read the documents and decide.
A metric — a formula that compares the system's ranked results against the relevance judgments and produces a score.

The queries and judgments together form an evaluation set (sometimes called a test collection or gold standard). Building a good evaluation set is the hardest and most important part — a metric is only as good as the judgments it's measured against.

Precision and Recall

The two foundational metrics:

Precision — of the documents the system returned, how many were actually relevant?

Precision = relevant documents returned / total documents returned

If the system returns 10 documents and 7 are relevant, precision is 0.7 (70%).

Recall — of all the relevant documents that exist, how many did the system find?

Recall = relevant documents returned / total relevant documents

If there are 20 relevant documents in the corpus and the system found 7 of them, recall is 0.35 (35%).

Precision and recall are in tension. Returning more results increases recall (you find more relevant documents) but decreases precision (you also return more irrelevant ones). Returning fewer results increases precision but decreases recall.

Precision@K and Recall@K

In practice, users look at the top K results — the first page. Metrics at a fixed cutoff are more useful than metrics over all results.

Precision@5 — of the top 5 results, how many are relevant?

Recall@5 — of all relevant documents, how many appear in the top 5?

For a query with 10 relevant documents where the system returns:

Position	Document	Relevant?
1	D3	Yes
2	D7	Yes
3	D12	No
4	D1	Yes
5	D9	No

Precision@5 = 3/5 = 0.6
Recall@5 = 3/10 = 0.3

MRR — Mean Reciprocal Rank

MRR measures how far down the user has to look to find the first relevant result. It's the metric for "did the answer appear near the top?"

For each query, the reciprocal rank is 1 / position of first relevant result. MRR is the average across all queries.

Query	First relevant at position	Reciprocal rank
Q1	1	1.0
Q2	3	0.333
Q3	1	1.0
Q4	2	0.5

MRR = (1.0 + 0.333 + 1.0 + 0.5) / 4 = 0.708

MRR is intuitive: 1.0 means the right answer is always first. 0.5 means it's typically second. An MRR of 0.708 means the user usually finds what they need in the first one or two results.

The BM25 vs Semantic Search benchmark used MRR as the primary metric: BM25 achieved 0.727, Nomic embedding achieved 0.754, hybrid search achieved 0.795.

nDCG — Normalized Discounted Cumulative Gain

MRR only cares about the first relevant result. nDCG cares about all relevant results and where they appear.

The idea: a relevant result at position 1 is more valuable than a relevant result at position 10. nDCG discounts the value of results by their position using a logarithmic function.

For a ranked list where each result has a relevance score (e.g., 0 = not relevant, 1 = somewhat relevant, 2 = highly relevant):

DCG (Discounted Cumulative Gain):

DCG = rel_1 + rel_2/log2(2) + rel_3/log2(3) + ... + rel_k/log2(k)

Ideal DCG — the DCG if results were perfectly ranked (all relevant documents first, in order of relevance).

nDCG = DCG / Ideal DCG — normalized to [0, 1].

Worked example. True relevance scores for 5 results:

Position	Relevance	Discount (1/log2(pos+1))	Discounted gain
1	2	1.000	2.000
2	0	0.631	0.000
3	1	0.500	0.500
4	2	0.431	0.862
5	0	0.387	0.000

DCG = 2.000 + 0.000 + 0.500 + 0.862 + 0.000 = 3.362

If the ideal order were [2, 2, 1, 0, 0]: Ideal DCG = 2.000 + 1.262 + 0.500 + 0.000 + 0.000 = 3.762

nDCG@5 = 3.362 / 3.762 = 0.894

nDCG is the standard metric for search engines that care about the full ranking, not just the first result. It's what benchmarks like TREC and BEIR use.

Which Metric Should You Use?

Metric	Best for	Measures
Precision@K	Filtering results	How clean are the top results?
Recall@K	Comprehensive retrieval	Did we find everything relevant?
MRR	Single-answer queries	How quickly does the user find the answer?
nDCG@K	Ranked lists with graded relevance	How good is the overall ranking?

For code search: MRR is often the right choice. The developer wants to find the function definition or the right file — one correct result near the top.

For document search: nDCG is better. Multiple documents may be relevant at different levels, and the ranking order matters.

For hybrid search evaluation: use both Recall@K (did we find documents that BM25 or semantic search alone would miss?) and MRR (is the best result near the top?).

How Do You Build a Good Evaluation Set?

Sample real queries — from search logs, user interviews, or common tasks. Synthetic queries rarely capture real search behavior.
Judge relevance honestly — have humans (ideally multiple) label each query-document pair. Use graded relevance (0/1/2) rather than binary if you want nDCG.
Include hard queries — queries where the system currently fails are more informative than queries where it already works.
Keep it stable — don't change the evaluation set every time you tune the system. A stable set lets you track progress over time.
Report confidence — with 20 queries, metric differences of less than 0.05 are probably noise. With 100+ queries, you can detect smaller improvements.

Next Steps

This lesson closes the search engineering learning path. You now understand:

How BM25 Works — keyword matching and ranking
How Semantic Search Works — meaning-based retrieval
How Hybrid Search Works — combining both with RRF
How Indexing Works — the data structures behind search
How Query Processing Works — what happens before the index is hit
How Code Search Works — structure-aware search for source code
How to Evaluate Search Quality — measuring whether any of it works

For real-world benchmark data, see BM25 vs Semantic Search — We Benchmarked 6 Models and the Semantic Search Benchmark.

Prerequisites

How Hybrid Search Works

How to Evaluate Search Quality — Measuring What Matters

What Do You Need to Evaluate?

Precision and Recall

Precision@K and Recall@K

MRR — Mean Reciprocal Rank

nDCG — Normalized Discounted Cumulative Gain

Which Metric Should You Use?

How Do You Build a Good Evaluation Set?

Next Steps

Prerequisites

References

Referenced by

How to Evaluate Search Quality — Measuring What Matters

What Do You Need to Evaluate?

Precision and Recall

Precision@K and Recall@K

MRR — Mean Reciprocal Rank

nDCG — Normalized Discounted Cumulative Gain

Which Metric Should You Use?

How Do You Build a Good Evaluation Set?

Next Steps

Prerequisites

Related

References

Referenced by