How to Evaluate Search Quality — Measuring What Matters

How to Evaluate Search Quality — Measuring What Matters

2026-03-22

You changed the tokenizer. You switched embedding models. You added query processing steps. Is search better now?

Without measurement, you're guessing. Search quality metrics turn "feels better" into numbers. They tell you exactly which queries improved, which got worse, and by how much.

What Do You Need to Evaluate?

Evaluation requires three things:

  1. A set of queries — representative queries that users actually run. Not 3 queries. At least 20-50 for meaningful results.
  2. Relevance judgments — for each query, which documents are relevant? This is the ground truth. Usually created by humans who read the documents and decide.
  3. A metric — a formula that compares the system's ranked results against the relevance judgments and produces a score.

The queries and judgments together form an evaluation set (sometimes called a test collection or gold standard). Building a good evaluation set is the hardest and most important part — a metric is only as good as the judgments it's measured against.

Precision and Recall

The two foundational metrics:

Precision — of the documents the system returned, how many were actually relevant?

Precision = relevant documents returned / total documents returned

If the system returns 10 documents and 7 are relevant, precision is 0.7 (70%).

Recall — of all the relevant documents that exist, how many did the system find?

Recall = relevant documents returned / total relevant documents

If there are 20 relevant documents in the corpus and the system found 7 of them, recall is 0.35 (35%).

Precision and recall are in tension. Returning more results increases recall (you find more relevant documents) but decreases precision (you also return more irrelevant ones). Returning fewer results increases precision but decreases recall.

Precision@K and Recall@K

In practice, users look at the top K results — the first page. Metrics at a fixed cutoff are more useful than metrics over all results.

Precision@5 — of the top 5 results, how many are relevant?

Recall@5 — of all relevant documents, how many appear in the top 5?

For a query with 10 relevant documents where the system returns:

PositionDocumentRelevant?
1D3Yes
2D7Yes
3D12No
4D1Yes
5D9No
  • Precision@5 = 3/5 = 0.6
  • Recall@5 = 3/10 = 0.3

MRR — Mean Reciprocal Rank

MRR measures how far down the user has to look to find the first relevant result. It's the metric for "did the answer appear near the top?"

For each query, the reciprocal rank is 1 / position of first relevant result. MRR is the average across all queries.

QueryFirst relevant at positionReciprocal rank
Q111.0
Q230.333
Q311.0
Q420.5

MRR = (1.0 + 0.333 + 1.0 + 0.5) / 4 = 0.708

MRR is intuitive: 1.0 means the right answer is always first. 0.5 means it's typically second. An MRR of 0.708 means the user usually finds what they need in the first one or two results.

The BM25 vs Semantic Search benchmark used MRR as the primary metric: BM25 achieved 0.727, Nomic embedding achieved 0.754, hybrid search achieved 0.795.

nDCG — Normalized Discounted Cumulative Gain

MRR only cares about the first relevant result. nDCG cares about all relevant results and where they appear.

The idea: a relevant result at position 1 is more valuable than a relevant result at position 10. nDCG discounts the value of results by their position using a logarithmic function.

For a ranked list where each result has a relevance score (e.g., 0 = not relevant, 1 = somewhat relevant, 2 = highly relevant):

DCG (Discounted Cumulative Gain):

DCG = rel_1 + rel_2/log2(2) + rel_3/log2(3) + ... + rel_k/log2(k)

Ideal DCG — the DCG if results were perfectly ranked (all relevant documents first, in order of relevance).

nDCG = DCG / Ideal DCG — normalized to [0, 1].

Worked example. True relevance scores for 5 results:

PositionRelevanceDiscount (1/log2(pos+1))Discounted gain
121.0002.000
200.6310.000
310.5000.500
420.4310.862
500.3870.000

DCG = 2.000 + 0.000 + 0.500 + 0.862 + 0.000 = 3.362

If the ideal order were [2, 2, 1, 0, 0]: Ideal DCG = 2.000 + 1.262 + 0.500 + 0.000 + 0.000 = 3.762

nDCG@5 = 3.362 / 3.762 = 0.894

nDCG is the standard metric for search engines that care about the full ranking, not just the first result. It's what benchmarks like TREC and BEIR use.

Which Metric Should You Use?

MetricBest forMeasures
Precision@KFiltering resultsHow clean are the top results?
Recall@KComprehensive retrievalDid we find everything relevant?
MRRSingle-answer queriesHow quickly does the user find the answer?
nDCG@KRanked lists with graded relevanceHow good is the overall ranking?

For code search: MRR is often the right choice. The developer wants to find the function definition or the right file — one correct result near the top.

For document search: nDCG is better. Multiple documents may be relevant at different levels, and the ranking order matters.

For hybrid search evaluation: use both Recall@K (did we find documents that BM25 or semantic search alone would miss?) and MRR (is the best result near the top?).

How Do You Build a Good Evaluation Set?

  1. Sample real queries — from search logs, user interviews, or common tasks. Synthetic queries rarely capture real search behavior.
  2. Judge relevance honestly — have humans (ideally multiple) label each query-document pair. Use graded relevance (0/1/2) rather than binary if you want nDCG.
  3. Include hard queries — queries where the system currently fails are more informative than queries where it already works.
  4. Keep it stable — don't change the evaluation set every time you tune the system. A stable set lets you track progress over time.
  5. Report confidence — with 20 queries, metric differences of less than 0.05 are probably noise. With 100+ queries, you can detect smaller improvements.

Next Steps

This lesson closes the search engineering learning path. You now understand:

For real-world benchmark data, see BM25 vs Semantic Search — We Benchmarked 6 Models and the Semantic Search Benchmark.