
How Query Processing Works — From Text to Results
When you type "connection timeout" into a search engine, the raw text doesn't go directly to the index. It passes through a pipeline that transforms, normalizes, and sometimes expands the query before any lookup happens. The same pipeline was applied to every document during indexing. If the query pipeline doesn't match the index pipeline, terms won't match — even when they should.
Why Does Query Processing Matter?
The query "Running connections" contains two terms. But the index might have stored "run" (stemmed) and "connection" (singular). Without processing the query through the same pipeline, the query term "Running" won't match the index term "run", and the search returns nothing.
Query processing ensures the query speaks the same language as the index. It's the reason searching for "connections" finds documents containing "connection", and "RUNNING" matches "running".
What Are the Steps?
1. Tokenization
Tokenization splits the query string into individual terms. This sounds simple but involves real decisions:
"connection_timeout"→["connection", "timeout"](split on underscore)"ConnectionTimeout"→["connection", "timeout"](split on camelCase)"192.168.1.1"→["192.168.1.1"](keep as a single token? or split on dots?)"don't"→["don't"]or["don", "t"]or["dont"]?"C++"→["c++"](plus signs matter here) or["c"]?
Every tokenizer makes different choices. The critical rule: the query tokenizer must match the index tokenizer. If the index treated ConnectionTimeout as a single token but the query splits it, nothing matches.
For code search, tokenization is especially tricky — identifiers like getHTTPResponse need code-aware splitting. That's covered in How Code Search Works.
2. Lowercasing
Convert all terms to lowercase: "TCP" → "tcp", "Connection" → "connection". Nearly all search systems do this by default. Case-sensitive search is an explicit opt-in.
3. Stop Word Removal
Stop words are common terms that appear in almost every document: "the", "is", "at", "which", "and". They have very low IDF (they appear everywhere, so they don't distinguish between documents) and add noise to results.
The query "what is the TCP handshake" becomes ["tcp", "handshake"] after stop word removal. The words "what", "is", "the" appear in most documents and don't help narrow the search.
But stop word removal has tradeoffs:
"to be or not to be"→ every word is a stop word. Removing all of them leaves nothing."The Who"→ removing "the" loses the meaning entirely."let it be"→ same problem.
Modern search engines handle this more carefully than blanket removal. BM25's IDF naturally downweights common terms, so aggressive stop word removal is less necessary than it was for simpler algorithms like TF-IDF.
4. Stemming and Lemmatization
Stemming reduces words to their root form:
| Original | Stemmed |
|---|---|
| running | run |
| connections | connect |
| authentication | authent |
| configured | configur |
The most common stemmer is Porter's algorithm (1980), which applies a series of suffix-stripping rules. It's fast but aggressive — "university" and "universal" both stem to "univers", which isn't always desirable.
Lemmatization is a more precise alternative: it uses a dictionary to find the proper base form. "running" → "run", "better" → "good". More accurate but slower, and requires language-specific dictionaries.
For code search, stemming is usually disabled — "config", "configure", "configuration", and "configured" are often different concepts in a codebase.
5. Query Expansion
Sometimes the user's query doesn't contain the exact terms in the relevant documents. Query expansion adds related terms:
Synonym expansion: "error" expands to "error OR fault OR failure". This is the keyword-based approach to the vocabulary mismatch problem.
Wildcard/prefix expansion: "connect*" expands to "connect OR connection OR connected OR connecting". Useful for autocomplete-style search.
Feedback expansion: after finding initial results, extract common terms from the top results and add them to the query. This is called pseudo-relevance feedback — the assumption is that top results are relevant and contain useful terms the user didn't think to include.
Query expansion is powerful but dangerous — adding the wrong terms dilutes the search and introduces noise. Most production systems use it conservatively or not at all, relying on semantic search to handle the vocabulary mismatch problem instead.
How Does This Differ for Semantic Search?
The pipeline above is for keyword search (BM25). Semantic search has a different query pipeline:
- Take the raw query text — minimal preprocessing.
- Encode it — run it through the same embedding model used at index time.
- Get a vector — the model outputs a vector representing the query's meaning.
- Search the vector index — find the nearest neighbors.
The embedding model handles tokenization, normalization, and "understanding" internally. You don't stem, remove stop words, or expand queries — the model's training already captured those relationships. "connection timeout" and "TCP socket timed out" produce similar vectors because the model learned they mean the same thing.
This is why semantic search handles vocabulary mismatch naturally — the processing is in the model, not in a hand-coded pipeline.
What Happens After Processing?
Once the query is processed:
For BM25: the processed terms are looked up in the inverted index. Each term's postings list is retrieved, the lists are intersected or unioned (depending on whether it's an AND or OR query), and BM25 scores are computed for each matching document.
For semantic search: the query vector is compared against the vector index using approximate nearest neighbor search. The top-k closest vectors are returned with their cosine similarity scores.
For hybrid search: both pipelines run in parallel, and the results are merged with RRF.
Next Steps
- How Code Search Works — how query processing changes when the corpus is source code instead of natural language.
- How Indexing Works — the index structures this pipeline feeds into.
- How to Evaluate Search Quality — measuring whether the processing pipeline is producing good results.