How Indexing Works — Building the Data Structures Behind Search

How Indexing Works — Building the Data Structures Behind Search

2026-03-22

When you type a query into a search engine, it doesn't scan every document looking for matches. That would take seconds on a thousand documents and hours on a billion. Instead, the search engine builds an index — a data structure optimized for fast lookup — before any query arrives.

There are two fundamentally different kinds of index, matching the two kinds of search:

  • Inverted indexes for keyword search (BM25) — map every word to the documents that contain it.
  • Vector indexes for semantic search — map every document to a point in high-dimensional space and find nearby points.

How Does an Inverted Index Work?

An inverted index reverses the relationship between documents and words. Instead of "document → words it contains," an inverted index stores "word → documents that contain it."

Consider three documents:

DocContent
D1"TCP uses sequence numbers"
D2"UDP has no sequence numbers"
D3"TCP and UDP are transport protocols"

The inverted index for these documents:

TermPostings list
tcpD1, D3
usesD1
sequenceD1, D2
numbersD1, D2
udpD2, D3
hasD2
noD2
transportD3
protocolsD3
andD3
areD3

Searching for "sequence numbers" means: look up "sequence" → {D1, D2}, look up "numbers" → {D1, D2}, intersect → {D1, D2}. Two lookups and a set intersection. No scanning.

Each postings list also stores metadata needed for ranking:

  • Term frequency — how many times the term appears in each document (needed for BM25).
  • Position — where in the document the term appears (needed for phrase queries and proximity scoring).
  • Document length — total terms in the document (needed for length normalization).

How Is an Inverted Index Built?

Building an inverted index is a pipeline:

  1. Read the document — load the raw text.
  2. Tokenize — split text into terms. "TCP uses sequence numbers"["tcp", "uses", "sequence", "numbers"]. This step lowercases, removes punctuation, and may apply stemming (reducing "running" to "run").
  3. Build postings — for each term, record the document ID, position, and frequency.
  4. Sort and merge — collect all postings by term, sort them, and write the final index.
  5. Store metadata — document lengths, total document count, average document length (all needed by BM25).

For large corpora, this is done in segments: build small in-memory indexes, flush them to disk, then merge segments into a final index. This is how Lucene (the engine behind Elasticsearch), Tantivy (the Rust equivalent), and PostgreSQL's full-text search all work.

The result is a set of files on disk — typically sorted term dictionaries with compressed postings lists. Lookup is O(log n) in the vocabulary size (a binary search or trie lookup for the term, then a sequential scan of the postings list).

How Does a Vector Index Work?

Semantic search represents every document as a vector — a point in high-dimensional space (typically 384 or 768 dimensions). Searching means finding the vectors closest to the query vector.

The naive approach — compute the distance from the query to every vector in the index — is O(n). At 1 million documents, that's 1 million distance calculations per query. At 100 million, it's unusable.

Approximate nearest neighbor (ANN) algorithms trade a small amount of accuracy for massive speed improvements. The most widely used is HNSW (Hierarchical Navigable Small World).

How Does HNSW Work?

HNSW builds a layered graph where each layer is progressively sparser. Think of it like a city with an express train, a local train, and walking:

  • Top layer — very few nodes, widely spaced. The express train. You jump to the approximate neighborhood quickly.
  • Middle layers — more nodes, closer together. The local train. You narrow down the region.
  • Bottom layer — all nodes. Walking. You find the exact nearest neighbors.

Searching starts at the top layer, greedily moves to the closest node, then drops to the next layer and repeats. Each layer narrows the search region until the bottom layer produces the final results.

The key properties:

  • Build time: O(n log n) — each new vector is inserted by searching the graph for its neighbors and adding edges.
  • Query time: O(log n) — much faster than brute-force O(n).
  • Accuracy: typically 95-99% recall (meaning it finds 95-99 of the true 100 nearest neighbors).
  • Memory: the graph structure adds overhead — roughly 1.5-2x the raw vector storage.

Other ANN algorithms exist — IVF (Inverted File Index) clusters vectors and only searches relevant clusters, Product Quantization compresses vectors for memory efficiency — but HNSW dominates in practice because it offers the best balance of speed, accuracy, and simplicity.

How Does Hybrid Indexing Work?

Hybrid search needs both indexes. When a document is added:

  1. Tokenize the text → add to the inverted index (for BM25).
  2. Embed the text → compute the vector → add to the vector index (for semantic search).

Both indexes must stay synchronized. If a document is updated or deleted, both indexes need to reflect the change. In practice, most systems handle this by re-indexing the document into both structures atomically.

The embedding step is the expensive one. Running a neural network to produce a vector takes 5-50ms per chunk on a CPU, or 0.5-5ms on a GPU. For a corpus of 100,000 chunks, initial indexing takes minutes on a GPU or hours on a CPU. The inverted index builds in seconds.

What Is Chunking and Why Does It Matter?

Documents are rarely indexed whole. A 10-page PDF or a 500-line source file is too long to represent as a single vector — the meaning gets diluted. Instead, documents are split into chunks.

Common chunking strategies:

  • Fixed-size — split every N tokens (e.g., 256 tokens). Simple but can cut sentences in half.
  • Paragraph — split on paragraph boundaries. Respects natural structure.
  • Semantic — split where the topic changes (detected by embedding similarity between adjacent sections). Most accurate but slowest.
  • Code-aware — split on function or class boundaries using a parser like tree-sitter. Essential for code search.

Chunk size is a tradeoff: smaller chunks are more precise (the vector represents a specific idea) but lose context. Larger chunks retain context but dilute the signal. Most systems use 200-500 tokens per chunk.

BM25 doesn't require chunking — the inverted index handles documents of any length. But for hybrid search, chunking the same way for both indexes keeps the results comparable when merging with RRF.

Next Steps