How Code Search Works — Searching Structure, Not Just Text

How Code Search Works — Searching Structure, Not Just Text

2026-03-22

Code is not prose. Searching code with tools designed for natural language — plain BM25 or grep — misses the structure that makes code meaningful. A function name, a type definition, and a comment are all text, but they mean fundamentally different things.

Code search treats source code as structured data. It parses the code into an AST (Abstract Syntax Tree), understands what kind of thing each piece of text is (function, variable, type, string literal, comment), and uses that structure to return more precise results.

Why Is Code Search Different from Text Search?

Consider searching for "connect" in a codebase:

MatchWhat it isRelevance
fn connect(addr: &str)Function definitionHigh — this is the implementation
let conn = connect("localhost")Function callHigh — this is usage
// TODO: connect to the databaseCommentLow — a note, not code
"Failed to connect"String literalLow — an error message
connection_poolPart of an identifierMaybe — depends on intent

Text search treats all five matches equally. Code search can distinguish between them because it understands the code's structure.

Other differences:

Identifiers are compound words. getHTTPResponseCode is one token to text search but four concepts to a developer: get, HTTP, Response, Code. Code search needs to split identifiers on camelCase, snake_case, and other conventions.

Syntax matters. def connect in Python, fn connect in Rust, function connect in JavaScript, void connect() in C — these all define a function named connect. Text search doesn't know that. Code search does.

Scope matters. A variable named result in function A is unrelated to result in function B. Text search can't distinguish them. Structure-aware search can.

How Does Tree-sitter Enable Code Search?

Tree-sitter is an incremental parsing library that builds a concrete syntax tree for source code in any supported language (40+ languages). It's fast enough to parse on every keystroke and robust enough to handle incomplete or syntactically invalid code.

For code search, tree-sitter provides:

  1. Node types — every piece of code is classified: function_definition, identifier, string_literal, comment, type_identifier, parameter, call_expression, etc.
  2. Parent-child relationships — you can determine that connect is a function name (it's an identifier child of a function_definition) vs. a function call (it's an identifier child of a call_expression).
  3. Language-agnostic queries — tree-sitter's query language lets you write patterns like "find all function definitions whose name contains X" across any supported language.

A tree-sitter query to find function definitions:

(function_definition
  name: (identifier) @function.name)

This matches fn connect() in Rust, def connect(): in Python, and function connect() in JavaScript — the tree-sitter grammar normalizes different syntaxes into the same node types.

How Is a Code Search Index Built?

Building a code search index extends the standard indexing pipeline:

  1. Parse — run tree-sitter on each file to produce an AST.
  2. Extract symbols — walk the AST and extract function definitions, type definitions, imports, and other structural elements. Each symbol gets metadata: name, kind (function/type/variable/constant), file path, line range, language.
  3. Tokenize identifiers — split compound identifiers: getHTTPResponseCode["get", "http", "response", "code"]. This uses language-aware rules (camelCase, snake_case, SCREAMING_CASE, kebab-case).
  4. Build the inverted index — index both the original identifier and its split components. A search for "response" finds getHTTPResponseCode.
  5. Store structural metadata — which file, which function, what kind of symbol. This enables filtering: "show me only function definitions" or "only in Rust files."

For semantic code search, each code chunk is embedded into a vector. The chunks are structural — function bodies, class definitions, or file sections — not arbitrary fixed-size splits. Code-aware chunking produces better embeddings because each chunk is a coherent unit of logic.

What Makes a Good Code Search Query?

Code search queries tend to be shorter and more specific than document search queries:

Query typeExampleWhat it needs
Exact symbolConnectionPoolDirect identifier lookup
Concept"database connection pooling"Semantic understanding
Patternfn.*connect.*ResultRegex over code
Structural"functions that return Result"AST-aware query
Cross-reference"all callers of connect()"Call graph analysis

The first three can be handled by a well-built search index. Structural queries need AST access at query time. Cross-reference queries need a full code intelligence system (like rust-analyzer or gopls) — they're beyond search.

How Does Code Search Differ by Language?

Different languages create different challenges:

Dynamic languages (Python, JavaScript) — fewer type annotations, more runtime patterns. getattr(obj, method_name) calls a method whose name is a runtime string. Code search can't follow dynamic dispatch.

Strongly typed languages (Rust, Go, TypeScript) — type annotations are searchable. You can search for "functions that accept &str and return Result" in Rust. Rich type information improves both keyword and semantic search.

Multi-language repositories — a React frontend calling a Go backend through a gRPC API. Searching for an API endpoint needs to cross language boundaries. Tree-sitter handles this because each file is parsed independently with its own grammar.

Where Does Grep Fall Short?

grep -r "connect" . is fast and available everywhere. For simple searches, it works. But:

  • No ranking — grep returns matches in file order, not relevance order. The function definition is buried among hundreds of usages and comments.
  • No identifier awareness — grep can't distinguish connect the function from disconnect containing the substring connect.
  • No structural filtering — you can't ask grep for "only function definitions."
  • No fuzzy matchinggrep "getHTTPResp" won't find get_http_response. Code search with identifier splitting finds it.
  • No semantic understanding — searching for "database connection" won't find a function called open_db_handle. Semantic search will, because the embedding model understands the relationship.

Grep is a line-level text tool. Code search is a structure-aware retrieval system. They solve different problems.

Next Steps