
How Git Internals Work — Packfiles, Reflog, and Garbage Collection
The first four lessons covered what Git stores and how branches and merges work. This lesson covers how Git stores it efficiently — the machinery beneath the plumbing commands. Understanding packfiles, the index, the reflog, and garbage collection explains why .git directories are small, why you can recover from almost any mistake, and when objects actually get deleted.
Loose Objects vs Packfiles
When Git first creates an object (blob, tree, or commit), it writes a loose object — a single zlib-compressed file in .git/objects/, named by its SHA-1 hash. Each object is an independent file.
For a repository with thousands of commits, this means thousands of small files. That is inefficient: filesystem overhead per file, no delta compression between similar objects, and slow network transfer.
Git solves this with packfiles. A packfile (.git/objects/pack/) bundles many objects into a single file with an index. Inside the packfile, Git uses delta compression — similar objects are stored as a base object plus a binary diff. A 10 KB file that changes by 50 bytes between commits stores the full 10 KB once and a 50-byte delta for the newer version.
Packfiles are created during git gc (garbage collection), git push, and git fetch. When you push to a remote, Git sends a packfile containing only the objects the remote does not have — delta-compressed and efficient.
The pack index (.idx file) is a sorted lookup table that maps SHA-1 hashes to byte offsets within the packfile. Finding an object is a binary search on the index followed by a seek in the packfile. O(log n).
The Index (Staging Area)
The index (or staging area) is the file .git/index. It sits between the working tree and the object database. When you run git add, Git computes the blob hash for the file and records it in the index. When you run git commit, Git creates a tree from the current index state and creates a commit pointing to that tree.
The index tracks three things for every file:
- Path — the file's relative path in the repository.
- Blob hash — the SHA-1 of the file's current staged contents.
- Metadata — timestamps, size, and inode for change detection.
The index enables Git's fast status checks. git status compares:
- HEAD tree vs index — files staged for commit (green in
git status). - Index vs working tree — files modified but not staged (red in
git status).
By caching file metadata, Git can skip hashing files that have not changed since the last git status. It checks the modification timestamp and file size first — only if they differ does it compute the hash.
Reflog — The Safety Net
The reflog is a local log of every change to HEAD and branch refs. Every commit, checkout, rebase, reset, merge, and pull is recorded:
git reflog
a1b2c3 HEAD@{0}: commit: Add feature
f7e9a2 HEAD@{1}: checkout: moving from main to feature
8899aa HEAD@{2}: commit: Fix bug
The reflog is your safety net. If you accidentally reset, rebase, or delete a branch, the commits still exist in the object database — they are just unreachable from any ref. The reflog lets you find them.
# Recover from a bad reset
git reset --hard HEAD@{2}
# Find a commit from a deleted branch
git reflog | grep "checkout: moving from deleted-branch"
git checkout -b recovered <hash>
Reflog entries expire after 90 days (for reachable commits) or 30 days (for unreachable commits). This is configurable via gc.reflogExpire and gc.reflogExpireUnreachable.
The reflog is local only — it is not pushed to remotes. Each developer's reflog reflects their own operations.
Garbage Collection
Git never modifies objects, and most operations create new objects rather than removing old ones. Over time, unreachable objects accumulate — orphaned commits from rebases, old blobs from amended commits, deleted branches.
git gc (garbage collection) performs three operations:
- Pack loose objects — combine individual object files into packfiles with delta compression.
- Prune unreachable objects — delete objects not reachable from any ref or reflog entry. An object is reachable if you can follow parent pointers and tree pointers from any ref to reach it.
- Pack refs — combine individual ref files into a single packed-refs file.
Git runs git gc --auto periodically (when the number of loose objects exceeds gc.auto, default 6700). You can run git gc manually, but it is rarely necessary.
The reachability walk is the core algorithm: starting from every ref (branches, tags, stash, reflog entries), Git follows parent pointers (commits), tree pointers (trees → blobs), and marks every reachable object. Anything not marked is garbage.
This is why reflog entries have an expiry — once a reflog entry expires, the objects it referenced may become unreachable and eligible for pruning. The 30-day default gives you a month to notice a mistake.
How Git Knows a File Changed
When you modify a file and run git status, Git needs to determine what changed. The algorithm:
- Read the index entry for the file (cached stat data: mtime, size, inode).
stat()the working tree file.- If mtime and size match the index, assume the file is unchanged (fast path).
- If they differ, read the file, compute its SHA-1, and compare to the index's blob hash.
- If the hash differs, the file is modified.
The stat cache makes git status fast even in large repositories — most files have not changed, and comparing metadata avoids reading file contents. This is why git status on a 100,000-file repository returns in milliseconds.
The .git Directory
Everything Git needs is in .git/:
.git/
HEAD → current branch reference
config → repository-level settings
index → staging area (binary)
objects/ → the object database
pack/ → packfiles and indexes
info/ → auxiliary data
refs/ → branch and tag pointers
heads/ → local branches
tags/ → tag refs
remotes/ → remote-tracking branches
logs/ → reflog entries
hooks/ → commit/push hook scripts
Deleting .git/ destroys the repository. The working tree files remain, but all history, branches, and configuration are gone. This is why .git/ is the only thing you need to back up — the working tree can be reconstructed from any commit.
Next Steps
- How Git Objects Work — the blob/tree/commit model that the internals serve.
- How File Systems Work — the layer beneath Git's object storage.
- How Storage Engines Work — similar concepts (pages, caching, write-ahead logging) in database engines.