B-Tree Internals
Why B-Trees Exist
A B-Tree is not about sorting. It is about minimizing page reads under large datasets. The core objective is simple: keep the tree shallow so that any lookup touches as few pages as possible.
If your index has height 3, a point lookup typically means:
- 1 root page read
- 1 internal page read
- 1 leaf page read
That is usually 3 page touches instead of scanning thousands of table pages.
Tree Structure: Root, Internal, Leaf
A B-Tree index is made of pages:
- Root page: entry point.
- Internal pages: routing nodes with key boundaries.
- Leaf pages: actual key + row pointer pairs.
Leaf pages are usually linked together to support efficient range scans.
What a Leaf Page Actually Stores
A leaf page contains ordered keys and references to table rows. In heap-based systems, that reference is typically a physical row location. That means:
- Index lookup → table page lookup (random IO risk).
- If rows move (updates), index entries may need updates.
Page Splits: The Hidden Cost of Growth
When a leaf page becomes full and a new key must be inserted, the engine performs a page split. The page is divided into two, and the parent page is updated.
This causes:
- Extra writes (write amplification).
- WAL/binlog traffic.
- Potential lock contention.
-- Example pattern causing frequent splits
INSERT INTO orders (id, created_at, user_id)
VALUES (nextval('orders_seq'), NOW(), 42);
Monotonically increasing keys can concentrate writes on the right-most leaf page.
Fillfactor and Free Space Strategy
Many engines allow reserving free space inside index pages. Lower fillfactor means:
- More space for future inserts.
- Fewer page splits.
- More total pages (larger index).
-- Postgres example ALTER INDEX idx_orders_user SET (fillfactor = 80); REINDEX INDEX idx_orders_user;
Tradeoff: larger index vs fewer split operations.
Range Scans and Locality
B-Trees shine at range queries because leaf pages are ordered and linked.
SELECT * FROM orders WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01' ORDER BY created_at;
If the index matches the ORDER BY, the engine can walk leaf pages sequentially, which is cache-friendly and disk-friendly.
Random vs Sequential IO Reality
An index lookup is efficient only if:
- The index fits in memory, or
- The working set has locality.
If the index is large and lookups are random, each row may cause a separate table page read. That is when an index can degrade performance compared to a sequential scan.
Composite Indexes and Leftmost Prefix
In composite indexes, order matters.
CREATE INDEX idx_orders_user_date ON orders (user_id, created_at);
This index supports:
- WHERE user_id = ?
- WHERE user_id = ? AND created_at > ?
It does NOT efficiently support:
- WHERE created_at = ? (without user_id)
Index-Only Scans
If all required columns are in the index, the engine may avoid touching the table.
CREATE INDEX idx_orders_cover ON orders (user_id, created_at) INCLUDE (total_amount);
This reduces table page reads significantly for read-heavy workloads.
Failure Modes in Production
- Rightmost page contention: hot insert workloads block on a single leaf page.
- Excessive page splits: fillfactor too high under heavy insert churn.
- Index bloat: updates/deletes create dead entries.
- Random IO storms: index lookups cause scattered table reads.
- Wrong composite order: index exists but does not serve real queries.
- Over-indexing: write performance collapses under many indexes.
Operational Checklist
- Measure index size vs table size; track growth rate.
- Monitor page splits and write amplification under insert load.
- Align composite index order with real WHERE + ORDER BY patterns.
- Avoid indexing low-selectivity columns alone.
- Evaluate covering indexes for read-heavy endpoints.
- Review rightmost leaf contention for monotonic keys.
- Benchmark with production-like data volume.
- Track p95/p99 latency after index changes.
- Audit unused indexes periodically.
- Understand the cost of every index before adding it.
Summary
B-Trees are efficient because they minimize page depth and support ordered traversal. But every index is a write cost multiplier. In production, index design is not about “making queries fast.” It is about balancing read amplification, write amplification, and locality under real workload patterns.