B-Tree Internals

Node splits, fillfactor, page splits, and what actually happens when indexes grow.

On this page

Why B-Trees Exist

A B-Tree is not about sorting. It is about minimizing page reads under large datasets. The core objective is simple: keep the tree shallow so that any lookup touches as few pages as possible.

If your index has height 3, a point lookup typically means:

1 root page read
1 internal page read
1 leaf page read

That is usually 3 page touches instead of scanning thousands of table pages.

Tree Structure: Root, Internal, Leaf

A B-Tree index is made of pages:

Root page: entry point.
Internal pages: routing nodes with key boundaries.
Leaf pages: actual key + row pointer pairs.

Leaf pages are usually linked together to support efficient range scans.

What a Leaf Page Actually Stores

A leaf page contains ordered keys and references to table rows. In heap-based systems, that reference is typically a physical row location. That means:

Index lookup → table page lookup (random IO risk).
If rows move (updates), index entries may need updates.

Page Splits: The Hidden Cost of Growth

When a leaf page becomes full and a new key must be inserted, the engine performs a page split. The page is divided into two, and the parent page is updated.

This causes:

Extra writes (write amplification).
WAL/binlog traffic.
Potential lock contention.

-- Example pattern causing frequent splits
INSERT INTO orders (id, created_at, user_id)
VALUES (nextval('orders_seq'), NOW(), 42);

Monotonically increasing keys can concentrate writes on the right-most leaf page.

Fillfactor and Free Space Strategy

Many engines allow reserving free space inside index pages. Lower fillfactor means:

More space for future inserts.
Fewer page splits.
More total pages (larger index).

-- Postgres example
ALTER INDEX idx_orders_user SET (fillfactor = 80);
REINDEX INDEX idx_orders_user;

Tradeoff: larger index vs fewer split operations.

Range Scans and Locality

B-Trees shine at range queries because leaf pages are ordered and linked.

SELECT *
FROM orders
WHERE created_at >= '2026-01-01'
  AND created_at < '2026-02-01'
ORDER BY created_at;

If the index matches the ORDER BY, the engine can walk leaf pages sequentially, which is cache-friendly and disk-friendly.

Random vs Sequential IO Reality

An index lookup is efficient only if:

The index fits in memory, or
The working set has locality.

If the index is large and lookups are random, each row may cause a separate table page read. That is when an index can degrade performance compared to a sequential scan.

Composite Indexes and Leftmost Prefix

In composite indexes, order matters.

CREATE INDEX idx_orders_user_date
ON orders (user_id, created_at);

This index supports:

WHERE user_id = ?
WHERE user_id = ? AND created_at > ?

It does NOT efficiently support:

WHERE created_at = ? (without user_id)

Index-Only Scans

If all required columns are in the index, the engine may avoid touching the table.

CREATE INDEX idx_orders_cover
ON orders (user_id, created_at)
INCLUDE (total_amount);

This reduces table page reads significantly for read-heavy workloads.

Failure Modes in Production

Rightmost page contention: hot insert workloads block on a single leaf page.
Excessive page splits: fillfactor too high under heavy insert churn.
Index bloat: updates/deletes create dead entries.
Random IO storms: index lookups cause scattered table reads.
Wrong composite order: index exists but does not serve real queries.
Over-indexing: write performance collapses under many indexes.

Operational Checklist

Measure index size vs table size; track growth rate.
Monitor page splits and write amplification under insert load.
Align composite index order with real WHERE + ORDER BY patterns.
Avoid indexing low-selectivity columns alone.
Evaluate covering indexes for read-heavy endpoints.
Review rightmost leaf contention for monotonic keys.
Benchmark with production-like data volume.
Track p95/p99 latency after index changes.
Audit unused indexes periodically.
Understand the cost of every index before adding it.

Summary

B-Trees are efficient because they minimize page depth and support ordered traversal. But every index is a write cost multiplier. In production, index design is not about “making queries fast.” It is about balancing read amplification, write amplification, and locality under real workload patterns.

← DB Page Layouts

Hash, GIN, and GiST Indexes →