DATABASE-ADVANCED Contents

B-Tree Internals

Node splits, fillfactor, page splits, and what actually happens when indexes grow.

On this page

Why B-Trees Exist

A B-Tree is not about sorting. It is about minimizing page reads under large datasets. The core objective is simple: keep the tree shallow so that any lookup touches as few pages as possible.

If your index has height 3, a point lookup typically means:

  • 1 root page read
  • 1 internal page read
  • 1 leaf page read

That is usually 3 page touches instead of scanning thousands of table pages.

Tree Structure: Root, Internal, Leaf

A B-Tree index is made of pages:

  • Root page: entry point.
  • Internal pages: routing nodes with key boundaries.
  • Leaf pages: actual key + row pointer pairs.

Leaf pages are usually linked together to support efficient range scans.

What a Leaf Page Actually Stores

A leaf page contains ordered keys and references to table rows. In heap-based systems, that reference is typically a physical row location. That means:

  • Index lookup → table page lookup (random IO risk).
  • If rows move (updates), index entries may need updates.

Page Splits: The Hidden Cost of Growth

When a leaf page becomes full and a new key must be inserted, the engine performs a page split. The page is divided into two, and the parent page is updated.

This causes:

  • Extra writes (write amplification).
  • WAL/binlog traffic.
  • Potential lock contention.
-- Example pattern causing frequent splits
INSERT INTO orders (id, created_at, user_id)
VALUES (nextval('orders_seq'), NOW(), 42);

Monotonically increasing keys can concentrate writes on the right-most leaf page.

Fillfactor and Free Space Strategy

Many engines allow reserving free space inside index pages. Lower fillfactor means:

  • More space for future inserts.
  • Fewer page splits.
  • More total pages (larger index).
-- Postgres example
ALTER INDEX idx_orders_user SET (fillfactor = 80);
REINDEX INDEX idx_orders_user;

Tradeoff: larger index vs fewer split operations.

Range Scans and Locality

B-Trees shine at range queries because leaf pages are ordered and linked.

SELECT *
FROM orders
WHERE created_at >= '2026-01-01'
  AND created_at < '2026-02-01'
ORDER BY created_at;

If the index matches the ORDER BY, the engine can walk leaf pages sequentially, which is cache-friendly and disk-friendly.

Random vs Sequential IO Reality

An index lookup is efficient only if:

  • The index fits in memory, or
  • The working set has locality.

If the index is large and lookups are random, each row may cause a separate table page read. That is when an index can degrade performance compared to a sequential scan.

Composite Indexes and Leftmost Prefix

In composite indexes, order matters.

CREATE INDEX idx_orders_user_date
ON orders (user_id, created_at);

This index supports:

  • WHERE user_id = ?
  • WHERE user_id = ? AND created_at > ?

It does NOT efficiently support:

  • WHERE created_at = ? (without user_id)

Index-Only Scans

If all required columns are in the index, the engine may avoid touching the table.

CREATE INDEX idx_orders_cover
ON orders (user_id, created_at)
INCLUDE (total_amount);

This reduces table page reads significantly for read-heavy workloads.

Failure Modes in Production

  • Rightmost page contention: hot insert workloads block on a single leaf page.
  • Excessive page splits: fillfactor too high under heavy insert churn.
  • Index bloat: updates/deletes create dead entries.
  • Random IO storms: index lookups cause scattered table reads.
  • Wrong composite order: index exists but does not serve real queries.
  • Over-indexing: write performance collapses under many indexes.

Operational Checklist

  • Measure index size vs table size; track growth rate.
  • Monitor page splits and write amplification under insert load.
  • Align composite index order with real WHERE + ORDER BY patterns.
  • Avoid indexing low-selectivity columns alone.
  • Evaluate covering indexes for read-heavy endpoints.
  • Review rightmost leaf contention for monotonic keys.
  • Benchmark with production-like data volume.
  • Track p95/p99 latency after index changes.
  • Audit unused indexes periodically.
  • Understand the cost of every index before adding it.

Summary

B-Trees are efficient because they minimize page depth and support ordered traversal. But every index is a write cost multiplier. In production, index design is not about “making queries fast.” It is about balancing read amplification, write amplification, and locality under real workload patterns.