DB Page Layouts

How pages store tuples/rows, visibility metadata, and why physical layout shapes performance.

On this page

What a “Page” Really Is

Databases do not read and write “rows” to disk. They read and write fixed-size blocks called pages (also called blocks). Every query you run becomes a sequence of page reads (from cache or disk) and page writes (often buffered, logged, and flushed later).

Once you think in pages, many production mysteries become explainable: why a query reads far more data than the number of rows returned, why random I/O is expensive, why indexes sometimes make things worse, and why a single long transaction can hurt the entire system.

Heap Storage vs Clustered Storage

Most engines store table data in a heap layout: rows are placed wherever there is free space. This is great for fast inserts but can lead to scattered reads. Some systems support clustered layouts where the physical order on disk follows a key (often the primary key). Clustered storage can be great for range queries but has different write costs.

Think in tradeoffs:

Heap: fast inserts, unpredictable locality, depends heavily on cache.
Clustered: better locality for ordered access, potentially more page splits and write amplification.

Inside a Page: Headers, Slots, Tuples

A page is not a bag of bytes; it has structure: a header, a free-space region, and one or more row versions (tuples). Many engines use a slot directory (an array of pointers) so rows can move within a page without changing the row identifier.

This matters because updates often create new row versions. The old version may remain until vacuum/cleanup can reclaim it. Your “one logical row” may occupy multiple physical tuples over time.

Row Identifiers and Why They Matter

Engines typically identify rows by something like (page_id, slot_id). In MVCC systems (like Postgres), an UPDATE creates a new row version, potentially on the same page or a different page. That means:

Indexes may need to point to new locations (unless special optimizations apply).
Old versions remain until cleanup, consuming space and hurting cache density.
Random I/O can increase over time if locality degrades.

Free Space and Insert Behavior

Inserts need free space. The engine tracks which pages have room for new tuples. If free-space tracking is inaccurate or the workload causes frequent page splits, insert performance can degrade and fragmentation increases.

Practical consequence: two tables with the same number of rows can have very different physical sizes and I/O profiles depending on update churn and vacuum health.

Why “Rows Returned” Lies About Work Done

A query returning 20 rows might still read thousands of pages due to:

Low selectivity predicates (many candidates per page).
Poor locality (rows scattered across pages).
Outdated statistics causing a bad plan (e.g., wrong join order).
Index access degenerating into random page fetches.

Always ask: “How many pages did we touch?” not “How many rows did we return?”

Measuring Page Work with EXPLAIN (Practical)

In production you should confirm whether you are CPU-bound, cache-bound, or disk-bound. Start with EXPLAIN, then move to buffer and I/O visibility where available.

-- Postgres: plan shape and timing
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM orders
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 20;

What to look for:

Buffers: high shared read means many pages came from disk; high shared hit means cache is working.
Loops: repeated inner scans can multiply page reads.
Sorts: may spill to disk if memory is insufficient.

Failure Modes You Will See in Real Systems

Table bloat: update/delete churn leaves dead tuples; pages become sparse; cache efficiency collapses.
Index bloat: index pages grow with dead entries; random I/O rises; plans regress.
Hot pages: certain pages become contention points (e.g., monotonically increasing keys or heavy insert hotspots).
Random I/O explosions: an index lookup becomes many scattered page reads under poor locality.
Long transactions: prevent cleanup; dead tuples accumulate; latency and disk usage climb.
Misleading benchmarks: synthetic tests fit in cache; production does not.

Operational Checklist

Verify whether key queries are cache-hit dominated or disk-read dominated using buffers/IO metrics.
Track table size growth vs row growth; divergence often indicates bloat.
Identify top tables by update/delete churn; those are your bloat candidates.
Confirm your indexes match access patterns (filter + order + limit) to reduce page reads.
Watch for monotonic insert hotspots; consider strategies to reduce contention.
Keep statistics current (ANALYZE) so the planner can estimate page work correctly.
Budget maintenance (vacuum/optimize) as a first-class production activity.
Measure tail latency (p95/p99); page-level issues often show up there first.
Run periodic “worst query” reviews focusing on pages touched, not just runtime.
Do restore drills and capacity planning assuming your working set will exceed memory.

Summary

Once you think in pages, you can reason about performance and failures with much less guesswork. Everything else in this track builds on this: B-Trees, planner choices, MVCC, vacuum pressure, and replication lag are all page-level stories.

B-Tree Internals →