DATABASE-ADVANCED Contents

Query Optimization Checklist

A production checklist: reproduce, measure, isolate, fix, and validate under load.

On this page

Production Query Optimization Is a Process, Not a Trick

Query tuning in production is not about “add an index” or “rewrite SQL.” It is a disciplined process: reproduce, measure, isolate, change one variable, validate, and observe impact on SLOs.

This checklist is designed to be used during incidents and during planned performance work.

Step 0: Define the Goal and Constraints

Before touching anything, answer:

  • What is the target latency (p95/p99)?
  • What is the acceptable write overhead?
  • Is correctness allowed to weaken (usually no)?
  • What is the rollout/rollback plan?

Optimization without constraints creates outages.

Step 1: Capture the Exact Query and Parameters

Most performance debugging fails because the query is not the same as production. Capture:

  • Full SQL text
  • Actual parameter values
  • Execution frequency (QPS)
  • Concurrency level

Step 2: Measure With EXPLAIN ANALYZE

Do not optimize blind. Capture a plan with actual execution.

EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;

Record:

  • Plan shape
  • Estimated vs actual rows
  • Loops multipliers
  • Buffer hits vs reads
  • Spill indicators (sort/hash to disk)

Step 3: Classify the Bottleneck

Most slow queries fall into one of these buckets:

  • Misestimation: wrong rows → wrong plan
  • Loop amplification: nested loop multipliers
  • IO volume: too many pages touched
  • Memory spill: sorts/hashes overflow
  • Locking: waits dominate actual time

Pick one primary bottleneck first. Do not shotgun changes.

Step 4: Fix Estimation First (If Wrong)

If estimated vs actual rows differ greatly, fix estimation before changing SQL.

  • Refresh statistics (ANALYZE)
  • Increase stats granularity for skewed columns
  • Remove implicit casts
  • Improve predicate clarity
-- Postgres example
ANALYZE orders;

Step 5: Reduce Pages Touched

Pages touched is often the real cost. Typical levers:

  • Composite index matching WHERE + ORDER BY
  • Covering index to avoid heap/table reads
  • Partial index for hot subset
  • Rewrite query to narrow early (push down filters)

Step 6: Control Join Explosion

If joins are the issue:

  • Ensure join keys are indexed appropriately
  • Reduce rows before join (filter early)
  • Validate join order makes sense for real distributions
  • Consider pre-aggregation on large sides

Step 7: Eliminate Memory Spills

If EXPLAIN shows external sorts or hash spills:

  • Reduce row volume upstream
  • Ensure LIMIT is pushed down where possible
  • Increase memory settings carefully (engine-specific)
  • Optimize indexes to avoid sort

Be careful: increasing memory globally can reduce overall concurrency capacity.

Step 8: Validate With Before/After Evidence

After any change, you must produce evidence:

  • Before plan + timing
  • After plan + timing
  • Row estimate improvement (if relevant)
  • Page reads reduction

Optimization without before/after is guesswork.

Step 9: Consider System Effects (Not Just One Query)

In production, a faster query can still be a bad change if it causes:

  • Higher write amplification (more indexes)
  • Replication lag (more WAL/binlog)
  • Cache pressure (larger indexes)
  • Maintenance cost increase

Always evaluate global impact.

Step 10: Rollout Safely

Safe rollout patterns:

  • Deploy index concurrently/online when possible
  • Roll out gradually (canary) if you can control routing
  • Have rollback steps ready (drop index, revert query)

Failure Modes in Production

  • Fixes the query, breaks writes: extra index kills write throughput.
  • Fixes average, worsens tail: cache pressure increases p99.
  • Replica lag incident: index build or new writes amplify WAL/binlog.
  • Plan flips later: stats drift causes regression after days/weeks.
  • Wrong environment: staging data too small, results do not match production.
  • Locking surprise: index build blocks writes unexpectedly.

Production Optimization Checklist

  • Capture the exact SQL and representative parameter values.
  • Measure with EXPLAIN ANALYZE and record plan output.
  • Compare estimated vs actual rows; fix misestimation first.
  • Identify loop multipliers; avoid nested loop explosions.
  • Reduce pages touched; optimize IO pattern.
  • Eliminate memory spills; confirm no external sorts/hashes.
  • Validate index usage; avoid implicit casts and type mismatches.
  • Check global impact: write amplification, cache pressure, replication lag.
  • Roll out changes safely with online/concurrent operations when possible.
  • Monitor p95/p99 latency, error rates, and DB resource metrics after rollout.
  • Document the change with before/after evidence for future regressions.
  • Schedule revalidation after upgrades and major data distribution shifts.

Summary

Production query optimization is a method. Measure first, classify the bottleneck, fix estimation, reduce pages touched, control join explosions, and validate with evidence. Then roll out safely and watch system-wide effects. This is how you tune without causing incidents.