Query Optimization Checklist

A production checklist: reproduce, measure, isolate, fix, and validate under load.

On this page

Production Query Optimization Is a Process, Not a Trick

Query tuning in production is not about “add an index” or “rewrite SQL.” It is a disciplined process: reproduce, measure, isolate, change one variable, validate, and observe impact on SLOs.

This checklist is designed to be used during incidents and during planned performance work.

Step 0: Define the Goal and Constraints

Before touching anything, answer:

What is the target latency (p95/p99)?
What is the acceptable write overhead?
Is correctness allowed to weaken (usually no)?
What is the rollout/rollback plan?

Optimization without constraints creates outages.

Step 1: Capture the Exact Query and Parameters

Most performance debugging fails because the query is not the same as production. Capture:

Full SQL text
Actual parameter values
Execution frequency (QPS)
Concurrency level

Step 2: Measure With EXPLAIN ANALYZE

Do not optimize blind. Capture a plan with actual execution.

EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;

Record:

Plan shape
Estimated vs actual rows
Loops multipliers
Buffer hits vs reads
Spill indicators (sort/hash to disk)

Step 3: Classify the Bottleneck

Most slow queries fall into one of these buckets:

Misestimation: wrong rows → wrong plan
Loop amplification: nested loop multipliers
IO volume: too many pages touched
Memory spill: sorts/hashes overflow
Locking: waits dominate actual time

Pick one primary bottleneck first. Do not shotgun changes.

Step 4: Fix Estimation First (If Wrong)

If estimated vs actual rows differ greatly, fix estimation before changing SQL.

Refresh statistics (ANALYZE)
Increase stats granularity for skewed columns
Remove implicit casts
Improve predicate clarity

-- Postgres example
ANALYZE orders;

Step 5: Reduce Pages Touched

Pages touched is often the real cost. Typical levers:

Composite index matching WHERE + ORDER BY
Covering index to avoid heap/table reads
Partial index for hot subset
Rewrite query to narrow early (push down filters)

Step 6: Control Join Explosion

If joins are the issue:

Ensure join keys are indexed appropriately
Reduce rows before join (filter early)
Validate join order makes sense for real distributions
Consider pre-aggregation on large sides

Step 7: Eliminate Memory Spills

If EXPLAIN shows external sorts or hash spills:

Reduce row volume upstream
Ensure LIMIT is pushed down where possible
Increase memory settings carefully (engine-specific)
Optimize indexes to avoid sort

Be careful: increasing memory globally can reduce overall concurrency capacity.

Step 8: Validate With Before/After Evidence

After any change, you must produce evidence:

Before plan + timing
After plan + timing
Row estimate improvement (if relevant)
Page reads reduction

Optimization without before/after is guesswork.

Step 9: Consider System Effects (Not Just One Query)

In production, a faster query can still be a bad change if it causes:

Higher write amplification (more indexes)
Replication lag (more WAL/binlog)
Cache pressure (larger indexes)
Maintenance cost increase

Always evaluate global impact.

Step 10: Rollout Safely

Safe rollout patterns:

Deploy index concurrently/online when possible
Roll out gradually (canary) if you can control routing
Have rollback steps ready (drop index, revert query)

Failure Modes in Production

Fixes the query, breaks writes: extra index kills write throughput.
Fixes average, worsens tail: cache pressure increases p99.
Replica lag incident: index build or new writes amplify WAL/binlog.
Plan flips later: stats drift causes regression after days/weeks.
Wrong environment: staging data too small, results do not match production.
Locking surprise: index build blocks writes unexpectedly.

Production Optimization Checklist

Capture the exact SQL and representative parameter values.
Measure with EXPLAIN ANALYZE and record plan output.
Compare estimated vs actual rows; fix misestimation first.
Identify loop multipliers; avoid nested loop explosions.
Reduce pages touched; optimize IO pattern.
Eliminate memory spills; confirm no external sorts/hashes.
Validate index usage; avoid implicit casts and type mismatches.
Check global impact: write amplification, cache pressure, replication lag.
Roll out changes safely with online/concurrent operations when possible.
Monitor p95/p99 latency, error rates, and DB resource metrics after rollout.
Document the change with before/after evidence for future regressions.
Schedule revalidation after upgrades and major data distribution shifts.

Summary

Production query optimization is a method. Measure first, classify the bottleneck, fix estimation, reduce pages touched, control join explosions, and validate with evidence. Then roll out safely and watch system-wide effects. This is how you tune without causing incidents.

← Plan Regressions