Query Optimization Checklist
Production Query Optimization Is a Process, Not a Trick
Query tuning in production is not about “add an index” or “rewrite SQL.” It is a disciplined process: reproduce, measure, isolate, change one variable, validate, and observe impact on SLOs.
This checklist is designed to be used during incidents and during planned performance work.
Step 0: Define the Goal and Constraints
Before touching anything, answer:
- What is the target latency (p95/p99)?
- What is the acceptable write overhead?
- Is correctness allowed to weaken (usually no)?
- What is the rollout/rollback plan?
Optimization without constraints creates outages.
Step 1: Capture the Exact Query and Parameters
Most performance debugging fails because the query is not the same as production. Capture:
- Full SQL text
- Actual parameter values
- Execution frequency (QPS)
- Concurrency level
Step 2: Measure With EXPLAIN ANALYZE
Do not optimize blind. Capture a plan with actual execution.
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
Record:
- Plan shape
- Estimated vs actual rows
- Loops multipliers
- Buffer hits vs reads
- Spill indicators (sort/hash to disk)
Step 3: Classify the Bottleneck
Most slow queries fall into one of these buckets:
- Misestimation: wrong rows → wrong plan
- Loop amplification: nested loop multipliers
- IO volume: too many pages touched
- Memory spill: sorts/hashes overflow
- Locking: waits dominate actual time
Pick one primary bottleneck first. Do not shotgun changes.
Step 4: Fix Estimation First (If Wrong)
If estimated vs actual rows differ greatly, fix estimation before changing SQL.
- Refresh statistics (ANALYZE)
- Increase stats granularity for skewed columns
- Remove implicit casts
- Improve predicate clarity
-- Postgres example ANALYZE orders;
Step 5: Reduce Pages Touched
Pages touched is often the real cost. Typical levers:
- Composite index matching WHERE + ORDER BY
- Covering index to avoid heap/table reads
- Partial index for hot subset
- Rewrite query to narrow early (push down filters)
Step 6: Control Join Explosion
If joins are the issue:
- Ensure join keys are indexed appropriately
- Reduce rows before join (filter early)
- Validate join order makes sense for real distributions
- Consider pre-aggregation on large sides
Step 7: Eliminate Memory Spills
If EXPLAIN shows external sorts or hash spills:
- Reduce row volume upstream
- Ensure LIMIT is pushed down where possible
- Increase memory settings carefully (engine-specific)
- Optimize indexes to avoid sort
Be careful: increasing memory globally can reduce overall concurrency capacity.
Step 8: Validate With Before/After Evidence
After any change, you must produce evidence:
- Before plan + timing
- After plan + timing
- Row estimate improvement (if relevant)
- Page reads reduction
Optimization without before/after is guesswork.
Step 9: Consider System Effects (Not Just One Query)
In production, a faster query can still be a bad change if it causes:
- Higher write amplification (more indexes)
- Replication lag (more WAL/binlog)
- Cache pressure (larger indexes)
- Maintenance cost increase
Always evaluate global impact.
Step 10: Rollout Safely
Safe rollout patterns:
- Deploy index concurrently/online when possible
- Roll out gradually (canary) if you can control routing
- Have rollback steps ready (drop index, revert query)
Failure Modes in Production
- Fixes the query, breaks writes: extra index kills write throughput.
- Fixes average, worsens tail: cache pressure increases p99.
- Replica lag incident: index build or new writes amplify WAL/binlog.
- Plan flips later: stats drift causes regression after days/weeks.
- Wrong environment: staging data too small, results do not match production.
- Locking surprise: index build blocks writes unexpectedly.
Production Optimization Checklist
- Capture the exact SQL and representative parameter values.
- Measure with EXPLAIN ANALYZE and record plan output.
- Compare estimated vs actual rows; fix misestimation first.
- Identify loop multipliers; avoid nested loop explosions.
- Reduce pages touched; optimize IO pattern.
- Eliminate memory spills; confirm no external sorts/hashes.
- Validate index usage; avoid implicit casts and type mismatches.
- Check global impact: write amplification, cache pressure, replication lag.
- Roll out changes safely with online/concurrent operations when possible.
- Monitor p95/p99 latency, error rates, and DB resource metrics after rollout.
- Document the change with before/after evidence for future regressions.
- Schedule revalidation after upgrades and major data distribution shifts.
Summary
Production query optimization is a method. Measure first, classify the bottleneck, fix estimation, reduce pages touched, control join explosions, and validate with evidence. Then roll out safely and watch system-wide effects. This is how you tune without causing incidents.