Rollbacks vs Forward-Fix (When Each Wins)
On this page
Production incident
A new release introduces a bug. The team tries to roll back, but the migration dropped a column. Old version crashes. Now rollback is impossible. You must forward-fix under pressure. Another incident: rollback would have been safe, but it took 30 minutes because image tags were not immutable and the pipeline was slow. A rollback that takes an hour is not rollback, it is a second incident.
Symptoms
- Rollback fails due to schema incompatibility.
- Rollback is slow and manual; people hesitate and waste time.
- Forward-fix is rushed and risky because rollback was not viable.
Root causes
- Destructive migrations in the same release.
- No expand/contract discipline.
- No immutable artifact versioning.
- No rehearsed rollback procedure.
Decision rule: rollback vs forward-fix
- Rollback when: bug is clearly introduced by the release, data compatibility is safe, and rollback is fast (< minutes).
- Forward-fix when: schema/data has changed incompatibly, or the bug requires a quick patch without reverting state.
- Mitigate first: reduce blast radius (feature flags off, rate limits, disable endpoint) while you decide.
Anti-pattern
- “We’ll just roll back” without verifying schema compatibility.
- Rolling back repeatedly, flapping versions, causing cache storms and unstable behavior.
- Forward-fixing without canary, causing a second regression.
Correct pattern
- Make rollback possible: immutable versions, blue/green traffic switch, non-destructive migrations.
- Have feature flags to disable risky paths without deploy.
- Rehearse rollback quarterly; treat it like a fire drill.
Operational notes
- Monitoring: rollback success rate, time-to-rollback, and incident recurrence after rollback.
- Rollout: canary releases reduce the need for rollbacks.
- Postmortem: every rollback failure is a design failure. Fix the process and the deployment model.
Checklist
- Rollback is a traffic switch, not a rebuild.
- Artifacts are immutable and versioned.
- Migrations are compatible with rollback (expand/contract).
- Feature flags exist for high-risk behaviors.
- Rollback is rehearsed and documented.