Go Production Checklist

A production checklist for Go services covering reliability, security, performance, observability, deployments, and data safety. Use this as a pre-launch and post-incident standard.

On this page

Go Production Checklist: What Must Be True Before Go Live

This checklist is designed for real production environments: containerized deployments, rolling updates, external dependencies, and incident-driven operations. It is intentionally pragmatic. If you cannot confidently check an item, it is a risk. The goal is not perfection, but predictable behavior under load, failure, and change.

Use this checklist for:

  • Pre launch readiness review
  • Deployment safety verification
  • Post incident hardening
  • New service template standards

1. Service Basics and Contracts

  • Service name, version, and environment are exposed in logs and health summary
  • Public API contract documented and backward compatibility rules defined
  • Input validation strict: unknown fields rejected, body size limited, content type validated
  • Error responses are consistent and do not leak internal details
  • Idempotency rules documented for write endpoints that may be retried

2. Configuration and Secrets

  • All configuration is externalized via env vars or mounted config files
  • Config validated at startup and service fails fast on invalid config
  • Secrets are not embedded in images or logs
  • Secrets rotated strategy defined and tested
  • Default settings are safe and conservative

3. Timeouts, Retries, and Backpressure

  • HTTP server read timeout, write timeout, and idle timeout set
  • All outbound HTTP calls have timeouts and context propagation
  • Retries are limited and use exponential backoff with jitter
  • Retry only idempotent operations or use idempotency keys
  • Backpressure exists: concurrency limits for heavy endpoints or queues
  • Circuit breaker or failure containment strategy exists for unstable dependencies

4. Database Safety

  • All SQL queries are parameterized and safe from injection
  • database/sql pool configured: MaxOpenConns, MaxIdleConns, ConnMaxLifetime
  • Context used for all DB calls and timeouts enforced
  • Rows are always closed and rows.Err checked
  • Transactions are short and boundaries are explicit
  • Deadlock and serialization failure retry strategy defined
  • Least privilege DB user used, no admin permissions in app runtime

5. Migrations and Schema Evolution

  • Migrations are backward compatible for rolling deploy
  • No destructive schema changes during mixed version rollout
  • Backfills are batched and do not lock tables for long periods
  • Migration execution strategy is defined and controlled
  • Rollback plan exists and is practiced

6. HTTP and API Hardening

  • Request body size limited
  • Rate limiting strategy exists for public endpoints
  • CORS configured intentionally where applicable
  • Secure headers set when serving web traffic
  • Authentication and authorization rules enforced consistently
  • Logging does not include tokens, cookies, or credentials

7. Observability: Logs, Metrics, Traces

  • Structured logs with consistent fields
  • request_id present in every request log
  • Metrics exposed on /metrics and scraped successfully
  • RED metrics per route: rate, errors, duration
  • Dependency metrics: DB latency, external API latency, cache hit rate
  • Saturation metrics: db pool wait, queue depth, goroutines, heap, GC
  • Dashboards exist for service overview and dependency health
  • Alerts are tied to user impact and SLOs where possible
  • Tracing propagation exists for critical flows when needed

8. Performance and Resource Control

  • pprof available on protected admin port for incident debugging
  • Known hot paths benchmarked and allocation metrics tracked
  • Payload sizes controlled via pagination and field selection
  • Concurrency limits align with DB pool and downstream capacity
  • No unbounded buffering of large requests or responses
  • GC and heap behavior monitored for regressions

9. Graceful Shutdown and Health Endpoints

  • SIGTERM handled
  • Graceful shutdown drains in flight requests
  • Readiness fails immediately on shutdown start
  • Liveness endpoint is cheap and dependency free
  • Readiness reflects ability to serve traffic and uses timeouts
  • Startup probe strategy exists for slow initialization

10. Deployment Safety

  • Rolling update strategy preserves capacity
  • Canary release path exists for risky changes
  • Dependency storms controlled via pool limits and rollout pacing
  • Deploy is observable with dashboard annotations and version tags
  • Rollback path documented and tested

11. Container and Runtime Security

  • Image built with multi stage build
  • Runtime image minimal and includes needed CA certs
  • Runs as non-root when possible
  • Ports and network exposure minimized
  • File system permissions and volumes defined intentionally
  • Security updates and base image refresh policy defined

12. Operations and Incident Readiness

  • Runbook exists for common incidents: high latency, db pool saturation, dependency outage
  • Known failure modes documented
  • On call response steps defined: triage flow using metrics then profiles
  • Post incident review process exists and produces concrete follow ups
  • Chaos or failure injection tests planned for critical dependencies

13. Data Protection

  • Backups exist and restore drills performed
  • PII handling rules defined and enforced
  • Audit logging exists for sensitive operations if required
  • Data retention policy defined and applied

Pre Launch Quick Gate

If you only have time for a minimal gate before go live, confirm these:

  • Readiness and graceful shutdown proven in a rollout test
  • DB pool configured and monitored, no connection leaks
  • Request validation strict and body size limited
  • Metrics: per route latency and error rates visible
  • Alerts: user impact alerts active
  • Migrations strategy safe for rolling deploy

Final Perspective

A Go service becomes production ready when its failure behavior is predictable and observable. This checklist reduces surprise. Use it as a living standard: every incident should add one improvement, one metric, or one guardrail. Over time, production becomes boring. That is the goal.