Running SQL in Production: My Pragmatic Approach to Safe, Zero-Downtime Migrations

I've broken production databases more times than I'd like to admit. After a few outages and a handful of panicked rollbacks at 2 AM, I developed a simple, repeatable playbook for deploying SQL changes with minimal risk: plan, test, monitor, and rollback with confidence. This isn't academic—it's the set of patterns I use daily on Postgres for tables that see millions of rows and live traffic.

Below I walk through the concrete steps, commands, and scripts I actually run. If you care about zero-downtime and being able to sleep after a deploy, read on.

Why this matters

DDL can be deceptively dangerous. A naive ALTER TABLE or CREATE INDEX can lock a table, trigger long-running rewrites, or blow up I/O. For a small app that’s annoying; for something making money in production, it’s a disaster.

My guiding principles:

Break changes into small, reversible steps.
Never deploy a blocking rewrite during peak traffic.
Test on production-like data and monitor progress.
Use feature flags and shadow writes where possible.

1) Plan: understand the impact before touching prod

Start with data, not code.

Estimate table size and row counts: psql -c "SELECT relname, pg_total_relation_size(relid) FROM pg_catalog.pg_statio_user_tables ORDER BY 2 DESC LIMIT 10;"
Check row size distribution: psql -c "SELECT avg(pg_column_size(t)), count(*) FROM my_table t;"
Look for indexes and foreign keys that will affect writes and deletes: psql -c "\d+ my_table"

Ask: Will ALTER TABLE rewrite the table? In Postgres, adding a column with a non-null default rewrites the whole table on older versions. Safer: add a nullable column, backfill, then set NOT NULL.

Estimate time/cost of backfill by sampling:

Use EXPLAIN for any heavy UPDATEs
Run a small batch and measure rows/sec, then project

Document a rollback plan before you start: feature flag toggles, reversing scripts, and when to restore from backup.

2) Test: run it on representative data

Staging with a tiny dataset doesn’t catch locking and IO behavior. I either:

Use a recent production snapshot (preferred), or
Copy a sampled slice (e.g., a 1% fidelity dump) and scale the timings conservatively.

Run EXPLAIN (ANALYZE if possible) on your UPDATEs and SELECTs you’ll touch. For indexes, use: psql -c "EXPLAIN CREATE INDEX CONCURRENTLY idx_my_table_col ON my_table(col);" (You can't run CREATE INDEX CONCURRENTLY inside a transaction, but EXPLAIN gives an idea.)

If you can, run the operation against a copy of production during off-peak hours and measure wall time and I/O. That gives realistic expectations.

3) Deploy patterns I use for zero-downtime changes

Below are the specific, repeatable patterns I rely on.

Add a column safely (no table rewrite)

Avoid DEFAULT with non-nullable in one step.
Steps:
1. ALTER TABLE my_table ADD COLUMN new_col TEXT;
2. Backfill in batches (see script below).
3. ALTER TABLE my_table ALTER COLUMN new_col SET NOT NULL;
4. ALTER TABLE my_table ALTER COLUMN new_col SET DEFAULT '...'; (optional)

Batch backfill pattern (psql + bash) This is the snippet I run for large tables. Adjust CHUNK_SIZE to match your measured throughput.

#!/usr/bin/env bash
CHUNK_SIZE=5000
PG_CONN="dbname=app user=app_user"

while true; do
  last_id=$(psql "$PG_CONN" -Atc "SELECT id FROM my_table WHERE new_col IS NULL ORDER BY id LIMIT 1 OFFSET $CHUNK_SIZE;")
  if [ -z "$last_id" ]; then
    # Final small batch
    psql "$PG_CONN" -c "BEGIN; UPDATE my_table SET new_col = compute_value(...) WHERE new_col IS NULL LIMIT $CHUNK_SIZE; COMMIT;"
    break
  fi

  psql "$PG_CONN" -c "BEGIN; UPDATE my_table SET new_col = compute_value(...) WHERE id <= $last_id AND new_col IS NULL; COMMIT;"
  echo "Backfilled up to id $last_id"
  sleep 0.5
done

Notes:

Use small transactions to avoid WAL bloat and long locks.
Keep CHUNK_SIZE conservative, increase if system idle.
Use ORDER BY primary key to avoid bitmap scans.

Create indexes without locking readers/writers

Use CREATE INDEX CONCURRENTLY. It avoids table-level locks, but:
- Can't be inside a transaction.
- Can be slower and fail if there are concurrent schema changes.
Monitor progress on newer Postgres with: psql -c "SELECT * FROM pg_stat_progress_create_index;"

If you need to rebuild or drop heavy indexes, consider pg_repack. It performs a table rewrite online and can be a lifesaver for reclaims.

Zero-downtime schema swap (for risky changes) When a column rename or complex rewrite is required, I do a shadow-copy approach:

CREATE TABLE my_table_new AS SELECT ... FROM my_table WITH NO DATA;
Run a background process that copies rows in batches while establishing triggers to keep new table in sync for writes.
When synced, switch application reads to new table (feature flag/DB view), then drop old.

This pattern is heavier but entirely reversible until you flip the flag.

Shadow write pattern (code-level)

Start writing to both old and new columns/tables in code behind a feature flag.
Verify reads from new copy in a small percentage of requests.
Once confidence is high, promote new copy to primary and stop dual writing.

I use simple feature flags (unfancy boolean flags in Redis or a tiny internal toggler). Feature flags are the most underrated migration tool.

4) Monitor: watch the right signals

Real-time monitoring is essential. I keep these handy in my deployment checklist.

psql quick checks

Long-running queries: SELECT pid, now() - query_start AS duration, state, query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY duration DESC LIMIT 10;
Locks: SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocker.pid AS blocker_pid, blocker.query AS blocker_query FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid = blocked.pid JOIN pg_locks br ON bl.locktype = br.locktype AND bl.database IS NOT DISTINCT FROM br.database AND bl.relation IS NOT DISTINCT FROM br.relation AND bl.page IS NOT DISTINCT FROM br.page AND bl.tuple IS NOT DISTINCT FROM br.tuple AND bl.virtualxid IS NOT DISTINCT FROM br.virtualxid AND bl.transactionid IS NOT DISTINCT FROM br.transactionid AND bl.classid IS NOT DISTINCT FROM br.classid AND bl.objid IS NOT DISTINCT FROM br.objid AND bl.objsubid IS NOT DISTINCT FROM br.objsubid JOIN pg_stat_activity blocker ON br.pid = blocker.pid WHERE blocked.pid <> blocker.pid;
Index creation progress (Postgres 12+): SELECT * FROM pg_stat_progress_create_index;

System metrics

I/O wait (iowait), CPU, WAL generation.
My preference: Grafana + Prometheus with node_exporter and postgres_exporter. I watch WAL volume spikes and disk queue depth.

Alert thresholds

If a background UPDATE slows below 10% expected throughput, I pause and investigate.
If replication lag exceeds a few seconds (for HA setups), I pause aggressive operations.

If something looks wrong, toggle the feature flag to stop reading from the new schema or kill the background worker.

5) Rollback: prepare for and execute reversals

Rollback is mostly about forward-thinking.

Make destructive operations reversible:

Avoid DROP COLUMN in the same deploy as a migration that populates a replacement.
Keep old columns for at least one release cycle after a swap.
Maintain a reverse migration script when possible.

Rollback steps I follow:

Toggle feature flag to route traffic back to the old path.
Stop any background writers that are populating the new schema.
If a backfill caused trouble, stop it, and consider resuming with smaller chunks or off-peak hours.
As last resort, restore from backup (I keep nightly base backups + WAL archives for PITR). Restores are slow—use them only for catastrophic failures.

Always practice rollback in staging. Know how long restoring a backup takes and factor that into your runbook.

Tools I use (practical)

psql (obviously) — direct and reliable.
Flyway / Sqitch — for migration versioning; I keep big rewrites out-of-band as scripts.
pg_repack — to reclaim bloat online.
pg_stat_statements, pg_stat_progress_create_index — built-in helpers.
Prometheus + Grafana — monitoring and alerting.
Small feature-flag library (I roll my own simple one, toggles in Redis).

Example: a real-life change I did last month

Context: I needed a new JSONB column derived from an expensive join on a 40M row table. Naive ALTER + UPDATE would rewrite 40M rows and block writes.

What I did:

Added nullable jsonb column.
Implemented a small background service that queried rows in chunks of 2k, computed the JSONB payload, and updated them with short transactions. The service auto-throttled based on replication lag and I/O.
Created an index on a frequently queried JSONB key using CREATE INDEX CONCURRENTLY.
Gradually rolled out read-side code behind a flag. I routed 5% of reads to the new column, validated results, increased to 100%, then removed the old join.
After two weeks of stable ops, I set NOT NULL and removed the old columns in a follow-up maintenance window.

Result: zero downtime, no customer-facing errors, and total wall time across the background job ~12 hours (mostly idle time while throttled).

Conclusion / Takeaways

Plan: measure table size and estimate rewrite cost before touching prod.
Test: run operations on production-like data; measure throughput.
Deploy in small, reversible steps: nullable columns, batch backfills, CREATE INDEX CONCURRENTLY, shadow writes, and feature flags.
Monitor the right things: pg_stat_activity, pg_locks, pg_stat_progress_create_index, and system I/O.
Rollback by toggling flags, stopping background jobs, and—only if necessary—restoring from backups.

If you remember nothing else: never do large, blocking DDL in one shot. Break it into small moves, and you'll be able to roll forward or back without losing sleep.

If you want, I can share my batch backfill library (a tiny Rust/Go script I use) or a checklist you can paste into your deploy docs. Follow me on X @fullybootstrap for small ops posts like this.