📚 Learning Hub
· 3 min read

We Deployed to Production on a Friday — Here's What Happened


This actually happened to a team I worked with. The deploy looked clean, the tests passed, and everyone went home for the weekend. By Saturday morning, the on-call engineer’s phone was on fire.

It was 4:47 PM on a Friday. The PR had been approved. Tests were green. “It’s a small change,” someone said. Famous last words.

The change

A database migration that added a new column to the users table. Simple ALTER TABLE ADD COLUMN. We’d done it a hundred times.

What went wrong

5:02 PM — Deploy goes out

The migration runs. Takes 3 seconds in staging. In production, with 2.3 million rows, it takes 47 seconds. During those 47 seconds, the table is locked.

5:02 PM — Alerts fire

Every API endpoint that touches the users table starts timing out. That’s… most of them. Health checks fail. The load balancer starts returning 502s.

5:03 PM — Cascade begins

The frontend retries failed requests. Each retry hits the locked table. Connection pool fills up. Now even endpoints that don’t touch users can’t get a database connection.

5:04 PM — Full outage

The site is down. 100% error rate. PagerDuty is screaming. The Slack channel has 47 unread messages.

The scramble

5:05 PM — “Who deployed?” (It was all of us. We approved it.)

5:06 PM — Check if the migration is still running. It finished at 5:03. The table is unlocked. But the connection pool is exhausted and not recovering.

5:08 PM — Restart the application servers. Connections reset. Health checks pass. Traffic starts flowing.

5:12 PM — Site is back. Total downtime: 10 minutes.

Why it actually happened

The migration itself wasn’t the problem. The real issues:

  1. No migration testing at production scale. Staging had 10,000 rows. Production had 2.3 million. A 3-second migration became a 47-second lock.

  2. No connection pool recovery. When all connections were blocked, the pool didn’t gracefully queue — it just rejected everything. And when the lock released, the pool didn’t recover without a restart.

  3. Retry storms. The frontend retried every failed request 3 times with no backoff. 1,000 failed requests became 3,000, then 9,000.

  4. Friday at 5 PM. Half the team had already mentally checked out. The person who knew the database best was driving home.

What we changed

Immediate fixes

  • Online migrations only. PostgreSQL’s ALTER TABLE ADD COLUMN with a default value doesn’t lock the table in PG 11+. We were on PG 14 but using a migration tool that generated the old syntax.
  • Connection pool limits. Added a queue with a 5-second timeout instead of immediate rejection.
  • Exponential backoff on retries. Frontend now waits 1s, 2s, 4s instead of hammering immediately.

Process changes

  • No deploys after 3 PM on Fridays. Actually, no deploys after 3 PM any day unless it’s a hotfix.
  • Migration dry runs. Every migration gets tested against a production-size dataset before deploying.
  • Deploy buddy system. The person deploying must have someone else online who can help if things go wrong.

The lesson

The migration was fine. The system around it wasn’t. Every outage I’ve seen has the same pattern: a small change exposes a weakness that was always there. The Friday deploy didn’t cause the outage — the missing connection pool recovery, the aggressive retries, and the untested migration path did.

Test your failure modes, not just your happy paths.

Related: AI App Deployment Checklist · Deleted Production Database Postmortem