This is the nightmare scenario every developer fears. Someone ran a migration script against production instead of staging. The database was gone in 3 seconds. Recovery took 11 hours.
This is the story every developer fears. Someone ran DROP TABLE on production. That someone was me.
How it happened
I had two terminal tabs open. One connected to the staging database. One connected to production. They looked identical โ same prompt, same tool, same everything.
I was cleaning up test data in staging:
DROP TABLE temp_migration_backup;
Except I ran it in the wrong tab.
The table I dropped wasnโt temp_migration_backup on staging. It was temp_migration_backup on production. This table was a backup of our users table from a migration two weeks ago. โItโs just a backup table, no big deal,โ I thought.
Then I realized: our application had a fallback query that read from this backup table when the main table was unavailable. A feature added during the migration that nobody removed.
The cascade
2:14 PM โ Table dropped. No immediate impact.
2:15 PM โ A background job tries to read from the backup table. Fails. The error handlerโฆ retries. And retries. And retries.
2:16 PM โ The retry loop consumes all database connections. The main application canโt connect to the database.
2:17 PM โ Full outage.
The recovery
Option 1: Restore from backup
Our automated backups run at 3 AM. Itโs 2:17 PM. Weโd lose 11 hours of data. Not acceptable.
Option 2: Point-in-time recovery
We had WAL (Write-Ahead Log) archiving enabled. This lets you restore to any point in time.
# Stop the application
# Restore to 2:13 PM (one minute before the DROP)
pg_restore --target-time="2026-03-25 14:13:00" ...
2:34 PM โ Database restored to 2:13 PM state. Zero data loss.
2:35 PM โ Remove the fallback query that caused the cascade.
2:38 PM โ Application restarted. Everything back to normal.
Total downtime: 21 minutes. Data lost: zero.
Why it really happened
It wasnโt just โwrong terminal tab.โ The real failures:
1. No visual distinction between environments
Both terminals looked the same. The prompt was psql> in both. No color coding, no hostname display, nothing to indicate โYOU ARE ON PRODUCTION.โ
Fix:
-- In production psql config
\set PROMPT1 '๐ด PRODUCTION %n@%M:%> %/# '
-- In staging
\set PROMPT1 '๐ข STAGING %n@%M:%> %/# '
2. No access controls
Every developer had direct DROP TABLE permissions on production. Why?
Fix: Production database access through a read-only role by default. Destructive operations require a separate role with MFA and are logged.
3. Dead code in production
The fallback query was added during a migration and never removed. Dead code isnโt harmless โ itโs a landmine.
Fix: Every migration-related code change gets a follow-up ticket to remove temporary code within 2 weeks.
4. No connection pool limits on background jobs
The retry loop consumed all connections because there was no limit on how many connections background jobs could use.
Fix: Background jobs get a separate connection pool capped at 20% of total connections.
What saved us
WAL archiving. Without point-in-time recovery, weโd have lost 11 hours of data. The $50/month for WAL storage paid for itself in one incident.
The checklist we use now
Before running any destructive command:
- Check the terminal prompt โ what environment am I in?
- Run
SELECT current_database(), inet_server_addr();โ confirm the host - Start a transaction:
BEGIN;โ so you canROLLBACK;if somethingโs wrong - Run the command
- Verify the result
COMMIT;only when youโre sure
Itโs paranoid. Itโs also why we havenโt had another incident.
Related postmortems: AWS Bill Spike ยท We Deployed on a Friday ยท A Single Regex Caused 100% CPU. See also: PostgreSQL cheat sheet