๐Ÿ“š Learning Hub
ยท 3 min read

The Day We Accidentally Deleted the Production Database


This is the nightmare scenario every developer fears. Someone ran a migration script against production instead of staging. The database was gone in 3 seconds. Recovery took 11 hours.

This is the story every developer fears. Someone ran DROP TABLE on production. That someone was me.

How it happened

I had two terminal tabs open. One connected to the staging database. One connected to production. They looked identical โ€” same prompt, same tool, same everything.

I was cleaning up test data in staging:

DROP TABLE temp_migration_backup;

Except I ran it in the wrong tab.

The table I dropped wasnโ€™t temp_migration_backup on staging. It was temp_migration_backup on production. This table was a backup of our users table from a migration two weeks ago. โ€œItโ€™s just a backup table, no big deal,โ€ I thought.

Then I realized: our application had a fallback query that read from this backup table when the main table was unavailable. A feature added during the migration that nobody removed.

The cascade

2:14 PM โ€” Table dropped. No immediate impact.

2:15 PM โ€” A background job tries to read from the backup table. Fails. The error handlerโ€ฆ retries. And retries. And retries.

2:16 PM โ€” The retry loop consumes all database connections. The main application canโ€™t connect to the database.

2:17 PM โ€” Full outage.

The recovery

Option 1: Restore from backup

Our automated backups run at 3 AM. Itโ€™s 2:17 PM. Weโ€™d lose 11 hours of data. Not acceptable.

Option 2: Point-in-time recovery

We had WAL (Write-Ahead Log) archiving enabled. This lets you restore to any point in time.

# Stop the application
# Restore to 2:13 PM (one minute before the DROP)
pg_restore --target-time="2026-03-25 14:13:00" ...

2:34 PM โ€” Database restored to 2:13 PM state. Zero data loss.

2:35 PM โ€” Remove the fallback query that caused the cascade.

2:38 PM โ€” Application restarted. Everything back to normal.

Total downtime: 21 minutes. Data lost: zero.

Why it really happened

It wasnโ€™t just โ€œwrong terminal tab.โ€ The real failures:

1. No visual distinction between environments

Both terminals looked the same. The prompt was psql> in both. No color coding, no hostname display, nothing to indicate โ€œYOU ARE ON PRODUCTION.โ€

Fix:

-- In production psql config
\set PROMPT1 '๐Ÿ”ด PRODUCTION %n@%M:%> %/# '

-- In staging
\set PROMPT1 '๐ŸŸข STAGING %n@%M:%> %/# '

2. No access controls

Every developer had direct DROP TABLE permissions on production. Why?

Fix: Production database access through a read-only role by default. Destructive operations require a separate role with MFA and are logged.

3. Dead code in production

The fallback query was added during a migration and never removed. Dead code isnโ€™t harmless โ€” itโ€™s a landmine.

Fix: Every migration-related code change gets a follow-up ticket to remove temporary code within 2 weeks.

4. No connection pool limits on background jobs

The retry loop consumed all connections because there was no limit on how many connections background jobs could use.

Fix: Background jobs get a separate connection pool capped at 20% of total connections.

What saved us

WAL archiving. Without point-in-time recovery, weโ€™d have lost 11 hours of data. The $50/month for WAL storage paid for itself in one incident.

The checklist we use now

Before running any destructive command:

  1. Check the terminal prompt โ€” what environment am I in?
  2. Run SELECT current_database(), inet_server_addr(); โ€” confirm the host
  3. Start a transaction: BEGIN; โ€” so you can ROLLBACK; if somethingโ€™s wrong
  4. Run the command
  5. Verify the result
  6. COMMIT; only when youโ€™re sure

Itโ€™s paranoid. Itโ€™s also why we havenโ€™t had another incident.

Related postmortems: AWS Bill Spike ยท We Deployed on a Friday ยท A Single Regex Caused 100% CPU. See also: PostgreSQL cheat sheet