May 4, 2026 · 3 min read

The Day We Accidentally Deleted the Production Database

This is the nightmare scenario every developer fears. Someone ran a migration script against production instead of staging. The database was gone in 3 seconds. Recovery took 11 hours.

This is the story every developer fears. Someone ran DROP TABLE on production. That someone was me.

How it happened

I had two terminal tabs open. One connected to the staging database. One connected to production. They looked identical — same prompt, same tool, same everything.

I was cleaning up test data in staging:

DROP TABLE temp_migration_backup;

Except I ran it in the wrong tab.

The table I dropped wasn’t temp_migration_backup on staging. It was temp_migration_backup on production. This table was a backup of our users table from a migration two weeks ago. “It’s just a backup table, no big deal,” I thought.

Then I realized: our application had a fallback query that read from this backup table when the main table was unavailable. A feature added during the migration that nobody removed.

The cascade

2:14 PM — Table dropped. No immediate impact.

2:15 PM — A background job tries to read from the backup table. Fails. The error handler… retries. And retries. And retries.

2:16 PM — The retry loop consumes all database connections. The main application can’t connect to the database.

2:17 PM — Full outage.

The recovery

Option 1: Restore from backup

Our automated backups run at 3 AM. It’s 2:17 PM. We’d lose 11 hours of data. Not acceptable.

Option 2: Point-in-time recovery

We had WAL (Write-Ahead Log) archiving enabled. This lets you restore to any point in time.

# Stop the application
# Restore to 2:13 PM (one minute before the DROP)
pg_restore --target-time="2026-03-25 14:13:00" ...

2:34 PM — Database restored to 2:13 PM state. Zero data loss.

2:35 PM — Remove the fallback query that caused the cascade.

2:38 PM — Application restarted. Everything back to normal.

Total downtime: 21 minutes. Data lost: zero.

Why it really happened

It wasn’t just “wrong terminal tab.” The real failures:

1. No visual distinction between environments

Both terminals looked the same. The prompt was psql> in both. No color coding, no hostname display, nothing to indicate “YOU ARE ON PRODUCTION.”

Fix:

-- In production psql config
\set PROMPT1 '🔴 PRODUCTION %n@%M:%> %/# '

-- In staging
\set PROMPT1 '🟢 STAGING %n@%M:%> %/# '

2. No access controls

Every developer had direct DROP TABLE permissions on production. Why?

Fix: Production database access through a read-only role by default. Destructive operations require a separate role with MFA and are logged.

3. Dead code in production

The fallback query was added during a migration and never removed. Dead code isn’t harmless — it’s a landmine.

Fix: Every migration-related code change gets a follow-up ticket to remove temporary code within 2 weeks.

4. No connection pool limits on background jobs

The retry loop consumed all connections because there was no limit on how many connections background jobs could use.

Fix: Background jobs get a separate connection pool capped at 20% of total connections.

What saved us

WAL archiving. Without point-in-time recovery, we’d have lost 11 hours of data. The $50/month for WAL storage paid for itself in one incident.

The checklist we use now

Before running any destructive command:

Check the terminal prompt — what environment am I in?
Run SELECT current_database(), inet_server_addr(); — confirm the host
Start a transaction: BEGIN; — so you can ROLLBACK; if something’s wrong
Run the command
Verify the result
COMMIT; only when you’re sure

It’s paranoid. It’s also why we haven’t had another incident.

Related postmortems: AWS Bill Spike · We Deployed on a Friday · A Single Regex Caused 100% CPU. See also: PostgreSQL cheat sheet

The Day We Accidentally Deleted the Production Database

How it happened

The cascade

The recovery

Option 1: Restore from backup

Option 2: Point-in-time recovery

Why it really happened

1. No visual distinction between environments

2. No access controls

3. Dead code in production

4. No connection pool limits on background jobs

What saved us

The checklist we use now

📬 AI Dev Weekly

You might also like

We Deployed to Production on a Friday — Here's What Happened

How Load Balancers Actually Work — Algorithms, Health Checks, and Sticky Sessions

How Database Indexes Actually Work — B-Trees, Hash Indexes, and When to Use Them

How Docker Containers Actually Work — Namespaces, Cgroups, and Layers