One missing database index took down an entire API for 45 minutes during peak traffic. The query had been running fine for months โ until the table hit 2 million rows. Hereโs the full story.
The API had been getting slower for weeks. Nobody noticed until it fell off a cliff.
The symptoms
Monday morning, 9:14 AM. Support tickets start coming in: โThe dashboard takes 30 seconds to load.โ Then: โThe dashboard doesnโt load at all.โ
API response times:
- Last week: 200ms average
- This morning: 12,000ms average
- By 9:30 AM: timeouts everywhere
The investigation
First instinct: โDid someone deploy something?โ
No deploys since Friday. Nothing changed in the code.
Second instinct: โIs the database down?โ
PostgreSQL is running. CPU at 98%. Thatโs the problem โ but why?
The query from hell
SELECT * FROM orders
WHERE user_id = $1
AND status = 'active'
ORDER BY created_at DESC
LIMIT 20;
This query runs on every dashboard load. It was fast when we had 50,000 orders. We now have 4.2 million. Without an index on (user_id, status, created_at), PostgreSQL was doing a sequential scan โ reading every single row in the table to find 20 results.
Why it got worse suddenly
Over the weekend, a batch job imported 800,000 historical orders. That pushed the table past the tipping point where PostgreSQLโs query planner switched from an index scan (using the user_id index) to a sequential scan (it estimated the seq scan would be faster with the new data distribution โ it was wrong).
The fix
Immediate (9:45 AM)
CREATE INDEX CONCURRENTLY idx_orders_user_status_date
ON orders (user_id, status, created_at DESC);
CONCURRENTLY so it doesnโt lock the table. Takes 4 minutes on 4.2 million rows. API response times drop from 12,000ms to 15ms immediately.
The 3-hour gap
Why did it take from 9:14 to 9:45? Because we spent 30 minutes looking at the wrong things:
- Checked for deployments (none)
- Checked server resources (CPU high, but thatโs a symptom)
- Checked for network issues (none)
- Restarted the application (didnโt help โ the database was the bottleneck)
- Finally ran
pg_stat_activityand saw 200 copies of the same query running
What we should have done first
-- Show currently running queries, sorted by duration
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
This would have shown us the problem in 10 seconds.
Prevention measures
1. Query performance monitoring
Added pg_stat_statements extension to track slow queries:
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
2. Alerting on query duration
Alert when any query averages over 500ms. We would have caught this days before the outage.
3. Index review process
Every PR that adds a new query or changes a WHERE clause gets an EXPLAIN ANALYZE review on production-size data.
4. Load testing with realistic data
Our test database had 50,000 rows. Production had 4.2 million. We now seed test databases with production-scale data.
The real lesson
The query was fine. The data grew. The index didnโt exist. Nobody was watching.
Performance doesnโt degrade gracefully โ it works fine until it doesnโt. Monitor your slow queries before they become your outage.
Related: PostgreSQL Cheat Sheet