One missing database index took down an entire API for 45 minutes during peak traffic. The query had been running fine for months — until the table hit 2 million rows. Here’s the full story.
The API had been getting slower for weeks. Nobody noticed until it fell off a cliff.
The symptoms
Monday morning, 9:14 AM. Support tickets start coming in: “The dashboard takes 30 seconds to load.” Then: “The dashboard doesn’t load at all.”
API response times:
- Last week: 200ms average
- This morning: 12,000ms average
- By 9:30 AM: timeouts everywhere
The investigation
First instinct: “Did someone deploy something?”
No deploys since Friday. Nothing changed in the code.
Second instinct: “Is the database down?”
PostgreSQL is running. CPU at 98%. That’s the problem — but why?
The query from hell
SELECT * FROM orders
WHERE user_id = $1
AND status = 'active'
ORDER BY created_at DESC
LIMIT 20;
This query runs on every dashboard load. It was fast when we had 50,000 orders. We now have 4.2 million. Without an index on (user_id, status, created_at), PostgreSQL was doing a sequential scan — reading every single row in the table to find 20 results.
Why it got worse suddenly
Over the weekend, a batch job imported 800,000 historical orders. That pushed the table past the tipping point where PostgreSQL’s query planner switched from an index scan (using the user_id index) to a sequential scan (it estimated the seq scan would be faster with the new data distribution — it was wrong).
The fix
Immediate (9:45 AM)
CREATE INDEX CONCURRENTLY idx_orders_user_status_date
ON orders (user_id, status, created_at DESC);
CONCURRENTLY so it doesn’t lock the table. Takes 4 minutes on 4.2 million rows. API response times drop from 12,000ms to 15ms immediately.
The 3-hour gap
Why did it take from 9:14 to 9:45? Because we spent 30 minutes looking at the wrong things:
- Checked for deployments (none)
- Checked server resources (CPU high, but that’s a symptom)
- Checked for network issues (none)
- Restarted the application (didn’t help — the database was the bottleneck)
- Finally ran
pg_stat_activityand saw 200 copies of the same query running
What we should have done first
-- Show currently running queries, sorted by duration
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
This would have shown us the problem in 10 seconds.
Prevention measures
1. Query performance monitoring
Added pg_stat_statements extension to track slow queries:
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
2. Alerting on query duration
Alert when any query averages over 500ms. We would have caught this days before the outage.
3. Index review process
Every PR that adds a new query or changes a WHERE clause gets an EXPLAIN ANALYZE review on production-size data.
4. Load testing with realistic data
Our test database had 50,000 rows. Production had 4.2 million. We now seed test databases with production-scale data.
The real lesson
The query was fine. The data grew. The index didn’t exist. Nobody was watching.
Performance doesn’t degrade gracefully — it works fine until it doesn’t. Monitor your slow queries before they become your outage.
Related: PostgreSQL Cheat Sheet