📚 Learning Hub
· 2 min read

How a Missing Database Index Took Down Our API for 3 Hours


One missing database index took down an entire API for 45 minutes during peak traffic. The query had been running fine for months — until the table hit 2 million rows. Here’s the full story.

The API had been getting slower for weeks. Nobody noticed until it fell off a cliff.

The symptoms

Monday morning, 9:14 AM. Support tickets start coming in: “The dashboard takes 30 seconds to load.” Then: “The dashboard doesn’t load at all.”

API response times:

  • Last week: 200ms average
  • This morning: 12,000ms average
  • By 9:30 AM: timeouts everywhere

The investigation

First instinct: “Did someone deploy something?”

No deploys since Friday. Nothing changed in the code.

Second instinct: “Is the database down?”

PostgreSQL is running. CPU at 98%. That’s the problem — but why?

The query from hell

SELECT * FROM orders
WHERE user_id = $1
AND status = 'active'
ORDER BY created_at DESC
LIMIT 20;

This query runs on every dashboard load. It was fast when we had 50,000 orders. We now have 4.2 million. Without an index on (user_id, status, created_at), PostgreSQL was doing a sequential scan — reading every single row in the table to find 20 results.

Why it got worse suddenly

Over the weekend, a batch job imported 800,000 historical orders. That pushed the table past the tipping point where PostgreSQL’s query planner switched from an index scan (using the user_id index) to a sequential scan (it estimated the seq scan would be faster with the new data distribution — it was wrong).

The fix

Immediate (9:45 AM)

CREATE INDEX CONCURRENTLY idx_orders_user_status_date
ON orders (user_id, status, created_at DESC);

CONCURRENTLY so it doesn’t lock the table. Takes 4 minutes on 4.2 million rows. API response times drop from 12,000ms to 15ms immediately.

The 3-hour gap

Why did it take from 9:14 to 9:45? Because we spent 30 minutes looking at the wrong things:

  • Checked for deployments (none)
  • Checked server resources (CPU high, but that’s a symptom)
  • Checked for network issues (none)
  • Restarted the application (didn’t help — the database was the bottleneck)
  • Finally ran pg_stat_activity and saw 200 copies of the same query running

What we should have done first

-- Show currently running queries, sorted by duration
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;

This would have shown us the problem in 10 seconds.

Prevention measures

1. Query performance monitoring

Added pg_stat_statements extension to track slow queries:

SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

2. Alerting on query duration

Alert when any query averages over 500ms. We would have caught this days before the outage.

3. Index review process

Every PR that adds a new query or changes a WHERE clause gets an EXPLAIN ANALYZE review on production-size data.

4. Load testing with realistic data

Our test database had 50,000 rows. Production had 4.2 million. We now seed test databases with production-scale data.

The real lesson

The query was fine. The data grew. The index didn’t exist. Nobody was watching.

Performance doesn’t degrade gracefully — it works fine until it doesn’t. Monitor your slow queries before they become your outage.

Related: PostgreSQL Cheat Sheet