It was 2 AM when my phone buzzed. Our monitoring system was screaming — response times had spiked to 30+ seconds. Users were complaining on Twitter. And I had absolutely no idea what was happening.
The Night Everything Slowed Down
I pulled up my laptop, logged into the server, and my heart sank. The database CPU was maxed at 100%. Not gradually climbing — just suddenly pinned. Something was hammering our database, and it wasn’t letting go.
We had about 200 active users at that hour, so this wasn’t normal traffic. Something was wrong with our code.
The Culprit: A Simple Query
I pulled the slow query log and found it. A query that looked innocent enough:
SELECT * FROM orders WHERE user_id = ? AND status = 'pending' ORDER BY created_at DESC
That query was taking 45 seconds. For a table with 2 million rows.
Here’s what happened: A new feature had shipped that allowed users to filter their order history. Somewhere in the code, a developer added a query that loaded all orders for a user without proper indexing on the composite key.
Why It Crashed
Let me break down what was happening:
- No index on
(user_id, status, created_at) - The query was doing a full table scan
- It was running inside a loop — one query per user in a batch job
- Each query locked rows, causing a cascade of waiting connections
We had 500 batch jobs running simultaneously, each firing this unindexed query. Within minutes, we had 500 blocked connections, and the database just… stopped.
How We Fixed It (Fast)
First, we killed the batch jobs immediately. That stopped the bleeding.
Then we added an index:
CREATE INDEX idx_orders_user_status_date ON orders(user_id, status, created_at);
Query time dropped from 45 seconds to 12 milliseconds. Twelve. Milliseconds.
What I Learned
1. Indexes aren’t optional. If you’re querying by multiple columns, composite indexes matter.
2. Never run queries in a loop. Batch operations should fetch data in bulk, not fire individual queries.
3. Monitor slow queries in production. We didn’t have alerts set up for slow queries — now we do.
4. Load test new features. This query worked fine in staging with 100 rows. Production had 2 million.
The worst part? The fix took 10 minutes. The panic took 3 hours. A proper code review and load test would have caught this before it hit production.
That was the night I became obsessive about database performance. Haven’t slept well since then, honestly.
