It was 2:14 AM when my phone started buzzing. Not the normal notification buzz — the panic buzz. Our main API was returning 500 errors across the board. I stumbled out of bed, opened my laptop, and saw it: our PostgreSQL database had ground to a complete halt.
What Was Happening
The dashboard showed CPU at 100%, connections maxed out at 200, and queries timing out left and right. Our app was effectively dead. I ssh’d into the server and ran SELECT * FROM pg_stat_activity; — and there it was. One query was holding a lock on the orders table for over 3 minutes, blocking everything else.
The Culprit
Turns out, a scheduled job had kicked off at midnight to update inventory counts. The developer who wrote it thought “I’ll just add an index to make it faster” — but they ran CREATE INDEX CONCURRENTLY without checking if one already existed. The job failed halfway, left a partial index, and the next night it tried again, this time locking the entire table while it waited.
What went wrong:
- No locking strategy: The migration script didn’t check for existing indexes
- No timeout: The job had no query timeout, so it sat there waiting forever
- No alerting on locks: We didn’t even have metrics for table lock waits
How We Fixed It (At 3AM)
I killed the blocking query, cancelled the stuck job, and added a simple check to the migration script: “IF NOT EXISTS”. We also added a 30-second timeout to all background jobs and set up alerting for any query holding a lock longer than 10 seconds.
What I Learned
Never run schema changes on a production database without checking what already exists.
That night cost us 45 minutes of downtime and about 200 failed orders. But the real cost was the sleep I lost and the lessons I gained:
- Always use IF NOT EXISTS for any schema changes
- Timeouts are your friend — set them everywhere
- Monitor lock waits — this should be in your default dashboard
- Test migrations in staging first — yes, even “simple” index additions
Now whenever I write a migration, I double-check: “What happens if this runs twice? What happens if it gets stuck?” It saved us more than once after that night.
