The Time Our Database Locked Up at 2AM — What I Learned

It was 2:14 AM when my phone started buzzing. Not the normal notification buzz — the panic buzz. Our main API was returning 500 errors across the board. I stumbled out of bed, opened my laptop, and saw it: our PostgreSQL database had ground to a complete halt.

What Was Happening

The dashboard showed CPU at 100%, connections maxed out at 200, and queries timing out left and right. Our app was effectively dead. I ssh’d into the server and ran SELECT * FROM pg_stat_activity; — and there it was. One query was holding a lock on the orders table for over 3 minutes, blocking everything else.

The Culprit

Turns out, a scheduled job had kicked off at midnight to update inventory counts. The developer who wrote it thought “I’ll just add an index to make it faster” — but they ran CREATE INDEX CONCURRENTLY without checking if one already existed. The job failed halfway, left a partial index, and the next night it tried again, this time locking the entire table while it waited.

What went wrong:

No locking strategy: The migration script didn’t check for existing indexes
No timeout: The job had no query timeout, so it sat there waiting forever
No alerting on locks: We didn’t even have metrics for table lock waits

How We Fixed It (At 3AM)

I killed the blocking query, cancelled the stuck job, and added a simple check to the migration script: “IF NOT EXISTS”. We also added a 30-second timeout to all background jobs and set up alerting for any query holding a lock longer than 10 seconds.

What I Learned

Never run schema changes on a production database without checking what already exists.

That night cost us 45 minutes of downtime and about 200 failed orders. But the real cost was the sleep I lost and the lessons I gained:

Always use IF NOT EXISTS for any schema changes
Timeouts are your friend — set them everywhere
Monitor lock waits — this should be in your default dashboard
Test migrations in staging first — yes, even “simple” index additions

Now whenever I write a migration, I double-check: “What happens if this runs twice? What happens if it gets stuck?” It saved us more than once after that night.