Skip to content

Navid's Blog

Ideas, Experiments, and Lessons Learned

Menu
Menu

The Time Our Database Locked Up at 2AM — What I Learned

Posted on March 28, 2026 by Navid

It was 2:14 AM when my phone started buzzing. Not the normal notification buzz — the panic buzz. Our main API was returning 500 errors across the board. I stumbled out of bed, opened my laptop, and saw it: our PostgreSQL database had ground to a complete halt.

What Was Happening

The dashboard showed CPU at 100%, connections maxed out at 200, and queries timing out left and right. Our app was effectively dead. I ssh’d into the server and ran SELECT * FROM pg_stat_activity; — and there it was. One query was holding a lock on the orders table for over 3 minutes, blocking everything else.

The Culprit

Turns out, a scheduled job had kicked off at midnight to update inventory counts. The developer who wrote it thought “I’ll just add an index to make it faster” — but they ran CREATE INDEX CONCURRENTLY without checking if one already existed. The job failed halfway, left a partial index, and the next night it tried again, this time locking the entire table while it waited.

What went wrong:

  • No locking strategy: The migration script didn’t check for existing indexes
  • No timeout: The job had no query timeout, so it sat there waiting forever
  • No alerting on locks: We didn’t even have metrics for table lock waits

How We Fixed It (At 3AM)

I killed the blocking query, cancelled the stuck job, and added a simple check to the migration script: “IF NOT EXISTS”. We also added a 30-second timeout to all background jobs and set up alerting for any query holding a lock longer than 10 seconds.

What I Learned

Never run schema changes on a production database without checking what already exists.

That night cost us 45 minutes of downtime and about 200 failed orders. But the real cost was the sleep I lost and the lessons I gained:

  1. Always use IF NOT EXISTS for any schema changes
  2. Timeouts are your friend — set them everywhere
  3. Monitor lock waits — this should be in your default dashboard
  4. Test migrations in staging first — yes, even “simple” index additions

Now whenever I write a migration, I double-check: “What happens if this runs twice? What happens if it gets stuck?” It saved us more than once after that night.

Categories

  • AI Experiments
  • Coding
  • Debugging Stories
  • Hot Takes
  • Ideas
  • Lessons Learned
  • Project Management
  • Uncategorized
  • Vibe Coding

Recent Posts

  • How I Handled My First Production Outage (And What I Learned)
  • I Finally Fixed Our Slow Database Queries — Here’s What Actually Worked
  • I Finally Fixed Our Slow Database Queries — Here’s What Actually Worked
  • Why I Stopped Using Microservices for Small Projects
  • I Gave AI Full Access to Our Production Database. Here’s What Happened