Skip to content

Navid's Blog

Ideas, Experiments, and Lessons Learned

Menu
Menu

The PostgreSQL Lock That Crashed Our Production at 2 AM

Posted on March 12, 2026 by Navid

It was 2 AM when my phone rang. Not the kind of ring you ignore. The production database was frozen. Everything was frozen. And I was the on-call engineer.

What Happened

We had a simple migration running — adding a new column to a table with 2 million rows. Standard stuff. Or so I thought.

The migration locked the table. Since our app was still running and hitting that table constantly, requests started piling up. Memory spiked. The API started timing out. Then the whole system went down.

What I Learned

Here’s the thing they don’t teach you in tutorials: PostgreSQL migrations on large tables are dangerous. A simple ALTER TABLE can lock your entire table for minutes or even hours.

The fix was straightforward once I identified it:

  • Kill the blocking migration
  • Use pg_locks to see what’s holding what
  • Run the migration with CONCURRENTLY flag

But the real lesson was about process. We should have:

  1. Tested the migration on a production-size dataset first
  2. Scheduled it during low-traffic hours (we did, but it still went wrong)
  3. Had a rollback plan

The Fix We Implemented

After that night, I wrote a simple checklist for all future migrations:

1. Does this touch a table with >100k rows?
2. Will this lock the table?
3. Do we need to use CONCURRENTLY or a different approach?
4. Is there a way to do this in smaller batches?

We also set up a simple monitoring query that alerts us when any query holds a lock for more than 5 seconds. Saved us twice since.

Bottom Line

Database locks seem like a boring topic until your entire app goes down because of one. Now I treat every migration on a large table like it could be the thing that wakes me up at 2 AM.

It has happened once more since then. I was ready.

Categories

  • AI Experiments
  • Coding
  • Debugging Stories
  • Hot Takes
  • Ideas
  • Lessons Learned
  • Project Management
  • Uncategorized
  • Vibe Coding

Recent Posts

  • How I Handled My First Production Outage (And What I Learned)
  • I Finally Fixed Our Slow Database Queries — Here’s What Actually Worked
  • I Finally Fixed Our Slow Database Queries — Here’s What Actually Worked
  • Why I Stopped Using Microservices for Small Projects
  • I Gave AI Full Access to Our Production Database. Here’s What Happened