The PostgreSQL Lock That Crashed Our Production at 2 AM

It was 2 AM when my phone rang. Not the kind of ring you ignore. The production database was frozen. Everything was frozen. And I was the on-call engineer.

What Happened

We had a simple migration running — adding a new column to a table with 2 million rows. Standard stuff. Or so I thought.

The migration locked the table. Since our app was still running and hitting that table constantly, requests started piling up. Memory spiked. The API started timing out. Then the whole system went down.

What I Learned

Here’s the thing they don’t teach you in tutorials: PostgreSQL migrations on large tables are dangerous. A simple ALTER TABLE can lock your entire table for minutes or even hours.

The fix was straightforward once I identified it:

Kill the blocking migration
Use pg_locks to see what’s holding what
Run the migration with CONCURRENTLY flag

But the real lesson was about process. We should have:

Tested the migration on a production-size dataset first
Scheduled it during low-traffic hours (we did, but it still went wrong)
Had a rollback plan

The Fix We Implemented

After that night, I wrote a simple checklist for all future migrations:

1. Does this touch a table with >100k rows?
2. Will this lock the table?
3. Do we need to use CONCURRENTLY or a different approach?
4. Is there a way to do this in smaller batches?

We also set up a simple monitoring query that alerts us when any query holds a lock for more than 5 seconds. Saved us twice since.

Bottom Line

Database locks seem like a boring topic until your entire app goes down because of one. Now I treat every migration on a large table like it could be the thing that wakes me up at 2 AM.

It has happened once more since then. I was ready.