It was 2 AM when my phone rang. Not the kind of ring you ignore. The production database was frozen. Everything was frozen. And I was the on-call engineer.
What Happened
We had a simple migration running — adding a new column to a table with 2 million rows. Standard stuff. Or so I thought.
The migration locked the table. Since our app was still running and hitting that table constantly, requests started piling up. Memory spiked. The API started timing out. Then the whole system went down.
What I Learned
Here’s the thing they don’t teach you in tutorials: PostgreSQL migrations on large tables are dangerous. A simple ALTER TABLE can lock your entire table for minutes or even hours.
The fix was straightforward once I identified it:
- Kill the blocking migration
- Use
pg_locksto see what’s holding what - Run the migration with
CONCURRENTLYflag
But the real lesson was about process. We should have:
- Tested the migration on a production-size dataset first
- Scheduled it during low-traffic hours (we did, but it still went wrong)
- Had a rollback plan
The Fix We Implemented
After that night, I wrote a simple checklist for all future migrations:
1. Does this touch a table with >100k rows?
2. Will this lock the table?
3. Do we need to use CONCURRENTLY or a different approach?
4. Is there a way to do this in smaller batches?
We also set up a simple monitoring query that alerts us when any query holds a lock for more than 5 seconds. Saved us twice since.
Bottom Line
Database locks seem like a boring topic until your entire app goes down because of one. Now I treat every migration on a large table like it could be the thing that wakes me up at 2 AM.
It has happened once more since then. I was ready.
