It was 2 AM when my phone rang. Not the kind of call you want to get when you’re three drinks in at a friend’s birthday dinner.
“The site is down,” our ops guy said. “Everything’s timing out.”
What Was Happening
I jumped on my laptop at the restaurant table while my friends looked at me like I’d lost it. Our PostgreSQL database was under heavy load. Connections were stacking up. The app was basically dead.
First thought: maybe we’re under attack? DDoS? Nah, traffic looked normal. Second thought: did someone deploy something bad? We hadn’t deployed in two days.
Finding the Culprit
I SSH’d into the server, checked the slow query log, and there it was — a query running for 8 seconds. Then another. Then another. All hitting the same endpoint.
The endpoint was fetching customer orders. Simple stuff. But under the hood, our ORM was doing this:
for order in orders:
customer = Customer.get(order.customer_id) # Query!
products = Product.get_for_order(order.id) # Query!
# More queries per order
We had 50 orders on a page. That’s 150+ queries. In production, with real load, that tanked us.
The Fix (What I Learned)
We fixed it with eager loading — grabbing everything in one or two queries instead of hundreds of tiny ones:
orders = Order.query.options(
joinedload(Order.customer),
joinedload(Order.products)
).all()
Response time dropped from 8 seconds to under 200ms.
What This Taught Me
- Staging doesn’t catch this. Our staging data had 10 orders per customer. Production had 500. The N+1 problem only showed up at scale.
- Query count matters. It’s not just about slow queries — it’s about how many queries you’re making.
- Always test with realistic data. If your staging is empty, you’re lying to yourself.
Now whenever I write a data-fetching function, I think about how many queries it’s going to run. That one 2 AM call changed how I code forever.
