It was 2 AM when my phone started buzzing. Production server down. Database connections maxed out. Great.
What Happened
We had a simple feature — fetch user data from an external API and save it to our database. Sounds easy, right?
The code looked something like this:
for user in users:
external_data = fetch_from_api(user.id)
save_to_db(external_data)
Looks fine. But we had 50,000 users. The loop ran sequentially, each iteration waiting for the external API to respond. Some requests took 5 seconds. Some took 30.
The Real Problem
Here’s what I missed: every time we called the external API, we were creating a new HTTP connection and never closing it properly. On top of that, our database connection pool was configured to allow 100 connections, but our app was trying to open way more than that.
Within minutes, we hit the connection limit. New requests started queuing up. The server ran out of memory. Everything stopped.
What I Learned
- Set connection pool limits — Don’t just use defaults. Know your limits and respect them.
- Use context managers — Always use ‘with’ statements or try/finally to close connections. Every single time.
- Add timeouts — External API calls should have timeouts. Waiting forever is not a strategy.
- Monitor connections — Set up alerts for connection pool usage before it hits 80%.
The Fix
We added connection pooling for the HTTP calls (using requests.Session), set a reasonable pool size, added timeouts, and wrapped everything in proper context managers. The whole thing took about 30 minutes to fix once we understood the problem.
The bug taught me something important: it’s not about writing code that works. It’s about writing code that fails gracefully when things go wrong. Because in production, things always go wrong.
