It was 2 AM when my phone rang. Our production server was down for the third time that week. I pulled up the logs, checked the metrics, and saw the same pattern: memory usage climbing steadily until the OOM killer stepped in and terminated our Node.js process.
The Symptoms
Let me paint the picture:
- Restarted fine — after each restart, everything worked for about 4-6 hours
- Memory only grew — never decreased, even under low traffic
- No obvious errors — the app ran smoothly until it suddenly didn’t
This was classic memory leak behavior. But where?
First Suspect: Event Listeners
I started with the obvious. Checked if we were adding event listeners and never removing them. Found a few spots but nothing significant enough to cause this level of leak.
Then I remembered something. We recently added a new feature for real-time updates. It used Server-Sent Events (SSE). Every time a client connected, we stored their connection in a Map. When they disconnected, we removed them.
Or so we thought.
The Bug
Here’s what happened:
// Our disconnect handler
sseConnection.on('close', () => {
// We were doing this
const index = connections.indexOf(conn);
if (index > -1) connections.splice(index, 1);
});
Looks fine, right? But there’s a problem. If the connection already closed from the client side, the ‘close’ event might fire twice. Or worse — if the removal failed for any reason, we’d keep accumulating connections.
More importantly, we discovered that some SSE connections were never triggering the close event properly due to network issues. So they stayed in our Map forever.
How We Fixed It
We switched to a WeakMap and added a heartbeat mechanism:
// Better approach with cleanup
const connections = new Map();
// Add heartbeat to detect dead connections
setInterval(() => {
connections.forEach((conn, id) => {
if (conn.isDead) {
conn.close();
connections.delete(id);
}
});
}, 30000);
We also added proper error handling around the connection removal logic and added monitoring to track the connection count over time.
What I Learned
- Always assume connections can fail silently — network issues happen, and your cleanup code might never run
- Monitor your memory — we now have alerts when memory grows beyond a threshold
- Weak references exist for a reason — they’re not just academic
- Reproduce locally first — I could have caught this with a simple load test
Aftermath
After deploying the fix, our server has been stable for 3 months. Memory stays flat around 200MB even under load. No more 2 AM calls.
The lesson? It’s often the simplest code — like removing items from an array — that hides the nastiest bugs. Always test your connection cleanup code under failure conditions, not just the happy path.
