How I Found the Memory Leak That Was Killing Our Production Server

It was 2 AM when my phone rang. Our production server was down for the third time that week. I pulled up the logs, checked the metrics, and saw the same pattern: memory usage climbing steadily until the OOM killer stepped in and terminated our Node.js process.

The Symptoms

Let me paint the picture:

Restarted fine — after each restart, everything worked for about 4-6 hours
Memory only grew — never decreased, even under low traffic
No obvious errors — the app ran smoothly until it suddenly didn’t

This was classic memory leak behavior. But where?

First Suspect: Event Listeners

I started with the obvious. Checked if we were adding event listeners and never removing them. Found a few spots but nothing significant enough to cause this level of leak.

Then I remembered something. We recently added a new feature for real-time updates. It used Server-Sent Events (SSE). Every time a client connected, we stored their connection in a Map. When they disconnected, we removed them.

Or so we thought.

The Bug

Here’s what happened:

// Our disconnect handler
sseConnection.on('close', () => {
  // We were doing this
  const index = connections.indexOf(conn);
  if (index > -1) connections.splice(index, 1);
});

Looks fine, right? But there’s a problem. If the connection already closed from the client side, the ‘close’ event might fire twice. Or worse — if the removal failed for any reason, we’d keep accumulating connections.

More importantly, we discovered that some SSE connections were never triggering the close event properly due to network issues. So they stayed in our Map forever.

How We Fixed It

We switched to a WeakMap and added a heartbeat mechanism:

// Better approach with cleanup
const connections = new Map();

// Add heartbeat to detect dead connections
setInterval(() => {
  connections.forEach((conn, id) => {
    if (conn.isDead) {
      conn.close();
      connections.delete(id);
    }
  });
}, 30000);

We also added proper error handling around the connection removal logic and added monitoring to track the connection count over time.

What I Learned

Always assume connections can fail silently — network issues happen, and your cleanup code might never run
Monitor your memory — we now have alerts when memory grows beyond a threshold
Weak references exist for a reason — they’re not just academic
Reproduce locally first — I could have caught this with a simple load test

Aftermath

After deploying the fix, our server has been stable for 3 months. Memory stays flat around 200MB even under load. No more 2 AM calls.

The lesson? It’s often the simplest code — like removing items from an array — that hides the nastiest bugs. Always test your connection cleanup code under failure conditions, not just the happy path.