Memory Usage and disparity between RSS and Heap size

1553 views
Skip to first unread message

Steve Freegard

unread,
Nov 22, 2013, 6:37:04 AM11/22/13
to nod...@googlegroups.com
Hi,

I'm trying to debug and understand an issue on several of my Haraka servers after a recent update.

I'm seeing large amounts of memory being used and never returned back to the operating system causing the machine to swap and eventually the Linux OOM killer shooting the processes.

I thought that there must be a memory leak somewhere - but I've used node-heapdump and taken multiple snapshots and compared them and whilst I'm seeing a small growth I'm not seeing anything major.    However the heapdump reports the size at ~97Mb whereas the RSS of the process at the time of the dump was 1.1Gb which is what I really don't understand at all.

Would such a disparity between heap and RSS indicate that an external C++ module that is being used is leaking memory??  If so, what is the best way of finding out which?

I've tried the mtrace module - but it caused the process to completely hang after only a few minutes of running so I had to kill it.   The output that I got from it is attached - I'm not really sure it tells me anything though.

Any suggestions as to how to work out what is going on here and how to fix it would be gratefully received.

Kind regards,
Steve.
mtrace.txt

Ben Noordhuis

unread,
Nov 22, 2013, 11:25:49 AM11/22/13
to nod...@googlegroups.com
Tracking down memory leaks in garbage-collected environments is one of
the great unsolved problems of our generation. Okay, maybe not really
but there is no one-size-fits-all solution. Here are some things you
can try:

* The mtrace log suggests you're using node.js v0.8. Consider upgrading.

* Use Valgrind. Drawback: it's really, really slow and it uses a lot
of memory itself.

* SystemTap. Write a small stap script that collects malloc/free/etc.
call graphs and aggregates them into histograms. Strong disparities
between calls to malloc() and free() generally show up quickly that
way.

Apropos that heap snapshot, did you look at the self size or the
retained size? Buffer objects tend to have small self sizes (usually
on the order of a few hundred bytes) but can have massive retained
sizes. Note that Buffers in snapshots taken with node.js <= v0.8.11
don't report the true retained size.

Steve Freegard

unread,
Nov 22, 2013, 12:27:19 PM11/22/13
to nod...@googlegroups.com, Ben Noordhuis
Hi Ben,

I managed to find the issue in Haraka - we were calling
fs.unwatchFile/fs.watchFile instead of consulting our configuration
cache which we should have been doing.

However - this has pointed us to a rather large memory leak in the
fs.unwatchFile and fs.watchFile functions. I've tested it on both 0.8
and 0.10. This is on 0.10:

[root@mta41 ~]# node --expose_gc watchfile_leak.js
{ rss: 8478720, heapTotal: 6163968, heapUsed: 1918824 }
{ rss: 827813888, heapTotal: 13387520, heapUsed: 4990520 }

Testcase is attached.

Kind regards,
Steve.
watchfile_leak_testcase.js

Timothy J Fontaine

unread,
Nov 22, 2013, 2:58:36 PM11/22/13
to nod...@googlegroups.com
For what it's worth, I just went through an arduous process of tracking down a memory leak for walmart, you can read about it http://www.joyent.com/blog/walmart-node-js-memory-leak you could use the techniques described there to help narrow down your issues. There are other pieces of information available at http://www.joyent.com/developers/node/debug#mdb

I will try and do a quick analysis for you, but if you have issues you can hit me and others up on irc #node.js on freenode


--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Timothy J Fontaine

unread,
Nov 22, 2013, 4:14:52 PM11/22/13
to nod...@googlegroups.com

so you're creating a bunch of stat watchers, those require gc to be cleaned up, but they're still too young by the time you run gc outside the loop, if instead you move the gc into the loop or call it multiple times after the loop is done it can effectively clean them up.

Matt

unread,
Nov 22, 2013, 4:53:20 PM11/22/13
to nod...@googlegroups.com

On Fri, Nov 22, 2013 at 4:14 PM, Timothy J Fontaine <tjfon...@gmail.com> wrote:
so you're creating a bunch of stat watchers, those require gc to be cleaned up, but they're still too young by the time you run gc outside the loop, if instead you move the gc into the loop or call it multiple times after the loop is done it can effectively clean them up.

Doesn't necessarily explain the problem in Haraka though (albeit that was with the deprecated watchFile/unwatchFile API), which was running long term and leaking. Not saying it couldn't be some other cause in Haraka though, and we've solved that issue now, but I think this could probably use some more thorough testing.

Matt.

Steve Freegard

unread,
Nov 22, 2013, 5:00:18 PM11/22/13
to nod...@googlegroups.com
On 22/11/13 21:14, Timothy J Fontaine wrote:
>
> so you're creating a bunch of stat watchers, those require gc to be
> cleaned up, but they're still too young by the time you run gc outside
> the loop, if instead you move the gc into the loop or call it multiple
> times after the loop is done it can effectively clean them up.
>
>

Ok - I tried that and moved the gc() call to the end of the loop and for
good measure added a couple outside the loop before outputting the
memory usage. I had to reduce the number of loops to 10,000 so that it
didn't take forever to complete; but it still shows a leak:

[root@mta41 ~]# node --expose_gc watchfile_leak.js
{ rss: 8478720, heapTotal: 6163968, heapUsed: 1919048 }
{ rss: 17838080, heapTotal: 20844032, heapUsed: 1928208 }


I also switched from fs.watchFile to fs.watch and I see a leak with that
too:

[root@mta41 ~]# node --expose_gc watchfile_leak.js
{ rss: 8478720, heapTotal: 6163968, heapUsed: 1918824 }
{ rss: 18067456, heapTotal: 18647040, heapUsed: 8692304 }

Code for this:

var fs = require('fs');
gc();
console.log(process.memoryUsage());
var name = '/etc/passwd';
for (var i=0; i<10000; i++) {
var foo = fs.watch(name, function () {});
foo.close();
gc();
}
gc();
console.log(process.memoryUsage());

This is on Linux 2.6.18 x86_64 running Node 0.10.22.

Kind regards,
Steve.

Ben Noordhuis

unread,
Nov 22, 2013, 7:40:53 PM11/22/13
to nod...@googlegroups.com
That doesn't necessarily imply a memory leak, the garbage collector
may just be growing the heap. What happens when you set
--max_old_space_size=128?

Steve Freegard

unread,
Nov 22, 2013, 7:54:18 PM11/22/13
to nod...@googlegroups.com
Hi Ben,

On 23/11/13 00:40, Ben Noordhuis wrote:
> That doesn't necessarily imply a memory leak, the garbage collector
> may just be growing the heap. What happens when you set
> --max_old_space_size=128?

Same result (this is with fs.watch):

[root@mta41 ~]# node --expose_gc --max_old_space_size=128 watchfile_leak.js
{ rss: 8499200, heapTotal: 6163968, heapUsed: 1918832 }
{ rss: 18079744, heapTotal: 18647040, heapUsed: 8688392 }

I tried it again with an 8M old space size with the same result:

[root@mta41 ~]# node --expose_gc --max_old_space_size=8 watchfile_leak.js
{ rss: 8491008, heapTotal: 6163968, heapUsed: 1918832 }
{ rss: 18071552, heapTotal: 18647040, heapUsed: 8688392 }

Regards,
Steve.

Ben Noordhuis

unread,
Nov 22, 2013, 8:02:23 PM11/22/13
to nod...@googlegroups.com
Right. I see that your test case creates 1M file watchers in a loop.
Those won't be released right away and calling gc() won't change that.
Disposing a file watcher takes two event loop ticks or more but in
your test case, everything is done in a single tick.

If you smear out the loop iterations with setImmediate(), you should
see more consistent memory usage.

Steve Freegard

unread,
Nov 22, 2013, 8:31:14 PM11/22/13
to nod...@googlegroups.com
On 23/11/13 01:02, Ben Noordhuis wrote:
> Right. I see that your test case creates 1M file watchers in a loop.
> Those won't be released right away and calling gc() won't change that.
> Disposing a file watcher takes two event loop ticks or more but in
> your test case, everything is done in a single tick. If you smear out
> the loop iterations with setImmediate(), you should see more
> consistent memory usage.

Ok - I tried that. I also added a gc() call after the loop, followed by
a setTimeout() of 60 seconds with a further call to gc() and then set an
process.on('exit') handler to output process.memoryUsage() so I would
only get the output once all of the watchers were closed.

I'm still seeing memory leaking:

using fs.watch:

[root@mta41 ~]# node --expose_gc --max_old_space_size=8 watchfile_leak.js
{ rss: 8491008, heapTotal: 6163968, heapUsed: 1919544 }
{ rss: 13017088, heapTotal: 13487360, heapUsed: 1966256 }

using fs.watchFile:

[root@mta41 ~]# node --expose_gc --max_old_space_size=8 watchfile_leak.js
{ rss: 8495104, heapTotal: 6163968, heapUsed: 1919632 }
{ rss: 13275136, heapTotal: 21875968, heapUsed: 1974984 }

This is with 10,000 watchers being added and immediately removed, so the
process would not exit until the watchers have been terminated.

Based on what we saw in Haraka - there is absolutely no doubt in my mind
that there is something amiss here. We were incorrectly calling
fs.unwatchFile() followed by fs.watchFile() thousands of times per
connection and the memory usage grew massively over a short space of
time. The moment I changed the code to avoid the calls to
fs.watchFile/unwatchFile we went from having an RSS of > 1Gb in about an
hour to around 100Mb.

Cheers,
Steve.

Ben Noordhuis

unread,
Nov 23, 2013, 10:22:02 AM11/23/13
to nod...@googlegroups.com
I can't really reproduce that. Below is what IMO the test case
reasonably should look like and that has RSS hovering around 60 MB
with --max_old_space_size=64. Tested with HEAD of v0.10 and master on
a x86_64 FC20 system.

var fs = require('fs');
var n = 0;
f();
function f() {
if (++n % 1e4 === 0) stats();
fs.unwatchFile('test');
fs.watchFile('test', function() {});
fs.unwatchFile('test');
if (n < 1e6) setImmediate(f);
}
function stats() {
var m = process.memoryUsage();
m.rss >>>= 10, m.heapUsed >>>= 10, m.heapTotal >>>= 10;
console.log(m);
}

Steve Freegard

unread,
Nov 25, 2013, 12:39:00 PM11/25/13
to nod...@googlegroups.com
On 23/11/13 15:22, Ben Noordhuis wrote:
> I can't really reproduce that. Below is what IMO the test case
> reasonably should look like and that has RSS hovering around 60 MB
> with --max_old_space_size=64. Tested with HEAD of v0.10 and master on
> a x86_64 FC20 system.
>
>

I tried this on both 0.8 (with process.nextTick instead of setImmediate)
and 0.10 and it does not show a leak or excessive memory usage.

However - this would have been quite different to what was happening in
Haraka with the bug that we had there; we would have been calling
fs.unwatchFile(name), fs.watchFile(name) per line of data or per buffer
which could have been potentially thousands of watchers per second in
the same stack. I was also running 0.8 on all of the machines that were
exhibiting this problem.

What I think happened here is that the fs.watch/unwatchFile functions
saturated the thread pool to the point that all the threads were busy
and it couldn't keep up which caused the RSS to grow with all of the
in-flight requests.

Regards,
Steve.

Ben Noordhuis

unread,
Nov 25, 2013, 3:55:43 PM11/25/13
to nod...@googlegroups.com
Right, that sounds plausible. The thread pool can be a bottleneck and
I'm afraid there is currently no real fix for that. In v0.10, you can
scale it up with the UV_THREADPOOL_SIZE environment variable (clamped
to 1-128, default 4.)

Making it scale up and down automatically is the subject of ongoing
research but it's surprisingly hard to get right. Not in the sense of
implementing it, that's trivial, but doing so without introducing
performance regressions however is not. :-/
Reply all
Reply to author
Forward
0 new messages