As some of you may have experienced, god has a slow memory leak. Since
my informal techniques have so far been fruitless, it's time to bust
out the scientific method. I will use this thread to document the
process and hopefully we can solve this problem.
Let's begin. I ask "Why is god leaking?"
Looking at /proc/<pid>/status for a god process (ruby 1.8.5
(2006-12-25 patchlevel 12) [x86_64-linux]) that is obviously leaking,
I see the following:
The stack size is small and the data size is large, which means that
the leak is on the heap.
Ruby can leak to the heap in two ways. By not garbage collecting from
the Ruby heap or by leaking memory to the C heap. I always assume it
is my own code that is broken, so my hypothesis is that I am somehow
causing Ruby heap objects to stick around and not be GC'ed. To test
this hypothesis I will use the excellent BleakHouse from Evan Weaver.
It produces a dump of the Ruby heap on demand and can then analyze the
frame dumps to calculate how many objects are never collected.
The test is against god 0.7.3 on ruby 1.8.5 (2006-12-25 patchlevel 12)
[i686-darwin8.9.1]. The config file contains a single watch with
interval = 0 (meaning it will run the condition continuously) that
runs a process_running condition. Heap snapshots are taken at the
beginning and end of the driver loop.
After several seconds, god holds steady at around 9150kb for ten
minutes or so. Then it starts leaking. When I finally stop god about
20 minutes later, it is up to around 9500kb. This test run produces a
20GB! dump file. An overnight analysis gives me this:
-----------------------------------
88733 full frames. Removing 15 frames from each end of the run to
account for
startup overhead and GC lag.
5220649 total births, 5220403 total deaths, 611 uncollected objects.
Tags sorted by persistent uncollected objects. These objects did not
exist at
startup, were instantiated by the associated tags, and were never
garbage
collected:
end-driver leaked (over 44350 requests):
178 String
146 Array
111 Time
4 MatchData
3 God::DriverEvent
1 IO
1 Float
1 God::System::Process
1 File
1 Process::Status
begin-driver leaked (over 44355 requests):
29 String
3 Array
1 NoMethodError
-----------------------------------
The analysis shows that no unexpected Ruby heap leaks are occurring.
God is set to hold on to the last 100 log lines (which are stored as
[text, time]) accounting for 100 Array objects, 100 String objects,
and 100 Time objects. The remaining objects represent the working set
that god uses during a normal driver cycle and are within my
expectations.
I conclude that under this setup, uncollectable objects do not account
for the memory leak. This leaves C heap leaks as the likely culprit.
My next experiment will formalize this hypothesis.
I know that the memory leak isn't caused by leaking Ruby objects and I
know that a basic looping Ruby program does not leak memory. There
must, then, be some part of the code that causes Ruby to leak memory
internally. Since the memory leak seems to be related to how
frequently the driver loop executes, I hypothesize that most or all of
the memory leak is contained within the driver loop handlers.
Experiment B
In this experiment I will disable the handler part of the driver loop,
instead rescheduling the condition immediately. The code for this test
can be found at
For these memory leak tests, I will consider god NOT leaking if it
does not vary significantly from the startup memory usage after 10,000
seconds. Here are the test results:
This is running under ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-
darwin8.9.1].
I conclude that the memory leak is caused by code within the driver
handlers. Some Ruby call or calls must be leaking memory within that
code. Isolating the offending code should hopefully allow me to work
around it.
Over the 17 hour test, memory increased by 1.1 MB. This represents
perhaps 1,000,000 times through the driver loop (the test used 100% of
both cores on my MBP).
I'm not sure what to make of this data right now. It appears to be
leaking, albeit very very slowly. I'm also not sure why initial memory
usage is higher. More investigation needs to go into 1.8.6 behavior,
but for the moment I will concentrate on 1.8.5 as it shows no memory
leak for this test which should make it easier to isolate the leaking
code.
Rubinius is almost the point where it can run god. I will keep an eye
on their progress and once a test run can be done against Rubinius I
will do an experiment under that environment.
Experiment B showed that god does not leak memory under 1.8.5 if the
main processing logic is removed. This is for a single Watch. It's
possible that thread interactions will cause the leak to manifest. I
hypothesize that configuring 10 watches (one driver thread per watch)
will also NOT leak. I will perform a 10,000 second experiment to test
this.
So it looks like a slow leak. Memory usage after 10,000 seconds (2.78
hours) was 188kb higher than initially. While I'd say this is a real
leak, it would only count as a tiny portion of the problem and so I
will shelve it for later inspection. If the main leak is not caused by
threaded interactions at this level, then it is time to delve into the
handler code and see if I can isolate the memory leaking code.
This experiment will act as the control, showing the memory leak in
god 0.7.2 (at full speed). A 10,000 second test with one Watch will be
performed if my machine can hold on that long.
Intuition tells me that the logging subsystem is a likely candidate
for memory leaks. There's a lot of data being pushed around in there.
I hypothesize that this will at least reduce the memory leak.
Experiment E
I will disable all logging from within the handlers.
Further testing shows that it's the condition.base_name part of the
line that is responsible. Here's the code:
def base_name
self.class.name.split('::').last
end
Amazingly, this seemingly benign method rings a very familiar bell.
Another developer at Powerset had identified a piece of code in his
app that looked very much like this and resulted in a HUGE memory leak
as well. The odd but functional fix that we identified was to create a
local scope by declaring a local variable. Here's the diff that I will
try for this test run:
def base_name
+ x = 1
self.class.name.split('::').last
end
I can't tell you how glad I am to have finally solved this problem.
The fun isn't over yet, though. I'll now confirm the fix for 10
watches and on 1.8.6. Stay tuned!
Holy crap, It would interesting to figure out how the F defining a local variable prevents the leak... It almost makes me more frustrated that this is the fix :)
> Further testing shows that it's the condition.base_name part of the > line that is responsible. Here's the code:
> def base_name > self.class.name.split('::').last > end
> Amazingly, this seemingly benign method rings a very familiar bell. > Another developer at Powerset had identified a piece of code in his > app that looked very much like this and resulted in a HUGE memory leak > as well. The odd but functional fix that we identified was to create a > local scope by declaring a local variable. Here's the diff that I will > try for this test run:
> def base_name > + x = 1 > self.class.name.split('::').last > end
> I can't tell you how glad I am to have finally solved this problem. > The fun isn't over yet, though. I'll now confirm the fix for 10 > watches and on 1.8.6. Stay tuned!
> Further testing shows that it's the condition.base_name part of the > line that is responsible. Here's the code:
> def base_name > self.class.name.split('::').last > end
Well done for isolating this; a triumph of empirical legwork.
> Amazingly, this seemingly benign method rings a very familiar bell. > Another developer at Powerset had identified a piece of code in his > app that looked very much like this and resulted in a HUGE memory leak > as well. The odd but functional fix that we identified was to create a > local scope by declaring a local variable. Here's the diff that I will > try for this test run:
> def base_name > + x = 1 > self.class.name.split('::').last > end
Odd indeed. This looks like a bug in Ruby. Would Ruby Talk be the best place to raise this in the hope of solving the problem at source?
> I can't tell you how glad I am to have finally solved this problem. > The fun isn't over yet, though. I'll now confirm the fix for 10 > watches and on 1.8.6. Stay tuned!
Well done!
Here's a test case that demonstrates what is surely a bug. If you want, you can post on ruby-talk or someplace more efficient to get some attention and a fix. Make sure you post the test case.
module Play class Toy def peek_at_proc puts `ps u -p #{Process.pid}` end
def base_name #x = 1 # <<<<<< REMOVE COMMENT TO FIX MEM LEAK "a,b".split(',') end