Google Groups Home
Help | Sign in
God's memory leak - a scientific treatment
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 33 - Collapse all   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Tom Werner  
View profile
 More options Feb 13, 2:02 pm
From: Tom Werner <mojo...@gmail.com>
Date: Wed, 13 Feb 2008 11:02:40 -0800 (PST)
Local: Wed, Feb 13 2008 2:02 pm
Subject: God's memory leak - a scientific treatment
As some of you may have experienced, god has a slow memory leak. Since
my informal techniques have so far been fruitless, it's time to bust
out the scientific method. I will use this thread to document the
process and hopefully we can solve this problem.

Let's begin. I ask "Why is god leaking?"

Looking at /proc/<pid>/status for a god process (ruby 1.8.5
(2006-12-25 patchlevel 12) [x86_64-linux]) that is obviously leaking,
I see the following:

VmPeak:   297144 kB
VmSize:   297132 kB
VmLck:         0 kB
VmHWM:    206344 kB
VmRSS:    206336 kB
VmData:   234948 kB
VmStk:       228 kB
VmExe:       772 kB
VmLib:      5000 kB
VmPTE:       584 kB

The stack size is small and the data size is large, which means that
the leak is on the heap.

Ruby can leak to the heap in two ways. By not garbage collecting from
the Ruby heap or by leaking memory to the C heap. I always assume it
is my own code that is broken, so my hypothesis is that I am somehow
causing Ruby heap objects to stick around and not be GC'ed. To test
this hypothesis I will use the excellent BleakHouse from Evan Weaver.
It produces a dump of the Ruby heap on demand and can then analyze the
frame dumps to calculate how many objects are never collected.

The test is against god 0.7.3 on ruby 1.8.5 (2006-12-25 patchlevel 12)
[i686-darwin8.9.1]. The config file contains a single watch with
interval = 0 (meaning it will run the condition continuously) that
runs a process_running condition. Heap snapshots are taken at the
beginning and end of the driver loop.

After several seconds, god holds steady at around 9150kb for ten
minutes or so. Then it starts leaking. When I finally stop god about
20 minutes later, it is up to around 9500kb. This test run produces a
20GB! dump file. An overnight analysis gives me this:

-----------------------------------

88733 full frames. Removing 15 frames from each end of the run to
account for
startup overhead and GC lag.

5220649 total births, 5220403 total deaths, 611 uncollected objects.

Tags sorted by persistent uncollected objects. These objects did not
exist at
startup, were instantiated by the associated tags, and were never
garbage
collected:
  end-driver leaked (over 44350 requests):
    178 String
    146 Array
    111 Time
    4 MatchData
    3 God::DriverEvent
    1 IO
    1 Float
    1 God::System::Process
    1 File
    1 Process::Status
  begin-driver leaked (over 44355 requests):
    29 String
    3 Array
    1 NoMethodError

-----------------------------------

The analysis shows that no unexpected Ruby heap leaks are occurring.
God is set to hold on to the last 100 log lines (which are stored as
[text, time]) accounting for 100 Array objects, 100 String objects,
and 100 Time objects. The remaining objects represent the working set
that god uses during a normal driver cycle and are within my
expectations.

I conclude that under this setup, uncollectable objects do not account
for the memory leak. This leaves C heap leaks as the likely culprit.
My next experiment will formalize this hypothesis.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom Werner  
View profile
 More options Feb 13, 8:00 pm
From: Tom Werner <mojo...@gmail.com>
Date: Wed, 13 Feb 2008 17:00:39 -0800 (PST)
Local: Wed, Feb 13 2008 8:00 pm
Subject: Re: God's memory leak - a scientific treatment
I know that the memory leak isn't caused by leaking Ruby objects and I
know that a basic looping Ruby program does not leak memory. There
must, then, be some part of the code that causes Ruby to leak memory
internally. Since the memory leak seems to be related to how
frequently the driver loop executes, I hypothesize that most or all of
the memory leak is contained within the driver loop handlers.

Experiment B

In this experiment I will disable the handler part of the driver loop,
instead rescheduling the condition immediately. The code for this test
can be found at

https://github.com/mojombo/god/tree/f59621c270c5d804ff362c152823946d6...

The specific change is hilighted at

https://github.com/mojombo/god/tree/f59621c270c5d804ff362c152823946d6...

For these memory leak tests, I will consider god NOT leaking if it
does not vary significantly from the startup memory usage after 10,000
seconds. Here are the test results:

memory in kb (second)
---------------------------------
7604 (1)
7604 (2)
7604 (3)
7604 (4)
7604 (5)
...
7588 (9996)
7588 (9997)
7588 (9998)
7588 (9999)
7588 (10000)

This is running under ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-
darwin8.9.1].

I conclude that the memory leak is caused by code within the driver
handlers. Some Ruby call or calls must be leaking memory within that
code. Isolating the offending code should hopefully allow me to work
around it.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom Werner  
View profile
 More options Feb 14, 1:49 pm
From: Tom Werner <mojo...@gmail.com>
Date: Thu, 14 Feb 2008 10:49:15 -0800 (PST)
Local: Thurs, Feb 14 2008 1:49 pm
Subject: Re: God's memory leak - a scientific treatment
As an addendum to Experiment B, I did the same test with the 1.8.6
BleakHouse Ruby.

$ ruby-bleak-house -v
ruby 1.8.6 (2007-09-23 patchlevel 110) [i686-darwin8.11.1]

$ ruby-bleak-house -I/p/libdata/ruby/1.8 bin/god -c test/configs/
stress/stress.god -D --no-events

Here are the results:

14356 (1)
14356 (2)
14356 (3)
14356 (4)
14356 (5)
...
14528 (10000)
...
14720 (20000)
...
14892 (30000)
...
15064 (40000)
...
15244 (50000)
...
15448 (62402)
15448 (62403)
15448 (62404)
15448 (62405)
15448 (62406)

Over the 17 hour test, memory increased by 1.1 MB. This represents
perhaps 1,000,000 times through the driver loop (the test used 100% of
both cores on my MBP).

I'm not sure what to make of this data right now. It appears to be
leaking, albeit very very slowly. I'm also not sure why initial memory
usage is higher. More investigation needs to go into 1.8.6 behavior,
but for the moment I will concentrate on 1.8.5 as it shows no memory
leak for this test which should make it easier to isolate the leaking
code.

Rubinius is almost the point where it can run god. I will keep an eye
on their progress and once a test run can be done against Rubinius I
will do an experiment under that environment.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom Werner  
View profile
 More options Feb 14, 5:21 pm
From: Tom Werner <mojo...@gmail.com>
Date: Thu, 14 Feb 2008 14:21:12 -0800 (PST)
Local: Thurs, Feb 14 2008 5:21 pm
Subject: Re: God's memory leak - a scientific treatment
Experiment B showed that god does not leak memory under 1.8.5 if the
main processing logic is removed. This is for a single Watch. It's
possible that thread interactions will cause the leak to manifest. I
hypothesize that configuring 10 watches (one driver thread per watch)
will also NOT leak. I will perform a 10,000 second experiment to test
this.

Experiment C

http://github.com/mojombo/god/tree/ea49aacb71b6d90fa0c65e19254299b060...

7896 (0)
7896 (1)
7896 (2)
7896 (3)
7896 (4)
...
7940 (2500)
...
7988 (5000)
...
8032 (7500)
...
8084 (9996)
8084 (9997)
8084 (9998)
8084 (9999)
8084 (10000)

So it looks like a slow leak. Memory usage after 10,000 seconds (2.78
hours) was 188kb higher than initially. While I'd say this is a real
leak, it would only count as a tiny portion of the problem and so I
will shelve it for later inspection. If the main leak is not caused by
threaded interactions at this level, then it is time to delve into the
handler code and see if I can isolate the memory leaking code.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom Werner  
View profile
 More options Feb 15, 1:22 pm
From: Tom Werner <mojo...@gmail.com>
Date: Fri, 15 Feb 2008 10:22:59 -0800 (PST)
Local: Fri, Feb 15 2008 1:22 pm
Subject: Re: God's memory leak - a scientific treatment
It occurs to me that I have not run a control test under the decided
upon setup.

$ ruby -v
ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-darwin8.9.1]

Experiment D

http://github.com/mojombo/god/tree/b91e3172af83c5a937a97ddbab88972922...

This experiment will act as the control, showing the memory leak in
god 0.7.2 (at full speed). A 10,000 second test with one Watch will be
performed if my machine can hold on that long.

8088 (0)
8136 (1)
8188 (2)
8240 (3)
8288 (4)
...
54208 (1000)
...
98304 (2000)
...
142992 (3000)
...
186936 (4000)
...
229920 (5000)
...
272516 (6000)
...
314444 (7000)
...
356028 (8000)
...
396920 (9000)
...
437708 (9996)
437752 (9997)
437788 (9998)
437832 (9999)
437868 (10000)

So there you have it. 437 MB memory usage after 10,000 seconds.
Amazing.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom Werner  
View profile
 More options Feb 15, 4:27 pm
From: Tom Werner <mojo...@gmail.com>
Date: Fri, 15 Feb 2008 13:27:06 -0800 (PST)
Local: Fri, Feb 15 2008 4:27 pm
Subject: Re: God's memory leak - a scientific treatment
Intuition tells me that the logging subsystem is a likely candidate
for memory leaks. There's a lot of data being pushed around in there.
I hypothesize that this will at least reduce the memory leak.

Experiment E

I will disable all logging from within the handlers.

http://github.com/mojombo/god/commit/e4c85a3221fc1a40add0b73abad9403b...

Here are the results

7404 (0)
7404 (1)
7404 (2)
7396 (3)
7396 (4)
...
7384 (9996)
7384 (9997)
7384 (9998)
7384 (9999)
7384 (10000)

Wow, this is great news! No leaks with logging removed. Slowly the
pincers close...


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom Werner  
View profile
(3 users)  More options Feb 15, 7:53 pm
From: Tom Werner <mojo...@gmail.com>
Date: Fri, 15 Feb 2008 16:53:34 -0800 (PST)
Local: Fri, Feb 15 2008 7:53 pm
Subject: Re: God's memory leak - a scientific treatment
So far I've shown that the leak is in the log_line method of task.rb.

Experiment F:

I will play around with the enabled parts of log_line and see if I can
isolate a single line in the method that is causing the leak.

...

After enabling all the code, I've found that this line, when removed,
prevents the leak (at least for short runs):

debug_message = watch.name + ' ' + condition.base_name + " [#{result}]
" + self.dest_desc(metric, condition)

...

Further testing shows that it's the condition.base_name part of the
line that is responsible. Here's the code:

    def base_name
      self.class.name.split('::').last
    end

Amazingly, this seemingly benign method rings a very familiar bell.
Another developer at Powerset had identified a piece of code in his
app that looked very much like this and resulted in a HUGE memory leak
as well. The odd but functional fix that we identified was to create a
local scope by declaring a local variable. Here's the diff that I will
try for this test run:

     def base_name
+      x = 1
       self.class.name.split('::').last
     end

And, drumroll please:

7444 (0)
7444 (1)
7444 (2)
7444 (3)
7444 (4)
...
7432 (9996)
7432 (9997)
7432 (9998)
7432 (9999)
7432 (10000)

All I can say is SCIENCE!

I can't tell you how glad I am to have finally solved this problem.
The fun isn't over yet, though. I'll now confirm the fix for 10
watches and on 1.8.6. Stay tuned!


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Van Pelt  
View profile
 More options Feb 15, 10:11 pm
From: "Chris Van Pelt" <vanp...@gmail.com>
Date: Fri, 15 Feb 2008 19:11:30 -0800
Local: Fri, Feb 15 2008 10:11 pm
Subject: Re: God's memory leak - a scientific treatment

Holy crap, It would interesting to figure out how the F defining a local
variable prevents the leak...  It almost makes me more frustrated that this
is the fix :)


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Stewart  
View profile
 More options Feb 16, 7:34 am
From: Andrew Stewart <b...@airbladesoftware.com>
Date: Sat, 16 Feb 2008 12:34:03 +0000
Local: Sat, Feb 16 2008 7:34 am
Subject: Re: God's memory leak - a scientific treatment

On 16 Feb 2008, at 00:53, Tom Werner wrote:

> Further testing shows that it's the condition.base_name part of the
> line that is responsible. Here's the code:

>     def base_name
>       self.class.name.split('::').last
>     end

Well done for isolating this; a triumph of empirical legwork.

> Amazingly, this seemingly benign method rings a very familiar bell.
> Another developer at Powerset had identified a piece of code in his
> app that looked very much like this and resulted in a HUGE memory leak
> as well. The odd but functional fix that we identified was to create a
> local scope by declaring a local variable. Here's the diff that I will
> try for this test run:

>      def base_name
> +      x = 1
>        self.class.name.split('::').last
>      end

Odd indeed.  This looks like a bug in Ruby.  Would Ruby Talk be the  
best place to raise this in the hope of solving the problem at source?

Regards,
Andy Stewart

-------
http://airbladesoftware.com


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bob Hutchison  
View profile
(1 user)  More options Feb 16, 11:40 am
From: Bob Hutchison <hutch-li...@recursive.ca>
Date: Sat, 16 Feb 2008 11:40:04 -0500
Local: Sat, Feb 16 2008 11:40 am
Subject: Re: God's memory leak - a scientific treatment
Hi,

On 15-Feb-08, at 7:53 PM, Tom Werner wrote:

> All I can say is SCIENCE!

> I can't tell you how glad I am to have finally solved this problem.
> The fun isn't over yet, though. I'll now confirm the fix for 10
> watches and on 1.8.6. Stay tuned!

Well done!

Here's a test case that demonstrates what is surely a bug. If you  
want, you can post on ruby-talk or someplace more efficient to get  
some attention and a fix. Make sure you post the test case.

module Play
  class Toy
    def peek_at_proc
      puts `ps u -p #{Process.pid}`
    end

    def base_name
      #x = 1 # <<<<<< REMOVE COMMENT TO FIX MEM LEAK
      "a,b".split(',')
    end

    def test
      peek_at_proc

      1000000.times do
        base_name
      end

      peek_at_proc
    end
  end
end

Play::Toy.new.test


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Kevin Clark  
View profile
 More options Feb 16, 3:36 pm
From: "Kevin Clark" <kevin.cl...@gmail.com>
Date: Sat, 16 Feb 2008 12:36:39 -0800
Local: Sat, Feb 16 2008 3:36 pm
Subject: Re: God's memory leak - a scientific treatment
I was actually able to reduce the leak further when we found it last month:

class Bar
  def self.class_name
    name.split(/::/)
  end
end

loop { Bar.class_name }

Leaks rather a lot.

--
Kevin Clark
http://glu.ttono.us


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Phil Murray  
View profile
 More options Feb 17, 4:59 am
From: Phil Murray <pmur...@open2view.com>
Date: Sun, 17 Feb 2008 01:59:05 -0800 (PST)
Local: Sun, Feb 17 2008 4:59 am
Subject: Re: God's memory leak - a scientific treatment
Seems you're not the only ones to discover it

See here: http://opensynapse.net/post/24570619

On Feb 17, 9:36 am, "Kevin Clark" <kevin.cl...@gmail.com> wrote:


    Reply to author    Forward