Memory leak in HM (and CC)

128 views
Skip to first unread message

Andrea Campi

unread,
May 30, 2012, 6:01:05 PM5/30/12
to vcap...@cloudfoundry.org
Hi all,

earlier this week I spent some time figuring out the how, when and why
of the growing size of both HM and CC in my staging environment.

Background: I have a one-server staging setup, with vcap updated to
master as of last week. I haven't seen any relevant commit in the past
few days.
Their RSS keeps growing, with no upper limit; I am currently
restarting both HM and CC a couple of times a week.

I will spare you the details of everything I tried to narrow it down
(unless you want to hear them), but at the end, it boils down to:
App.find_by_id(app_id)
Or more verbose:

+ @logger.info("analyze_app: #{droplet_entry[:state]}
#{APP_STABLE_STATES.include?(droplet_entry[:state])} #{quiescent}
(#{`ps -o rss= -p #{Process.pid}`.to_i})")
+ (1..10).each do
+ App.find_by_id(app_id)
+ end
+ @logger.info("analyze_app: #{droplet_entry[:state]}
#{APP_STABLE_STATES.include?(droplet_entry[:state])} #{quiescent}
(#{`ps -o rss= -p #{Process.pid}`.to_i})")
+ GC.start
+ @logger.info("analyze_app: #{droplet_entry[:state]}
#{APP_STABLE_STATES.include?(droplet_entry[:state])} #{quiescent}
(#{`ps -o rss= -p #{Process.pid}`.to_i})")

I let this run for ~5 minutes to let things settle before looking at
the samples.
Even correcting for the fact that the GC may not actually run
immediately, it looks like App.find_by_id is leaking about 4KB each
time.

This is with Rails 3.0.5; I also tried upgrading to 3.0.12 and saw no change.
I'm running these components with Ruby 1.9.2-p290


I have a few questions:
* is this a known issue, or should I dig in more and provide more info?
* any known workaround?
* are CC v2 and HM v2 moving fast enough that I should ignore this and wait?
* am I completely off base?

Regards,
Andrea

ramonskie

unread,
May 31, 2012, 11:23:03 AM5/31/12
to vcap...@cloudfoundry.org
I encountered a memory problem with the CC a few months ago, i tried almost everything...
i eventually did a clean install with chef and the problem was solved.
my guess is that there went' something wrong with a previous upgrade
 
Op donderdag 31 mei 2012 00:01:05 UTC+2 schreef Andrea Campi het volgende:

张 磊

unread,
May 31, 2012, 9:44:45 PM5/31/12
to vcap-dev
Any details about how you solve that problem?

Jesse Zhang

unread,
May 31, 2012, 9:44:24 PM5/31/12
to vcap...@cloudfoundry.org
Hi Andrea,
  I looks normal to me before you reveal any detailed numbers. Do you have any monitoring data or charts showing how terrible the leak is? GC should kick in at some point.

Cheers,
Jesse

Andrea Campi

unread,
Jun 6, 2012, 2:24:22 AM6/6/12
to vcap...@cloudfoundry.org
On Fri, Jun 1, 2012 at 3:44 AM, Jesse Zhang <je...@rbcon.com> wrote:
> Hi Andrea,
>   I looks normal to me before you reveal any detailed numbers. Do you have
> any monitoring data or charts showing how terrible the leak is? GC should
> kick in at some point.

Please find attached the log file for a night; it shows steady growth
from 56MB (at start) to over 320MB.
I didn't plot it, but it looks pretty linear to me: it took about 3.5h
to go from 100MB to 200MB, and again 3.5h from 200MB to 300MB.


Look for entries like:
[2012-05-29 04:29:56.142936] hm - pid=16234 tid=70e9 fid=6532 INFO
-- Analyzed 6 running and 0 down apps in 9.3ms (200024)

These number at the end is the RSS of the process in KB, produced
with: (#{`ps -o rss= -p #{Process.pid}`.to_i})")


This log was produced with HM f7f7354c290285c477715986d1b6f399b7e01c1a
(after removing vendored gems, before the Rails upgrade), running with
ruby1.9.2-p290
I am now running with HM from the split repo
(https://github.com/cloudfoundry/cloud_controller) and I'm seeing
pretty much the same; I'll report back the results of that.


Let me know if you can suggest ideas for more debugging.
Andrea



Next I will try with the smallest script that shows the problem, something like:

while true do
@logger.info `ps -o rss= -p #{Process.pid}`
(1..10).each do
App.find_by_id(app_id)
end
GC.start
sleep 5
end

From my earlier tests, this should leak big time.
health_manager.log.gz

Jesse Zhang

unread,
Jun 6, 2012, 4:04:57 AM6/6/12
to vcap...@cloudfoundry.org
If you wanted to reproduce a leak, why would you call GC.start?

Jesse

Andrea Campi

unread,
Jun 6, 2012, 4:09:25 AM6/6/12
to vcap...@cloudfoundry.org
On Wed, Jun 6, 2012 at 10:04 AM, Jesse Zhang <je...@rbcon.com> wrote:
> If you wanted to reproduce a leak, why would you call GC.start?

Just to verify I wasn't just observing a slow GC; it doesn't matter
over the course of hours.
I tried both ways anyway; in the log that I sent, it wasn't there, nor
it is now.

Andrea

Jesse Zhang

unread,
Jun 6, 2012, 4:19:28 AM6/6/12
to vcap...@cloudfoundry.org
Hi Andrea,
  What wasn't where? It would be interesting if you could track down where the leak comes from. FYI I have a health manager running for 2 weeks and the RSS is only 60M.

Jesse

Andrea Campi

unread,
Jun 6, 2012, 4:36:13 AM6/6/12
to vcap...@cloudfoundry.org
On Wed, Jun 6, 2012 at 10:19 AM, Jesse Zhang <je...@rbcon.com> wrote:
> Hi Andrea,
>   What wasn't where?

The call to GC.start. This is the only diff:

- @logger.info("Analyzed #{@analysis[:stats][:running]} running
and #{@analysis[:stats][:down]} down apps in
#{elapsed_time_in_ms(@analysis[:start])}")
+ @logger.info("Analyzed #{@analysis[:stats][:running]} running
and #{@analysis[:stats][:down]} down apps in
#{elapsed_time_in_ms(@analysis[:start])} (#{`ps -o rss= -p
#{Process.pid}`.to_i})")

> It would be interesting if you could track down where
> the leak comes from. FYI I have a health manager running for 2 weeks and the
> RSS is only 60M.

See my original post:

+ @logger.info("analyze_app: #{droplet_entry[:state]}
#{APP_STABLE_STATES.include?(droplet_entry[:state])} #{quiescent}
(#{`ps -o rss= -p #{Process.pid}`.to_i})")
+ (1..10).each do
+ App.find_by_id(app_id)
+ end
+ @logger.info("analyze_app: #{droplet_entry[:state]}
#{APP_STABLE_STATES.include?(droplet_entry[:state])} #{quiescent}
(#{`ps -o rss= -p #{Process.pid}`.to_i})")
+ GC.start
+ @logger.info("analyze_app: #{droplet_entry[:state]}
#{APP_STABLE_STATES.include?(droplet_entry[:state])} #{quiescent}
(#{`ps -o rss= -p #{Process.pid}`.to_i})")

I can see the RSS go up across that and never go down.
Of course that's not definitive in any way, and I would be very very
surprised if the problem was there.

I will keep investigating, although with HM v2 coming up I'm not
willing to spend a huge amount of time.

> It would be interesting if you could track down where
> the leak comes from. FYI I have a health manager running for 2 weeks and the
> RSS is only 60M.

Yeah, that's the thing--in production I have an older version of
everything (very old, circa January) and it's been stable for months.

Andrea

Jesse Zhang

unread,
Jun 6, 2012, 5:06:31 AM6/6/12
to vcap...@cloudfoundry.org
FYI my 2-week-old HM is using very new code (before the rails vulnerability fix)

Jesse

Andrea Campi

unread,
Jun 6, 2012, 5:51:10 AM6/6/12
to vcap...@cloudfoundry.org
On Wed, Jun 6, 2012 at 11:06 AM, Jesse Zhang <je...@rbcon.com> wrote:
> FYI my 2-week-old HM is using very new code (before the rails vulnerability
> fix)

Even stranger: we must be running the same code then.

I'm wondering what might be different; some gems maybe, I think some
of them are not pinned to a specific version, right?
I just had a look and the HM seems to be set up in a, uhm, peculiar way.
That is, in health_manager/lib/health_manager:

def self.root
@root ||= Pathname.new(File.expand_path('../../../cloud_controller',
__FILE__))
end

So at the end of the day the HM is totally dependent on the CC and
transitively picks up the same dependencies, which are not specified
in the Gemfile for HM. That is kind of confusing, and makes it hard to
see which versions of each gem it is using.
For instance, I wasn't quite sure why it was forcing me to use pg
0.10.1 (quite old): that's because the CC is pinned there.


Just for kicks, I now tried running the HM with bundle exec (and
changed the Gemfile accordingly).
Is that something worth doing, should I submit it?

Andrea Campi

unread,
Jun 14, 2012, 3:33:20 AM6/14/12
to vcap...@cloudfoundry.org
I have made some progress investigating this, and now I have the HM
running stable: it hasn't increased its RSS at all in the last 12
hours.

I tried many many things so it's kind of hard to pinpoint it; I've
seen a lot of crazy stuff while doing that, but nothing I could
reproduce and consider a bug in CF, more likely something with my
setup.

But that said, what finally solved the problem for me was:

* giving HM its own Gemfile;
* running bundle --standalone;
* changing the rc script to run it with bundle exec.

The funny thing is that I had already tried making sure I only had the
exact gems that were needed, but that didn't much help.
I can't really explain it, which makes me really suspicious.
This is a test VM with all services and DEA on a single machine. What
I'll try next is to set up a new VM from scratch with just CC and HM,
and see what happens.

Andrea

Albert

unread,
Jun 14, 2012, 1:38:44 PM6/14/12
to vcap-dev
Hello Jesse and Andrea -

I can confirm there is definitely a leak. our last sync point with
cloudfoundry/vcap is f7f7354c290285c477715986d1b6f399b7e01c1a

In our current deployment, HM can quickly balloon up to over 8G. It
maxed memory on the box and triggered the OOM killer causing general
mayhem. And it seems to be leaking memory at a very high rate for our
deployment.

Like Andrea, we were running a very old build (circa January) and had
no problems. Updating to latest (pre-repo split) is the first time
we're seeing this kind of memory use.

We haven't yet tried Andrea's Gemfile solution, but I would guess that
the problem is not specific to HM. We actually run HM on a different
box from CC, and are seeing CC leaking memory as well. It does so at a
much, much slower rate however. It's currently hovering at 1-2G
resident. Given the high memory profile on CC in addition to HM, it
would seem there is some call happening in CC happening much more
frequently in HM that is leaking memory.

I added a periodic_timer to check rss of Process.pid every 3s. (monit
restarts HM after 256M)

Jun 14 12:11:31 | pid=13930 | MEM:430516
Jun 14 12:11:35 | pid=13930 | MEM:450012
Jun 14 12:11:38 | pid=13930 | MEM:455308
Jun 14 12:11:41 | pid=13930 | MEM:455864
Jun 14 12:11:44 | pid=13930 | MEM:468984
Jun 14 12:11:47 | pid=13930 | MEM:481332
Jun 14 12:11:51 | pid=13930 | MEM:481696
Jun 14 12:12:00 | pid=15558 | MEM:87104
Jun 14 12:12:03 | pid=15558 | MEM:89824
Jun 14 12:12:06 | pid=15558 | MEM:90344
Jun 14 12:12:09 | pid=15558 | MEM:92232
Jun 14 12:12:13 | pid=15558 | MEM:92436
Jun 14 12:12:16 | pid=15558 | MEM:92444
Jun 14 12:12:19 | pid=15558 | MEM:92656
Jun 14 12:12:22 | pid=15558 | MEM:93160
Jun 14 12:12:25 | pid=15558 | MEM:90840
Jun 14 12:12:28 | pid=15558 | MEM:91712
Jun 14 12:12:31 | pid=15558 | MEM:91712
Jun 14 12:12:34 | pid=15558 | MEM:92560
Jun 14 12:12:37 | pid=15558 | MEM:92708
Jun 14 12:12:40 | pid=15558 | MEM:124248
Jun 14 12:12:43 | pid=15558 | MEM:124248
Jun 14 12:12:46 | pid=15558 | MEM:127540
Jun 14 12:12:49 | pid=15558 | MEM:145416
Jun 14 12:12:53 | pid=15558 | MEM:149556
Jun 14 12:12:56 | pid=15558 | MEM:149656
Jun 14 12:13:02 | pid=15558 | MEM:192392


as best as I can tell, the big steps in memory correspond to the
analyze_all_apps timer.

- Albert



On Jun 14, 12:33 am, Andrea Campi <andrea.ca...@zephirworks.com>
wrote:

Albert

unread,
Jun 15, 2012, 1:54:38 AM6/15/12
to vcap-dev
turns out the issue is here:

after_initialize :set_defaults
def set_defaults
- self.metadata ||= {}
+ @metadata = {} if new_record?
end

isolated it was this attribute via this test:

fields = 'id,runtime'
5.times { App.select(fields).find_each{|_|}; puts `ps -o rss= -p
#{Process.pid}`.to_i }

fields = 'id,runtime,metadata'
5.times { App.select(fields).find_each{|_|}; puts `ps -o rss= -p
#{Process.pid}`.to_i }

# kaboom



@andrea are you running with the mysql2 adapter or the postgres one?

Pieter Noordhuis

unread,
Jun 19, 2012, 1:45:03 PM6/19/12
to vcap...@cloudfoundry.org
Hi Albert,

Jesse and I spent some time looking at this, but aren't able to
reproduce the issue. We've filled up a SQLite database with 100k app
records. Running #select with and without the metadata field does show
a bump in RSS, but nothing out of the ordinary. The #find_each will
create a lot of objects, but nothing that can't be reused so RSS
converges and doesn't keep increasing.

The fix you post isn't functionally equivalent to the current
approach. The #metadata method is not an attr_accessor, but a
serialized field. This means you can't just set an instance variable
and be done (AFAIK).

Could you include output of GC.stat and GC.count in your periodic
polls? Maybe that gives us a definitive answer to the question if the
Ruby heap is leaking or not.

Thanks,
Pieter

Jesse Zhang

unread,
Jun 19, 2012, 3:18:00 PM6/19/12
to vcap...@cloudfoundry.org
Hi Albert,
  Can you provide me with the exact Gemfile and Gemfile lock with which you are seeing a leak? Detailed info about your deployment env (mysql daemon / client lib version) will also help.

Cheers,
Jesse

Albert

unread,
Jun 25, 2012, 1:54:55 PM6/25/12
to vcap-dev
Hey guys,

Sorry for the delay. While compiling the info you guys requested, I
noticed that ruby 1.9.2 lacks GC.stat (new in 1.9.3). Interestingly
enough, in a small VM there is no leak on 1.9.3

But it was my understanding that vcap runs on 1.9.2 these days. Has
that changed?

I threw together our Gemfiles, locks and versions into this gist
repo:
https://gist.github.com/2989855

For now, we're running with the patch and our system has stabilized.



On Jun 19, 12:18 pm, Jesse Zhang <je...@rbcon.com> wrote:
> Hi Albert,
>   Can you provide me with the exact Gemfile and Gemfile lock with which you
> are seeing a leak? Detailed info about your deployment env (mysql daemon /
> client lib version) will also help.
>
> Cheers,
> Jesse
>
> On Tue, Jun 19, 2012 at 10:45 AM, Pieter Noordhuis <pnoordh...@rbcon.com>wrote:
>
>
>
>
>
>
>
> > Hi Albert,
>
> > Jesse and I spent some time looking at this, but aren't able to
> > reproduce the issue. We've filled up a SQLite database with 100k app
> > records. Running #select with and without the metadata field does show
> > a bump in RSS, but nothing out of the ordinary. The #find_each will
> > create a lot of objects, but nothing that can't be reused so RSS
> > converges and doesn't keep increasing.
>
> > The fix you post isn't functionally equivalent to the current
> > approach. The #metadata method is not an attr_accessor, but a
> > serialized field. This means you can't just set an instance variable
> > and be done (AFAIK).
>
> > Could you include output of GC.stat and GC.count in your periodic
> > polls? Maybe that gives us a definitive answer to the question if the
> > Ruby heap is leaking or not.
>
> > Thanks,
> > Pieter
>

Albert

unread,
Sep 7, 2012, 12:12:11 PM9/7/12
to vcap...@cloudfoundry.org
just to wrap up this discussion and help anyone else who might be seeing what we saw.

we eventually narrowed the leak down to a known issue with the built-in yaml library. turns out there's a leak in the psych library (with an H) with 1.9.2

a bit of memory leaks whenever App#set_defaults runs, loading (typically empty) serialized yaml from the db. This happens in both CC and HM, although significantly more frequently in HM. 

we resolved the situation by installing psyck (with a K). bundler passively picks up and loads it early on, thereby bypassing the built-in, and the leak disappears. it might be better to explicitly bundle the updated psych version (the one without the leak) 

Bob Nugmanov

unread,
Sep 7, 2012, 3:24:55 PM9/7/12
to vcap...@cloudfoundry.org
Thank you Albert.

Bob.
Reply all
Reply to author
Forward
0 new messages