memory leak tips...

50 views
Skip to first unread message

Henrik Lindberg

unread,
Oct 13, 2014, 6:57:54 PM10/13/14
to puppe...@googlegroups.com
Hi,
As you may know a memory leak was found in 3.7 (PUP-3345) and it seems
like we found the cause of the problem. (YAY !!!)

In order to find the leak, I came up with some (rough) tools to help
detect leakage. Below are some tips if you want to use them. But first,
the cause.

Basically, the cause was a "faulty" cache implementation that made
several assumptions that were not correct. So, here are some tips what
not to do.

Do not hold on to things in class variables (e.g. @@my_cache) unless the
cache only contains things that are in the loaded ruby code and share
the same lifecycle as the class. Alternatively you must have something
that evict the cache content on some sort of transaction boundary. In
the case found, this did not happen, and for each environment, it added
a reference to a resource type instance (and since they get reloaded for
each environment, the cache kept on growing).

I would go so far as to say, almost never use the Class level for
regular programming - create instances instead. That forces you to think
about the lifecycle - when is it created, when does the things it hold
on to get freed, etc.

When using an object as a hash key, that object typically must have a
hash method, and an equals method or you will very likely end up with an
ever growing set of entries in the hash.

If you are tempted to use the support for WeakRef in Ruby - then give up
immediately since it is horribly slow on Ruby 1.8, and does not work
correctly on Ruby 1.9 (seems to be based on Object Ids that can get
recycled). If they worked owever, a WeakRef is otherwise ideal for cache
implementations since it only binds the object if something else is also
referencing it. (Still plenty of opportunity to write a cache
that is incorrect though).

Before you implement a cache - measure if the cache is an actual speed
improvement! The overhead of a cache may eat the performance gain - or
it may even be worse!

Avoid binding lots of objects in the cache. Bind an identifier / name if
possible. You may think you are keeping track of a Banana, but attached
to that you may have a Gorilla, and it needs its jungle...

The "Tools"
===========
A new "benchmark" was added to the code base called "catalog_memory" -
it is the same benchmark as "empty_catalog" (it contains a single "hello
world" notice in each catalog), but the "catalog_memory" is instrumented
to dump information about memory usage.

To run this, you must be using Ruby 2.1.0. Then (if running from source) do:

bundle exec rake benchmark:catalog_memory

This will print some stats about the first and last run (it does 10
runs). It then computes the set of objects in memory that were not bound
at the start, and it outputs two data files; "heap.json" with
information about all live objects in memory, and "diff.json" with
information about the diff between start and end of the run.

It also outputs a list of source locations and methods being called
where the allocations of the "leaked" objects were made. This list is
typically not very helpful unless the leak is trivial.

Once at this point, there is a rake task called "memwalk" that reads the
two fils "heap.json", and "diff.json" and produces a graphviz .dot file
that can be rendered. The result is a graph of all objects in memory and
how they bind each other. (There is more to say about this...)

You run this task with:

bundle exec rake memwalk

Then you produce the graph with the command:

dot -Tsvg -omemwalk.svg memwalk.dot

You now have a "memwalk.svg" file that you can open in Chrome. Nice
features are that you can search the graph (like searching on any web
page), and you can zoom and pan.

The graph has a bubble per object, and it shows its address in hex.
Arrows point to referenced objects from objects that bind them.

The graph is pruned from all arrays, hashes and leaf data objects. For
arrays and hashes it skips over them, and instead shows the Object that
ultimately holds on to the structure (without the interleaving nested
structure). This makes the graph readable (and have a size that is
possible to process and view).

The memwalk command prints out some information about what it rendered
(counts). If you see something like tens of thousands of objects then
the leak is massive and you may not be able to process it (nor be able
to read and navigate the huge graph).

To find a leak, browse the resulting graph, and find clusters that are
not supposed to be there. In the current case, there where 10
Puppet::Node::Environment objects and there was only supposed to be one.

Then copy the address of one of the objects that are not supposed to be
there in order to do a walk of only it and the objects that keeps it
alive. Say 7f9afa20ba38.

Then run memwalk again, now for this object (you need to quote the
argument now):

bundle exec rake 'memwalk[7f9afa20ba38]'

This creates a file called memwalk-7f9afa20ba38.dot that you can now
render using the dot command.

View that and look at how it is bound. You may find that it is
indirectly bound, and you may need to repeat this with what now appears
to be a root holding on to a cluster of objects.

When you got this far you know the class(es) involved. You may also want
to figure out where it was allocated, and you can do that by using grep
in the heap.json - say:

grep 7f9afa20ba38 heap.json

which will print out the information about this allocation (among other
things it shows the file and line where it was allocated, a list of
objects it references, and address (in hex) to the class object.

This allows you to manually grep / walk the heap to find more details.
(Or continue hacking on the memwalk rake script to make it do what you want.

Hope the above is of help to someone having to track down a memory leak
in the future...

Regards
- henrik


--

Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/

Ben Ford

unread,
Oct 14, 2014, 11:36:26 AM10/14/14
to puppe...@googlegroups.com
I have to admit that this email made me feel a little bit dumb. Could you provide a TL;DR summary that at least provides a little context for this? Is this something that people writing types, functions, hiera backends, or report processors need to concern themselves with?

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/m1hld0%24mfc%241%40ger.gmane.org.
For more options, visit https://groups.google.com/d/optout.



--
Ben Ford | Training Solutions Engineer 
Puppet Labs, Inc.
926 NW 13th Ave, Suite #210
Portland, OR 97209

509.592.7291
ben....@puppetlabs.com

Henrik Lindberg

unread,
Oct 14, 2014, 11:57:32 AM10/14/14
to puppe...@googlegroups.com
On 2014-14-10 17:36, Ben Ford wrote:
> I have to admit that this email made me feel a little bit dumb. Could
> you provide a TL;DR summary that at least provides a little context for
> this? Is this something that people writing types, functions, hiera
> backends, or report processors need to concern themselves with?
>

Sorry about that - this is mostly if you are contributing to Puppet
itself, or if you find that your implementation in Ruby (of whatever)
leaks memory and you need to find the cause.

If you are following the well beaten path when writing types and
providers etc. you need not worry. If you are writing caching in any
form you are potentially causing memory leaks and the tips here apply.

Does that help?

- henrik

> On Mon, Oct 13, 2014 at 3:57 PM, Henrik Lindberg
> <henrik....@cloudsmith.com <mailto:henrik....@cloudsmith.com>>
> wrote:
>
> Hi,
> As you may know a memory leak was found in 3.7 (PUP-3345) and it
> seems like we found the cause of the problem. (YAY !!!)
>
> In order to find the leak, I came up with some (rough) tools to help
> detect leakage. Below are some tips if you want to use them. But
> first, the cause..
> you want..
>
> Hope the above is of help to someone having to track down a memory
> leak in the future...
>
> Regards
> - henrik
>
>
> --
>
> Visit my Blog "Puppet on the Edge"
> http://puppet-on-the-edge.__blogspot.se/
> <http://puppet-on-the-edge.blogspot.se/>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to puppet-dev+unsubscribe@__googlegroups.com
> <mailto:puppet-dev%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/__msgid/puppet-dev/m1hld0%24mfc%__241%40ger.gmane.org
> <https://groups.google.com/d/msgid/puppet-dev/m1hld0%24mfc%241%40ger.gmane.org>.
> For more options, visit https://groups.google.com/d/__optout
> <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Ben Ford | Training Solutions Engineer
> Puppet Labs, Inc.
> 926 NW 13th Ave, Suite #210
> Portland, OR 97209
>
> 509.592.7291
> ben....@puppetlabs.com <mailto:ben....@puppetlabs.com>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-dev+...@googlegroups.com
> <mailto:puppet-dev+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-dev/CACkW_L5fCj%3DHfeYTTwsbeXef6wgkjg%2B1fC2c2nSdhJxMGcMLKQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/puppet-dev/CACkW_L5fCj%3DHfeYTTwsbeXef6wgkjg%2B1fC2c2nSdhJxMGcMLKQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


--

Ben Ford

unread,
Oct 14, 2014, 1:54:11 PM10/14/14
to puppe...@googlegroups.com
On Tue, Oct 14, 2014 at 8:57 AM, Henrik Lindberg <henrik....@cloudsmith.com> wrote:
On 2014-14-10 17:36, Ben Ford wrote:
I have to admit that this email made me feel a little bit dumb. Could
you provide a TL;DR summary that at least provides a little context for
this? Is this something that people writing types, functions, hiera
backends, or report processors need to concern themselves with?


Sorry about that - this is mostly if you are contributing to Puppet itself, or if you find that your implementation in Ruby (of whatever) leaks memory and you need to find the cause.

If you are following the well beaten path when writing types and providers etc. you need not worry. If you are writing caching in any form you are potentially causing memory leaks and the tips here apply.

Does that help?


Thanks for the clarifications!

Jeff McCune

unread,
Oct 14, 2014, 3:17:47 PM10/14/14
to puppe...@googlegroups.com
On Tue, Oct 14, 2014 at 11:57 AM, Henrik Lindberg <henrik....@cloudsmith.com> wrote:
On 2014-14-10 17:36, Ben Ford wrote:
I have to admit that this email made me feel a little bit dumb. Could
you provide a TL;DR summary that at least provides a little context for
this? Is this something that people writing types, functions, hiera
backends, or report processors need to concern themselves with?


Sorry about that - this is mostly if you are contributing to Puppet itself, or if you find that your implementation in Ruby (of whatever) leaks memory and you need to find the cause.

If you are following the well beaten path when writing types and providers etc. you need not worry. If you are writing caching in any form you are potentially causing memory leaks and the tips here apply.

Is my understanding that the way caching was implemented, by using a full resource type object as the Hash key in a class variable, is what caused the leak?  Specifically, the resource type would never be cleaned up by the garbage collector because it's a hash key in a class instance variable, correct?


-Jeff

Henrik Lindberg

unread,
Oct 14, 2014, 8:58:48 PM10/14/14
to puppe...@googlegroups.com
On 2014-14-10 21:17, Jeff McCune wrote:
> On Tue, Oct 14, 2014 at 11:57 AM, Henrik Lindberg
> <henrik....@cloudsmith.com <mailto:henrik....@cloudsmith.com>>
Yes, two problems;

1. The object used as the key does not provides hash and equals and thus
the cache would grow because of lack of cache hits unless the same
instance was again used as a key (which it isn't in a new instance of an
environment).
Had equals and hash been correctly implemented on type, we would
possibly see a slower leak since some environments would not introduce
new keys. (We would have other issues though).

2. The class level variable (@@nondeprecating_type) holds on to every
type seen and does not change when an environment expires. When an
environment expires, and the cache holds on to a type then a very large
set of objects are bound in memory since there are circular references
from that type binding to almost everything loaded in an environment
(via known_resource_types). Hence my analogy of "you think you are
holding on to a banana, but it comes with a Gorilla and its Jungle".

- henrik


--

Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/

Jeff McCune

unread,
Oct 14, 2014, 9:08:07 PM10/14/14
to puppe...@googlegroups.com
On Tue, Oct 14, 2014 at 8:58 PM, Henrik Lindberg <henrik....@cloudsmith.com> wrote:
Yes, two problems;

Great, thanks for taking the time to explain.  This is really valuable information.

-Jeff
Reply all
Reply to author
Forward
0 new messages