puppetdb memory use

653 views
Skip to first unread message

jdt

unread,
Jan 6, 2015, 2:35:04 AM1/6/15
to puppet...@googlegroups.com
Hello,

I recently upgraded to PuppetDB2.2.0 from 1.6.3
When I was running PuppetDB1.6.3 with Puppet 3.4.3 I could do with 256Mb of heap space for PuppetDB after the upgrade I've increased first to 512Mb and 768Mb but I kept getting Out of Memory, so now I've increased it to 1Gb.
When looking at the dashboard it seems like memory is slowly increasing,
I'm running Puppet v3.6.2 with PuppetDB2.2.0 on CentOS6.x with Ruby 1.8.7. I have about 850 active puppet agents

Has anyone seen the same behaviour? How can I find out what exactly is wrong if anything?

Best Regards,
Jo

kaustubh chaudhari

unread,
Jan 6, 2015, 5:26:40 AM1/6/15
to puppet...@googlegroups.com
Did you upgrade the openjdk to the latest ?

i did the same upgrade. for me things are running smooth.

-Kaustubh

jdt

unread,
Jan 6, 2015, 7:19:56 AM1/6/15
to puppet...@googlegroups.com
I'm running  java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el6_5 that might not be the very latest. I'm on CentOS 6.5

The heap space usage is pretty close to 1GB,when I look at it via the PuppetDB dashboard. I'm managing over 100000 resources

Jo


Op dinsdag 6 januari 2015 11:26:40 UTC+1 schreef kaustubh chaudhari:

kaustubh chaudhari

unread,
Jan 6, 2015, 8:14:24 AM1/6/15
to puppet...@googlegroups.com
Hey,

I noticed that you mentioned you are managing 850 nodes and you are using 1024MB for java Heap size. As per the recommendation this seems to be OK.

As per the documentation you need to allocate 128MB + 1MB per node ie: 128 + 850 =  978 MB is the minimum requirement. Also the scaling depends on the how many catalog runs etc.

Performance might be the issue if you have less CPU cores.

So what you are seeing about memory close to 1024MB is Ok, my recommendation is to increase it further to stay away from OOM.

-HTH
Kaustubh

kaustubh chaudhari

unread,
Jan 6, 2015, 8:15:30 AM1/6/15
to puppet...@googlegroups.com

jdt

unread,
Jan 13, 2015, 4:12:36 AM1/13/15
to puppet...@googlegroups.com
Hello,

in the meantime I've added RAM and extended the heap to 2GB. But still I'm getting crashes of PuppetDB.
Last time it was the kernel OOM that killed the java process as I saw in /var/log/messages
kernel: Out of memory: Kill process 10146 (java) score 158 or sacrifice child
kernel: Killed process 10146, UID 498, (java) total-vm:3649108kB, anon-rss:1821580kB, file-rss:72kB

Does this mean I need to add additional heap space and/or RAM?
I'm looking at the dashboard and it's as if the heap slowly increases. Right after startup it's 3 of 400 MB, after a day or 1 I'm over 1 GB ....

I have no clue in how to find out what exactly is wrong. When I was running PuppetDB 1.6 I could do with 256MB heap space.
Does anyone have an idea how to investigate what's wrong?

Best Regards,
Jo


Op dinsdag 6 januari 2015 14:15:30 UTC+1 schreef kaustubh chaudhari:

Ken Barber

unread,
Jan 13, 2015, 9:39:10 AM1/13/15
to Puppet Users
> in the meantime I've added RAM and extended the heap to 2GB. But still I'm
> getting crashes of PuppetDB.
> Last time it was the kernel OOM that killed the java process as I saw in
> /var/log/messages
> kernel: Out of memory: Kill process 10146 (java) score 158 or sacrifice
> child
> kernel: Killed process 10146, UID 498, (java) total-vm:3649108kB,
> anon-rss:1821580kB, file-rss:72kB

This kind of crash is more to do with the tuning on your Linux
instance usually. The OOM killer formula is somewhat tricky, but as a
general rule it takes into account the amount of RAM + swap (which
most people don't expect). So if your swap is zero, or very low in
relation to your memory, you may find the OOM killer is killing
processes before your RAM fills up.

The thing you want to research is overcommit_ratio, or your swap
allocation. There are lots of articles online about this.

As a general rule, if you're running with low swap you need the
overcommit_ratio to be higher, by default its set to 50% of the total
virtual memory ordinarily I think, so if a process tries to allocate
memory, and you've exceeded 50% of your overall RAM + swap, OOM killer
will kick in.

Here's an example from one of my instances, so you can see how to analyze this:

root@puppetdb1:~# free
total used free shared buffers cached
Mem: 2054120 975464 1078656 0 169772 219876
-/+ buffers/cache: 585816 1468304
Swap: 892924 0 892924
root@puppetdb1:~# cat /proc/sys/vm/overcommit_ratio
50
root@puppetdb1:~#

So in my case, the total virtual memory available is 2.8 GB (2 GB +
800 MB swap), and if a process tries to allocated more than 50% of it
(1.4 GB), the oom killer might kick in.

I'm obviously trivialising the whole story for brevity, the OOM killer
has a few little quirks that might affect all of this (and quite a few
independant tunables), but more often than not if you think you have
enough RAM but the oom killer is still killing your processes its
somewhere between your swap + overcomit_ratio. I've seen this a bit in
virtualised environments, and places that try to launch instances with
zero swap as an example.

> Does this mean I need to add additional heap space and/or RAM?
> I'm looking at the dashboard and it's as if the heap slowly increases. Right
> after startup it's 3 of 400 MB, after a day or 1 I'm over 1 GB ....

If its not crashing in the JVM any more, I'd focus on the tuning issue
above. If the Java instance is still crashing you can try increasing
it to see where the stable point is. If its starts to get stupid, and
you still don't know why its possible to analyze the heap dumps that
get left behind for any clues potentially, but about 90% of the time I
find its a large catalog causing it, so I'd focus there first.

Heap memory can naturally fluctuate over time ... and yes sometimes it
can increase but usually it should be garbage collecting and going up
and down, so it depends on what you are monitoring exactly, and where
you got the number I guess.

> I have no clue in how to find out what exactly is wrong. When I was running
> PuppetDB 1.6 I could do with 256MB heap space.
> Does anyone have an idea how to investigate what's wrong?

Well, it shouldn't have gotten any worse in the later versions afaik,
what version are you running now?

So here is the common memory-bloat story as I know it.

During ingestion of new content, PuppetDB has to hold the new
catalogs/facts/reports in memory for a short period of time as it
decodes the JSON and stores it in the internal queue ... at that point
there are really two copies running around, one is JSON, the other is
in the internal Clojure data structure. After that there are command
listener threads that process these 'commands', storing them in the
database. Sometimes, a very large catalog can cause a problem with
memory bloat, and if you happen to receive more than 1 at a time, it
can be much worse. At the same time, a very large factset or report
can also cause issues, like if you are storing a lot of information in
facts for example.

But more often than not, its to do with a large catalog, and while
most catalogs are sane there are a few cases that bloat them. A
combinatorial edge problem can often cause this, so doing something
like:

File<||>->Package<||>

(whereby we are trying to create a relationship between all file &
package resources, as an example here)

Can cause a many to many edge to be created, thus bloating the catalog
size. This is because that kind of graph will have many edges
reflected in its catalog. So trying to locate a large catalog might be
useful. Not to mention, such a catalog would cause slower compilation
times on the master also :-). Be mindful that we can receive any
catalog at any time without throttling (while we have free threads),
so if we get a few at a time, it could cause a memory bloat. We also
have N backend command processes ordinarily listening to the internal
queue, and if all N are processing large catalogs, that can cause
bloat and potential crash also. The formula for N is usually half the
amount of CPU cores/hyperthreads you have available.

Ordinarily the approach is:

* Find out if you have large catalogs, and see if there is something
going on in the manifest to explain it
* Increase your RAM until it stabilises with your load and expected
concurrent catalog sizes

We hope to fix this in the future. For the http reception we can
reduce this potential by perhaps streaming inbound catalogs onto our
internal queue, instead of having to hold the entire thing in RAM and
decode it, but that may require us to switch out our internal ActiveMQ
since it doesn't support streaming afaik. Not to mention, it makes it
hard to validate inbound commands :-). On the query side, we already
stream results, its just that during the POST to PuppetDB we have
little ability to stream today. Another solution might be to use a
smaller format like msgpack, even though we supported gzipped commands
via HTTP, they get unpackaged in memory, whereas a msgpack/CBOR packet
would consume less RAM while processing - this is only going to reduce
the problem however, with a big enough catalog even a more compact
serialization will still cause issues (not to mention we still decode
it in RAM today, so that copy still consumes the same RAM).

Another solution we have been looking into is to fail early if we
receive a large catalog, and save the whole instance. Alternatively we
could break up the catalog into multiple pieces, but that requires a
more complicated process so is less viable.

ken.
Reply all
Reply to author
Forward
0 new messages