Urgent Help Required | Puppet run is dead slow

355 views
Skip to first unread message

Harish Kothuri

unread,
Dec 12, 2016, 11:34:35 AM12/12/16
to Puppet Users
Hi,

I have a puppet master v3.8.7 and having 300+ nodes running fine till last week. 

All the agents are running very slow since last week and sometimes it started complaining that puppet master is running on heavy load. 


Following is my master configuration details:
Puppet : 3.8.7
CentOS: 6.7 Final
RAM : 32GB

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
stepping        : 2
microcode       : 26
cpu MHz         : 3457.999
cache size      : 12288 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat epb dtherm
bogomips        : 6915.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
stepping        : 2
microcode       : 26
cpu MHz         : 3457.999
cache size      : 12288 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat epb dtherm
bogomips        : 6915.99

Error Details:

Error 1:
Failed to generate additional resources using 'eval_generate': Error 503 on SERVER: <h1>This website is under heavy load</h1><p>We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.</p>

Error 2:
Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to submit 'replace facts' command for node.domain.com to PuppetDB at puppetmaster.domain.com:8081: Connection refused - connect(2)

Also, attaching the puppet agent log with --debug enabled. 

Kindly help.



puppetlog.log

Dirk Heinrichs

unread,
Dec 12, 2016, 11:44:10 AM12/12/16
to puppet...@googlegroups.com
Am 12.12.2016 um 17:34 schrieb Harish Kothuri:

I have a puppet master v3.8.7 and having 300+ nodes running fine till last week. 

All the agents are running very slow since last week and sometimes it started complaining that puppet master is running on heavy load. 

Two possible reasons come to my mind:
  1. Memory leak, causing the machine to start swapping
  2. Thundering herd problem: Too many agents accessing the server at the same time, maybe also trying to download somthing large from builtin fileserver.
If 1: Try restarting Puppet server processes.
If 2: Try adding some splay to the agent configuration:
    splay = true
    splaylimit = 2m

HTH...

    Dirk
--
Dirk Heinrichs | Senior Systems Engineer, Delivery Pipeline
http://mimage.opentext.com/alt_content/binary/images/email-signature/recommind-ot.png
Tel: +49 2226 159666 (Ansage) 1149
Email: dirk.he...@recommind.com
Skype: dirk.heinrichs.recommind

Recommind GmbH, Von-Liebig-Straße 1, 53359 Rheinbach

Vertretungsberechtigte Geschäftsführer John Marshall Doolittle, Gordon Davies, Roger Illing, Registergericht Amtsgericht Bonn, Registernummer HRB 10646

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail sind nicht gestattet.

Harish Kothuri

unread,
Dec 12, 2016, 12:16:48 PM12/12/16
to Puppet Users, dirk.he...@recommind.com
Thanks a lot for your quick reply.

1. I'm not sure if it's a memory leak issue because we have restarted the services and restarted the master machine already. ( have also increased RAM from 16GB to 32GB + 1 core to 2 cores )
2. Splay time is also set to 180 seconds.

Waiting for any other pointers.

-Harish

Kevin Corcoran

unread,
Dec 12, 2016, 12:34:42 PM12/12/16
to puppet...@googlegroups.com, harish...@gmail.com
On Mon, Dec 12, 2016 at 9:16 AM, Harish Kothuri <harish...@gmail.com> wrote:
Thanks a lot for your quick reply.

1. I'm not sure if it's a memory leak issue because we have restarted the services and restarted the master machine already. ( have also increased RAM from 16GB to 32GB + 1 core to 2 cores )


300+ nodes and 1 core (or even 2) sounds like a bad time.  With so few cores, I wouldn't expect the RAM increase to improve things at all.  I suggest increasing CPUs to 4 or 8.

Matthaus Owens

unread,
Dec 12, 2016, 12:53:26 PM12/12/16
to Puppet Users
This error indicates that PuppetDB may be in the process of restarting (or isn't handling requests for some other reason). It would be good to know what is happening to PuppetDB. Is there anything in PuppetDB's logs to indicate what is going on?
 

Also, attaching the puppet agent log with --debug enabled. 

Kindly help.



--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/bc165bf9-ab20-4a95-8785-1b804dcc2ce3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Harish Kothuri

unread,
Dec 12, 2016, 1:03:57 PM12/12/16
to Puppet Users, harish...@gmail.com
Thanks Kevin,

I don't see that even 2 cores are utilized 100%. Not sure increasing CPU will help. (Attached the CPUUsage of the server)
CPUUSAGE.png

Harish Kothuri

unread,
Dec 12, 2016, 1:09:50 PM12/12/16
to Puppet Users
I see that following errors are popping up from puppetdb. There are quite a few of these and also would like to know for which machines it is coming from

2016-12-12 10:08:16,040 ERROR [c.p.p.command] [afe425be-8413-49a8-8086-79ebaa00927a] [store report] Retrying after attempt 13, due to: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "reports_pkey"
org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "reports_pkey"
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.
Message has been deleted

Rob Nelson

unread,
Dec 12, 2016, 4:30:37 PM12/12/16
to puppet...@googlegroups.com
Harish,

I don't have any direct insight into your performance woes. But have you investigated what changes were made in the past 7 days around the time your issues began? I'm my experience, such issues are rarely spontaneous but a result of a purposeful change somewhere in the system.

--


You received this message because you are subscribed to the Google Groups "Puppet Users" group.


To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.


To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/bc165bf9-ab20-4a95-8785-1b804dcc2ce3%40googlegroups.com.


For more options, visit https://groups.google.com/d/optout.


--
Rob Nelson

Wyatt Alt

unread,
Dec 12, 2016, 7:41:41 PM12/12/16
to puppet...@googlegroups.com



On 12/12/2016 10:09 AM, Harish Kothuri wrote:
I see that following errors are popping up from puppetdb. There are quite a few of these and also would like to know for which machines it is coming from

2016-12-12 10:08:16,040 ERROR [c.p.p.command] [afe425be-8413-49a8-8086-79ebaa00927a] [store report] Retrying after attempt 13, due to: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "reports_pkey"
org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "reports_pkey"

This error occurs when a report is submitted to PuppetDB twice -- the first time it will be stored, and the second time this error will occur. It's conflicting on the hash of a report, which includes certname and creation time, so these errors will pretty much only arise in the case a two submissions.

We've seen these caused by puppetdb imports on nonempty databases, or sometimes sporadically due to bug in shutdown coordination between PuppetDB and ActiveMQ (these have gotten much better since your version; in fact, AMQ is gone). It's not typically a cause for concern and is IMO unlikely to be the root of your issue, though it could be a symptom of PDB restarts like Matthaus said.

At this point if I were you I'd look in the PuppetDB logs to try and correlate a restart with the connection failure you mentioned. In particular, keep an eye out for places where a startup is logged but the preceding shutdown is not, which generally indicates that an OOM occurred. Depending on how your service is configured you may have better ways of detecting this, such as a puppetdb-daemon.log file in the puppetdb log directory.

As for knowing which machines the errors originated from, the messages that prompt the failures will become available in your dead letter office after 16 retries, and will contain the certnames. You can read more about that here: https://docs.puppet.com/puppetdb/latest/maintain_and_tune.html#clean-up-the-dead-letter-office

Wyatt

Harish Kothuri

unread,
Dec 13, 2016, 11:57:41 PM12/13/16
to Puppet Users
Thank you all. Will take a look at all the errors and update.

Ramin K

unread,
Dec 14, 2016, 12:30:19 AM12/14/16
to puppet...@googlegroups.com
On 12/12/2016 8:34 AM, Harish Kothuri wrote:
>
> Also, attaching the puppet agent log with --debug enabled.
>
> Kindly help.

What we really need is the Puppet master log since your problems are on
that end.

Error 1 is Passenger complaining that there are no processed free to
hand off to and you've filled Passenger's queue as well.

Based on error 2 would guess Puppetdb is blocking Puppet from serving
requests.

In regards to general tuning of 3.x Puppet masters see the following
link though really shouldn't be a problem with a few hundred nodes
unless they all check in at the same time.

https://ask.puppet.com/question/13433/how-should-i-tune-passenger-to-run-puppet/

Ramin
Reply all
Reply to author
Forward
0 new messages