PuppetDB intermittent crashing

413 views
Skip to first unread message

Jonathan Gazeley

unread,
Apr 28, 2014, 7:07:19 AM4/28/14
to puppet...@googlegroups.com
Hi folks,

I'm using Puppet with PuppetDB running on the same machine, but with the
Postgresql database on an external server. Several times a day, PuppetDB
seems to crash with errors like the one below. Nodes are then unable to
check in, although Puppet will restart its own PuppetDB service on its
30-minute runs so the problem sort of fixes itself after a bit.

I don't really know much about the internals of PuppetDB, so can anyone
shed any light on the possible cause of these crashes, and what I can do
to mitigate them?

Thanks,
Jonathan

# puppetdb.log

2014-04-28 09:46:39,535 ERROR [clojure-agent-send-off-pool-15]
[http.resources] Error streaming response
org.postgresql.util.PSQLException: This connection has been closed.
at
org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822)
at
org.postgresql.jdbc2.AbstractJdbc2Connection.setAutoCommit(AbstractJdbc2Connection.java:769)
at
com.jolbox.bonecp.ConnectionHandle.setAutoCommit(ConnectionHandle.java:1063)
at
clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:222)
at
com.puppetlabs.jdbc$with_transacted_connection_fn$fn__2278.invoke(jdbc.clj:228)
at
clojure.java.jdbc.internal$with_connection_STAR_.invoke(internal.clj:186)
at
com.puppetlabs.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:225)
at
com.puppetlabs.puppetdb.http.resources$produce_body$fn__7017$fn__7020.invoke(resources.clj:36)
at ring.util.io$piped_input_stream$fn__2512.invoke(io.clj:22)
at clojure.core$binding_conveyor_fn$fn__4107.invoke(core.clj:1836)
at clojure.lang.AFn.call(AFn.java:18)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Ken Barber

unread,
Apr 28, 2014, 8:39:16 AM4/28/14
to Puppet Users
> I'm using Puppet with PuppetDB running on the same machine, but with the
> Postgresql database on an external server. Several times a day, PuppetDB
> seems to crash with errors like the one below. Nodes are then unable to
> check in, although Puppet will restart its own PuppetDB service on its
> 30-minute runs so the problem sort of fixes itself after a bit.

Are you doing this to workaround this issue or for other reasons? Just
seems a bit weird to do during normal operation.

> I don't really know much about the internals of PuppetDB, so can anyone shed
> any light on the possible cause of these crashes, and what I can do to
> mitigate them?

I'd be interested to see what is happening your postgresql.log when
this occurs. Connections must be closing for some specific reason, and
I'd hope that the postgresql instance can shed some light on this.

You're not trying to connect to the database through a firewall,
load-balancer or some other device are you? Even if the device is "in
the way" but meant to be passive, I've still seen issues (especially
with F5's or checkpoint firewalls that were badly configured).

> # puppetdb.log
>
> 2014-04-28 09:46:39,535 ERROR [clojure-agent-send-off-pool-15]
> [http.resources] Error streaming response
> org.postgresql.util.PSQLException: This connection has been closed.
> at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822)
> at org.postgresql.jdbc2.AbstractJdbc2Connection.setAutoCommit(AbstractJdbc2Connection.java:769)
> at com.jolbox.bonecp.ConnectionHandle.setAutoCommit(ConnectionHandle.java:1063)
> at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:222)
> at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__2278.invoke(jdbc.clj:228)
> at clojure.java.jdbc.internal$with_connection_STAR_.invoke(internal.clj:186)
> at com.puppetlabs.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:225)
> at com.puppetlabs.puppetdb.http.resources$produce_body$fn__7017$fn__7020.invoke(resources.clj:36)
> at ring.util.io$piped_input_stream$fn__2512.invoke(io.clj:22)
> at clojure.core$binding_conveyor_fn$fn__4107.invoke(core.clj:1836)
> at clojure.lang.AFn.call(AFn.java:18)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)

Not much detail from the JDBC postgresql client really, looks like a
connection that we have retrieved from the pool was already closed.
This didn't seem to happen during an actual transaction because I can
see the attempt to disable autoCommit has triggered the exception,
which is something we normally do very early in a transaction.

This is synonymous with database connections being timed out, often
caused by network devices or some other timeout policy.

There is a setting here where we can adjust the keep alive interval:
http://docs.puppetlabs.com/puppetdb/master/configure.html#conn-keep-alive.
We default it to 45 minutes for 1.6.3 (you are running the latest
PuppetDB right?) ... but some firewalls and devices can have a shorter
TTL, like 30 minutes ... I've even seen as little as 5 minutes in
aggressively configured devices. For kicks you might want to set this
relatively low for a period of time to see if you get any
improvements. This might not be the actual problem however, there are
other timeouts that might cause this but its the most obscure fault to
track down so probably worth testing early.

Again I'd be curious to see what the Postgresql server says about
this. We might need to bump up the logging though to get anything we
can use.

Is this the only exception thrown btw? Anything else? Can you show us
the rest of the log _around_ the exception?

Can you show us the results of a 'show all' on the database as well?
Like I've done here:

[ken@kb puppetdb]# psql puppetdb
psql (9.3.4)
Type "help" for help.

puppetdb=# show all;
name |
setting |
description
-------------------------------------+----------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------
allow_system_table_mods | off
| Allows
modifications of the structure of system tables.
application_name | psql
| Sets the
application name to be reported in statistics and logs.
archive_command | (disabled)
| Sets the shell
command that will be called to archive a WAL file.
archive_mode | off
| Allows archiving of
WAL files using archive_command.
archive_timeout | 0
| Forces a switch to
the next xlog file if a new file has not been started within N
seconds.
array_nulls | on
| Enable input of
NULL elements in arrays.
authentication_timeout | 1min
| Sets the maximum
allowed time to complete client authentication.
...

Also if you can provide everything in your
/etc/puppetdb/conf.d/database.ini (obviously hide your private details
first) that might be helpful.

ken.

Jonathan Gazeley

unread,
Apr 28, 2014, 9:22:08 AM4/28/14
to puppet...@googlegroups.com
On 28/04/14 13:39, Ken Barber wrote:
>> I'm using Puppet with PuppetDB running on the same machine, but with the
>> Postgresql database on an external server. Several times a day, PuppetDB
>> seems to crash with errors like the one below. Nodes are then unable to
>> check in, although Puppet will restart its own PuppetDB service on its
>> 30-minute runs so the problem sort of fixes itself after a bit.
> Are you doing this to workaround this issue or for other reasons? Just
> seems a bit weird to do during normal operation.
No, this is just

service { 'puppetdb': ensure => running }

so when Puppet sees that PuppetDB has stopped, it restarts it.
Restarting by hand also works.
>
>> I don't really know much about the internals of PuppetDB, so can anyone shed
>> any light on the possible cause of these crashes, and what I can do to
>> mitigate them?
> I'd be interested to see what is happening your postgresql.log when
> this occurs. Connections must be closing for some specific reason, and
> I'd hope that the postgresql instance can shed some light on this.
I don't run the Postgresql server so access to the logs might take some
time (the guy who used to run it has now left...)
>
> You're not trying to connect to the database through a firewall,
> load-balancer or some other device are you? Even if the device is "in
> the way" but meant to be passive, I've still seen issues (especially
> with F5's or checkpoint firewalls that were badly configured).

The Puppetmaster and the Postgresql server are on different subnets but
are not firewalled from each other. No crazy load-balancing here (yet).
As I mentioned, the Puppetmaster/PuppetDB and the Postgresql server on
different subnets. However access between the two servers happens
several times a minute as there are so many nodes. Timeout due to
inactivity seems very unlikely.
>
> There is a setting here where we can adjust the keep alive interval:
> http://docs.puppetlabs.com/puppetdb/master/configure.html#conn-keep-alive.
> We default it to 45 minutes for 1.6.3 (you are running the latest
> PuppetDB right?) ... but some firewalls and devices can have a shorter
> TTL, like 30 minutes ... I've even seen as little as 5 minutes in
> aggressively configured devices. For kicks you might want to set this
> relatively low for a period of time to see if you get any
> improvements. This might not be the actual problem however, there are
> other timeouts that might cause this but its the most obscure fault to
> track down so probably worth testing early.
Thanks, good to know. We are using a very basic database.ini (attached)
and we are indeed using the default keepalive of 45 minutes.
>
> Again I'd be curious to see what the Postgresql server says about
> this. We might need to bump up the logging though to get anything we
> can use.
>
> Is this the only exception thrown btw? Anything else? Can you show us
> the rest of the log _around_ the exception?
Attached today's puppetdb.log. There are several instances of the same
error but other than that, everything seems normal.
>
> Can you show us the results of a 'show all' on the database as well?
> Like I've done here:
>
Attached.
>
> Also if you can provide everything in your
> /etc/puppetdb/conf.d/database.ini (obviously hide your private details
> first) that might be helpful.

Attached.
>
> ken.
>

database.ini
puppetdb.log
showall.log

Ken Barber

unread,
Apr 28, 2014, 10:33:21 AM4/28/14
to Puppet Users
>>> I don't really know much about the internals of PuppetDB, so can anyone
>>> shed
>>> any light on the possible cause of these crashes, and what I can do to
>>> mitigate them?
>>
>> I'd be interested to see what is happening your postgresql.log when
>> this occurs. Connections must be closing for some specific reason, and
>> I'd hope that the postgresql instance can shed some light on this.
>
> I don't run the Postgresql server so access to the logs might take some time
> (the guy who used to run it has now left...)

That is disconcerting, I think we need to see those however, since the
client in this case is dumb to the issue, all it sees is a connection
that has been closed underneath it for some reason.

>> You're not trying to connect to the database through a firewall,
>> load-balancer or some other device are you? Even if the device is "in
>> the way" but meant to be passive, I've still seen issues (especially
>> with F5's or checkpoint firewalls that were badly configured).
>
>
> The Puppetmaster and the Postgresql server are on different subnets but are
> not firewalled from each other. No crazy load-balancing here (yet).

Well still, I don't know your setup and unless your run the devices in
the path - you can never ever be sure.
Well, while I believe the connection pooling software should be
round-robining all its available connections there is a possibility
that an older stale connection was retrieved. I guess the fact this is
restarting every 30 minutes makes this unlikely.

Regardless, I'd recommend adjusting the keepalive setting to something
aggressive like 1 minute and see what happens. The very fact they live
on different subnets means that traffic is passing through a router of
some kind, and even routers can do strange things to packets.

At the very least it will rule out connection timeouts.

>> Is this the only exception thrown btw? Anything else? Can you show us
>> the rest of the log _around_ the exception?
>
> Attached today's puppetdb.log. There are several instances of the same error
> but other than that, everything seems normal.

Yeah, nothing obviously new to learn from this :-(. At least not yet anyway.

We need to see the other end and what it is doing to connections. This
means checking the postgresql logs and perhaps bumping the log levels
to see if its the PG end dropping the connection (and maybe we can get
a reason ...).

Another idea is to remove the network completely from the equation and
try running a local PG instance on the puppetdb server for a period of
time, this would prove/disprove quite a lot and is probably a good
next step if the PG logging surfaces no new information.

After that I would usually drop to doing a proper network trace using
tcpdump and its ilk, on both the pg and puppetdb ends to see what end
is causing these connections to drop. This is pretty hard-core stuff
though, but certainly would determine what is _actually_ happening
between the two systems.

ken.
Reply all
Reply to author
Forward
0 new messages