PuppetDB queue problem

ak0ska

unread,

Feb 28, 2013, 4:14:16 AM2/28/13

to puppet...@googlegroups.com

Hello,

We have upgraded Puppetdb from 1.0.2 to 1.1.1 last week, and recently we noticed that queue is not processed. We don't know if this problem is directly linked to the upgrade. There are now 25 000 messages waiting in the queue. We also see errors in the Postgres log:

ERROR: insert or update on table "certname_catalogs" violates foreign key constraint "certname_catalogs_catalog_fkey"
DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs".
STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) VALUES ($1,$2,$3) RETURNING *

Can this be somehow related to the the KahaDB leak thread? (https://groups.google.com/d/topic/puppet-users/gWACi4a0BZs/discussion )

Any thoughts? Would flushing the queue solve the problem?

ak0ska

unread,

Feb 28, 2013, 4:38:42 AM2/28/13

to puppet...@googlegroups.com

We flushed the queue and also restarted Postgres. Didn't help.

ak0ska

unread,

Feb 28, 2013, 7:09:15 AM2/28/13

to puppet...@googlegroups.com

Or maybe it did. The queue size went up to like 900 and after a while it normalized and started slowly decreasing. Right now it's empty.
There were 2 insert errors in postgres log since its restart.

It was perhaps on our end, but it would be good to hear some theories, or suggestions where else to look for a problem in a similar case

ak0ska

unread,

Feb 28, 2013, 7:54:13 AM2/28/13

to puppet...@googlegroups.com

The queue jumped back to ~1 600, and according to the puppetdb logs, it hasn't processed a single entry for like 20 minutes, but the log shows a lot of slow queries (see example below). Postgres logs show no error.

And now it started processing the queue again.

Slow queries: http://pastebin.com/2B4azRib

Ken Barber

unread,

Feb 28, 2013, 8:28:24 AM2/28/13

to Puppet Users

If you clear the queue and rollback to the original version does the
problem disappear? If you're having processing problems at the latest
version thats what I would do, as I presume we're talking production
here right?

> Can this be somehow related to the the KahaDB leak thread?

No - it doesn't sound like it.

> ERROR: insert or update on table "certname_catalogs" violates foreign key constraint "certname_catalogs_catalog_fkey"
> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs".
> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) VALUES ($1,$2,$3) RETURNING *

That is a bit of a concern, are you receiving a lot of these? Is this constant?

> Use of unversioned APIs is deprecated; please use /v1/resources

These are appearing in your slow query log - are you sure you're
running puppetdb-terminus 1.1.1 as well? Did you restart your master
after upgrading the terminus?

ken.

> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-users...@googlegroups.com.
> To post to this group, send email to puppet...@googlegroups.com.
> Visit this group at http://groups.google.com/group/puppet-users?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

ak0ska

unread,

Feb 28, 2013, 9:21:30 AM2/28/13

to puppet...@googlegroups.com

Hi, thanks for trying to help! :)

If you clear the queue and rollback to the original version does the
problem disappear? If you're having processing problems at the latest
version thats what I would do, as I presume we're talking production
here right?

Yes, this is production. We would like to avoid reverting, but maybe we will have to.

> ERROR: insert or update on table "certname_catalogs" violates foreign key constraint "certname_catalogs_catalog_fkey"
> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs".
> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) VALUES ($1,$2,$3) RETURNING *

That is a bit of a concern, are you receiving a lot of these? Is this constant?

Before we flushed the queue, quite a lot. Since we flushed it, only 4 times.

> Use of unversioned APIs is deprecated; please use /v1/resources

These are appearing in your slow query log - are you sure you're
running puppetdb-terminus 1.1.1 as well? Did you restart your master
after upgrading the terminus?

The terminus version is correct, however we were not entirely sure that the masters were properly restarted, so we did it now again, let's see if this changes anything.

But the hanging is happening again. I checked what queries run on the database server. Some queries are running for ~ 90 mins, and there are some idle threads. There are two update commands that are waiting. I assume while those 2 don't get executed, the queue processing is halted.

Ken Barber

unread,

Feb 28, 2013, 9:34:26 AM2/28/13

to Puppet Users

> Hi, thanks for trying to help! :)

I'm @ken_barber on irc btw if that is easier.

>> That is a bit of a concern, are you receiving a lot of these? Is this
>> constant?
>
> Before we flushed the queue, quite a lot. Since we flushed it, only 4 times.
>
>
>>
>> > Use of unversioned APIs is deprecated; please use /v1/resources
>>
>> These are appearing in your slow query log - are you sure you're
>> running puppetdb-terminus 1.1.1 as well? Did you restart your master
>> after upgrading the terminus?
>>
>
> The terminus version is correct, however we were not entirely sure that the
> masters were properly restarted, so we did it now again, let's see if this
> changes anything.

Okay. Did you clear the ActiveMQ queues after doing this? I usually
just move the old KahaDB directory out of the way when I do this.

> But the hanging is happening again. I checked what queries run on the
> database server. Some queries are running for ~ 90 mins, and there are some
> idle threads. There are two update commands that are waiting. I assume while
> those 2 don't get executed, the queue processing is halted.

Okay, so were those old queries that are taking 90 minutes, what are they?

Have you tried shutting down the PuppetDB, clearing the queue -
running full vacuum on the database then starting it all up again?

What version of PostgreSQL are we talking here? What distro and
version specifically? How big is the database itself?

How many nodes and how frequently are they checking in? A screenshot
of your puppetdb dashboard might give us a good snapshot of the size
of your environment as well.

Has any tuning been performed on your postgresql server? A copy of
your postgresql.conf would be interesting.

We can do this all privately if you are concerned about sharing
information publicly as well - up to you. Like I said, I'm also on
Freenode IRC as @ken_barber if this is better for you.

ken.

ak0ska

unread,

Feb 28, 2013, 10:30:31 AM2/28/13

to puppet...@googlegroups.com

I'm @ken_barber on irc btw if that is easier.

Can't use IRC here, sorry. :(

Okay. Did you clear the ActiveMQ queues after doing this? I usually
just move the old KahaDB directory out of the way when I do this.

I haven't though about myself, but it makes sense, so I just flushed the queue again while puppetdb service was stopped. Since this last restart it seems smooth, but it's only up for a few minutes.

Okay, so were those old queries that are taking 90 minutes, what are they?

SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL
18 instances were running, oldest was ~90 mins at the time

SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL)
12 instances were running, oldest was ~90 mins at the time

Have you tried shutting down the PuppetDB, clearing the queue -
running full vacuum on the database then starting it all up again?

Not yet, according to the Postgresql documentation full vacuum is not really recommended. Also it might take a while, and PuppetDB would be unavailable for the duration.

What version of PostgreSQL are we talking here? What distro and
version specifically? How big is the database itself?

8.4.13 on a RHEL6 based distribution. The database size might be an indicator that something is not ok. It currently takes up 512Gbs, and it grew 100Gbs over a week. We made a database dump before the PuppetDB upgrade and it was ~1Gb uncompressed. The full vacuum might solve this, but my guess that it would run quite long.

How many nodes and how frequently are they checking in? A screenshot
of your puppetdb dashboard might give us a good snapshot of the size
of your environment as well.

We have ~2500 nodes, run interval for the two major environments: 60mins for 1593 machines and 30 mins for 496. Splay limit is 900sec for both.
Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and http://imgur.com/kG5Sth5,bLFqGAX#1

Has any tuning been performed on your postgresql server? A copy of
your postgresql.conf would be interesting.

No tuning that I'm aware of. The same default postgresql.conf as deployed by the puppetlabs-postgres module. It's here: http://pastebin.com/njvGm4eu

We can do this all privately if you are concerned about sharing
information publicly as well - up to you. Like I said, I'm also on
Freenode IRC as @ken_barber if this is better for you.

ken.

So far I don't feel like I shared anything sensitive, so this might be useful for others as well.

Thanks for the help!

Ken Barber

unread,

Feb 28, 2013, 10:55:09 AM2/28/13

to Puppet Users

>> Okay. Did you clear the ActiveMQ queues after doing this? I usually
>> just move the old KahaDB directory out of the way when I do this.
>
>
> I haven't though about myself, but it makes sense, so I just flushed the
> queue again while puppetdb service was stopped. Since this last restart it
> seems smooth, but it's only up for a few minutes.

You mean, you've only been watching it for a few minutes, and so far
so good - or it crashed? Sorry - just want to be clear :-).

>> Okay, so were those old queries that are taking 90 minutes, what are they?
>>
> SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr,
> certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND
> c.deactivated IS NULL
> 18 instances were running, oldest was ~90 mins at the time
>
> SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources
> cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND
> cc.certname=c.name AND c.deactivated IS NULL)
> 12 instances were running, oldest was ~90 mins at the time

>> Have you tried shutting down the PuppetDB, clearing the queue -
>> running full vacuum on the database then starting it all up again?
>
>
> Not yet, according to the Postgresql documentation full vacuum is not really
> recommended. Also it might take a while, and PuppetDB would be unavailable
> for the duration.

Its not recommended as a re-occurring maintenance task this is true,
but if your DB has had major changes due to a schema upgrade, or if
vacuum hasn't ran in a while its generally okay and can provide some
benefit. But yeah, judging by the size of your DB this will take a
long time. I guess my point is, every recommendation has a caveat or
back-story.

>> What version of PostgreSQL are we talking here? What distro and
>> version specifically? How big is the database itself?
>
>
> 8.4.13 on a RHEL6 based distribution. The database size might be an
> indicator that something is not ok. It currently takes up 512Gbs, and it
> grew 100Gbs over a week. We made a database dump before the PuppetDB upgrade
> and it was ~1Gb uncompressed. The full vacuum might solve this, but my guess
> that it would run quite long.

Wow - it should be more like 5 GB or less for your size.

It sounds like you've got some major fragmentation issues, your
indexes may need rebuilding and yeah - a vacuum will probably help -
and index rebuild. But this will require an outage I would say.

When was the last time it was vacuumed? Try the SQL:

select relname,last_vacuum, last_autovacuum, last_analyze,
last_autoanalyze from pg_stat_user_tables;

And give us the output. I think by default autovacuum should be on for
Postgresql 8.4 on Redhat but I can't recall.

On another note ... to be honest I can't be precise about why the
upgrade failed, I'm pretty sure you were running old revisions of the
terminus when you upgraded but that should cause this kind of anguish.
The errors you were receiving about constraints:

Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
present in table "catalogs".

Should not be occurring at all, which is all very suspect - alas I
have no clue yet as to why. Have they stopped now we have cleared the
queue and restarted?

>> How many nodes and how frequently are they checking in? A screenshot
>> of your puppetdb dashboard might give us a good snapshot of the size
>> of your environment as well.
>
>
> We have ~2500 nodes, run interval for the two major environments: 60mins for
> 1593 machines and 30 mins for 496. Splay limit is 900sec for both.
> Two screenshots from today: http://imgur.com/kG5Sth5,bLFqGAX#0 and
> http://imgur.com/kG5Sth5,bLFqGAX#1

Okay.

>> Has any tuning been performed on your postgresql server? A copy of
>> your postgresql.conf would be interesting.
>
>
> No tuning that I'm aware of. The same default postgresql.conf as deployed by
> the puppetlabs-postgres module. It's here: http://pastebin.com/njvGm4eu

What is the size of your database? CPU/cores ... and RAM on the box?

Does your puppetdb service live on the same node as the database? I'm
guessing this to be true, as your postgresql.conf is listening only on
'localhost' ... what is the memory consumption of your apps? The
output of 'free' would probably be a good start.

ken.

Ken Barber

unread,

Feb 28, 2013, 12:09:35 PM2/28/13

to Puppet Users

FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old
1.0.2 terminus I get no errors:

2013-02-28 17:06:27,711 WARN [qtp1478462104-39] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:28,284 INFO [command-proc-44] [puppetdb.command]
[7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm
2013-02-28 17:06:30,724 WARN [qtp1478462104-32] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:32,693 INFO [command-proc-44] [puppetdb.command]
[a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm

If at all possible - I wouldn't mind a full copy of your puppetdb.log
... to dig a bit deeper. And I know I told you to clear the KahaDB
queue (I always make this mistake) but I don't suppose you kept an old
copy of it?

ken.

ak0ska

unread,

Mar 1, 2013, 4:09:56 AM3/1/13

to puppet...@googlegroups.com

You mean, you've only been watching it for a few minutes, and so far
so good - or it crashed? Sorry - just want to be clear :-).

I was watching it for a few minutes and it seemed good. However the queue grew up to 4000 items overnight. Also we have more of the constraint violation errors now.

Its not recommended as a re-occurring maintenance task this is true,
but if your DB has had major changes due to a schema upgrade, or if
vacuum hasn't ran in a while its generally okay and can provide some
benefit. But yeah, judging by the size of your DB this will take a
long time. I guess my point is, every recommendation has a caveat or
back-story.

We decided to schedule a full vacuum for this weekend.

Wow - it should be more like 5 GB or less for your size.

It sounds like you've got some major fragmentation issues, your
indexes may need rebuilding and yeah - a vacuum will probably help -
and index rebuild. But this will require an outage I would say.

When was the last time it was vacuumed? Try the SQL:

select relname,last_vacuum, last_autovacuum, last_analyze,
last_autoanalyze from pg_stat_user_tables;

And give us the output. I think by default autovacuum should be on for
Postgresql 8.4 on Redhat but I can't recall.

         relname         | last_vacuum |        last_autovacuum        | last_analyze |       last_autoanalyze
-------------------------+-------------+-------------------------------+--------------+-------------------------------
edges                   |             | 2013-03-01 09:49:04.659005+01 |              | 2013-03-01 08:57:59.479092+01
reports                 |             |                               |              |
resource_events         |             |                               |              |
schema_migrations       |             |                               |              |
certnames               |             | 2013-03-01 09:20:50.378484+01 |              | 2013-03-01 09:19:49.22173+01
certname_catalogs       |             | 2013-03-01 09:07:54.251874+01 |              | 2013-03-01 09:20:50.548025+01
catalog_resources       |             | 2013-01-29 23:17:04.224172+01 |              | 2013-01-30 08:47:38.371796+01
catalogs                |             | 2013-03-01 08:20:48.148931+01 |              | 2013-03-01 09:19:48.749645+01
certname_facts_metadata |             | 2013-03-01 09:20:51.318913+01 |              | 2013-03-01 09:19:50.021701+01
certname_facts          |             | 2013-03-01 09:19:47.655727+01 |              | 2013-03-01 09:10:53.688119+01
resource_params         |             | 2013-02-28 15:21:02.192264+01 |              | 2013-02-28 13:13:59.806642+01
(11 rows)

We actually did manually vacuum the databse before the upgrade, when we saw the difference between the dump and database size. Strange that it doesn't show up in that query. But probably it's too little too late anyway.

On another note ... to be honest I can't be precise about why the
upgrade failed, I'm pretty sure you were running old revisions of the
terminus when you upgraded but that should cause this kind of anguish.
The errors you were receiving about constraints:

Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
present in table "catalogs".

Should not be occurring at all, which is all very suspect - alas I
have no clue yet as to why. Have they stopped now we have cleared the
queue and restarted?

As is said above, we have new errors of this kind.

What is the size of your database? CPU/cores ... and RAM on the box?

4 corees @2,27GHz, 12Gb RAM

Does your puppetdb service live on the same node as the database? I'm
guessing this to be true, as your postgresql.conf is listening only on
'localhost' ... what is the memory consumption of your apps? The
output of 'free' would probably be a good start.

Yes, it's the same machine. For now.

"free" ouput
              total       used       free     shared    buffers     cached
Mem:      12319548   12150396     169152          0       7388    9504356
-/+ buffers/cache:    2638652    9680896
Swap:      6160376      85212    6075164

ak0ska

unread,

Mar 1, 2013, 4:19:34 AM3/1/13

to puppet...@googlegroups.com

On Thursday, February 28, 2013 6:09:35 PM UTC+1, Ken Barber wrote:

FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old
1.0.2 terminus I get no errors:

2013-02-28 17:06:27,711 WARN [qtp1478462104-39] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:28,284 INFO [command-proc-44] [puppetdb.command]
[7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm
2013-02-28 17:06:30,724 WARN [qtp1478462104-32] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:32,693 INFO [command-proc-44] [puppetdb.command]
[a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm

As I was saying, we're not entirely sure that this behaviour was causes by the upgrade. We only noticed it one week after the upgrade (so unfortunately the postgres logs from before the update were rotated, that's the default setting), and it might be a coincidence. Then again. maybe not. :)

If at all possible - I wouldn't mind a full copy of your puppetdb.log
... to dig a bit deeper. And I know I told you to clear the KahaDB
queue (I always make this mistake) but I don't suppose you kept an old
copy of it?

We kept the KahaDB backup both times we flushed the queue. The sizes in /var/lib/puppetdb/mq/localhost/ are as follows:

1.3G    ./KahaDB
637M    ./KahaDB.old2
1.1G    ./scheduler
6.2G    ./KahaDB.old

The big one is a 25 000 queue.

I would happily send the logs over, but this is something I have discuss with my colleagues.

Again, thank you for helping! :)

ak0ska

Ken Barber

unread,

Mar 1, 2013, 9:25:46 AM3/1/13

to Puppet Users

So I've been pondering this issue of yours, and I keep coming back to
that error in my mind:

ERROR: insert or update on table "certname_catalogs" violates foreign
key constraint "certname_catalogs_catalog_fkey"

Regardless of the other issues, 512 GB db - yes its big, but so what?
That shouldn't cause an integrity violation. It could be a big hint to
something else, but I'm not entirely sure this is the cause.

This error - can we try to track down the corresponding PuppetDB error
message that goes with it in the puppetdb.log or in perhaps the dead
letter queue for activemq? If postgresql is throwing the error - then
there must be a corresponding exception in puppetdb.log.

Its just that when I look at the code ... it doesn't make sense that
the insert fails here - "Key

(catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in

table "catalogs". ... but I can't yet see a fault in the logic inside
the code yet.

Still its going to be hard to replicate without the data on my side.

As far as your issue ... you have a couple of avenues:

* Drop the existing database, allow it to be recreated from scratch -
run noop a serveral times across your entire infrastructure to
repopulate it. Has the advantage of being an effective vacuum at the
same time :-).
* Rollback the PuppetDB version and the database. This will at least
get you back to a known working state, but will still leave you in a
place where you need to recover.

Either case - you would want to take a backup of the existing database
for diagnostic purposes. But I really feel like there is something
structurally wrong here - perhaps the migration scripts have failed
during the database upgrade?

But I'm guessing you can't stay broken for very long - so I would take
a snapshot of the current broken state so we can take a look and try
to replicate it, and make moves on getting back to a working state as
suggested above.

I'll need:

* PuppetDB logs
* Backup of your current DB
* Backup of your old DB
* The broken KahaDB queues

ken.

Ken Barber

unread,

Mar 1, 2013, 9:36:20 AM3/1/13

to Puppet Users

Oh - and a copy of the current dead letter queue would be nice, its
normally stored in:

/var/lib/puppetdb/mq/discarded/*

This should also contain the full exceptions for the failed SQL as I
mentioned earlier, so perhaps a glance into those now and letting me
know what the prevalent failure is would be handy.

I can organise a secure space on a Puppetlabs support storage area to
upload this data if you are willing. Just contact me privately to
organise it.

ken.

ak0ska

unread,

Mar 1, 2013, 11:32:19 AM3/1/13

to puppet...@googlegroups.com

On Friday, March 1, 2013 3:36:20 PM UTC+1, Ken Barber wrote:

Oh - and a copy of the current dead letter queue would be nice, its
normally stored in:

/var/lib/puppetdb/mq/discarded/*

I will back it up.

This should also contain the full exceptions for the failed SQL as I
mentioned earlier, so perhaps a glance into those now and letting me
know what the prevalent failure is would be handy.

Here's on of them.

Attempt 1 @ 2013-02-18T17:04:30.512Z

java.lang.IllegalArgumentException: Edge '{:relationship :contains, :target {:title nil, :type nil}, :source {:title "Castor::Installer", :type "Class"}}' refers to resource '{:title nil, :type nil}', which doesn't exist in the catalog.
com.puppetlabs.puppetdb.catalog$validate_edges.invoke(catalog.clj:205)
clojure.core$comp$fn__4036.invoke(core.clj:2286)
com.puppetlabs.puppetdb.catalog$eval1498$fn__1499.invoke(catalog.clj:311)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_$fn__2696.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$eval2716$fn__2718.invoke(command.clj:329)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$produce_message_handler$fn__2838.invoke(command.clj:566)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789$fn__2792.invoke(command.clj:474)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789.invoke(command.clj:473)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774$fn__2775.invoke(command.clj:427)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774.invoke(command.clj:426)
com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__2784.invoke(command.clj:449)
com.puppetlabs.puppetdb.command$wrap_with_meter$fn__2765.invoke(command.clj:387)
com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__2797.invoke(command.clj:489)
clamq.jms$jms_consumer$fn__2452.invoke(jms.clj:38)
clamq.jms.proxy$java.lang.Object$MessageListener$ce893c05.onMessage(Unknown Source)
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560)
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498)
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947)
java.lang.Thread.run(Thread.java:679)

Attempt null @ 2013-02-18T17:04:30.512Z

java.lang.IllegalArgumentException: Edge '{:relationship :contains, :target {:title nil, :type nil}, :source {:title "Castor::Installer", :type "Class"}}' refers to resource '{:title nil, :type nil}', which doesn't exist in the catalog.
com.puppetlabs.puppetdb.catalog$validate_edges.invoke(catalog.clj:205)
clojure.core$comp$fn__4036.invoke(core.clj:2286)
com.puppetlabs.puppetdb.catalog$eval1498$fn__1499.invoke(catalog.clj:311)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_$fn__2696.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$replace_catalog_STAR_.invoke(command.clj:308)
com.puppetlabs.puppetdb.command$eval2716$fn__2718.invoke(command.clj:329)
clojure.lang.MultiFn.invoke(MultiFn.java:167)
com.puppetlabs.puppetdb.command$produce_message_handler$fn__2838.invoke(command.clj:566)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789$fn__2792.invoke(command.clj:474)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_discard$fn__2789.invoke(command.clj:473)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774$fn__2775.invoke(command.clj:427)
com.puppetlabs.puppetdb.command.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source)
com.yammer.metrics.core.Timer.time(Timer.java:91)
com.puppetlabs.puppetdb.command$wrap_with_exception_handling$fn__2774.invoke(command.clj:426)
com.puppetlabs.puppetdb.command$wrap_with_command_parser$fn__2784.invoke(command.clj:449)
com.puppetlabs.puppetdb.command$wrap_with_meter$fn__2765.invoke(command.clj:387)
com.puppetlabs.puppetdb.command$wrap_with_thread_name$fn__2797.invoke(command.clj:489)
clamq.jms$jms_consumer$fn__2452.invoke(jms.clj:38)
clamq.jms.proxy$java.lang.Object$MessageListener$ce893c05.onMessage(Unknown Source)
org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:560)
org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:498)
org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325)
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1050)
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:947)
java.lang.Thread.run(Thread.java:679)

I can organise a secure space on a Puppetlabs support storage area to
upload this data if you are willing. Just contact me privately to
organise it.

I am willing, but this is not up to me. :) I will keep all the backups you asked for for now.

> So I've been pondering this issue of yours, and I keep coming back to
> that error in my mind:
>
> ERROR: insert or update on table "certname_catalogs" violates foreign
> key constraint "certname_catalogs_catalog_fkey"
>
> Regardless of the other issues, 512 GB db - yes its big, but so what?
> That shouldn't cause an integrity violation. It could be a big hint to
> something else, but I'm not entirely sure this is the cause.
>
> This error - can we try to track down the corresponding PuppetDB error
> message that goes with it in the puppetdb.log or in perhaps the dead
> letter queue for activemq? If postgresql is throwing the error - then
> there must be a corresponding exception in puppetdb.log.
>
> Its just that when I look at the code ... it doesn't make sense that
> the insert fails here - "Key
> (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in
> table "catalogs". ... but I can't yet see a fault in the logic inside
> the code yet.
>
> Still its going to be hard to replicate without the data on my side.
>
> As far as your issue ... you have a couple of avenues:
>
> * Drop the existing database, allow it to be recreated from scratch -
> run noop a serveral times across your entire infrastructure to
> repopulate it. Has the advantage of being an effective vacuum at the
> same time :-).

This could be the next step :) If the database maintenance doesn't help.

> * Rollback the PuppetDB version and the database. This will at least
> get you back to a known working state, but will still leave you in a
> place where you need to recover.

> Either case - you would want to take a backup of the existing database
> for diagnostic purposes. But I really feel like there is something
> structurally wrong here - perhaps the migration scripts have failed
> during the database upgrade?

I made a backup today, to have a fresh one before we start the database maintenance. The structurally wrong might not be so far fetched, since we didn't upgrade from an official 1.0.2 release. My colleague got a patched version (don't know the details, and can't ask now, as he's on holiday), because at that time the official release was having an issue with binary files (http://projects.puppetlabs.com/issues/17216). One consequence was, that value 8 was missing from the schema_migrations table, even though the matching modifications were already in place in the database ( 8 rename-fact-column in /src/com/puppetlabs/puppetdb/scf/migrate.clj ). Because of that, upgrades failed initially, when PuppetDB tried to rename non-existing columns. The schema looked like it was actually matching the migrations up to number 8, so just adding that value seemed to be ok. This was tested in a lot smaller scale, but there the upgrade seemed fine, never saw this error. Perhaps we overlooked something.

If this is the case, then your suggested solution, to recreate the database from scratch, should solve the problem. This will be up our next step, if database maintenance doesn't help.

> But I'm guessing you can't stay broken for very long - so I would take
> a snapshot of the current broken state so we can take a look and try
> to replicate it, and make moves on getting back to a working state as
> suggested above.
>
> I'll need:
>
> * PuppetDB logs
> * Backup of your current DB
> * Backup of your old DB
> * The broken KahaDB queues
>
> ken.
>

I have all these backups, but other people have to authorize sharing them.

Have a nice weekend! :)

ak0ska

Ken Barber

unread,

Mar 1, 2013, 11:47:06 AM3/1/13

to Puppet Users

> I made a backup today, to have a fresh one before we start the database
> maintenance. The structurally wrong might not be so far fetched, since we
> didn't upgrade from an official 1.0.2 release. My colleague got a patched
> version (don't know the details, and can't ask now, as he's on holiday),
> because at that time the official release was having an issue with binary
> files (http://projects.puppetlabs.com/issues/17216). One consequence was,
> that value 8 was missing from the schema_migrations table, even though the
> matching modifications were already in place in the database ( 8
> rename-fact-column in /src/com/puppetlabs/puppetdb/scf/migrate.clj ).
> Because of that, upgrades failed initially, when PuppetDB tried to rename
> non-existing columns. The schema looked like it was actually matching the
> migrations up to number 8, so just adding that value seemed to be ok. This
> was tested in a lot smaller scale, but there the upgrade seemed fine, never
> saw this error. Perhaps we overlooked something.
>
> If this is the case, then your suggested solution, to recreate the database
> from scratch, should solve the problem. This will be up our next step, if
> database maintenance doesn't help.

Well, I don't think a vacuum will help you - I imagine something is
wrong with the schema right now or some data migration failed during
upgrade. Esp. if you had migration issues from your custom PuppetDB.
Of course I can't prove this with so little knowledge - but it
certainly does raise a red flag.

The only other option is to compare the schemas with a known good one,
and compare the database data to try and find the fault. This however
is prone to error and might be time-consuming. Recreating the database
sounds like a far more reliable option with a lot more guarantees
around it.

ken.

ak0ska

unread,

Mar 1, 2013, 12:00:33 PM3/1/13

to puppet...@googlegroups.com

Yes, maybe not. The next step will be to recreate it from scratch.

Ken Barber

unread,

Mar 4, 2013, 4:55:37 PM3/4/13

to Puppet Users

Any progress today?

Mike Tedesco

unread,

Mar 4, 2013, 6:51:16 PM3/4/13

to puppet...@googlegroups.com, Puppet Users

Is puppetdb and postgres on the same server?
How many node does your environment have?

I had a similar issue and it was linked to I/O. Can you look at that?

Mike { Thanks => always }

ak0ska

unread,

Mar 5, 2013, 2:57:09 AM3/5/13

to puppet...@googlegroups.com

A little update on our story.

Vacuum full was running for the whole weekend, so we didn't yet have time to rebuild indexes, because that would require more downtime, and we're not sure how long it would take. The size of the database didn't drop that much, it's now ~370Gb.

We already see some improvements. The queue doesn't get insanely large, in fact the highest peak we saw since Monday morning was about 20, but mostly it's between 0 and 3. That's good. :) One strange benefit is that we now see "Resources" and "Resource duplication" values on the Dashboard, those were previously just question marks, as you can see on my previous screenshots.

However, we still get the constraint violations, steadily 2 violations per hour. They appear as "Retried" commands on the dashboard. But if the queue reaches the size 0, then does this mean these commands get written to the database eventually? The violations seem to happen shortly after puppetdb starts garbage collection.

You can see it here: http://pastebin.com/B6VR67LW

We are still thinking about wiping the database, and bringing up a new one, as you suggested, but that cannot be done before the next weekend. Perhaps we should rebuild the indexes first, and see if it further improves the situation.
It's only been 1 day since the restart, but so far it looks a lot better then before.

Cheers,

ak0ska

ak0ska

unread,

Mar 5, 2013, 3:00:29 AM3/5/13

to puppet...@googlegroups.com

Hey Mike,

Thanks for the suggestions, but we already checked the IO rates, and they seemed fine. And yes, PuppetDB and Postgres are on the same machine for now, but we plan to change that sometime in the future.

Cheers,

ak0ska

Mike Tedesco

unread,

Mar 5, 2013, 9:23:38 AM3/5/13

to puppet...@googlegroups.com, puppet...@googlegroups.com

What are the I/O stats? Can I just peak at them?

Mike { Thanks => always }

Ken Barber

unread,

Mar 5, 2013, 3:21:22 PM3/5/13

to Puppet Users

> Vacuum full was running for the whole weekend, so we didn't yet have time to
> rebuild indexes, because that would require more downtime, and we're not
> sure how long it would take. The size of the database didn't drop that much,
> it's now ~370Gb.

Wow. Thats still way too large for the amount of nodes. I imagine
re-indexing might help, we can check first. Can you display how big
each of your relations are? A query like this might help:

SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
LIMIT 20;

Also - what are the row counts for each of your tables btw?

If you run this for each table it will get you a row count:

select count(*) from table_name;

I'd be curious to see if any of these are higher than what they should
be based on your node count, maybe some of the space is just large
tables perhaps?

> We already see some improvements. The queue doesn't get insanely large, in
> fact the highest peak we saw since Monday morning was about 20, but mostly
> it's between 0 and 3. That's good. :) One strange benefit is that we now see
> "Resources" and "Resource duplication" values on the Dashboard, those were
> previously just question marks, as you can see on my previous screenshots.

Perhaps the queries were just timing out on the dashboard?

> However, we still get the constraint violations, steadily 2 violations per
> hour. They appear as "Retried" commands on the dashboard. But if the queue
> reaches the size 0, then does this mean these commands get written to the
> database eventually? The violations seem to happen shortly after puppetdb
> starts garbage collection.
>
> You can see it here: http://pastebin.com/B6VR67LW

This should be evident in your puppetdb.log if you trace the uuid of a
command. If commands 'fail' completely, they end up in the DQL located
here:

/var/lib/puppetdb/mq/discarded/

ken.

ak0ska

unread,

Mar 6, 2013, 4:23:53 AM3/6/13

to puppet...@googlegroups.com

@Mike

iostat -nx

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
md0 0.00 0.00 85.55 405.31 9226.53 3234.60 25.39 0.00 0.00 0.00 0.00

@Ken

Wow. Thats still way too large for the amount of nodes. I imagine
re-indexing might help, we can check first. Can you display how big
each of your relations are? A query like this might help:

SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
LIMIT 20;

Also - what are the row counts for each of your tables btw?

If you run this for each table it will get you a row count:

select count(*) from table_name;

I'd be curious to see if any of these are higher than what they should
be based on your node count, maybe some of the space is just large
tables perhaps?

Lines:

certname_facts         336426
catalogs                            2963
resource_events          0
reports                        0
certnames                   2825
certname_catalogs      2810
certname_facts_metadata   2764
catalog_resources               1881888
resource_params          348806
edges                                3454907
schema_migrations            9

Perhaps the queries were just timing out on the dashboard?

Probably

This should be evident in your puppetdb.log if you trace the uuid of a
command. If commands 'fail' completely, they end up in the DQL located
here:

/var/lib/puppetdb/mq/discarded/

Just looked at that directory, no entry with a recent date, so I guess they go through eventually.

Ken Barber

unread,

Mar 6, 2013, 6:23:13 PM3/6/13

to Puppet Users

> Indexes seem bloated.

Totally agree, you should organise re-indexes starting from the biggest.

So the index 'idx_catalog_resources_tags' was removed in 1.1.0 I
think, so that is no longer needed.

This points back to making sure you schema matches exactly what a
known good 1.1.1 has, as things have been missed obviously.

Try this tool out:

http://www.apgdiff.com/diff_online.php

I've got a pg_dump of a good 1.1.1 here:

https://gist.github.com/kbarber/5104169

All you need to do is perform a pg_dump of your schema:

pg_dump --schema-only puppetdb

And use that online tool to compare your schema with the one I've
provided. I would obviously double check its accuracy before doing
anything drastic, but this should help to get your schema inline with
what it should be. If you are paranoid about any part of the diff,
paste the results in a gist and we can discuss it.

> Lines:
>
> certname_facts 336426
> catalogs 2963
> resource_events 0
> reports 0
> certnames 2825
> certname_catalogs 2810
> certname_facts_metadata 2764
> catalog_resources 1881888
> resource_params 348806
> edges 3454907
> schema_migrations 9

None of this seems out of the ordinary.

>> This should be evident in your puppetdb.log if you trace the uuid of a
>> command. If commands 'fail' completely, they end up in the DQL located
>> here:
>>
>> /var/lib/puppetdb/mq/discarded/
>
>
> Just looked at that directory, no entry with a recent date, so I guess they
> go through eventually.

Okay, well - lets at least get your schema back inline with what it
should be, get your re-indexes done and we can come back round to this
problem if it still exists.

ken.

ak0ska

unread,

Mar 7, 2013, 8:04:42 AM3/7/13

to puppet...@googlegroups.com

On Thursday, March 7, 2013 12:23:13 AM UTC+1, Ken Barber wrote:

So the index 'idx_catalog_resources_tags' was removed in 1.1.0 I
think, so that is no longer needed.

This points back to making sure you schema matches exactly what a
known good 1.1.1 has, as things have been missed obviously.

Try this tool out:

http://www.apgdiff.com/diff_online.php

I've got a pg_dump of a good 1.1.1 here:

https://gist.github.com/kbarber/5104169

All you need to do is perform a pg_dump of your schema:

pg_dump --schema-only puppetdb

And use that online tool to compare your schema with the one I've
provided. I would obviously double check its accuracy before doing
anything drastic, but this should help to get your schema inline with
what it should be. If you are paranoid about any part of the diff,
paste the results in a gist and we can discuss it.

The only difference seem to be that 1 index, nothing else.

Okay, well - lets at least get your schema back inline with what it
should be, get your re-indexes done and we can come back round to this
problem if it still exists.

ken.

We will rebuild the indexes during the weekend, and see how it performs afterwards.

ak0ska

unread,

Mar 11, 2013, 4:50:28 AM3/11/13

to puppet...@googlegroups.com

After dropping the obsolete index, and rebuilding the others, the database is now ~ 30 GB. We still get the constraint violation errors when garbage collection starts.
Also the "Resources" and "Resource duplication" values on the dashboard are still question marks, so those queries probably time out.

Although these errors don't have a big impact on performance (at least for now), could they be a sign of something not being quite ok? Shall we still consider rebuilding the database?

Ken Barber

unread,

Mar 11, 2013, 2:04:05 PM3/11/13

to Puppet Users

> After dropping the obsolete index, and rebuilding the others, the database
> is now ~ 30 GB. We still get the constraint violation errors when garbage
> collection starts.

Okay - can you please send me the puppetdb.log entry that shows the
exception? Including surrounding messages?

> Also the "Resources" and "Resource duplication" values on the dashboard are
> still question marks, so those queries probably time out.

Again, this should have a corresponding exception in puppetdb.log or
at least a slow query being logged, also - can you pull up the
javascript console and see what error are produced in relation to that
error.

> Although these errors don't have a big impact on performance (at least for
> now), could they be a sign of something not being quite ok? Shall we still
> consider rebuilding the database?

Not sure, I need more data.

ken.

ak0ska

unread,

Mar 12, 2013, 8:38:55 AM3/12/13

to puppet...@googlegroups.com

I think my previous comment just got lost.

So, I cut three occurrences of this error from the database log and the corresponding part from the puppetdb log. I removed the hostnames, I hope it's still sensible: http://pastebin.com/yvyBDWQE
The unversioned api warnings are not from the masters. They're from an older version of the puppetdbquery library, and the puppetdb module's puppetdb_conn_validator function.

Apart from that, there are 2 kinds of slow queries that appear in the logs:

2013-03-12 11:54:37,149 WARN [qtp1992135396-3384] [jdbc.internal] Query slower than 10s threshold: actual execution time: 452.2331 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:54:42,035 WARN [qtp1992135396-3391] [jdbc.internal] Query slower than 10s threshold: actual execution time: 452.0072 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:55:43,146 WARN [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold: actual execution time: 441.2548 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:00:56,492 WARN [qtp1992135396-3402] [jdbc.internal] Query slower than 10s threshold: actual execution time: 433.4503 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:03:30,933 WARN [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold: actual execution time: 467.7872 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:04:18,225 WARN [qtp1992135396-3396] [jdbc.internal] Query slower than 10s threshold: actual execution time: 472.9386 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:54:49,042 WARN [qtp1992135396-3463] [jdbc.internal] Query slower than 10s threshold: actual execution time: 464.9721 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,791 WARN [qtp1992135396-3482] [jdbc.internal] Query slower than 10s threshold: actual execution time: 460.9083 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,803 WARN [qtp1992135396-3490] [jdbc.internal] Query slower than 10s threshold: actual execution time: 460.7699 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:

I didn't see any javascript errors on the dashboard, the queries are just slow. I ran the resource duplication query manually on the puppetdb machine:

time curl -X GET -H 'Accept: application/json' http://[puppetdbip]:8080/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes -v
* About to connect() to [puppetdbip] port 8080 (#0)
*   Trying [puppetdbip]... connected
* Connected to [puppetdbip] ([puppetdbip]) port 8080 (#0)
> GET /v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.6.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> Host: [puppetdbip]:8080
> Accept: application/json
>
< HTTP/1.1 200 OK
< Date: Tue, 12 Mar 2013 11:55:34 GMT
< Content-Type: application/json;charset=ISO-8859-1
< Content-Length: 34
< Server: Jetty(7.x.y-SNAPSHOT)
<
{
"Value" : 0.9691941482891232
* Connection #0 to host [puppetdbip] left intact
* Closing connection #0
}
real    16m57.604s
user    0m0.012s
sys    0m0.025s

Deepak Giridharagopal

unread,

Mar 12, 2013, 11:28:01 AM3/12/13

to puppet...@googlegroups.com

On Tue, Mar 12, 2013 at 6:38 AM, ak0ska <akos....@gmail.com> wrote:

I think my previous comment just got lost.

So, I cut three occurrences of this error from the database log and the corresponding part from the puppetdb log. I removed the hostnames, I hope it's still sensible: http://pastebin.com/yvyBDWQE
The unversioned api warnings are not from the masters. They're from an older version of the puppetdbquery library, and the puppetdb module's puppetdb_conn_validator function.

Apart from that, there are 2 kinds of slow queries that appear in the logs:

2013-03-12 11:54:37,149 WARN [qtp1992135396-3384] [jdbc.internal] Query slower than 10s threshold: actual execution time: 452.2331 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:54:42,035 WARN [qtp1992135396-3391] [jdbc.internal] Query slower than 10s threshold: actual execution time: 452.0072 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:55:43,146 WARN [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold: actual execution time: 441.2548 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:00:56,492 WARN [qtp1992135396-3402] [jdbc.internal] Query slower than 10s threshold: actual execution time: 433.4503 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:03:30,933 WARN [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold: actual execution time: 467.7872 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:04:18,225 WARN [qtp1992135396-3396] [jdbc.internal] Query slower than 10s threshold: actual execution time: 472.9386 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:54:49,042 WARN [qtp1992135396-3463] [jdbc.internal] Query slower than 10s threshold: actual execution time: 464.9721 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,791 WARN [qtp1992135396-3482] [jdbc.internal] Query slower than 10s threshold: actual execution time: 460.9083 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,803 WARN [qtp1992135396-3490] [jdbc.internal] Query slower than 10s threshold: actual execution time: 460.7699 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:

Can you fire up psql, point it at your puppetdb database, and run "EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL;"? That will profile the query, and we can see where the problem is.

Also worth examining is the GC query (modified to be non-destructive):

EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT * FROM catalog_resources cr WHERE cr.resource=resource_params.resource);

And the resource duplication query:

EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r;

And finally, can you give us a dump of your database settings? You can do this from within psql by doing "SHOW ALL;"

--

ak0ska

unread,

Mar 13, 2013, 5:00:52 AM3/13/13

to puppet...@googlegroups.com

Hello Deepak,

Here are the queries you asked for:

Can you fire up psql, point it at your puppetdb database, and run "EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL;"? That will profile the query, and we can see where the problem is.

                                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3132186.09..3132186.10 rows=1 width=0) (actual time=348426.375..348426.376 rows=1 loops=1)
   -> Hash Join (cost=247.37..3096532.80 rows=14261314 width=0) (actual time=24059.151..348255.933 rows=1604253 loops=1)
         Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
         -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 rows=16444286 width=41) (actual time=24053.542..347250.967 rows=2107982 loops=1)
         -> Hash (cost=216.42..216.42 rows=2476 width=41) (actual time=5.555..5.555 rows=2476 loops=1)
               -> Hash Join (cost=94.84..216.42 rows=2476 width=41) (actual time=1.850..4.555 rows=2476 loops=1)
                     Hash Cond: (cc.certname = c.name)
                     -> Seq Scan on certname_catalogs cc (cost=0.00..82.55 rows=2855 width=62) (actual time=0.010..0.941 rows=2855 loops=1)
                     -> Hash (cost=63.71..63.71 rows=2490 width=21) (actual time=1.818..1.818 rows=2490 loops=1)
                           -> Seq Scan on certnames c (cost=0.00..63.71 rows=2490 width=21) (actual time=0.013..0.823 rows=2490 loops=1)
                                 Filter: (deactivated IS NULL)
Total runtime: 348446.364 ms
(12 rows)

Also worth examining is the GC query (modified to be non-destructive):

EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT * FROM catalog_resources cr WHERE cr.resource=resource_params.resource);

                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.00..1710681.18 rows=360492 width=126) (actual time=1967.182..1967.182 rows=0 loops=1)
   -> Seq Scan on resource_params (cost=0.00..13249.23 rows=389623 width=126) (actual time=0.013..76.403 rows=389693 loops=1)
   -> Index Scan using idx_catalog_resources_resource on catalog_resources cr (cost=0.00..21429.14 rows=5291 width=41) (actual time=0.004..0.004 rows=1 loops=389693)
         Index Cond: ((cr.resource)::text = (resource_params.resource)::text)
Total runtime: 1967.253 ms
(5 rows)

And the resource duplication query:

EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r;

                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3129997.71..3129997.72 rows=1 width=0) (actual time=474369.420..474369.420 rows=1 loops=1)
   -> HashAggregate (cost=3129927.78..3129958.86 rows=3108 width=41) (actual time=474351.730..474365.144 rows=49466 loops=1)
         -> Hash Join (cost=248.48..3094726.38 rows=14080561 width=41) (actual time=21536.467..473678.479 rows=1604237 loops=1)
               Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
               -> Seq Scan on catalog_resources cr (cost=0.00..2871450.86 rows=16444286 width=82) (actual time=21531.355..472692.303 rows=1934212 loops=1)
               -> Hash (cost=217.36..217.36 rows=2490 width=41) (actual time=5.071..5.071 rows=2476 loops=1)
                     -> Hash Join (cost=94.84..217.36 rows=2490 width=41) (actual time=1.700..4.048 rows=2476 loops=1)
                           Hash Cond: (cc.certname = c.name)
                           -> Seq Scan on certname_catalogs cc (cost=0.00..83.08 rows=2908 width=62) (actual time=0.010..0.603 rows=2855 loops=1)
                           -> Hash (cost=63.71..63.71 rows=2490 width=21) (actual time=1.670..1.670 rows=2490 loops=1)
                                 -> Seq Scan on certnames c (cost=0.00..63.71 rows=2490 width=21) (actual time=0.014..0.785 rows=2490 loops=1)
                                       Filter: (deactivated IS NULL)
Total runtime: 474370.129 ms
(13 rows)

And finally, can you give us a dump of your database settings? You can do this from within psql by doing "SHOW ALL;"

http://pastebin.com/raw.php?i=HfhqfVa4

Thank you for your help!

Cheers,

ak0ska

Ken Barber

unread,

Mar 14, 2013, 3:05:59 AM3/14/13

to Puppet Users

Hi ak0ska,

So I've been spending the last 2 days trying all kinds of things to
replicate your constraint violation problem and I still am getting
nowhere with it. I've been speaking to all kinds of smart people and
we believe its some sort of lock and/or transactional mode problem but
none of my testing can replicate it.

My latest test was something like:

https://gist.github.com/kbarber/5157836

But with all that preparation and load simulation I still couldn't
replicate it :-(.

I think I need more data from you ... in this case, a statement log.
Now a statement log can be pretty heavy, so if we can get the data
surrounding a GC only that should be enough - however I anticipate it
might reduce some performance so if you're apprehensive about doing it
in production I fully understand. I'm just running out of ideas, and
without more data I may have trouble coming to a conclusion.

Anyway, if at all possible the instructions for enabling this are as follows:

* Go to your postgresql.conf file (On RHEL 6 this is usually located
in /var/lib/pgsql/data)
* Add/modify the entries:
log_statement = 'all'
log_line_prefix = '%m PID[%p] SID[%c] VTID[%v] TID[%x] '
* Restart postgresql
* Logs should go to: /var/lib/pgsql/data/pg_log

The closer you can do this to being set _just before_ a GC run the
better, as it will reduce the amount of time this setting needs to be
on. Once you have a GC run and a subsequent constraint logged you
should be able to turn it off. Don't worry about PuppetDB during these
cases, it retries the database so if you can take the small hit with
queries the outage should be quick while you switch off/on the setting
(and be mindful that catalogs/facts/reports submissions will get
queued, queries of course will fail during that short time during DB
restart).

Now the data itself is going to obviously contain private items, so if
you prefer to send it to me privately thats fine (contact me off this
list if so). Otherwise, it might need some scrubbing.

So the concern I have with your issue, even though it doesn't stop a
catalog from being submitted - is that you're seeing a 'block' when
those two catalogs get submitted which makes me fear that data doesn't
get populated until _after_ the GC is complete. This is a concern
because we don't want data to be delayed like this. If this wasn't the
case we could happily let it fail and retry, ultimately the data isn't
lost, but we should be able to process data within reasonable time
limits without you have to worry about database GC slowing things
down.

Now in regards to your performance, we are going to keep looking into
the results you gave us. Looking at your settings Deepak and I think
you probably need to increase at least work_mem to 16MB and
maintenance_work_mem to 32MB (this is used during vacuum) the main
item being work_mem here really. These are reasonable changes we've
seen other users apply with success, but alone they probably won't
solve your issue - it would be good to just get you working using
higher values for these.

I'm very curious to understand the disk IO for your systems also. What
are the nature of the disks where your database is located? Are they
SCSI/SATA/FC or otherwise? What filesystem type are you using? Can you
give me any other details you might find interesting?

What are the results of the command:

iostat -k 5

A few samples of the run would be useful to see, so we can understand
the average disk IO going on here.

FYI To give you an idea of what we _should_ be seeing for say, just
one of your queries, here is a query on a Centos 5.8 box (running just
postgresql 8.1 actually):

puppetdb=# EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT

resource FROM catalog_resources cr, certname_catalogs cc, certnames c
WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated
IS NULL) r;
QUERY PLAN

-------------------------------------------------------------------------------------------------
---------------------------------------------------

Aggregate (cost=91792.54..91792.55 rows=1 width=0) (actual
time=2611.872..2611.872 rows=1 loops
=1)
-> HashAggregate (cost=91737.14..91761.76 rows=2462 width=41)
(actual time=2606.507..2609.48
9 rows=5478 loops=1)
-> Hash Join (cost=184.36..90322.32 rows=565926 width=41)
(actual time=7.060..2072.000
rows=867847 loops=1)

Hash Cond: ((cr.catalog)::text = (cc.catalog)::text)
-> Seq Scan on catalog_resources cr

(cost=0.00..80139.47 rows=867847 width=82) (
actual time=0.021..1119.260 rows=867847 loops=1)
-> Hash (cost=159.17..159.17 rows=2015 width=41)
(actual time=7.032..7.032 rows=
2195 loops=1)
-> Hash Join (cost=62.34..159.17 rows=2015
width=41) (actual time=2.438..5
.830 rows=2195 loops=1)

Hash Cond: (cc.certname = c.name)
-> Seq Scan on certname_catalogs cc

(cost=0.00..65.79 rows=2179 widt
h=50) (actual time=0.005..0.908 rows=2195 loops=1)
-> Hash (cost=37.15..37.15 rows=2015
width=9) (actual time=2.427..2.
427 rows=2195 loops=1)

-> Seq Scan on certnames c

(cost=0.00..37.15 rows=2015 width=9
) (actual time=0.008..1.088 rows=2195 loops=1)
Filter: (deactivated IS NULL)
Total runtime: 2611.935 ms
(13 rows)

As you can see, 2 seconds on my box and 475 seconds on yours - the
difference is just too large to be considered a normal discrepancy
:-). Be mindful, this is on data that represents 2000 nodes ... so its
not like my tables are smaller then yours (at least not drastically),
however I do notice the number of resources is much higher so I'm
going to try and replicate that.

For now, I'm going to ponder more on your performance as I believe
that affects you the most - so let us get back to you with more
updates. If you can give me any details as I've requested for now that
would be appreciated.

ken.

ak0ska

unread,

Mar 14, 2013, 6:36:28 AM3/14/13

to puppet...@googlegroups.com

Hello Ken,

I really appreciate you guys looking into this problem, and I'm happy to provide you with the data you ask for. However, I feel like I should ask, whether you think this problem is worth your efforts, if rebuilding the database might solve the issue?

Cheers,

ak0ska

Ken Barber

unread,

Mar 14, 2013, 5:12:41 PM3/14/13

to Puppet Users

So I have this sinking feeling that all of your problems (including
the constraint side-effect) are related to general performance issues
on your database 'for some reason we are yet to determine'. This could
be related to IO contention, it could be a bad index (although you've
rebuilt them all right?) or something else. What continues to bother
me is the fact you have only managed to reduce your DB size to 30GB,
this is far larger than I expect and I it screams 'problem' to me. It
should be less than 5 GB, at least based on my rough calculations from
other DBs I've seen in the wild.

It could very well be that something still isn't sufficiently
organised (index, tablespace data) for optimal performance. A
recreation of the database is by all means a very fast way to recreate
everything, its like heaving an axe though. Having said that, it is
fast and would eradicate a lot of back & forth between us to try and
solve this problem.

A dump, drop, recreate and restore is also another way, not 100% the
same thing, but close to it.

Its entirely up to you - if you want to try either option then go
ahead. It would be a fast step to achieving a lot of small steps. If
it works, we may have lost some clarity on the issue however, so might
be good to keep the original DB backup. If it doesn't work, at least
we have ruled out a lot of possibilities in one quick action. So yeah,
I wouldn't be upset if you take this action - there is a reason why I
suggested this earlier, and that is it performs a lot of actions in
one single step ... and I still think this is a viable action, but
since you were keen on trying other things I was happy to go along
:-). At the end of the day your time and your risk assessment is up to
you. If it was me, yeah, I would try it. Esp. considering the amount
of time already spent with very little final result.

Thanks for being patient with me on this btw ... :-).

ken.

Ken Barber

unread,

Mar 15, 2013, 1:00:08 AM3/15/13

to Puppet Users

Hi ak0ska,

FWIW - with the help of some of my colleagues we've managed to
replicate your constraint issue in a lab style environment now:

https://gist.github.com/kbarber/5157836

Which is a start. It requires a unique precondition to replicate, and
I've been unable however to replicate it in any way with PuppetDB
directly, so the statement log will still be helpful if you can supply
it.

ken.

Ken Barber

unread,

Mar 21, 2013, 6:19:52 AM3/21/13

to Puppet Users

Hi ak0ska,

How are things going? Anything to report?

ken.

Reply all

Reply to author

Forward