ERROR: insert or update on table "certname_catalogs" violates foreign key constraint "certname_catalogs_catalog_fkey"
DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs".
STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) VALUES ($1,$2,$3) RETURNING *
If you clear the queue and rollback to the original version does the
problem disappear? If you're having processing problems at the latest
version thats what I would do, as I presume we're talking production
here right?
> ERROR: insert or update on table "certname_catalogs" violates foreign key constraint "certname_catalogs_catalog_fkey"
> DETAIL: Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in table "catalogs".
> STATEMENT: INSERT INTO certname_catalogs (certname,catalog,timestamp) VALUES ($1,$2,$3) RETURNING *
That is a bit of a concern, are you receiving a lot of these? Is this constant?
> Use of unversioned APIs is deprecated; please use /v1/resources
These are appearing in your slow query log - are you sure you're
running puppetdb-terminus 1.1.1 as well? Did you restart your master
after upgrading the terminus?
I'm @ken_barber on irc btw if that is easier.
Okay. Did you clear the ActiveMQ queues after doing this? I usually
just move the old KahaDB directory out of the way when I do this.
Okay, so were those old queries that are taking 90 minutes, what are they?
Have you tried shutting down the PuppetDB, clearing the queue -
running full vacuum on the database then starting it all up again?
What version of PostgreSQL are we talking here? What distro and
version specifically? How big is the database itself?
How many nodes and how frequently are they checking in? A screenshot
of your puppetdb dashboard might give us a good snapshot of the size
of your environment as well.
Has any tuning been performed on your postgresql server? A copy of
your postgresql.conf would be interesting.
We can do this all privately if you are concerned about sharing
information publicly as well - up to you. Like I said, I'm also on
Freenode IRC as @ken_barber if this is better for you.
ken.
You mean, you've only been watching it for a few minutes, and so far
so good - or it crashed? Sorry - just want to be clear :-).
Its not recommended as a re-occurring maintenance task this is true,
but if your DB has had major changes due to a schema upgrade, or if
vacuum hasn't ran in a while its generally okay and can provide some
benefit. But yeah, judging by the size of your DB this will take a
long time. I guess my point is, every recommendation has a caveat or
back-story.
Wow - it should be more like 5 GB or less for your size.
It sounds like you've got some major fragmentation issues, your
indexes may need rebuilding and yeah - a vacuum will probably help -
and index rebuild. But this will require an outage I would say.
When was the last time it was vacuumed? Try the SQL:
select relname,last_vacuum, last_autovacuum, last_analyze,
last_autoanalyze from pg_stat_user_tables;
And give us the output. I think by default autovacuum should be on for
Postgresql 8.4 on Redhat but I can't recall.
On another note ... to be honest I can't be precise about why the
upgrade failed, I'm pretty sure you were running old revisions of the
terminus when you upgraded but that should cause this kind of anguish.
The errors you were receiving about constraints:
Key (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not
present in table "catalogs".
Should not be occurring at all, which is all very suspect - alas I
have no clue yet as to why. Have they stopped now we have cleared the
queue and restarted?
What is the size of your database? CPU/cores ... and RAM on the box?
Does your puppetdb service live on the same node as the database? I'm
guessing this to be true, as your postgresql.conf is listening only on
'localhost' ... what is the memory consumption of your apps? The
output of 'free' would probably be a good start.
FYI, I just upgraded only the PuppetDB part to 1.1.1, using the old
1.0.2 terminus I get no errors:
2013-02-28 17:06:27,711 WARN Â [qtp1478462104-39] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:28,284 INFO Â [command-proc-44] [puppetdb.command]
[7a4a1f70-1e85-4256-8f95-21008b6c1199] [replace facts] puppetdb2.vm
2013-02-28 17:06:30,724 WARN Â [qtp1478462104-32] [http.server] Use of
unversioned APIs is deprecated; please use /v1/commands
2013-02-28 17:06:32,693 INFO Â [command-proc-44] [puppetdb.command]
[a87001c5-f064-4a2d-9f9e-de903f62f824] [replace catalog] puppetdb2.vm
If at all possible - I wouldn't mind a full copy of your puppetdb.log
... to dig a bit deeper. And I know I told you to clear the KahaDB
queue (I always make this mistake) but I don't suppose you kept an old
copy of it?
Oh - and a copy of the current dead letter queue would be nice, its
normally stored in:
/var/lib/puppetdb/mq/discarded/*
This should also contain the full exceptions for the failed SQL as I
mentioned earlier, so perhaps a glance into those now and letting me
know what the prevalent failure is would be handy.
I can organise a secure space on a Puppetlabs support storage area to
upload this data if you are willing. Just contact me privately to
organise it.
> So I've been pondering this issue of yours, and I keep coming back to
> that error in my mind:
>
> Â ERROR: insert or update on table "certname_catalogs" violates foreign
> key constraint "certname_catalogs_catalog_fkey"
>
> Regardless of the other issues, 512 GB db - yes its big, but so what?
> That shouldn't cause an integrity violation. It could be a big hint to
> something else, but I'm not entirely sure this is the cause.
>
> This error - can we try to track down the corresponding PuppetDB error
> message that goes with it in the puppetdb.log or in perhaps the dead
> letter queue for activemq? If postgresql is throwing the error - then
> there must be a corresponding exception in puppetdb.log.
>
> Its just that when I look at the code ... it doesn't make sense that
> the insert fails here - "Key
> (catalog)=(d1c89bbef78a55edcf560c432d965cfe1263059c) is not present in
> table "catalogs". ... but I can't yet see a fault in the logic inside
> the code yet.
>
> Still its going to be hard to replicate without the data on my side.
>
> As far as your issue ... you have a couple of avenues:
>
> * Drop the existing database, allow it to be recreated from scratch -
> run noop a serveral times across your entire infrastructure to
> repopulate it. Has the advantage of being an effective vacuum at the
> same time :-).
> * Rollback the PuppetDB version and the database. This will at least
> get you back to a known working state, but will still leave you in a
> place where you need to recover.
> Either case - you would want to take a backup of the existing database
> for diagnostic purposes. But I really feel like there is something
> structurally wrong here - perhaps the migration scripts have failed
> during the database upgrade?
> But I'm guessing you can't stay broken for very long - so I would take
> a snapshot of the current broken state so we can take a look and try
> to replicate it, and make moves on getting back to a working state as
> suggested above.
>
> I'll need:
>
> * PuppetDB logs
> * Backup of your current DB
> * Backup of your old DB
> * The broken KahaDB queues
>
> ken.
>
Wow. Thats still way too large for the amount of nodes. I imagine
re-indexing might help, we can check first. Can you display how big
each of your relations are? A query like this might help:
SELECT nspname || '.' || relname AS "relation",
  pg_size_pretty(pg_relation_size(C.oid)) AS "size"
 FROM pg_class C
 LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
 WHERE nspname NOT IN ('pg_catalog', 'information_schema')
 ORDER BY pg_relation_size(C.oid) DESC
 LIMIT 20;
Also - what are the row counts for each of your tables btw?
If you run this for each table it will get you a row count:
select count(*) from table_name;
I'd be curious to see if any of these are higher than what they should
be based on your node count, maybe some of the space is just large
tables perhaps?
Perhaps the queries were just timing out on the dashboard?Â
This should be evident in your puppetdb.log if you trace the uuid of a
command. If commands 'fail' completely, they end up in the DQL located
here:
/var/lib/puppetdb/mq/discarded/
So the index 'idx_catalog_resources_tags' was removed in 1.1.0 I
think, so that is no longer needed.
This points back to making sure you schema matches exactly what a
known good 1.1.1 has, as things have been missed obviously.
Try this tool out:
http://www.apgdiff.com/diff_online.php
I've got a pg_dump of a good 1.1.1 here:
https://gist.github.com/kbarber/5104169
All you need to do is perform a pg_dump of your schema:
pg_dump --schema-only puppetdb
And use that online tool to compare your schema with the one I've
provided. I would obviously double check its accuracy before doing
anything drastic, but this should help to get your schema inline with
what it should be. If you are paranoid about any part of the diff,
paste the results in a gist and we can discuss it.
Okay, well - lets at least get your schema back inline with what it
should be, get your re-indexes done and we can come back round to this
problem if it still exists.
ken.
2013-03-12 11:54:37,149 WARNÂ [qtp1992135396-3384] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 452.2331 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:54:42,035 WARNÂ [qtp1992135396-3391] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 452.0072 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:55:43,146 WARN [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold: actual execution time: 441.2548 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:00:56,492 WARN [qtp1992135396-3402] [jdbc.internal] Query slower than 10s threshold: actual execution time: 433.4503 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:03:30,933 WARNÂ [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 467.7872 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:04:18,225 WARNÂ [qtp1992135396-3396] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 472.9386 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:54:49,042 WARNÂ [qtp1992135396-3463] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 464.9721 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,791 WARNÂ [qtp1992135396-3482] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 460.9083 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,803 WARN [qtp1992135396-3490] [jdbc.internal] Query slower than 10s threshold: actual execution time: 460.7699 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
time curl -X GET -H 'Accept: application/json' http://[puppetdbip]:8080/v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes -v
* About to connect() to [puppetdbip] port 8080 (#0)
*Â Â Trying [puppetdbip]... connected
* Connected to [puppetdbip] ([puppetdbip]) port 8080 (#0)
> GET /v2/metrics/mbean/com.puppetlabs.puppetdb.query.population:type=default,name=pct-resource-dupes HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.6.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> Host: [puppetdbip]:8080
> Accept: application/json
>
< HTTP/1.1 200 OK
< Date: Tue, 12 Mar 2013 11:55:34 GMT
< Content-Type: application/json;charset=ISO-8859-1
< Content-Length: 34
< Server: Jetty(7.x.y-SNAPSHOT)
<
{
 "Value" : 0.9691941482891232
* Connection #0 to host [puppetdbip] left intact
* Closing connection #0
}
real   16m57.604s
user   0m0.012s
sys   0m0.025s
I think my previous comment just got lost.
So, I cut three occurrences of this error from the database log and the corresponding part from the puppetdb log. I removed the hostnames, I hope it's still sensible: http://pastebin.com/yvyBDWQE
The unversioned api warnings are not from the masters. They're from an older version of the puppetdbquery library, and the puppetdb module's puppetdb_conn_validator function.
Apart from that, there are 2 kinds of slow queries that appear in the logs:2013-03-12 11:54:37,149 WARNÂ [qtp1992135396-3384] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 452.2331 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:54:42,035 WARNÂ [qtp1992135396-3391] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 452.0072 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 11:55:43,146 WARN [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold: actual execution time: 441.2548 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:00:56,492 WARN [qtp1992135396-3402] [jdbc.internal] Query slower than 10s threshold: actual execution time: 433.4503 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
2013-03-12 12:03:30,933 WARNÂ [qtp1992135396-3392] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 467.7872 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:04:18,225 WARNÂ [qtp1992135396-3396] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 472.9386 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:54:49,042 WARNÂ [qtp1992135396-3463] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 464.9721 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,791 WARNÂ [qtp1992135396-3482] [jdbc.internal] Query slower than 10s threshold:Â actual execution time: 460.9083 seconds; Query: SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog AND c.name=cc.certname AND c.deactivated IS NULL; Query Params:
2013-03-12 12:57:58,803 WARN [qtp1992135396-3490] [jdbc.internal] Query slower than 10s threshold: actual execution time: 460.7699 seconds; Query: SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r; Query Params:
--
Can you fire up psql, point it at your puppetdb database, and run "EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM certname_catalogs cc, catalog_resources cr, certnames c WHERE cc.catalog=cr.catalog ANDÂ c.name=cc.certname AND c.deactivated IS NULL;"? That will profile the query, and we can see where the problem is.
Also worth examining is the GC query (modified to be non-destructive):EXPLAIN ANALYZE SELECT * FROM resource_params WHERE NOT EXISTS (SELECT * FROM catalog_resources cr WHERE cr.resource=resource_params.resource);
And the resource duplication query:EXPLAIN ANALYZE SELECT COUNT(*) AS c FROM (SELECT DISTINCT resource FROM catalog_resources cr, certname_catalogs cc, certnames c WHERE cr.catalog=cc.catalog AND cc.certname=c.name AND c.deactivated IS NULL) r;
And finally, can you give us a dump of your database settings? You can do this from within psql by doing "SHOW ALL;"