PuppetDB 1.6.3 Garbage collection fails abruptly

273 views
Skip to first unread message

Andreas Paul

unread,
Jan 2, 2015, 9:44:59 AM1/2/15
to puppet...@googlegroups.com
Hi there,

the PuppetDB garbage collection fails sometimes after 5 minutes and sometimes after 2-3 hours with the following error message:

2014-12-29 12:12:57,312 INFO  [pool-3-thread-1] [cli.services] Starting database garbage collection
2014-12-29 16:31:39,106 ERROR [pool-3-thread-1] [cli.services] Error during garbage collection


2015-01-02 08:03:18,464 INFO  [pool-3-thread-3] [cli.services] Starting database garbage collection
[...]
2015-01-02 08:08:38,255 ERROR [pool-3-thread-3] [cli.services] Error during garbage collection
java.sql.BatchUpdateException: Batch entry 0 DELETE FROM resource_params_cache WHERE NOT EXISTS (SELECT * FROM catalog_resources cr WHERE cr.resource=resource_params_cache.resource) was aborted.  Call getNextException to see the cause.
        at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746)
        at org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
[...]
2015-01-02 08:08:38,256 INFO  [pool-3-thread-3] [cli.services] Starting sweep of stale nodes (threshold: 35 days)
2015-01-02 08:08:38,500 INFO  [pool-3-thread-3] [cli.services] Finished sweep of stale nodes (threshold: 35 days)



In this time no command can be processed by any PuppetDB server so the queue starts to increase very quickly.


Here are some information about our PuppetDB data sets involved in the garbage collection query:

# Explain output for select
explain  select * FROM resource_params_cache WHERE NOT EXISTS (SELECT * FROM catalog_resources cr WHERE cr.resource=resource_params_cache.resource);
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..2089716.08 rows=19861941 width=314)
   ->  Seq Scan on resource_params_cache  (cost=0.00..1267703.93 rows=19876193 width=314)
   ->  Index Only Scan using idx_catalog_resources_resource on catalog_resources cr  (cost=0.00..11.15 rows=270 width=41)
         Index Cond: (resource = (resource_params_cache.resource)::text

Tables:
 Schema |          Name           | Type  |    Owner    |  Size   | Description 
--------+-------------------------+-------+-------------+---------+-------------
 public | catalog_resources       | table | puppet_prod | 1696 MB | 
 public | catalogs                | table | puppet_prod | 11 MB   | 
 public | certname_facts          | table | puppet_prod | 588 MB  | 
 public | certname_facts_metadata | table | puppet_prod | 6128 kB | 
 public | certnames               | table | puppet_prod | 14 MB   | 
 public | edges                   | table | puppet_prod | 1227 MB | 
 public | latest_reports          | table | puppet_prod | 9296 kB | 
 public | reports                 | table | puppet_prod | 942 MB  | 
 public | resource_events         | table | puppet_prod | 11 GB   | 
 public | resource_params         | table | puppet_prod | 98 GB   | 
 public | resource_params_cache   | table | puppet_prod | 91 GB   | 
 public | schema_migrations       | table | puppet_prod | 40 kB   |

# Indexes
 Schema |                     Name                     | Type  |    Owner    |          Table          |  Size   | Description 
--------+----------------------------------------------+-------+-------------+-------------------------+---------+-------------
 public | catalog_resources_pkey                       | index | puppet_prod | catalog_resources       | 536 MB  | 
 public | catalogs_certname_key                        | index | puppet_prod | catalogs                | 4680 kB | 
 public | catalogs_hash_key                            | index | puppet_prod | catalogs                | 6632 kB | 
 public | catalogs_pkey                                | index | puppet_prod | catalogs                | 2152 kB | 
 public | certname_facts_metadata_certname_key         | index | puppet_prod | certname_facts_metadata | 5240 kB | 
 public | certname_facts_metadata_pkey                 | index | puppet_prod | certname_facts_metadata | 6304 kB | 
 public | certname_facts_pkey                          | index | puppet_prod | certname_facts          | 505 MB  | 
 public | certnames_pkey                               | index | puppet_prod | certnames               | 5120 kB | 
 public | constraint_resource_events_unique            | index | puppet_prod | resource_events         | 8067 MB | 
 public | edges_certname_source_target_type_unique_key | index | puppet_prod | edges                   | 2365 MB | 
 public | idx_catalog_resources_exported_true          | index | puppet_prod | catalog_resources       | 23 MB   | 
 public | idx_catalog_resources_resource               | index | puppet_prod | catalog_resources       | 1197 MB | 
 public | idx_catalog_resources_type                   | index | puppet_prod | catalog_resources       | 1473 MB | 
 public | idx_catalog_resources_type_title             | index | puppet_prod | catalog_resources       | 1473 MB | 
 public | idx_catalogs_transaction_uuid                | index | puppet_prod | catalogs                | 14 MB   | 
 public | idx_certname_facts_certname                  | index | puppet_prod | certname_facts          | 511 MB  | 
 public | idx_certname_facts_name                      | index | puppet_prod | certname_facts          | 877 MB  | 
 public | idx_latest_reports_report                    | index | puppet_prod | latest_reports          | 11 MB   | 
 public | idx_reports_certname                         | index | puppet_prod | reports                 | 543 MB  | 
 public | idx_reports_end_time                         | index | puppet_prod | reports                 | 180 MB  | 
 public | idx_reports_transaction_uuid                 | index | puppet_prod | reports                 | 210 MB  | 
 public | idx_resource_events_containing_class         | index | puppet_prod | resource_events         | 1150 MB | 
 public | idx_resource_events_property                 | index | puppet_prod | resource_events         | 1150 MB | 
 public | idx_resource_events_report                   | index | puppet_prod | resource_events         | 4089 MB | 
 public | idx_resource_events_resource_type            | index | puppet_prod | resource_events         | 1162 MB | 
 public | idx_resource_events_resource_type_title      | index | puppet_prod | resource_events         | 3461 MB | 
 public | idx_resource_events_status                   | index | puppet_prod | resource_events         | 1150 MB | 
 public | idx_resource_events_timestamp                | index | puppet_prod | resource_events         | 1259 MB | 
 public | idx_resources_params_name                    | index | puppet_prod | resource_params         | 5397 MB | 
 public | idx_resources_params_resource                | index | puppet_prod | resource_params         | 12 GB   | 
 public | latest_reports_pkey                          | index | puppet_prod | latest_reports          | 5152 kB | 
 public | reports_pkey                                 | index | puppet_prod | reports                 | 594 MB  | 
 public | resource_params_cache_pkey                   | index | puppet_prod | resource_params_cache   | 2057 MB | 
 public | resource_params_pkey                         | index | puppet_prod | resource_params         | 13 GB   | 
 public | schema_migrations_pkey                       | index | puppet_prod | schema_migrations       | 16 kB   | 

# Statistics for resource_params_cache
table_len          | 8756715520
tuple_count        | 24245975
tuple_len          | 8300471564
tuple_percent      | 94.79
dead_tuple_count   | 348
dead_tuple_len     | 134886
dead_tuple_percent | 0
free_space         | 219676684
free_percent       | 2.51


select count(*) from catalog_resources;
  count
---------
 3839182

select count(*) from resource_params_cache;
  count   
----------
 24246308





We will upgrade when our Puppet 3 servers are ready but, as it looks right now, we're still stuck with having to operate Puppet 2.x for some time to come, so we'd like to fix this problem tight now.

Thanks in advance!

Wyatt Alt

unread,
Jan 2, 2015, 11:32:36 AM1/2/15
to puppet...@googlegroups.com, Andreas Paul
Hey Andreas,

Can you give some more information about how this cropped up?

- Have you recently upgraded PuppetDB, or changed any other part of your
Puppet/PuppetDB installation?
- Was PuppetDB running fine before this started occurring, or are you
still getting it set up?
- What version of PuppetDB are you running?
- What version of Puppet are you running?
- What version of Postgres are you running?
- Can you post the full stacktrace, perhaps in a gist?
- What is the operating system running PuppetDB?
- When the new commands are failing, they should be showing up in
/var/lib/puppetdb/mq/discarded. If so, can you find one of the failed
commands submitted after this error and gist that as well?

That should help get the ball rolling -- to my knowledge we haven't seen
this before.

Thanks,
Wyatt
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-users...@googlegroups.com
> <mailto:puppet-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-users/8bd3cd4b-58cc-4377-a34b-fccc235813e8%40googlegroups.com
> <https://groups.google.com/d/msgid/puppet-users/8bd3cd4b-58cc-4377-a34b-fccc235813e8%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Andreas Paul

unread,
Jan 2, 2015, 12:19:06 PM1/2/15
to puppet...@googlegroups.com, xor...@gmail.com
Hi Wyatt,


- Have you recently upgraded PuppetDB, or changed any other part of your Puppet/PuppetDB installation?
No, we haven't changed our Puppet client/server or PuppetDB version recently.


- Was PuppetDB running fine before this started occurring, or are you still getting it set up?
PuppetDB is still running fine except when it tries to compact the database with the garbage collection. After the exception the queue gets emptied in a matter of seconds and the commands get processed as usual. We are running this setup for over a year now and it runs fine except this GC issue.


- What version of PuppetDB are you running?
1.6.3


- What version of Puppet are you running?
2.7.25


- What version of Postgres are you running?
PostgreSQL 9.2.9


- Can you post the full stacktrace, perhaps in a gist?


- What is the operating system running PuppetDB?
Debian Wheezy


- When the new commands are failing, they should be showing up in /var/lib/puppetdb/mq/discarded. If so, can you find one of the failed
commands submitted after this error and gist that as well?
We have some failing commands, but I don't think these correlate with this GC problem, but I can send you one command to your email address as these could contain some sensitive information that I don't want to post to github. I'm xorpaul on freenode and #puppet just send me a message.



Our DB admin thinks that the SELECT query over this large table is too much for the application, but we don't know why it aborts the GC sometimes after 5 minutes and sometimes only after hours of trying.
He also saw that PuppetDB tries to start a new transaction within an transaction without first closing the previous one, but this could be coincidental to our problem.

Thanks!

Andreas Paul

unread,
Jan 7, 2015, 12:07:18 PM1/7/15
to puppet...@googlegroups.com, xor...@gmail.com
Hey,

our database admin solved this problem by executing the following SQL query in a 3 second sleep loop over multiple days:

DELETE FROM resource_params_cache
WHERE resource_params_cache.resource IN (SELECT resource_params_cache.resource FROM resource_params_cache

WHERE NOT EXISTS (SELECT * FROM catalog_resources cr
WHERE cr.resource=resource_params_cache.resource)
LIMIT 2000)                                                                                                          
RETURNING resource_params_cache;


After this clean up the table size shrinked by 93% and were returned to the OS with VACUUM FULL
and the database GC did finish in 3 minutes:

2015-01-07 16:09:34,190 INFO  [pool-3-thread-1] [cli.services] Starting database garbage collection
2015-01-07 16:12:21,296 INFO  [pool-3-thread-1] [cli.services] Finished database garbage collection
Reply all
Reply to author
Forward
0 new messages