Jira (PDB-4444) PuppetDB never finishes migrating resource_events

0 views
Skip to first unread message

Luke Bigum (JIRA)

unread,
Jun 26, 2019, 5:59:03 AM6/26/19
to puppe...@googlegroups.com
Luke Bigum created an issue
 
PuppetDB / Bug PDB-4444
PuppetDB never finishes migrating resource_events
Issue Type: Bug Bug
Assignee: Unassigned
Components: PuppetDB
Created: 2019/06/26 2:58 AM
Priority: Normal Normal
Reporter: Luke Bigum

In our environment, a PuppetDB upgrade never completes - any schema migrations that touch `resource_events` always take too long (over 12+ hours).  The PuppetDB JVM either crashes OOM, or I give up, kill it, and truncate `resource_events` and start it again.

This is the migration query that is running:

INSERT INTO resource_events_transform ( new_value, corrective_change, property, file, report_id, event_hash, old_value, containing_class, certname_id, line, resource_type, status, resource_title, timestamp, containment_path, message ) VALUES ( $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16 )+

I'm not sure there's any way to solve this... `resource_events` is by far the largest table, usually around 3-5 million rows.  I've already disabled report processing on our Dev infrastructure to limit the amount of reports stored.

Any suggestions, or should I make it practice to truncate this table before every package upgrade?

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Charlie Sharpsteen (JIRA)

unread,
Jun 26, 2019, 10:59:03 AM6/26/19
to puppe...@googlegroups.com
Charlie Sharpsteen commented on Bug PDB-4444
 
Re: PuppetDB never finishes migrating resource_events

Which version of PuppetDB are you starting with, and which version are you upgrading to? That will let us know which migrations are being run on the resource_events table.

Luke Bigum (JIRA)

unread,
Jun 28, 2019, 5:11:03 AM6/28/19
to puppe...@googlegroups.com
Luke Bigum commented on Bug PDB-4444

In March, puppetdb-5.2.2-1.el6 -> puppetdb-6.3.0-1.el6, and then a few days ago puppetdb-6.3.0-1.el6 -> puppetdb-6.3.3-1.el6.

Robert Roland (JIRA)

unread,
Aug 13, 2019, 2:58:03 PM8/13/19
to puppe...@googlegroups.com
Robert Roland commented on Bug PDB-4444

Luke Bigum - can we get some details on this instance of a migration issue?

How long did you let it run, how many rows are in your table, do you have custom JVM GC settings for PuppetDB, how much RAM / CPU cores does your PuppetDB instance have? What sort of bandwidth do you have between the PostgreSQL server and PuppetDB?

Our testing of this migration included an instance with approximately 5 million rows in resource_events and the migration took 45 minutes.

Luke Bigum (JIRA)

unread,
Aug 14, 2019, 4:51:03 AM8/14/19
to puppe...@googlegroups.com
Luke Bigum commented on Bug PDB-4444

For the JVM, we only change the Max Heap:

JAVA_ARGS="-Xmx6g"

The current row count (which is a higher than I was expecting):

puppetdb=> select count(*) from resource_events;
 count 
----------
 42973699
(1 row)

The machine itself is a 12 core KVM instance with 40Gig total RAM.  The Postgresql instance is co-located on the same VM, and it also runs a Puppet Server instance (but we don't have automatic Puppet runs enabled so this Puppet Server is mostly idle).

Some other relevant config:

 

# How often (in minutes) to compact the database
# gc-interval = 60
gc-interval = 60
Number of seconds before any SQL query is considered 'slow'; offending
# queries will not be interrupted, but will be logged at the WARN log level.
log-slow-statements = 10
syntax_pgs = true
node-ttl = 0s
node-purge-ttl = 1d
report-ttl = 14d
conn-max-age = 60
conn-keep-alive = 45
conn-lifetime = 0

 

We don't purge any nodes based on last check-in time, but I do try purge reports.  We can't turn on node-ttl because our Puppet runs are done on a certain schedule.  If we were to turn on automatic purging, we might expire an active node simply because it hasn't had it's scheduled run.  This affects our monitoring, which is derived from PuppetDB.  When a machine is decommissioned it is deactivated in PuppetDB manually.

I thought purging reports would be enough to keep `resource_events` low, but could be it's not doing what I think it's doing.

Reply all
Reply to author
Forward
0 new messages