Jira (PDB-4072) HA out-of-sync after restart

Change By:	Dylan Ratcliffe
CS Priority:	Needs Priority

This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)

Dylan Ratcliffe (JIRA)

unread,

Sep 9, 2018, 11:39:04 PM9/9/18

to puppe...@googlegroups.com

Dylan Ratcliffe created an issue

PuppetDB /

Issue Type:	Bug
Affects Versions:	PDB 5.2.4
Assignee:	Unassigned
Components:	HA
Created:	2018/09/09 8:38 PM
Environment:	PE 2018.1.3 HA with compile masters. The following things are out of the ordinary for this environment: ~5500 Nodes Many hundreds of events per run (~250) Some resource titles over 4000 characters long
Priority:	Major
Reporter:	Dylan Ratcliffe

When running in a HA setup I'm seeing strange issues after restarting PuppetDB on either the Primary or Secondary master. Both masters will run fine for many hours/days with no sync errors. However after restart they will determine that they are are a long way out-of-sync and re-sync themselves, for example after both machines bing up for 4 days, restarting the primary caused it to re-sync 10,000 reports. This causes prolonged downtime as the PuppetDB service remains in "starting" mode and cannot process requests.

Possibly there is some difference in the way that PuppetDB does an initial as opposed to one of the subsequent syncs?

Rob Browning (JIRA)

unread,

Sep 10, 2018, 11:27:02 AM9/10/18

to puppe...@googlegroups.com

Rob Browning updated an issue

PuppetDB /

Change By:	Rob Browning
Team:	PuppetDB

Neil Binney (JIRA)

unread,

Sep 11, 2018, 3:17:03 AM9/11/18

to puppe...@googlegroups.com

Neil Binney updated an issue

PuppetDB /

Change By:	Neil Binney
CS Priority:	Needs Priority Reviewed

Neil Binney (JIRA)

unread,

Sep 11, 2018, 3:21:02 AM9/11/18

to puppe...@googlegroups.com

Neil Binney updated an issue

PuppetDB /

CS Triage

CS Priority: Major

CS Frequency: 1 1-5%

CS Severity: 3 Serious

CS Business Value: 4 $$$$

CS Impact. Impacting major customer. Puppetdb resync on startup is noted in the Docs, but it should not need to resync so much data following a DB restart. Such outages will impact HA and customers DR strategy.

Austin Blatt (JIRA)

unread,

Oct 2, 2018, 11:06:05 AM10/2/18

to puppe...@googlegroups.com

Austin Blatt assigned an issue to Austin Blatt

PuppetDB /

Change By:	Austin Blatt
Assignee:	Austin Blatt

Austin Blatt (JIRA)

unread,

Oct 4, 2018, 11:30:05 AM10/4/18

to puppe...@googlegroups.com

Austin Blatt commented on

Ok, I'm going to investigate the possibility that the massive resource titles are somehow affecting sync. The other ticket to follow is PDB-3742, I don't think it should affect this case, but you never know.

Austin Blatt (JIRA)

unread,

Oct 5, 2018, 10:15:04 AM10/5/18

to puppe...@googlegroups.com

Austin Blatt commented on

Dylan Ratcliffe, can you get the output of pg_controldata for both the primary and secondary Postgres's? All you need to do is give it the path to the postgres data directory, for my computer, using a sandboxed postgres for development, it looks like this pg_controldata ~/sandbox/pdb/tmp_pg/pg/data/.

Dylan Ratcliffe (JIRA)

unread,

Oct 8, 2018, 2:06:04 AM10/8/18

to puppe...@googlegroups.com

Dylan Ratcliffe commented on

Austin Blatt, attached to private Google Drive link. I have also attached logs from the primary after a reboot (whole server) that was required, might give some more info.

Dylan Ratcliffe (JIRA)

unread,

Oct 15, 2018, 9:42:02 PM10/15/18

to puppe...@googlegroups.com

Dylan Ratcliffe commented on

Austin Blatt no need for alternatives we can wait. The only impact is increased startup time when the Primary needs to restart PuppetDB for whatever reason. Workaround is just to not restart PuppetDB in the middle of the day, easy.

Dylan Ratcliffe (JIRA)

unread,

Oct 23, 2018, 10:09:02 PM10/23/18

to puppe...@googlegroups.com

Dylan Ratcliffe commented on

Austin Blatt just had to restart PuppetDB in order to increase report-ttl and we encountered this issue again. This time the primary was restarted and took about 40min to come back into sync due to being out by thousands of reports (~8000) and hundreds of everything else. I also saw the sync complete, PuppetDB come up fully, then it run sync again and say it needs to sync ~900 factsets, it syncs maybe 100-150 (judging by logs, very rough estimate) then stops, saying sync completed successfully. This repeats every 2min when sync is triggered about 5 or 6 times. In the meantime the secondary syncs a few thousand reports from the primary, which should have nothing that the secondary does not as the secondary was up the whole time.

Is this behaviour consistent with your hypothesis? Seems the initial huge sync is consistent, but I can't see why after the first sync completes, it would need to transfer anything further, especially 5 or 6 times, in both directions.

Austin Blatt (JIRA)

unread,

Oct 23, 2018, 10:18:05 PM10/23/18

to puppe...@googlegroups.com

Austin Blatt commented on

No, my hypothesis does not explain the smaller syncs your are seeing after start up. For that reason, I made PDB-4158 to track the work that would alleviate the startup issue I found, but we can leave this ticket open until we resolve all the pieces.

It's possible that PDB-3742, which we are also working on right now could explain smaller syncs every time. If a command is enqueued on both PDBs, but only processed on one before the sync happens, PDB will attempt to re-sync that command to the server that has not yet processed it, the result is that one system will have 2 identical commands enqueued, but the duplicate will be ignored when it is processed.

Dylan Ratcliffe (JIRA)

unread,

Oct 23, 2018, 10:45:04 PM10/23/18

to puppe...@googlegroups.com

Dylan Ratcliffe commented on

Yeah that's certainly possible. It would be good if the monitoring could come up while PuppetDB is in "starting" mode, that way we could get a lot more detail. Is there a technical reason why that doesn't happen? At the moment we get zero metrics until the PuppetDB is fully started which makes identifying these kinds of issues harder

Austin Blatt (JIRA)

unread,

Oct 23, 2018, 11:59:03 PM10/23/18

to puppe...@googlegroups.com

Austin Blatt commented on