Jira (PDB-4072) HA out-of-sync after restart

2 views
Skip to first unread message

Dylan Ratcliffe (JIRA)

unread,
Sep 9, 2018, 11:39:03 PM9/9/18
to puppe...@googlegroups.com
Dylan Ratcliffe updated an issue
 
PuppetDB / Bug PDB-4072
HA out-of-sync after restart
Change By: Dylan Ratcliffe
CS Priority: Needs Priority
Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Dylan Ratcliffe (JIRA)

unread,
Sep 9, 2018, 11:39:04 PM9/9/18
to puppe...@googlegroups.com
Dylan Ratcliffe created an issue
Issue Type: Bug Bug
Affects Versions: PDB 5.2.4
Assignee: Unassigned
Components: HA
Created: 2018/09/09 8:38 PM
Environment:

PE 2018.1.3 HA with compile masters. The following things are out of the ordinary for this environment:

  • ~5500 Nodes
  • Many hundreds of events per run (~250)
  • Some resource titles over 4000 characters long
Priority: Major Major
Reporter: Dylan Ratcliffe

When running in a HA setup I'm seeing strange issues after restarting PuppetDB on either the Primary or Secondary master. Both masters will run fine for many hours/days with no sync errors. However after restart they will determine that they are are a long way out-of-sync and re-sync themselves, for example after both machines bing up for 4 days, restarting the primary caused it to re-sync 10,000 reports. This causes prolonged downtime as the PuppetDB service remains in "starting" mode and cannot process requests.

Possibly there is some difference in the way that PuppetDB does an initial as opposed to one of the subsequent syncs?

Rob Browning (JIRA)

unread,
Sep 10, 2018, 11:27:02 AM9/10/18
to puppe...@googlegroups.com
Rob Browning updated an issue
Change By: Rob Browning
Team: PuppetDB

Neil Binney (JIRA)

unread,
Sep 11, 2018, 3:17:03 AM9/11/18
to puppe...@googlegroups.com
Neil Binney updated an issue
Change By: Neil Binney
CS Priority: Needs Priority Reviewed

Neil Binney (JIRA)

unread,
Sep 11, 2018, 3:21:02 AM9/11/18
to puppe...@googlegroups.com
Neil Binney updated an issue

CS Triage

CS Priority: Major

CS Frequency:  1 1-5%

CS Severity:  3 Serious

CS Business Value:   4 $$$$

CS Impact.  Impacting major customer.  Puppetdb resync on startup is noted in the Docs, but it should not need to resync so much data following a DB restart.  Such outages will impact HA and customers DR strategy.  

Austin Blatt (JIRA)

unread,
Oct 2, 2018, 11:06:05 AM10/2/18
to puppe...@googlegroups.com
Austin Blatt assigned an issue to Austin Blatt
Change By: Austin Blatt
Assignee: Austin Blatt

Austin Blatt (JIRA)

unread,
Oct 4, 2018, 11:30:05 AM10/4/18
to puppe...@googlegroups.com
Austin Blatt commented on Bug PDB-4072
 
Re: HA out-of-sync after restart

Ok, I'm going to investigate the possibility that the massive resource titles are somehow affecting sync. The other ticket to follow is PDB-3742, I don't think it should affect this case, but you never know.

Austin Blatt (JIRA)

unread,
Oct 5, 2018, 10:15:04 AM10/5/18
to puppe...@googlegroups.com
Austin Blatt commented on Bug PDB-4072

Dylan Ratcliffe, can you get the output of pg_controldata for both the primary and secondary Postgres's? All you need to do is give it the path to the postgres data directory, for my computer, using a sandboxed postgres for development, it looks like this pg_controldata ~/sandbox/pdb/tmp_pg/pg/data/.

Dylan Ratcliffe (JIRA)

unread,
Oct 8, 2018, 2:06:04 AM10/8/18
to puppe...@googlegroups.com

Austin Blatt, attached to private Google Drive link. I have also attached logs from the primary after a reboot (whole server) that was required, might give some more info.

Dylan Ratcliffe (JIRA)

unread,
Oct 15, 2018, 9:42:02 PM10/15/18
to puppe...@googlegroups.com

Austin Blatt no need for alternatives we can wait. The only impact is increased startup time when the Primary needs to restart PuppetDB for whatever reason. Workaround is just to not restart PuppetDB in the middle of the day, easy.

Dylan Ratcliffe (JIRA)

unread,
Oct 23, 2018, 10:09:02 PM10/23/18
to puppe...@googlegroups.com

Austin Blatt just had to restart PuppetDB in order to increase report-ttl and we encountered this issue again. This time the primary was restarted and took about 40min to come back into sync due to being out by thousands of reports (~8000) and hundreds of everything else. I also saw the sync complete, PuppetDB come up fully, then it run sync again and say it needs to sync ~900 factsets, it syncs maybe 100-150 (judging by logs, very rough estimate) then stops, saying sync completed successfully. This repeats every 2min when sync is triggered about 5 or 6 times. In the meantime the secondary syncs a few thousand reports from the primary, which should have nothing that the secondary does not as the secondary was up the whole time.

 

Is this behaviour consistent with your hypothesis? Seems the initial huge sync is consistent, but I can't see why after the first sync completes, it would need to transfer anything further, especially 5 or 6 times, in both directions.

Austin Blatt (JIRA)

unread,
Oct 23, 2018, 10:18:05 PM10/23/18
to puppe...@googlegroups.com
Austin Blatt commented on Bug PDB-4072

No, my hypothesis does not explain the smaller syncs your are seeing after start up. For that reason, I made PDB-4158 to track the work that would alleviate the startup issue I found, but we can leave this ticket open until we resolve all the pieces.

It's possible that PDB-3742, which we are also working on right now could explain smaller syncs every time. If a command is enqueued on both PDBs, but only processed on one before the sync happens, PDB will attempt to re-sync that command to the server that has not yet processed it, the result is that one system will have 2 identical commands enqueued, but the duplicate will be ignored when it is processed.

Dylan Ratcliffe (JIRA)

unread,
Oct 23, 2018, 10:45:04 PM10/23/18
to puppe...@googlegroups.com

Yeah that's certainly possible. It would be good if the monitoring could come up while PuppetDB is in "starting" mode, that way we could get a lot more detail. Is there a technical reason why that doesn't happen? At the moment we get zero metrics until the PuppetDB is fully started which makes identifying these kinds of issues harder

Austin Blatt (JIRA)

unread,
Oct 23, 2018, 11:59:03 PM10/23/18
to puppe...@googlegroups.com
Austin Blatt commented on Bug PDB-4072

That's true, and making PuppetDB a little more easier to work with during startup is something we have talked about working on in the near future. I can think of a few difficulties with tracking the initial sync because it occurs somewhat differently than other syncs, would you mind making a ticket and mention which metrics information/endpoints are most useful for your debugging errors like this so I can look into what sort of work that might require or if the metrics endpoint might report incorrect data during start up.

Dylan Ratcliffe (JIRA)

unread,
Oct 24, 2018, 9:23:03 AM10/24/18
to puppe...@googlegroups.com

Adam Bottchen (JIRA)

unread,
Jan 23, 2019, 5:00:09 PM1/23/19
to puppe...@googlegroups.com
Adam Bottchen updated an issue
Change By: Adam Bottchen
CS Priority: Reviewed Major

Maheswaran Shanmugam (JIRA)

unread,
Jun 13, 2019, 4:06:04 AM6/13/19
to puppe...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages