Jira (PDB-5351) Reports table partitions are garbage collected too soon

11 views
Skip to first unread message

Stel Abrego (Jira)

unread,
Oct 28, 2021, 7:15:02 PM10/28/21
to puppe...@googlegroups.com
Stel Abrego created an issue
 
PuppetDB / Bug PDB-5351
Reports table partitions are garbage collected too soon
Issue Type: Bug Bug
Assignee: Unassigned
Created: 2021/10/28 4:14 PM
Priority: Normal Normal
Reporter: Stel Abrego

There is a bug in our reports table partition garbage collection which makes any `reports-ttl` less than 24 days and 5 minutes (assuming garbage collection is around 00:05) cause the previous day's table partitioning to get garbage collected.

This is due to how the reports-ttl expiration DateTime and the reports table partition DateTime are constructed and compared. The table partitions' DateTime is always at 0:00 of that day which means the partition always contains reports that are after that time but on that same day. See puppetlabs.puppetdb.scf.storage/prune-daily-partitions.

Solution A:
We could change the table partitions' DateTime to be a day later to more accurately reflect an "expiration" date for the partition.

Solution B:
An alternative fix would be to "floor" the expiration DateTime that gets derived from the `reports-ttl` value. This way when yesterday's partition is considered for garbage collection with a `reports-ttl` of "1d", both the DateTime's should be identical. This avoids garbage collection because the partition only gets garbage collected if the partion DateTime is before the `reports-ttl` expiration DateTime. See puppetlabs.puppetdb.cli.services/sweep-reports!

Zendesk: https://puppetlabs.zendesk.com/agent/tickets/46288

Inital Zendesk Support Message:

Hi,
 
I believe we've identified a bug in the handling of the puppetdb report_ttl setting.
 
We have been using a report_ttl of `1d` for quite some time, and noticed, especially after upgrading to 2019 (but this possibly existed before), that virtually all reports would be deleted during the first sweep after midnight. Often we found the Puppet Console reporting tens of thousands of nodes with no reports.
 
What seems to be happening is that, if the GC runs at 00:10, only reports that came in between 00:00 and 00:10 will be retained. This means any agents that have NOT run Puppet in the past 10 minutes will have 'no reports' and will show on the status page as not having checked in.
 
Nodes that HAVE checked in between 00:00 and 00:10 will show as having checked in, but only have the 1 report for the day (for example).
 
I'm guessing that the code is rounding `1d` unexpectedly; it seems to be rounding down to the most recent calendar day, which is just "today".
 
If I change `1d` to `24h` it converts back to `1d` for the purposes of the sweep.
 
I am now experimenting with `25h` to see if reports are retained for 1 full day, and will report back, but I wanted to get the ball rolling on this ticket.
 
This is uniquely visible in our environment as we retain reports for 1d, and have a standard runInterval of 1h. This means if GC runs at 00:05 and we check the console immediately thereafter, almost no agents have checked in. If we check the console at 00:30, statistically speaking it is likely that 50% of our nodes have 'no reports'.
 
https://github.com/puppetlabs/puppetdb/blob/ad13f09bed2f9462ec97b3dc738a055bb8716c4e/src/puppetlabs/puppetdb/cli/services.clj#L206-L242

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.13.2#813002-sha1:c495a97)
Atlassian logo

Stel Abrego (Jira)

unread,
Oct 28, 2021, 7:22:03 PM10/28/21
to puppe...@googlegroups.com

Austin Boyd (Jira)

unread,
Oct 28, 2021, 8:43:01 PM10/28/21
to puppe...@googlegroups.com

Austin Boyd (Jira)

unread,
Oct 28, 2021, 8:43:02 PM10/28/21
to puppe...@googlegroups.com
Austin Boyd updated an issue
Change By: Austin Boyd
Zendesk Ticket Count: 1
Zendesk Ticket IDs: 46288

Austin Blatt (Jira)

unread,
Oct 29, 2021, 1:26:02 PM10/29/21
to puppe...@googlegroups.com
Austin Blatt updated an issue
Change By: Austin Blatt
Sprint: HA 2021-11-03
Team: HA

Austin Blatt (Jira)

unread,
Nov 3, 2021, 12:40:05 PM11/3/21
to puppe...@googlegroups.com
Austin Blatt updated an issue
Change By: Austin Blatt
Fix Version/s: PDB 7.8.0
Fix Version/s: PDB 6.20.0
Release Notes: Bug Fix
Release Notes Summary: Reports were gc'ed by report-ttl - 1 day, now they are not.
Reply all
Reply to author
Forward
0 new messages