Jira (PDB-5599) Remove catalog resources pg

Austin Blatt (Jira)

unread,

Feb 16, 2023, 11:49:02 AM2/16/23

to puppe...@googlegroups.com

Austin Blatt created an issue

PuppetDB /

PDB-5599

Remove catalog resources pg_trgm index on file path

Issue Type:	Improvement
Assignee:	Unassigned
Created:	2023/02/16 8:48 AM
Priority:	Normal
Reporter:	Austin Blatt

CD4PE is no longer using this index (it was added for their queries)

Add Comment

This message was sent by Atlassian Jira (v8.20.11#820011-sha1:0629dd8)

Austin Blatt (Jira)

unread,

May 25, 2023, 12:10:02 PM5/25/23

to puppe...@googlegroups.com

Austin Blatt commented on

PDB-5599

Re: Remove catalog resources pg_trgm index on file path

This is an extraordinarily expensive index to maintain.

From support in Slack regarding a customer:

Their catalog_resources_file_trgm index is 58 GB in size, which is over 33% of the total PuppetDB database size and it's been used 0 times

pg_stat_user_indexes: https://www.postgresql.org/docs/11/monitoring-stats.html#PG-STAT-ALL-INDEXES-VIEW
relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
---------{}+{}--------{}{}--------{}{}-----------------------{}{}--------------------------------------------------------{}{}----------{}{}----------{}{}-------------
17540 | 18381 | public | catalog_resources | catalog_resources_file_trgm | 0 | 0 | 0

14:56
5571717074944 disk blocks have been read from that index though, so it seems like autovaccuum and autoanalyze have spent 5.5 TB in just read I/O to keep it maintained. Probably more in write.
15:08
Well, hopefully 5.5 TB if postgres means 1 kB file system blocks by "disk blocks". If it's counting 8 kB Postgres pages, then that's 44 TB in read
15:28
Postgres slack confirms it's counting the 8 kB database pages

Add Comment

This message was sent by Atlassian Jira (v8.20.21#820021-sha1:38274c8)

Austin Blatt (Jira)

unread,

May 25, 2023, 12:13:02 PM5/25/23

to puppe...@googlegroups.com

Austin Blatt updated an issue

PuppetDB /

PDB-5599

Remove catalog resources pg_trgm index on file path

Change By:	Austin Blatt
Team:	Skeletor

Add Comment

Charlie Sharpsteen (Jira)

unread,

May 25, 2023, 1:40:02 PM5/25/23

to puppe...@googlegroups.com

Charlie Sharpsteen commented on

PDB-5599

Re: Remove catalog resources pg_trgm index on file path

This index was originally added in ~~PDB-4640~~ to support CD4PE Impact Analysis. Impact analysis needs to look at a git diff with modifications to a file name like mod_name/manifests/foo.pp and determine all nodes in PuppetDB with catalogs that might have resources defined in the modified file.

This used to be done via a regex match against the file field of the resources entity and ~~PDB-4640~~ added a GIN tri-gram index to help speed those matches up. However, the regex match solution eventually proved inadequate even with the index and CD4PE switched (CDPE-4869) to re-constructing the resource type from the modified files (e.g. mod_name/manifests/foo.pp is either a Mod_name::Foo defined type or a Class['Mod_name::Foo']. Regex matching is still used in the very rare edge case of modifications to a site manifest.

That leaves the catalog_resources_file_trgm index behind as unused in almost all PE installations. This type of index is extremely expensive for the database to maintain, resulting in several negative impacts that get worse as the size of a PE installation scales up:

On row updates, regular Postgres B-Tree indexes impose a tax of one additional write to store a copy of the indexed value. GIN indexes require multiple additional writes. In the case of a GIN tri-gram, the indexed value is split into groupings of three characters (tri-grams) and an index update is written for each grouping. For a long value like a file path, this can translate into dozens or hundreds of extra writes per row update.

The write amplification results in a lot of disk space being used. The catalog_resources_file_trgm index appears frequently in the top consumers of disk space for a PuppetDB database — the previous comment gives an example of over one third of the total PuppetDB database size being bound up in this index.

Even when not used by queries, GIN indexes have to be updated with each new row and maintained by autovacuum and autoanalyze. Due to their amplified writes and large on-disk size, this results in significant amounts of additional read and write I/O. In the case of autovacuum and autoanalyze, Postgres has a global "budget" for I/O performed as part of maintenance that controlled by the autovacuum_vacuum_cost_limit setting. A maintenance process that is working on catalog_resources_file_trgm can easily burn this entire budget — which forces all other maintenance workers to sleep for a period of time until the budget refills. This means the the existance of a large, unused index increases the time required to maintain every other table and index in the database server and results in the entire dataset being organized less efficiently.

So, to summarize: GIN indexes are massively expensive, to the point of requiring upkeep that is disproportionate to the utility they provide when covering large sets of long strings. Often, re-factoring application logic to reduce the use of pattern matching is truly the cheaper option.

catalog_resources_file_trgm is a specific index discussed in this ticket, but the fact_paths_path_trgm also exhibits the same large size coupled with low or no usage in PE installations.

Add Comment

Charlie Sharpsteen (Jira)

unread,

May 25, 2023, 1:46:02 PM5/25/23

to puppe...@googlegroups.com

Charlie Sharpsteen commented on

PDB-5599

Re: Remove catalog resources pg_trgm index on file path

As noted in PE-34880, any operation that causes mass updates to agent catalogs will create a peak load that is then amplified by this GIN index and can bring database performance to its knees.

Add Comment

Charmaine Pritchett (Jira)

unread,

May 25, 2023, 1:53:02 PM5/25/23

to puppe...@googlegroups.com

Charmaine Pritchett updated an issue

PuppetDB /

PDB-5599

Remove catalog resources pg_trgm index on file path

Change By:	Charmaine Pritchett
Zendesk Ticket Count:	1 2
Zendesk Ticket IDs:	50015, 52086

Add Comment

Charmaine Pritchett (Jira)

unread,

May 25, 2023, 1:53:02 PM5/25/23

to puppe...@googlegroups.com

Charmaine Pritchett updated an issue

PuppetDB /

PDB-5599

Remove catalog resources pg_trgm index on file path

Change By:	Charmaine Pritchett
Zendesk Ticket Count:	1

Zendesk Ticket IDs:	52086

Add Comment

Charmaine Pritchett (Jira)

unread,

May 25, 2023, 1:53:03 PM5/25/23

to puppe...@googlegroups.com

Charmaine Pritchett updated an issue

PuppetDB /

PDB-5599

Remove catalog resources pg_trgm index on file path

Change By:	Charmaine Pritchett
Labels:	jira_escalated

Add Comment

Charlie Sharpsteen (Jira)

unread,

May 25, 2023, 4:33:03 PM5/25/23

to puppe...@googlegroups.com

Charlie Sharpsteen commented on

PDB-5599

Re: Remove catalog resources pg_trgm index on file path

And to give a concrete example of the degree to which tri-gram indexes amplify writes, from the Postgres docs:

Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string “cat” is “ c”, “ ca”, “cat”, and “at ”.

https://www.postgresql.org/docs/14/pgtrgm.html

A file named the_quick_brown_fox would result in 19 tri-grams that need to be written out as part of the row update: