Jira (PDB-5599) Remove catalog resources pg_trgm index on file path

1 view
Skip to first unread message

Austin Blatt (Jira)

unread,
Feb 16, 2023, 11:49:02 AM2/16/23
to puppe...@googlegroups.com
Austin Blatt created an issue
 
PuppetDB / Improvement PDB-5599
Remove catalog resources pg_trgm index on file path
Issue Type: Improvement Improvement
Assignee: Unassigned
Created: 2023/02/16 8:48 AM
Priority: Normal Normal
Reporter: Austin Blatt

CD4PE is no longer using this index (it was added for their queries)

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.20.11#820011-sha1:0629dd8)
Atlassian logo

Austin Blatt (Jira)

unread,
May 25, 2023, 12:10:02 PM5/25/23
to puppe...@googlegroups.com
Austin Blatt commented on Improvement PDB-5599
 
Re: Remove catalog resources pg_trgm index on file path

This is an extraordinarily expensive index to maintain.

From support in Slack regarding a customer:

Their catalog_resources_file_trgm index is 58 GB in size, which is over 33% of the total PuppetDB database size and it's been used 0 times

pg_stat_user_indexes: https://www.postgresql.org/docs/11/monitoring-stats.html#PG-STAT-ALL-INDEXES-VIEW
relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
---------{}+{}--------{}{}--------{}{}-----------------------{}{}--------------------------------------------------------{}{}----------{}{}----------{}{}-------------
17540 | 18381 | public | catalog_resources | catalog_resources_file_trgm | 0 | 0 | 0

14:56
5571717074944 disk blocks have been read from that index though, so it seems like autovaccuum and autoanalyze have spent 5.5 TB in just read I/O to keep it maintained. Probably more in write.
15:08
Well, hopefully 5.5 TB if postgres means 1 kB file system blocks by "disk blocks". If it's counting 8 kB Postgres pages, then that's 44 TB in read
15:28
Postgres slack confirms it's counting the 8 kB database pages

This message was sent by Atlassian Jira (v8.20.21#820021-sha1:38274c8)
Atlassian logo

Austin Blatt (Jira)

unread,
May 25, 2023, 12:13:02 PM5/25/23
to puppe...@googlegroups.com

Charlie Sharpsteen (Jira)

unread,
May 25, 2023, 1:40:02 PM5/25/23
to puppe...@googlegroups.com
Charlie Sharpsteen commented on Improvement PDB-5599
 
Re: Remove catalog resources pg_trgm index on file path

This index was originally added in PDB-4640 to support CD4PE Impact Analysis. Impact analysis needs to look at a git diff with modifications to a file name like mod_name/manifests/foo.pp and determine all nodes in PuppetDB with catalogs that might have resources defined in the modified file.

This used to be done via a regex match against the file field of the resources entity and PDB-4640 added a GIN tri-gram index to help speed those matches up. However, the regex match solution eventually proved inadequate even with the index and CD4PE switched (CDPE-4869) to re-constructing the resource type from the modified files (e.g. mod_name/manifests/foo.pp is either a Mod_name::Foo defined type or a Class['Mod_name::Foo']. Regex matching is still used in the very rare edge case of modifications to a site manifest.

That leaves the catalog_resources_file_trgm index behind as unused in almost all PE installations. This type of index is extremely expensive for the database to maintain, resulting in several negative impacts that get worse as the size of a PE installation scales up:

  • On row updates, regular Postgres B-Tree indexes impose a tax of one additional write to store a copy of the indexed value. GIN indexes require multiple additional writes. In the case of a GIN tri-gram, the indexed value is split into groupings of three characters (tri-grams) and an index update is written for each grouping. For a long value like a file path, this can translate into dozens or hundreds of extra writes per row update.
  • The write amplification results in a lot of disk space being used. The catalog_resources_file_trgm index appears frequently in the top consumers of disk space for a PuppetDB database — the previous comment gives an example of over one third of the total PuppetDB database size being bound up in this index.
  • Even when not used by queries, GIN indexes have to be updated with each new row and maintained by autovacuum and autoanalyze. Due to their amplified writes and large on-disk size, this results in significant amounts of additional read and write I/O. In the case of autovacuum and autoanalyze, Postgres has a global "budget" for I/O performed as part of maintenance that controlled by the autovacuum_vacuum_cost_limit setting. A maintenance process that is working on catalog_resources_file_trgm can easily burn this entire budget — which forces all other maintenance workers to sleep for a period of time until the budget refills. This means the the existance of a large, unused index increases the time required to maintain every other table and index in the database server and results in the entire dataset being organized less efficiently.

So, to summarize: GIN indexes are massively expensive, to the point of requiring upkeep that is disproportionate to the utility they provide when covering large sets of long strings. Often, re-factoring application logic to reduce the use of pattern matching is truly the cheaper option.

catalog_resources_file_trgm is a specific index discussed in this ticket, but the fact_paths_path_trgm also exhibits the same large size coupled with low or no usage in PE installations.

Charlie Sharpsteen (Jira)

unread,
May 25, 2023, 1:46:02 PM5/25/23
to puppe...@googlegroups.com

As noted in PE-34880, any operation that causes mass updates to agent catalogs will create a peak load that is then amplified by this GIN index and can bring database performance to its knees.

Charmaine Pritchett (Jira)

unread,
May 25, 2023, 1:53:02 PM5/25/23
to puppe...@googlegroups.com
Charmaine Pritchett updated an issue
 
Change By: Charmaine Pritchett
Zendesk Ticket Count: 1 2
Zendesk Ticket IDs: 50015, 52086

Charmaine Pritchett (Jira)

unread,
May 25, 2023, 1:53:02 PM5/25/23
to puppe...@googlegroups.com
Charmaine Pritchett updated an issue
Change By: Charmaine Pritchett
Zendesk Ticket Count: 1
Zendesk Ticket IDs: 52086

Charmaine Pritchett (Jira)

unread,
May 25, 2023, 1:53:03 PM5/25/23
to puppe...@googlegroups.com

Charlie Sharpsteen (Jira)

unread,
May 25, 2023, 4:33:03 PM5/25/23
to puppe...@googlegroups.com
 
Re: Remove catalog resources pg_trgm index on file path

And to give a concrete example of the degree to which tri-gram indexes amplify writes, from the Postgres docs:

Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string “cat” is “ c”, “ ca”, “cat”, and “at ”.

https://www.postgresql.org/docs/14/pgtrgm.html

A file named the_quick_brown_fox would result in 19 tri-grams that need to be written out as part of the row update:

  t
 th
the
he_
e_q
_qu
qui
uic
ick
ck_
k_b
_br
bro
row
own
wn_
n_f
_fo
fox
ox

Cas Donoghue (Jira)

unread,
Jun 1, 2023, 2:36:01 PM6/1/23
to puppe...@googlegroups.com

Cas Donoghue (Jira)

unread,
Jun 1, 2023, 2:48:02 PM6/1/23
to puppe...@googlegroups.com
Cas Donoghue updated an issue
Change By: Cas Donoghue
Fix Version/s: PDB 8.0.1

Cas Donoghue (Jira)

unread,
Jun 15, 2023, 4:45:02 PM6/15/23
to puppe...@googlegroups.com
Cas Donoghue commented on Improvement PDB-5599
 
Re: Remove catalog resources pg_trgm index on file path

We will only target the STS branch with this change. 

Cas Donoghue (Jira)

unread,
Jun 21, 2023, 2:11:02 PM6/21/23
to puppe...@googlegroups.com
Cas Donoghue updated an issue
Change By: Cas Donoghue
Sprint: Skeletor 07/05/2023
Reply all
Reply to author
Forward
0 new messages