This index was originally added in PDB-4640 to support CD4PE Impact Analysis. Impact analysis needs to look at a git diff with modifications to a file name like mod_name/manifests/foo.pp and determine all nodes in PuppetDB with catalogs that might have resources defined in the modified file. This used to be done via a regex match against the file field of the resources entity and PDB-4640 added a GIN tri-gram index to help speed those matches up. However, the regex match solution eventually proved inadequate even with the index and CD4PE switched (CDPE-4869) to re-constructing the resource type from the modified files (e.g. mod_name/manifests/foo.pp is either a Mod_name::Foo defined type or a Class['Mod_name::Foo']. Regex matching is still used in the very rare edge case of modifications to a site manifest. That leaves the catalog_resources_file_trgm index behind as unused in almost all PE installations. This type of index is extremely expensive for the database to maintain, resulting in several negative impacts that get worse as the size of a PE installation scales up:
- On row updates, regular Postgres B-Tree indexes impose a tax of one additional write to store a copy of the indexed value. GIN indexes require multiple additional writes. In the case of a GIN tri-gram, the indexed value is split into groupings of three characters (tri-grams) and an index update is written for each grouping. For a long value like a file path, this can translate into dozens or hundreds of extra writes per row update.
- The write amplification results in a lot of disk space being used. The catalog_resources_file_trgm index appears frequently in the top consumers of disk space for a PuppetDB database — the previous comment gives an example of over one third of the total PuppetDB database size being bound up in this index.
- Even when not used by queries, GIN indexes have to be updated with each new row and maintained by autovacuum and autoanalyze. Due to their amplified writes and large on-disk size, this results in significant amounts of additional read and write I/O. In the case of autovacuum and autoanalyze, Postgres has a global "budget" for I/O performed as part of maintenance that controlled by the autovacuum_vacuum_cost_limit setting. A maintenance process that is working on catalog_resources_file_trgm can easily burn this entire budget — which forces all other maintenance workers to sleep for a period of time until the budget refills. This means the the existance of a large, unused index increases the time required to maintain every other table and index in the database server and results in the entire dataset being organized less efficiently.
So, to summarize: GIN indexes are massively expensive, to the point of requiring upkeep that is disproportionate to the utility they provide when covering large sets of long strings. Often, re-factoring application logic to reduce the use of pattern matching is truly the cheaper option. catalog_resources_file_trgm is a specific index discussed in this ticket, but the fact_paths_path_trgm also exhibits the same large size coupled with low or no usage in PE installations. |