We have tried updating an environment with a few thousand hosts and 4441439 entries in the catalog_resources postgres table from puppetdb 5.1.5 to 6.11.3 and have seen a large regression in the resource query performance
A query like {code} /pdb/query/v4/resources?query=["and",["=","type","Prometheus::Scrape_job"],["=","exported",true],["not",["=","certname","mprom003.server.lan"]],["=",["parameter","job_name"],"jobname"]] {code}
We have tried updating an environment with a few thousand hosts and 4441439 entries in the catalog_resources postgres table from puppetdb 5.1.5 to 6.11.3 and have seen a large regression in the resource query performance
We have tried updating an environment with a few thousand hosts and 4441439 entries in the catalog_resources postgres table from puppetdb 5.1.5 to 6.11.3 and have seen a large regression in the resource query performance
and is executed many times and now times out due to the slow query.
Slow query log: {code} SELECT: ... duration: 4027.270 ms execute <unnamed>/C_3: WITH inactive_nodes AS (SELECT certname FROM certnames WHERE (deactivated IS NOT NULL AND deactivated > '2020-08-13T13:23:03.928Z') OR (expired IS NOT NULL and expired > '2020-08-13T13:23:03.928Z')), not_active_nodes AS (SELECT certname FROM certnames WHERE (deactivated IS NOT NULL OR expired IS NOT NULL)) SELECT encode(resources.resource::bytea, 'hex') AS resource, c.certname AS certname, tags AS tags, exported AS exported, line AS line, title AS title, type AS type, e.environment AS environment, file AS file, rpc.parameters AS parameters FROM catalog_resources resources INNER JOIN certnames ON resources.certname_id = certnames.id INNER JOIN catalogs c ON c.certname = certnames.certname LEFT JOIN environments e ON c.environment_id = e.id LEFT JOIN resource_params_cache rpc ON rpc.resource = resources.resource WHERE (((type = $1) AND (exported = $2) AND NOT (c.certname = $3) AND ((encode(resources.resource::bytea, 'hex')) in (SELECT res_param_resource FROM ( (SELECT encode(resource::bytea, 'hex') AS res_param_resource FROM resource_params WHERE ((name = $4) AND (value = $5))) ) sub))) AND NOT ((c.certname) in (SELECT certname FROM ( (SELECT not_active_nodes.certname AS certname FROM not_active_nodes) ) sub))) parameters: $1 = 'Prometheus::Scrape_job', $2 = 't', $3 = 'mprom003.server.lan', $4 = 'job_name', $5 = '"jobname"' {code}
Here the explain of the slow query:
{code}
{code}
The database has been freshly initialized from scratch, no migration was performed.
Yes we have rolled back the update to the old version.
What is interesting is that the query in 5.1 database returns a result while the query in the 6.11 version does not. The set of data should be the same, the same hosts have run against the databases.
The analyze of the same query in the old database:
-> Index Scan using catalog_resources_exported_idx on catalog_resources resources (cost=0.28..763.18 rows=9 width=589) (actual time=0.020..6.019 rows=3947 loops=1)
Filter: (type = 'Prometheus::Scrape_job'::text)
-> Index Scan using certnames_transform_pkey on certnames (cost=0.28..8.30 rows=1 width=34) (actual time=0.002..0.002 rows=1 loops=3947)
Filter: ((deactivated IS NOT NULL) OR (expired IS NOT NULL))
Rows Removed by Filter: 2662
-> Index Scan using resource_params_hash_expr_idx on resource_params (cost=0.56..281.64 rows=1 width=21) (actual time=0.013..0.013 rows=0 loops=3846)
Index Cond: (encode(resource, 'hex'::text) = encode(resources.resource, 'hex'::text))
Filter: ((name = 'job_name'::text) AND (value = '"jobname"'::text))
Rows Removed by Filter: 3
-> Index Scan using environments_pkey on environments e (cost=0.14..0.16 rows=1 width=46) (actual time=0.003..0.003 rows=1 loops=8)
Index Cond: (id = c.environment_id)
-> Index Scan using resource_params_cache_pkey on resource_params_cache rpc (cost=0.42..8.44 rows=1 width=728) (actual time=0.008..0.008 rows=1 loops=8)
We have verified that two queries that return the same result in both databases show the different performance characteristics and have the same analyze results. So the statement above that the slow query returned no result is probably unrelated to the issue.
It appears running a ANALYZE on the resource_params table helped postgres to generate a good plan with default settings. The query is now down to 90ms.
So this is probably no regression, just an issue with poor statistics in the database (due to it being initialized from scratch) causing the postgres planner to create a bad plan.
It appears the performance of the query regresses after some time again and we have to manually call the ANALYZE again.
Can this query be improved to be less fragile on the statistics of the table? A two nested subqueries on such large tables seems to be quite problematic.
Though it is probably also worth a postgres bugreport. It is odd that it attempts a double index scan despite the estimation being that one of the indices only returns 40 rows and the other 5000 rows.
As the issue keeps reoccuring we have performed some more tests on the query running following before each query: VACUUM ANALYZE resource_params; SELECT attname, n_distinct, avg_width, correlation, array_to_string(most_common_vals, E'\n'), array_to_string(most_common_freqs, E'\n') FROM pg_stats WHERE tablename = 'resource_params' AND attname = 'name';
The query performance seems to become slow when the frequency of the job_name value of the name colume in resource_params is lower than 0.0012 There is no correlation with the n_distinct or correlation value. Other values in the frequency table should be irrelevant as only name and value are used and value has a high distinct count.
A workaround is probably to increase the statistics sampling of the table to reduce the probability that the job_name frequency is underestimated or to disable the autovacuum which was added in the new version (and the reason why this appeared as a regression) after good statistics have been created.