Jira (PDB-4874) regression on resource query performance on upgrade to 6.11.3 from 5.1

22 views
Skip to first unread message

Julian Taylor (Jira)

unread,
Aug 27, 2020, 10:13:03 AM8/27/20
to puppe...@googlegroups.com
Julian Taylor updated an issue
 
PuppetDB / Bug PDB-4874
regression on resource query performance on upgrade to 6.11.3 from 5.1
Change By: Julian Taylor
Hi,

We have tried updating an environment with a few thousand hosts and 4441439 entries in the catalog_resources postgres table from puppetdb 5.1.5 to 6.11.3 and have seen a large regression in the resource query performance

A query like
{code}
/pdb/query/v4/resources?query=["and",["=","type","Prometheus::Scrape_job"],["=","exported",true],["not",["=","certname","mprom003.server.lan"]],["=",["parameter","job_name"],"jobname"]]
{code}

which used to take 70ms now takes 4 seconds.
The query is generated by prometheus scrape config:
https://github.com/voxpupuli/puppet-prometheus/blob/3ae9b926c8697938398ff2ff66c2dcda539a6603/manifests/config.pp#L234

and is executed many times and now times out due to the slow query.

Here the explain of the slow query:
{code}
Nested Loop Left Join  (cost=591.68..1122.65 rows=1 width=1357) (actual time=4098.976..4098.976 rows=0 loops=1)
   ->  Nested Loop Left Join  (cost=591.26..1114.20 rows=1 width=636) (actual time=4098.976..4098.976 rows=0 loops=1)
         ->  Nested Loop Semi Join  (cost=591.10..1113.99 rows=1 width=612) (actual time=4098.975..4098.975 rows=0 loops=1)
               ->  Hash Join  (cost=453.88..690.22 rows=3 width=612) (actual time=62.459..70.658 rows=3810 loops=1)
                     Hash Cond: (c.certname = certnames.certname)
                     ->  Seq Scan on catalogs c  (cost=62.27..293.66 rows=1312 width=34) (actual time=0.933..4.710 rows=2625 loops=1)
                     Filter: ((certname <> 'mprom003.server.lan'::text) AND (NOT (hashed SubPlan 1)))
                     Rows Removed by Filter: 1
                     SubPlan 1
                     ->  Seq Scan on certnames certnames_1  (cost=0.00..62.27 rows=1 width=26) (actual time=0.896..0.896 rows=0 loops=1)
                     Filter: ((deactivated IS NOT NULL) OR (expired IS NOT NULL))
                     Rows Removed by Filter: 2627
                     ->  Hash  (cost=391.53..391.53 rows=6 width=604) (actual time=61.494..61.494 rows=3811 loops=1)
                     Buckets: 4096 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 2182kB
                     ->  Nested Loop  (cost=318.02..391.53 rows=6 width=604) (actual time=4.377..56.959 rows=3811 loops=1)
                     ->  Bitmap Heap Scan on catalog_resources resources  (cost=317.74..341.74 rows=6 width=586) (actual time=4.367..46.896 rows=3811 loops=1)
                     Recheck Cond: (exported AND (type = 'Prometheus::Scrape_job'::text))
                     Heap Blocks: exact=3801
                     ->  BitmapAnd  (cost=317.74..317.74 rows=6 width=0) (actual time=3.755..3.755 rows=0 loops=1)
                     ->  Bitmap Index Scan on catalog_resources_exported_idx  (cost=0.00..80.54 rows=3263 width=0) (actual time=1.306..1.306 rows=3840 loops=1)
                     ->  Bitmap Index Scan on catalog_resources_type_idx  (cost=0.00..236.95 rows=8602 width=0) (actual time=2.222..2.222 rows=7947 loops=1)
                     Index Cond: (type = 'Prometheus::Scrape_job'::text)
                     ->  Index Scan using certnames_transform_pkey on certnames  (cost=0.28..8.30 rows=1 width=34) (actual time=0.002..0.002 rows=1 loops=3811)
                     Index Cond: (id = resources.certname_id)
               ->  Bitmap Heap Scan on resource_params  (cost=137.23..141.25 rows=1 width=21) (actual time=1.056..1.056 rows=0 loops=3810)
                     Recheck Cond: ((encode(resource, 'hex'::text) = encode(resources.resource, 'hex'::text)) AND (name = 'job_name'::text))
                     Filter: (value = 'jobname'::text)
                     Rows Removed by Filter: 1
                     Heap Blocks: exact=3810
                     ->  BitmapAnd  (cost=137.23..137.23 rows=1 width=0) (actual time=1.046..1.046 rows=0 loops=3810)
                     ->  Bitmap Index Scan on resource_params_hash_expr_idx  (cost=0.00..4.86 rows=40 width=0) (actual time=0.021..0.021 rows=3 loops=3810)
                     Index Cond: (encode(resource, 'hex'::text) = encode(resources.resource, 'hex'::text))
                     ->  Bitmap Index Scan on idx_resources_params_name  (cost=0.00..132.12 rows=5291 width=0) (actual time=0.988..0.988 rows=7479 loops=3810)
                     Index Cond: (name = 'job_name'::text)
         ->  Index Scan using environments_pkey on environments e  (cost=0.15..0.21 rows=1 width=40) (never executed)
               Index Cond: (id = c.environment_id)
   ->  Index Scan using resource_params_cache_pkey on resource_params_cache rpc  (cost=0.42..8.44 rows=1 width=731) (never executed)
         Index Cond: (resource = resources.resource)
Planning Time: 6.063 ms
Execution Time: 4099.327 ms
(40 rows)
{code}


The database has been freshly initialized from scratch, no migration was performed.
Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)
Atlassian logo

Julian Taylor (Jira)

unread,
Aug 27, 2020, 10:13:04 AM8/27/20
to puppe...@googlegroups.com
Julian Taylor created an issue
Issue Type: Bug Bug
Affects Versions: PDB 6.11.3
Assignee: Unassigned
Components: PuppetDB
Created: 2020/08/27 7:12 AM
Priority: Normal Normal
Reporter: Julian Taylor

Hi,

We have tried updating an environment with a few thousand hosts and 4441439 entries in the catalog_resources postgres table from puppetdb 5.1.5 to 6.11.3 and have seen a large regression in the resource query performance

A query like

/pdb/query/v4/resources?query=["and",["=","type","Prometheus::Scrape_job"],["=","exported",true],["not",["=","certname","mprom003.server.lan"]],["=",["parameter","job_name"],"jobname"]]
 

which used to take 70ms now takes 4 seconds.

and is executed many times and now times out due to the slow query.

Here the explain of the slow query:

 Nested Loop Left Join  (cost=591.68..1122.65 rows=1 width=1357) (actual time=4098.976..4098.976 rows=0 loops=1)

Julian Taylor (Jira)

unread,
Aug 27, 2020, 10:15:03 AM8/27/20
to puppe...@googlegroups.com
Julian Taylor updated an issue
Change By: Julian Taylor
Hi,

We have tried updating an environment with a few thousand hosts and 4441439 entries in the catalog_resources postgres table from puppetdb 5.1.5 to 6.11.3 and have seen a large regression in the resource query performance

A query like
{code}

/pdb/query/v4/resources?query=["and",["=","type","Prometheus::Scrape_job"],["=","exported",true],["not",["=","certname","mprom003.server.lan"]],["=",["parameter","job_name"],"jobname"]]
{code}


which used to take 70ms now takes 4 seconds.
The query is generated by prometheus scrape config:
https://github.com/voxpupuli/puppet-prometheus/blob/3ae9b926c8697938398ff2ff66c2dcda539a6603/manifests/config.pp#L234

and is executed many times and now times out due to the slow query.

Slow query log:
{code}
SELECT: ... duration: 4027.270 ms  execute <unnamed>/C_3: WITH inactive_nodes AS (SELECT certname FROM certnames WHERE (deactivated IS NOT NULL AND deactivated > '2020-08-13T13:23:03.928Z') OR (expired IS NOT NULL and expired > '2020-08-13T13:23:03.928Z')), not_active_nodes AS (SELECT certname FROM certnames WHERE (deactivated IS NOT NULL OR expired IS NOT NULL)) SELECT encode(resources.resource::bytea, 'hex') AS resource, c.certname AS certname, tags AS tags, exported AS exported, line AS line, title AS title, type AS type, e.environment AS environment, file AS file, rpc.parameters AS parameters FROM catalog_resources resources INNER JOIN certnames ON resources.certname_id = certnames.id INNER JOIN catalogs c ON c.certname = certnames.certname LEFT JOIN environments e ON c.environment_id = e.id LEFT JOIN resource_params_cache rpc ON rpc.resource = resources.resource WHERE (((type = $1) AND (exported = $2) AND NOT (c.certname = $3) AND ((encode(resources.resource::bytea, 'hex')) in (SELECT res_param_resource FROM  ( (SELECT encode(resource::bytea, 'hex') AS res_param_resource FROM resource_params WHERE ((name = $4) AND (value = $5))) )  sub))) AND NOT ((c.certname) in (SELECT certname FROM  ( (SELECT not_active_nodes.certname AS certname FROM not_active_nodes) )  sub)))
parameters: $1 = 'Prometheus::Scrape_job', $2 = 't', $3 = 'mprom003.server.lan', $4 = 'job_name', $5 = '"jobname"'
{code}

Here the explain of the slow query:
{code}
{code}

The database has been freshly initialized from scratch, no migration was performed.

Julian Taylor (Jira)

unread,
Aug 27, 2020, 10:17:03 AM8/27/20
to puppe...@googlegroups.com

The database is a postgres 12

Austin Blatt (Jira)

unread,
Aug 27, 2020, 12:10:04 PM8/27/20
to puppe...@googlegroups.com
Austin Blatt commented on Bug PDB-4874
 
Re: regression on resource query performance on upgrade to 6.11.3 from 5.1

Julian Taylor are you able to provide an explain analyze of the query from the 5.1.5 postgres, or is that gone at this point?

Julian Taylor (Jira)

unread,
Aug 28, 2020, 3:18:03 AM8/28/20
to puppe...@googlegroups.com
Julian Taylor commented on Bug PDB-4874

Yes we have rolled back the update to the old version.

What is interesting is that the query in 5.1 database returns a result while the query in the 6.11 version does not. The set of data should be the same, the same hosts have run against the databases.

The analyze of the same query in the old database:

 Nested Loop Left Join  (cost=750.28..3002.76 rows=1 width=1363) (actual time=15.440..73.436 rows=8 loops=1)
   ->  Nested Loop Left Join  (cost=749.85..2994.32 rows=1 width=645) (actual time=15.421..73.357 rows=8 loops=1)
         ->  Nested Loop Semi Join  (cost=749.71..2994.15 rows=1 width=615) (actual time=15.403..73.323 rows=8 loops=1)
               ->  Hash Join  (cost=749.15..1586.53 rows=4 width=615) (actual time=4.170..21.330 rows=3846 loops=1)
                     Hash Cond: (certnames.certname = c.certname)
                     ->  Nested Loop  (cost=0.56..837.86 rows=9 width=607) (actual time=0.034..15.100 rows=3947 loops=1)
                           ->  Index Scan using catalog_resources_exported_idx on catalog_resources resources  (cost=0.28..763.18 rows=9 width=589) (actual time=0.020..6.019 rows=3947 loops=1)
                                 Filter: (type = 'Prometheus::Scrape_job'::text)
                           ->  Index Scan using certnames_transform_pkey on certnames  (cost=0.28..8.30 rows=1 width=34) (actual time=0.002..0.002 rows=1 loops=3947)
                                 Index Cond: (id = resources.certname_id)
                     ->  Hash  (cost=731.62..731.62 rows=1358 width=34) (actual time=4.120..4.120 rows=2659 loops=1)
                           Buckets: 4096 (originally 2048)  Batches: 1 (originally 1)  Memory Usage: 214kB
                           ->  Seq Scan on catalogs c  (cost=391.88..731.62 rows=1358 width=34) (actual time=1.121..3.160 rows=2659 loops=1)
                                 Filter: ((certname <> 'mprom003.server.lan'::text) AND (NOT (hashed SubPlan 1)))
                                 Rows Removed by Filter: 57
                                 SubPlan 1
                                   ->  Seq Scan on certnames certnames_1  (cost=0.00..391.71 rows=68 width=26) (actual time=0.012..1.081 rows=61 loops=1)
                                         Filter: ((deactivated IS NOT NULL) OR (expired IS NOT NULL))
                                         Rows Removed by Filter: 2662
               ->  Index Scan using resource_params_hash_expr_idx on resource_params  (cost=0.56..281.64 rows=1 width=21) (actual time=0.013..0.013 rows=0 loops=3846)
                     Index Cond: (encode(resource, 'hex'::text) = encode(resources.resource, 'hex'::text))
                     Filter: ((name = 'job_name'::text) AND (value = '"jobname"'::text))
                     Rows Removed by Filter: 3
         ->  Index Scan using environments_pkey on environments e  (cost=0.14..0.16 rows=1 width=46) (actual time=0.003..0.003 rows=1 loops=8)
               Index Cond: (id = c.environment_id)
   ->  Index Scan using resource_params_cache_pkey on resource_params_cache rpc  (cost=0.42..8.44 rows=1 width=728) (actual time=0.008..0.008 rows=1 loops=8)
         Index Cond: (resource = resources.resource)
 Planning Time: 2.213 ms
 Execution Time: 73.535 ms

Julian Taylor (Jira)

unread,
Aug 28, 2020, 4:47:04 AM8/28/20
to puppe...@googlegroups.com
Julian Taylor commented on Bug PDB-4874

We have verified that two queries that return the same result in both databases show the different performance characteristics and have the same analyze results. So the statement above that the slow query returned no result is probably unrelated to the issue.

Julian Taylor (Jira)

unread,
Aug 28, 2020, 7:20:02 AM8/28/20
to puppe...@googlegroups.com
Julian Taylor commented on Bug PDB-4874

Setting

SET enable_nestloop = off; 

does help a bit, it reduces the query time from 3000-4000ms to 250ms

Julian Taylor (Jira)

unread,
Aug 28, 2020, 7:52:03 AM8/28/20
to puppe...@googlegroups.com
Julian Taylor commented on Bug PDB-4874

It appears running a ANALYZE on the resource_params table helped postgres to generate a good plan with default settings.
The query is now down to 90ms.

So this is probably no regression, just an issue with poor statistics in the database (due to it being initialized from scratch) causing the postgres planner to create a bad plan.

Julian Taylor (Jira)

unread,
Sep 10, 2020, 9:19:05 AM9/10/20
to puppe...@googlegroups.com
Julian Taylor commented on Bug PDB-4874

It appears the performance of the query regresses after some time again and we have to manually call the ANALYZE again.

Can this query be improved to be less fragile on the statistics of the table? A two nested subqueries on such large tables seems to be quite problematic.

Though it is probably also worth a postgres bugreport. It is odd that it attempts a double index scan despite the estimation being that one of the indices only returns 40 rows and the other 5000 rows.

Julian Taylor (Jira)

unread,
Oct 14, 2020, 4:44:03 AM10/14/20
to puppe...@googlegroups.com
Julian Taylor commented on Bug PDB-4874

As the issue keeps reoccuring we have performed some more tests on the query running following before each query:
VACUUM ANALYZE resource_params;
SELECT attname, n_distinct, avg_width, correlation, array_to_string(most_common_vals, E'\n'), array_to_string(most_common_freqs, E'\n') FROM pg_stats WHERE tablename = 'resource_params' AND attname = 'name';

The query performance seems to become slow when the frequency of the job_name value of the name colume in resource_params is lower than 0.0012
There is no correlation with the n_distinct or correlation value. Other values in the frequency table should be irrelevant as only name and value are used and value has a high distinct count.

A workaround is probably to increase the statistics sampling of the table to reduce the probability that the job_name frequency is underestimated or to disable the autovacuum which was added in the new version (and the reason why this appeared as a regression) after good statistics have been created.

Margaret Lee (Jira)

unread,
Apr 29, 2021, 6:28:02 PM4/29/21
to puppe...@googlegroups.com
Margaret Lee updated an issue
 
Change By: Margaret Lee
Epic Link: PE-31891
This message was sent by Atlassian Jira (v8.13.2#813002-sha1:c495a97)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages