Autoanalyze and partitioned tables

Chris Hajas

unread,

Jun 12, 2023, 9:05:58 PM6/12/23

to gpdb...@greenplum.org

Hi developers,

With the introduction of autoanalyze in GP7, tables are now analyzed without users explicitly needing to run analyze or analyzedb. However, autoanalyze does not include populating root statistics. Even in current versions of Postgres, root statistics aren't populated in autoanalyze, and this is left to the user to explicitly analyze. There's a few discussions on this, but there hasn't been consensus on how this should be done: https://www.postgresql.org/message-id/flat/c3162511-0e80-039e-765c-1d02d0fd0bab%40enterprisedb.com and https://www.postgresql.org/message-id/flat/CAKkQ508_PwVgwJyBY%3D0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA%40mail.gmail.com.

Partitioned tables are more widely used in GPDB, and without root stats queries on partitioned tables will generate poor plans with both planner and Orca. So, we'd like to figure out a path forward. GPDB does support merging statistics, which means we don't need to resample (see https://github.com/greenplum-db/gpdb/pull/4897 for more details on this). In most cases, merging stats should be a good approach. Merging stats is done through HyperLogLog and supports most data types, specifically those that are comparable. Currently, if a user explicitly analyzes a leaf partition, we will then merge statistics to the root partition IF all other leaf partitions have stats. That is, all leaf partitions must first be analyzed.

I don't think the implementation here is too difficult, but I'd like to reach a consensus on the approach first. I was thinking that during autoanalyze, after analyzing a leaf partition, also merge stats to the root partition if all other leafs have been analyzed, similar to how we currently run analyze.

There are a couple potential issues here:

1) If data is inserted into multiple partitions, autoanalyze would end up running the merge statistics command repeatedly after each individual partition. While merging stats is cheap, it's not free.

2) Merging stats takes a ShareUpdateExclusiveLock on the root table in order to update stats. Is that a problem? I could see if there are many partitions being updated, this could slow down the analyze progress if many partitions are simultaneously updated and trying to update root stats.

3) Some datatypes are not mergeable. What should we do in these cases? Re-sample? Merge all columns except that column? I would lean toward merging everything except that column.

4) Extended stats aren't mergeable. What should we do in these cases? Re-sample? Ignore? I would lean toward ignoring.

5) For multi-level partitions, we would merge stats up to the root partition. Intermediate partitions would have no stats, similar to how it's done in 6X. Is this acceptable?

Thoughts?

Thanks,

Chris

Soumyadeep Chakraborty

unread,

Jun 12, 2023, 10:06:10 PM6/12/23

to Chris Hajas, gpdb...@greenplum.org

Hey Chris,

I added some thoughts on some of the items.

On Mon, Jun 12, 2023 at 6:05 PM 'Chris Hajas' via Greenplum Developers
<gpdb...@greenplum.org> wrote:
> I don't think the implementation here is too difficult, but I'd like to reach a consensus on the approach first. I was thinking that during autoanalyze, after analyzing a leaf partition, also merge stats to the root partition if all other leafs have been analyzed, similar to how we currently run analyze.
>
> There are a couple potential issues here:
>
> 1) If data is inserted into multiple partitions, autoanalyze would end up running the merge statistics command repeatedly after each individual partition. While merging stats is cheap, it's not free.

>
> 2) Merging stats takes a ShareUpdateExclusiveLock on the root table in order to update stats. Is that a problem? I could see if there are many partitions being updated, this could slow down the analyze progress if many partitions are simultaneously updated and trying to update root stats

You mean if we have:
P
P1 P2 P3

and if we load P1, P2 and P3 then autoanalyze will run on P1, P2 and P3, and
merge_leaf_stats() will be called thrice? If so, then yes, the autovac workers
can contend with one another on the SUELock on P. Contention will be higher if
we have lots of partitions,

Also, won't the leaves themselves be locked while merge_leaf_stats() is in
play? (if yes, what kind of lock?)

Possible approach: What if we didn't merge stats in the regular workflow
(by detecting if we are doing auto-analyze and bailing)? For relkind = 'p', we
could then call merge instead of recursive analyze. We could do this whenever
reltuples for the partitioned table exceeds the scale factor percentage?

Regards,
Soumyadeep (VMware)

Ashwin Agrawal

unread,

Jun 13, 2023, 1:47:06 AM6/13/23

to Soumyadeep Chakraborty, Chris Hajas, gpdb...@greenplum.org

On Tue, Jun 13, 2023 at 7:36 AM Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:

Possible approach: What if we didn't merge stats in the regular workflow
(by detecting if we are doing auto-analyze and bailing)? For relkind = 'p', we
could then call merge instead of recursive analyze. We could do this whenever
reltuples for the partitioned table exceeds the scale factor percentage?

I am not sure if the two limitations mentioned in this this old thread on the topic have been lifted (as relates to question providing the link)

https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/gqordopb6Gg/m/bMFTA2-MCQAJ

--

Ashwin Agrawal (VMware)

Chris Hajas

unread,

Jun 13, 2023, 7:46:25 PM6/13/23

to Soumyadeep Chakraborty, gpdb...@greenplum.org

Correct. I'm not sure how we could get around the SUELock contention on the root. In analyzedb, we explicitly don't merge statistics until all leaf partitions are analyzed. We don't have that luxury here, since we don't know when another leaf may get tuples inserted. For example, the worst case scenario would be inserting the minimum number of tuples to trigger autoanalyze into each leaf partition. That's likely not a common use case, but something we should be aware of.

The leaves aren't locked during merge_leaf_stats(). The root has an access share lock, which will prevent detaching/attaching partitions and DDL operations.

For the root partition approach you mentioned, I think that would limit the number of analyzes which would be beneficial. One concern there is that we wouldn't trigger as often as we should, especially for GPDB use cases. For example, if a user adds a new partition each week, that likely won't have enough data to trigger analyze, but users likely will be querying that new data and will want it reflected in analyze. This may be a tradeoff depending on whether we're optimizing for more updates and possible contention vs fewer updates but leaving out new data. There's also the attach partition use case we'll need to consider (should this trigger an autoanalyze?)

Let me run some tests to see exactly how long the merge step takes. I know it's proportional to the number of partitions and columns.

From: Soumyadeep Chakraborty <soumyad...@gmail.com>
Sent: Monday, June 12, 2023 7:05 PM
To: Chris Hajas <cha...@vmware.com>
Cc: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Re: Autoanalyze and partitioned tables

!! External Email

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Chris Hajas

unread,

Jun 16, 2023, 1:27:10 PM6/16/23

to Soumyadeep Chakraborty, Chris Hajas, gpdb...@greenplum.org

I did some tests on my system. I can see that we get into the situation where we do sequentially analyze each leaf. The numbers below are only for merging, so the amount of data doesn't matter. Currently when explicitly analyzing a leaf partition, merging is done after each partition is analyzed.

100 partitions, 2 columns: 30ms

1000 partitions, 2 columns: 130ms

100 partitions, 100 columns: 600ms

1000 partitions, 100 columns: 6300ms

In the 1000 partition, 100 column case, it takes ~6s for each merge. Which by itself isn't too bad, but does add up, especially if there's many partitions.

Another approach is to do something similar to analyzedb and not merge after each leaf in autoanalyze, which I think is somewhat similar to what Deep suggested. Analyzedb retrieves the list of relations, if it's a leaf partition it adds its root to a set. If it's a leaf, it does the analyze, but not the merge. At the end of analyzedb, it then runs `analyze_rootpartition` on the root partitions in that set, which only does the merge on each root.

Autoanalyze has similar logic, it retrieves the list of relids that need to be analyzed. We can get the root relid if it's a leaf, add it to a set, and then run only the merge step on the root partitions.

From: 'Chris Hajas' via Greenplum Developers <gpdb...@greenplum.org>
Sent: Tuesday, June 13, 2023 4:46 PM
To: Soumyadeep Chakraborty <soumyad...@gmail.com>

Chris Hajas

unread,

Jun 16, 2023, 6:34:02 PM6/16/23

to Soumyadeep Chakraborty, gpdb...@greenplum.org

I talked with Soumyadeep offline a bit. There are 2 approaches we discussed:

1) For each leaf, add its root partition to a set. Perform the sampling of all tables, and once all sampling has finished, merge statistics for each root table in the set if all leaves of that root have been analyzed (sampled). Ignore unsupported/extended columns. This has the advantage of minimizing merging and is fairly simple to implement.

For this approach, we would also want to handle the case when a partition is attached/detached/exchanged/dropped. Consider an ETL process that inserts data into a table, that table is analyzed, then is attached to a partitioned table. In this proposal, the stats of the root wouldn't be updated, so we would either want to automatically analyze, or even better would be to add it to the queue in autoanalyze to be analyzed.

2) The 2nd approach is simpler. During the autoanalyze loop, determine the root partitions. If the root stats haven't been merged since some user-configurable time (maybe an hour, or 12h), run the merge step. This would unnecessarily merge stats, and doing this too often would cause unnecessary processing (since merging stats is more cpu intensive).

Chris

From: Chris Hajas <cha...@vmware.com>
Sent: Friday, June 16, 2023 10:26 AM
To: Soumyadeep Chakraborty <soumyad...@gmail.com>; Chris Hajas <cha...@vmware.com>

Ashwin Agrawal

unread,

Jun 21, 2023, 7:36:05 AM6/21/23

to Chris Hajas, Soumyadeep Chakraborty, gpdb...@greenplum.org

On Sat, Jun 17, 2023 at 4:04 AM 'Chris Hajas' via Greenplum Developers <gpdb...@greenplum.org> wrote:

I talked with Soumyadeep offline a bit. There are 2 approaches we discussed:

1) For each leaf, add its root partition to a set. Perform the sampling of all tables, and once all sampling has finished, merge statistics for each root table in the set if all leaves of that root have been analyzed (sampled). Ignore unsupported/extended columns. This has the advantage of minimizing merging and is fairly simple to implement.

This approach reads good though is fuzzy to me (given I don't have full clarity on how autoanalyze schedules tables for analysis). If the user performs insert at different times to different child partitions it would behave similar to what's happening currently, right? Within one cycle of autoanalyze if multiple child partitions belonging to the same root are scheduled then it has the advantage to avoid merging as per my understanding.

Plus, for the case mentioned in the email I referenced earlier - where inserting via root table only updates pg_stat for root table on coordinator. In that case - autoanalyze will process each and every child table and merge stats once?

For this approach, we would also want to handle the case when a partition is attached/detached/exchanged/dropped. Consider an ETL process that inserts data into a table, that table is analyzed, then is attached to a partitioned table. In this proposal, the stats of the root wouldn't be updated, so we would either want to automatically analyze, or even better would be to add it to the queue in autoanalyze to be analyzed.

2) The 2nd approach is simpler. During the autoanalyze loop, determine the root partitions. If the root stats haven't been merged since some user-configurable time (maybe an hour, or 12h), run the merge step. This would unnecessarily merge stats, and doing this too often would cause unnecessary processing (since merging stats is more cpu intensive).

For this will have to introduce book-keeping of when the merge was performed or that information is already tracked and handy to use if implementing this approach? Certain time based threshold to trigger merge seems adding unpredictability in the system where plans can change based on the same and makes it a little tricker. Compared to merging based on when child partition was analyzed and got significant change of data.

--

Ashwin Agrawal (VMware)

Chris Hajas

unread,

Jun 22, 2023, 6:39:34 PM6/22/23

to Ashwin Agrawal, Soumyadeep Chakraborty, gpdb...@greenplum.org

Correct, if users perform insert at different times in an adversarial way, then autoanalyze would continuously merge. However, if there are multiple child partitions queued for a single autoanalyze iteration, we won't merge multiple times. I think this is acceptable, since we want the root partition stats to be up-to-date. We'll still never explicitly track inserts into the root partition, so the entire tree will never be sampled. I don't think we should sample the entire tree in autoanalyze.
For the 2^nd approach we'd have to add that bookkeeping. I agree that it'd add unpredictability since it's pretty arbitrary when it's being merged. That's definitely a downside with that approach. I'm leaning away from this one for now.

Thanks,

Chris

From: Ashwin Agrawal <ashwi...@gmail.com>
Sent: Wednesday, June 21, 2023 4:35 AM
To: Chris Hajas <cha...@vmware.com>
Cc: Soumyadeep Chakraborty <soumyad...@gmail.com>; gpdb...@greenplum.org <gpdb...@greenplum.org>