Plan to remove auto_stats code from Greenplum Master/7.

Junfeng Yang

unread,

Oct 28, 2020, 10:39:31 PM10/28/20

to gpdb...@greenplum.org

Hi guys,

Currently we enabled autovaccum to run the auto-ANALYZE across databases on Master branch with these two PR:

https://github.com/greenplum-db/gpdb/pull/10515

https://github.com/greenplum-db/gpdb/pull/10988

Maybe now it's the time to remove the auto_stats logic from Greenplum.

The simple compration between auto_stats and auto-ANALYZE.

	auto_stats	auto-ANALYZE
Targets	Relations except views and root/mid-layer partition table.	Same with auto_stats. But it will support execute on root/mid-layer partition tables once pg enable them. There's a thread to do this. https://www.postgresql.org/message-id/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUx...@mail.gmail.com
Execution time	Could be execute in the same transaction for evey statement that may change the tables' stats. But the default settings is to skip analyze when there's statistics for the relation.	For each database, autovacuum is triggered every autovacuum_naptime.
Behavior	With in the same transaction, execute ANALYZE for the relation. Try to expand analyze on parent partition table in current transaction if current table is leaf partition. This may bring dead lock issue which discussed in: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/gqordopb6Gg/m/bMFTA2-MCQAJ	Executed in seperate process, only execute ANALYZE on the relations that reach the autovacuum_analyze_threshold. Never expand ANALYZE on parent partition tables. Neve ANALYZE on root/mid-layer partition table for now, but will be supported, see the link above.
Others	With default auto_stats settings, user still need to schedule ANALYZE on all tables since it only execute ANALYZE if the table is never analyzed before.	User need to schedule ANALYZE on root/mid-layer partition table for now.

For the performance.

Test loading the data into several tables on gcp with auto_stats or with auto-ANALYZE enabled.

The gcp cluster contains 1 master host and 4 segment hosts.

For each segment hosts, it contains 8 primary segments and 8 mirror segments.

Each host has the same configuration (24 vCPUs, 48 GB memory)

The total data generated for loading is 960GB.

Since I don't find any existing benchmark to evaluate the backgroud write impact for Greenplum,

so I decide to modify the TPC-DS's data loading script. And this is not accurate.

Overall, when enable auto-ANALYZE, the data loading will have around 8% decline. But note here the

loading is single process, if we run loading concurrently, ANALYZE through auto_stats should also have impact

for other processes.

Also the auto_stats and auto-ANALYZE are all using default settings.

Here are the detail results:

with auto_stats enabled

with auto-ANALYZE enabled

table_name tuples seconds

call_center 42 0.366

catalog_page 30000 0.334

catalog_returns 143996756 94.94226

catalog_sales 1439980416 850.850487

customer 12000000 4.466

customer_address 6000000 2.2282

customer_demographics 1920800 0.356

date_dim 73049 0.402

household_demographics 7200 0.091

income_band 20 0.078

inventory 783000000 36.36603

item 300000 1.1804

promotion 1500 0.121

reason 65 0.323

ship_mode 20 0.077

store 1002 0.137

store_returns 287999764 153.153594

store_sales 2879987999 1123.112331

time_dim 86400 0.373

warehouse 20 0.094

web_page 3000 0.106

web_returns 71997522 28.28437

web_sales 720000376 418.418055

web_site 54 0.136

table_name tuples seconds

call_center 42 0.157

catalog_page 30000 0.168

catalog_returns 143996756 111.11141

catalog_sales 1439980416 930.930319

customer 12000000 4.4676

customer_address 6000000 2.2321

customer_demographics 1920800 0.385

date_dim 73049 0.388

household_demographics 7200 0.092

income_band 20 0.074

inventory 783000000 38.38896

item 300000 1.1964

promotion 1500 0.14

reason 65 0.075

ship_mode 20 0.328

store 1002 0.167

store_returns 287999764 187.187212

store_sales 2879987999 1208.120863

time_dim 86400 0.364

warehouse 20 0.104

web_page 3000 0.125

web_returns 71997522 30.30341

web_sales 720000376 434.434357

web_site 54 0.145

I think replace the auto_stats with auto-ANALYZE is workable.

To move forward, the next steps are:

Enable auto-ANALYZE in our concourse pipeline test.
Remove the auto_stats related code and set auto-ANALYZE enable as default.

Do you have any concerns or any suggestions?

Regards,

Junfeng

Ashwin Agrawal

unread,

Oct 29, 2020, 8:21:46 PM10/29/20

to Junfeng Yang, gpdb...@greenplum.org

On Wed, Oct 28, 2020 at 7:39 PM Junfeng Yang <yje...@vmware.com> wrote:

Hi guys,

Currently we enabled autovaccum to run the auto-ANALYZE across databases on Master branch with these two PR:

https://github.com/greenplum-db/gpdb/pull/10515

https://github.com/greenplum-db/gpdb/pull/10988

Maybe now it's the time to remove the auto_stats logic from Greenplum.

Please can you elaborate on:

What's the code/functional complexity of keeping both for GPDB7?
Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?
How much approximate lines of code removal will it be?

What's our plan for gp_autostats_mode_in_functions GUC?
gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?
Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?
Are temporary tables included or excluded from auto-analyze?

My current thinking is if it's not complex to keep both the

implementations (testing and other maintenance overhead is not high to
keep the auto-analyze), we should not remove that piece of
functionality before receiving feedback on how the newer functionality
is behaving in the field. Given its already GUC controlled can be
easily turned off/on as required and feedback received. Before getting
too destructive. At least till the beta release phase we can keep it
and based on beta feedback maybe decide. I feel the two
functionalities are fairly independent and don't interfere with each
other. Actually, auto-stats can kind of help auto-analyze to do the
job and reduce deferred load :-)

We should definitely enable auto-analyze by default.

Maybe make auto-stats default NONE instead of ON_NO_STATS.

--

Ashwin Agrawal (VMware)

Junfeng Yang

unread,

Nov 2, 2020, 12:51:27 AM11/2/20

to Ashwin Agrawal, gpdb...@greenplum.org

Hi Ashwin,

What's the code/functional complexity of keeping both for GPDB7?

The logic of auto_stats is quite simple, just put the function at the place when needed and call `ExecVacuum`. There shouldn't be any complexity for keeping both. And when enable auto-ANALYZE, the auto_stats will be skipped.

Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?

We do have some tests. The overhead should not be so large.

How much approximate lines of code removal will it be?

Around 1000 lines.

What's our plan for gp_autostats_mode_in_functions GUC?

We don't need to consider functions separately for auto-ANALYZE. The auto-ANALYZE is triggered based on pgstat table stats which counted through the relations' access method.
For each relation, we can control whether enable auto-ANALYZE by setting it's storage parameter.

gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?

From my point of view, ON_CHANGE will add lot of overhead when we doing a simple one row insert/update/delete since it'll trigger ANALYZE all the time(And for the auto-ANALYZE, the ANALYZE only triggered when the tuple change reach the threshold). And from the current code, we still have 3 choices. Where is the newer implementation?

Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?

Yes, this part has no different from upstream.

Are temporary tables included or excluded from auto-analyze?

auto-ANALYZE never execute ANALYZE on temporary tables.

And I think you concern make sense. The reason I want to remove it is that I think if auto-ANALYZE works well, so if auto_stats never used, it will be dead code.

And auto_stats add overhead for each statement and have some issues.

May cause one phase commit to two phase. For example insert only to a single segment, but trigger auto-ANALYZE.
The dead lock issue mentioned before.

Wait until beta release for feedback is a good solution I think.

So we could enable the auto-ANALYZE as default at first to find potential issues.

Regards,

Junfeng

Ashwin Agrawal

unread,

Nov 2, 2020, 2:09:38 PM11/2/20

to Junfeng Yang, gpdb...@greenplum.org

On Sun, Nov 1, 2020 at 9:51 PM Junfeng Yang <yje...@vmware.com> wrote:

What's the code/functional complexity of keeping both for GPDB7?

The logic of auto_stats is quite simple, just put the function at the place when needed and call `ExecVacuum`. There shouldn't be any complexity for keeping both. And when enable auto-ANALYZE, the auto_stats will be skipped.

That's good to know. I am still missing why auto-stats will be skipped if auto-analyze is enabled? It will purely depend on (independently) the value of gp_autostats_mode GUC still, right?

Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?

We do have some tests. The overhead should not be so large.

Okay that's awesome.

How much approximate lines of code removal will it be?

Around 1000 lines.

Okay.

What's our plan for gp_autostats_mode_in_functions GUC?

We don't need to consider functions separately for auto-ANALYZE. The auto-ANALYZE is triggered based on pgstat table stats which counted through the relations' access method.
For each relation, we can control whether enable auto-ANALYZE by setting it's storage parameter.

I understand auto-analyze is triggered irrespective. Though GPDB provided this GUC separate from gp_autostats_mode means there have been usecases where users wished to have stats computed while the function is executing and not in delayed fashion.

So, in that case, auto-analyzing it in a delayed fashion I don't is helpful. Hence, need to understand the impact of removing this feature as well along with auto-stats whenever we do it in future.

gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?

From my point of view, ON_CHANGE will add lot of overhead when we doing a simple one row insert/update/delete since it'll trigger ANALYZE all the time(And for the auto-ANALYZE, the ANALYZE only triggered when the tuple change reach the threshold). And from the current code, we still have 3 choices. Where is the newer implementation?

Well, auto-stats ON_CHANGE doesn't invoke analyze on every change either.

GUC gp_autostats_on_change_threshold controls the threshold for the same similar to auto-analyze. So, it's threshold based as well. I don't think currently it works for singleton inserts or updates.

By newer implementation I meant auto-analyze.

Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?

Yes, this part has no different from upstream.

Are temporary tables included or excluded from auto-analyze?

auto-ANALYZE never execute ANALYZE on temporary tables.

So, maybe for temp tables auto-stats is still required. Though one can always argue users should add analyze if required for the temp tables. But again that would be regression from current behavior.

And I think you concern make sense. The reason I want to remove it is that I think if auto-ANALYZE works well, so if auto_stats never used, it will be dead code.

And auto_stats add overhead for each statement and have some issues.

May cause one phase commit to two phase. For example insert only to a single segment, but trigger auto-ANALYZE.
The dead lock issue mentioned before.

Wait until beta release for feedback is a good solution I think.

So we could enable the auto-ANALYZE as default at first to find potential issues.

I highly appreciate the thought about not leaving behind dead code and functionality. Seems we are in agreement that it's not dead yet and useful for certain situations.

Plus, good to let customers slowly migrate / get accustomed to newer functionality and then once really dead we can rip it out.

Thank You for driving this important feature forward.

--

Ashwin Agrawal (VMware)

Junfeng Yang

unread,

Nov 3, 2020, 3:32:26 AM11/3/20

to Ashwin Agrawal, gpdb...@greenplum.org

Hi Ashwin,

That's good to know. I am still missing why auto-stats will be skipped if auto-analyze is enabled? It will purely depend on (independently) the value of gp_autostats_mode GUC still, right?

I don't see any need to keep both ANALYZE auto trigger method in same time. Once `AutoVacuumingActive()` returns true, there's no need to run the auto_stats. Since it'll trigger the auto-ANALYZE.

I understand auto-analyze is triggered irrespective. Though GPDB provided this GUC separate from gp_autostats_mode means there have been usecases where users wished to have stats computed while the function is executing and not in delayed fashion.

So, in that case, auto-analyzing it in a delayed fashion I don't is helpful. Hence, need to understand the impact of removing this feature as well along with auto-stats whenever we do it in future.

After I dig some history issues. It seems the auto_stats is not works for functions at first. So to resolve this, bring in another GUC and add auto_stats into functions.c. But not sure why we bring the new GUC.

It actually could make the gp_autostats_mode works for functions too. Cause from the issue, user expects the gp_autostats_mode works for functions.

GUC gp_autostats_on_change_threshold controls the threshold for the same similar to auto-analyze.

We should note here the `gp_autostats_on_change_threshold ` only get evaulated in single statement. If we insert 1 tuple million times which never reach the threshold, auto_stats is skipped. And still need user to do

the manual ANALYZE. But auto-ANALYZE do count these changes.

So, maybe for temp tables auto-stats is still required. Though one can always argue users should add analyze if required for the temp tables. But again that would be regression from current behavior.

Yes, I agree. But when we dealing the upstream merge, we discard lots of GPDB own behaviros, and we see these issue reports occasionally. So what's the judgment criteria to help us make decision?

Let's keep them both for now, and this is the best solution at this stage I think.

And thanks a lot Ashwin! Your comments drive me think futher.

Regards,

Junfeng

Junfeng Yang

unread,

Nov 26, 2020, 9:02:11 PM11/26/20

to Ashwin Agrawal, gpdb...@greenplum.org

Hi Ashwin,

Since there still some concerns for enable auto-ANALYZE as default. So let's just enable auto-ANALYZE in pipeline test.

And user could enable auto-ANALYZE if they want by setting `set autovacuum=on` on coordinator.

So for the PR https://github.com/greenplum-db/gpdb/pull/11142, I'll only enable auto-ANALYZE in pipeline test,

and fix related test cases.

Does this sounds good for you?

Regards,

Junfeng

发件人: Junfeng Yang <yje...@vmware.com>
发送时间: 2020年11月3日 16:32
收件人: Ashwin Agrawal <ashwi...@gmail.com>
抄送: gpdb...@greenplum.org <gpdb...@greenplum.org>
主题: 回复: Plan to remove auto_stats code from Greenplum Master/7.

Ashwin Agrawal

unread,

Nov 30, 2020, 7:14:23 PM11/30/20

to Junfeng Yang, gpdb...@greenplum.org

On Thu, Nov 26, 2020 at 6:02 PM Junfeng Yang <yje...@vmware.com> wrote:

Hi Ashwin,

Since there still some concerns for enable auto-ANALYZE as default. So let's just enable auto-ANALYZE in pipeline test.

And user could enable auto-ANALYZE if they want by setting `set autovacuum=on` on coordinator.

So for the PR https://github.com/greenplum-db/gpdb/pull/11142, I'll only enable auto-ANALYZE in pipeline test,

and fix related test cases.

Does this sounds good for you?

No, let's not go back on this. You have done excellent work. Leaving it to users to enable is not good. Let's continue discussing how to bake the feature and deliver it fully enabled by default.

I don't think we have too many things (concerns) to be discussed in the PR. Due to vacations and other stuff, I wasn't able to reply to PR comments. Will try to do it this week. Also, I would highly appreciate, if more members of the team can jump in, share thoughts and help to get this important feature to proper completion.

--

Ashwin Agrawal (VMware)

Luis Macedo

unread,

Dec 1, 2020, 9:18:50 AM12/1/20

to Greenplum Developers, Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano

Guys, there is an important fact that we should consider.

When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

How the new auto-analyze will impact that scenario? Is my understanding correct?

Rgds,

Luis Macedo

Ashwin Agrawal

unread,

Dec 1, 2020, 5:26:11 PM12/1/20

to Luis Macedo, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano

On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:

Guys, there is an important fact that we should consider.

When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

How the new auto-analyze will impact that scenario? Is my understanding correct?

Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).

So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.

The only argument where it confuses things based on PR discussion so far with Junfeng:

> If auto-ANALYZE enabled, it should use auto-ANALYZE as Postgres does, there no need to turn off gp_autostats_mode . And shouldn't make then both work for below reason:

> A cluster with autovacuum=on, and create a table with autovacuum_enabled = false.
> As a user, I would expect the table will always have empty stats unless I manually run ANALYZE.
> But if we still run autostats with autovacuum=on. The table will have stats.
> So we should not run autostats with autovacuum=on.

First I don't think that's so much of usecase to worry about. Secondly with autovacuum_enabled = false, only aspect user wishes to convey is don't trigger autovacuum in background and spend cycle on this table. I don't think it has any aspect to convey, don't want stats for this table, as why wouldn't anyone not want it. So, this usecase I won't worry at all about.

Aspect I worry about, will ANALYZE get triggered multiple times for the table if both auto-stats are ON and auto-analyze is ON? Is there something in the system which prevents auto-analyze from kicking in if auto-stats (or manual ANALYZE) already analyzed the table? If multiple ANALYZE don't kick-in and are taken care of, I don't see a reason to turn off auto-stats.

Any other concerns we need to take care?

--

Ashwin Agrawal (VMware)

Luis Filipe de Macedo

unread,

Dec 2, 2020, 8:16:49 AM12/2/20

to Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

Ashwin,

You explanation made it more clear to me. My original understanding was that auto-stats would be deprecated in favor of auto-vacuum which is not true.

The scenario that I described will not happen as we will still be collecting some stats on bulk load operations AND improving/automating stats collection on mini-batch/streaming scenarios.

Thanks,

Luis F R Macedo

Advisory Data Engineer & Business Development for Latam

VMware Tanzu Data

Call Me @ +55 11 97616-6438

Take care of the customers and the rest takes care of itself

Hubert Zhang

unread,

Dec 8, 2020, 7:04:05 AM12/8/20

to Luis Filipe de Macedo, Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

Hi Luis and Ashwin,

On Tue, Dec 2, 2020 at 6:18 AM Ashwin Agrawal wrote:

On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:

Guys, there is an important fact that we should consider.

When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

How the new auto-analyze will impact that scenario? Is my understanding correct?

Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).

So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.

Since auto-analyze is a feature from PostgreSQL, I think it's better to keep the auto-analyze behavior the same as upstream. auto-stats may help in case "1 million tuples inserted in one shot" as Ashwin mentioned and I agree there are some customer side ETL/ELT scripts which depends on auto-stats feature. But this feature is Greenplum specific and has many issues, e.g. for the subsequent insert, auto_stats will not be triggered in ON_NO_STATS mode. I don't think all the customers are aware of it. It's safer to run analyze manually in these ETL scripts, which is also the best practice in PostgreSQL as well.

For a long term, I prefer Greenplum keep the same behavior and practice as PostgreSQL related to the topic of `analyze`. Auto-stats should be disabled and thus user should run analyze manually if they want the statistics immediately just like upstream. Meanwhile, when auto-analyze is disabled, we should keep the current behavior, which keeps the auto-stats feature.

So how about setting auto-analyze as disabled by default? When user want to enable auto-analyze, they should also follow the practice to add manual analyze in their scripts if necessary.

Hubert

From: Luis Filipe de Macedo <mac...@vmware.com>
Sent: Wednesday, December 2, 2020 9:16 PM
To: Ashwin Agrawal <ashwi...@gmail.com>
Cc: gpdb...@greenplum.org <gpdb...@greenplum.org>; Junfeng Yang <yje...@vmware.com>; Louis Mugnano (Pivotal) <lmug...@pivotal.io>
Subject: Re: Plan to remove auto_stats code from Greenplum Master/7.

Luis Filipe de Macedo

unread,

Dec 8, 2020, 8:32:10 AM12/8/20

to Hubert Zhang, Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

Hubert,

Thanks for your thoughts. We could think of integrating the two features and slowly migrating to the auto-analyze (PG). We could (maybe) deprecate the ON_CHANGE in favor of the auto-analyze (PG) and (maybe) change auto-stats (GP) to force an auto-analyze(PG) synchronously on ON_NO_STATS (not sure about the viability).

We can also add deprecation notices on the docs and keep the feature for 2 major versions before removing it.

I also prefer to run analyze on the ETL but now this ship has sailed and many customers are already depending on the auto-stats.

But your proposal to leave auto-analyze(PG) off by default sounds reasonable as it will ease in the users.

The topic of autonomous database should be considered by engineering in high priority IMHO. The easier to use GPDB is, the more users GPDB will have.

PS. All my comments are based on my view as GPDB user. I have very little visibility of the code itself.

Rgds,

Ashwin Agrawal

unread,

Dec 8, 2020, 12:48:09 PM12/8/20

to Hubert Zhang, Luis Filipe de Macedo, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

On Tue, Dec 8, 2020 at 4:04 AM Hubert Zhang <zhu...@vmware.com> wrote:

Hi Luis and Ashwin,

On Tue, Dec 2, 2020 at 6:18 AM Ashwin Agrawal wrote:

On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:

Guys, there is an important fact that we should consider.

When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

How the new auto-analyze will impact that scenario? Is my understanding correct?

Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).

So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.

Since auto-analyze is a feature from PostgreSQL, I think it's better to keep the auto-analyze behavior the same as upstream. auto-stats may help in case "1 million tuples inserted in one shot" as Ashwin mentioned and I agree there are some customer side ETL/ELT scripts which depends on auto-stats feature. But this feature is Greenplum specific and has many issues, e.g. for the subsequent insert, auto_stats will not be triggered in ON_NO_STATS mode. I don't think all the customers are aware of it. It's safer to run analyze manually in these ETL scripts, which is also the best practice in PostgreSQL as well.

It's easier said than done. Easily able to migrate customers to newer versions without requiring application modification is a strongly desired aspect and as much as possible we should retain it unless really can't. So many scripts and work-flows would be relying on this behavior we can't just take it away without giving chance to adopt newer behavior.

Subsequent insert, auto_stats will not be triggered in ON_NO_STATS mode and users might not be aware of it, agree. But how does that merit take it away from users who are aware of it and changed it to ON_CHANGE and relying on sync stats (and not async stats) behavior.

For a long term, I prefer Greenplum keep the same behavior and practice as PostgreSQL related to the topic of `analyze`. Auto-stats should be disabled and thus user should run analyze manually if they want the statistics immediately just like upstream. Meanwhile, when auto-analyze is disabled, we should keep the current behavior, which keeps the auto-stats feature.

I agree and always vote to keep Greenplum behavior in sync with upstream. Though we need to be cognizant of current GPDB users. Like for partitioning also we would like to only use upstream syntax, but for some upcoming major versions need to provide room to make changes to scripts and work-flow to adopt new behavior before forcing it to new.

So how about setting auto-analyze as disabled by default? When user want to enable auto-analyze, they should also follow the practice to add manual analyze in their scripts if necessary.

Shipping new features disabled by default sounds like a half baked feature. If we feel the feature is useful we should have it enabled by default. I am missing to understand what benefit it provides by default disabling the feature.

I would really like to hear *technical* concerns for the problems with keeping both the features enabled. If there are serious issues or complications that would help to decide. If none then why turn one off.

--

Ashwin Agrawal (VMware)

Luis Filipe de Macedo

unread,

Dec 8, 2020, 1:22:49 PM12/8/20

to Ashwin Agrawal, Hubert Zhang, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

Ashwin,

My main concern of keeping both features on would be waste of resources. Let’s say I bulk loaded a tables which was analyzed by auto-stats. Then comes auto-analyze few moments after and analyze the table again. And worse its spending CPU cycles on an operation that is not easily visible by the DBA.

Another concern is the same concern we had when it was discussed to turn auto-vaccum on. There will be resource consumption that the DBA will not be able to fully control and might hit some SLAs.

When looking at OLTP, usually tables are not that big, but on the OLAP use case, we might have monster tables that could end up been a hit on the system.

Rgds,

Luis F R Macedo

Advisory Data Engineer & Business Development for Latam

VMware Tanzu Data

Call Me @ +55 11 97616-6438

Take care of the customers and the rest takes care of itself

De: Ashwin Agrawal <ashwi...@gmail.com>
Data: terça-feira, 8 de dezembro de 2020 14:48
Para: Hubert Zhang <zhu...@vmware.com>
Cc: Luis Filipe de Macedo <mac...@vmware.com>, gpdb...@greenplum.org <gpdb...@greenplum.org>, Junfeng Yang <yje...@vmware.com>, Louis Mugnano (Pivotal) <lmug...@pivotal.io>
Assunto: Re: Plan to remove auto_stats code from Greenplum Master/7.

Junfeng Yang

unread,

Dec 8, 2020, 9:12:40 PM12/8/20

to Luis Filipe de Macedo, Ashwin Agrawal, Hubert Zhang, gpdb...@greenplum.org, Louis Mugnano (Pivotal)

Hi guys,

Thanks for all you feedbacks.

I want to clarify that the auto-analyze only runs on the tables that have changes(since pervious analyze) exceeds a computed threshold.

The reason that we enable auto-analyze to replace autostats is because autostats has some issues as I mentioned at the beginning.

And autostats can not guarantee us an accurate stats without user manually execute ANALYZE periodically.

As we discussed, auto-analyze may also bring us some side-effects. So if we combine autostats and auto-analyze in same time. We just double the issues.

Combine them both didn't resolve the issues in autostats which we plan to resolve at beginning.

Although I talk with the contributor who open the first enable auto-analyze PR, they already deploy the auto-ANALYZE on their cluster and works fine.

If we still concern about the single ANALYZE modification in user's ETL tools, let's just document the auto-ANALYZE feature and let user decide whether

to enable it. So if user choose to enable it, they need to aware the difference and make changes base on it.

Regards,

Junfeng

发件人: Luis Filipe de Macedo <mac...@vmware.com>
发送时间: 2020年12月9日 2:22
收件人: Ashwin Agrawal <ashwi...@gmail.com>; Hubert Zhang <zhu...@vmware.com>
抄送: gpdb...@greenplum.org <gpdb...@greenplum.org>; Junfeng Yang <yje...@vmware.com>; Louis Mugnano (Pivotal) <lmug...@pivotal.io>
主题: Re: Plan to remove auto_stats code from Greenplum Master/7.

Ashwin Agrawal

unread,

Dec 15, 2020, 9:27:19 PM12/15/20

to Junfeng Yang, Luis Filipe de Macedo, Hubert Zhang, gpdb...@greenplum.org, Louis Mugnano (Pivotal)

We seeked lot more input from the field and discussed offline with various stake-holders to help bake this feature. Posting summary and decisions we have reached based on those discussions and inputs.

Auto-analyze and auto-stats will be treated as separate features. Capability to enable and disable each will be separately handled. Enabling one won't disable the other. Main reason being these two features work at different granularities. Auto-stats can be customized at table, session, database, user levels whereas auto-analyze is system level property. Modifying the value of one GUC (autovacuum) should not affect the value of another GUC (gp_autostats_mode) in the system.

Documentation would provide recommendations to not have both features enabled at same time. It's understood ideally enabling both the features at same time doesn't pose any risk to the system. Some test coverage will be added to test both features enabled scenarios (unless some major hurdle encountered to add the test).

Auto-analyze will be default enabled for GPDB7.

Auto-stats (gp_autostats_mode) feature with current default value only runs for NO_STATS case and is not very well used in the field. Hence, will be default disabled for GPDB7 release. With deprecation warning for possible removal of feature by GPDB8.