Plan to remove auto_stats code from Greenplum Master/7.

170 views
Skip to first unread message

Junfeng Yang

unread,
Oct 28, 2020, 10:39:31 PM10/28/20
to gpdb...@greenplum.org
Hi guys,

Currently we enabled autovaccum to run the auto-ANALYZE across databases on Master branch with these two PR:

Maybe now it's the time to remove the auto_stats logic from Greenplum.

The simple compration between auto_stats and auto-ANALYZE.

auto_stats auto-ANALYZE
Targets Relations except views and root/mid-layer partition table. Same with auto_stats. But it will support execute on root/mid-layer partition tables once pg enable them.
There's a thread to do this.
https://www.postgresql.org/message-id/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUx...@mail.gmail.com
Execution time Could be execute in the same transaction for evey statement that may change the tables' stats. But the default settings is to skip analyze when there's statistics for the relation. For each database, autovacuum is triggered every autovacuum_naptime.
Behavior With in the same transaction, execute ANALYZE for the relation. Try to expand analyze on parent partition table in current transaction if current table is leaf partition.
This may bring dead lock issue which discussed in: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/gqordopb6Gg/m/bMFTA2-MCQAJ
Executed in seperate process, only execute ANALYZE on the relations that reach the autovacuum_analyze_threshold. Never expand ANALYZE on parent partition tables.
Neve ANALYZE on root/mid-layer partition table for now, but will be supported, see the link above.
Others With default auto_stats settings, user still need to schedule ANALYZE on all tables since it only execute ANALYZE if the table is never analyzed before. User need to schedule ANALYZE on root/mid-layer partition table for now.

For the performance.
Test loading the data into several tables on gcp with auto_stats or with auto-ANALYZE enabled.

The gcp cluster contains 1 master host and 4 segment hosts.
For each segment hosts, it contains 8 primary segments and 8 mirror segments.
Each host has the same configuration (24 vCPUs, 48 GB memory)

The total data generated for loading is 960GB.
Since I don't find any existing benchmark to evaluate the backgroud write impact for Greenplum,
so I decide to modify the TPC-DS's data loading script. And this is not accurate.
Overall, when enable auto-ANALYZE, the data loading will have around 8% decline. But note here the
loading is single process, if we run loading concurrently, ANALYZE through auto_stats should also have impact
for other processes.

Also the auto_stats and auto-ANALYZE are all using default settings.
Here are the detail results:

with auto_stats enabled
with auto-ANALYZE enabled
table_name      tuples  seconds
call_center     42      0.366
catalog_page    30000   0.334
catalog_returns 143996756       94.94226
catalog_sales   1439980416      850.850487
customer        12000000        4.466
customer_address        6000000 2.2282
customer_demographics   1920800 0.356
date_dim        73049   0.402
household_demographics  7200    0.091
income_band     20      0.078
inventory       783000000       36.36603
item    300000  1.1804
promotion       1500    0.121
reason  65      0.323
ship_mode       20      0.077
store   1002    0.137
store_returns   287999764       153.153594
store_sales     2879987999      1123.112331
time_dim        86400   0.373
warehouse       20      0.094
web_page        3000    0.106
web_returns     71997522        28.28437
web_sales       720000376       418.418055
web_site        54      0.136
table_name      tuples  seconds
call_center     42      0.157
catalog_page    30000   0.168
catalog_returns 143996756       111.11141
catalog_sales   1439980416      930.930319
customer        12000000        4.4676
customer_address        6000000 2.2321
customer_demographics   1920800 0.385
date_dim        73049   0.388
household_demographics  7200    0.092
income_band     20      0.074
inventory       783000000       38.38896
item    300000  1.1964
promotion       1500    0.14
reason  65      0.075
ship_mode       20      0.328
store   1002    0.167
store_returns   287999764       187.187212
store_sales     2879987999      1208.120863
time_dim        86400   0.364
warehouse       20      0.104
web_page        3000    0.125
web_returns     71997522        30.30341
web_sales       720000376       434.434357
web_site        54      0.145

I think replace the auto_stats with auto-ANALYZE is workable.
To move forward, the next steps are:
  • Enable auto-ANALYZE in our concourse pipeline test.
  • Remove the auto_stats related code and set auto-ANALYZE enable as default.
Do you have any concerns or any suggestions?

Regards,
Junfeng

Ashwin Agrawal

unread,
Oct 29, 2020, 8:21:46 PM10/29/20
to Junfeng Yang, gpdb...@greenplum.org
On Wed, Oct 28, 2020 at 7:39 PM Junfeng Yang <yje...@vmware.com> wrote:
Hi guys,

Currently we enabled autovaccum to run the auto-ANALYZE across databases on Master branch with these two PR:

Maybe now it's the time to remove the auto_stats logic from Greenplum.

Please can you elaborate on:
  • What's the code/functional complexity of keeping both for GPDB7?
  • Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?
  • How much approximate lines of code removal will it be?
  • What's our plan for gp_autostats_mode_in_functions GUC?
  • gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?
  • Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?
  • Are temporary tables included or excluded from auto-analyze?
My current thinking is if it's not complex to keep both the
implementations (testing and other maintenance overhead is not high to
keep the auto-analyze), we should not remove that piece of
functionality before receiving feedback on how the newer functionality
is behaving in the field. Given its already GUC controlled can be
easily turned off/on as required and feedback received. Before getting
too destructive. At least till the beta release phase we can keep it
and based on beta feedback maybe decide. I feel the two
functionalities are fairly independent and don't interfere with each
other. Actually, auto-stats can kind of help auto-analyze to do the
job and reduce deferred load :-)

We should definitely enable auto-analyze by default.
Maybe make auto-stats default NONE instead of ON_NO_STATS.

--
Ashwin Agrawal (VMware)

Junfeng Yang

unread,
Nov 2, 2020, 12:51:27 AM11/2/20
to Ashwin Agrawal, gpdb...@greenplum.org
Hi Ashwin,

    • What's the code/functional complexity of keeping both for GPDB7?
      • The logic of auto_stats is quite simple, just put the function at the place when needed and call `ExecVacuum`. There shouldn't be any complexity for keeping both. And when enable auto-ANALYZE, the auto_stats will be skipped.
    • Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?
      • We do have some tests. The overhead should not be so large.
    • How much approximate lines of code removal will it be?
      • Around 1000 lines.
    • What's our plan for gp_autostats_mode_in_functions GUC?
      • We don't need to consider functions separately for auto-ANALYZE. The auto-ANALYZE is triggered based on pgstat table stats which counted through the relations' access method.
      • For each relation, we can control whether enable auto-ANALYZE by setting it's storage parameter.
    • gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?
      • From my point of view, ON_CHANGE will add lot of overhead when we doing a simple one row insert/update/delete since it'll trigger ANALYZE all the time(And for the auto-ANALYZE, the ANALYZE only triggered when the tuple change reach the threshold). And from the current code, we still have 3 choices. Where is the newer implementation?
    • Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?
      • Yes, this part has no different from upstream.
    • Are temporary tables included or excluded from auto-analyze?
      • auto-ANALYZE never execute ANALYZE on temporary tables.

    And I think you concern make sense. The reason I want to remove it is that I think if auto-ANALYZE works well, so if auto_stats never used, it will be dead code.
    And auto_stats add overhead for each statement and have some issues.
    1. May cause one phase commit to two phase. For example insert only to a single segment, but trigger auto-ANALYZE.
    2. The dead lock issue mentioned before.
    Wait until beta release for feedback is a good solution I think.
    So we could enable the auto-ANALYZE as default at first to find potential issues.

    Regards,
    Junfeng

    Ashwin Agrawal

    unread,
    Nov 2, 2020, 2:09:38 PM11/2/20
    to Junfeng Yang, gpdb...@greenplum.org
    On Sun, Nov 1, 2020 at 9:51 PM Junfeng Yang <yje...@vmware.com> wrote:
    • What's the code/functional complexity of keeping both for GPDB7?
      • The logic of auto_stats is quite simple, just put the function at the place when needed and call `ExecVacuum`. There shouldn't be any complexity for keeping both. And when enable auto-ANALYZE, the auto_stats will be skipped.
    That's good to know. I am still missing why auto-stats will be skipped if auto-analyze is enabled? It will purely depend on (independently) the value of gp_autostats_mode GUC still, right?

    • Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?
      • We do have some tests. The overhead should not be so large.
    Okay that's awesome. 
    • How much approximate lines of code removal will it be?
      • Around 1000 lines.
    Okay. 
    • What's our plan for gp_autostats_mode_in_functions GUC?
      • We don't need to consider functions separately for auto-ANALYZE. The auto-ANALYZE is triggered based on pgstat table stats which counted through the relations' access method.
      • For each relation, we can control whether enable auto-ANALYZE by setting it's storage parameter.
    I understand auto-analyze is triggered irrespective. Though GPDB provided this GUC separate from gp_autostats_mode means there have been usecases where users wished to have stats computed while the function is executing and not in delayed fashion.
    So, in that case, auto-analyzing it in a delayed fashion I don't is helpful. Hence, need to understand the impact of removing this feature as well along with auto-stats whenever we do it in future.
     
    • gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?
      • From my point of view, ON_CHANGE will add lot of overhead when we doing a simple one row insert/update/delete since it'll trigger ANALYZE all the time(And for the auto-ANALYZE, the ANALYZE only triggered when the tuple change reach the threshold). And from the current code, we still have 3 choices. Where is the newer implementation?
    Well, auto-stats ON_CHANGE doesn't invoke analyze on every change either.
    GUC  gp_autostats_on_change_threshold controls the threshold for the same similar to auto-analyze. So, it's threshold based as well. I don't think currently it works for singleton inserts or updates.

    By newer implementation I meant auto-analyze.

    • Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?
      • Yes, this part has no different from upstream.
    • Are temporary tables included or excluded from auto-analyze?
      • auto-ANALYZE never execute ANALYZE on temporary tables.
    So, maybe for temp tables auto-stats is still required. Though one can always argue users should add analyze if required for the temp tables. But again that would be regression from current behavior.

    And I think you concern make sense. The reason I want to remove it is that I think if auto-ANALYZE works well, so if auto_stats never used, it will be dead code.
    And auto_stats add overhead for each statement and have some issues.
    1. May cause one phase commit to two phase. For example insert only to a single segment, but trigger auto-ANALYZE.
    2. The dead lock issue mentioned before.
    Wait until beta release for feedback is a good solution I think.
    So we could enable the auto-ANALYZE as default at first to find potential issues.

    I highly appreciate the thought about not leaving behind dead code and functionality. Seems we are in agreement that it's not dead yet and useful for certain situations.
    Plus, good to let customers slowly migrate / get accustomed to newer functionality and then once really dead we can rip it out.

    Thank You for driving this important feature forward.


    --
    Ashwin Agrawal (VMware)

    Junfeng Yang

    unread,
    Nov 3, 2020, 3:32:26 AM11/3/20
    to Ashwin Agrawal, gpdb...@greenplum.org
    Hi Ashwin,

    That's good to know. I am still missing why auto-stats will be skipped if auto-analyze is enabled? It will purely depend on (independently) the value of gp_autostats_mode GUC still, right?
    I don't see any need to keep both ANALYZE auto trigger method in same time. Once `AutoVacuumingActive()` returns true, there's no need to run the auto_stats. Since it'll trigger the auto-ANALYZE.

    I understand auto-analyze is triggered irrespective. Though GPDB provided this GUC separate from gp_autostats_mode means there have been usecases where users wished to have stats computed while the function is executing and not in delayed fashion.
    So, in that case, auto-analyzing it in a delayed fashion I don't is helpful. Hence, need to understand the impact of removing this feature as well along with auto-stats whenever we do it in future.
    After I dig some history issues. It seems the auto_stats is not works for functions at first. So to resolve this, bring in another GUC and add auto_stats into functions.c. But not sure why we bring the new GUC.
    It actually could make the gp_autostats_mode works for functions too. Cause from the issue, user expects the gp_autostats_mode  works for functions.

    GUC  gp_autostats_on_change_threshold controls the threshold for the same similar to auto-analyze.
    We should note here the `gp_autostats_on_change_threshold ` only get evaulated in single statement. If we insert 1 tuple million times which never reach the threshold, auto_stats is skipped. And still need user to do
    the manual ANALYZE. But auto-ANALYZE do count these changes.

    So, maybe for temp tables auto-stats is still required. Though one can always argue users should add analyze if required for the temp tables. But again that would be regression from current behavior.
    Yes, I agree. But when we dealing the upstream merge, we discard lots of GPDB own behaviros, and we see these issue reports occasionally. So what's the judgment criteria to help us make decision?

    Let's keep them both for now, and this is the best solution at this stage I think.
    And thanks a lot Ashwin! Your comments drive me think futher.

    Regards,
    Junfeng

    Junfeng Yang

    unread,
    Nov 26, 2020, 9:02:11 PM11/26/20
    to Ashwin Agrawal, gpdb...@greenplum.org
    Hi Ashwin,

    Since there still some concerns for enable auto-ANALYZE as default. So let's just enable auto-ANALYZE in pipeline test.
    And user could enable auto-ANALYZE if they want by setting `set autovacuum=on` on coordinator.

    So for the PR https://github.com/greenplum-db/gpdb/pull/11142, I'll only enable auto-ANALYZE in pipeline test,
    and fix related test cases.
    Does this sounds good for you?

    Regards,
    Junfeng

    发件人: Junfeng Yang <yje...@vmware.com>
    发送时间: 2020年11月3日 16:32
    收件人: Ashwin Agrawal <ashwi...@gmail.com>
    抄送: gpdb...@greenplum.org <gpdb...@greenplum.org>
    主题: 回复: Plan to remove auto_stats code from Greenplum Master/7.
     

    Ashwin Agrawal

    unread,
    Nov 30, 2020, 7:14:23 PM11/30/20
    to Junfeng Yang, gpdb...@greenplum.org
    On Thu, Nov 26, 2020 at 6:02 PM Junfeng Yang <yje...@vmware.com> wrote:
    Hi Ashwin,

    Since there still some concerns for enable auto-ANALYZE as default. So let's just enable auto-ANALYZE in pipeline test.
    And user could enable auto-ANALYZE if they want by setting `set autovacuum=on` on coordinator.

    So for the PR https://github.com/greenplum-db/gpdb/pull/11142, I'll only enable auto-ANALYZE in pipeline test,
    and fix related test cases.
    Does this sounds good for you?

    No, let's not go back on this. You have done excellent work. Leaving it to users to enable is not good. Let's continue discussing how to bake the feature and deliver it fully enabled by default.

    I don't think we have too many things (concerns) to be discussed in the PR. Due to vacations and other stuff, I wasn't able to reply to PR comments. Will try to do it this week. Also, I would highly appreciate, if more members of the team can jump in, share thoughts and help to get this important feature to proper completion.

    --
    Ashwin Agrawal (VMware)

    Luis Macedo

    unread,
    Dec 1, 2020, 9:18:50 AM12/1/20
    to Greenplum Developers, Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano
    Guys, there is an important fact that we should consider. 

    When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

    How the new auto-analyze will impact that scenario? Is my understanding correct?


    Rgds,

    Luis Macedo

    Ashwin Agrawal

    unread,
    Dec 1, 2020, 5:26:11 PM12/1/20
    to Luis Macedo, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano
    On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:
    Guys, there is an important fact that we should consider. 

    When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

    How the new auto-analyze will impact that scenario? Is my understanding correct?

    Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).

    So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.

    The only argument where it confuses things based on PR discussion so far with Junfeng:
    > If auto-ANALYZE enabled, it should use auto-ANALYZE as Postgres does, there no need to turn off gp_autostats_mode . And shouldn't make then both work for below reason:

    > A cluster with autovacuum=on, and create a table with autovacuum_enabled = false.
    > As a user, I would expect the table will always have empty stats unless I manually run ANALYZE.
    > But if we still run autostats with autovacuum=on. The table will have stats.
    > So we should not run autostats with autovacuum=on.

    First I don't think that's so much of usecase to worry about. Secondly with autovacuum_enabled = false, only aspect user wishes to convey is don't trigger autovacuum in background and spend cycle on this table. I don't think it has any aspect to convey, don't want stats for this table, as why wouldn't anyone not want it. So, this usecase I won't worry at all about.

    Aspect I worry about, will ANALYZE get triggered multiple times for the table if both auto-stats are ON and auto-analyze is ON? Is there something in the system which prevents auto-analyze from kicking in if auto-stats (or manual ANALYZE) already analyzed the table? If multiple ANALYZE don't kick-in and are taken care of, I don't see a reason to turn off auto-stats.

    Any other concerns we need to take care?

    -- 
    Ashwin Agrawal (VMware)

    Luis Filipe de Macedo

    unread,
    Dec 2, 2020, 8:16:49 AM12/2/20
    to Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

    Ashwin,

     

    You explanation made it more clear to me. My original understanding was that auto-stats would be deprecated in favor of auto-vacuum which is not true.

     

    The scenario that I described will not happen as we will still be collecting some stats on bulk load operations AND improving/automating stats collection on mini-batch/streaming scenarios.

     

     

    Thanks,

     

    Luis F R Macedo

    Advisory Data Engineer & Business Development for Latam

    VMware Tanzu Data

    Call Me @ +55 11 97616-6438

    Take care of the customers and the rest takes care of itself

    Hubert Zhang

    unread,
    Dec 8, 2020, 7:04:05 AM12/8/20
    to Luis Filipe de Macedo, Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)
    Hi Luis and Ashwin,

    On Tue, Dec 2, 2020 at 6:18 AM  Ashwin Agrawal wrote:

    On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:
    Guys, there is an important fact that we should consider. 

    When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

    How the new auto-analyze will impact that scenario? Is my understanding correct?

    Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).

    So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.

    Since auto-analyze is a feature from PostgreSQL, I think it's better to keep the auto-analyze behavior the same as upstream. auto-stats may help in case "1 million tuples inserted in one shot" as Ashwin mentioned and I agree there are some customer side ETL/ELT scripts which depends on auto-stats feature. But this feature is Greenplum specific and has many issues, e.g. for the subsequent insert, auto_stats will not be triggered in ON_NO_STATS mode. I don't think all the customers are aware of it. It's safer to run analyze manually in these ETL scripts, which is also the best practice in PostgreSQL as well.

    For a long term, I prefer Greenplum keep the same behavior and practice as PostgreSQL related to the topic of `analyze`. Auto-stats should be disabled and thus user should run analyze manually if they want the statistics immediately just like upstream. Meanwhile, when auto-analyze is disabled, we should keep the current behavior, which keeps the auto-stats feature.

    So how about setting auto-analyze as disabled by default? When user want to enable auto-analyze, they should also follow the practice to add manual analyze in their scripts if necessary. 


    Hubert

    From: Luis Filipe de Macedo <mac...@vmware.com>
    Sent: Wednesday, December 2, 2020 9:16 PM
    To: Ashwin Agrawal <ashwi...@gmail.com>
    Cc: gpdb...@greenplum.org <gpdb...@greenplum.org>; Junfeng Yang <yje...@vmware.com>; Louis Mugnano (Pivotal) <lmug...@pivotal.io>
    Subject: Re: Plan to remove auto_stats code from Greenplum Master/7.
     

    Luis Filipe de Macedo

    unread,
    Dec 8, 2020, 8:32:10 AM12/8/20
    to Hubert Zhang, Ashwin Agrawal, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

    Hubert,

     

    Thanks for your thoughts. We could think of integrating the two features and slowly migrating to the auto-analyze (PG).  We could (maybe) deprecate the ON_CHANGE in favor of the auto-analyze (PG) and (maybe) change auto-stats (GP) to force an auto-analyze(PG) synchronously on ON_NO_STATS (not sure about the viability).

     

    We can also add deprecation notices on the docs and keep the feature for 2 major versions before removing it.

     

    I also prefer to run analyze on the ETL but now this ship has sailed and many customers are already depending on the auto-stats.

     

    But your proposal to leave auto-analyze(PG) off by default sounds reasonable as it will ease in the users.

     

    The topic of autonomous database should be considered by engineering in high priority IMHO. The easier to use GPDB is, the more users GPDB will have.

     

     

    PS. All my comments are based on my view as GPDB user. I have very little visibility of the code itself.

     

     

    Rgds,

    Ashwin Agrawal

    unread,
    Dec 8, 2020, 12:48:09 PM12/8/20
    to Hubert Zhang, Luis Filipe de Macedo, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)
    On Tue, Dec 8, 2020 at 4:04 AM Hubert Zhang <zhu...@vmware.com> wrote:
    Hi Luis and Ashwin,

    On Tue, Dec 2, 2020 at 6:18 AM  Ashwin Agrawal wrote:

    On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:
    Guys, there is an important fact that we should consider. 

    When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.

    How the new auto-analyze will impact that scenario? Is my understanding correct?

    Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).

    So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.

    Since auto-analyze is a feature from PostgreSQL, I think it's better to keep the auto-analyze behavior the same as upstream. auto-stats may help in case "1 million tuples inserted in one shot" as Ashwin mentioned and I agree there are some customer side ETL/ELT scripts which depends on auto-stats feature. But this feature is Greenplum specific and has many issues, e.g. for the subsequent insert, auto_stats will not be triggered in ON_NO_STATS mode. I don't think all the customers are aware of it. It's safer to run analyze manually in these ETL scripts, which is also the best practice in PostgreSQL as well.

    It's easier said than done. Easily able to migrate customers to newer versions without requiring application modification is a strongly desired aspect and as much as possible we should retain it unless really can't. So many scripts and work-flows would be relying on this behavior we can't just take it away without giving chance to adopt newer behavior.

    Subsequent insert, auto_stats will not be triggered in ON_NO_STATS mode and users might not be aware of it, agree. But how does that merit take it away from users who are aware of it and changed it to ON_CHANGE and relying on sync stats (and not async stats) behavior.

    For a long term, I prefer Greenplum keep the same behavior and practice as PostgreSQL related to the topic of `analyze`. Auto-stats should be disabled and thus user should run analyze manually if they want the statistics immediately just like upstream. Meanwhile, when auto-analyze is disabled, we should keep the current behavior, which keeps the auto-stats feature.

    I agree and always vote to keep Greenplum behavior in sync with upstream. Though we need to be cognizant of current GPDB users. Like for partitioning also we would like to only use upstream syntax, but for some upcoming major versions need to provide room to make changes to scripts and work-flow to adopt new behavior before forcing it to new.

    So how about setting auto-analyze as disabled by default? When user want to enable auto-analyze, they should also follow the practice to add manual analyze in their scripts if necessary.

    Shipping new features disabled by default sounds like a half baked feature. If we feel the feature is useful we should have it enabled by default. I am missing to understand what benefit it provides by default disabling the feature.

    I would really like to hear *technical* concerns for the problems with keeping both the features enabled. If there are serious issues or complications that would help to decide. If none then why turn one off.

    --
    Ashwin Agrawal (VMware)

    Luis Filipe de Macedo

    unread,
    Dec 8, 2020, 1:22:49 PM12/8/20
    to Ashwin Agrawal, Hubert Zhang, gpdb...@greenplum.org, Junfeng Yang, Louis Mugnano (Pivotal)

    Ashwin,

     

    My main concern of keeping both features on would be waste of resources. Let’s say I bulk loaded a tables which was analyzed by auto-stats. Then comes auto-analyze few moments after and analyze the table again. And worse its spending CPU cycles on an operation that is not easily visible by the DBA.

     

    Another concern is the same concern we had when it was discussed to turn auto-vaccum on. There will be resource consumption that the DBA will not be able to fully control and might hit some SLAs.

     

    When looking at OLTP, usually tables are not that big, but on the OLAP use case, we might have monster tables that could end up been a hit on the system.

     

     

    Rgds,

     

    Luis F R Macedo

    Advisory Data Engineer & Business Development for Latam

    VMware Tanzu Data

    Call Me @ +55 11 97616-6438

    Take care of the customers and the rest takes care of itself

     

     

    De: Ashwin Agrawal <ashwi...@gmail.com>
    Data: terça-feira, 8 de dezembro de 2020 14:48
    Para: Hubert Zhang <zhu...@vmware.com>
    Cc: Luis Filipe de Macedo <mac...@vmware.com>, gpdb...@greenplum.org <gpdb...@greenplum.org>, Junfeng Yang <yje...@vmware.com>, Louis Mugnano (Pivotal) <lmug...@pivotal.io>
    Assunto: Re: Plan to remove auto_stats code from Greenplum Master/7.

    Junfeng Yang

    unread,
    Dec 8, 2020, 9:12:40 PM12/8/20
    to Luis Filipe de Macedo, Ashwin Agrawal, Hubert Zhang, gpdb...@greenplum.org, Louis Mugnano (Pivotal)
    Hi guys,

    Thanks for all you feedbacks.

    I want to clarify that the auto-analyze only runs on the tables that have changes(since pervious analyze) exceeds a computed threshold.

    The reason that we enable auto-analyze to replace autostats is because autostats has some issues as I mentioned at the beginning.
    And autostats can not guarantee us an accurate stats without user manually execute ANALYZE periodically.

    As we discussed, auto-analyze may also bring us some side-effects. So if we combine autostats and auto-analyze in same time. We just double the issues.
    Combine them both didn't resolve the issues in autostats which we plan to resolve at beginning.
    Although I talk with the contributor who open the first enable auto-analyze PR, they already deploy the auto-ANALYZE on their cluster and works fine.

    If we still concern about the single ANALYZE modification in user's ETL tools, let's just document the auto-ANALYZE feature and let user decide whether
    to enable it. So if user choose to enable it, they need to aware the difference and make changes base on it.

    Regards,
    Junfeng

    发件人: Luis Filipe de Macedo <mac...@vmware.com>
    发送时间: 2020年12月9日 2:22
    收件人: Ashwin Agrawal <ashwi...@gmail.com>; Hubert Zhang <zhu...@vmware.com>
    抄送: gpdb...@greenplum.org <gpdb...@greenplum.org>; Junfeng Yang <yje...@vmware.com>; Louis Mugnano (Pivotal) <lmug...@pivotal.io>
    主题: Re: Plan to remove auto_stats code from Greenplum Master/7.
     

    Ashwin Agrawal

    unread,
    Dec 15, 2020, 9:27:19 PM12/15/20
    to Junfeng Yang, Luis Filipe de Macedo, Hubert Zhang, gpdb...@greenplum.org, Louis Mugnano (Pivotal)
    We seeked lot more input from the field and discussed offline with various stake-holders to help bake this feature. Posting summary and decisions we have reached based on those discussions and inputs.

    Auto-analyze and auto-stats will be treated as separate features. Capability to enable and disable each will be separately handled. Enabling one won't disable the other. Main reason being these two features work at different granularities. Auto-stats can be customized at table, session, database, user levels whereas auto-analyze is system level property. Modifying the value of one GUC (autovacuum) should not affect the value of another GUC (gp_autostats_mode) in the system.

    Documentation would provide recommendations to not have both features enabled at same time. It's understood ideally enabling both the features at same time doesn't pose any risk to the system. Some test coverage will be added to test both features enabled scenarios (unless some major hurdle encountered to add the test).

    Auto-analyze will be default enabled for GPDB7.

    Auto-stats (gp_autostats_mode) feature with current default value only runs for NO_STATS case and is not very well used in the field. Hence, will be default disabled for GPDB7 release. With deprecation warning for possible removal of feature by GPDB8.

    -- 
    Ashwin Agrawal (VMware)
    Reply all
    Reply to author
    Forward
    0 new messages