|
|
auto_stats | auto-ANALYZE |
| Targets | Relations except views and root/mid-layer partition table. |
Same with auto_stats. But it will support execute on root/mid-layer partition tables once pg enable them. There's a thread to do this. https://www.postgresql.org/message-id/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUx...@mail.gmail.com |
| Execution time | Could be execute in the same transaction for evey statement that may change the tables' stats. But the default settings is to skip analyze when there's statistics for the relation. | For each database, autovacuum is triggered every autovacuum_naptime. |
| Behavior |
With in the same transaction, execute ANALYZE for the relation. Try to expand analyze on parent partition table in current transaction if current table is leaf partition. This may bring dead lock issue which discussed in: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/gqordopb6Gg/m/bMFTA2-MCQAJ |
Executed in seperate process, only execute ANALYZE on the relations that reach the autovacuum_analyze_threshold. Never expand ANALYZE on parent partition tables. Neve ANALYZE on root/mid-layer partition table for now, but will be supported, see the link above. |
| Others | With default auto_stats settings, user still need to schedule ANALYZE on all tables since it only execute ANALYZE if the table is never analyzed before. | User need to schedule ANALYZE on root/mid-layer partition table for now. |
|
with auto_stats enabled |
with auto-ANALYZE enabled |
|
table_name tuples seconds
call_center 42 0.366
catalog_page 30000 0.334
catalog_returns 143996756 94.94226
catalog_sales 1439980416 850.850487
customer 12000000 4.466
customer_address 6000000 2.2282
customer_demographics 1920800 0.356
date_dim 73049 0.402
household_demographics 7200 0.091
income_band 20 0.078
inventory 783000000 36.36603
item 300000 1.1804
promotion 1500 0.121
reason 65 0.323
ship_mode 20 0.077
store 1002 0.137
store_returns 287999764 153.153594
store_sales 2879987999 1123.112331
time_dim 86400 0.373
warehouse 20 0.094
web_page 3000 0.106
web_returns 71997522 28.28437
web_sales 720000376 418.418055
web_site 54 0.136 |
table_name tuples seconds
call_center 42 0.157
catalog_page 30000 0.168
catalog_returns 143996756 111.11141
catalog_sales 1439980416 930.930319
customer 12000000 4.4676
customer_address 6000000 2.2321
customer_demographics 1920800 0.385
date_dim 73049 0.388
household_demographics 7200 0.092
income_band 20 0.074
inventory 783000000 38.38896
item 300000 1.1964
promotion 1500 0.14
reason 65 0.075
ship_mode 20 0.328
store 1002 0.167
store_returns 287999764 187.187212
store_sales 2879987999 1208.120863
time_dim 86400 0.364
warehouse 20 0.104
web_page 3000 0.125
web_returns 71997522 30.30341
web_sales 720000376 434.434357
web_site 54 0.145 |
Hi guys,
Currently we enabled autovaccum to run the auto-ANALYZE across databases on Master branch with these two PR:
Maybe now it's the time to remove the auto_stats logic from Greenplum.
What's the code/functional complexity of keeping both for GPDB7?
Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?
How much approximate lines of code removal will it be?
What's our plan for gp_autostats_mode_in_functions GUC?
gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?
Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?
Are temporary tables included or excluded from auto-analyze?
What's the code/functional complexity of keeping both for GPDB7?
- The logic of auto_stats is quite simple, just put the function at the place when needed and call `ExecVacuum`. There shouldn't be any complexity for keeping both. And when enable auto-ANALYZE, the auto_stats will be skipped.
Will there be maintenance overhead (like tests, incorporating with newer features, etc..) with keeping both the implementations?
- We do have some tests. The overhead should not be so large.
How much approximate lines of code removal will it be?
- Around 1000 lines.
What's our plan for gp_autostats_mode_in_functions GUC?
- We don't need to consider functions separately for auto-ANALYZE. The auto-ANALYZE is triggered based on pgstat table stats which counted through the relations' access method.
- For each relation, we can control whether enable auto-ANALYZE by setting it's storage parameter.
gp_autostats_mode has 3 choices OFF, ON_CHANGE, ON_NO_STATS. With newer implementation we only have OFF and ON_CHANGE. Plus, will lose session level capability to control things or selectively ignore tables from auto_stats. I don't know why default is ON_NO_STATS compared to ON_CHANGE. Did we look into history for a reason? Or validate if this GUC is mostly flipped by most users to ON_CHANGE?
- From my point of view, ON_CHANGE will add lot of overhead when we doing a simple one row insert/update/delete since it'll trigger ANALYZE all the time(And for the auto-ANALYZE, the ANALYZE only triggered when the tuple change reach the threshold). And from the current code, we still have 3 choices. Where is the newer implementation?
Are we going to leverage `autovacuum_enabled` reloption to provide flexibility for ignoring tables for auto-analyze?
- Yes, this part has no different from upstream.
Are temporary tables included or excluded from auto-analyze?
- auto-ANALYZE never execute ANALYZE on temporary tables.
And I think you concern make sense. The reason I want to remove it is that I think if auto-ANALYZE works well, so if auto_stats never used, it will be dead code.And auto_stats add overhead for each statement and have some issues.
- May cause one phase commit to two phase. For example insert only to a single segment, but trigger auto-ANALYZE.
- The dead lock issue mentioned before.
Wait until beta release for feedback is a good solution I think.So we could enable the auto-ANALYZE as default at first to find potential issues.
That's good to know. I am still missing why auto-stats will be skipped if auto-analyze is enabled? It will purely depend on (independently) the value of gp_autostats_mode GUC still, right?
I understand auto-analyze is triggered irrespective. Though GPDB provided this GUC separate from gp_autostats_mode means there have been usecases where users wished to have stats computed while the function is executing and not in delayed fashion.So, in that case, auto-analyzing it in a delayed fashion I don't is helpful. Hence, need to understand the impact of removing this feature as well along with auto-stats whenever we do it in future.
GUC gp_autostats_on_change_threshold controls the threshold for the same similar to auto-analyze.
So, maybe for temp tables auto-stats is still required. Though one can always argue users should add analyze if required for the temp tables. But again that would be regression from current behavior.
Hi Ashwin,
Since there still some concerns for enable auto-ANALYZE as default. So let's just enable auto-ANALYZE in pipeline test.And user could enable auto-ANALYZE if they want by setting `set autovacuum=on` on coordinator.
So for the PR https://github.com/greenplum-db/gpdb/pull/11142, I'll only enable auto-ANALYZE in pipeline test,and fix related test cases.Does this sounds good for you?
Guys, there is an important fact that we should consider.When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.How the new auto-analyze will impact that scenario? Is my understanding correct?
gp_autostats_mode . And shouldn't make then both work for below reason:> A cluster with autovacuum=on, and create a table with autovacuum_enabled = false.
> As a user, I would expect the table will always have empty stats unless I manually run ANALYZE.
> But if we still run autostats with autovacuum=on. The table will have stats.
> So we should not run autostats with autovacuum=on.
Ashwin,
You explanation made it more clear to me. My original understanding was that auto-stats would be deprecated in favor of auto-vacuum which is not true.
The scenario that I described will not happen as we will still be collecting some stats on bulk load operations AND improving/automating stats collection on mini-batch/streaming scenarios.
Thanks,
Luis F R Macedo
Advisory Data Engineer & Business Development for Latam
VMware Tanzu Data
Call Me @ +55 11 97616-6438
Take care of the customers and the rest takes care of itself
On Tue, Dec 2, 2020 at 6:18 AM Ashwin Agrawal wrote:
On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:
Guys, there is an important fact that we should consider.
When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.
How the new auto-analyze will impact that scenario? Is my understanding correct?
Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).
So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.
Hubert,
Thanks for your thoughts. We could think of integrating the two features and slowly migrating to the auto-analyze (PG). We could (maybe) deprecate the ON_CHANGE in favor of the auto-analyze (PG) and (maybe) change auto-stats (GP) to force an auto-analyze(PG) synchronously on ON_NO_STATS (not sure about the viability).
We can also add deprecation notices on the docs and keep the feature for 2 major versions before removing it.
I also prefer to run analyze on the ETL but now this ship has sailed and many customers are already depending on the auto-stats.
But your proposal to leave auto-analyze(PG) off by default sounds reasonable as it will ease in the users.
The topic of autonomous database should be considered by engineering in high priority IMHO. The easier to use GPDB is, the more users GPDB will have.
PS. All my comments are based on my view as GPDB user. I have very little visibility of the code itself.
Rgds,
Hi Luis and Ashwin,
On Tue, Dec 2, 2020 at 6:18 AM Ashwin Agrawal wrote:
On Tue, Dec 1, 2020 at 6:18 AM Luis Macedo <lma...@pivotal.io> wrote:
Guys, there is an important fact that we should consider.
When running an ETL job with many steps using work tables, the users expect auto-stats to kick in and collect stats before the work table is used on the next step. If we move to a async process we will loose that characteristic witch might make many ETL processes perform much worse due to not having any stats collected.
How the new auto-analyze will impact that scenario? Is my understanding correct?
Thanks Luis for the participation and comment. This (gp_autostats_mode should be touched or not) is the main aspect we have been debating for this feature. My past proposal aligns with what you mentioned. Auto-analyze should be an additive feature which helps for use cases we don't cover currently but shouldn't take away anything (or only some minimal stuff if really required too).
So, just to put it in a simplified way, consider for OLAP (1 million tuple inserted in one shot) current auto-stats helps and for OLTP (1 million singleton inserts are made) auto-analyze will help.
Since auto-analyze is a feature from PostgreSQL, I think it's better to keep the auto-analyze behavior the same as upstream. auto-stats may help in case "1 million tuples inserted in one shot" as Ashwin mentioned and I agree there are some customer side ETL/ELT scripts which depends on auto-stats feature. But this feature is Greenplum specific and has many issues, e.g. for the subsequent insert, auto_stats will not be triggered in ON_NO_STATS mode. I don't think all the customers are aware of it. It's safer to run analyze manually in these ETL scripts, which is also the best practice in PostgreSQL as well.
For a long term, I prefer Greenplum keep the same behavior and practice as PostgreSQL related to the topic of `analyze`. Auto-stats should be disabled and thus user should run analyze manually if they want the statistics immediately just like upstream. Meanwhile, when auto-analyze is disabled, we should keep the current behavior, which keeps the auto-stats feature.
So how about setting auto-analyze as disabled by default? When user want to enable auto-analyze, they should also follow the practice to add manual analyze in their scripts if necessary.
Ashwin,
My main concern of keeping both features on would be waste of resources. Let’s say I bulk loaded a tables which was analyzed by auto-stats. Then comes auto-analyze few moments after and analyze the table again. And worse its spending CPU cycles on an operation that is not easily visible by the DBA.
Another concern is the same concern we had when it was discussed to turn auto-vaccum on. There will be resource consumption that the DBA will not be able to fully control and might hit some SLAs.
When looking at OLTP, usually tables are not that big, but on the OLAP use case, we might have monster tables that could end up been a hit on the system.
Rgds,
Luis F R Macedo
Advisory Data Engineer & Business Development for Latam
VMware Tanzu Data
Call Me @ +55 11 97616-6438
Take care of the customers and the rest takes care of itself
De:
Ashwin Agrawal <ashwi...@gmail.com>
Data: terça-feira, 8 de dezembro de 2020 14:48
Para: Hubert Zhang <zhu...@vmware.com>
Cc: Luis Filipe de Macedo <mac...@vmware.com>, gpdb...@greenplum.org <gpdb...@greenplum.org>, Junfeng Yang <yje...@vmware.com>, Louis Mugnano (Pivotal) <lmug...@pivotal.io>
Assunto: Re: Plan to remove auto_stats code from Greenplum Master/7.