Soumyadeep Chakraborty
unread,Aug 30, 2023, 9:00:39 PM8/30/23Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Greenplum Developers
Hello,
In this post I'm going to examine the prevalence of work_mem (and its cousin
planner_work_mem) in our main branch (and soon to be released GPDB 7X).
In general, our long term goal is to eradicate work_mem from the codebase. We
have done so in many places, chiefly in the executor, by replacing it with
operatorMemKB. Using operatorMem ensures that we can allocate a budget for the
entire query/statement (called query_mem), something that is not at all possible
w/ work_mem.
TL;DR:
However, there are quite a few places in the code where there is rampant use of
work_mem and planner_work_mem. These include the planner, executor, PL
languages,
UDFs and index AMs. The iceberg is not submerged, it is floating in the air.
My conclusions:
There is a long way to go as you will see below, before we can eradicate the
usage of work_mem entirely. We should un-deprecate it and provide clear
guidance to users on how it can prove beneficial. I have attached some simple
test results that have yielded very positive results from setting this GUC.
Also, as we absorb extensions such as pgvector (relies on work_mem for sizing
its tuplesort) etc - they will always rely on work_mem so we can't eliminate it.
We have to think very carefully on how we want to proceed. We will have to
replace work_mem gradually in all facets of code. At the same time, we would
have to provide a GUC to enable it to work as it does now, both for backwards
compatibility and for extensibility (We don't want to start forking extensions,
do we?)
Usages:
(1) Planner:
* Finalize hash agg: add_second_stage_hash_agg_path() - whether we even consider
a finalize hashagg depends on hashentrysize * dNumGroups < work_mem * 1024L
and hashentrysize * ctx->dNumGroupsTotal <= work_mem * 1024L
* Bitmap heap scan/ Bitmap index scan costing: compute_bitmap_pages(): This
directly influences the cost of performing bitmap heap/index scans. This has
a dramatic effect on estimating tuples_fetched from the BitmapHeapScan node, as
it has a direct bearing on lossy_pages and having lots of lossy_pages imposes a
major costing penalty (tuples_fetched ≈ reltuples/numsegments).
To see the level of impact, just look at the differences in run times and the
number of TPC-H queries that select a bitmap heap/index scan in the appendix,
using BRIN and B-tree indexes (see attached)!
* Costing for global ops - ALL of these use planner_work_mem via
global_work_mem()
to override work_mem.
cost_sort()
cost_material()
cost_shareinputscan()
initial_cost_hashjoin()
subpath_is_hashable()
subplan_is_hashable()
* Costing for CTEScan and WorkTableScan rescans: cost_rescan(): work_mem decides
if we are spilling or not and adjusts costs accordingly.
* Hash-join costing: final_cost_hashjoin(): disable_cost penalty for buckets
holding inner MCV is imposed when the bucket size exceeds work_mem.
* Merge-join costing: final_cost_mergejoin(): We decide whether to materialize
the inner side by checking if inner side will spill (based on work_mem)
* Creation of unique paths (create_unique_path() and create_unique_rowid_path())
We decide if we can hash based on work_mem:
hashentrysize * pathnode->path.rows > work_mem * 1024L
* Agg planning:
Whether we choose an agg path or not depends on
hashaggtablesize < work_mem * 1024L: add_paths_to_grouping_rel()
Whether we choose a hashagg depends on:
hashaggtablesize < work_mem * 1024L : create_partial_grouping_paths()
* Grouping sets planning: Whether we generate Grouping sets paths at all depends
on whether the hashagg table fits into work_mem: consider_groupingsets_paths():
hashsize > work_mem * 1024L.
* Setops planning: Whether we should use hash for setops: choose_hashed_setop()
* Costing for
(2) Executor:
* [Possibly very high impact due to generality] Sizing per-tuple memory
contexts: CreateWorkExprContext(): We decide maxBlockSize of
ecxt_per_query_memory
based on work_mem. This memory context is used by all executor nodes!
* Comment that shows that we fall back to using work_mem for certain
"statements"
that don't go through resource queues.
if (ps->plan->operatorMemKB == 0)
{
/**
* There are some statements that do not go through the resource
queue and these
* plans dont get decorated with the operatorMemKB. Someday, we
should fix resource queues.
*/
result = work_mem;
}
* Agg execution: BuildTupleHashTableExt(): We decide the number of
buckets in the
hash table used for grouping, based on work_mem.
* index_initbitmap() - generic index AM agnostic routine called from
MultiExecBitmapIndexScan().
* [Low impact to GPDB] Parallel hash join: ExecParallelHashIncreaseNumBatches()
(3) Index AMs:
* amgetbitmap routines - used for bitmap index scan. work_mem is used here to
size tidbitmaps -> Huge bearing on tidbitmap lossification, which adversely
affects BitmapHeapScan runtime.
* GIN insert cleanup: ginInsertCleanup(): Decides if its time to flush mem to
disk (accum.allocatedMemory >= workMemory * 1024L)
* GIN match tid bitmap is sized with respect to work_mem: collectMatchBitmap().
(4) Tuplestores: Rampant use of work_mem is present to size maxKBytes of
tuplestores throughout the code.
* Fmgr cache tuplestore
* Upstream UDFs relying on tuplestores:
pg_timezone_names (datetime.c)
dblink UDFs (dblink.c)
Event triggers (event_trigger.c)
pageinspect functions (e.g. brin_page_items())
extension support UDFs (extension.c)
Foreign table options support UDF (deflist_to_tuplestore())
Dir admin functions (genfile.c)
show_all_file_settings UDF (guc.c)
pg_hba_file_rules UDF
pg_stat_get_subscription UDF
pg_get_backend_memory_contexts UDF
pg_tablespace_databases UDF
pg_show_replication_origin_status UDF
pg_config UDF
pg_stat_statements UDF
pgrowlocks UDF
pg_stat_get* UDFs
pg_cursor UDF used by pg_cursors view
pg_prepared_statement UDF for pg_prepared_statements view
pg_get_replication_slots UDF
tablefuncs.c UDFs
amcheck UDFs for btree (verify_nbtree.c)
pg_stat_get_wal_senders UDF
pg_stop_backup_v2 UDF
* jsonfuncs.c tuplestore maxBytes
* libpqreceiver uses work_mem to size tuplestore maxKBytes! See
libpqrcv_processTuples()
* Logical replication: Getting slot changes relies on a tuplestore sized by
work_mem: pg_logical_slot_get_changes_guts()
* FunctionScans: ExecMakeTableFunctionResult(): Tuplestore for materializing the
results of a SRF has its maxKBytes set based on work_mem.
* CTEScan: Size of tuplestore for holding results returned from CTE query:
ExecInitCteScan()
* TableFuncScan tuplestore to contain results of TableFuncScan: tfuncFetchRows()
* RecursiveUnion tuplestore working_table: ExecInitRecursiveUnion()
tuplestore intermediate_table: ExecInitRecursiveUnion()/ExecRecursiveUnion()
* pl/pgsql UDFs have a tuplestore in their PLpgSQL_execstate!
exec_init_tuple_store()
* Holdable portals have a tuplestore! PortalCreateHoldStore()
* Triggers
FDW tuple stores: GetCurrentFDWTuplestore()
TransitionCapture tuple stores: MakeTransitionCaptureState()
(5) Tuplesort allowedMem is decided by work_mem. See tuplesort_beginXXX
routines.
* btree internal sort routines (nbtsort.c)
* Ordered set aggs execution sort state: ordered_set_startup()
Regards,
Soumyadeep (VMware)