work_mem for GPDB7 and beyond

42 views

Skip to first unread message

Soumyadeep Chakraborty

unread,

Aug 30, 2023, 9:00:39 PM8/30/23

to Greenplum Developers

Hello,

In this post I'm going to examine the prevalence of work_mem (and its cousin
planner_work_mem) in our main branch (and soon to be released GPDB 7X).

In general, our long term goal is to eradicate work_mem from the codebase. We
have done so in many places, chiefly in the executor, by replacing it with
operatorMemKB. Using operatorMem ensures that we can allocate a budget for the
entire query/statement (called query_mem), something that is not at all possible
w/ work_mem.

TL;DR:
However, there are quite a few places in the code where there is rampant use of
work_mem and planner_work_mem. These include the planner, executor, PL
languages,
UDFs and index AMs. The iceberg is not submerged, it is floating in the air.

My conclusions:

There is a long way to go as you will see below, before we can eradicate the
usage of work_mem entirely. We should un-deprecate it and provide clear
guidance to users on how it can prove beneficial. I have attached some simple
test results that have yielded very positive results from setting this GUC.

Also, as we absorb extensions such as pgvector (relies on work_mem for sizing
its tuplesort) etc - they will always rely on work_mem so we can't eliminate it.

We have to think very carefully on how we want to proceed. We will have to
replace work_mem gradually in all facets of code. At the same time, we would
have to provide a GUC to enable it to work as it does now, both for backwards
compatibility and for extensibility (We don't want to start forking extensions,
do we?)

Usages:

(1) Planner:

* Finalize hash agg: add_second_stage_hash_agg_path() - whether we even consider
a finalize hashagg depends on hashentrysize * dNumGroups < work_mem * 1024L
and hashentrysize * ctx->dNumGroupsTotal <= work_mem * 1024L

* Bitmap heap scan/ Bitmap index scan costing: compute_bitmap_pages(): This
directly influences the cost of performing bitmap heap/index scans. This has
a dramatic effect on estimating tuples_fetched from the BitmapHeapScan node, as
it has a direct bearing on lossy_pages and having lots of lossy_pages imposes a
major costing penalty (tuples_fetched ≈ reltuples/numsegments).

To see the level of impact, just look at the differences in run times and the
number of TPC-H queries that select a bitmap heap/index scan in the appendix,
using BRIN and B-tree indexes (see attached)!

* Costing for global ops - ALL of these use planner_work_mem via
global_work_mem()
to override work_mem.

cost_sort()
cost_material()
cost_shareinputscan()
initial_cost_hashjoin()
subpath_is_hashable()
subplan_is_hashable()

* Costing for CTEScan and WorkTableScan rescans: cost_rescan(): work_mem decides
if we are spilling or not and adjusts costs accordingly.

* Hash-join costing: final_cost_hashjoin(): disable_cost penalty for buckets
holding inner MCV is imposed when the bucket size exceeds work_mem.

* Merge-join costing: final_cost_mergejoin(): We decide whether to materialize
the inner side by checking if inner side will spill (based on work_mem)

* Creation of unique paths (create_unique_path() and create_unique_rowid_path())
We decide if we can hash based on work_mem:
hashentrysize * pathnode->path.rows > work_mem * 1024L

* Agg planning:

Whether we choose an agg path or not depends on
hashaggtablesize < work_mem * 1024L: add_paths_to_grouping_rel()

Whether we choose a hashagg depends on:
hashaggtablesize < work_mem * 1024L : create_partial_grouping_paths()

* Grouping sets planning: Whether we generate Grouping sets paths at all depends
on whether the hashagg table fits into work_mem: consider_groupingsets_paths():
hashsize > work_mem * 1024L.

* Setops planning: Whether we should use hash for setops: choose_hashed_setop()

* Costing for

(2) Executor:

* [Possibly very high impact due to generality] Sizing per-tuple memory
contexts: CreateWorkExprContext(): We decide maxBlockSize of
ecxt_per_query_memory
based on work_mem. This memory context is used by all executor nodes!

* Comment that shows that we fall back to using work_mem for certain
"statements"
that don't go through resource queues.

if (ps->plan->operatorMemKB == 0)
{
/**
* There are some statements that do not go through the resource
queue and these
* plans dont get decorated with the operatorMemKB. Someday, we
should fix resource queues.
*/
result = work_mem;
}

* Agg execution: BuildTupleHashTableExt(): We decide the number of
buckets in the
hash table used for grouping, based on work_mem.

* index_initbitmap() - generic index AM agnostic routine called from
MultiExecBitmapIndexScan().

* [Low impact to GPDB] Parallel hash join: ExecParallelHashIncreaseNumBatches()

(3) Index AMs:

* amgetbitmap routines - used for bitmap index scan. work_mem is used here to
size tidbitmaps -> Huge bearing on tidbitmap lossification, which adversely
affects BitmapHeapScan runtime.

* GIN insert cleanup: ginInsertCleanup(): Decides if its time to flush mem to
disk (accum.allocatedMemory >= workMemory * 1024L)

* GIN match tid bitmap is sized with respect to work_mem: collectMatchBitmap().

(4) Tuplestores: Rampant use of work_mem is present to size maxKBytes of
tuplestores throughout the code.

* Fmgr cache tuplestore

* Upstream UDFs relying on tuplestores:
pg_timezone_names (datetime.c)
dblink UDFs (dblink.c)
Event triggers (event_trigger.c)
pageinspect functions (e.g. brin_page_items())
extension support UDFs (extension.c)
Foreign table options support UDF (deflist_to_tuplestore())
Dir admin functions (genfile.c)
show_all_file_settings UDF (guc.c)
pg_hba_file_rules UDF
pg_stat_get_subscription UDF
pg_get_backend_memory_contexts UDF
pg_tablespace_databases UDF
pg_show_replication_origin_status UDF
pg_config UDF
pg_stat_statements UDF
pgrowlocks UDF
pg_stat_get* UDFs
pg_cursor UDF used by pg_cursors view
pg_prepared_statement UDF for pg_prepared_statements view
pg_get_replication_slots UDF
tablefuncs.c UDFs
amcheck UDFs for btree (verify_nbtree.c)
pg_stat_get_wal_senders UDF
pg_stop_backup_v2 UDF

* jsonfuncs.c tuplestore maxBytes

* libpqreceiver uses work_mem to size tuplestore maxKBytes! See
libpqrcv_processTuples()

* Logical replication: Getting slot changes relies on a tuplestore sized by
work_mem: pg_logical_slot_get_changes_guts()

* FunctionScans: ExecMakeTableFunctionResult(): Tuplestore for materializing the
results of a SRF has its maxKBytes set based on work_mem.

* CTEScan: Size of tuplestore for holding results returned from CTE query:
ExecInitCteScan()

* TableFuncScan tuplestore to contain results of TableFuncScan: tfuncFetchRows()

* RecursiveUnion tuplestore working_table: ExecInitRecursiveUnion()
tuplestore intermediate_table: ExecInitRecursiveUnion()/ExecRecursiveUnion()

* pl/pgsql UDFs have a tuplestore in their PLpgSQL_execstate!
exec_init_tuple_store()

* Holdable portals have a tuplestore! PortalCreateHoldStore()

* Triggers
FDW tuple stores: GetCurrentFDWTuplestore()
TransitionCapture tuple stores: MakeTransitionCaptureState()

(5) Tuplesort allowedMem is decided by work_mem. See tuplesort_beginXXX
routines.

* btree internal sort routines (nbtsort.c)

* Ordered set aggs execution sort state: ordered_set_startup()

Regards,
Soumyadeep (VMware)