On resource management in Greenplum

99 views
Skip to first unread message

Zhenghua Lyu

unread,
Nov 14, 2022, 9:23:52 AM11/14/22
to gpdb...@greenplum.org
Hi,
    this is a summary post on resource management in Greenplum. If you have any questions,  please discuss in this thread.

    Short version:
  • we are going to remove resource quque in Greenplum 7: https://github.com/greenplum-db/gpdb/pull/14466 
  • we are going to remove resource group's memory control model in Greenplum 7
  • we are going to implement a simple memory control model for resource group in Greenplum 7
  • we are going to support both cgroup v1 and cgroup v2 in Greenplum 7

    TL;DR

    Greenplum 5 and Greenplum 6 contain two modules to do resource management:
  • Resource Queue
  • Resource Group
    For later discussion, we can use the word manage unit to represent either a resource queue or a resource group.
    
    Some common features of resource management are:
  • concurrency control: what is the max number of sessions (transactions) a manage unit can have at the same time
  • memory control: the max memory a query can use in the cluster
  • CPU: how much CPU resource can be used for a job
  • IO control
  • policy to bypass management: based on role, based on SQL (like plan shape, plan costs...)
    The remaining of the post will try to talk about several core problems and designs in Greenplum.
    
    ## Concurrency
   There are 3 important decisions for concurrency limits:
  1. wait or quit: when the manage unit is full, what to do with the new coming sessions, simply reject or put them in some wait queue
  2. where to put the check code?
  3. if choosing wait, what mechanism to implement process wait

     The following table compares these choices between resource group and resource queue:
     

resource queue resource group
wait or quit wait wait
where is the check code At the start of the executor, after getting the plan. ResLockPortal() ← PortalStart()
At the very beginning of the transaction start. AssignResGroupOnMaster() ← StartTransaction()
mechanism to implement the wait
Extended Database Object Lock. High level, hack the code a lot.
WaitLatch(). very low level, near OS.

     Digression: locking stage in Greenplum. A SQL's process progress is as below, words in () and red are procedure names: 
     SQL → (parse) → Syntax Tree → (semantic analyze) → Query Tree → (Optimize) → Plan → (Execute).
     Greenplum acquires Locks (like table locks) during (semantic analysis) and (Optimize) and releases these locks at the end of the transaction.
   
    The logic behind the design of resource queue:

    feature request for bypassing based on plan shape or plan cost

     ==> we have to put the check code after getting the plan

         ==> we execute the check code while holding locks (table locks etc.)

            ==> the check code might put the current session to wait

                 ==> it is possible for the session to hang and this hang session is holding some locks

                     ==> there is possibility to deadlock

  
     
     The possibility of deadlock forces resource queue chooses to extend object look to implement wait. Because only by this,
     when deadlock happens, Postgres deadlock detector can break it.

    The painful part is the extend object lock (resource queue lock) contains many hacky codes and is hard to prove correctness,
    we have fixed several such issues (some lead to PANIC), and the debugging is so painful.

    Due to the drawbacks of resource queue, resource group decide to get rid of the hacky extend object lock and choose
    OS level wait mechanism to implement the wait. The logic of design of resource group's concurrency limit is as below:
     resource queue hack object lock is painful

     ==> we need to choose a safe way: Latch (near OS level)

         ==> the latch cannot be handled by Postgres deadlock handler

            ==> we have to put the check code at the very begin of a transaction (since during then, no locks should be held)

    
     The drawback is it is not easy to bypass based on plan shape or cost (we have not yet processed the SQL). The best we
     can do is to un-assign from the resgroup after planning because un-assign will not lead to waiting.


    ## Memory
    Digression: the concept of VmemTracker. This is Greenplum specific feature. The goal is to protect database not impacted by OS's OOM killer.
     The design is let's make a memory account book on each segment, if that segment's total used vmem is larger than some threshold, let the Database
     cancel those sessions. In short: 
  • gp_vmem_protect_limit is a value that a segment's vmem limit
  • vmemtracker only tracks memory context memory allocate (no direct malloc, fortunately, direct malloc is rare in GPDB)
     Digression: the concept of query_mem. PlannedStmt struct has a field query_mem, which is a concept to measure how much
     memory this SQL can use on a single segment.  Based on this value, Greenplum will walk the plan tree to assign operator
     memory for each operator (just a new name for the plan node here):
  • For memory intensive operators (plan node), like SORT, and HASH, will use more memory (calculated by some algorithms); 
  • For non-memory intensive operators (plan node), like SeqScan, it will be set based on a GUC, now default 100KB. 
    So query_mem is a value that almost measures how much memory this SQL can use on a single segment:
  • plan size is not considered: Greenplum will dispatch to each QE the same whole plan
  • run time structure palloc-ed memory is not considered
  • 100KB is too rough: think of a simple seqscan with many many filters, these structs consume considerable mem
    Anyway, we can roughly believe that query_mem is the limitation of memory a query can use on a segment. (This is not accurate, but not bad).

     

Resource Queue Resource Group
Memory Limit A simple configuration to control the query mem of each SQL A very complex model to do precise control. Hard to understand. Please refer to doc for details.

   ## CPU
   This part is based on Linux Cgroup's CPU API. Please refer to doc. We are enhancing the database API, coming soom.

   ## IO
   Linux cgroup provides the ability to control IOPS. The OS API needs a parameter to identify the device. This makes it hard for an MPP database like Greenplum
    to build an abstract layer for the database user. We are still trying to find a way to design the IO control in Greenplum.



Best,
Zhenghua Lyu
  

Shine Zhang

unread,
Nov 15, 2022, 10:29:37 AM11/15/22
to Greenplum Developers, zlyu
Thanks ZhengHua.

Few clarification questions.
- What's relationship between GUC statement_mem and the query_mem in the code?
- You mentioned supporting both cgroup v1 and v2. What's complication of supporting both? Where is the control to decide the GPDB is using v1 or v2?
- Any hint of what's the new resource group table `gp_toolkit.gp_resgroup_config` will looks like?

Few more scenario I am looking for in the new resource group
- You mentioned the concurrency can control how many queries can go through, and we also have a way to by-pass the resource group. If a query by-pass the resource group, which resource group this by-pass query will take resource from? Does that mean the by-pass query will take resource from some `default_group`? Or in other words, how to account for the resources (memory, CPU, IO, etc.) used by the by-passed queries?
- In the simplified memory model, is there anyway to understand the relationship between the actual physical memory usage vs. resource group specified memory limit? In other words, if a query has a given statement_mem computed as 10GB, but it actually used just 5GB (over estimated) or 15GB (under estimated), where to check that information? My goal is to have enough feedback, so that I can adjust the resource group settings to reduce OOM, or increase memory utilization efficiency. One more related point is also account for the `file read cache` from the OS, since that provides a critical performance impact to the resource group. When we limit the memory in the new resource group, do we also limit the `file read cache` for that resource group?
- One more thing on the waiting in the resource group. Are we doing at the level of the statement, or transaction, or connection session? Here is the scenario, I have 1000 users, and my concurrency is 50, the idea is I only allow 50 concurrent queries running. If the concurrency is controlled at the connection level, the first 50 users came in and then do nothing, they will block the remaining 950 users using the cluster. If we the concurrency control is at the transaction level, and after first 50 transaction started (say with BEGIN), and then the users didn't commit/abort those transactions, and also do nothing, it will also lead to the cluster do nothing. Ideally, the concurrency control is at the `statement` level. If at the `statement` level, how to make sure there is no `deadlock` scenario introduced by the resource group concurrency control? E.g. two statements in one transaction, the second statement cannot run due to the concurrency limit, but the first statement already executed to hold locks on shared resources.

Let's discuss more.

Thanks,
Shine

zlyu

unread,
Nov 20, 2022, 6:49:39 PM11/20/22
to Greenplum Developers, Shine Zhang, zlyu
Hi, 


> What's relationship between GUC statement_mem and the query_mem in the code?

Let's talk about gpdb6's behavior.
Under resqueque mode:
  • if resource queque is bypassed (like super user), then query_mem will just fallback to use statement_mem
  • of if current resource queue does not configure memlimit, then query_mem will just fallback to use statement_mem
User resource group mode:
  • if the GUC memory_spill_ratio is 0, then query_mem will just fallback to use statement_mem
  • else, if the group's spill ratio is configured to be 0,  then query_mem will just fallback to use statement_mem
>  You mentioned supporting both cgroup v1 and v2. What's complication of supporting both? Where is the control to decide the GPDB is using v1 or v2?

Zhenglong has opened a PR to build an abstract layer for resgroup's operations. https://github.com/greenplum-db/gpdb/pull/14343
There is a GUC to control which version is using.

> Any hint of what's the new resource group table `gp_toolkit.gp_resgroup_config` will looks like?

Will update more details later.

zlyu

unread,
Dec 22, 2022, 1:06:29 AM12/22/22
to Greenplum Developers, zlyu, Shine Zhang
Hi, 

more on memory management for Utilities (by Xuejing Zhao and me).

There is a PR opened to remove the current resgroup memory model: https://github.com/greenplum-db/gpdb/pull/14562
After it gets in, we'll start the new memory control model.

- For queries, the new model will control memory usage using query_mem.
- For utilities:
   * create index: this will create a tuple store and controlled by `maintain_work_mem`, previously, resgroup and resqueue does not touch it, in the new model we will try also setting this value
   * analyze: this will sample data and return in an array, no spill supported thus no directly memory control parameters, resource control cannot directly work for them, just let vmem_protect_limit handle such
   * others, just let vmem_protect_limit handle

Zhenghua Lyu

unread,
Dec 22, 2022, 1:24:31 AM12/22/22
to gpdb...@greenplum.org, Zhenghua Lyu
Repost using mail:

Hi, 
more on memory management for Utilities (by Xuejing Zhao and me).

There is a PR opened to remove the current resgroup memory model: https://github.com/greenplum-db/gpdb/pull/14562
After it gets in, we'll start the new memory control model.

- For queries, the new model will control memory usage using query_mem.
- For utilities:
   * create index: this will create a tuple store and controlled by `maintain_work_mem`, previously, resgroup and resqueue does not touch it, in the new model we will try also setting this value
   * analyze: this will sample data and return in an array, no spill supported thus no directly memory control parameters, resource control cannot directly work for them, just let vmem_protect_limit handle such
   * others, just let vmem_protect_limit handle


From: 'Zhenghua Lyu' via Greenplum Developers <gpdb...@greenplum.org>
Sent: Monday, November 14, 2022 10:23 PM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: On resource management in Greenplum
 
!! External Email
!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.
Reply all
Reply to author
Forward
0 new messages