this is a summary post on resource management in Greenplum. If you have any questions, please discuss in this thread.
- Resource Queue
- Resource Group
For later discussion, we can use the word manage unit to represent either a resource queue or a resource group.
Some common features of resource management are:
- concurrency control: what is the max number of sessions (transactions) a manage unit can have at the same time
- memory control: the max memory a query can use in the cluster
- CPU: how much CPU resource can be used for a job
- IO control
- policy to bypass management: based on role, based on SQL (like plan shape, plan costs...)
The remaining of the post will try to talk about several core problems and designs in Greenplum.
## Concurrency
There are 3 important decisions for concurrency limits:
- wait or quit: when the manage unit is full, what to do with the new coming sessions, simply reject or put them in some wait queue
- where to put the check code?
- if choosing wait, what mechanism to implement process wait
The following table compares these choices between resource group and resource queue:
|
resource queue |
resource group |
wait or quit |
wait |
wait |
where is the check code |
At the start of the executor, after getting the plan. ResLockPortal() ← PortalStart()
|
At the very beginning of the transaction start. AssignResGroupOnMaster() ← StartTransaction()
|
mechanism to implement the wait
|
Extended Database Object Lock. High level, hack the code a lot.
|
WaitLatch(). very low level, near OS.
|
Digression: locking stage in Greenplum. A SQL's process progress is as below, words in () and
red are procedure
names:
SQL → (parse) → Syntax Tree → (semantic analyze) → Query Tree → (Optimize)
→ Plan → (Execute).
Greenplum acquires Locks (like table locks) during (semantic analysis) and (Optimize) and releases these locks at the end of the transaction.
The logic behind the design of resource queue:
feature
request for bypassing based on plan shape or plan cost
==> we have to put the check code after getting the plan
==> we execute the check code while holding locks (table locks etc.)
==> the check code might put the current session to wait
==> it is possible for the session to hang and this hang session is holding some locks
==> there is possibility to deadlock
The possibility of deadlock forces resource queue chooses to extend object look to implement wait. Because only by this,
when deadlock happens, Postgres deadlock detector can break it.
The painful part is the extend object lock (resource queue lock) contains many hacky codes and is hard to prove correctness,
we have fixed several such issues (some lead to PANIC), and the debugging is so painful.
Due to the drawbacks of resource queue, resource group decide to get rid of the hacky extend object lock and choose
OS level wait mechanism to implement the wait. The logic of design of resource group's concurrency limit is as below:
resource
queue hack object lock is painful
==> we need to choose a safe way: Latch (near OS level)
==> the latch cannot be handled by Postgres deadlock handler
==> we have to put the check code at the very begin of a transaction (since during then, no locks should be held)
The drawback is it is not easy to bypass based on plan shape or cost (we have not yet processed the SQL). The best we
can do is to un-assign from the resgroup after planning because un-assign will not lead to waiting.
## Memory
Digression: the
concept of VmemTracker. This is Greenplum specific feature. The goal is to protect database not impacted by OS's OOM killer.
The design is
let's make a memory account book on each segment, if that segment's total used vmem is larger than some threshold, let the Database
cancel those sessions.
In short:
- gp_vmem_protect_limit
is a value that a segment's vmem limit
- vmemtracker
only tracks memory context memory allocate (no direct malloc, fortunately, direct malloc is rare in GPDB)
Digression:
the concept of query_mem. PlannedStmt
struct has a field query_mem, which is a concept to measure how much
memory this SQL can use on a single segment. Based on this value, Greenplum will walk the plan tree to assign operator
memory for each operator (just a new name for the plan node here):
- For
memory intensive operators (plan node), like SORT, and HASH, will
use more memory (calculated by some algorithms);
- For
non-memory intensive operators (plan node), like SeqScan, it will be set based
on a GUC, now default 100KB.
So query_mem is a value that almost measures how much memory this SQL can use on a single
segment:
- plan
size is not considered: Greenplum will dispatch to each QE the same whole plan
- run
time structure palloc-ed memory is not considered
- 100KB
is too rough: think of a simple seqscan with many many filters, these structs consume considerable mem
Anyway, we can roughly believe that query_mem is the limitation of memory a query can use on a segment. (This is not accurate, but not
bad).
|
Resource Queue |
Resource Group |
Memory Limit |
A simple configuration to control the query mem of each SQL |
A very complex model to do precise control. Hard to understand. Please refer to doc for details. |
## CPU
This part is based on Linux Cgroup's CPU API. Please refer to doc. We are enhancing the database API, coming soom.
## IO
Linux cgroup
provides the ability to control IOPS. The OS API needs a parameter to identify the device. This makes it hard for an MPP database like Greenplum
to build an abstract layer for the database user. We are still trying to find a way to design the IO control in Greenplum.
Best,
Zhenghua
Lyu