IO limit in resgroup

Rong Du

unread,

Jan 3, 2023, 10:28:49 PM1/3/23

to gpdb...@greenplum.org

Motivation

When the system is under high workload and processes start to compete for IO bandwidth, some important queries may be processed very slowly. If there is a way to limit the io bandwidth of some unimportant queries, and let the rest io bandwidth be used for important queries, so we can avoid the above situation.

Cgroup’s IO control

V1

Using blkio controllers in cgroup v1 is very easy, there is an example which limiting write bps for specify block device:

mount blkio controller:

> mount -t cgroup -oblkio none /opt/blkio

test write bps without cgroup limit:

> dd if=/dev/zero of=/opt/zerofile bs=4k count=1024
1024+0 records in  
1024+0 records out  
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00466194 s, 900 MB/s

configure bps for /dev/sda(major 8, minor 0):

> ls -al /dev/sda
brw-rw---- 1 root disk 8, 0 Aug 17 04:14 /dev/sda

> echo "8:0  1048576" > /opt/blkio/blkio.throttle.write_bps_device

add current bash process to this cgroup:

> echo $$ > /opt/blkio/cgroup.procs

test bps with cgroup limit(1MB/s)

> dd if=/dev/zero of=/opt/zerofile bs=4k count=1024 oflag=direct
1024+0 records in  
1024+0 records out  
4194304 bytes (4.2 MB, 4.0 MiB) copied, 3.90652 s, 1.1 MB/s

For using blkio, you must specify which block device need to be limited, and only read/write with O_DIRECT can be limited.

Blkio in cgroup v1 also supports hierarchy, but not recommand, for more information, you can read: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/blkio-controller.html#hierarchical-cgroups

V2

The io controller is different with blkio of v1, for example, using io controller to limit the write bps:

enable io controller for child node in the root.

> echo "+io" > /opt/root/cgroup.subtree_control

create a leaf node to test.

> mkdir /opt/root/io

put current shell process to the io cgroup.

> echo $$ > /opt/io/cgroup.procs

configure the limitation.

> echo "8:0 wbps=1048576" > io.max

test.

> dd if=/dev/zero of=/opt/zerofile bs=4k count=1024 oflag=direct
1024+0 records in  
1024+0 records out  
4194304 bytes (4.2 MB, 4.0 MiB) copied, 3.90907 s, 1.1 MB/s

Writeback

If the disk has a heavy workload, and we want to limit the io speed of next processes, so we can create a cgroup and limit it’s wbps of io controller, and then put the processes to this cgroup. But from the result of above test, when we write to disk without direct flag, the data we written will be written to the pagecache, and then, another process within different cgroup will flush the pagacache to disk, our limitation was lost here.

But, in cgroup v2, we can solve this problem. V2 using memory controller and io controller to limite the writeback speed. For example, we create a cgroup and limit the wbps to 1MB/s, and then using dd to test the speed:

> dd if=/dev/zero of=/opt/zerofile bs=60M count=1
1+0 records in
1+0 records out
62914560 bytes (63 MB, 60 MiB) copied, 0.0425542 s, 1.5 GB/s

The dd command completed quickly, because it just wrote the data to pagecache. But let see the actually io speed:

> iostat -p sda -d 1
# update information every one second
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               2.00         0.00      1024.00         0.00          0       1024          0

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               2.00         0.00      1024.00         0.00          0       1024          0

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               2.00         0.00      1040.00         0.00          0       1040          0

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               2.00         0.00      1044.00         0.00          0       1044          0

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               4.00         0.00      1076.00         0.00          0       1076          0

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               2.00         0.00      1036.00         0.00          0       1036          0

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               2.00         0.00      1044.00         0.00          0       1044          0

Yeah, as you can see, the wbps is really 1MB/s!

Summary

GPDB can use the cgroup io limit to limit io speed of queries, but only cgroup v2, because GPDB using pagecache. So if the systems running gpdb just enable cgroup v1, the io control is unavailable.

Greenplum IO experiment

Test for Heap

create table

create table test_heap(a int,b int);

copy data(around 5GB)

COPY test_heap FROM '/home/mt/c';

result(using `iotop`)

The read requests come from QD totally:

03:13:06    4185 be/4 mt         19.51 M/s    0.00 B/s ?unavailable?  postgres:  5432, mt demo [local] con7 cmd14 COPY

The write requests come from QE and walwriter:

b'03:13:07    4196 be/4 mt          0.00 B/s   79.58 M/s ?unavailable?  postgres:  6000, mt demo 10.117.190.180(52630) con7 seg0 cmd14 MPPEXEC UTILITY'  
b'03:13:07    4195 be/4 mt          0.00 B/s   79.51 M/s ?unavailable?  postgres:  6001, mt demo 10.117.190.180(36408) con7 seg1 cmd14 MPPEXEC UTILITY'
b'03:13:07    4040 be/4 mt          0.00 B/s    6.23 M/s ?unavailable?  postgres:  6001, walwriter'  
b'03:13:07    4034 be/4 mt          0.00 B/s    6.20 M/s ?unavailable?  postgres:  6000, walwriter'

Test for AO

create table

create table test_ao(a int, b int) with (appendonly=true);

copy data(around 5GB)

COPY test_ao FROM '/home/mt/c';

result(using iotop)

The read requests come from QD and QE:

b'03:18:09    4196 be/4 mt        156.01 K/s   18.51 M/s ?unavailable?  postgres:  6000, mt demo 10.117.190.180(52630) con7 seg0 cmd15 MPPEXEC UTILITY'  
b'03:18:09    4195 be/4 mt        156.01 K/s   17.19 M/s ?unavailable?  postgres:  6001, mt demo 10.117.190.180(36408) con7 seg1 cmd15 MPPEXEC UTILITY
b'03:18:09    4185 be/4 mt         12.90 M/s    0.00 B/s ?unavailable?  postgres:  5432, mt demo [local] con7 cmd15 COPY'

But around 99.9% read requests come from QD.

The write requests come from QE and walwriter:

b'03:18:09    4196 be/4 mt        156.01 K/s   18.51 M/s ?unavailable?  postgres:  6000, mt demo 10.117.190.180(52630) con7 seg0 cmd15 MPPEXEC UTILITY'  
b'03:18:09    4195 be/4 mt        156.01 K/s   17.19 M/s ?unavailable?  postgres:  6001, mt demo 10.117.190.180(36408) con7 seg1 cmd15 MPPEXEC UTILITY'  
b'03:18:09    4040 be/4 mt          0.00 B/s    2.44 M/s ?unavailable?  postgres:  6001, walwriter'  
b'03:18:09    4034 be/4 mt          0.00 B/s  676.05 K/s ?unavailable?  postgres:  6000, walwriter'

Summary

From the result of above tests, we can find that the most io requests sent by QD and QE processes, so we can limit the io for even per query(using cgroup v2).

Resgroup IO control implementation

User API

create resource group rg with (io_read_limit=1024, io_write_limit=-1, io_read_iops=100, io_write_iops=100);

io_read_limit : limit read speed(MB) from disk for every process in the resource group

io_write_limit: limit write speed(MB) from disk for every process in the resource group

io_read_iops: limit read iops from disk for every process in the resource group

io_write_iops: limit write iops from disk for every process in the resource group

Implementation

As you can see, the device id is needed when configure the io controller, and gpdb is a cluster application, we cannot promise that every system has same disk configuration, so we need a way to configure device id easily.

Every query maybe involves many tables, and data files of those tables are distributed on many disks. Cgroup io controller is configured by writing every device id of block device and the limitatioin for that block device to the configuration file. For example, in cgroup v2, write 8:0 wbps=102400 rbps=102400 to io.max which configuration file of io limitation means if processes in this cgroup write to 8:0 disk(you can see block device id by using lsblk), the upper limitation of write speed and read speed is 102400 bytes per second.

So the biggest problem is that how we find the block device id and write it to cgroup io configuration.There are some solutions to write the information to cgroup io configuration file:

When the query be executed, gpdb find disks which this query will use. It’s difficult and inefficient.
Find all block device id, and write them to cgroup configuration file when resource group created or initialed.

Apparently, the second solution is better, it’s logical and implementation is easier than the first one. There are many ways to find the device id of all block devices, and we can even detect the hotplug event, just use udev, a tool inside linux kernel(after linux 2.6). There is a easy solution:
Using a simple udev rule to monitor the block devices, when udev find a block device, write it’s device id to a specific file, for example, /tmp/gpdb-diskinfo. Then gpdb using linux filesystem monitor api(inotify) to get the diskinfo by monitoring /tmp/gpdb-diskinfo. gpdb can use a bgworker to do this thing, and write the device id of disks to cgroup configuration simultaneously.

Examples

CREATE RESOURCE GROUP rgroup1 WITH (CONCURRENCY=5, IO_READ_LIMIT=1024, IO_WRITE_LIMIT=-1, IO_READ_IOPS=100, IO_WRITE_IOPS=100);
ALTER ROLE bill RESOURCE GROUP rgroup1;

Then all queries with role bill will be limit the io speed and iops.

Authored By: Rong Du, Zhenghua Lyu

Haolin Wang

unread,

Jan 4, 2023, 4:34:04 AM1/4/23

to Greenplum Developers, Rong Du

> If there is a way to limit the io bandwidth of some unimportant queries, and let the rest io bandwidth be used for important queries, so we can avoid the above situation.

So it is intended to limit I/O based on queries, but the following cgroup configuration shows the controller per group limits I/O based on device, right ? For example, there are different

queries on a single segment which has I/Os in the same data directory so on the same device, how can we limit the I/O based on those queries for this single segment ?

> V2 using memory controller and io controller to limite the writeback speed.

The behavior seems like we see the I/O speed is limited from iostat, but we also see the write is finished with a faster speed from application (did you also check iotop outputs ? Is the I/O speed similar like dd ?).

Seems most data was in OS page cache right ? If so, the page cache resource could get shortage hence still has possibility to impact the critical queries, do we have any approach to control this properly ?

> When the query be executed, gpdb find disks which this query will use. It’s difficult and inefficient.

I feel like this option seems more straightforwardly for GP, if I understand correctly. For example, we should be able to know the datadir where the query happens, given the datadir, we can get the

device id directly by udevadm command:

# udevadm info -d /home/gpadmin/workspace/gpdb/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0
8:17

#

Soumyadeep Chakraborty

unread,

Jan 4, 2023, 6:08:28 PM1/4/23

to Rong Du, gpdb...@greenplum.org

Hello,

> Summary
> From the result of above tests, we can find that the most
> io requests sent by
> QD and
> QE processes, so we can limit the
> io for even per query(using cgroup v2).

For completeness, apart from QD and QE backends there are other
processes that perform significant IO:

1. High write activity can be common for the checkpointer, especially if
there are a lot of dirty buffers since the last checkpoint.

2. Another worthwhile mention is the walreceiver for mirrors (which will
write WAL into the pg_wal directory.

3. The startup process on mirrors which performs WAL replay can issue a
lot of reads (reading the WAL and data files) and writes (for data files).

Do we have to keep these in mind when we design IO limiting? The goal
should be that the above processes should get an appropriate share of
the I/O bandwidth. Throttling checkpoints or WAL streaming can have bad
consequences such as WAL buildup, write amplification etc. How can we
ensure that these processes will remain unaffected?

> User API
> create resource group rg with (io_read_limit=1024, io_write_limit=-1, io_read_iops=100, io_write_iops=100);

First up, I think we need to address how tablespaces will interact with
this feature. The tablespace feature allows for storage heterogeneity -
As an example one may have a faster disk for frequently accessed tables
and a much slower disk for cold data. A global read/write limit, as
specified above, may lead to under/over-utilization of device bandwidth
for a given tablespace.

We also have to consider the special notion of temp_tablespaces, which
is where spill files go.

So I think we should approach the problem as "for a given tablespace
what is the read/write limit that we want to assign?"

Now, if a user uses the general syntax above, we could assign the same
limit values for all existing tablespaces (including the default). When
a new tablespace is added, we could also inherit the settings from that
of the default tablespace. We could also make it possible to alter a
tablespace's IO setting with ALTER TABLESPACE or even something like:
ALTER RESOURCE GROUP FOR TABLESPACE.

All of this involves a new catalog table and possible changes to existing
catalogs, which is a bit of a hurdle given how simple
pg_resgroupcapability is today. Off the top of my head, we could have:

pg_resgroup_io_setting(oid, tablespaceid, resgroupid, io_read_limit, ...)

and then opt not to store anything in pg_resgroupcapability for IO
related settings, as a one-off case.

Of course, we could keep things simple and not consider tablespaces. It
depends on how many users use tablespaces to get heterogeneous storage
(as opposed to tablespaces simply being used for).
Environments such as GBB definitely make use of tablespaces, and usage
of temp_tablespaces has been observed.

Thoughts?

> Apparently, the second solution is better, it’s logical and implementation is easier than the first one. There are many ways to find the
> device id of all block devices, and we can even detect the hotplug event, just use
> udev, a tool inside linux kernel(after linux 2.6). There is a easy solution:
> Using a simple udev rule to monitor the block devices, when udev find a block device, write it’s device id to a specific file, for example,
> /tmp/gpdb-diskinfo. Then gpdb using linux filesystem monitor api(inotify) to get the diskinfo by monitoring
> /tmp/gpdb-diskinfo. gpdb can use a bgworker to do this thing, and write the device id of disks to cgroup configuration simultaneously.

Yes the second solution is better as this activity needs to happen only
once. Doing it per query/command on the QEs will not only add per-query
overhead, it will add complexity. For eg. if we have a query that joins
5 tables, with each table in its own tablespace, then we will have to
call udevadm (alternatively just the stat syscall or its PG wrapper for
efficiency) to find the device id for each and every tablespace, as
each tablespace could be backed by a different disk.

However, I do dislike the idea of introducing a continuously running
background worker for device attach events, since they will rarely
happen. First, let's examine when we actually even need to rewrite the
cgroup io config:

1. If someone changes the device backing a datadir. This is an offline
activity. So, we would need to rewrite the config on server restart.

We could, at startup, launch a background worker which can call stat(),
figure out the device (for default and non-default tablespaces), read
the setting from the catalog and then rewrite the config in the cgroup.
The worker can then immediately die.

2. If someone adds a new tablespace backed by a new device. This is an
online activity.

We can similarly read the catalog and update the cgroup file, as part of
the CREATE TABLESPACE workflow. If we want to support tablespace specific
IO config, then we can adopt some of the strategies discussed above.

Some other questions I had:

1. Why are we not considering the io.latency [1] method? A quick skim
made me feel that it is better, since it is "work conserving". With our
current approach, if we don't have all the cgroups add up to the
available device bandwidth, will it lead to under-utilization? Will
io.latency help prevent that?

Maybe it can also be useful to ensure that critical processes like the
checkpointer etc are not throttled (by setting io.latency super low)?
Can that be achieved in some other fashion?

Can io.latency be used in conjunction with io.max, instead of as an
alternative?

The other advantage of having io.latency configured, is that one can
monitor the avg latency for a cgroup, which is a fantastic metric for
observability [2].

2. Monitoring:
Are we planning to introduce views to monitor the IO performance of the
resource groups?

[1] https://www.kernel.org/doc/html/v5.10/admin-guide/cgroup-v2.html#io-latency
[2] https://facebookmicrosites.github.io/cgroup2/docs/io-controller.html#interface

Regards,
Soumyadeep (VMware)

Soumyadeep Chakraborty

unread,

Jan 4, 2023, 6:20:34 PM1/4/23

to Rong Du, gpdb...@greenplum.org

> Of course, we could keep things simple and not consider tablespaces. It
> depends on how many users use tablespaces to get heterogeneous storage
> (as opposed to tablespaces simply being used for).

To correct myself, I meant "not consider tablespace specific IO config"
by "not consider tablespaces". We have to support tablespaces regardless,
i.e. we have to write the tablespace's device config to the cgroup. Now,
whether we write a specific one or a common one for all depends on our
decision.

Also "(as opposed to tablespaces simply being used for)"
->
"(as opposed to tablespaces simply being used for disk space management)"

Regards,
Soumyadeep (VMware)

Ashwin Agrawal

unread,

Jan 4, 2023, 6:29:40 PM1/4/23

to Soumyadeep Chakraborty, Rong Du, gpdb...@greenplum.org

Tablespaces are mostly used (at least in Greenplum) for heterogeneous
storage and rarely for disk space management. Disk space management is
mostly done via adding more storage to existing tablespaces or via
gpexpand adding more nodes. So, supporting to configure different IO
config profiles based on tablespace has to be considered as a goal
while designing this feature.

--
Ashwin Agrawal (VMware)

Ashwin Agrawal

unread,

Jan 4, 2023, 7:25:29 PM1/4/23

to Rong Du, gpdb...@greenplum.org

I am having a hard time understanding this aspect. Please can you
elaborate more on how the kernel accomplishes this aspect. Once it
gets to pagecache it becomes part of the global pool (like some pages
could be written by process having wbps as 1MB/s and some pages could
be written to pagecache by 2MB/s config). In the context of Greenplum
(and PostgreSQL), backends mabbe write to shared-buffers and
pagecache. It will be a checkpointer process finally flushing the
pages to disk, so I am guessing in that case checkpointers write
limits would be used/enforced for IO, per query write limit will not
come into play as backend is not performing any write IO.

Kind of on the lines of writeback, I am thinking, doesn't IO control
hurt in WAL flush scenarios? During XLogFlush() by backend A (short
single row I/U/D running transaction), it has to flush the WAL upto
certain point. The WAL buffers required to be flushed upto that point
could be from backends X, Y, Z (bulk load transactions which won't
call WAL flush before A). If we consider backend A has a very low
write limit configured then it would slow-down WAL flush as backend A
is not only flushing its own WAL buffers but also of other backends.
(Similar situation can happen for shared-buffers too as well where a
read-only backend needs to evict some other backends buffer)

--
Ashwin Agrawal (VMware)

Rong Du

unread,

Mar 29, 2023, 7:45:01 AM3/29/23

to gpdb...@greenplum.org

Hello, after a long time, we have updates on this topic:

After discussion, the io limitation has obvious effect in GPDB, such as limit io bandwidth usage of queries with some roles.

There are some detailed content:
1. Using cgroup block io controller is easy, just create a folder in a specific path and then write your limitation to specific file.
2. With several experiments(single host, multi hosts and cross disks), we can confirm that QE/QDs have most io requests when there are some heavy queries.
3. Cgroup block io controller need write limitation for every disk you want, for easy and efficiency, current proposal just write all disks of host to limitation.
4. Resource group interface, such as `CREATE RESOURCE GROUP test WITH (IO_WRITE_HARD_MIT=1000)`, and its implementation.

But there are also some points need thinking:
1. Current proposal only limit QE/QD, when mirrors and standbys have many io requests, will the whole system be affected?
2. When disks are idle, the queries also be limited to specific bandwidth, that looks like a waste of disk bandwidth.
3. Is there a way to give priority to queries instead of hard limit?
4. How about use io.latency?

For more detailed information, check the attachment.

cgroup block io.pdf

Rong Du

unread,

Apr 10, 2023, 4:41:04 AM4/10/23

to gpdb...@greenplum.org

Hello, After some thinkinng, there is some new updates:
1. Mirrors and standbys locate at remote, Master/Primary to Standbys/mirrors communication use network io, and we only limit disk io of QE/QD at local host, so we do not need consider mirrors and standbys.

2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.

3. We can use io.latency as "priority", but we need adjust resource group hierarchy, followting testings are simple test with io.latency.

Testing on io.latency:

1. Test 1

config:
```
test_cg
   \-> low (not set io.latency)
   \-> mid (not set io.latency)
   \-> high (not set io.latency)
   \-> none (not set io.latency)
```

command(run simultaneous in every cgroup):
`dd if=/dev/zero of=/data/zerofile bs=1M count=10000`

result:
```
test_cg
   \-> low (613 MB/s)
   \-> mid (674 MB/s)
   \-> high (659 MB/s)
   \-> none (634 MB/s)
```

2. Test 2

config:
```
test_cg
   \-> low (io.latency=1)
   \-> mid (io.latency=5)
   \-> high (io.latency=10)
   \-> none (not set io.latency)
```

command(run simultaneous in every cgroup):
`dd if=/dev/zero of=/data/zerofile bs=1M count=10000`

result:
```
test_cg
   \-> low (1.4 GB/s)
   \-> mid (724 MB/s)
   \-> high (700 MB/s)
   \-> none (771 MB/s)
```

3. Test 3

config(run simultaneous in every cgroup):
```
test_cg
   \-> low (io.latency=1)
   \-> mid (io.latency=1)
   \-> high (io.latency=1)
   \-> none (not set io.latency)
```

command(run simultaneous in every cgroup):
`dd if=/dev/zero of=/data/zerofile bs=1M count=10000`

result:
```
test_cg
   \-> low (902 MB/s)
   \-> mid (862 MB/s)
   \-> high (824 MB/s)
   \-> none (525 MB/s)
```

problem:
The group without latency limitation(`none` group above) seems will be delayed actually.

4. Test 4

config:
```
test_cg
   \-> low (io.latency=1)
   \-> mid (io.latency=1)
   \-> high (io.latency=1)

test_none
   \-> none (not set io.latency)
```

command(run simultaneous in every cgroup):
`dd if=/dev/zero of=/data/zerofile bs=1M count=10000`

result:
```
test_cg
   \-> low (614 MB/s)
   \-> mid (681 MB/s)
   \-> high (666 MB/s)

test_none
   \-> none (647 MB/s)
```

5. Test 5

config:
```
test_cg
   \-> low (io.latency=1)
   \-> mid (io.latency=5)
   \-> high (io.latency=10)

test_none
   \-> none (not set io.latency)
```

command(run simultaneous in every cgroup):
`dd if=/dev/zero of=/data/zerofile bs=1M count=10000`

result:
```
test_cg
   \-> low (875 MB/s)
   \-> mid (535 MB/s)
   \-> high (523 MB/s)

test_none
   \-> none (931 MB/s)
```

So we need put QE/QD processes to a seperated group from `gpdb` cgroup, like:
```
gpdb
   \-> system
   \-> admin
   \-> QE/QD groups
        \-> user created group 1
        \-> user created group 2
```

On 3/29/23 19:44, 'Rong Du' via Greenplum Developers wrote:

!! External Email

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Soumyadeep Chakraborty

unread,

Apr 11, 2023, 2:25:05 PM4/11/23

to Rong Du, gpdb...@greenplum.org

Hi Rong,

> Hello, After some thinkinng, there is some new updates:

> 1. Mirrors and standbys locate at remote, Master/Primary to Standbys/mirrors communication use network io, and we only limit disk io of QE/QD at local host, so we do not need consider mirrors and standbys.

GP clusters typically have mirrors and primaries on the same host. Mirrors can
consume significant disk IO due to WAL replay, especially in clusters with large
WAL volume, where the mirror is continuously replaying WAL all throughout the
day or during critical production jobs/queries. So, we have to consider mirrors
into our design as well.

Since WAL replay is a single process activity, we may reserve lower bandwidth
compared to a primary on the same host, but we will have to reserve something.

The amount reserved will depend on the segment density on the host, and can be
further controlled with a GUC (for which we have to come up with an intelligent
default). Reserving too little can have big ramifications on the cluster.
Mirrors falling behind is a commonly observed problem and we have to avoid it.

Also, like primaries, mirrors run checkpoints too, which can be similarly IO
intensive.

> 2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.

How many users have we surveyed to reach this conclusion? I don't think this is
universally true. Even if it is, IMO we don't want such a restriction in the
product.

How do we tackle queries that span tablespaces? For instance, if we have a
temp_tablespace (spill files go here), a given query will be using more than one
tablespace. The user running the query will be spanning multiple tablespaces.

Also, multiple users (roles) may share the same tablespace. temp_tablespace is
one such example. For eg consider users with role 'manager' and 'employee'
accessing the same database 'staff_records' located in the same tablespace where
based on access rules, 'managers' can access certain tables that 'employee' can't.

Regards,

Soumyadeep (VMware)

Ashwin Agrawal

unread,

Apr 11, 2023, 9:41:12 PM4/11/23

to Soumyadeep Chakraborty, Rong Du, gpdb...@greenplum.org

On Tue, Apr 11, 2023 at 11:25 AM Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:

> 2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.

How many users have we surveyed to reach this conclusion? I don't think this is
universally true. Even if it is, IMO we don't want such a restriction in the
product.

No usage of tablespace I encountered for Greenplum so far has been set up as described based on role. It's mostly based on hot vs cold data aspects (not based on users or roles). Like biggest feature highlight heterogeneous partitioned tables can have tables across tablespaces and single user and single query will access the data across tablespaces.

--

Ashwin Agrawal (VMware)

Haolin Wang

unread,

Apr 11, 2023, 9:43:40 PM4/11/23

to Soumyadeep Chakraborty, Rong Du, gpdb...@greenplum.org

On Apr 12, 2023, at 02:24, Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:

!! External Email

Hi Rong,

> Hello, After some thinkinng, there is some new updates:

> 1. Mirrors and standbys locate at remote, Master/Primary to Standbys/mirrors communication use network io, and we only limit disk io of QE/QD at local host, so we do not need consider mirrors and standbys.

GP clusters typically have mirrors and primaries on the same host. Mirrors can
consume significant disk IO due to WAL replay, especially in clusters with large
WAL volume, where the mirror is continuously replaying WAL all throughout the
day or during critical production jobs/queries. So, we have to consider mirrors
into our design as well.

I think a typicall setup in production should be having primary in one host, and its paired mirror in another host, then we can guarantee it is a HA configuration cluster.

I think you probably meant primaries and non-paired mirrors are locating on the same host, right ?

So, I think we can’t control its remote paired mirror’s I/O locating on a remote host when we configure an I/O group for a primary on this local host.

If considering non-paired mirrors I/O locating on the same host, IIUC, it may be treated as another background I/O workload like system/admin I/O group and set it to a different I/O group.

Since WAL replay is a single process activity, we may reserve lower bandwidth
compared to a primary on the same host, but we will have to reserve something.

The amount reserved will depend on the segment density on the host, and can be
further controlled with a GUC (for which we have to come up with an intelligent
default). Reserving too little can have big ramifications on the cluster.
Mirrors falling behind is a commonly observed problem and we have to avoid it.

Yes, that is the tricky part as if we lower the replay I/O bandwidth for a mirror to yield a non-paired primary on the same host, it may impact the other paired primary writing on peer host.

So how much bandwidth should be set ? I think it’s hard to evaluate hence thinking let the system allocate fairly might be better (for unpredictable effect) ?

Also, like primaries, mirrors run checkpoints too, which can be similarly IO
intensive.

> 2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.

How many users have we surveyed to reach this conclusion? I don't think this is
universally true. Even if it is, IMO we don't want such a restriction in the
product.

How do we tackle queries that span tablespaces? For instance, if we have a
temp_tablespace (spill files go here), a given query will be using more than one
tablespace. The user running the query will be spanning multiple tablespaces.

Also, multiple users (roles) may share the same tablespace. temp_tablespace is
one such example. For eg consider users with role 'manager' and 'employee'
accessing the same database 'staff_records' located in the same tablespace where
based on access rules, 'managers' can access certain tables that 'employee' can't.

Regards,

Soumyadeep (VMware)

Haolin Wang

unread,

Apr 11, 2023, 10:02:37 PM4/11/23

to Ashwin Agrawal, Soumyadeep Chakraborty, Rong Du, gpdb...@greenplum.org

On Apr 12, 2023, at 09:40, Ashwin Agrawal <ashwi...@gmail.com> wrote:

!! External Email

Just want to clarify item 2:

The picture in my mind of mapping tablespace and I/O group is:

A kind of I/O workload --(1:1 mapping to)--> an I/O group —(1:1 mapping to) —> a datadir —(1:1 mapping to) —> a tablespace

So an I/O group should be 1:1 mapping to a tablespace, right ?

I was told previously that an I/O group 1: m mapping to a tablespace (m > 1), if that’s the case, I feel things get too complicate to configure and understand.

--

Ashwin Agrawal (VMware)

--
To unsubscribe from this topic, visit https://groups.google.com/a/greenplum.org/d/topic/gpdb-dev/l-5RpOz0fP8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gpdb-dev+u...@greenplum.org.

Haolin Wang

unread,

Apr 12, 2023, 3:02:49 AM4/12/23

to Haolin Wang, Ashwin Agrawal, Soumyadeep Chakraborty, Rong Du, gpdb...@greenplum.org

On Apr 12, 2023, at 10:02, 'Haolin Wang' via Greenplum Developers <gpdb...@greenplum.org> wrote:

!! External Email

On Apr 12, 2023, at 09:40, Ashwin Agrawal <ashwi...@gmail.com> wrote:

!! External Email

On Tue, Apr 11, 2023 at 11:25 AM Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:

> 2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.

How many users have we surveyed to reach this conclusion? I don't think this is
universally true. Even if it is, IMO we don't want such a restriction in the
product.

No usage of tablespace I encountered for Greenplum so far has been set up as described based on role. It's mostly based on hot vs cold data aspects (not based on users or roles). Like biggest feature highlight heterogeneous partitioned tables can have tables across tablespaces and single user and single query will access the data across tablespaces.

Just want to clarify item 2:

The picture in my mind of mapping tablespace and I/O group is:

A kind of I/O workload --(1:1 mapping to)--> an I/O group —(1:1 mapping to) —> a datadir —(1:1 mapping to) —> a tablespace

So an I/O group should be 1:1 mapping to a tablespace, right ?

I was told previously that an I/O group 1: m mapping to a tablespace (m > 1), if that’s the case, I feel things get too complicate to configure and understand.

Please ignore above, I misunderstood I/O group usage, it doesn’t control only one kind of I/O workload, actually one I/O group can control multiple kinds of I/O workloads.

So we do have one I/O group maps to multiple data paths or devices corresponding to multiple tablespaces. So we need to consider or support 1: m (I/O group : tablespaces) case.

Soumyadeep Chakraborty

unread,

Apr 12, 2023, 2:57:35 PM4/12/23

to Haolin Wang, Rong Du, gpdb...@greenplum.org

Hey Haolin,

On Tue, Apr 11, 2023 at 6:43 PM Haolin Wang <wha...@vmware.com> wrote:

> I think a typicall setup in production should be having primary in one host, and its paired mirror in another host, then we can guarantee it is a HA configuration cluster.
> I think you probably meant primaries and non-paired mirrors are locating on the same host, right ?
> So, I think we can’t control its remote paired mirror’s I/O locating on a remote host when we configure an I/O group for a primary on this local host.
>
> If considering non-paired mirrors I/O locating on the same host, IIUC, it may be treated as another background I/O workload like system/admin I/O group and set it to a different I/O group.

Yes, I meant that for each primary in the cluster, its mirror is on a
different host.
In this configuration, primaries and mirrors (of other content IDs)
will be on the
same host.

> If considering non-paired mirrors I/O locating on the same host, IIUC, it may be treated as another background I/O workload like system/admin I/O group and set it to a different I/O group.

Yes, that is one approach to group it into the sys/admin group with
other background
processes (like checkpointer etc). However, we haven't yet reached
consensus on whether
those processes merit IO allocation.

Regards,
Soumyadeep (VMware)

Haolin Wang

unread,

Apr 12, 2023, 8:25:42 PM4/12/23

to Soumyadeep Chakraborty, Rong Du, gpdb...@greenplum.org

> On Apr 13, 2023, at 02:56, Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:
>
> !! External Email
>

If we all agree on this is a configuration problem which could be deferred to test phase to address, then we can move forward with current design.

Off the topic, for configuration, I personally feel it should be based on what we can guarantee, in this case, we may need to take the sense that we are
doing limitation, rather than allocation.

Zhenghua Lyu

unread,

Apr 17, 2023, 4:41:01 AM4/17/23

to gpdb...@greenplum.org, Rong Du, Soumyadeep Chakraborty

Let me summarize and clarify the idea of tablespace proposed by @Soumyadeep Chakraborty:

each role binds to a resgroup;
each resgroup binds to a cgroup directory on a host
each tablespace (on a segment) binds to a directory path, and this directory path belong to a dev

And the user API will be:

create resource group rg with(io_limit=content0,10MB:content1:5MB);

alter table resource rg set io_limit=content0,10MB:content1:15MB)；

The APIs are to set a matrix of config like below:

ts1 ts2 ts3

grp1

grp2

grp3

Each entry in the above matrix either have a default value (not control) or have user specific value.

For example, we have Config<grp1, ts1> = 10MB, it means:

given grp1, its group id = 16450, we can know its cgroup path is /sys/fs/cgroup/16450
given ts1, we can find its dev
then we can write the value 10MB with the dev id to the cgroup path's IO config file

The core problem is:

within a single segment, different tablespaces' dir paths must belong to different devs.

If the above condition is violated, then resgroup needs to throw error or warning?

The benefit of the above proposal is that it is more clear than writing all devs in the IO file of Cgroup.

--------------------------------------------------------

For mirrors, this is another big design problems. Now procs of mirrors do not belong to any resgroup,

Now we have 3 groups in catalog:

default (QD,QEs)
admin (QD, QEs)
system (other procs of postmasters and their child procs)

What about:

renaming the above system group to primary_pm_group
add a new group: mirror_pm_group

So we can put all procs of mirrors' primary into mirror_pm_group.

And use DDL to tune the parameters of that group.

From: 'Rong Du' via Greenplum Developers <gpdb...@greenplum.org>
Sent: Wednesday, March 29, 2023 7:44 PM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Re: IO limit in resgroup

!! External Email

Haolin Wang

unread,

Apr 17, 2023, 5:13:35 AM4/17/23

to Zhenghua Lyu, gpdb...@greenplum.org, Rong Du, Soumyadeep Chakraborty

On Apr 17, 2023, at 16:40, 'Zhenghua Lyu' via Greenplum Developers <gpdb...@greenplum.org> wrote:

!! External Email

Let me summarize and clarify the idea of tablespace proposed by @Soumyadeep Chakraborty:

each role binds to a resgroup;

each resgroup binds to a cgroup directory on a host
each tablespace (on a segment) binds to a directory path, and this directory path belong to a dev

And the user API will be:

create resource group rg with(io_limit=content0,10MB:content1:5MB);

alter table resource rg set io_limit=content0,10MB:content1:15MB)；

The APIs are to set a matrix of config like below:

ts1 ts2 ts3

grp1

grp2

grp3

Each entry in the above matrix either have a default value (not control) or have user specific value.

For example, we have Config<grp1, ts1> = 10MB, it means:

given grp1, its group id = 16450, we can know its cgroup path is /sys/fs/cgroup/16450
given ts1, we can find its dev
then we can write the value 10MB with the dev id to the cgroup path's IO config file

The core problem is:

within a single segment, different tablespaces' dir paths must belong to different devs.

If the above condition is violated, then resgroup needs to throw error or warning?

If the same device is specified multiple times in the same group and the result is undefined, then we should error out.

FYI https://facebookmicrosites.github.io/cgroup2/docs/io-controller.html

For io.max, If the same key is specified multiple times, the outcome is undefined.

--

Reply all

Reply to author

Forward