When the system is under high workload and processes start to compete for IO bandwidth, some important queries may be processed very slowly. If there is a way to limit the io bandwidth of some unimportant queries, and let the rest io bandwidth be used for important queries, so we can avoid the above situation.
Using blkio controllers in cgroup v1 is very easy, there is an example which limiting write bps for specify block device:
> mount -t cgroup -oblkio none /opt/blkio
> dd if=/dev/zero of=/opt/zerofile bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00466194 s, 900 MB/s
/dev/sda
(major 8, minor 0):> ls -al /dev/sda
brw-rw---- 1 root disk 8, 0 Aug 17 04:14 /dev/sda
> echo "8:0 1048576" > /opt/blkio/blkio.throttle.write_bps_device
> echo $$ > /opt/blkio/cgroup.procs
> dd if=/dev/zero of=/opt/zerofile bs=4k count=1024 oflag=direct
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 3.90652 s, 1.1 MB/s
For using blkio, you must specify which block device need to be limited, and only read/write with
O_DIRECT
can be limited.
Blkio in cgroup v1 also supports hierarchy, but not recommand, for more information, you can read: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/blkio-controller.html#hierarchical-cgroups
The io controller is different with blkio of v1, for example, using io controller to limit the write bps:
> echo "+io" > /opt/root/cgroup.subtree_control
> mkdir /opt/root/io
> echo $$ > /opt/io/cgroup.procs
> echo "8:0 wbps=1048576" > io.max
> dd if=/dev/zero of=/opt/zerofile bs=4k count=1024 oflag=direct
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 3.90907 s, 1.1 MB/s
If the disk has a heavy workload, and we want to limit the io speed of next processes, so we can create a cgroup and limit it’s
wbps
of io controller, and then put the processes to this cgroup. But from the result of above test, when we write to disk without
direct
flag, the data we written will be written to the pagecache, and then, another process within different cgroup will flush the pagacache to disk, our limitation was lost here.
But, in cgroup v2, we can solve this problem. V2 using memory controller and io controller to limite the writeback speed. For example, we create a cgroup and limit the
wbps
to
1MB/s
, and then using
dd
to test the speed:
> dd if=/dev/zero of=/opt/zerofile bs=60M count=1
1+0 records in
1+0 records out
62914560 bytes (63 MB, 60 MiB) copied, 0.0425542 s, 1.5 GB/s
The
dd
command completed quickly, because it just wrote the data to pagecache. But let see the actually io speed:
> iostat -p sda -d 1
# update information every one second
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.00 0.00 1024.00 0.00 0 1024 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.00 0.00 1024.00 0.00 0 1024 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.00 0.00 1040.00 0.00 0 1040 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.00 0.00 1044.00 0.00 0 1044 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 4.00 0.00 1076.00 0.00 0 1076 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.00 0.00 1036.00 0.00 0 1036 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sda 2.00 0.00 1044.00 0.00 0 1044 0
Yeah, as you can see, the
wbps
is really
1MB/s
!
GPDB can use the cgroup io limit to limit io speed of queries, but only cgroup v2, because GPDB using pagecache. So if the systems running gpdb just enable cgroup v1, the io control is unavailable.
create table test_heap(a int,b int);
COPY test_heap FROM '/home/mt/c';
iotop
)The
read
requests come from
QD
totally:
03:13:06 4185 be/4 mt 19.51 M/s 0.00 B/s ?unavailable? postgres: 5432, mt demo [local] con7 cmd14 COPY
The
write
requests come from
QE
and
walwriter
:
b'03:13:07 4196 be/4 mt 0.00 B/s 79.58 M/s ?unavailable? postgres: 6000, mt demo 10.117.190.180(52630) con7 seg0 cmd14 MPPEXEC UTILITY'
b'03:13:07 4195 be/4 mt 0.00 B/s 79.51 M/s ?unavailable? postgres: 6001, mt demo 10.117.190.180(36408) con7 seg1 cmd14 MPPEXEC UTILITY'
b'03:13:07 4040 be/4 mt 0.00 B/s 6.23 M/s ?unavailable? postgres: 6001, walwriter'
b'03:13:07 4034 be/4 mt 0.00 B/s 6.20 M/s ?unavailable? postgres: 6000, walwriter'
create table test_ao(a int, b int) with (appendonly=true);
COPY test_ao FROM '/home/mt/c';
The
read
requests come from
QD
and
QE
:
b'03:18:09 4196 be/4 mt 156.01 K/s 18.51 M/s ?unavailable? postgres: 6000, mt demo 10.117.190.180(52630) con7 seg0 cmd15 MPPEXEC UTILITY'
b'03:18:09 4195 be/4 mt 156.01 K/s 17.19 M/s ?unavailable? postgres: 6001, mt demo 10.117.190.180(36408) con7 seg1 cmd15 MPPEXEC UTILITY
b'03:18:09 4185 be/4 mt 12.90 M/s 0.00 B/s ?unavailable? postgres: 5432, mt demo [local] con7 cmd15 COPY'
But around
99.9%
read
requests come from
QD
.
The
write
requests come from
QE
and
walwriter
:
b'03:18:09 4196 be/4 mt 156.01 K/s 18.51 M/s ?unavailable? postgres: 6000, mt demo 10.117.190.180(52630) con7 seg0 cmd15 MPPEXEC UTILITY'
b'03:18:09 4195 be/4 mt 156.01 K/s 17.19 M/s ?unavailable? postgres: 6001, mt demo 10.117.190.180(36408) con7 seg1 cmd15 MPPEXEC UTILITY'
b'03:18:09 4040 be/4 mt 0.00 B/s 2.44 M/s ?unavailable? postgres: 6001, walwriter'
b'03:18:09 4034 be/4 mt 0.00 B/s 676.05 K/s ?unavailable? postgres: 6000, walwriter'
From the result of above tests, we can find that the most
io
requests sent by
QD
and
QE
processes, so we can limit the
io
for even per query(using cgroup v2).
create resource group rg with (io_read_limit=1024, io_write_limit=-1, io_read_iops=100, io_write_iops=100);
io_read_limit
: limit read speed(MB) from disk for every process in the resource group
io_write_limit
:
limit write speed(MB) from disk for every process in the resource group
io_read_iops
:
limit read iops from disk for every process in the resource group
io_write_iops
:
limit write iops from disk for every process in the resource group
As you can see, the
device id
is needed when configure the io controller, and gpdb is a cluster application, we cannot promise that every system has same disk configuration, so we need a way to configure
device id
easily.
Every query maybe involves many tables, and data files of those tables are distributed on many disks. Cgroup io controller is configured by writing every device id of block device and the limitatioin for that block
device to the configuration file. For example, in cgroup v2, write
8:0 wbps=102400 rbps=102400
to
io.max
which configuration file of io limitation means if processes in this cgroup write to
8:0
disk(you can see block device id by using
lsblk
), the upper limitation of write speed and read speed is 102400 bytes per second.
So the biggest problem is that how we find the block device id and write it to cgroup io configuration.There are some solutions to write the information to cgroup io configuration file:
Apparently, the second solution is better, it’s logical and implementation is
easier than the first one. There are many ways to find the
device id
of all block devices, and we can even detect the hotplug event, just use
udev
, a tool inside linux kernel(after linux 2.6). There is a easy solution:
Using a simple udev rule to monitor the block devices, when udev find a block device, write it’s device id to a specific file, for example,
/tmp/gpdb-diskinfo
. Then gpdb using linux filesystem monitor api(inotify) to get the diskinfo by monitoring
/tmp/gpdb-diskinfo
. gpdb can use a bgworker to do this thing, and write the device id of disks to cgroup configuration simultaneously.
CREATE RESOURCE GROUP rgroup1 WITH (CONCURRENCY=5, IO_READ_LIMIT=1024, IO_WRITE_LIMIT=-1, IO_READ_IOPS=100, IO_WRITE_IOPS=100);
ALTER ROLE bill RESOURCE GROUP rgroup1;
Then all queries with role
bill
will be limit the io speed and iops.
Authored By: Rong Du, Zhenghua Lyu
!! External Email
!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.
> 2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.
How many users have we surveyed to reach this conclusion? I don't think this is
universally true. Even if it is, IMO we don't want such a restriction in the
product.
On Apr 12, 2023, at 02:24, Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:
!! External Email
Hi Rong,
> Hello, After some thinkinng, there is some new updates:> 1. Mirrors and standbys locate at remote, Master/Primary to Standbys/mirrors communication use network io, and we only limit disk io of QE/QD at local host, so we do not need consider mirrors and standbys.
GP clusters typically have mirrors and primaries on the same host. Mirrors can
consume significant disk IO due to WAL replay, especially in clusters with large
WAL volume, where the mirror is continuously replaying WAL all throughout the
day or during critical production jobs/queries. So, we have to consider mirrors
into our design as well.
Since WAL replay is a single process activity, we may reserve lower bandwidth
compared to a primary on the same host, but we will have to reserve something.
The amount reserved will depend on the segment density on the host, and can be
further controlled with a GUC (for which we have to come up with an intelligent
default). Reserving too little can have big ramifications on the cluster.
Mirrors falling behind is a commonly observed problem and we have to avoid it.
Also, like primaries, mirrors run checkpoints too, which can be similarly IO
intensive.
> 2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.
How many users have we surveyed to reach this conclusion? I don't think this is
universally true. Even if it is, IMO we don't want such a restriction in the
product.
How do we tackle queries that span tablespaces? For instance, if we have a
temp_tablespace (spill files go here), a given query will be using more than one
tablespace. The user running the query will be spanning multiple tablespaces.
Also, multiple users (roles) may share the same tablespace. temp_tablespace is
one such example. For eg consider users with role 'manager' and 'employee'
accessing the same database 'staff_records' located in the same tablespace where
based on access rules, 'managers' can access certain tables that 'employee' can't.
Regards,Soumyadeep (VMware)
On Apr 12, 2023, at 09:40, Ashwin Agrawal <ashwi...@gmail.com> wrote:
!! External Email
--
Ashwin Agrawal (VMware)--
To unsubscribe from this topic, visit https://groups.google.com/a/greenplum.org/d/topic/gpdb-dev/l-5RpOz0fP8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gpdb-dev+u...@greenplum.org.
On Apr 12, 2023, at 10:02, 'Haolin Wang' via Greenplum Developers <gpdb...@greenplum.org> wrote:
!! External Email
On Apr 12, 2023, at 09:40, Ashwin Agrawal <ashwi...@gmail.com> wrote:
!! External EmailOn Tue, Apr 11, 2023 at 11:25 AM Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:> 2. Actually, in customers usage, one role just use one tablespace, and one role can only bind to one resource group( that is one cgroup). So resource group to tablespace mapping is always 1:1, and configure io max limitation using tablespace is complex and seems not need.
How many users have we surveyed to reach this conclusion? I don't think this is
universally true. Even if it is, IMO we don't want such a restriction in the
product.
No usage of tablespace I encountered for Greenplum so far has been set up as described based on role. It's mostly based on hot vs cold data aspects (not based on users or roles). Like biggest feature highlight heterogeneous partitioned tables can have tables across tablespaces and single user and single query will access the data across tablespaces.
Just want to clarify item 2:
The picture in my mind of mapping tablespace and I/O group is:
A kind of I/O workload --(1:1 mapping to)--> an I/O group —(1:1 mapping to) —> a datadir —(1:1 mapping to) —> a tablespace
So an I/O group should be 1:1 mapping to a tablespace, right ?I was told previously that an I/O group 1: m mapping to a tablespace (m > 1), if that’s the case, I feel things get too complicate to configure and understand.
!! External Email
|
On Apr 17, 2023, at 16:40, 'Zhenghua Lyu' via Greenplum Developers <gpdb...@greenplum.org> wrote:
!! External Email
Let me summarize and clarify the idea of tablespace proposed by @Soumyadeep Chakraborty:
- each role binds to a resgroup;
- each resgroup binds to a cgroup directory on a host
- each tablespace (on a segment) binds to a directory path, and this directory path belong to a dev
And the user API will be:
create resource group rg with(io_limit=content0,10MB:content1:5MB);alter table resource rg set io_limit=content0,10MB:content1:15MB);
The APIs are to set a matrix of config like below:ts1 ts2 ts3grp1grp2grp3
Each entry in the above matrix either have a default value (not control) or have user specific value.For example, we have Config<grp1, ts1> = 10MB, it means:
- given grp1, its group id = 16450, we can know its cgroup path is /sys/fs/cgroup/16450
- given ts1, we can find its dev
- then we can write the value 10MB with the dev id to the cgroup path's IO config file
The core problem is:within a single segment, different tablespaces' dir paths must belong to different devs.If the above condition is violated, then resgroup needs to throw error or warning?
--