gp_vmem_protect_limit and resource group based resource management

Aleksey Kashin

unread,

Jun 28, 2021, 5:32:47 AM6/28/21

to Greenplum Users

Hi,

I've been faced with canceling vacuuming tables due to high vmem usage error, for instance
VACUUM: Canceling query because of high VMEM usage. current group id is 6438, group memory usage 363 MB, group shared memory quota is 210 MB, slot memory quota is 15 MB, global freechunks memory is 15 MB, global safe memory threshold is 15 MB (runaway_cleaner.c:197) (seg89 Х.X.X.X:6017 pid=3429316) (runaway_cleaner.c:197)

Here is a little bit explanation I googled about high VMEM usage errors:
https://community.pivotal.io/s/question/0D50e0000586mDdCAI/error-canceling-query-because-of-high-vmem-usage?language=en_US,
https://community.pivotal.io/s/article/Query-Failing-with-ERROR-Canceling-query-because-of-high-VMEM-usage?language=en_US,
but in doc I saw "Note: The gp_vmem_protect_limit server configuration parameter is enforced only when resource queue-based resource management is active"

My question is - Am I right to understand, error during vacuum doesn't depend on gp_vmem_protect_limit and I need to check out and fit admin_resource group parameters to fix this error? Maybe I may play with some parameters inside the session during vacuum?

Greenplum 6.16.1 OSS, Resource group based resource management.

Thanks!

Joe Manning

unread,

Jun 28, 2021, 6:05:42 AM6/28/21

to pvtl-cont-aleksey.kashin, Greenplum Users

Hi, Aleksey.

You are correct. The gp_vmem_protect_limit is not causing the error.

You seem to be using resource groups and the admin_group (group id = 6438) does not have enough memory assigned and there is very little shared memory available.

So before running the vacuum, you will need to increase the amount of available memory to the admin_group.

You can set the limits back after the vacuum is completed.

Even reducing the amount of RAM allocated to other groups so the Global Shared Memory is increased, may be sufficient.

See https://gpdb.docs.pivotal.io/6-16/admin_guide/workload_mgmt_resgroups.html#topic8339717 for more details on the resource groups.

Regards,

joe.

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.
To view this discussion on the web visit https://groups.google.com/a/greenplum.org/d/msgid/gpdb-users/CAPJYFP_naajZc%3DcOjN6ZUXOcpfC86qTiUTpoTsFtMexpGa_W9A%40mail.gmail.com.

Aleksey Kashin

unread,

Jun 29, 2021, 5:50:54 AM6/29/21

to Joe Manning, Greenplum Users

Hi, Joe.

Thanks for answering.

One small question yet. If I understand correctly, the memory allocated by session at the start and doesn't recalculate while session is alive.

For example, this simple code gets all tables in database and run vacuum for each one ( written on python using psycopg2)

conn = psycopg2.connect(..)
conn.autocommit = True
cur.execute ..
res = cur.fetchall()
for row in res:
cur.execute('VACUUM "{}"'.format(row))
conn.close

At some point vacuum starting fails with HIGH vmem usage error and continues failing for every table until the end of session. So, I need to close the connection after vacuuming each table and open a new one for another?

Thanks!

пн, 28 июн. 2021 г. в 13:05, Joe Manning <mann...@vmware.com>:

To view this discussion on the web visit https://groups.google.com/a/greenplum.org/d/msgid/gpdb-users/1DCCEB28-35D4-4BB8-9A4F-71EC96CC6692%40vmware.com.

--

С уважением,
Кашин Алексей

Joe Manning

unread,

Jun 29, 2021, 9:11:18 AM6/29/21

to pvtl-cont-aleksey.kashin, Greenplum Users

Hi.

That does not sound correct. It will reuse the RAM assigned to the process./connection.

How many tables are in the list? And how many tables get vacuumed before it starts to fail?

I would think that the group has very little memory allocated to it. From the original message:

current group id is 6438, group memory usage 363 MB, group shared memory quota is 210 MB, slot memory quota is 15 MB, global freechunks memory is 15 MB, global safe memory threshold is 15 MB

363 MB of RAM seems very small. If you connect and disconnect for each table/vacuum, it will introduce quite a lot of overhead.

I would think the best option is to increase the RAM allocated while the VACUUM is being done.

Aleksey Kashin

unread,

Jun 29, 2021, 11:46:18 AM6/29/21

to Joe Manning, Greenplum Users

Hi,

About 10k tables in the list. About 2k tables vacuumed before it starts to fail. About 4 hours it worked fine.

Also I noticed group memory usage, group shared memory quota, slot memory quota and global freechunks are the same in all error messages. Maybe this is strange or not, I don't understand.

# grep -ic 'Canceling query because of high VMEM usage. current group id is 6438, group memory usage 363 MB, group shared memory quota is 210 MB, slot memory quota is 15 MB, global freechunks memory is 15 MB, global safe memory threshold is 15 MB' vacuum.log
7673

вт, 29 июн. 2021 г. в 16:11, Joe Manning <mann...@vmware.com>:

Aleksey Kashin

unread,

Jun 30, 2021, 5:54:30 AM6/30/21

to Joe Manning, Greenplum Users

Hi,

Ok. Thanks for your help.

I changed concurrency for admin_group from 20 to 10 yesterday, but the errors still exist and vacuum still fails on about the same amount of tables.

Another mistake I found - the errors come from only one segment host from all primary segments on it.

Segment's log contains a lot of errors (several tens of gigabyte), for example

2021-06-30 06:12:07.055665 MSK,"gpadmin","prod",p2423745,th-557321216,"X.X.X.X","21056",2021-06-29 23:00:10 MSK,0,con67470,cmd2566,seg80,,dx365764,,sx1,"ERROR","XX000","Canceling query b
ecause of high VMEM usage. current group id is 6438, group memory usage 381 MB, group shared memory quota is 210 MB, slot memory quota is 30 MB, global freechunks memory is 12 MB, global sa
fe memory threshold is 15 MB (runaway_cleaner.c:197)",,,,,,"VACUUM ""schema"".""table""",0,,"runaway_cleaner.c",197,"Stack trace:
1 0x5635c3ac86d1 postgres errstart + 0x251
2 0x5635c3af8aeb postgres RunawayCleaner_StartCleanup + 0x1fb
3 0x5635c36b3476 postgres heap_getnext + 0x306
4 0x5635c394ba04 postgres <symbol not found> + 0xc394ba04
5 0x5635c394f4d9 postgres pgstat_vacuum_stat + 0x1b9
6 0x5635c38334b2 postgres vacuum + 0x5d2
7 0x5635c39c19c4 postgres standard_ProcessUtility + 0x6e4
8 0x5635c39be38e postgres <symbol not found> + 0xc39be38e
9 0x5635c39bf225 postgres <symbol not found> + 0xc39bf225
10 0x5635c39c0058 postgres PortalRun + 0x1e8
11 0x5635c39ba603 postgres <symbol not found> + 0xc39ba603
12 0x5635c39bdbb8 postgres PostgresMain + 0x1f18
13 0x5635c3674176 postgres <symbol not found> + 0xc3674176
14 0x5635c3956ecd postgres PostmasterMain + 0x11cd
15 0x5635c3675af8 postgres main + 0x498
16 0x7f0edb4abbf7 libc.so.6 __libc_start_main + 0xe7
17 0x5635c3681b2a postgres _start + 0x2a
"

<cut>

2021-06-30 06:12:07.057973 MSK,"gpadmin","prod",p2423745,th-557321216,"X.X.X.X","21056",2021-06-29 23:00:10 MSK,0,con67470,cmd2566,seg80,,,,sx1,"LOG","00000","context: 1, 3072, 1992, 3072, 0, CacheMemoryContext/pg_toast_373630_index

",,,,,,,0,,,,
2021-06-30 06:12:07.057999 MSK,"gpadmin","prod",p2423745,th-557321216,"X.X.X.X","21056",2021-06-29 23:00:10 MSK,0,con67470,cmd2566,seg80,,,,sx1,"LOG","00000","context: 1, 3072, 1992, 3072, 0, CacheMemoryContext/pg_aovisimap_373558_index
",,,,,,,0,,,,

<cut>

Checked sysctl parameters - can't find any difference from others segment nodes. Systems logs without errors about hardware or software problems.

I'll try to open an issue on github.

ср, 30 июн. 2021 г. в 11:40, Joe Manning <mann...@vmware.com>:

Hi.

Ultimately, it needs more RAM in the resource group to complete as it is at the moment.

There may be some small memory leak, but generally it will reuse the memory allocated.

Would probably need a support ticket to be opened to get this investigated.

If you cannot change the resource group config, then getting the list of tables and close and open a new connection every 500 or 1000 tables vacuumed to allow it clean up memory may help you.

opening and closing the connection on every table would be extreme and cause a lot of overhead.

Regards,

joe.

From: Aleksey Kashin <aleksey...@gmail.com>
Date: Tuesday 29 June 2021 at 17:08
To: Joe Manning <mann...@vmware.com>
Subject: Re: [gpdb-users] gp_vmem_protect_limit and resource group based resource management

Here is a stack trace from master logs. Maybe it's ok due to the error.

2021-06-29 05:15:03.650820 MSK,"gpadmin","prod",p1101423,th1201042624,"[local]",,2021-06-28 23:00:01 MSK,0,con49069,cmd3445,seg-1,,dx306718,,sx1,"ERROR","XX000","Canceling query because of high VMEM usage. current group id is 6438, group memory usage 363 MB, group shared memory quota is 210 MB, slot memory quota is 15 MB, global freechunks memory is 15 MB, global safe memory threshold is 15 MB (runaway_cleaner.c:197) (seg73 X.X.X.X:6001 pid=709512) (runaway_cleaner.c:197)",,,,,,"VACUUM ""schema"".""tablename""",0,,"runaway_cleaner.c",197,"Stack trace:
1 0x559fb45266d1 postgres errstart + 0x251
2 0x559fb45a3efd postgres cdbdisp_get_PQerror + 0xbd
3 0x559fb45a406d postgres cdbdisp_dumpDispatchResult + 0x3d
4 0x559fb45a4150 postgres cdbdisp_dumpDispatchResults + 0x30
5 0x559fb45a1718 postgres cdbdisp_getDispatchResults + 0x88
6 0x559fb45a5430 postgres <symbol not found> + 0xb45a5430
7 0x559fb4290a5e postgres <symbol not found> + 0xb4290a5e
8 0x559fb4291267 postgres vacuum + 0x387
9 0x559fb441f9c4 postgres standard_ProcessUtility + 0x6e4
10 0x559fb441c38e postgres <symbol not found> + 0xb441c38e
11 0x559fb441d225 postgres <symbol not found> + 0xb441d225
12 0x559fb441e058 postgres PortalRun + 0x1e8
13 0x559fb4417c11 postgres <symbol not found> + 0xb4417c11
14 0x559fb441b965 postgres PostgresMain + 0x1cc5
15 0x559fb40d2176 postgres <symbol not found> + 0xb40d2176
16 0x559fb43b4ecd postgres PostmasterMain + 0x11cd
17 0x559fb40d3af8 postgres main + 0x498
18 0x7f3c44193bf7 libc.so.6 __libc_start_main + 0xe7
19 0x559fb40dfb2a postgres _start + 0x2a
"

вт, 29 июн. 2021 г. в 18:46, Aleksey Kashin <aleksey...@gmail.com>:

Reply all

Reply to author

Forward