Analyzing DML state for AO/CO tables

28 views
Skip to first unread message

Soumyadeep Chakraborty

unread,
Oct 26, 2022, 8:16:43 PM10/26/22
to Greenplum Developers
Hello!

This is to perform an analysis on the backend-local hash table mechanism that is
employed by AO/CO tables for I/U/D and also VACUUM/CLUSTER. Primarily, we want
to examine the FIXME:

/*
* GPDB_12_MERGE_FIXME: This is a temporary state of things. A locally stored
* state is needed currently because there is no viable place to store this
* information outside of the table access method. Ideally the caller should be
* responsible for initializing a state and passing it over using the table
* access method api.
*
...

This email will focus on performance analysis to check if the hash table
mechanism that the FIXME is referring to is viable.

(1) Performance analysis:

* Currently the "AppendOnly DML State Context" used for both AO and CO tables is
allocated under the ExecutorState memory context. This context houses the hash
table for DML operations. We need a hash table as there can be multiple
insert/delete descriptors present (think inserts into multiple partitionn
leaves in the same command).

Since this is attached to ExecutorState, it is cleaned up at command end (not at
transaction end). So, we have to create and destroy the context (and hash
table) for every command we execute. Along with the hash table, the
insert/delete descriptors are also wiped at command end and must be recreated.
While this is perfectly fine for bulk data loading, this will have an impact on
smaller point DMLs.

This is VERY SIMILAR to what we have for 6X. In 6X, the descs are also allocated
to the ExecutorState memory context. The only difference is that there is no
hash table (the descs are attached to ResultRelInfos).

Running perf on a simple loop-point-insert workload like:

CREATE TABLE aosimple(i int) WITH (appendonly=true);
DO
$do$
BEGIN
FOR i IN 1..10000 LOOP
INSERT INTO aosimple values(1);
END LOOP;
END
$do$;
-- Takes 15s (on 6X and 7X on a dev box), takes 3s for a heap table

* I found that the hash table re-creation wasn't significant at all in terms of
the overall profile (0.7%). What was significant in the profile were
GetFileSegInfo() and choose_segno_internal() (5.49%, 5.24% respectively). These
things cannot be reused unfortunately, as new appendonly metadata snapshots need
to be established between commands (unless we are in repeatable read which is
not a frequent enough use case to warrant reusing the descs across commands
in the same transaction).

(2) Memory analysis:

* Setup:
create table testao(i int, j int) using ao_row distributed by (i)
partition by range(j) (start(1) end(1001) every(1));

* Black box analysis:

Performing insert into testao select 1, j from generate_series(1, 1000)j; and
analyzing the memory growth with top, the impact is the SAME BETWEEN 6X and 7X.

7X grows VIRT from ~380M -> ~490M whereas 6X grows from ~383M -> ~494M.
If you repeat this experiment 6 times, with 6 different partition
tables, eventually
the memory stops growing and reaches a steady state ~500M.

* Low level analysis:

We use a small patch to analyze the memory footprint (see attached).

If we insert 10 rows (routed to 10 partition leaves), we see that the
"AppendOnly DML State Context" can grow up to 963712 bytes (~1M)

insert into testao select 1, j from generate_series(1, 10)j;

Scrubbed log:

LOG: statement: insert into testao2 select 1, j from generate_series(1, 10)j;
LOG: free list index of context = 1
LOG: Attempting to reuse context from free list
LOG: printing stats before allocating for oid = 741241
LOG: Grand total: 9216 bytes in 2 blocks; 3280 free (0 chunks); 5936 used
LOG: free list index of context = 1 (seg1 192.168.0.148:7003 pid=2185301)
LOG: Attempting to reuse context from free list (seg1
192.168.0.148:7003 pid=2185301)
LOG: Reusing context from free list (seg1 192.168.0.148:7003 pid=2185301)
LOG: printing stats before allocating for oid = 741241 (seg1
192.168.0.148:7003 pid=2185301)
LOG: Grand total: 9216 bytes in 2 blocks; 3280 free (0 chunks); 5936
used (seg1 192.168.0.148:7003 pid=2185301)
LOG: printing stats before allocating for oid = 741244 (seg1
192.168.0.148:7003 pid=2185301)
LOG: Grand total: 9216 bytes in 2 blocks; 1208 free (0 chunks); 8008
used (seg1 192.168.0.148:7003 pid=2185301)
LOG: printing stats before allocating for oid = 741251 (seg1
192.168.0.148:7003 pid=2185301)
LOG: Grand total: 118912 bytes in 7 blocks; 5144 free (12 chunks);
113768 used (seg1 192.168.0.148:7003 pid=2185301)
LOG: printing stats before allocating for oid = 741258 (seg1
192.168.0.148:7003 pid=2185301)
...
LOG: Grand total: 963712 bytes in 26 blocks; 4704 free (15 chunks);
959008 used (seg1 192.168.0.148:7003 pid=2185301)
LOG: printing stats during context delete. peak usage = 1024640
(seg1 192.168.0.148:7003 pid=2185301)
LOG: Grand total: 1066240 bytes in 28 blocks; 39264 free (146
chunks); 1026976 used (seg1 192.168.0.148:7003 pid=2185301)
LOG: printing stats during context delete. peak usage = 0
LOG: Grand total: 1024 bytes in 1 blocks; 728 free (0 chunks); 296 used

There are 10 insert descriptors in this case that are used. We see that the
usage and peak usage of this context is ~1MB.

For 100 rows routed to 100 partitions, 100 insert descriptors will be created
and the memory usage will grow linearly to be ~10MB with a peak of ~10MB.

Similarly, for 1000 rows routed to 1000 partitions, the memory usage would be
~100MB with a peak of ~100MB.

For CO tables, the consumption is higher: ~130MB for 1000 partitions.

* Allocation distribution:

The largest fraction of the allocations are for BTScanOpaqueData for
sysscans (of size 32920), which are all malloced/freed - they are not
pulled from
the free lists and not sent back to the free lists.
LOG: Request size 32920 exceeded chunk limit 1024

There are smaller allocations as well like, which don't really amount to much:
LOG: Request size 296 DID NOT exceed chunk limit 1024 (seg 1 ..)

So the ALLOCSET_SMALL_SIZES is fine really for this memory context.

(3) Conclusions:

* From a performance standpoint, we are at par with 6X and we have acceptable
performance. For point queries we were and will be slower than heap, which is
due to how AO/CO tables are designed really.

* From a memory standpoint, we are also at par with 6X and have acceptable
performance. There is not much to do here.

* Way forward: Clean up the code and introduce table AM APIs for initing/tearing
down the DML states and get rid of the FIXME.

Regards,
Soumyadeep (VMware)
examine_mem.patch

Ashwin Agrawal

unread,
Oct 27, 2022, 12:49:26 PM10/27/22
to Soumyadeep Chakraborty, Greenplum Developers
Thanks so much Deep for in-depth analyzing this part of code.Gives a piece of mind of something that doesn't need attention for GPDB7 GA.

And I am assuming given the hash-table is in process local memory there is no constraint on its size and such and will grow as required.

--
Ashwin Agrawal (VMware)

Soumyadeep Chakraborty

unread,
Oct 27, 2022, 1:12:32 PM10/27/22
to Ashwin Agrawal, Greenplum Developers
> Thanks so much Deep for in-depth analyzing this part of code.Gives a piece of mind of something that doesn't need attention for GPDB7 GA.

It has been most interesting!

> And I am assuming given the hash-table is in process local memory there is no constraint on its size and such and will grow as required.

Yeah it's process local and grows dynamically. We don't have to worry about it
shrinking either, as it is destroyed at command end.

Regards,
Soumyadeep (VMware)

Soumyadeep Chakraborty

unread,
Nov 3, 2022, 4:06:46 PM11/3/22
to Ashwin Agrawal, Greenplum Developers
Table AM changes and minor refactoring has been committed as part of:
https://github.com/greenplum-db/gpdb/pull/14359

Regards,
Soumyadeep (VMware)
Reply all
Reply to author
Forward
0 new messages