BRIN: pages_per_range

49 views
Skip to first unread message

Soumyadeep Chakraborty

unread,
Jun 16, 2023, 3:44:03 PM6/16/23
to Greenplum Developers
Hello,

This is to discuss what the default value for pages_per_range should be
for BRIN indexes in GPDB.

First, let's focus our attention on heap tables. AO/CO tables are a
different ball game - let's discuss that later in the thread.

Upstream Postgres has set this to 128, which means that a BRIN range
would cover 128 heap blocks. This translates to (128 * 8K) = 1M of disk
space, assuming the default PG block size of 8K. This is exactly
equivalent to the granularity of a "storage index's" region in Oracle
Exadata, which is what BRIN was modelled after.[8]

There is not a lot of literature on BRIN out there and there are not
many explanations out there as to why this default was chosen.

For GPDB, we have to keep in mind that the default block size is 32K.

The original authors did discuss pages_per_range. These were the
following schools of thought:

1. Align the default value with disk readahead [1]. The idea is based on
the observation that we won't save on IO if pages_per_range < RA, as we
would read #pages = RA anyway.

Considering device defaults this is usually:
(256 * 512 byte sectors) = 128K. This maps to pages_per_range = 16 for
Postgres (and 4 for GPDB).

[1] also mentions that it is common wisdom to increase RA settings from
their defaults. This is indeed true. [2] claims that it should be set
between 4096 (4096 * 512 = 2M) and 16384 (16384 * 512 = 8M) on modern
hardware. GPDB's documentation also advises a value of 16384 [3]. That
would map to a pages_per_range = 1024 for PG and 256 for GPDB.

Compared to SSDs, the gain from readahead will be more pronounced for
higher latency storage such as HDDs, network storage or even
virtualized storage. For e.g. Amazon recommends a setting of 1M for
their hot and cold HDD EBS instances [4].

We can see that the PG default of 128 pages aligns with a RA setting of
2048 (2048 * 512 = 1M).

2. The second school of thought looks at the problem from the angle of
saving CPU cycles by setting the pages_per_range to max granularity
[5], i.e. pages_per_range = 1. Even if we end up doing(IO = RA >
pages_per_range), we might save a lot on CPU and on as we have less
blocks to process in the BitmapHeapScan that accompanies the BRIN index
scan (BitmapIndexScan).

Now, after a certain point cpu_run_cost can start to outweigh the IO
cost. If we read N pages, with a tuple density of d, then:

disk_io_cost ~= N * seq_page_cost = N
cpu_run_cost = cpu_per_tuple * tuples_fetched
= 0.0125 * (N * d)
[from cost_bitmap_heap_scan()]

With a workload where we select the first page of every 4 pages w/ our
predicate, ppr=1 wins purely due to the tuples eliminated and CPU
savings over ppr=4 (see results attached). Please do note that this type
of workload might not be a regular BRIN workload, as the index column
has super-low correlation (0.0016), as opposed to the desired value of 1.

[7] seems to corroborate this opinion - setting ppr=32 down from 128
helps in reducing the bitmap heap scan's execution time in this plan
shape with bitmap heap scans having lots of loops.

However, going down to 8 increases the bitmapindex scan's execution time
ever so slightly, and that cost seems to dominate (due to the loops). So,
if there is an argument against having higher granularity, this is it. But, do
notice that we didn't process so many tuples. Maybe things would have
been different were there more tuples.

3. The author's original design goal was to bring down the index size as
much as possible. [6]

However, do we really need to worry about the index size? A low index
size is desirable, especially if shared_buffers defaults to 125M?

Considering a single integer index:

Number of revmap entries = Nrev = Ntup = Nranges = S / Ppr
where S is the size of the relation in heap blocks,
where Ntup is the number of index tuples,
where Ppr is the number of pages per range

Number of revmap pages = Prev = Nrev / Nrevmax,
where Nrevmax = maximum number of index tids in revmap page
= REVMAP_PAGE_MAXITEMS = 5454. So, Prev = Nrev / 5454 = S / (5454 * Ppr)

Number of data pages (Pdata) depends on the index key types. For a single
int index key, we can fit upto 1636 items (i.e. ranges) on a data page.
So, Pdata = Nranges / 1636 = (S / Ppr) / 1636

Keeping these in mind, we end up with:

size_of_rel | ppr | prev | pdata | index_size | index_size_ratio
-------------+-----+------+-------+------------+------------------
1G | 1 | 7 | 21 | 0.88M | 0.09
1G | 128 | 1 | 1 | 0.06M | 0.01
128G | 1 | 770 | 2564 | 104.2M | 0.08
128G | 128 | 7 | 21 | 0.86M | 0.01
1T | 1 | 6144 | 20511 | 832M | 0.08
1T | 128 | 48 | 161 | 6.5M | 0.01

The index size is so insignificant, it's almost always < 1% of the data
size.

The other thing to keep in mind is that even for Ppr = 1, when we have
much more index pages, for an index scan, we would never simultaneously
request that many pages into the buffer cache - it would always be the
metapage, a revmap page and a data page -> at most 3 pages at a time.
The same applies for summarization etc.

Parting thoughts:

* I will share more numbers in this thread, and an analysis for AO/CO
tables.

* pages_per_range is very workload sensitive. Ultimately, the choice is
between scaling the value down to 32 for GPDB (accounting for BLCKSZ),
aligning to default readahead (ppr=4) or being aggressive and setting it to 1.

* Readahead might not be a big factor, given more modern hardware trends.

Regards,
Soumyadeep (VMware)

[1] https://www.postgresql.org/message-id/20140623170751.GA5032%40eldon.alvh.no-ip.org
[2] PostgreSQL 10 High Performance - General Linux filesystem tuning chapter
[3] https://docs.vmware.com/en/VMware-Tanzu-Greenplum/7/greenplum-database/GUID-best_practices-sysconfig.html#io-configuration
[4] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html
[5] https://www.postgresql.org/message-id/53A87BAF.5020908%40vmware.com
[6] https://www.postgresql.org/message-id/20130617201248.GE3537%40eldon.alvh.no-ip.org
[7] https://www.postgresql.org/message-id/flat/CACM-Oyc-rOkzP6kV1POY%2BfiZcRQ%2Bcqd5iWqR61zN6nUVRVt3sw%40mail.gmail.com#e08107a3909fd427d3768ee0c5c69766
[8] https://docs.oracle.com/en/engineered-systems/exadata-database-machine/sagug/exadata-storage-server-software-introduction.html#GUID-11230200-2220-4FA3-9EFF-1792F992FBA5
brin_results.txt

Soumyadeep Chakraborty

unread,
Jun 16, 2023, 9:33:56 PM6/16/23
to Greenplum Developers
Attaching the correct version of brin_results.txt.
brin_results.txt

Ashwin Agrawal

unread,
Jun 21, 2023, 6:49:55 AM6/21/23
to Soumyadeep Chakraborty, Greenplum Developers
Thank you Deep for the detailed analysis and points. Given there is clear winner or obvious downsides
from either side of the settings - I would recommend we go with 32 PPR for HEADP, adjusting GPDB's
default to upstream adjusted based on our default BLCKSZ setting. We just have a reasonable default
and it seems we have one based on all the analysis provided by you. Finally, users do have control to
modify the value as they seem fit on a use case basis.

--
Ashwin Agrawal (VMware)

Soumyadeep Chakraborty

unread,
Jun 26, 2023, 8:26:58 PM6/26/23
to Ashwin Agrawal, Greenplum Developers
Hi Ashwin,

Thanks for the input. My thoughts are similar. I'm looking to provide
some TPC-H numbers soon, which are being collected, to supplement our
discussion.

About AO/CO tables:

AO/CO tables are different from heap tables in the sense that they are
comprised of logical heap blocks. Further, each logical heap block
consists of AO_MAX_TUPLES_PER_HEAP_BLOCK (32768) tuples. This is
considering that the table was bulk loaded, as opposed to smaller
inserts which can cause holes and under-utilized logical heap blocks.

Now consider the following TPC-H table:

psql postgres -c "\d lineitem"
Table "public.lineitem"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------+-----------+----------+---------
l_orderkey | bigint | | |
l_partkey | integer | | |
l_suppkey | integer | | |
l_linenumber | integer | | |
l_quantity | numeric(15,2) | | |
l_extendedprice | numeric(15,2) | | |
l_discount | numeric(15,2) | | |
l_tax | numeric(15,2) | | |
l_returnflag | character(1) | | |
l_linestatus | character(1) | | |
l_shipdate | date | | |
l_commitdate | date | | |
l_receiptdate | date | | |
l_shipinstruct | character(25) | | |
l_shipmode | character(10) | | |
l_comment | character varying(44) | | |
Distributed by: (l_orderkey)

Now, if we were to load this table in bulk, if this were a heap table
at most 248 tuples would fit in one heap block. I bulk loaded w/:

insert into lineitem
select 1,1,1,1,1,1,1,1,'a','b','1992-01-28','1992-01-01','1992-01-17',
'hello','shipped', null from generate_series(1, 1000000);

Check w/:
select right(split_part(ctid::text, ',', 1), -1) as blknum,
count(*) from lineitem group by 1;

If the same table were AO/CO, a "heap block" would contain 32768/248 ~=
132 times more tuples!

So for this table:
ppr of 132 for heap ~= ppr of 1 for AO/CO.

This means:
disk_io_cost ~= N * seq_page_cost = N = 132
cpu_run_cost = cpu_per_tuple * tuples_fetched
= 0.0125 * (N * d) = 0.0125 * (132 * 32768) = 54067

We can clearly see that the CPU run cost will dominate. One may argue
that the disk_io_cost doesn't take into account the fact that we are
dealing with logical heap blocks - i.e. we should be scaling
disk_io_cost by a reasonable factor, taking into account table type,
schema, compression settings etc. However, that is a problem for another
day.

If we consider an extreme example: with a single int column heap table,
we would fit 909 tuples into a block. So the numbers would be:
ppr of 36 for heap ~= ppr of 1 for AO/CO.

disk_io_cost ~= N * seq_page_cost = N = 36
cpu_run_cost = cpu_per_tuple * tuples_fetched
= 0.0125 * (N * d) = 0.0125 * (36 * 32768) = 14745

Given this massive gap, I think the safest thing to do is to have the
default: ppr=1 for AO/CO tables.

I will be sharing TPC-H numbers for AO/CO tables as well in the near future.

Regards,
Soumyadeep (VMware)

Haolin Wang

unread,
Jun 27, 2023, 8:55:37 AM6/27/23
to Soumyadeep Chakraborty, Ashwin Agrawal, Greenplum Developers


> On Jun 27, 2023, at 08:26, Soumyadeep Chakraborty <soumyad...@gmail.com> wrote:
>
> !! External Email
Thanks for the detail analysis!

Vote for ppr = 1 by default for AO/CO too. The idea (or assumption) is that random index I/Os could be ignored as
the BRIN size of AO/CO is usually far smaller that could be easily loaded to memory, few random index reads could be ignored.
Then ppr=1 has the least table reads (and tuples to be processed) regardless of OS pagecache readahead.

-Haolin

>
> I will be sharing TPC-H numbers for AO/CO tables as well in the near future.
>
> Regards,
> Soumyadeep (VMware)
>
>
>
> !! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Ashwin Agrawal

unread,
Jun 27, 2023, 1:56:11 PM6/27/23
to Haolin Wang, Soumyadeep Chakraborty, Greenplum Developers
+1 from my side as well for PPR=1 for AO and CO tables. Even if physically AO or CO single block may contain tuples for many 
logical heap blocks (and hence read in memory in one read), summarizing at lowest possible 32768 tuple granularity makes 
perfect sense for AO/CO tables to avoid unnecessary processing of large numbers of tuples.



For quick fun, I tried this table and query with different PPR values...

create table t_list as select generate_series(1,100000000) id;
drop table if exists table_ao_brin;
create table table_ao_brin (id int, descr text, created_at timestamp) using ao_row with (compresstype=zstd);
insert into table_ao_brin select id, 'test ' || id, '2012-01-01 0:00'::timestamptz + id * interval '2 sec' from t_list;

DROP INDEX table_ao_brin_idx_created_at;
CREATE INDEX table_ao_brin_idx_created_at ON table_ao_brin USING brin(created_at);
\timing
SELECT date_trunc('day', created_at), min(id), max(id) FROM table_ao_brin
WHERE created_at BETWEEN '2017-02-01'::timestamp AND '2017-02-28 11:59:59'::timestamp
GROUP BY 1 ORDER BY 1;
\timing

PPR
1   Time: 702.645 ms
10  Time: 737.637 ms
100 Time: 1777.91 ms (00:01.778)
128 Time: 1950.09 ms (00:01.950)
200 Time: 3072.57 ms (00:03.073)
300 Time: 4393.03 ms (00:04.393)


--
Ashwin Agrawal (VMware)
Reply all
Reply to author
Forward
0 new messages