Hello,
This is to discuss what the default value for pages_per_range should be
for BRIN indexes in GPDB.
First, let's focus our attention on heap tables. AO/CO tables are a
different ball game - let's discuss that later in the thread.
Upstream Postgres has set this to 128, which means that a BRIN range
would cover 128 heap blocks. This translates to (128 * 8K) = 1M of disk
space, assuming the default PG block size of 8K. This is exactly
equivalent to the granularity of a "storage index's" region in Oracle
Exadata, which is what BRIN was modelled after.[8]
There is not a lot of literature on BRIN out there and there are not
many explanations out there as to why this default was chosen.
For GPDB, we have to keep in mind that the default block size is 32K.
The original authors did discuss pages_per_range. These were the
following schools of thought:
1. Align the default value with disk readahead [1]. The idea is based on
the observation that we won't save on IO if pages_per_range < RA, as we
would read #pages = RA anyway.
Considering device defaults this is usually:
(256 * 512 byte sectors) = 128K. This maps to pages_per_range = 16 for
Postgres (and 4 for GPDB).
[1] also mentions that it is common wisdom to increase RA settings from
their defaults. This is indeed true. [2] claims that it should be set
between 4096 (4096 * 512 = 2M) and 16384 (16384 * 512 = 8M) on modern
hardware. GPDB's documentation also advises a value of 16384 [3]. That
would map to a pages_per_range = 1024 for PG and 256 for GPDB.
Compared to SSDs, the gain from readahead will be more pronounced for
higher latency storage such as HDDs, network storage or even
virtualized storage. For e.g. Amazon recommends a setting of 1M for
their hot and cold HDD EBS instances [4].
We can see that the PG default of 128 pages aligns with a RA setting of
2048 (2048 * 512 = 1M).
2. The second school of thought looks at the problem from the angle of
saving CPU cycles by setting the pages_per_range to max granularity
[5], i.e. pages_per_range = 1. Even if we end up doing(IO = RA >
pages_per_range), we might save a lot on CPU and on as we have less
blocks to process in the BitmapHeapScan that accompanies the BRIN index
scan (BitmapIndexScan).
Now, after a certain point cpu_run_cost can start to outweigh the IO
cost. If we read N pages, with a tuple density of d, then:
disk_io_cost ~= N * seq_page_cost = N
cpu_run_cost = cpu_per_tuple * tuples_fetched
= 0.0125 * (N * d)
[from cost_bitmap_heap_scan()]
With a workload where we select the first page of every 4 pages w/ our
predicate, ppr=1 wins purely due to the tuples eliminated and CPU
savings over ppr=4 (see results attached). Please do note that this type
of workload might not be a regular BRIN workload, as the index column
has super-low correlation (0.0016), as opposed to the desired value of 1.
[7] seems to corroborate this opinion - setting ppr=32 down from 128
helps in reducing the bitmap heap scan's execution time in this plan
shape with bitmap heap scans having lots of loops.
However, going down to 8 increases the bitmapindex scan's execution time
ever so slightly, and that cost seems to dominate (due to the loops). So,
if there is an argument against having higher granularity, this is it. But, do
notice that we didn't process so many tuples. Maybe things would have
been different were there more tuples.
3. The author's original design goal was to bring down the index size as
much as possible. [6]
However, do we really need to worry about the index size? A low index
size is desirable, especially if shared_buffers defaults to 125M?
Considering a single integer index:
Number of revmap entries = Nrev = Ntup = Nranges = S / Ppr
where S is the size of the relation in heap blocks,
where Ntup is the number of index tuples,
where Ppr is the number of pages per range
Number of revmap pages = Prev = Nrev / Nrevmax,
where Nrevmax = maximum number of index tids in revmap page
= REVMAP_PAGE_MAXITEMS = 5454. So, Prev = Nrev / 5454 = S / (5454 * Ppr)
Number of data pages (Pdata) depends on the index key types. For a single
int index key, we can fit upto 1636 items (i.e. ranges) on a data page.
So, Pdata = Nranges / 1636 = (S / Ppr) / 1636
Keeping these in mind, we end up with:
size_of_rel | ppr | prev | pdata | index_size | index_size_ratio
-------------+-----+------+-------+------------+------------------
1G | 1 | 7 | 21 | 0.88M | 0.09
1G | 128 | 1 | 1 | 0.06M | 0.01
128G | 1 | 770 | 2564 | 104.2M | 0.08
128G | 128 | 7 | 21 | 0.86M | 0.01
1T | 1 | 6144 | 20511 | 832M | 0.08
1T | 128 | 48 | 161 | 6.5M | 0.01
The index size is so insignificant, it's almost always < 1% of the data
size.
The other thing to keep in mind is that even for Ppr = 1, when we have
much more index pages, for an index scan, we would never simultaneously
request that many pages into the buffer cache - it would always be the
metapage, a revmap page and a data page -> at most 3 pages at a time.
The same applies for summarization etc.
Parting thoughts:
* I will share more numbers in this thread, and an analysis for AO/CO
tables.
* pages_per_range is very workload sensitive. Ultimately, the choice is
between scaling the value down to 32 for GPDB (accounting for BLCKSZ),
aligning to default readahead (ppr=4) or being aggressive and setting it to 1.
* Readahead might not be a big factor, given more modern hardware trends.
Regards,
Soumyadeep (VMware)
[1]
https://www.postgresql.org/message-id/20140623170751.GA5032%40eldon.alvh.no-ip.org
[2] PostgreSQL 10 High Performance - General Linux filesystem tuning chapter
[3]
https://docs.vmware.com/en/VMware-Tanzu-Greenplum/7/greenplum-database/GUID-best_practices-sysconfig.html#io-configuration
[4]
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html
[5]
https://www.postgresql.org/message-id/53A87BAF.5020908%40vmware.com
[6]
https://www.postgresql.org/message-id/20130617201248.GE3537%40eldon.alvh.no-ip.org
[7]
https://www.postgresql.org/message-id/flat/CACM-Oyc-rOkzP6kV1POY%2BfiZcRQ%2Bcqd5iWqR61zN6nUVRVt3sw%40mail.gmail.com#e08107a3909fd427d3768ee0c5c69766
[8]
https://docs.oracle.com/en/engineered-systems/exadata-database-machine/sagug/exadata-storage-server-software-introduction.html#GUID-11230200-2220-4FA3-9EFF-1792F992FBA5