slow bitmap heap scans on pg 9.2

Steve Singer

unread,

Apr 10, 2013, 9:49:55 AM4/10/13

to

I'm encountering an issue where PG 9.2.4 (we also see this with 9.2.3)
is picking a plan involving a bitmap heap scan that turns out to be much
slower than a nested-loop plan using indexes.

The planner picks the hashjoin plan by default (see attached files)

Bitmap Heap Scan on public.table_b_2 b (cost=172635.99..9800225.75
rows=8435754 width=10) (actual t
ime=9132.194..1785196.352 rows=9749680 loops=1)
Recheck Cond: ((b.organization_id = 3) AND
(b.year = 2013) AND (b.month = 3))
Rows Removed by Index Recheck: 313195667
Filter: (b.product_id = 2)

Is the part that seems be causing the problem (or at least taking most
of the time, other than the final aggregation)

If I set enable_hashjoin=false and enable_mergejoin=false I get the
nestedloop join plan.

table_b is 137 GB plus indexes each on is around 43 GB
table_a is 20 GB

random_page_cost = 2.0
effective_cache_size = 3500MB
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025
work_mem = 64MB
shared_buffers = 300MB (for this output, I've also had it at 2GB)

If I bump cpu_tuple_cost to the 10-20 range it will pick the nested loop
join for some date ranges but not all. cpu_tuple_cost of 20 doesn't
sound like an sane value.

This database used to run 8.3 where it picked the nested-loop join. We
used pg_upgrade to migrate to 9.2

Any ideas why the bitmap heap scan is much slower than the planner expects?

Steve

hashjoin.txt

nestedloop.txt

k...@rice.edu

unread,

Apr 10, 2013, 9:56:57 AM4/10/13

to

Hi Steve,

The one thing that stands out to me is that you are working with 200GB of
data on a machine with 4-8GB of ram and you have the random_page_cost set
to 2.0. That is almost completely uncached and I would expect a value of
10 or more to be closer to reality.

Regards,
Ken

--
Sent via pgsql-performance mailing list (pgsql-pe...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Steve Singer

unread,

Apr 10, 2013, 11:56:32 AM4/10/13

to

Setting random_page_cost to 15 makes the planner choose the nested-loop
plan (at least the date range I tried).

I thought that the point of effective cache size was to tell the planner
high likely it is for a random page to be in cache. With 200GB of data
for this query and an effective cache size of 3.5 GB I would have
expected that to be accounted for.

k...@rice.edu

unread,

Apr 10, 2013, 1:49:52 PM4/10/13

to

On Wed, Apr 10, 2013 at 11:56:32AM -0400, Steve Singer wrote:
> On 13-04-10 09:56 AM, k...@rice.edu wrote:
> >On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote:
>
> >
> >Hi Steve,
> >
> >The one thing that stands out to me is that you are working with 200GB of
> >data on a machine with 4-8GB of ram and you have the random_page_cost set
> >to 2.0. That is almost completely uncached and I would expect a value of
> >10 or more to be closer to reality.
>
> Setting random_page_cost to 15 makes the planner choose the
> nested-loop plan (at least the date range I tried).
>
> I thought that the point of effective cache size was to tell the
> planner high likely it is for a random page to be in cache. With
> 200GB of data for this query and an effective cache size of 3.5 GB I
> would have expected that to be accounted for.
>

For random_page_cost to be that low, the database would need to be
mostly cached. 3.5GB is almost 100X too small to do that unless your
query exhibits a large amount of locality of reference. Values for
random_page_cost between 10 and 20 are very reasonable for disk-bound
I/O scenarios.

Jeff Janes

unread,

Apr 10, 2013, 2:06:16 PM4/10/13

to

On Wed, Apr 10, 2013 at 6:49 AM, Steve Singer <ssi...@ca.afilias.info> wrote:

I'm encountering an issue where PG 9.2.4 (we also see this with 9.2.3) is picking a plan involving a bitmap heap scan that turns out to be much slower than a nested-loop plan using indexes.

The planner picks the hashjoin plan by default (see attached files)

Bitmap Heap Scan on public.table_b_2 b (cost=172635.99..9800225.75 rows=8435754 width=10) (actual t
ime=9132.194..1785196.352 rows=9749680 loops=1)
Recheck Cond: ((b.organization_id = 3) AND (b.year = 2013) AND (b.month = 3))
Rows Removed by Index Recheck: 313195667
Filter: (b.product_id = 2)

I think the index recheck means your bitmap is overflowing (i.e. needing more space than work_mem) and so keeping only the pages which have at least one match, which means all rows in those pages need to be rechecked. How many rows does the table have? You might be essentially doing a seq scan, but with the additional overhead of the bitmap machinery. Could you do "explain (analyze,buffers)", preferably with track_io_timing set to on?

Cheers,

Jeff

Jeff Janes

unread,

Apr 10, 2013, 2:15:34 PM4/10/13

to

On Wed, Apr 10, 2013 at 8:56 AM, Steve Singer <ssi...@ca.afilias.info> wrote:

On 13-04-10 09:56 AM, k...@rice.edu wrote:

On Wed, Apr 10, 2013 at 09:49:55AM -0400, Steve Singer wrote:

Hi Steve,

The one thing that stands out to me is that you are working with 200GB of
data on a machine with 4-8GB of ram and you have the random_page_cost set
to 2.0. That is almost completely uncached and I would expect a value of
10 or more to be closer to reality.

Setting random_page_cost to 15 makes the planner choose the nested-loop plan (at least the date range I tried).

I thought that the point of effective cache size was to tell the planner high likely it is for a random page to be in cache.

e_c_s tells it how likely it is to still be in cache the second (and subsequent) time the page is visited during the *same query*. It doesn't tell it how likely it is to be in cache the first time it is needed in a given query. (Also, e_c_s is irrelevant for bitmap scans, as they inherently hit every block only once)

Also, it doesn't tell how expensive it is to bring it into cache when it is needed. That is what random_page_cost is for. If you tell that those fetches are going to be cheap, then it doesn't matter so much how many of them it is going to have to do.

Cheers,

Jeff

Steve Singer

unread,

Apr 10, 2013, 7:54:09 PM4/10/13

to

table_b has 1,530,710,469 rows

Attached is the output with track_io_timings and buffers.

> Cheers,
>
> Jeff

hashjoin_buffers.txt

Steve Singer

unread,

Apr 11, 2013, 10:20:11 AM4/11/13

to

I've done some more testing with a random_page_cost=20.

This gives me the nested-loop plan for the various date ranges I've tried.

However table_a_2 and table_b_2 are actually partition tables. This
query only needs to look at a single partition. When I run this same
query against a different partition (a smaller partition, but still
bigger than cache) it picks hash join plan involving a seq scan of
table_b but no bitmap index scan. On this partition the hash-join
plans tend to take 15 minutes versus 2 minutes when I disable hashjoin
plans. Bumping random_page_cost higher doesn't fix this.

I think the reason why it is picking the hash join based plans is
because of

Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b
(cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1
loops=414249)
Index Cond: ((a.id = a_id) AND (organization_id =
2) AND (year = 2013) AND (month = 3))
Filter: (product_id = 1)

I think we are over-estimating the cost of the index scans in the inner
loop. This seems similar to what was discussed a few months ago
http://www.postgresql.org/message-id/092a01cdd230$ff6143c0$fe23cb40$@foo.me.uk

This version of PG should have 3e9960e9d935e7e applied. I am trying to
get the database copied to a machine where I can easily switch PG
versions and test this against something prior to that commit and also
against a 9.3 build.

Steve

>
>
>> Cheers,
>>
>> Jeff
>

nestedloop_1.txt