Recheck condition

Josh Harrison

unread,

Nov 28, 2007, 1:18:56 PM11/28/07

to

Hi,
Sorry if my question is silly.
When I use explain analyze command I come across 'recheck condition' in some places.
I googled for this but didn't get any solid answers.

What is recheck condition and when does the query planner choose this?
Thanks
josh

Martijn van Oosterhout

unread,

Nov 28, 2007, 1:24:59 PM11/28/07

to

On Wed, Nov 28, 2007 at 01:18:56PM -0500, Josh Harrison wrote:
> Hi,
> Sorry if my question is silly.
> When I use explain analyze command I come across 'recheck condition' in some
> places.
> I googled for this but didn't get any solid answers.

Some indexes are inexact, i.e. they may sometimes return tuples that
don't actually match the index condition. This also happens with bitmap
scans, because it'll return anything in the bitmap which will probably
be more than what you asked for. The recheck just means that the
planner retests the index condition on the result to make sure you only
get the rows you wanted.

Have a nice day,
--
Martijn van Oosterhout <kle...@svana.org> http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
> -- John F Kennedy

signature.asc

Josh Harrison

unread,

Nov 28, 2007, 2:20:11 PM11/28/07

to

>Some indexes are inexact, i.e. they may sometimes return tuples that
>don't actually match the index condition.

What causes an index to be inexact. When you create an index and vacuum it regularly, it is suppose to be correct....right??

>This also happens with bitmap
>scans, because it'll return anything in the bitmap which will probably
>be more than what you asked for. The recheck just means that the
>planner retests the index condition on the result to make sure you only
>get the rows you wanted

So does recheck condition affect the performance of the queries since it basically rechecks the condition?
Also does it goes to the heap to retest ?

For example for this query
explain analyze select count(*) from foo where foo_id=1 I get the following plan

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1949.84..1949.85 rows=1 width=0) (actual time=7.996..7.996 rows=1 loops=1)
   -> Bitmap Heap Scan on foo (cost=277.45..1924.94 rows=9959 width=0) (actual time=1.903..5.270 rows=10020 loops=1)
         Recheck Cond: (foo_id = 1::numeric)
         -> Bitmap Index Scan on foo_pk (cost=0.00..274.96 rows=9959 width=0) (actual time=1.864..1.864 rows=10020 loops=1)
               Index Cond: (foo_id = 1::numeric)
Total runtime: 8.062 ms

Can you please explain to me with respect to this example what is happening here? This is a small table but for big tables the performance is not very good. Does recheck condition brings down the query performance?

Thanks
josh

Martijn van Oosterhout

unread,

Nov 28, 2007, 3:03:27 PM11/28/07

to

On Wed, Nov 28, 2007 at 02:20:11PM -0500, Josh Harrison wrote:
> >Some indexes are inexact, i.e. they may sometimes return tuples that
> >don't actually match the index condition.
>
> What causes an index to be inexact. When you create an index and vacuum it
> regularly, it is suppose to be correct....right??

The nature of the beast. For example, if you create an index on large
integer arrays it doesn't store the actual array in the index, but a
hashed version thereof. When we scan the index because of this hashing
it might match other arrays that shouldn't be. Hence the recheck.

Similarly for geometry indexes. The index only stores bounding boxes
and an intersection test might hit the bounding box but not match the
actual query.

> So does recheck condition affect the performance of the queries since it
> basically rechecks the condition?
> Also does it goes to the heap to retest ?

At the time of the recheck the data is already in memory. So no, it
doesn't go back to the heap.

> For example for this query
> explain analyze select count(*) from foo where foo_id=1 I get the following
> plan

It isn't the recheck that's costing it, it's probably just that you're
matching a lot of rows. A bitmap scan classically needs a recheck
because if a lot of rows need to be stored it might remember only
blocks 2044-2060. It then needs to recheck each row as it comes through
to make sure it really matches the conditions.

This query is 8ms, I imagine when it takes a long time it's matching
lots of rows?

signature.asc

Martijn van Oosterhout

unread,

Nov 29, 2007, 5:55:20 AM11/29/07

to

Please always CC the list so other people can respond.

On Wed, Nov 28, 2007 at 10:21:39PM -0500, Josh Harrison wrote:
> > It isn't the recheck that's costing it, it's probably just that you're
> > matching a lot of rows. A bitmap scan classically needs a recheck
> > because if a lot of rows need to be stored it might remember only
> > blocks 2044-2060. It then needs to recheck each row as it comes through
> > to make sure it really matches the conditions.
> >

> What is this number 2044-2060? Is this a fixed number in postgres?

Ofcourse not. Have you read the documentation on explain yet?
http://www.postgresql.org/docs/8.2/static/using-explain.html

The point is that the bitmap may have an inexact representation of the
tuples that match. If your scan indicates you'll match 10 million
entries and you only want to use 16KB for your bitmap, obviously you
can't store all the locations exactly.

> For example if I have a table Person with 3 fields (name,city_id,age). And
> the table contains 1000 rows. The table has 2 indexes city_id and age
> If I have a query :
> SELECT * FROM PERSON WHERE city_id=5 AND AGE=30

The answer is "it depends". Postgres has a cost based planner, it will
estimate the costs of each different way of getting the result and use
the cheapest. The factors that are important is how many rows each
condition will match.

Given your table is only 8MB, the system may decide that it's all in
memory and just do a scan.

Or it maybe see that city_id is almost unique and use that index and
check the matches for the second condition. Or vice-versa.

Or maybe it will scan both indexes, calculate the intersection and then
looks up the matches in the heap (with a recheck).

> In other words, Will this query cause 1000 random heap access or 10 random
> heap access ?

I don't know, run it and see.

signature.asc

Josh Harrison

unread,

Nov 29, 2007, 8:01:13 AM11/29/07

to

> For example if I have a table Person with 3 fields (name,city_id,age). And
> the table contains 1000 rows. The table has 2 indexes city_id and age
> If I have a query :
> SELECT * FROM PERSON WHERE city_id=5 AND AGE=30

The answer is "it depends". Postgres has a cost based planner, it will
estimate the costs of each different way of getting the result and use
the cheapest. The factors that are important is how many rows each
condition will match.

Given your table is only 8MB, the system may decide that it's all in
memory and just do a scan.

Or it maybe see that city_id is almost unique and use that index and
check the matches for the second condition. Or vice-versa.

Or maybe it will scan both indexes, calculate the intersection and then
looks up the matches in the heap (with a recheck).

Okay....So If I have a query like the above and the query plan shows a 'recheck condition' and bitmap scan, then does that mean it scans the indexes first to get the intermediate results and goto the heap only for the final data?

Thanks

jo

Alvaro Herrera

unread,

Nov 29, 2007, 8:15:07 AM11/29/07

to

Josh Harrison escribió:

> >
> > > For example if I have a table Person with 3 fields (name,city_id,age).
> > And
> > > the table contains 1000 rows. The table has 2 indexes city_id and age
> > > If I have a query :
> > > SELECT * FROM PERSON WHERE city_id=5 AND AGE=30
>

> Okay....So If I have a query like the above and the query plan shows a
> 'recheck condition' and bitmap scan, then does that mean it scans the
> indexes first to get the intermediate results and goto the heap only for the
> final data?

Yes.

If the table actually contains 1000 rows, the most likely outcome is
that the bitmaps would not be lossy and therefore no rechecking is
needed at all. (Tuple bitmaps become lossy only if they have to store a
lot of tuples, in which case they forget the idea of storing each tuple,
and instead "compress" the representation to storing only the page
numbers where matching tuples are to be found).

Note however, that even if the bitmaps are not lossy, the visit to the
heap is still required, because the need to check for visibility.

--
Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees." (E. Dijkstra)

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Josh Harrison

unread,

Nov 30, 2007, 7:51:31 AM11/30/07

to

On Nov 29, 2007 8:15 AM, Alvaro Herrera <alvh...@alvh.no-ip.org> wrote:

Josh Harrison escribió:

> >
> > > For example if I have a table Person with 3 fields (name,city_id,age).
> > And
> > > the table contains 1000 rows. The table has 2 indexes city_id and age
> > > If I have a query :
> > > SELECT * FROM PERSON WHERE city_id=5 AND AGE=30
>

> Okay....So If I have a query like the above and the query plan shows a
> 'recheck condition' and bitmap scan, then does that mean it scans the
> indexes first to get the intermediate results and goto the heap only for the
> final data?

Yes.

If the table actually contains 1000 rows, the most likely outcome is
that the bitmaps would not be lossy and therefore no rechecking is
needed at all. (Tuple bitmaps become lossy only if they have to store a
lot of tuples, in which case they forget the idea of storing each tuple,
and instead "compress" the representation to storing only the page
numbers where matching tuples are to be found).

Note however, that even if the bitmaps are not lossy, the visit to the
heap is still required, because the need to check for visibility.

Thanks...

I have 1 more question in the same line...

Query1
SELECT person_id FROM person WHERE (column1=1 AND column2='62')
INTERSECT
SELECT person_id FROM person WHERE (column1=1 AND column2='189')

There is an index created as person_idx(column1,column2)

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SetOp Intersect (cost=1750719.48..1769378.35 rows=373177 width=4) (actual time=42913.626..47247.650 rows=6352 loops=1)
   -> Sort (cost=1750719.48..1760048.92 rows=3731774 width=4) (actual time=42913.537..45838.472 rows=3756726 loops=1)
         Sort Key: "*SELECT* 1".patient_id
         Sort method: external merge Disk:73432kB
         -> Append (cost= 17886.42..1209431.67 rows=3731774 width=4) (actual time=1474.995..32215.493 rows=3756726 loops=1)
               -> Subquery Scan "*SELECT* 1" (cost= 17886.42..496377.90 rows=381993 width=4) (actual time=1474.993..4936.240 rows=327498 loops=1)
                     -> Bitmap Heap Scan on person   (cost=17886.42..492557.97 rows=381993 width=4) (actual time= 1474.990..4735.972 rows=327498 loops=1)
                           Recheck Cond: ((column1 = 1) AND ((column2)::text = '62'::text))
                           -> Bitmap Index Scan on person_idx (cost= 0.00..17790.92 rows=381993 width=0) (actual time=1469.508..1469.508 rows=327498 loops=1)
                                 Index Cond: ((column1 = 1) AND ((column2)::text = '62'::text))
               -> Subquery Scan "*SELECT* 2" (cost=156754.24..713053.77 rows=3349781 width=4) (actual time=4142.577..25518.305 rows=3429228 loops=1)
                     -> Bitmap Heap Scan on person   (cost= 156754.24..679555.96 rows=3349781 width=4) (actual time=4142.573..23493.596 rows=3429228 loops=1)
                           Recheck Cond: ((column1 = 1) AND ((column2)::text = '189'::text))
                           -> Bitmap Index Scan on person_idx (cost=0.00..155916.80 rows=3349781 width=0) (actual time=4136.948..4136.948 rows=3429228 loops=1)
                                 Index Cond: ((column1 = 1) AND ((column2)::text = '189'::text))
Total runtime: 47250.501 ms


Question:
In this query Intersection is used. How does postgres handle this? The steps in the above query are
1.find all tuples that match column1=1 AND column2='62'
2. find all tuples that match column1=1 AND column2='189'
3. Find the intersection of the above 2
Does it go to the heap even to get the intermediate results (1 & 2) ?
or
Does it do the first 2 steps using index and go to the heap for the final data?

Also what does Sort method: external merge Disk:73432kB mean? Should I have to modify this to make this query run faster? Postgres takes 4 times slower than Oracle to return this query. Is there a way to make this faster?

Thanks
jo

Alvaro Herrera

unread,

Nov 30, 2007, 7:55:42 AM11/30/07

to

Josh Harrison escribió:

> Thanks...
> I have 1 more question in the same line...
>

> *Query1*

> SELECT person_id FROM person WHERE (column1=1 AND column2='62')
> INTERSECT
> SELECT person_id FROM person WHERE (column1=1 AND column2='189')

Hmm, I think INTERSECT (and EXCEPT) is pretty stupid in Postgres in
general. Maybe INTERSECT ALL could be a bit faster, because it can
avoid the sort steps. Make sure you eliminate duplicates if they are a
concern.

--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Josh Harrison

unread,

Nov 30, 2007, 8:21:18 AM11/30/07

to

On Nov 30, 2007 7:55 AM, Alvaro Herrera <alvh...@alvh.no-ip.org> wrote:

Josh Harrison escribió:

> Thanks...
> I have 1 more question in the same line...
>
> *Query1*
> SELECT person_id FROM person WHERE (column1=1 AND column2='62')
> INTERSECT
> SELECT person_id FROM person WHERE (column1=1 AND column2='189')

Hmm, I think INTERSECT (and EXCEPT) is pretty stupid in Postgres in
general. Maybe INTERSECT ALL could be a bit faster, because it can
avoid the sort steps. Make sure you eliminate duplicates if they are a
concern.

I get the same plan(see below) with 'sort' for 'intersect all' operation too. Why is intersect not an effecient way? Is there any other way this query/index can be written/created so that I can get the intersect results in an efficient way?
Thanks
jo

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SetOp Intersect All (cost=1750719.48..1769378.35 rows=373177 width=4) (actual time=41065.459..45469.038 rows=128562 loops=1)
   -> Sort (cost=1750719.48..1760048.92 rows=3731774 width=4) (actual time=41065.375..44027.342 rows=3756726 loops=1)

Sort Key: "*SELECT* 1".patient_id

         Sort Method: external merge Disk: 73432kB
         -> Append (cost=17886.42..1209431.67 rows=3731774 width=4) (actual time=1445.675..30171.066 rows=3756726 loops=1)
               -> Subquery Scan "*SELECT* 1" (cost=17886.42..496377.90 rows=381993 width=4) (actual time=1445.674..8223.061 rows=327498 loops=1)
                     -> Bitmap Heap Scan on person   (cost=17886.42..492557.97 rows=381993 width=4) (actual time= 1445.670..8021.006 rows=327498 loops=1)

Recheck Cond: ((column1 = 1) AND ((column2)::text = '62'::text))

-> Bitmap Index Scan on person_idx (cost= 0.00..17790.92 rows=381993 width=0) (actual time=1440.189..1440.189 rows=327498 loops=1)

Index Cond: ((column1 = 1) AND ((column2)::text = '62'::text))

-> Subquery Scan "*SELECT* 2" (cost=156754.24..713053.77 rows=3349781 width=4) (actual time=4183.977..20195.276 rows=3429228 loops=1)
-> Bitmap Heap Scan on person (cost= 156754.24..679555.96 rows=3349781 width=4) (actual time=4183.973..18191.919 rows=3429228 loops=1)

Recheck Cond: ((column1 = 1) AND ((column2)::text = '189'::text))

-> Bitmap Index Scan on person_idx (cost=0.00..155916.80 rows=3349781 width=0) (actual time=4178.644..4178.644 rows=3429228 loops=1)

Index Cond: ((column1 = 1) AND ((column2)::text = '189'::text))

Total runtime: 45504.425 ms

Martijn van Oosterhout

unread,

Nov 30, 2007, 8:28:55 AM11/30/07

to

On Fri, Nov 30, 2007 at 08:21:18AM -0500, Josh Harrison wrote:
> > > *Query1*
> > > SELECT person_id FROM person WHERE (column1=1 AND column2='62')
> > > INTERSECT
> > > SELECT person_id FROM person WHERE (column1=1 AND column2='189')

> I get the same plan(see below) with 'sort' for 'intersect all' operation

> too. Why is intersect not an effecient way? Is there any other way this
> query/index can be written/created so that I can get the intersect results
> in an efficient way?

Set operations are rather inefficient. To find the intersection of two
arbitrary sets you need to sort them and compare. A query like you
write would be better expressed as a join, something like:

SELECT a.person_id
FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
(SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
WHERE a.person_id = b.person_id;

or perhaps:

SELECT a.person_id
FROM person a, person b
WHERE a.column1=1 AND a.column2='62'
AND b.column1=1 AND b.column2='189'
AND a.person_id = b.person_id;

Which will probably generate a merge join...

signature.asc

Josh Harrison

unread,

Nov 30, 2007, 9:37:32 AM11/30/07

to

> > > *Query1*
> > > SELECT person_id FROM person WHERE (column1=1 AND column2='62')
> > > INTERSECT
> > > SELECT person_id FROM person WHERE (column1=1 AND column2='189')

> I get the same plan(see below) with 'sort' for 'intersect all' operation
> too. Why is intersect not an effecient way? Is there any other way this
> query/index can be written/created so that I can get the intersect results
> in an efficient way?

Set operations are rather inefficient. To find the intersection of two
arbitrary sets you need to sort them and compare. A query like you
write would be better expressed as a join, something like:

SELECT a.person_id
FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
(SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
WHERE a.person_id = b.person_id;

or perhaps:

SELECT a.person_id
FROM person a, person b
WHERE a.column1=1 AND a.column2='62 '
AND b.column1=1 AND b.column2='189'
AND a.person_id = b.person_id;

Which will probably generate a merge join...

Thanks. But this query seems to be more expensive than using intersect operator.
This is the explain analyse plan for this query. It took 5 1/2 minutes to generate this. I also tried to disable the mergejoin and in that case it uses hash join and still takes more than 3 minutes (intersect took only 40 sec)

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=1610648.92..10280119.99 rows=577856095 width=4) (actual time=30562.630..264534.677 rows=225145385 loops=1)
   Merge Cond: (a.patient_id = b.patient_id )
   -> Sort (cost=527974.81..528929.79 rows=381993 width=4) (actual time=3755.361..3845.134 rows=213435 loops=1)
         Sort Key: a.patient_id
         Sort Method: quicksort Memory: 15868kB
         -> Bitmap Heap Scan on clinical_variable2 a (cost= 17886.42..492557.97 rows=381993 width=4) (actual time=315.753..3410.366 rows=327498 loops=1)
               Recheck Cond: ((top_parent_service_sys_id = 1) AND ((top_parent_service_code)::text = '62'::text))
               -> Bitmap Index Scan on clinical_variable_idx_topserv (cost=0.00..17790.92 rows=381993 width=0) (actual time=277.185..277.185 rows=327498 loops=1)
                     Index Cond: ((top_parent_service_sys_id = 1) AND ((top_parent_service_code)::text = '62'::text))
   -> Materialize (cost=1082674.11..1124546.38 rows=3349781 width=4) (actual time=26807.248..99885.620 rows=225148250 loops=1)
         -> Sort (cost=1082674.11..1091048.57 rows=3349781 width=4) (actual time=26807.238..30343.870 rows=3429228 loops=1)
               Sort Key: b.patient_id
               Sort Method: external merge Disk: 53552kB
               -> Bitmap Heap Scan on clinical_variable2 b (cost= 156754.24..679555.96 rows=3349781 width=4) (actual time=2744.126..20106.160 rows=3429228 loops=1)
                     Recheck Cond: ((top_parent_service_sys_id = 1) AND ((top_parent_service_code)::text = '189'::text))
                     -> Bitmap Index Scan on clinical_variable_idx_topserv (cost=0.00..155916.80 rows=3349781 width=0) (actual time=2686.456..2686.456 rows=3429228 loops=1)
                           Index Cond: ((top_parent_service_sys_id = 1) AND ((top_parent_service_code)::text = '189'::text))
Total runtime: 324646.035 ms

18 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms] a: 0/ms]

Is there any other way you can think of to solve this problem. May be creating the indexes in a different way or something?

Thanks
jo

Gregory Stark

unread,

Nov 30, 2007, 10:59:59 AM11/30/07

to

"Martijn van Oosterhout" <kle...@svana.org> writes:

> On Fri, Nov 30, 2007 at 08:21:18AM -0500, Josh Harrison wrote:
>> > > *Query1*
>> > > SELECT person_id FROM person WHERE (column1=1 AND column2='62')
>> > > INTERSECT
>> > > SELECT person_id FROM person WHERE (column1=1 AND column2='189')
>
>> I get the same plan(see below) with 'sort' for 'intersect all' operation
>> too. Why is intersect not an effecient way? Is there any other way this
>> query/index can be written/created so that I can get the intersect results
>> in an efficient way?
>
> Set operations are rather inefficient. To find the intersection of two
> arbitrary sets you need to sort them and compare.

I think all the set operations are implemented this way. It's actually a
pretty clever plan if you're processing two large lists without indexes but,
it would be nice to support a fuller set of plans like we do for other kinds
of queries. For INTERSECT star-schema joins might actually be best.

> A query like you write would be better expressed as a join, something like:
>
> SELECT a.person_id
> FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
> (SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
> WHERE a.person_id = b.person_id;
>
> or perhaps:
>
> SELECT a.person_id
> FROM person a, person b
> WHERE a.column1=1 AND a.column2='62'
> AND b.column1=1 AND b.column2='189'
> AND a.person_id = b.person_id;

Or using an IN or EXISTS query:

SELECT person_id
FROM person
WHERE column1=1
AND column2='62'
AND person_id IN (
SELECT person_id
FROM person
WHERE column1=1
AND column2='189'
)

or

SELECT person_id
FROM person AS parent
WHERE column1=1
AND column2='62'
AND EXISTS (
SELECT 1
FROM person
WHERE parent.person_id = person_id
AND column1=1
AND column2='189'
)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Josh Harrison

unread,

Nov 30, 2007, 11:27:24 AM11/30/07

to

Or using an IN or EXISTS query:

SELECT person_id
FROM person
WHERE column1=1
AND column2='62'
AND person_id IN (
SELECT person_id
FROM person
WHERE column1=1
AND column2='189'
)

or

SELECT person_id
FROM person AS parent
WHERE column1=1
AND column2='62'
AND EXISTS (
SELECT 1
FROM person
WHERE parent.person_id = person_id
AND column1=1
AND column2='189'
)

Thanks for your reply
The query with IN gave this plan and took 1m19sec to give the result which is slightly more than the intersect query(40 sec). The other query with exists takes way long time for results. All these queries does a heap scan for intermediate results...right? Is there a way to get them not to use the heap for intermediate result and go to heap only for final data? This will drastically improve the performance but Im not sure if postgres can do that? Will creating the index in a different way and/or rewriting the query in a different way achieve this result?

Thanks
jo

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=705823.44..1182434.64 rows=43631 width=4) (actual time=26443.675..52055.698 rows=140464 loops=1)
   Hash Cond: (public.person.patient_id = public.person.patient_id)
   -> Bitmap Heap Scan on person (cost=17886.42..492557.97 rows=381993 width=4) (actual time=442.934..25779.601 rows=327498 loops=1)

Recheck Cond: ((column1 = 1) AND ((column2)::text = '62'::text))

-> Bitmap Index Scan on person_idx (cost= 0.00..17790.92 rows=381993 width=0) (actual time=403.869..403.869 rows=327498 loops=1)

Index Cond: ((column1 = 1) AND ((column2)::text = '62'::text))

   -> Hash (cost=687933.35..687933.35 rows=294 width=4) (actual time=26000.635..26000.635 rows=6568 loops=1)
         -> HashAggregate (cost= 687930.41..687933.35 rows=294 width=4) (actual time=25992.971..25996.471 rows=6568 loops=1)
               -> Bitmap Heap Scan on person (cost=156754.24..679555.96 rows=3349781 width=4) (actual time=3202.251..23974.389 rows=3429228 loops=1)

Recheck Cond: ((column1 = 1) AND ((column2)::text = '189'::text))

-> Bitmap Index Scan on person_idx (cost=0.00..155916.80 rows=3349781 width=0) (actual time=3145.912..3145.912 rows=3429228 loops=1)

Index Cond: ((column1 = 1) AND ((column2)::text = '189'::text))

Total runtime: 52094.598 ms

Martijn van Oosterhout

unread,

Nov 30, 2007, 11:56:29 AM11/30/07

to

On Fri, Nov 30, 2007 at 11:27:24AM -0500, Josh Harrison wrote:
> Thanks for your reply

> Is there a way to get them not to use the
> heap for intermediate result and go to heap only for final data? This will
> drastically improve the performance but Im not sure if postgres can do that?
> Will creating the index in a different way and/or rewriting the query in a
> different way achieve this result?

I'm trying to imagine what it would take to avoid the heap access after
the index scan I don't think it's possible. It would require that the
bitmaps generated by the bitmap scan have the person_id attached and
then have the bitmap AND operation only happen if the person IDs match.
No such machinary currently exists.

You do have somewhat of a worst case scenario, intersecting 300k
records with 3million records. I'm not sure if there is a good way to
handle that. The only things I can think of is to add the person_id to
the index also so that you can avoid the sort (not sure if first or
last is more helpful). Or perhaps clustering on that index to reduce
disk access...

signature.asc

Gregory Stark

unread,

Dec 1, 2007, 5:13:42 AM12/1/07

to

"Martijn van Oosterhout" <kle...@svana.org> writes:

> On Fri, Nov 30, 2007 at 11:27:24AM -0500, Josh Harrison wrote:
>> Thanks for your reply
>> Is there a way to get them not to use the
>> heap for intermediate result and go to heap only for final data? This will
>> drastically improve the performance but Im not sure if postgres can do that?
>> Will creating the index in a different way and/or rewriting the query in a
>> different way achieve this result?
>
> I'm trying to imagine what it would take to avoid the heap access after
> the index scan I don't think it's possible. It would require that the
> bitmaps generated by the bitmap scan have the person_id attached and
> then have the bitmap AND operation only happen if the person IDs match.
> No such machinary currently exists.

I think you're describing a star schema join. This is a common checklist item
for data warehousing databases.

The classic data warehouse has a table like "person" which has the info you're
looking for, and dozens of tables with person_id and possibly some associated
data. In some cases those tables don't even have any other data, the mere
existence of the person_id in that table is enough.

So a typical query could look like something like:

select *
from person
where person_id in (select person_id from people_who_used_service_in_the_past)
and person_id in (select person_id from people_with_big_balances)
and person_id in (select person_id from people_...)
and person_id not in (select person_id from people_who_unsubscribed)
and person_id not in (select person_id from people_who_we_mailed_last_week)

The best plan for this is to gather up the person_ids in a kind of bitmap scan
with a bitmap of ids. And once the bitmap is done scan an index on person for
just the matching records. Postgres doesn't support anything like this (yet:).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

Ask me about EnterpriseDB's PostGIS support!

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly