Some indexes are inexact, i.e. they may sometimes return tuples that
don't actually match the index condition. This also happens with bitmap
scans, because it'll return anything in the bitmap which will probably
be more than what you asked for. The recheck just means that the
planner retests the index condition on the result to make sure you only
get the rows you wanted.
Have a nice day,
--
Martijn van Oosterhout <kle...@svana.org> http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
> -- John F Kennedy
The nature of the beast. For example, if you create an index on large
integer arrays it doesn't store the actual array in the index, but a
hashed version thereof. When we scan the index because of this hashing
it might match other arrays that shouldn't be. Hence the recheck.
Similarly for geometry indexes. The index only stores bounding boxes
and an intersection test might hit the bounding box but not match the
actual query.
> So does recheck condition affect the performance of the queries since it
> basically rechecks the condition?
> Also does it goes to the heap to retest ?
At the time of the recheck the data is already in memory. So no, it
doesn't go back to the heap.
> For example for this query
> explain analyze select count(*) from foo where foo_id=1 I get the following
> plan
It isn't the recheck that's costing it, it's probably just that you're
matching a lot of rows. A bitmap scan classically needs a recheck
because if a lot of rows need to be stored it might remember only
blocks 2044-2060. It then needs to recheck each row as it comes through
to make sure it really matches the conditions.
This query is 8ms, I imagine when it takes a long time it's matching
lots of rows?
On Wed, Nov 28, 2007 at 10:21:39PM -0500, Josh Harrison wrote:
> > It isn't the recheck that's costing it, it's probably just that you're
> > matching a lot of rows. A bitmap scan classically needs a recheck
> > because if a lot of rows need to be stored it might remember only
> > blocks 2044-2060. It then needs to recheck each row as it comes through
> > to make sure it really matches the conditions.
> >
> What is this number 2044-2060? Is this a fixed number in postgres?
Ofcourse not. Have you read the documentation on explain yet?
http://www.postgresql.org/docs/8.2/static/using-explain.html
The point is that the bitmap may have an inexact representation of the
tuples that match. If your scan indicates you'll match 10 million
entries and you only want to use 16KB for your bitmap, obviously you
can't store all the locations exactly.
> For example if I have a table Person with 3 fields (name,city_id,age). And
> the table contains 1000 rows. The table has 2 indexes city_id and age
> If I have a query :
> SELECT * FROM PERSON WHERE city_id=5 AND AGE=30
The answer is "it depends". Postgres has a cost based planner, it will
estimate the costs of each different way of getting the result and use
the cheapest. The factors that are important is how many rows each
condition will match.
Given your table is only 8MB, the system may decide that it's all in
memory and just do a scan.
Or it maybe see that city_id is almost unique and use that index and
check the matches for the second condition. Or vice-versa.
Or maybe it will scan both indexes, calculate the intersection and then
looks up the matches in the heap (with a recheck).
> In other words, Will this query cause 1000 random heap access or 10 random
> heap access ?
I don't know, run it and see.
The answer is "it depends". Postgres has a cost based planner, it will
> For example if I have a table Person with 3 fields (name,city_id,age). And
> the table contains 1000 rows. The table has 2 indexes city_id and age
> If I have a query :
> SELECT * FROM PERSON WHERE city_id=5 AND AGE=30
estimate the costs of each different way of getting the result and use
the cheapest. The factors that are important is how many rows each
condition will match.
Given your table is only 8MB, the system may decide that it's all in
memory and just do a scan.
Or it maybe see that city_id is almost unique and use that index and
check the matches for the second condition. Or vice-versa.
Or maybe it will scan both indexes, calculate the intersection and then
looks up the matches in the heap (with a recheck).
Thanks
Yes.
If the table actually contains 1000 rows, the most likely outcome is
that the bitmaps would not be lossy and therefore no rechecking is
needed at all. (Tuple bitmaps become lossy only if they have to store a
lot of tuples, in which case they forget the idea of storing each tuple,
and instead "compress" the representation to storing only the page
numbers where matching tuples are to be found).
Note however, that even if the bitmaps are not lossy, the visit to the
heap is still required, because the need to check for visibility.
--
Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees." (E. Dijkstra)
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Josh Harrison escribió:
> >
> > > For example if I have a table Person with 3 fields (name,city_id,age).
> > And
> > > the table contains 1000 rows. The table has 2 indexes city_id and age
> > > If I have a query :
> > > SELECT * FROM PERSON WHERE city_id=5 AND AGE=30
>> Okay....So If I have a query like the above and the query plan shows aYes.
> 'recheck condition' and bitmap scan, then does that mean it scans the
> indexes first to get the intermediate results and goto the heap only for the
> final data?
If the table actually contains 1000 rows, the most likely outcome is
that the bitmaps would not be lossy and therefore no rechecking is
needed at all. (Tuple bitmaps become lossy only if they have to store a
lot of tuples, in which case they forget the idea of storing each tuple,
and instead "compress" the representation to storing only the page
numbers where matching tuples are to be found).
Note however, that even if the bitmaps are not lossy, the visit to the
heap is still required, because the need to check for visibility.
> Thanks...
> I have 1 more question in the same line...
>
> *Query1*
> SELECT person_id FROM person WHERE (column1=1 AND column2='62')
> INTERSECT
> SELECT person_id FROM person WHERE (column1=1 AND column2='189')
Hmm, I think INTERSECT (and EXCEPT) is pretty stupid in Postgres in
general. Maybe INTERSECT ALL could be a bit faster, because it can
avoid the sort steps. Make sure you eliminate duplicates if they are a
concern.
--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
Josh Harrison escribió:Hmm, I think INTERSECT (and EXCEPT) is pretty stupid in Postgres in
> Thanks...
> I have 1 more question in the same line...
>
> *Query1*
> SELECT person_id FROM person WHERE (column1=1 AND column2='62')
> INTERSECT
> SELECT person_id FROM person WHERE (column1=1 AND column2='189')
general. Maybe INTERSECT ALL could be a bit faster, because it can
avoid the sort steps. Make sure you eliminate duplicates if they are a
concern.
> I get the same plan(see below) with 'sort' for 'intersect all' operation
> too. Why is intersect not an effecient way? Is there any other way this
> query/index can be written/created so that I can get the intersect results
> in an efficient way?
Set operations are rather inefficient. To find the intersection of two
arbitrary sets you need to sort them and compare. A query like you
write would be better expressed as a join, something like:
SELECT a.person_id
FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
(SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
WHERE a.person_id = b.person_id;
or perhaps:
SELECT a.person_id
FROM person a, person b
WHERE a.column1=1 AND a.column2='62'
AND b.column1=1 AND b.column2='189'
AND a.person_id = b.person_id;
Which will probably generate a merge join...
> > > *Query1*
> > > SELECT person_id FROM person WHERE (column1=1 AND column2='62')
> > > INTERSECT
> > > SELECT person_id FROM person WHERE (column1=1 AND column2='189')> I get the same plan(see below) with 'sort' for 'intersect all' operationSet operations are rather inefficient. To find the intersection of two
> too. Why is intersect not an effecient way? Is there any other way this
> query/index can be written/created so that I can get the intersect results
> in an efficient way?
arbitrary sets you need to sort them and compare. A query like you
write would be better expressed as a join, something like:
SELECT a.person_id
FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
(SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
WHERE a.person_id = b.person_id;
or perhaps:
SELECT a.person_id
FROM person a, person b
WHERE a.column1=1 AND a.column2='62 '
AND b.column1=1 AND b.column2='189'
AND a.person_id = b.person_id;
Which will probably generate a merge join...
> On Fri, Nov 30, 2007 at 08:21:18AM -0500, Josh Harrison wrote:
>> > > *Query1*
>> > > SELECT person_id FROM person WHERE (column1=1 AND column2='62')
>> > > INTERSECT
>> > > SELECT person_id FROM person WHERE (column1=1 AND column2='189')
>
>> I get the same plan(see below) with 'sort' for 'intersect all' operation
>> too. Why is intersect not an effecient way? Is there any other way this
>> query/index can be written/created so that I can get the intersect results
>> in an efficient way?
>
> Set operations are rather inefficient. To find the intersection of two
> arbitrary sets you need to sort them and compare.
I think all the set operations are implemented this way. It's actually a
pretty clever plan if you're processing two large lists without indexes but,
it would be nice to support a fuller set of plans like we do for other kinds
of queries. For INTERSECT star-schema joins might actually be best.
> A query like you write would be better expressed as a join, something like:
>
> SELECT a.person_id
> FROM (SELECT person_id FROM person WHERE (column1=1 AND column2='62') a,
> (SELECT person_id FROM person WHERE (column1=1 AND column2='189') b
> WHERE a.person_id = b.person_id;
>
> or perhaps:
>
> SELECT a.person_id
> FROM person a, person b
> WHERE a.column1=1 AND a.column2='62'
> AND b.column1=1 AND b.column2='189'
> AND a.person_id = b.person_id;
Or using an IN or EXISTS query:
SELECT person_id
FROM person
WHERE column1=1
AND column2='62'
AND person_id IN (
SELECT person_id
FROM person
WHERE column1=1
AND column2='189'
)
or
SELECT person_id
FROM person AS parent
WHERE column1=1
AND column2='62'
AND EXISTS (
SELECT 1
FROM person
WHERE parent.person_id = person_id
AND column1=1
AND column2='189'
)
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
Or using an IN or EXISTS query:AND person_id IN (
SELECT person_id
FROM person
WHERE column1=1
AND column2='62'SELECT person_idor
FROM person
WHERE column1=1
AND column2='189'
)
SELECT person_id
FROM person AS parentWHERE column1=1AND EXISTS (
AND column2='62'
SELECT 1
FROM person
WHERE parent.person_id = person_id
AND column1=1
AND column2='189'
)
I'm trying to imagine what it would take to avoid the heap access after
the index scan I don't think it's possible. It would require that the
bitmaps generated by the bitmap scan have the person_id attached and
then have the bitmap AND operation only happen if the person IDs match.
No such machinary currently exists.
You do have somewhat of a worst case scenario, intersecting 300k
records with 3million records. I'm not sure if there is a good way to
handle that. The only things I can think of is to add the person_id to
the index also so that you can avoid the sort (not sure if first or
last is more helpful). Or perhaps clustering on that index to reduce
disk access...
> On Fri, Nov 30, 2007 at 11:27:24AM -0500, Josh Harrison wrote:
>> Thanks for your reply
>> Is there a way to get them not to use the
>> heap for intermediate result and go to heap only for final data? This will
>> drastically improve the performance but Im not sure if postgres can do that?
>> Will creating the index in a different way and/or rewriting the query in a
>> different way achieve this result?
>
> I'm trying to imagine what it would take to avoid the heap access after
> the index scan I don't think it's possible. It would require that the
> bitmaps generated by the bitmap scan have the person_id attached and
> then have the bitmap AND operation only happen if the person IDs match.
> No such machinary currently exists.
I think you're describing a star schema join. This is a common checklist item
for data warehousing databases.
The classic data warehouse has a table like "person" which has the info you're
looking for, and dozens of tables with person_id and possibly some associated
data. In some cases those tables don't even have any other data, the mere
existence of the person_id in that table is enough.
So a typical query could look like something like:
select *
from person
where person_id in (select person_id from people_who_used_service_in_the_past)
and person_id in (select person_id from people_with_big_balances)
and person_id in (select person_id from people_...)
and person_id not in (select person_id from people_who_unsubscribed)
and person_id not in (select person_id from people_who_we_mailed_last_week)
The best plan for this is to gather up the person_ids in a kind of bitmap scan
with a bitmap of ids. And once the bitmap is done scan an index on person for
just the matching records. Postgres doesn't support anything like this (yet:).
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly