If qsort is to blame, then maybe this patch could help. It sorts
equal key values on item pointer. And if it doesn't help index
creation speed, at least the resulting index has better correlation.
Test script:
CREATE TABLE t (i int NOT NULL, t text NOT NULL);
INSERT INTO t VALUES (1, 'lajshdflasjhdflajhsdfljhasdlfjhasdf');
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t VALUES (100, 's,dmfa.,smdn.famsndfamdnsbfmansdbf');
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
INSERT INTO t SELECT * FROM t;
ANALYZE t;
CREATE INDEX t_i ON t(i);
SET enable_seqscan = 0;
SELECT ctid FROM t WHERE i=100 LIMIT 10;
Result without patch:
ctid
----------
(153,14)
(306,23)
(305,80)
(152,91)
(76,68)
(38,34)
(153,34)
(305,50)
(9,62)
(305,40)
(10 rows)
Result with patch:
ctid
--------
(0,5)
(0,10)
(0,15)
(0,20)
(0,25)
(0,30)
(0,35)
(0,40)
(0,45)
(0,50)
(10 rows)
For testing purposes I have made a second patch that provides a
boolean GUC variable sort_index. It is available here:
http://www.pivot.at/pg/23.test-IdxTupleSort.diff
Servus
Manfred
I don't know why that TODO entry exists, but I think the idea is
counterproductive. The existing btree code will tend to put newer
versions of a row earlier (because it puts a new entry in front of any
with duplicate keys), which usually reduces the time spent skipping dead
rows. Forcing tid ordering will cost us more in dead-row skipping than
it's likely to save elsewhere.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majo...@postgresql.org
I assume you are talking about a unique index that probably only has a
few non-expired rows (in which case the newer rows first is better).
The TODO deals with cases where you have lots of valid duplicate index
rows, and you want to spin through all the matching rows in heap order
rather than randomly. This is related to our CLUSTER capability. The
idea originally came from Vadim.
At what point does this patch do the sorting?
--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly
> I assume you are talking about a unique index that probably only has a
> few non-expired rows (in which case the newer rows first is better).
> The TODO deals with cases where you have lots of valid duplicate index
> rows, and you want to spin through all the matching rows in heap order
> rather than randomly.
Maybe so, but it would degrade the performance in the unique-index case
if we do it as the TODO is worded.
My own opinion is that the bitmap-index-lookup approach will be superior
to trying to keep the index entries in TID order. (That's the idea
we've been discussing for awhile of separating the heap-fetch stage from
the index-scan stage: scan the index, make a sparse bitmap of the TIDs
we need to visit, possibly AND or OR this bitmap with maps derived from
other indexes, and finally visit the rows in heap order.)
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majo...@postgresql.org)
Yes, the wording is just a guide.
> My own opinion is that the bitmap-index-lookup approach will be superior
> to trying to keep the index entries in TID order. (That's the idea
> we've been discussing for awhile of separating the heap-fetch stage from
> the index-scan stage: scan the index, make a sparse bitmap of the TIDs
> we need to visit, possibly AND or OR this bitmap with maps derived from
> other indexes, and finally visit the rows in heap order.)
Oh, yes.
--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
I don't think so, because the patch does nothing to keep the sort
order once the index is initially created.
> If you want to post it now, [...]
I did already post it. It's only the last page or so of the original
message. The link in that message points to a testing aid which is
not part of what I would like to see committed.
Servus
Manfred
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
The patch would only hurt with a unique index, if there are lots of
duplicate tuples at CREATE INDEX time.
>My own opinion is that the bitmap-index-lookup approach will be superior
So is mine, but I was not able to do this in 30 lines. Sorry ;-)
>to trying to keep the index entries in TID order.
^^^^
... which the patch does not. I see its main advantage in creating
better b-tree indices when you restore a large database with many
duplicate index entries.
It is intended to be only effective at CREATE INDEX or REINDEX time.
I don't believe it is activated when you insert a single new entry,
otherwise it wouldn't pass regression tests ...
But on average this argument only holds true for unique indexes, no ?
Is there any code that stops the heap lookup after the visible tuple is found ?
At least in an index with more rows per key you will fetch all heaps after the
first one anyway to get at the next row. This is better done in heap order, no ?
And the bitmap approach will not work for large result sets.
Summa summarum I would leave the TODO item (maybe add a comment
(only for non-unique, evaluate performance))
Andreas
Yes, yes, yes, and yes. Seems we all agree on that; the patch has
been queued for 7.5.
Servus
Manfred
---------------------------(end of broadcast)---------------------------
http://momjian.postgresql.org/cgi-bin/pgpatches
I will try to apply it within the next 48 hours.
---------------------------------------------------------------------------
> diff -ruN ../base/src/backend/utils/sort/tuplesort.c src/backend/utils/sort/tuplesort.c
> --- ../base/src/backend/utils/sort/tuplesort.c 2003-08-17 21:58:06.000000000 +0200
> +++ src/backend/utils/sort/tuplesort.c 2003-09-05 10:04:22.000000000 +0200
> @@ -2071,6 +2071,33 @@
> (errcode(ERRCODE_UNIQUE_VIOLATION),
> errmsg("could not create unique index"),
> errdetail("Table contains duplicated values.")));
> + else
> + {
> + /*
> + * If key values are equal, we sort on ItemPointer. This might help
> + * for some bad qsort implementation having performance problems
> + * with many equal items. OTOH I wouldn't trust such a weak qsort
> + * to handle pre-sorted sequences very well ...
> + *
> + * Anyway, this code doesn't hurt much, and it helps produce indices
> + * with better index correlation which is a good thing per se.
> + */
> + ItemPointer tid1 = &tuple1->t_tid;
> + ItemPointer tid2 = &tuple2->t_tid;
> + BlockNumber blk1 = ItemPointerGetBlockNumber(tid1);
> + BlockNumber blk2 = ItemPointerGetBlockNumber(tid2);
> +
> + if (blk1 != blk2)
> + return (blk1 < blk2) ? -1 : 1;
> + else
> + {
> + OffsetNumber pos1 = ItemPointerGetOffsetNumber(tid1);
> + OffsetNumber pos2 = ItemPointerGetOffsetNumber(tid2);
> +
> + if (pos1 != pos2)
> + return (pos1 < pos2) ? -1 : 1;
> + }
> + }
>
> return 0;
> }
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
---------------------------------------------------------------------------
> diff -ruN ../base/src/backend/utils/sort/tuplesort.c src/backend/utils/sort/tuplesort.c
Because it sorts on tuple position, which is certainly about as low
level as you can get. More to the point, though, no evidence has been
provided that this is a good idea.
regards, tom lane