After reading the previous posts about indexes and large joins, etc., I'm
now very confused about the results I'm seeing on a simpler join query...
I'm using source from CVS as of about 12:00 PM PST, 03/23/2000.  I still get
a big difference between enable_seqscan='on' and enable_seqscan='off', and
even with 'off' my queries seem to take too long.  All the results listed
below are with enable_seqscan='off'.  I have vacuum'd the db a few minutes
before recording these results.  I am running Linux 2.2.13 on a K62-400 with
256MB memory, with everything already in cache - my CPU runs wide open on
these queries with no disk activity.
My tables are:
  ordhead: 49726 rows
  ordline: 292193 rows
My indexes:
  ordhead.sales_ord (integer) (unique)
  ordhead.customer (char(6))
  ordline.sales_ord (integer)
  ordline.item (char(14))
Query one:
select h.sales_ord,h.date_enter,h.customer,l.item,l.qty_ord from ordhead h
join ordline l using (sales_ord) where h.customer = 'blah';
Query two:
select h.sales_ord,h.date_enter,h.customer,l.item,l.qty_ord from ordhead h
join ordline l using (sales_ord) where l.item = 'blah';
Query one explain:
Merge Join  (cost=1511.07..143462.64 rows=1452959 width=44)
  ->  Index Scan using idx_ordline_sales_ord on ordline l
(cost=0.00..138292.95 rows=292193 width=24)
  ->  Sort  (cost=1511.07..1511.07 rows=497 width=20)
        ->  Index Scan using idx_ordhead_customer on ordhead h
(cost=0.00..1488.79 rows=497 width=20)
Query two explain:
Merge Join  (cost=10038.63..37075.94 rows=1452959 width=44)
  ->  Sort  (cost=10038.63..10038.63 rows=2922 width=24)
        ->  Index Scan using idx_ordline_item on ordline l
(cost=0.00..9870.43 rows=2922 width=24)
  ->  Index Scan using idx_ordhead_sales_ord on ordhead h
(cost=0.00..26379.21 rows=49726 width=20)
Query one takes ~5 seconds, query 2 takes ~1 second, both returning ~400
rows.  Seems quite excessive considering the small size of my tables and the
fact that all fields are indexed.  I would expect to see < 1 second for both
queries.  What confuses me is the row counts for the merge join, and the
drastic difference between my explains and the ones from the previous poster
on this subject.  Actually, all my row counts are much higher.
 --
Glen Parker
glen...@nwlink.com
What happens (both to real execution time and the EXPLAIN results) if
you don't turn off enable_seqscan?
> I have vacuum'd the db a few minutes before recording these results.
Have you ever done a 'vacuum analyze' on these tables?
> What confuses me is the row counts for the merge join,
It looks to me like all the row counts are based on default selectivity
estimates, which is why I suspect a vacuum analyze is needed.
regards, tom lane
Ah ha...  If only I'd read the vacuum docs a little closer :-)  OK, that
helps.  In fact, one of the queries now runs at expected speed with and
without the set enable_seqscan='off', so one down one to go.  Here's the new
report on the other...
I'm running on a different DB (same schema, more data) now.
I now have:
  ordhead: 298765 rows
  ordline: 1973632 rows
And this is now running on a PentiumIII 600 with 384MB of memory.
Query:
select h.sales_ord,h.date_enter,h.customer,l.item,l.qty_ord from ordhead h
join ordline l using (sales_ord) where h.customer='000100'
With set enable_seqscan='on':
Hash Join  (cost=3676.54..126747.51 rows=6414 width=44)
  ->  Seq Scan on ordline l  (cost=0.00..93438.32 rows=1973632 width=24)
  ->  Hash  (cost=3674.11..3674.11 rows=971 width=20)
        ->  Index Scan using idx_ordhead_customer on ordhead h
(cost=0.00..3674.11 rows=971 width=20)
This takes 45 seconds to return 4778 rows (in the form of a select into
table suchandsuch).
As you can see the sequential scan is still being chosen on this plan.
With set enable_seqscan='off':
Nested Loop  (cost=0.00..659465.00 rows=6414 width=44)
  ->  Index Scan using idx_ordhead_customer on ordhead h
(cost=0.00..3674.11 rows=971 width=20)
  ->  Index Scan using idx_ordline_sales_ord on ordline l
(cost=0.00..673.26 rows=169 width=24)
This takes 10 seconds to return 4778 rows (in the form of a select into
table suchandsuch).
10 seconds seems reasonable enough to me...
 --
Glen Parker
glen...@nwlink.com
Hmm.  This is guessing about 971 ordhead rows will satisfy
h.customer='000100', and then about 169 rows from ordline
will match any given value of sales_ord.  How close to reality
are those guesses?
regards, tom lane
> I've been curious about all this explain voodoo from the start.  Tom, is
> there any documentation describing what all the plan numbers mean? 
> Preferrably documentation not ending in *.c... :)
Not much.  I put some basic info into the current EXPLAIN ref page
(http://www.postgresql.org/docs/postgres/sql-explain.htm),
but plan-reading is an art that deserves a tutorial, and I haven't
had time to write one.  Here is some quick & dirty explanation:
The numbers that are currently quoted by EXPLAIN are:
* Estimated startup cost (time expended before output scan can start,
  eg, time to do the sorting in a SORT node).
* Estimated total cost (if all tuples are retrieved, which they may not
  be --- LIMIT will stop short of paying the total cost, for example).
* Estimated number of rows output by this plan node.
* Estimated average width (in bytes) of rows output by this plan node.
The costs are measured in units of disk page fetches.  (There are some
fairly bogus fudge-factors for converting CPU effort estimates into
disk-fetch units; see the SET ref page if you want to play with these.)
It's important to note that the cost of an upper-level node includes
the cost of all its child nodes.  It's also important to realize that
the cost only reflects things that the planner/optimizer cares about.
In particular, the cost does not consider the time spent transmitting
result tuples to the frontend --- which could be a pretty dominant
factor in the true elapsed time, but the planner ignores it because
it cannot change it by altering the plan.  (Every correct plan will
output the same tuple set, we trust.)
Rows output is a little tricky because it is *not* the number of rows
processed/scanned by the query --- it is usually less, reflecting the
estimated selectivity of any WHERE-clause constraints that are being
applied at this node.
Average width is pretty bogus because the thing really doesn't have
any idea of the average length of variable-length columns.  I'm thinking
about improving that in the future, but it may not be worth the trouble,
because the width isn't used for very much.
Here are some examples (using the regress test database after a
vacuum analyze, and current sources):
regression=# explain select * from tenk1;
NOTICE:  QUERY PLAN:
Seq Scan on tenk1 (cost=0.00..333.00 rows=10000 width=148)
About as straightforward as it gets.  If you do "select * from pg_class
where relname = 'tenk1'" you'll find out that tenk1 has 233 disk
pages and 10000 tuples.  So the cost is estimated at 233 block
reads, defined as 1.0 apiece, plus 10000 * cpu_tuple_cost which is
currently 0.01 (try "show cpu_tuple_cost").
regression=# explain select * from tenk1 where unique1 < 1000;
NOTICE:  QUERY PLAN:
Seq Scan on tenk1 (cost=0.00..358.00 rows=1000 width=148)
Estimated output rows has gone down because of the WHERE clause.
(The uncannily accurate estimate is just because tenk1 is a particularly
simple case --- the unique1 column has 10000 distinct values ranging
from 0 to 9999, so the estimator's linear interpolation between min and
max column values is dead-on.)  However, the scan will still have to
visit all 10000 rows, so the cost hasn't decreased; in fact it has gone
up a bit to reflect the extra CPU time spent checking the WHERE
condition.
regression=# explain select * from tenk1 where unique1 < 100;
NOTICE:  QUERY PLAN:
Index Scan using tenk1_unique1 on tenk1 (cost=0.00..89.35 rows=100 width=148)
If we make the WHERE condition selective enough, the planner will
eventually decide that an indexscan is cheaper than a sequential scan.
This plan will only have to visit 100 tuples because of the index,
so it wins despite the fact that each individual fetch is expensive.
regression=# explain select * from tenk1 where unique1 < 100 and
regression-# stringu1 = 'xxx';
NOTICE:  QUERY PLAN:
Index Scan using tenk1_unique1 on tenk1 (cost=0.00..89.60 rows=1 width=148)
The added clause "stringu1 = 'xxx'" reduces the output-rows estimate,
but not the cost because we still have to visit the same set of tuples.
regression=# explain select * from tenk1 t1, tenk2 t2 where t1.unique1 < 100
regression-# and t1.unique2 = t2.unique2;
NOTICE:  QUERY PLAN:
Nested Loop  (cost=0.00..144.07 rows=100 width=296)
  ->  Index Scan using tenk1_unique1 on tenk1 t1  (cost=0.00..89.35 rows=100 width=148)
  ->  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..0.53 rows=1 width=148)
In this nested-loop join, the outer scan is the same indexscan we had
in the example before last, and the cost and row count are the same
because we are applying the "unique1 < 100" WHERE clause at this node.
The "t1.unique2 = t2.unique2" clause isn't relevant yet, so it doesn't
affect the row count.  For the inner scan, we assume that the current
outer-scan tuple's unique2 value is plugged into the inner indexscan
to produce an indexqual like "t2.unique2 = <constant>".  So we get the
same inner-scan plan and costs that we'd get from, say, "explain select
* from tenk2 where unique2 = 42".  The loop node's costs are then set
on the basis of the outer scan's cost, plus one repetition of the
inner scan for each outer tuple (100 * 0.53, here), plus a little CPU
time for join processing.
In this example the loop's output row count is the same as the product
of the two scans' row counts, but that's not true in general, because
in general you can have WHERE clauses that mention both relations and
so can only be applied at the join point, not to either input scan.
For example, if we added "WHERE ... AND t1.hundred < t2.hundred",
that'd decrease the output row count of the join node, but not change
either input scan.
We can look at variant plans by forcing the planner to disregard
whatever strategy it thought was the winner (a pretty crude tool,
but it's what we've got at the moment):
regression=# set enable_nestloop = 'off';
SET VARIABLE
regression=# explain select * from tenk1 t1, tenk2 t2 where t1.unique1 < 100
regression-# and t1.unique2 = t2.unique2;
NOTICE:  QUERY PLAN:
Hash Join  (cost=89.60..574.10 rows=100 width=296)
  ->  Seq Scan on tenk2 t2  (cost=0.00..333.00 rows=10000 width=148)
  ->  Hash  (cost=89.35..89.35 rows=100 width=148)
        ->  Index Scan using tenk1_unique1 on tenk1 t1  (cost=0.00..89.35 rows=100 width=148)
This plan proposes to extract the 100 interesting rows of tenk1
using ye same olde indexscan, stash them into an in-memory hash table,
and then do a sequential scan of tenk2, probing into the hash table
for possible matches of "t1.unique2 = t2.unique2" at each tenk2 tuple.
The cost to read tenk1 and set up the hash table is entirely startup
cost for the hash join, since we won't get any tuples out until we can
start reading tenk2.  The total time estimate for the join also
includes a pretty hefty charge for CPU time to probe the hash table
10000 times.  Note, however, that we are NOT charging 10000 times 89.35;
the hash table setup is only done once in this plan type.
regards, tom lane
I have a situation where exactly this is causing me a problem.  I have a
table with around 20,000 news stories in it.  Since news stories can
easily be larger than 8k, I have split it into two tables with all the
base story information in one table, and then another table which has
chunks of story and which has three fields: story_id, chunk_seq and
story_body.
Primarily I want my indexes to be on the story table, and most lookups
are on the story table, but I still do joins to the story_body table and
the moment I do, Postgres decides that the 'quickest' way to do this is
to do a seqscan on story_body, whereas the fastest answer is definitely
to use the indexes on story...
I'm currently using 6.5.3 (I've only just started lurking on the
'hackers' list) under Debian, so I don't think I can disable seqscan
yet, but it would be nice to indicate to Postgres somehow that the
average size for my story_body 'TEXT' field is 3500 bytes.  Perhaps
'vacuum analyze' could be generating a statistic along these lines?
Regards,
					Andrew McMillan.
-- 
_____________________________________________________________________
            Andrew McMillan, e-mail: And...@cat-it.co.nz
Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington
Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267
- Thomas
-- 
Thomas Lockhart				lock...@alumni.caltech.edu
South Pasadena, California
It's pretty rough, but I guess it's better than nothing...
regards, tom lane
> I'm currently using 6.5.3 (I've only just started lurking on the
> 'hackers' list) under Debian, so I don't think I can disable seqscan
> yet, but it would be nice to indicate to Postgres somehow that the
> average size for my story_body 'TEXT' field is 3500 bytes.  Perhaps
> 'vacuum analyze' could be generating a statistic along these lines?
Right, the idea I was toying with was to add average field width
to the stats computed by VACUUM ANALYZE.  It'd be easy enough to
compute and store it.  The thing I'm wondering about is whether having
more accurate width info would improve the planning enough to justify
the extra overhead to look it up.  That overhead would occur on every
query, whether the info helped or not.  So far, I'm not convinced it'd
be a good tradeoff.
Right now the width info isn't used at all unless the system is
contemplating a hashjoin or sort operation (it needs to guess whether
the hash/sort data will fit in memory, so the width helps then).
So I'm not sure that better width info would help your situation
anyway.  The costs for seqscan versus indexscan are based on number of
pages and number of tuples in the relation --- I guess that indirectly
tells the average tuple width, but there's not a direct dependence on
the estimated width.
I'd be interested to look more closely at your example and see what
we can do with it.  Can you show an example query, the plan you get
for it, and schema and size stats for the tables involved?
BTW, you can suppress particular plan types in 6.5, it's just less
convenient than in 7.0.  You have to start psql (not the backend;
psql, or your choice of libpq-based client) with environment variable
PGOPTIONS set to contain one or more of these switches:
	-fs	disable seqscan
	-fi	disable indexscan
	-fn	disable nestloop
	-fm	disable mergejoin
	-fh	disable hashjoin
So it's kind of painful, but you can see the estimates and measure the
actual runtimes for different plan types that way.
But really I'd like to get you to fire up 7.0beta and see what results
you get from it on your data.  It's not too late to tweak 7.0 if we
can figure out how to improve its planning.
regards, tom lane
Possibly not.  In cases like mine I would be happy to tell the system to
"use that index", which is a great last resort option that I have
occasionally used with Progress in the past, and it has zero overhead in
action (or less than zero!).  Of course, that sort of thing does have a
development time overhead but since it doesn't apply to a large number
of queries that's really OK.
I guess the average width is the file size / number of tuples, so if you
know both of those already it might not be much of an overhead to do the
division?
> I'd be interested to look more closely at your example and see what
> we can do with it.  Can you show an example query, the plan you get
> for it, and schema and size stats for the tables involved?
First, for my existing case:
EXPLAIN SELECT * FROM story WHERE head1='t';
-->> Index Scan using story_sk4 on story  (cost=987.84 rows=12937
width=125)
EXPLAIN SELECT * FROM story, story_body WHERE head1='t' AND
story_body.story_id=story.story_id;
-->> Hash Join  (cost=5459.47 rows=23330 width=145)
-->>   ->  Index Scan using story_sk4 on story  (cost=987.84 rows=12937
width=125)
-->>   ->  Hash  (cost=3309.86 rows=19632 width=20)
-->>         ->  Seq Scan on story_body  (cost=3309.86 rows=19632
width=20)
EXPLAIN SELECT * FROM story, story_body WHERE head1='t' AND
story_body.story_id=story.story_id ORDER BY written DESC;
-->> Sort  (cost=5410.60 rows=21786 width=145)
-->>   ->  Hash Join  (cost=5410.60 rows=21786 width=145)
-->>         ->  Index Scan using story_sk4 on story  (cost=987.84
rows=12937 width=125)
-->>         ->  Hash  (cost=3309.86 rows=18333 width=20)
-->>               ->  Seq Scan on story_body  (cost=3309.86 rows=18333
width=20)
And after I set PGOPTIONS=-fs:
EXPLAIN SELECT * FROM story WHERE head1='t';
-->> Index Scan using story_sk4 on story  (cost=987.84 rows=12937
width=125)
EXPLAIN SELECT * FROM story, story_body WHERE head1='t' AND
story_body.story_id=story.story_id;
-->> Hash Join  (cost=5890.21 rows=23330 width=145)
-->>   ->  Index Scan using story_sk4 on story  (cost=987.84 rows=12937
width=125)
-->>   ->  Hash  (cost=3740.60 rows=19632 width=20)
-->>         ->  Index Scan using story_body_pkey on story_body 
(cost=3740.60 rows=19632 width=20)
EXPLAIN SELECT * FROM story, story_body WHERE head1='t' AND
story_body.story_id=story.story_id;
-->> Nested Loop  (cost=27951.11 rows=21786 width=145)
-->>   ->  Index Scan using story_sk4 on story  (cost=987.84 rows=12937
width=125)
-->>   ->  Index Scan using story_body_pkey on story_body  (cost=2.08
rows=18333 width=20)
(an ORDER BY clause just adds a redundant 'sort' to this).
When I run these queries I actually put a 'LIMIT 40' on them because I
only want to display the 40 most recent entries here.  The performance
of the queries with PGOPTIONS=-fs is exactly what I'm looking for, but
if I turn it on globally might it screw things up somewhere else?
Here's what the tables look like:
story ( story_id INT4, author INT4, written DATETIME, embargo DATETIME,
wcount INT4, chunk_count INT4, head1 CHAR, internal CHAR, rating FLOAT4,
reviews INT4, source TEXT, story_type TEXT, byline TEXT, title TEXT,
precis TEXT );
story_sk4 is an index on ( head1, written )
NOTICE:  --Relation story--
NOTICE:  Pages 376: Changed 0, Reapped 2, Empty 0, New 0; Tup 18898: Vac
0, Keep/VTL 0/0, Crash 0, UnUsed 15, MinLen 104, MaxLen 212; Re-using:
Free/Avail. Space 4600/156; EndEmpty/Avail. Pages 0/1. Elapsed 0/1 sec.
NOTICE:  Index story_sk4: Pages 121; Tuples 18898: Deleted 0. Elapsed
0/0 sec.
The file is almost exactly 3 MB.
story_body ( story_id INT4, chunk_seq INT4, story_body TEXT, PRIMARY KEY
( story_id, chunk_seq) );
NOTICE:  --Relation story_body--
NOTICE:  Pages 2662: Changed 0, Reapped 0, Empty 0, New 0; Tup 19632:
Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 50, MaxLen 4044;
Re-using: Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. Elapsed 0/1
sec.
NOTICE:  Index story_body_pkey: Pages 97; Tuples 19632. Elapsed 0/0 sec.
> BTW, you can suppress particular plan types in 6.5, it's just less
> convenient than in 7.0.  You have to start psql (not the backend;
> psql, or your choice of libpq-based client) with environment variable
> PGOPTIONS set to contain one or more of these switches:
>         -fs     disable seqscan
>         -fi     disable indexscan
>         -fn     disable nestloop
>         -fm     disable mergejoin
>         -fh     disable hashjoin
> So it's kind of painful, but you can see the estimates and measure the
> actual runtimes for different plan types that way.
Any way to do that from inside PHP?  That's where all my queries are
coming from...
> But really I'd like to get you to fire up 7.0beta and see what results
> you get from it on your data.  It's not too late to tweak 7.0 if we
> can figure out how to improve its planning.
OK, I'll do that once I'm back on a high-speed internet connection on
Monday.
Thanks for your help,
					Andrew.
> EXPLAIN SELECT * FROM story, story_body WHERE head1='t' AND
> story_body.story_id=story.story_id;
--> Nested Loop  (cost=27951.11 rows=21786 width=145)
--> ->  Index Scan using story_sk4 on story  (cost=987.84 rows=12937
> width=125)
--> ->  Index Scan using story_body_pkey on story_body  (cost=2.08
> rows=18333 width=20)
> (an ORDER BY clause just adds a redundant 'sort' to this).
I'm confused --- you show two different plans for the same query
and same conditions?  Cut&paste mistake, maybe?
> When I run these queries I actually put a 'LIMIT 40' on them because I
> only want to display the 40 most recent entries here.  The performance
> of the queries with PGOPTIONS=-fs is exactly what I'm looking for, but
> if I turn it on globally might it screw things up somewhere else?
Undoubtedly.  These options are really only useful for debugging;
I'd never suggest turning them on globally.  (7.0's SET command
is marginally better since you can turn it on for just one query,
but it's still a kluge rather than a solution...)
One good bit of news is that 7.0 does consider LIMIT in planning,
and will choose a fast-start plan if you have a small LIMIT (that's
the whole reason for the separate accounting for startup costs in
7.0).  So you might find that your queries act like you want in 7.0
without any help.  I'd still like to look into whether we can improve
the planning estimates, though.
regards, tom lane
OK, I installed 7.0beta3:
explain select * from story, story_body where head1='t' AND
story_body.story_id=story.story_id AND chunk_seq=0 order by written desc
limit 40;
Nested Loop  (cost=0.00..27753.01 rows=1810 width=145)
  ->  Index Scan Backward using story_sk1 on story  (cost=0.00..2020.68
rows=1810 width=125)
Wow! That's a heck of an improvement alright :-)
When I increase the limit to 400, as you explained would happen:
explain select * from story, story_body where head1='t' AND
story_body.story_id=story.story_id AND chunk_seq=0 order by written desc
limit 400;
Sort  (cost=3932.71..3932.71 rows=1810 width=145)
  ->  Hash Join  (cost=616.75..3834.77 rows=1810 width=145)
        ->  Seq Scan on story_body  (cost=0.00..2907.40 rows=18898
width=20)
        ->  Hash  (cost=612.23..612.23 rows=1810 width=125)
              ->  Seq Scan on story  (cost=0.00..612.23 rows=1810
width=125)
but since I have "WHERE head1=<constant> ... ORDER BY written" and there
is an index on (head1, written) should the optimisation be looking at
the index in any case?  I guess this sort of question can be endless...
:-)
BTW, I did have one problem with upgrading my databases using
pg_upgrade:  the pg_dumpall from 6.5.3 dumped some tables with boolean
fields like:
  xxxx bool default 't'
and (most) others like:
  xxxx bool default bool 't'
I'm not sure why this happened, but pg_upgrade choked on the first
syntax, didn't create one of the (core) tables for one database and left
me with a nice pile of custard!
Cheers,
That is the fault of pg_dump not working properly.
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pg...@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
It evidently thinks that the cost of an indexscan over that many entries
will exceed the cost of a seqscan.  I'm not sure I have the cost
equations for this tradeoff quite right yet --- so the first question
is whether it's really guessing right.  You can experiment by setting
	set enable_seqscan = 'off';
which will discourage it from using a seqscan if it can find an
indexscan alternative.  Check out the plans you get that way, and
then do some timings to see if the estimated costs have anything
to do with reality... let me know ;-)
> BTW, I did have one problem with upgrading my databases using
> pg_upgrade:  the pg_dumpall from 6.5.3 dumped some tables with boolean
> fields like:
>   xxxx bool default 't'
> and (most) others like:
>   xxxx bool default bool 't'
> I'm not sure why this happened, but pg_upgrade choked on the first
> syntax, didn't create one of the (core) tables for one database and left
> me with a nice pile of custard!
Odd. Either syntax works for me with current sources:
regression=# create table foo (xxxx bool default 't');
CREATE
regression=# create table foof (xxxx bool default bool 't');
CREATE
What error did you get exactly?
regards, tom lane