/*The hash join plan is,
smalltable has about 48,000 records.
bigtable has about 168,000,000 records.
invtranref is char(10) and is the primary key for both tables
*/
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on
bigtable.invtranref = smalltable.invtranref
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)"The nested loop join plan is,
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" -> Seq Scan on public.invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
"Nested Loop (cost=0.00..12888684.07 rows=48261 width=171)"The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Index Cond: (bigtable.invtranref = smalltable.invtranref)"
Config changes are
Many Thanks
Huan
Hello All
While investigating switching to Postgres, we come across a query plan that uses hash join and is a lot slower than a nested loop join.
I don't understand why the optimiser chooses the hash join in favor of the nested loop. What can I do to get the optimiser to make a better decision (nested loop in this case)? I have run analyze on both tables.
On Dec 12, 2012, at 8:44 AM, Huan Ruan <huan.r...@gmail.com> wrote:
On 12 December 2012 15:33, Evgeny Shishkin <itpar...@gmail.com> wrote:Optimiser thinks that nested loop is more expensive, because of point PK lookups, which a random io.Can you set random_page_cost to 2 or 3 and try again?
Hi Evgeny
Thanks for the quick reply. Setting random_page_cost to 3 doesn't make a difference, but to 2 makes the optimiser to choose nested loop. However, with such a small penalty for random I/O, I'm worry about this setting will make other small queries incorrectly use index when it should be a sequential scan though. I understand random I/O is expensive, but in this case the optimiser already knows the big table is really big, should it consider a sequential scan will be slower than an index lookup? Scan 170 million records vs index lookup of 50,000 records. Any thoughts?
Yes, this is the most common issue for me.Usually you just have to find the right combination of random and seq scan costs, shared_buffers and effective_cache_size.If some of the queries work well with another value of, say, random_page_cost, then, since it is per session parameter, you can SET it in your session before the query. But over time your table may change in size and distribution and everything brakes. No speaking about general ugliness from application standpoint.May be somebody more experienced would help.Also you can set different costs per tablespace.ThanksHuan
This looks like the same large-index over-penalty as discussed in the
recent thread "[PERFORM] Slow query: bitmap scan troubles".
Back-patching the log(npages) change is starting to look like a good idea.
Cheers,
Jeff
Huan Ruan wrote:Giving actual numbers is more useful than terms like "a lot". Even
> is a lot slower than a nested loop join.
better is to provide the output of EXPLAIN ANALYZZE rather than
just EXPLAIN. This shows estimates against actual numbers, and give
timings. For more suggestions see this page:
http://wiki.postgresql.org/wiki/SlowQueryQuestions
"QUERY PLAN"
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual time=2182.450..88158.645 rows=48257 loops=1)"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" Buffers: shared hit=3950 read=3046219"
" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108) (actual time=0.051..32581.052 rows=168121657 loops=1)"
" Buffers: shared hit=3351 read=3046219"" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual time=21.751..21.751 rows=48261 loops=1)"" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"" Buffers: shared hit=596"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.007..8.299 rows=48261 loops=1)"" Buffers: shared hit=596""Total runtime: 88162.417 ms"
"QUERY PLAN""Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual time=2280.390..87934.540 rows=48257 loops=1)"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" Buffers: shared hit=3982 read=3046187"" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108) (actual time=0.052..32747.805 rows=168121657 loops=1)"" Buffers: shared hit=3383 read=3046187"" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual time=62.161..62.161 rows=48261 loops=1)"" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"" Buffers: shared hit=596"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.006..8.209 rows=48261 loops=1)"" Buffers: shared hit=596""Total runtime: 87938.584 ms"
"QUERY PLAN"
"Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual time=0.056..551.438 rows=48257 loops=1)"
" Buffers: shared hit=242267"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.009..7.353 rows=48261 loops=1)"" Buffers: shared hit=596"
" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"
" Index Cond: (invtranref = smalltable.invtranref)"" Buffers: shared hit=241671""Total runtime: 555.336 ms"
"QUERY PLAN""Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual time=0.058..554.215 rows=48257 loops=1)"" Buffers: shared hit=242267"" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.009..7.416 rows=48261 loops=1)"" Buffers: shared hit=596"
" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"
" Index Cond: (invtranref = smalltable.invtranref)"" Buffers: shared hit=241671""Total runtime: 558.095 ms"
> Config changes are
> I don't understand why the optimiser chooses the hash join in
> favor of the nested loop. What can I do to get the optimiser to
> make a better decision (nested loop in this case)? I have run
> analyze on both tables.
>
> - shared_buffers = 6GB
> - effective_cache_size = 18GB
> - work_mem = 10MB
> - maintenance_work_mem = 3GB
As already suggested, there was a change made in 9.2 which may have
over-penalized nested loops using index scans. This may be fixed in
the next minor release.
Also, as already suggested, you may want to reduce random_page
cost, to bring it in line with the actual cost relative to
seq_page_cost based on your cache hit ratio.
Additionally, I just routinely set cpu_tuple_cost higher than the
default of 0.01. I find that 0.03 to 0.05 better models the actual
relative cost of processing a tuple.
-Kevin
Huan Ruan wrote:
> Hash 1st run
> "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
> (actual time=2182.450..88158.645 rows=48257 loops=1)"
> " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28194 nanoseconds per row suggests 100% cache hits.
> rows=168121728 width=108) (actual time=0.051..32581.052
> rows=168121657 loops=1)"
> NL 1st run
> "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
> time=0.056..551.438 rows=48257 loops=1)"
> " -> Index Scan using pk_invtran on invtran bigtable10 microseconds per index scan (each index scan requiring multiple
> (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
> rows=1 loops=48261)"
"random" accesses) also suggests 100% cache hits.
If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.
>>> - shared_buffers = 6GB
>>> - effective_cache_size = 18GB
>>> - work_mem = 10MB
>>> - maintenance_work_mem = 3GB
> Can you see any obvious issues with the other memory settings II might bump up work_mem to 20MB to 60MB, as long as you're not
> changed?
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)
-Kevin
With a low cache hit rate, that would generally be when the number
of lookups into the table exceeds about 10% of the table's rows.
Quite possibly, but it could be any of a number of other things,
like a type mismatch. It might be best to rule out other causes. If
you post the new query and EXPLAIN ANALYZE output, along with the
settings you have now adopted, someone may be able to spot
something. It wouldn't hurt to repeat OS and hardware info with it
so people have it handy for reference.
Config changes are
explain (analyze, buffers)
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref
"QUERY PLAN"
"Nested Loop (cost=0.00..341698.92 rows=48261 width=171) (actual time=0.042..567.980 rows=48257 loops=1)"
" Buffers: shared hit=242267"
" -> Seq Scan on im_match_table smalltable (cost=0.00..2472.65 rows=48261 width=63) (actual time=0.006..8.230 rows=48261 loops=1)"
" Buffers: shared hit=596"
" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..6.98 rows=1 width=108) (actual time=0.010..0.011 rows=1 loops=48261)"
" Index Cond: (invtranref = smalltable.invtranref)"
" Buffers: shared hit=241671"
"Total runtime: 571.662 ms"
Frankly, at 12 microseconds per matched pair of rows, I think
you're doing OK.