hash join vs nested loop join

Huan Ruan

unread,

Dec 11, 2012, 11:25:48 PM12/11/12

to

Hello All

While investigating switching to Postgres, we come across a query plan that uses hash join and is a lot slower than a nested loop join.

I don't understand why the optimiser chooses the hash join in favor of the nested loop. What can I do to get the optimiser to make a better decision (nested loop in this case)? I have run analyze on both tables.

The query is,

/* smalltable has about 48,000 records. bigtable has about 168,000,000 records. invtranref is char(10)and is the primary key for both tables*/SELECT *FROM IM_Match_Table smalltable inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref

The hash join plan is,

"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" -> Seq Scan on public.invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108)"
"        Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63)"
"        Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
"        -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
"              Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"

The nested loop join plan is,

"Nested Loop (cost=0.00..12888684.07 rows=48261 width=171)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
"        Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)"
"        Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
"        Index Cond: (bigtable.invtranref = smalltable.invtranref)"

The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:

Centos, ext4
24GB memory
6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
raid 10 on 4 sata disks

Config changes are

shared_buffers = 6GB
effective_cache_size = 18GB
work_mem = 10MB
maintenance_work_mem = 3GB

Many Thanks
Huan

Evgeny Shishkin

unread,

Dec 11, 2012, 11:33:33 PM12/11/12

to

On Dec 12, 2012, at 8:25 AM, Huan Ruan <leohu...@gmail.com> wrote:

Hello All

While investigating switching to Postgres, we come across a query plan that uses hash join and is a lot slower than a nested loop join.

I don't understand why the optimiser chooses the hash join in favor of the nested loop. What can I do to get the optimiser to make a better decision (nested loop in this case)? I have run analyze on both tables.

Optimiser thinks that nested loop is more expensive, because of point PK lookups, which a random io.

Can you set random_page_cost to 2 or 3 and try again?

Evgeny Shishkin

unread,

Dec 11, 2012, 11:59:04 PM12/11/12

to

On Dec 12, 2012, at 8:57 AM, Evgeny Shishkin <itpar...@gmail.com> wrote:

On Dec 12, 2012, at 8:44 AM, Huan Ruan <huan.r...@gmail.com> wrote:

On 12 December 2012 15:33, Evgeny Shishkin <itpar...@gmail.com> wrote:

Optimiser thinks that nested loop is more expensive, because of point PK lookups, which a random io.
Can you set random_page_cost to 2 or 3 and try again?

Hi Evgeny

Thanks for the quick reply. Setting random_page_cost to 3 doesn't make a difference, but to 2 makes the optimiser to choose nested loop. However, with such a small penalty for random I/O, I'm worry about this setting will make other small queries incorrectly use index when it should be a sequential scan though. I understand random I/O is expensive, but in this case the optimiser already knows the big table is really big, should it consider a sequential scan will be slower than an index lookup? Scan 170 million records vs index lookup of 50,000 records. Any thoughts?

Yes, this is the most common issue for me.
Usually you just have to find the right combination of random and seq scan costs, shared_buffers and effective_cache_size.
If some of the queries work well with another value of, say, random_page_cost, then, since it is per session parameter, you can SET it in your session before the query. But over time your table may change in size and distribution and everything brakes. No speaking about general ugliness from application standpoint.

May be somebody more experienced would help.

Also you can set different costs per tablespace.

Thanks

Huan

Added CC.

Jeff Janes

unread,

Dec 12, 2012, 11:28:13 AM12/12/12

to

On Tue, Dec 11, 2012 at 8:25 PM, Huan Ruan <leohu...@gmail.com> wrote:
> Hello All
>
> While investigating switching to Postgres, we come across a query plan that
> uses hash join and is a lot slower than a nested loop join.
>
> I don't understand why the optimiser chooses the hash join in favor of the
> nested loop. What can I do to get the optimiser to make a better decision
> (nested loop in this case)? I have run analyze on both tables.
>
> The query is,
>
> /*
> smalltable has about 48,000 records.
> bigtable has about 168,000,000 records.
> invtranref is char(10) and is the primary key for both tables
> */
> SELECT
> *
> FROM IM_Match_Table smalltable
> inner join invtran bigtable on
> bigtable.invtranref = smalltable.invtranref

..

> " -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)"

This looks like the same large-index over-penalty as discussed in the
recent thread "[PERFORM] Slow query: bitmap scan troubles".

Back-patching the log(npages) change is starting to look like a good idea.

Cheers,

Jeff

--
Sent via pgsql-performance mailing list (pgsql-pe...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Kevin Grittner

unread,

Dec 12, 2012, 6:47:32 PM12/12/12

to

Huan Ruan wrote:

> is a lot slower than a nested loop join.

Giving actual numbers is more useful than terms like "a lot". Even
better is to provide the output of EXPLAIN ANALYZZE rather than
just EXPLAIN. This shows estimates against actual numbers, and give
timings. For more suggestions see this page:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

> I don't understand why the optimiser chooses the hash join in
> favor of the nested loop. What can I do to get the optimiser to
> make a better decision (nested loop in this case)? I have run
> analyze on both tables.

> Config changes are
>
> - shared_buffers = 6GB
> - effective_cache_size = 18GB
> - work_mem = 10MB
> - maintenance_work_mem = 3GB

As already suggested, there was a change made in 9.2 which may have
over-penalized nested loops using index scans. This may be fixed in
the next minor release.

Also, as already suggested, you may want to reduce random_page
cost, to bring it in line with the actual cost relative to
seq_page_cost based on your cache hit ratio.

Additionally, I just routinely set cpu_tuple_cost higher than the
default of 0.01. I find that 0.03 to 0.05 better models the actual
relative cost of processing a tuple.

-Kevin

Kevin Grittner

unread,

Dec 13, 2012, 10:26:24 AM12/13/12

to

Huan Ruan wrote:

> Hash 1st run

> "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)

> (actual time=2182.450..88158.645 rows=48257 loops=1)"

> " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28
> rows=168121728 width=108) (actual time=0.051..32581.052
> rows=168121657 loops=1)"

194 nanoseconds per row suggests 100% cache hits.

> NL 1st run

> "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
> time=0.056..551.438 rows=48257 loops=1)"

> " -> Index Scan using pk_invtran on invtran bigtable
> (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
> rows=1 loops=48261)"

10 microseconds per index scan (each index scan requiring multiple
"random" accesses) also suggests 100% cache hits.

> I originally reduced random_page_cost to 2 to achieve the nested
> loop join. Now I set cpu_tuple_cost to 0.05 and reset
> random_page_cost back to 4, I can also achieve a nested loop
> join.
>
> I'm still new in Postgres, but I'm worried about random_page_cost
> being 2 is too low, so maybe increasing cpu_tuple_cost is a
> better choice.

If these are typical of what you would expect in production, then
the fact that with default cost factors the costs are barely
different (by 0.6%) for actual run times which differ by two orders
of magnitude (the chosen plan is 160 times slower) means that the
modeling of cost factors is off by a lot.

If you expect the active portion of your database to be fully
cached like this, it makes sense to reduce random_page_cost to be
equal to seq_page_cost. But that only adjusts the costs by at most
a factor of four, and we've established that in the above query
they're off by a factor of 160. To help make up the difference, it
makes sense to de-emphasize page access compared to cpu-related
costs by reducing both page costs to 0.1. Combined, these
adjustments still can't compensate for how far off the estimate
was.

In my experience default cpu_tuple_cost is understated compared to
other cpu-related costs, so I would do the above *plus* a boost to
cpu_tuple_cost. Personally, I have never seen a difference between
plans chosen with that set to 0.03 and 0.05, so I can't say where
in that range is the ideal value; you should feel free to
experiment if there is a query which seems to be choosing a bad
plan. If the above results really do represent cache hit levels you
expect in production, the combination of the above changes should
come reasonably close to modeling costs realistically, resulting in
better plan choice.

If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.

>>> - shared_buffers = 6GB
>>> - effective_cache_size = 18GB
>>> - work_mem = 10MB
>>> - maintenance_work_mem = 3GB

> Can you see any obvious issues with the other memory settings I
> changed?

I might bump up work_mem to 20MB to 60MB, as long as you're not
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)

Huan Ruan

unread,

Dec 12, 2012, 7:56:21 PM12/12/12

to

On 13 December 2012 03:28, Jeff Janes <jeff....@gmail.com> wrote:

This looks like the same large-index over-penalty as discussed in the
recent thread "[PERFORM] Slow query: bitmap scan troubles".

Back-patching the log(npages) change is starting to look like a good idea.

Cheers,

Jeff

Thanks for the information Jeff. That does seem to be related.

Huan Ruan

unread,

Dec 12, 2012, 8:10:24 PM12/12/12

to

Hi Kevin

On 13 December 2012 10:47, Kevin Grittner <kgr...@mail.com> wrote:

Huan Ruan wrote:

> is a lot slower than a nested loop join.

Giving actual numbers is more useful than terms like "a lot". Even
better is to provide the output of EXPLAIN ANALYZZE rather than
just EXPLAIN. This shows estimates against actual numbers, and give
timings. For more suggestions see this page:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

You are right. I realised my information wasn't accurate. Was a bit slack and canceled the slower one. The full outputs are

Hash 1st run

"QUERY PLAN"

"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual time=2182.450..88158.645 rows=48257 loops=1)"

" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"

" Buffers: shared hit=3950 read=3046219"

" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108) (actual time=0.051..32581.052 rows=168121657 loops=1)"

" Buffers: shared hit=3351 read=3046219"

" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual time=21.751..21.751 rows=48261 loops=1)"

" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"

" Buffers: shared hit=596"

" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.007..8.299 rows=48261 loops=1)"

" Buffers: shared hit=596"

"Total runtime: 88162.417 ms"

Hash 2nd run (after disconnect and reconnect)

"QUERY PLAN"

"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual time=2280.390..87934.540 rows=48257 loops=1)"

" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"

" Buffers: shared hit=3982 read=3046187"

" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108) (actual time=0.052..32747.805 rows=168121657 loops=1)"

" Buffers: shared hit=3383 read=3046187"

" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual time=62.161..62.161 rows=48261 loops=1)"

" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"

" Buffers: shared hit=596"

" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.006..8.209 rows=48261 loops=1)"

" Buffers: shared hit=596"

"Total runtime: 87938.584 ms"

NL 1st run

"QUERY PLAN"

"Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual time=0.056..551.438 rows=48257 loops=1)"

" Buffers: shared hit=242267"

" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.009..7.353 rows=48261 loops=1)"

" Buffers: shared hit=596"

" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"

" Index Cond: (invtranref = smalltable.invtranref)"

" Buffers: shared hit=241671"

"Total runtime: 555.336 ms"

NL 2nd run (after disconnect and reconnect)

"QUERY PLAN"

"Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual time=0.058..554.215 rows=48257 loops=1)"

" Buffers: shared hit=242267"

" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63) (actual time=0.009..7.416 rows=48261 loops=1)"

" Buffers: shared hit=596"

" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"

" Index Cond: (invtranref = smalltable.invtranref)"

" Buffers: shared hit=241671"

"Total runtime: 558.095 ms"

> I don't understand why the optimiser chooses the hash join in
> favor of the nested loop. What can I do to get the optimiser to
> make a better decision (nested loop in this case)? I have run
> analyze on both tables.

> Config changes are
>
> - shared_buffers = 6GB
> - effective_cache_size = 18GB
> - work_mem = 10MB
> - maintenance_work_mem = 3GB

As already suggested, there was a change made in 9.2 which may have
over-penalized nested loops using index scans. This may be fixed in
the next minor release.

Will keep this in mind.

Also, as already suggested, you may want to reduce random_page
cost, to bring it in line with the actual cost relative to
seq_page_cost based on your cache hit ratio.

Additionally, I just routinely set cpu_tuple_cost higher than the
default of 0.01. I find that 0.03 to 0.05 better models the actual
relative cost of processing a tuple.

I originally reduced random_page_cost to 2 to achieve the nested loop join. Now I set cpu_tuple_cost to 0.05 and reset random_page_cost back to 4, I can also achieve a nested loop join.

I'm still new in Postgres, but I'm worried about random_page_cost being 2 is too low, so maybe increasing cpu_tuple_cost is a better choice. All these tuning probably also depends on the above mentioned possible fix as well. Can you see any obvious issues with the other memory settings I changed?

Thanks for your help.

Cheers

Huan

-Kevin

Huan Ruan

unread,

Dec 13, 2012, 6:51:27 PM12/13/12

to

Hi Kevin

Again, many thanks for your time and help.

On 14 December 2012 02:26, Kevin Grittner <kgr...@mail.com> wrote:

Huan Ruan wrote:

> Hash 1st run

> "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
> (actual time=2182.450..88158.645 rows=48257 loops=1)"

> " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28
> rows=168121728 width=108) (actual time=0.051..32581.052
> rows=168121657 loops=1)"

194 nanoseconds per row suggests 100% cache hits.

> NL 1st run

> "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
> time=0.056..551.438 rows=48257 loops=1)"

> " -> Index Scan using pk_invtran on invtran bigtable
> (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
> rows=1 loops=48261)"

10 microseconds per index scan (each index scan requiring multiple
"random" accesses) also suggests 100% cache hits.

Interesting to see how you derived 100% cache hits. I assume by 'cache' you mean the pg shared buffer plus the OS cache? Because the table is 23GB but the shared buffer is only 6GB. Even then, I'm not completely convinced because the total RAM is just 24GB, part of which will have to be used for other data and indexes.

I read somewhere that a pg shared buffer that's too big can hurt the performance and it's better just leave it to the OS cache. I'm not sure why but for now, I just configured the shared buffer to be 1/4 of the total RAM.

In production, 60% of the database would be able to fit in the RAM. But roughly, all the active data we need to use should be able to fit in 100%. On the test server I'm playing with now, RAM is only 8% of the database size. Nonetheless, I will play with these parameters like you suggested.

I was wondering on our production server where the effetive_cache_size will be much bigger, will pg then guess that probably most data is cached anyway therefore leaning towards nested loop join rather than a scan for hash join?

Even on a test server where the cache hit rate is much smaller, for a big table like this, under what circumstances, will a hash join perform better than nested loop join though?

If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.

>>> - shared_buffers = 6GB
>>> - effective_cache_size = 18GB
>>> - work_mem = 10MB
>>> - maintenance_work_mem = 3GB

> Can you see any obvious issues with the other memory settings I
> changed?

I might bump up work_mem to 20MB to 60MB, as long as you're not
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)

Yes, I had bumped up work_mem yesterday to speed up another big group by query. I used 80MB. I assumed this memory will only be used if the query needs it and will be released as soon as it's finished, so it won't be too much an issue as long as I don't have too many concurrently sorting queries running (which is true in our production). Is this correct?

I increased maintenance_work_mem initially to speed up the index creation when I first pump in the data. In production environment, we don't do run time index creation, so I think only the vacuum and analyze will consume this memory?

Thanks

Huan

-Kevin

Kevin Grittner

unread,

Dec 13, 2012, 7:54:43 PM12/13/12

to

Huan Ruan wrote:

> Interesting to see how you derived 100% cache hits. I assume by 'cache' you
> mean the pg shared buffer plus the OS cache? Because the table is 23GB but
> the shared buffer is only 6GB. Even then, I'm not completely convinced
> because the total RAM is just 24GB, part of which will have to be used for
> other data and indexes.

Well, you can't have more than a few disk hits, which typically
take something like 10 ms each, and still get an average less than 200
nanoseconds.

> I read somewhere that a pg shared buffer that's too big can hurt the
> performance and it's better just leave it to the OS cache. I'm not sure why
> but for now, I just configured the shared buffer to be 1/4 of the total RAM.

PostgreSQL goes through the OS and its filesystems, unlike some
other products. The right balance of memory in the PostgreSQL
shared buffers versus what is left for a combination of OS caching
and other memory allocations can be hard to determine. 25% is a
good starting point, but your best performance might be higher or
lower. It's a good idea to try incremental adjustments using your
actual workload. Just remember you need to allow enough for several
maintenance_work_mem allocations, about one work_mem allocation per
max_connections setting, plus a reasonable OS cache size.

> I was wondering on our production server where the effetive_cache_size will
> be much bigger, will pg then guess that probably most data is cached anyway
> therefore leaning towards nested loop join rather than a scan for hash join?

Once effective_cache_size is larger than your largest index, its
exact value doesn't matter all that much.

> Even on a test server where the cache hit rate is much smaller, for a big
> table like this, under what circumstances, will a hash join perform better
> than nested loop join though?

With a low cache hit rate, that would generally be when the number
of lookups into the table exceeds about 10% of the table's rows.

> Yes, I had bumped up work_mem yesterday to speed up another big group by
> query. I used 80MB. I assumed this memory will only be used if the query
> needs it and will be released as soon as it's finished, so it won't be too
> much an issue as long as I don't have too many concurrently sorting queries
> running (which is true in our production). Is this correct?

Each connection running a query can allocate one work_mem
allocation per plan node (depending on node type), which will be
freed after the query completes. A common "rule of thumb" is to
plan on peaks of max_conncetions allocations of work_mem.

> I increased maintenance_work_mem initially to speed up the index creation
> when I first pump in the data. In production environment, we don't do run
> time index creation, so I think only the vacuum and analyze will consume
> this memory?

You'll probably be creating indexes from time to time. Figure an
occasional one of those plus up to one allocation per autovacuum
worker (and you probably shouldn't go below three of those).

Huan Ruan

unread,

Dec 13, 2012, 11:46:44 PM12/13/12

to

With a low cache hit rate, that would generally be when the number
of lookups into the table exceeds about 10% of the table's rows.

So far, my main performance issue comes down to this pattern where Postgres chooses hash join that's slower than a nest loop indexed join. By changing those cost parameters, this query works as expected now, but there are others fall into the same category and appear to be harder to convince the optimiser.

I'm still a bit worried about this query as Postgres gets the record count right, and knows the index is a primary key index, therefore it knows it's 0.05m out of 170m records (0.03%) but still chooses the sequential scan. Hopefully this is just related to that big index penalty bug introduced in 9.2.

Kevin Grittner

unread,

Dec 14, 2012, 8:48:30 AM12/14/12

to

Huan Ruan wrote:

Quite possibly, but it could be any of a number of other things,
like a type mismatch. It might be best to rule out other causes. If
you post the new query and EXPLAIN ANALYZE output, along with the
settings you have now adopted, someone may be able to spot
something. It wouldn't hurt to repeat OS and hardware info with it
so people have it handy for reference.

Huan Ruan

unread,

Dec 18, 2012, 8:55:04 PM12/18/12

to

Quite possibly, but it could be any of a number of other things,
like a type mismatch. It might be best to rule out other causes. If
you post the new query and EXPLAIN ANALYZE output, along with the
settings you have now adopted, someone may be able to spot
something. It wouldn't hurt to repeat OS and hardware info with it
so people have it handy for reference.

Sorry for the late reply. To summarise,

The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:

Centos, ext4
24GB memory
6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
raid 10 on 4 sata disks

Config changes are

shared_buffers = 6GB
work_mem = 80MB
maintenance_work_mem = 3GB
effective_cache_size = 22GB
seq_page_cost = 0.1
random_page_cost = 0.1
cpu_tuple_cost = 0.05
geqo = off

The query is,

explain (analyze, buffers)

SELECT *FROM IM_Match_Table smalltable inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref

The result is,

"QUERY PLAN""Nested Loop (cost=0.00..341698.92 rows=48261 width=171) (actual time=0.042..567.980 rows=48257 loops=1)"" Buffers: shared hit=242267"" -> Seq Scan on im_match_table smalltable (cost=0.00..2472.65 rows=48261 width=63) (actual time=0.006..8.230 rows=48261 loops=1)"" Buffers: shared hit=596"" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..6.98 rows=1 width=108) (actual time=0.010..0.011 rows=1 loops=48261)"

" Index Cond: (invtranref = smalltable.invtranref)"" Buffers: shared hit=241671"

"Total runtime: 571.662 ms"

Kevin Grittner

unread,

Dec 19, 2012, 8:16:37 AM12/19/12

to

"Huan Ruan" <huan.r...@gmail.com> wrote:

> explain (analyze, buffers)
> SELECT
> *
> FROM IM_Match_Table smalltable
> inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref

Well, one table or the other will need to be read in full, and you
would normally want that one to be the small table. When there is
no ORDER BY clause, the fastest way to do that will normally be a
seqscan. So that part of the query is as it should be. The only
question is whether the access to the big table is using the
fastest technique.

If you want to see what the planner's second choice would have
been, you could run:

SET enable_indexscan = off;

on a connection and try the explain again. If you don't like that
one, you might be able to disable another node type and see what
you get. If one of the other alternatives is faster, that would
suggest that adjustments are needed to the costing factors;
otherwise, it just takes that long to read hundreds of thousands of
rows in one table and look for related data for each of them in
another table.

> "Nested Loop (cost=0.00..341698.92 rows=48261 width=171) (actual
> time=0.042..567.980 rows=48257 loops=1)"

Frankly, at 12 microseconds per matched pair of rows, I think
you're doing OK.

Huan Ruan

unread,

Dec 19, 2012, 8:02:17 PM12/19/12

to

Frankly, at 12 microseconds per matched pair of rows, I think
you're doing OK.

This plan is the good one, I want the indexscan nested loop join and this is only achieved after making all these costing factors change. Before that, it was hash join and was very slow.

However, I'm worried about the config changes being too 'extreme', i.e. both sequential I/O and random I/O have the same cost and being only 0.1. So, I was more wondering why I have to make such dramatic changes to convince the optimiser to use NL join instead of hash join. And also, I'm not sure what impact will these changes have on other queries yet. e.g. will a query that's fine with hash join now choose NL join and runs slower?

Kevin Grittner

unread,

Dec 20, 2012, 9:06:29 AM12/20/12

to

Huan Ruan wrote:
> Kevin Grittner wrote:

I understand the concern, but PostgreSQL doesn't yet have a knob to
turn for "cache hit ratio". You essentially need to build that into
the page costs. Since your cache hit ratio (between shared buffers
and the OS) is so high, the cost of page access relative to CPU
costs has declined and there isn't any effective difference between
sequential and random access. As the level of caching changes, you
may need to adjust. In one production environment where there was
significant caching, but far enough from 100% to matter, we tested
various configurations and found the fastest plans being chosen
with seq_page_cost = 0.3 and random_page_cost = 0.5. Tune to your
workload.

Huan Ruan

unread,

Dec 20, 2012, 4:16:24 PM12/20/12

to

Thanks Kevin. I think I get some ideas now that I can try on the production server when we switch.