Optimizing query

30 views
Skip to first unread message

Linus Pettersson

unread,
Jan 25, 2012, 11:14:31 AM1/25/12
to rubyonra...@googlegroups.com
Hi!

I'm trying to speed up some queries in my app, running Rails 3.2 and need some help. I'm running the app at Heroku on postgresql. I'm new to postgresql and need some help to optimize a query so it effectively uses indices.

This is the query I'm currently working on:

In the link above you can also see the full EXPLAIN (query plan) as well as my current indices. It doesn't seem to use any of my indices at the moment...

So, any help is very appreciated.

Regards
Linus

Jim

unread,
Jan 25, 2012, 12:08:13 PM1/25/12
to Ruby on Rails: Talk
Would probably help if you posted the generated SQL as well as the
output of EXPLAIN ANALYZE, which shows not just what the planner
thinks the costs are, but what the costs actually were.

You could also try running ANALYZE on the tables in question, or
ANALYZE VERBOSE and checking if the estimates are pretty close to
reality, e.g.

ANALYZE VERBOSE categories;

Jim Crate

Linus Pettersson

unread,
Jan 25, 2012, 2:20:16 PM1/25/12
to rubyonra...@googlegroups.com
Thank you. I have added the generated SQL to the snipt in my first post.

Regarding the EXPLAIN ANALYZE, can I somehow run that on Heroku? I'm using their shared database so I don't have access to any psql console...

I did run it on my local machine though. The thing is that it differs in database version (Local is PostgreSQL 9.1.2 while Heroku still uses 8.3 or something) and also in size.

Anyway, the EXPLAIN ANALYZE from my local machine can be found here:

Regards
Linus

Peter Vandenabeele

unread,
Jan 25, 2012, 2:29:36 PM1/25/12
to rubyonra...@googlegroups.com
On Wed, Jan 25, 2012 at 8:20 PM, Linus Pettersson <linus.pe...@gmail.com> wrote:
Thank you. I have added the generated SQL to the snipt in my first post.

Regarding the EXPLAIN ANALYZE, can I somehow run that on Heroku? I'm using their shared database so I don't have access to any psql console...

I did run it on my local machine though. The thing is that it differs in database version (Local is PostgreSQL 9.1.2 while Heroku still uses 8.3 or something) and also in size.

On Debian/Ubuntu you typically have a choice of an 8.4 or a 9.1 ...

Try

$ apt-cache search postgresql-8.4

and then you could install that alongside 9.1.

It will run on a different port
(one on 5432 and one on 5433 IIRC, check $ netstat -atn).

That said, I have no information that the optimization would be
different for such a "simple" query.


Anyway, the EXPLAIN ANALYZE from my local machine can be found here:


<speculative>
Just out of curiosity ... what problem are you trying to solve ? Did you
actually see any slow queries ? I don't see obvious problems with the
query ... Not sure if this is the area that deserves your first attention 
wrt performance optimization ... there is this whole story about
"premature optimization" (Google it to find many rants about it ...).
</speculative>

HTH,

Peter

Linus Pettersson

unread,
Jan 25, 2012, 2:51:27 PM1/25/12
to rubyonra...@googlegroups.com, pe...@vandenabeele.com
Well, maybe it's not necessary... It is the slowest of my queries as far as I can see anyway.

I'm using Rails 3.2 and I have enabled automatic EXPLAIN on slow queries (>0.5s). This query is currently being "explained" in my console when I run it on my localhost. "SQL (556.5ms) ..."

556.5ms seems a bit slow to me, isn't it?

Then a bunch of other stuff is happening and I get this:
"Completed 200 OK in 2737ms (Views: 813.5ms | ActiveRecord: 703.2ms)"

So, 550ms of the total 703ms is the above query.

Maybe I'm just picky? :)

Peter Vandenabeele

unread,
Jan 25, 2012, 4:56:23 PM1/25/12
to rubyonra...@googlegroups.com
On Wed, Jan 25, 2012 at 8:51 PM, Linus Pettersson <linus.pe...@gmail.com> wrote:
Well, maybe it's not necessary... It is the slowest of my queries as far as I can see anyway.

I'm using Rails 3.2 and I have enabled automatic EXPLAIN on slow queries (>0.5s). This query is currently being "explained" in my console when I run it on my localhost. "SQL (556.5ms) ..."

556.5ms seems a bit slow to me, isn't it?

Indeed.

TL;DR
* On my system, a similar query seems a lot faster
* you may have a bad "sort" on categories.name, subcategories.name without index ?


I had a similar database structure here, filled it with 10,000 records (you seem to have around 2,200 ?),
postgresql 9.1 on Ubuntu and ran a similar query:

cl1 = ContactLine.includes(:contact) ; nil
cl2 = cl1.includes(:contact => :person) ; nil
cl3 = cl2.where("people.first_name = 'M' OR people.first_name = 'longer_word'") ; nil # 8 times 'M' and 8 times 'longer_word'
cl4 = cl3.order("contacts.email") ; nil 

167:0> puts cl4.explain
  SQL (16.9ms)  SELECT "contact_lines"."id" AS ...
FROM "contact_lines"
LEFT OUTER JOIN "contacts" ON "contacts"."id" = "contact_lines"."contact_id"
LEFT OUTER JOIN "people" ON "people"."id" = "contacts"."person_id"
WHERE (people.first_name = 'M' OR people.first_name = 'longer_word')
ORDER BY contacts.email
  EXPLAIN (1.4ms)  EXPLAIN SELECT "contact_lines"."id" ... <same query description>
EXPLAIN for: SELECT "contact_lines"."id" ...
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=691.05..691.06 rows=1 width=2699)
   Sort Key: contacts.email
   ->  Hash Join  (cost=455.53..691.04 rows=1 width=2699)
         Hash Cond: (contact_lines.contact_id = contacts.id)
         ->  Seq Scan on contact_lines  (cost=0.00..198.00 rows=10000 width=50)
         ->  Hash  (cost=455.52..455.52 rows=1 width=2649)
               ->  Hash Join  (cost=229.01..455.52 rows=1 width=2649)
                     Hash Cond: (contacts.person_id = people.id)
                     ->  Seq Scan on contacts  (cost=0.00..189.00 rows=10000 width=41)
                     ->  Hash  (cost=229.00..229.00 rows=1 width=2608)
                           ->  Seq Scan on people  (cost=0.00..229.00 rows=1 width=2608)
                                 Filter: (((first_name)::text = 'M'::text) OR ((first_name)::text = 'longer_word'::text))
(12 rows)
=> nil

The log on a development server (plain `rails s`) is:

Started GET "/contact_lines" for 127.0.0.1 at 2012-01-25 22:26:41 +0100
Processing by ContactLinesController#index as HTML
  SQL (8.6ms)  SELECT "contact_lines"."id" AS t0_r0...
... ORDER BY contacts.email
  Rendered contact_lines/index.html.haml within layouts/application (99.4ms)
Completed 200 OK in 105ms (Views: 89.5ms | ActiveRecord: 14.7ms)


About indexes ... The above is without explicitly set indexes.

Assuming that the first_name::text filtering would be the most expensive, I added
this migration:

class AddIndexFirstName < ActiveRecord::Migration
  def change
    add_index :people, :first_name
  end
end

This worked:

# \d people
                                      Table "public.people"
    Column    |            Type             |                      Modifiers                      
--------------+-----------------------------+-----------------------------------------------------
 id           | integer                     | not null default nextval('people_id_seq'::regclass)
 full_name    | character varying(255)      | 
 first_name   | character varying(255)      | 
...
Indexes:
    "people_pkey" PRIMARY KEY, btree (id)
    "index_people_on_first_name" btree (first_name)

The EXPLAIN changed:

SQL (12.7ms)  SELECT "contact_lines"."id" AS t0_r0, ...

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=474.59..474.59 rows=1 width=2699)
   Sort Key: contacts.email
   ->  Hash Join  (cost=239.07..474.58 rows=1 width=2699)
         Hash Cond: (contact_lines.contact_id = contacts.id)
         ->  Seq Scan on contact_lines  (cost=0.00..198.00 rows=10000 width=50)
         ->  Hash  (cost=239.05..239.05 rows=1 width=2649)
               ->  Hash Join  (cost=12.54..239.05 rows=1 width=2649)
                     Hash Cond: (contacts.person_id = people.id)
                     ->  Seq Scan on contacts  (cost=0.00..189.00 rows=10000 width=41)
                     ->  Hash  (cost=12.53..12.53 rows=1 width=2608)
                           ->  Bitmap Heap Scan on people  (cost=8.52..12.53 rows=1 width=2608)
                                 Recheck Cond: (((first_name)::text = 'M'::text) OR ((first_name)::text = 'longer_word'::text))
                                 ->  BitmapOr  (cost=8.52..8.52 rows=1 width=0)
                                       ->  Bitmap Index Scan on index_people_on_first_name  (cost=0.00..4.26 rows=1 width=0)
                                             Index Cond: ((first_name)::text = 'M'::text)
                                       ->  Bitmap Index Scan on index_people_on_first_name  (cost=0.00..4.26 rows=1 width=0)
                                             Index Cond: ((first_name)::text = 'longer_word'::text)
(17 rows)

But the performance was the same. A log of a hit to the server.

Processing by ContactLinesController#index as HTML
  SQL (11.7ms)  SELECT "contact_lines"."id" AS t0_r0, ...
...
WHERE (people.first_name = 'M' OR people.first_name = 'longer_word')
  Rendered contact_lines/index.html.haml within layouts/application (104.1ms)
Completed 200 OK in 112ms (Views: 90.8ms | ActiveRecord: 19.8ms)



Maybe it is related to something in your query plan that I do not see here
and is suspicious:


Sort  (cost=3267.32..3272.25 rows=1972 width=2104) (actual time=39.308..39.468 rows=1880 loops=1)
   Sort Method: quicksort  Memory: 313kB  ####### ??



you seem to not have an index on those 2 `name` columns ?


What happens to performance when you remove try sort ?

Category.eager_load(:subcategories)
        .joins("INNER JOIN products AS p ON resellercategories.id = p.resellercategory_id")
## WITHOUT SORT        .order("categories.name ASC, subcategories.name ASC")
        .where("p.gender = 'unisex' OR p.gender = 'male'")

If that makes a significant difference, what happens when you add indices on those 2
columns (and possibly remove some indices on other columns) ?



Then a bunch of other stuff is happening and I get this:
"Completed 200 OK in 2737ms (Views: 813.5ms | ActiveRecord: 703.2ms)"

So, 550ms of the total 703ms is the above query.

Maybe I'm just picky? :)

Those indeed seem "large" numbers for a new/fresh application...

Peter

Linus Pettersson

unread,
Jan 26, 2012, 12:15:17 PM1/26/12
to rubyonra...@googlegroups.com, pe...@vandenabeele.com
Hi

I tested to remove the .order(...) part and indeed, the query time goes down to ~100ms. However, it doesn't help to add indices, at least not as I did :)

add_index :categories, :name
add_index :subcategories, :name

Did some more testing and if I keep the .order... but don't join the products table I get a query that runs at about ~55ms. So the bottleneck seems to be the products table.
The query that I'm running looks like this:

Category.eager_load(:subcategories)
             .joins("INNER JOIN products AS p ON resellercategories.id = p.resellercategory_id")
             .order("categories.name ASC, subcategories.name ASC")

(Skipping the gender here...)

What I have is Categories and Subcategories. They are related to each other through a Resellercategories table. Products are related to Resellercategories.
So, the reason that I want to join the products as well is because I only want to show categories and subcategories that actually have some products (there are some empty categories/subcategories still).

So the above query is what we came up with in another thread here in the group.

- Maybe there is a better way to check if a category/subcategory has products without joining the entire products table?


BTW, in my dev database there is about 8700 rows in products table, 2200 rows in resellercategories, 5 rows in categories and 45 rows in subcategories.

Linus Pettersson

unread,
Jan 26, 2012, 12:31:51 PM1/26/12
to rubyonra...@googlegroups.com, pe...@vandenabeele.com
Also, another reason for the query that I forgot to mention is that if a user filters the products for female products only for instance, it should only show categories and subcategories that contains products for that gender.

The gender is specified in the products table...

Peter Vandenabeele

unread,
Jan 26, 2012, 12:37:12 PM1/26/12
to rubyonra...@googlegroups.com
On Thu, Jan 26, 2012 at 6:15 PM, Linus Pettersson <linus.pe...@gmail.com> wrote:
Hi

I tested to remove the .order(...) part and indeed, the query time goes down to ~100ms. However, it doesn't help to add indices, at least not as I did :)

add_index :categories, :name
add_index :subcategories, :name

Did some more testing and if I keep the .order... but don't join the products table I get a query that runs at about ~55ms. So the bottleneck seems to be the products table.
The query that I'm running looks like this:

Category.eager_load(:subcategories)
             .joins("INNER JOIN products AS p ON resellercategories.id = p.resellercategory_id")
             .order("categories.name ASC, subcategories.name ASC")

(Skipping the gender here...)

What I have is Categories and Subcategories. They are related to each other through a Resellercategories table. Products are related to Resellercategories.
So, the reason that I want to join the products as well is because I only want to show categories and subcategories that actually have some products (there are some empty categories/subcategories still).

So the above query is what we came up with in another thread here in the group.

- Maybe there is a better way to check if a category/subcategory has products without joining the entire products table?


It is possible to add a :counter_cache , but then you need to make sure you
use the proper methods for each product that you add or remove from the
association.

Alternative to the default counter cache (from Rails), you could build your own
logic as in:

* has_male_products
* ...

changing you query to  ...

Category.eager_load(:subcategories).
               where(:has_male_products => true).
               order(...)

Then you would need to set the cache on the appropriate categories in
an after_save on the product you are creating/updating/deactivating/(deleting ?).

Both ideas would probably be faster for querying, but certainly more
complex for making sure that cache is always correct.

HTH,

Peter

Peter Vandenabeele

unread,
Jan 26, 2012, 2:08:33 PM1/26/12
to rubyonra...@googlegroups.com
Sorry to reply to my own post.


TL;DR  Is there pagination? Then a smaller set may return much faster.


I was thinking over my reply and may have forgotten a fundamental
aspect ... If you say 2200 categories, 8000 products.

How many entries does you query return ?
(replace .add with .count at the end).

How many do you need ?

What happens when you add .limit(20) to your query ?

By which "primary object" do you want to sort and paginate ?
(I will assume 'Product' in the discussion below).

With the includes that are currently implemented, you may
have to redo the whole query into 2 qeuries ... 

1) for fetching the "primary objects"
(e.g. exactly 20 Products, no additional "has_many" data, because
that would increase the number of returned rows for 1 product and
make proper pagination in the database impossible; including
"belongs_to" here is no problem)

2) a second query for fetching eventual "has_many" data
on those 20 "primary products" (is that "Reification" ?)

If the performance problem could be solved by taking the
pagination into account, that would be a _much_ better
solution that building cache columns in this early phase
of your project.

HTH,

Peter

Linus Pettersson

unread,
Jan 26, 2012, 2:51:14 PM1/26/12
to rubyonra...@googlegroups.com, pe...@vandenabeele.com
This query is only to get the appropriate Categories & Subcategories. There are 5 Categories and 45 Subcategories.

My products are imported from webshops which are using different categories for the same things ("tshirt", "t-shirts", "t-shirt", "short sleeved shirts" may all be the same). To cope with this issue I have the "Resellercategories" that I relate to a specific Category and Subcategory.

Category/Subcategory -> Resellercategory -> Product

(Resellercategories are never seen in the frontend)

If I filter the products for "male" I only want to display (non-empty) Categories/Subcategories that have male products.


To answer your question, there is no pagination on the Categories and Subcategories that I'm fetching here with this query. The query only returns the appropriate Categories/Subcategories. So if filtered by "male" it returns 3 Categories and 12 Subcategories (In my dev environment).


I am using pagination for the actual products though, but that is a separate query.

Peter Vandenabeele

unread,
Jan 26, 2012, 3:06:42 PM1/26/12
to rubyonra...@googlegroups.com
OK, I understand (small amount of categories, much more products).

If you find no other way, then the caching the "has_male_products" etc
in the Category/Subcategory may be the best remaining way.

I was also thinking, maybe it is acceptable that the set of Categories
with "male_products" etc. is only update every hour or so. That would
avoid the complexity of real-time update of that cache column. On the
other hand, an after_save on products is not _that_ difficult.

Just as a test, does your performance improve significantly if you add
a "has_male_products" column on categories and filter on that?

HTH,

Peter
Reply all
Reply to author
Forward
0 new messages