slow eager loading ( & fix)

227 views
Skip to first unread message

Frederick Cheung

unread,
Aug 7, 2007, 1:22:50 PM8/7/07
to rubyonra...@googlegroups.com
Executive Summary:
================

I've recently come across some performances problems with eager
loading multiple has_many relationships (and to a lesser extent a
single has_many with many objects in the collection) and had some
thoughts.

The case I came across involved some models like so:

class Question < ActiveRecord::Base
has_many :incoming_messages
has_many :outgoing_messages
end

class IncomingMessage < ActiveRecord::Base
belongs_to :question
end

class OutgoingMessage < ActiveRecord::Base
belongs_to :question
end

In various parts of the app we load a question (or multiple ones)
with :include => [:incoming_messages, :outgoing_messages]

Typically a question has a small number of incoming and outgoing
messages (often only 1 or 2) and this all works absolutely fine.
However at some point we ended up with a question with many incoming
and outgoing_messages. Our servers (quite literally) ground to a halt
whenever loading that question with the aforementioned includes, so I
had a look under the hood.

The underlying thing is that in this case Question.find(1, :include
=> [:incoming_messages, :outgoing_messages]) returns quite a few rows
and so even fairly small things add up very quickly

I've put together some changes that improve the situation, along with
some numbers

Numbers:
===========

In my benchmarks I've used 2 instances of Question: one with 150
incoming and 80 outgoing (big question) and one with 225 incoming and
120 outgoing (huge question) (ie 50% more of each, so total row count
goes up by 2.25)

bmbm(5) do |x|
x.report("big question incoming_messages") {Question.find_by_id
big_question.id, :include => :incoming_messages}
x.report("big question all") {Question.find_by_id
big_question.id, :include => [:incoming_messages, :outgoing_messages]}
x.report("huge question incoming_messages") {Question.find_by_id
huge_question.id, :include => :incoming_messages}
x.report(" question all") {Question.find_by_id
huge_question.id, :include => [:incoming_messages, :outgoing_messages]}
end

Vanilla rails 1.2.3:

user system
total real
big question incoming_messages 0.050000 0.000000 0.050000
( 0.052436)
big question all 6.060000 0.080000 6.140000
( 6.362781)
huge question incoming_messages 0.100000 0.000000 0.100000
( 0.111775)
huge question all 20.960000 0.290000 21.250000
( 23.186990)


Rails 1.2.3 + my changes:
user system
total real
user system
total real
big question incoming_messages 0.010000 0.010000 0.020000
( 0.013589)
big question all 1.040000 0.070000 1.110000
( 1.325704)
huge question incoming_messages 0.020000 0.000000 0.020000
( 0.019577)
huge question all 2.310000 0.160000 2.470000
( 2.944555)

Note that in the current code (huge question all)/(big question all)
~= 3.4 (ie > the increase in size of the dataset (2.25), with my
fixed that ratio is the same as the increase in size of the data set
(ie execution time scaling linearly with the data set side).


Morally I think eager loading in such cases is bad: it's silly to get
the db to send us 10000 rows when there are actually only 200 rows of
real content. However (at least for my usage) the case of a huge
question is an insignificant (in terms of the numbers of such
questions, not in terms of the effects) proportion of the overall set
of questions, 99.999% of the time eager loading saves us time and so
I don't really want to throw the baby out with the bath water. As
long as we cope reasonably in the rare cases when things go awry I
can happily continue eager loading

The changes:
I've put together a range of things, but by far the biggest
performance boost was due to the fact that rails checked whether the
collection already contains an object for the row being considered
with collection.target.include?(association). I've used hashes instead

The tests obviously pass in my environment (minus some unrelated
failures that I think are down to me using mysql 4.1 which failed
before I started changing code), my instinct would be that this would
be more or less database independent.

If people are interested I will tidy & write this up as a ticket/
patch (at the very least I think it is worth mentioning in the docs
that overzealous eager loading with several large :has_manys can be a
bad thing).

Thanks to anyone who has kept reading this far, thoughts/comments
appreciated
Fred


Index: activerecord/lib/active_record/associations.rb
@@ -1313,12 +1313,19 @@
end

def instantiate(rows)
+ @already_loaded_associations_cache = {}
+ @collections = {}
+
+ join_associations_copy = join_associations
rows.each_with_index do |row, i|
primary_id = join_base.record_id(row)
unless @base_records_hash[primary_id]
@base_records_in_order << (@base_records_hash
[primary_id] = join_base.instantiate(row))
end
- construct(@base_records_hash[primary_id],
@associations, join_associations.dup, row)
+ construct(@base_records_hash[primary_id],
primary_id.to_s, @associations, join_associations_copy, row)
end
return @base_records_in_order
end
@@ -1350,42 +1357,56 @@
end
end

- def construct(parent, associations, joins, row)
+ def construct(parent, parent_id, associations, joins, row)
case associations
when Symbol, String
- while (join = joins.shift).reflection.name.to_s !=
associations.to_s
- raise ConfigurationError, "Not Enough
Associations" if joins.empty?
+ association_name = associations.to_s
+ joins.each do |join|
+ if join.reflection.name.to_s == association_name
+ return construct_association(parent,
parent_id, association_name, join, row)
+ end
end
- construct_association(parent, join, row)
+ raise ConfigurationError, "Not Enough
Associations" if joins.empty?
when Array
associations.each do |association|
- construct(parent, association, joins, row)
+ construct(parent, parent_id, association, joins,
row)
end
when Hash
associations.keys.sort{|a,b|a.to_s<=>b.to_s}.each
do |name|
- association = construct_association(parent,
joins.shift, row)
- construct(association, associations[name],
joins, row) if association
+ association = construct_association(parent,
parent_id, joins.first.reflection.name, joins.first, row)
+ construct(association, association.id.to_s,
associations[name], joins[1..-1], row) if association
end
else
raise ConfigurationError, associations.inspect
end
end

- def construct_association(record, join, row)
+ def construct_association(record, record_id,
join_reflection_name, join, row)
case join.reflection.macro
when :has_many, :has_and_belongs_to_many
- collection = record.send(join.reflection.name)
- collection.loaded
+ association_id = row[join.aliased_primary_key]

- return nil if record.id.to_s !=
join.parent.record_id(row).to_s or row[join.aliased_primary_key].nil?
- association = join.instantiate(row)
- collection.target.push(association) unless
collection.target.include?(association)
+ unless collection_target = @collections[record_id
+ join_reflection_name.to_s]
+ collection = record.send(join_reflection_name)
+ collection.loaded
+ collection_target = collection.target
+ @collections[record_id +
join_reflection_name.to_s] = collection_target
+ end
+
+ return nil if record_id != join.parent.record_id
(row).to_s or association_id.nil?
+
+ cache = (@already_loaded_associations_cache
[collection_target.object_id] ||= {})
+ unless association = cache[association_id]
+ association = join.instantiate(row)
+ collection_target.push(association)
+ cache[association_id] = association
+ end
when :has_one
- return if record.id.to_s != join.parent.record_id
(row).to_s
+ return if record_id != join.parent.record_id
(row).to_s
association = join.instantiate(row) unless row
[join.aliased_primary_key].nil?
record.send("set_#{join.reflection.name}_target",
association)
when :belongs_to
- return if record.id.to_s != join.parent.record_id
(row).to_s or row[join.aliased_primary_key].nil?
+ return if record_id != join.parent.record_id
(row).to_s or row[join.aliased_primary_key].nil?
association = join.instantiate(row)
record.send("set_#{join.reflection.name}_target",
association)
else
@@ -1402,6 +1423,7 @@
@active_record = active_record
@cached_record = {}
@table_joins = joins
+ aliased_primary_key
end

def aliased_prefix
@@ -1409,7 +1431,7 @@
end

def aliased_primary_key
- "#{ aliased_prefix }_r0"
+ @aliased_primary_key ||= "#{ aliased_prefix }_r0"
end

def aliased_table_name
@@ -1431,7 +1453,7 @@
end

def record_id(row)
- row[aliased_primary_key]
+ row[@aliased_primary_key]
end

def instantiate(row)
@@ -1455,7 +1477,9 @@
@aliased_prefix = "t#{ join_dependency.joins.size }"
@aliased_table_name = table_name #.tr('.', '_') #
start with the table name, sub out any .'s
@parent_table_name = parent.active_record.table_name
-
+ @aliased_primary_key = nil
+ aliased_primary_key
+
if !parent.table_joins.blank? &&
parent.table_joins.to_s.downcase =~ %r{join(\s+\w+)?\s+#
{aliased_table_name.downcase}\son}
join_dependency.table_aliases[aliased_table_name] += 1
end


Jeremy Evans

unread,
Aug 7, 2007, 4:55:47 PM8/7/07
to rubyonra...@googlegroups.com

The main issue you are running into is that Rails' SQL queries for
multiple included has_many associations return the cartesian product
of the has_many_associations. Ideally, the best way to handle this is
to send two or three separate SQL statements. You'd have one
statement for each association, and then combine them together. The
most efficient way is probably n+1 queries where n is the number of
has_many associations, with one query to get the information on the
main object, and one query for each has_many association, that only
includes the association information and the main object's id (in
order to associate it). That would shorten the number of rows
returned for the queries you mention from 12,000 to 231 and from 27000
to 346. It's more complex than the current implementation, but it
will preform much better. I'm not volunteering to implement it,
though. :)

As a workaround, how about:

question = Question.find_by_id(big_question.id, :include => :incoming_messages)
question.instance_variable_set('@outgoing_messages',
Question.find_by_id(big_question.id, :include =>
:outgoing_messages).outgoing_messages)

Also, note that for a single object, you are probably better off using
lazy loading has_many associations (eager loading belongs_to
associations is fine). Eager loading has_many associations should
only be done if you are getting multiple objects at once (i.e. find
:all).

Jeremy

Trevor Squires

unread,
Aug 7, 2007, 5:37:25 PM8/7/07
to rubyonra...@googlegroups.com
Hey,

2 comments from me:

1 - using :include one-level-deep when you are fetching *one*
toplevel object *and* you are not issuing :conditions on
the :included tables is *always* (well, I've never found an
exception) slower.

x = Foo.find(4678. :include => [:incoming_messages, :outgoing_messages])

is slower than:

x = Foo.find(4678)
x.incoming_messages
x.outgoing_messages

If your reaction is to say "but it's always faster with eager
loading" then I urge you to *measure* it and get back to me if you
find that my measurements are wrong.

2 - I've got a plugin that improves your situation where you are
fetching multiple toplevel objects *and* you don't have
any :conditions that relate to the associations you're pulling in.

foos = Foo.find(:all, :hydrate =>
[:incoming_messages, :outgoing_messages])

It will split that find() into 3 queries, wiring up the relation
targets. I've chosen the explicit :hydrate option to give the user
more control over the strategy for pulling in associations and it
works just fine with Rick O's scope plugin too.

I've been sitting on the plugin since railsconf *last* year and never
released it because I had so many doubts about whether the strategy
was solid enough. I've used it enough times that I'm confident it
works. I'll be releasing it for public consumption in the next
couple of weeks.

Trev

Rick Olson

unread,
Aug 7, 2007, 5:59:01 PM8/7/07
to rubyonra...@googlegroups.com
On 8/7/07, Trevor Squires <tre...@protocool.com> wrote:
>
> Hey,
>
> 2 comments from me:
>
> 1 - using :include one-level-deep when you are fetching *one*
> toplevel object *and* you are not issuing :conditions on
> the :included tables is *always* (well, I've never found an
> exception) slower.
>
> x = Foo.find(4678. :include => [:incoming_messages, :outgoing_messages])
>
> is slower than:
>
> x = Foo.find(4678)
> x.incoming_messages
> x.outgoing_messages
>
> If your reaction is to say "but it's always faster with eager
> loading" then I urge you to *measure* it and get back to me if you
> find that my measurements are wrong.

I agree. Eager including is only really beneficial if you're doing
lots of queries looping through the messages:

# or use paginate in the lovely will_paginate plugin
@incoming_messages = @foo.incoming_messages.find(:all, :include => :author)

Doing your eager include here should be faster than doing a query on
each row for each message author.

Personally I've stopped using all eager includes in favor of the new
ActiveRecord connection caching and my own active_record_context
plugin (currently in use in Lighthouse):

http://activereload.net/2007/5/23/spend-less-time-in-the-database-and-more-time-outdoors

> 2 - I've got a plugin that improves your situation where you are
> fetching multiple toplevel objects *and* you don't have
> any :conditions that relate to the associations you're pulling in.
>
> foos = Foo.find(:all, :hydrate =>
> [:incoming_messages, :outgoing_messages])
>
> It will split that find() into 3 queries, wiring up the relation
> targets. I've chosen the explicit :hydrate option to give the user
> more control over the strategy for pulling in associations and it
> works just fine with Rick O's scope plugin too.
>
> I've been sitting on the plugin since railsconf *last* year and never
> released it because I had so many doubts about whether the strategy
> was solid enough. I've used it enough times that I'm confident it
> works. I'll be releasing it for public consumption in the next
> couple of weeks.

Oh, I think coda hale had some similar plugin too. You guys should
totally team up!

--
Rick Olson
http://lighthouseapp.com
http://weblog.techno-weenie.net
http://mephistoblog.com

Frederick Cheung

unread,
Aug 8, 2007, 4:30:53 AM8/8/07
to rubyonra...@googlegroups.com

On 7 Aug 2007, at 22:37, Trevor Squires wrote:

>
> Hey,
>
> 2 comments from me:
>
> 1 - using :include one-level-deep when you are fetching *one*
> toplevel object *and* you are not issuing :conditions on
> the :included tables is *always* (well, I've never found an
> exception) slower.
>
> x = Foo.find(4678. :include =>
> [:incoming_messages, :outgoing_messages])
>
> is slower than:
>
> x = Foo.find(4678)
> x.incoming_messages
> x.outgoing_messages
>
> If your reaction is to say "but it's always faster with eager
> loading" then I urge you to *measure* it and get back to me if you
> find that my measurements are wrong.
>

Ah yes, you are right. It's not a huge difference but its definitely
there.

> 2 - I've got a plugin that improves your situation where you are
> fetching multiple toplevel objects *and* you don't have
> any :conditions that relate to the associations you're pulling in.
>
> foos = Foo.find(:all, :hydrate =>
> [:incoming_messages, :outgoing_messages])
>
> It will split that find() into 3 queries, wiring up the relation
> targets. I've chosen the explicit :hydrate option to give the user
> more control over the strategy for pulling in associations and it
> works just fine with Rick O's scope plugin too.
>
> I've been sitting on the plugin since railsconf *last* year and never
> released it because I had so many doubts about whether the strategy
> was solid enough. I've used it enough times that I'm confident it
> works. I'll be releasing it for public consumption in the next
> couple of weeks.
>

Very interesting, I look forward to seeing it!

Fred

> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "Ruby on Rails: Core" group.
> To post to this group, send email to rubyonra...@googlegroups.com
> To unsubscribe from this group, send email to rubyonrails-core-
> unsub...@googlegroups.com
> For more options, visit this group at http://groups.google.com/
> group/rubyonrails-core?hl=en
> -~----------~----~----~----~------~----~------~--~---
>

Frederick Cheung

unread,
Aug 8, 2007, 4:37:04 AM8/8/07
to rubyonra...@googlegroups.com

On 7 Aug 2007, at 21:55, Jeremy Evans wrote:

>
>
> The main issue you are running into is that Rails' SQL queries for
> multiple included has_many associations return the cartesian product
> of the has_many_associations. Ideally, the best way to handle this is
> to send two or three separate SQL statements. You'd have one
> statement for each association, and then combine them together. The
> most efficient way is probably n+1 queries where n is the number of
> has_many associations, with one query to get the information on the
> main object, and one query for each has_many association, that only
> includes the association information and the main object's id (in
> order to associate it). That would shorten the number of rows
> returned for the queries you mention from 12,000 to 231 and from 27000
> to 346. It's more complex than the current implementation, but it
> will preform much better. I'm not volunteering to implement it,
> though. :)
>

That would be the ideal


> As a workaround, how about:
>
> question = Question.find_by_id(big_question.id, :include
> => :incoming_messages)
> question.instance_variable_set('@outgoing_messages',
> Question.find_by_id(big_question.id, :include =>
> :outgoing_messages).outgoing_messages)

A bit fiddly for me. For now I'm going to junk one of the eager loads
( I do have a frequent use case where I display a list of
questions), foregoing some eager loading seems to be the way forward
for now.

>
> Also, note that for a single object, you are probably better off using
> lazy loading has_many associations (eager loading belongs_to
> associations is fine). Eager loading has_many associations should
> only be done if you are getting multiple objects at once (i.e. find
> :all).

Yes, I'm coming round to that view

Frederick Cheung

unread,
Aug 8, 2007, 4:48:41 AM8/8/07
to rubyonra...@googlegroups.com

On 7 Aug 2007, at 22:59, Rick Olson wrote:

>
> Personally I've stopped using all eager includes in favor of the new
> ActiveRecord connection caching and my own active_record_context
> plugin (currently in use in Lighthouse):
>
> http://activereload.net/2007/5/23/spend-less-time-in-the-database-
> and-more-time-outdoors
>

Very awesome! I'll definitely be having a look at that one!

Fred

Gabe da Silveira

unread,
Aug 8, 2007, 4:40:30 PM8/8/07
to rubyonra...@googlegroups.com
While we're on the topic, another place where eager loading falls down is on tables with large fields that aren't needed.  Sometimes you need eager loading, but one of the tables has some large fields you don't need.  Normally you could :select only what you needed, but of course that doesn't work with eager loading where the :select attribute is ignored.

I've run into a few other cases where I needed eager loading but for one reason or another it resulted in poor performance or had other limitations.  The more I think about it, the more I think it would be really cool to be able to "manually" eager load with :select and :joins options (or possibly even find_by_sql).  

The ability to instantiate multiple model types from a single query in a more flexible way than vanilla eager-loading would be quite useful.  The tricky part would be coming up with a clean interface.  Anyway, just food for thought.

Michael Koziarski

unread,
Aug 8, 2007, 6:21:04 PM8/8/07
to rubyonra...@googlegroups.com
> The ability to instantiate multiple model types from a single query in a
> more flexible way than vanilla eager-loading would be quite useful. The
> tricky part would be coming up with a clean interface. Anyway, just food
> for thought.

Yeah, we've discussed this in the past and all we're waiting on is
someone to come up with a nice interface and a wicked patch. being
able to construct a graph of objects from the results of a sql query
would be a nice feature for some people, we just haven't found someone
who needs it badly enough to warrant spending time on investigating a
solution :)

--
Cheers

Koz

dasil003

unread,
Aug 10, 2007, 8:50:42 PM8/10/07
to Ruby on Rails: Core
I'm interested in tackling this one. I've been doing some pretty
heavy stuff with dynamically built finds, so I have a lot of ideas.

I did a writeup at http://darwinweb.net/article/Free_Form_Manual_Eager_Loading
to solicit some input. I need to fix my commenting system because
it's kind of confusing. For anyone commenting: make sure that you
submit twice, as the first time is only previewing your comment.

Mark Reginald James

unread,
Aug 11, 2007, 2:42:02 PM8/11/07
to rubyonra...@googlegroups.com
Trevor Squires wrote:

> 1 - using :include one-level-deep when you are fetching *one*
> toplevel object *and* you are not issuing :conditions on
> the :included tables is *always* (well, I've never found an
> exception) slower.
>
> x = Foo.find(4678. :include => [:incoming_messages, :outgoing_messages])
>
> is slower than:
>
> x = Foo.find(4678)
> x.incoming_messages
> x.outgoing_messages
>
> If your reaction is to say "but it's always faster with eager
> loading" then I urge you to *measure* it and get back to me if you
> find that my measurements are wrong.

I'm surprised the effect of the extra data and data processing
that comes with use of :include so clearly trumps the chained
latency of the extra database calls.

Do you think you would get the same result when:
1. The database is on another server,
2. :select is used to restrict the size of base model data returned, and
3. Hashes are used to speed matching in the construction of the object
hierarchy, like in Fred's patch?

Jeremy's n+1 solution obviously becomes more attractive as the include
chain gets longer and the records larger. But there would still be a
switchover point that's a function of database comms latency. Such a
solution would really be aided by sending the database commands in parallel.

--
We develop, watch us RoR, in numbers too big to ignore.

Gabe da Silveira

unread,
Aug 11, 2007, 5:05:11 PM8/11/07
to rubyonra...@googlegroups.com
On 8/11/07, Mark Reginald James <m...@bigpond.net.au> wrote:

   2. :select is used to restrict the size of base model data returned, and

Unfortunately eager loading ignores the select option.

Mark Reginald James

unread,
Aug 11, 2007, 5:23:57 PM8/11/07
to rubyonra...@googlegroups.com
Gabe da Silveira wrote:

> On 8/11/07, *Mark Reginald James* <m...@bigpond.net.au wrote:
> 2. :select is used to restrict the size of base model data returned
>
> Unfortunately eager loading ignores the select option.

There's a plugin for this now:

http://rubyforge.org/projects/include-select/

Reply all
Reply to author
Forward
0 new messages