Using ActiveRecord with large result sets?

Pat Shaughnessy

unread,

Aug 18, 2010, 5:27:50 PM8/18/10

to Boston Ruby Group

Hey everyone - at my day job I need to query a large number of records
(25000 - 50000) using ActiveRecord (with Oracle) and return them all
in a single HTTP request in the form of an Excel spreadsheet. I have
an existing app I need to refactor quickly, so switching to data
mapper, some other ORM or to a NoSQL database is not an option. My
report needs to join across 5-10 different tables.

Anyone have experience or suggestions for this sort of thing?

This seems like an really interesting and neglected use case for
ActiveRecord: handling large result sets efficiently in terms of both
time and memory. I’m guessing that most Rails sites probably use AR
with pagination (e.g. will_paginate) almost all the time, since such a
large dataset wouldn’t fit or be readable on the web anyway. But it
would be nice to use AR for reporting situations also.

Right now I’m considering these approaches:

1. Using connection.execute with a hard-coded SQL statement, and
processing the results as an array of hashes, one for each record.
This is probably the fastest option since hashes are fast in Ruby, but
I would lose the flexibilty/power of named scopes or searchlogic since
I would have to write the SQL by hand.

2. Using find :all (or named scopes/searchlogic) as usual, but hacking
AR somehow to return an array of hashes, just like connection.execute
would. This would provide me with the flexibility of AR, but with the
performance benefits of using hashes instead of full model objects.

3. Using find_each, looping through the results and streaming out the
response gradually. This would control memory usage at least, but
might still be slow.

4. As a last resort generate the report offline and email it to the
user later.

- pat
http://patshaughnessy.net

Neil Cook

unread,

Aug 18, 2010, 5:48:08 PM8/18/10

to boston-r...@googlegroups.com

Pat,

When using option 2, use :select to just return the fields you need. This should speed up serialization/deserialization at least and also might speed up your query.

/Neil

--
You received this message because you are subscribed to the Boston Ruby Group mailing list
To post to this group, send email to boston-r...@googlegroups.com
To unsubscribe from this group, send email to boston-rubygro...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/boston-rubygroup

Brian Cardarella

unread,

Aug 18, 2010, 6:08:06 PM8/18/10

to boston-r...@googlegroups.com

Pat,

I would look into a ActiveRecord::Base.connection.execute call.
This will return the raw result in an array format. The problem that
you're probably running into with a normal ActiveRecord::Base.find is
that it instantizes each record and that is a lt of memory. So with
the numbers you're talking about you might start swapping which is
going to bring everything to a crawl.

- Brian

Scott Schulthess

unread,

Aug 18, 2010, 6:30:33 PM8/18/10

to boston-r...@googlegroups.com

Look at connection.select_values, it just makes activerecord do just a straight select, minus the deserializing into objects and whatnot, thus faster

Sent from my iPhone

Robert Thau

unread,

Aug 18, 2010, 8:50:05 PM8/18/10

to boston-r...@googlegroups.com

> I would look into a ActiveRecord::Base.connection.execute call.
> This will return the raw result in an array format. The problem that
> you're probably running into with a normal ActiveRecord::Base.find is
> that it instantizes each record and that is a lt of memory. So with
> the numbers you're talking about you might start swapping which is
> going to bring everything to a crawl.

Well, that's what find_each is meant to take care of --- it
instantiates just a few records at a time, so you don't have to fit
the whole set in the memory of a single process. Though, of course,
you're still out the CPU time for constructing all the objects,
and the time spent later when the GC is mopping them up. Just calling
AR::Base.initialize 50,000 times (or more, if objects from multiple
tables are involved) can add up.

(Then again, if the query is using :include to try to instantiate
objects from multiple tables in a single fetch, it might be worth
doublechecking in the debug log to make sure that :include and
find_each are playing well together. If you want a 50,000 row
request to finish in ten seconds, you've got 200 microseconds per
row; an extra SQL query per row to find the record at the end of
some belongs_to is probably not going to work.)

One last thought: if you aren't completely comfortable writing the
query in raw SQL, it might be worth writing a quickie prototype in
pure AR, for sanity-checking on the spec, and to verify that the
raw SQL is doing what you want....

rst

Josh Nichols

unread,

Aug 19, 2010, 2:01:05 AM8/19/10

to boston-r...@googlegroups.com

Hi Pat,

It's a tough problem, and your biggest enemies are going to be memory usage spiking and slow response times. Your app server workers (passenger, mongrel, etc) are pretty likely to start leaking memory at this point, and one worker is going to be tied up... and when that happens, it'll affect production's availability, possibly even crashing the entire server in the case of running out of memory.

Investigating with 1, 2, or 3 could definitely help alleviate the extent of the memory use and response time. I think it'd be a time sync though, especially if you're looking to turn this around quickly.

So, instead of offline generating of reports and emailing it, I'd suggest looking into something like delayed_job or resque to asynchronously building it up. I could imagine the work flow going like:

* request the report

* has the report been generated recently?

* if yes, serve up the recently generated report

* if no...

* make a new job to generate the report, and show a 'waiting' page that checks back periodically to see if it's ready

* also check if there's a job already in progress, and just wait on that if that's the case

There are definitely more moving parts in this approach, but I believe it'll significantly reduce the impact of viewing/generating these reports on the rest of your application's production availability and performance.

- Josh

On Wed, Aug 18, 2010 at 5:27 PM, Pat Shaughnessy <p...@patshaughnessy.net> wrote:

--

James K

unread,

Aug 19, 2010, 7:08:40 AM8/19/10

to Boston Ruby Group

Pat,

It's possible to generate SQL from a scope chain, then feed that SQL
to select_values. It's a wee bit of a hack, but if you call
construct_finder_sql on a scope, out pops the SQL.

>> User.verified.male.send(:construct_finder_sql, {})
=> "SELECT * FROM \"users\" WHERE ((\"users\".\"verified\" = 't' AND
\"users\".\"sex\" = E'M')) "

So you can leverage your existing scopes/searchlogic), then pass the
sql to select_values for loading much faster than hydrating models.
That should be many times faster. This works in rails 2.3, but being a
private method, prone to moving somewhere else in the future, but as
long as AR knows how to generate SQL, there should be a way to make it
give the SQL to you :)

-James

Keenan Brock

unread,

Aug 19, 2010, 7:40:43 AM8/19/10

to boston-r...@googlegroups.com

Pat,

+1 select_values

Great for memory usage / reduced object creation.

Sooner or later you will have to use resque or delayed_job as your mongrel will time out.

--Keenan

Pat Shaughnessy

unread,

Aug 19, 2010, 5:05:21 PM8/19/10

to boston-r...@googlegroups.com

Wow - thanks everyone for all the great suggestions!

FYI if anyone is interested I got a prototype working today that uses
ActiveRecord::Batches.find_in_batches to loop through the large result
set in chunks. Then by using render :text with a proc I was able to
have the data streamed out to the client gradually... so the browser
displays a “downloading file” progress bar to the user.

The key to all of this was that the user’s expectations were lowered
by seeing the downloading file dialog box... they were willing to wait
for the large file to download. So the performance is still slow, but
the perceived performance was fast since the download dialog box
appeared right away. After that point the user is willing to wait for
a large file.

Also find_in_batches prevents Ruby from running out of memory.

Josh - I agree having an offline process generate the reports is
ideal, but was too much coding for me given my deadlines. And my app
is running in a low-usage intranet environment, so it’s probably ok
for now to have a Passenger server process tied up looping through the
SQL results once in a while. I'll have to look into doing this later.

Brian, Keenan - it turned out most of the time was spent in the DB, so
refactoring the code to use select_values or something similar
wouldn't have been worthwhile for me in this case.

thanks again! - pat

>> ruby...@googlegroups.com

Andrew Kuklewicz

unread,

Aug 19, 2010, 5:20:26 PM8/19/10

to boston-r...@googlegroups.com

We do offline generation of a star-schema and roll-up tables for fast report (and graph) generation without having to join 4+ tables at request time.

I use active warehouse for this, with some custom code (I should really submit as patches) for the rollup tables.

We do this on millions of rows of data, and it works great - 50K would be wicked fast both to generate the tables and to process.

Only downside is it is not real-time data - only as good as when you last updated the data, but you could do that hourly or something - with such a small record count, it would not take long.

Cheers,

Andrew Kuklewicz

Hongli Lai

unread,

Aug 29, 2010, 1:05:47 PM8/29/10

to Boston Ruby Group

Maybe you're looking for this? http://github.com/FooBarWidget/streaming_find
I've had this code lying around since 2008 but I figure now's a good
time to release it.

Pat Shaughnessy

unread,

Aug 30, 2010, 9:56:24 AM8/30/10

to boston-r...@googlegroups.com

Thanks for sharing this, Hongli! This does look very useful since my
SQL query is slow and find_in_batches does just repeat it once for
each batch. So streaming solves the memory usage and timeout problems,
but now I will need to tuned the batch size carefully to avoid making
too many SQL calls. Your code might avoid that problem...

Sadly, at my day job I need to use Oracle... To use your plugin with
Oracle would I have to extend the Oracle Enhanced Adapter similar to
what you are doing in mysql_adapter_extensions.rb for the MySQL
adapter? Is there anything else I’ll need to do? Is there any reason
you can think of why this wouldn’t work with Oracle?

thanks!! - pat
http://patshaughnessy.net/

Hongli Lai

unread,

Aug 30, 2010, 12:09:29 PM8/30/10

to boston-r...@googlegroups.com

On Mon, Aug 30, 2010 at 3:56 PM, Pat Shaughnessy <p...@patshaughnessy.net> wrote:
> Thanks for sharing this, Hongli! This does look very useful since my SQL
> query is slow and find_in_batches does just repeat it once for each batch.
> So streaming solves the memory usage and timeout problems, but now I will
> need to tuned the batch size carefully to avoid making too many SQL calls.
> Your code might avoid that problem...
>
> Sadly, at my day job I need to use Oracle... To use your plugin with Oracle
> would I have to extend the Oracle Enhanced Adapter similar to what you are
> doing in mysql_adapter_extensions.rb for the MySQL adapter? Is there
> anything else I’ll need to do? Is there any reason you can think of why this
> wouldn’t work with Oracle?

Yes I think all you need to do is to extend select_all to accept a
block, like what I did for the MySQL adapter.

--
Phusion | The Computer Science Company

Web: http://www.phusion.nl/
E-mail: in...@phusion.nl
Chamber of commerce no: 08173483 (The Netherlands)

Reply all

Reply to author

Forward