Export very large select statements to file

17 views
Skip to first unread message

Mike Barton

unread,
Oct 2, 2008, 6:11:48 AM10/2/08
to NWRUG
Hi,

I'm having trouble getting data out of my database through
activerecord when the amount data returned by the sql is very large. I
could use plain mysql at the command line, but would prefer to keep it
in Ruby.

This gist shows the helper I'm using at the moment, works for datasets
in the tens of thousands, but not in the millions. The error message I
get is that activerecord timed out.

http://gist.github.com/14333

Any suggestions would be great.

Thanks

Mike

Francis Fish

unread,
Oct 2, 2008, 6:28:42 AM10/2/08
to nwrug-...@googlegroups.com
No time to respond in detail

Break data into chunks (order by ID or something like that)

something like this (untested - set limit to something bigger than 1000 - but this also stops you eating all the memory)

def output_from_sql(sql_file,target_file)
 
  raise ArgumentError, "SQL file not found : #{sql_file}" unless File.exists?(sql_file)
 
  keys = data.first.keys
  FasterCSV.open(target_file,'w') do |csv|
    csv << keys
    last_key = 0
    while true
      data = ActiveRecord::Base.connection.select_all(File.open(sql_file).read,:order => 'id',:limit => 2000, :conditions => ['id > ?',last_key])
      data.each do |row|
        csv << keys.map {|key| row[key]}
      end
      break if data.empty?
      last_key = data.last.id
    end
  end
end
--
Thanks and regards,

Francis Fish

John Leach

unread,
Oct 2, 2008, 6:38:02 AM10/2/08
to nwrug-...@googlegroups.com
On Thu, 2008-10-02 at 03:11 -0700, Mike Barton wrote:
> I'm having trouble getting data out of my database through
> activerecord when the amount data returned by the sql is very large.
<snip>
> http://gist.github.com/14333
>

Hi Mike,

All kinds of things could be going wrong here. You really should be
grabbing these in batches from the database, rather than all at once.

Use the limit option to grab a thousand at a time or something. If
you're using an ACID database, do the whole thing in a transaction to
ensure records inserting mid batch don't mess things up.

Here is an extract from a rake task I wrote to do batched conversion of
Wordpress comments. I'm using some Rails models, but you can see what
is going on:

http://gist.github.com/14336

I'm sure there is a better way to do this tbh, but it works ok.

Really though, if it's not a complex query you're running and it's a
very large dataset, then consider using mysqldump - it'll be faster and
use less ram than Ruby, and I'm sure it supports csv output.

John.

--
http://www.brightbox.co.uk - UK Ruby on Rails hosting
http://johnleach.co.uk

Carl Drinkwater | 29degrees

unread,
Oct 2, 2008, 6:41:42 AM10/2/08
to nwrug-...@googlegroups.com
Or simpler, IMHO at least, use the MySQL adapter directly. Again,
untested though :


csv = FasterCSV.open( target_file, "w" )
res = ActiveRecord::Base.connection.execute( your_sql )

# Header
csv << res.fetch_fields.map { |f| f.name }

# Results
res.each do |row_data|
csv << row_data
end

csv.close


This is probably specific to the MySQL adapter, although I'm sure
there will be similar methods to fetch_fields in the other adapters.

Regards,
Carl.

--
Carl Drinkwater
Director,
29degrees Ltd

Bespoke Web Application Development

t: 0161 953 6669
m: 07515 351 589
f: 07092 217 531
w: http://29degrees.co.uk

-----------------------------------------------
Office: 23 New Mount Street, Manchester, M4 4DE

Registered Address: St. George's House, 215/219
Chester Road, Manchester, M14 4JE.

Company Number 05952987 | VAT Number 897293655
-----------------------------------------------

John Leach

unread,
Oct 2, 2008, 7:16:55 AM10/2/08
to nwrug-...@googlegroups.com
On Thu, 2008-10-02 at 11:41 +0100, Carl Drinkwater | 29degrees wrote:
> Or simpler, IMHO at least, use the MySQL adapter directly. Again,
> untested though :
>
> csv = FasterCSV.open( target_file, "w" )
> res = ActiveRecord::Base.connection.execute( your_sql )
>
> # Header
> csv << res.fetch_fields.map { |f| f.name }
>
> # Results
> res.each do |row_data|
> csv << row_data
> end
>
> csv.close
>

Nice - that's a very tidy solution. I'm assuming the each method on res
fetches the rows one at a time from the MySQL server, rather than
grabbing them all at once.

Carl Drinkwater | 29degrees

unread,
Oct 2, 2008, 7:19:26 AM10/2/08
to nwrug-...@googlegroups.com
Hi John,

> Nice - that's a very tidy solution. I'm assuming the each method on
> res
> fetches the rows one at a time from the MySQL server, rather than
> grabbing them all at once.

Yeah, exactly.

Regards,
Carl.

John Leach

unread,
Oct 2, 2008, 7:29:28 AM10/2/08
to nwrug-...@googlegroups.com
Oh, and I meant to mention, in comparison to my proposed solution, this
one executes one query whereas mine executes (total number of rows /
100) queries.

What's bad with mine is that MySQL does the query on the whole table on
every loop and then just sends you the batch you asked for. With a big
table, this is really bad.

Anyone know a good way of having ActiveRecord models do this kind of
execute and fetch on each, rather than get the whole lot at the start?
It's clearly preferable sometimes.

John.

Carl Drinkwater | 29degrees

unread,
Oct 2, 2008, 7:34:46 AM10/2/08
to nwrug-...@googlegroups.com
Hi John,

> Anyone know a good way of having ActiveRecord models do this kind of
> execute and fetch on each, rather than get the whole lot at the start?
> It's clearly preferable sometimes.

ActiveRecord doesn't support it, so you have to write your own methods
in your models in cases where you need it :


class Foo < ActiveRecord::Base
def many_rows( &block )
ActiveRecord::Base.connection.execute( sql ).each do |row|
yield row
end
end
end

Foo.new.many_rows do |r|
puts r.inspect
end


Note that ActiveRecord::Base.connection.execute returns a
MySQL::Result object, and MySQL::Result#each yields just an array -
You don't get the column names. That said, there might be a
#each_as_hash or something - I've not checked.

Using the method above, you're side-stepping ActiveRecord completely.
Sometimes that's necessary though, and you shouldn't feel dirty by
doing so :)

Regards,
Carl.

Mike Barton

unread,
Oct 2, 2008, 7:37:32 AM10/2/08
to NWRUG
Thanks for all your replies, they were very insightful.
The reason I was using SQL is because I didn't want all the active
record models loaded into the memory at once. I hadn't thought of
breaking the data into chunks using offsets, which would solve this
problem, I mean I could use pure Ruby. I refactored your solution John
to make something that could be added to any AR model

http://gist.github.com/14338

Which could then be used as

Customer.each_in_chunks_of{ |cust| # Do something with cust }

This doesn't allow for code like Comments.approved as in your example
though.

Mike Barton

unread,
Oct 2, 2008, 7:38:13 AM10/2/08
to NWRUG
Sorry that should be
Customer.each_in_chunks_of(100) { |cust| # Do something with cust }

Mike Barton

unread,
Oct 2, 2008, 7:54:19 AM10/2/08
to NWRUG
I've reinvented the wheel, there's a discussion of 'paginating'
database queries here
http://weblog.jamisbuck.org/2007/4/6/faking-cursors-in-activerecord
Reply all
Reply to author
Forward
0 new messages