Bulk load 70GB csv to database using R

483 views
Skip to first unread message

TMOD

unread,
Jan 31, 2014, 5:37:24 AM1/31/14
to manip...@googlegroups.com
If I'm doing some analysis that involves csv files too large for memory, say 70GB each, is there an easy way to bulk load those into a database of my choosing from R to be used with dplyr?

I'm picturing something very similar to dbWriteTable() except that it it takes a filename instead of a data.frame as an argument. Like dbWriteTable(), it would automatically figure out the correct column types based on a sample of the csv.

Ideally it'd use something like sql*loader (Oracle), bcp (MS Sql Server), etc in the background to make the loads as quickly as possible. 

Could something like that be a good fit to add to dplyr's copy_to()?






Hadley Wickham

unread,
Feb 2, 2014, 10:21:58 PM2/2/14
to TMOD, manipulatr
Not currently.

I have thought a little about an on-disk data source that worked much
like the SQL sources, delaying all operations until asked for,
potentially making it possible to work with larger-than-memory files.
However, doing that sort of compilation is somewhat tricky, and it's
currently relatively low priority.

But maybe you don't need to go that far to be useful. A standard
interface to bulk ingest that worked across multiple databases would
be very nice.

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/groups/opt_out.



--
http://had.co.nz/

Sean O'Riordain

unread,
Feb 3, 2014, 9:22:35 AM2/3/14
to Hadley Wickham, TMOD, manipulatr
This might be a completely off the wall topic here...
but it might be worth considering (at some future point) persuading R to play nicely with something like Talend - a well known opensource ETL tool for stuff which is ETL with a stats twist...  I'm guessing that it would probably be R in Talend, rather than Talend in R.

Sean

nicolas paris

unread,
May 28, 2014, 6:15:08 PM5/28/14
to manip...@googlegroups.com
Hello,

This might be a completely off the wall topic here too, I am testing R in
talend (thanks to JRI and the tExecuteRscript component
http://gabrielebaldassarre.com/talend/texecuterscript/) and I agree, this
symbiosis between R/talend is nice.

Talend appears to a high-level platform to extract/transform/load data when
R with JRI adds his powerfull data/stats stuff.

I still think that transforming data within R is better than Talend, but I
have quickly been impressed with talend extract & load features. For 70GB
csv, I would use txxBulkLoad component
(http://www.vikramtakkar.com/2013/07/improve-load-performance-with-bulk-
load.html), a stuff that is designed for.

Both are opensource and talend's learning curve seems to be very fast


Christopher Peters

unread,
May 28, 2014, 6:41:12 PM5/28/14
to manip...@googlegroups.com
I'd be interested in a general bulk function between mysql and postgres.  Thinking about using dplyr for ETL operations.

Christopher Peters


--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.
To post to this group, send email to manip...@googlegroups.com.
Visit this group at http://groups.google.com/group/manipulatr.
For more options, visit https://groups.google.com/d/optout.

Sean O'Riordain

unread,
May 29, 2014, 3:25:46 AM5/29/14
to Christopher Peters, manipulatr
Christopher,
For plain db to db, I'd strongly recommend using a dedicated ETL tool like Talend - this is what they are designed for! :-)

There is a learning curve at the start, but once overcome (refer youtube), the GUI makes putting together a workflow like this quite quick and easy.  Some years ago I did some MySQL (and Oracle) to Postgres work and this was definitely the easiest - the runtime is surprisingly quick too even in non-bulk mode.  The early releases of Talend were quite buggy, but the have been improving.

Kind regards,
Sean

Christopher Peters

unread,
May 29, 2014, 10:15:38 AM5/29/14
to Sean O'Riordain, manipulatr

The reason I'd like to use dplyr is that it's already in my workflow.

Suppose I make a Shiny dashboard that's not very performant. I could chron the queries and save as RData for Shiny to pick up. Or, I can just copy_to my database, hence ETL. What I find is that these tables are useful in and of themselves. It's ETL'd in a way I actually use.

dplyr already does this, but so I'm just throwing out a vote for more features in this area.

Hadley Wickham

unread,
Jun 6, 2014, 11:11:21 AM6/6/14
to Christopher Peters, Sean O'Riordain, manipulatr
R's db packages definitely need more facilities in this area - most db
API provide some function to efficiently load large quantities of
data, but they're not currently well exposed in R.

Hadley
http://had.co.nz/
Reply all
Reply to author
Forward
0 new messages