convert from data.table into a "distributed data frame" stored on disk

12 views
Skip to first unread message

Enzo

unread,
Jul 6, 2015, 4:51:25 PM7/6/15
to tesser...@googlegroups.com
I'm working on to sets of data, each made up of >1,700 csvs.

I consider too much to use read.csv  (and I suspect with ddrRead.csv as well).

Currently on my macbook pro it takes ~69s to read the data with fread and to create a data.table of 2.4GB (this is the smallest of the two datasets: the second dataset is >5Gb and takes proportionally more time to read).

Of course the size of these datasets are as such that I would like to use (at least in first instance) the disk storage, probably with 3 (or 4?) cores.

Is it possible to convert a data.table as above into a "distributed data frame"  stored on disk?

How?

Ryan Hafen

unread,
Jul 6, 2015, 5:11:52 PM7/6/15
to Enzo, tesser...@googlegroups.com
Good question.  Assuming there is enough memory, fread will certainly be much faster than drRead.csv.  The way you can convert this in-memory data.table to a local disk ddf would be the following.  

# supposing you want arbitrary chunks as an intital ddf:
chunk_size <- 100000
n <- nrow(large_dt)

large_dt_ddf_conn <- localDiskConn("__path__")

# loop over chunks and save each as key-value pair to connection
for(ii in seq_len(ceiling(n / chunk_size))) {
  idx <- ((ii - 1) * chunk_size + 1):(ii * chunk_size)
  addData(large_dt_ddf_conn, kvPairs(kvPair(ii, large_dt[idx,])))
}

# point to this connection as a ddf
large_dt_ddf <- ddf(large_dt_ddf_conn)

This basically loops over your data set and saves each chunk as a key-value pair to a local disk connection.  If you’d rather save subsets by some prescribed division to your data.table, you can modify the for loop accordingly to loop over factor levels of the data.table.

Ryan



--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/73410027-2b6e-4cda-a3da-c5ec8c249277%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages