Writing a large data/frame/ data.table to HDFS: ~ parallel rhwrite

8 views

Skip to first unread message

Saptarshi Guha

unread,

Dec 15, 2015, 8:29:28 PM12/15/15

to essera-Users

Hello,

I have a large data table, about 2MM rows. It takes few/several seconds (< a minute) to load using the load command.

I would like to write it to HDFS, every row of the data table will be a value (in the key/value paradigm). This output file will be input to subsequent MR jobs.

How to get the file to disk? rhwrite is sequential and slow, very very slow.

Here is some code that is essentially parallel rhwrite. Because it takes finite small amount of time to load the Rdata file (once per mapper), this approach is feasible.

allparams is the data table i would like to write to disk. The following is _much_ faster than

rhwrite(allparams, file="dz",chunk=1,numfiles=1000,kvpairs=FALSE,verbose=TRUE)

C <- 1000
chu <- as.integer(nrow(allparams)/C)
rhwatch(map=function(a,b){
    start <- (a-1)*chu+1; end <- min(start+chu-1,nrow(allparams))
    for(i in start:end)
        rhcollect(i,allparams[i,])
}
, input=c(C,C)
, reduce = 0
, output="/user/sguha/tmp/foo"
, read=FALSE
, mapred=list(mapred.task.timeout=0))

Cheers

Saptarshi

Reply all

Reply to author

Forward

0 new messages