Hello,
I have a large data table, about 2MM rows. It takes few/several seconds (< a minute) to load using the load command.
I would like to write it to HDFS, every row of the data table will be a value (in the key/value paradigm). This output file will be input to subsequent MR jobs.
How to get the file to disk? rhwrite is sequential and slow, very very slow.
Here is some code that is essentially parallel rhwrite. Because it takes finite small amount of time to load the Rdata file (once per mapper), this approach is feasible.
allparams is the data table i would like to write to disk. The following is _much_ faster than
rhwrite(allparams, file="dz",chunk=1,numfiles=1000,kvpairs=FALSE,verbose=TRUE)
C <- 1000
chu <- as.integer(nrow(allparams)/C)
rhwatch(map=function(a,b){
start <- (a-1)*chu+1; end <- min(start+chu-1,nrow(allparams))
for(i in start:end)
rhcollect(i,allparams[i,])
}
, input=c(C,C)
, reduce = 0
, output="/user/sguha/tmp/foo"
, read=FALSE
, mapred=list(mapred.task.timeout=0))
Cheers
Saptarshi