Parallel computing with data.table

1,420 views
Skip to first unread message

ygu...@gmail.com

unread,
Mar 16, 2016, 7:36:47 AM3/16/16
to Israel R User Group
I'm trying to run a parallel computation with data.table. I have a big data set, I'd like to split it to a number of rather small ones and apply some transformation for every part independently.

Let: DataP is a big data set: ID, x1, x2, x3, group

My code is:

# I split an index (indx) because data split takes a lot of time with my data.
setkey(DataP ,SplitKey_f)
indx<-split(seq(nrow(DataP )),DataP $group)

library(parallel)
library(doParallel)
library(foreach)
cl<-makeCluster(8)
registerDoParallel(cl)

foreach(i=1:l, .combine = rbind) %dopar% { 
  library(data.table)
  Psubset<-DataP [,indx[[i]]]
# do some transformations on the data
}
stopCluster(cl)

The above doesn't work because foreach with parallel computing cannot execute the underlined line ( Psubset<-DataP [,indx[[i]]]). However, %do% instead of %dopar% works good (but a lot of time).

How can I fix the problem - sub-setting a data.table within a parallel loop?  

Jonathan Rosenblatt

unread,
Mar 16, 2016, 9:21:04 AM3/16/16
to israel-r-user-group
Generally, I find this little blog post to be the best intro to foreach:

I suspect that what is going wrong is that the object DataP is not available available to the various machines.
If you are running linux, use registerDoMC() instead of registerDoParallel(). 
If you are running Windows, you will need to export the object manually, or build an iterator over DataP. 
More on iterators and foreach can be found here:





--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
--
Jonathan Rosenblatt
Dept. of Industrial Engineering and Management
Ben Gurion University of the Negev

Yury Gubman

unread,
Mar 16, 2016, 9:57:33 AM3/16/16
to israel-r-...@googlegroups.com
Thank you. I'll try. 

Yury

Dr. Yury Gubman

Project Manager, Statistics

Jerusalem, Israel

Tel. +972-54-5717409

Mail: ygu...@gmail.com


--
You received this message because you are subscribed to a topic in the Google Groups "Israel R User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/israel-r-user-group/8oUduobe0y4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to israel-r-user-g...@googlegroups.com.

ygu...@gmail.com

unread,
Mar 17, 2016, 9:16:55 AM3/17/16
to Israel R User Group
The solution is rather simple, no specific iterators are needed, even in Windows:
It might be that somebody deals with similar problems, so I post it:

setkey(DataP ,SplitKey_f)
indx<-unique(DataP$SplitKey_f)

library(parallel)
library(doParallel)
library(foreach)
cl<-makeCluster(8)
registerDoParallel(cl)

Results<-foreach(i=indx, .packages="data.table", .combine = rbind) %dopar% { 
 #subset data table
 DataP_sub=DataP[SplitKey_f==i,]
 #data manipulation with DataP_sub
}
stopCluster(cl)

Jonathan Rosenblatt

unread,
Mar 17, 2016, 12:49:12 PM3/17/16
to israel-r-user-group
Thank you!
So the problem was data.tabel specific?
Now I am puzzled- data.table::setkey() allows the data table to be changed in-place. 
But registerDoParallel() is not a shared memory parallelization. 
This means that each cores get a *full copy* of DataP and changes it in place. 
If this is indeed the case, I think you can improve efficiency by exporting only a subset of the data with an iterator. 
Then again, if DataP is not too large, then you will not be gaining much.




--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yury Gubman

unread,
Mar 18, 2016, 9:45:37 AM3/18/16
to israel-r-...@googlegroups.com
Thank you, it seems that you are right and the whole data is supplied to each node.
The link you sent me before iterates with data frame by columns/rows. Iterator on data.table is not so straightforward, am I right?
"split" is not an option for 50 mln. records.

This email has been sent from a virus-free computer protected by Avast.
www.avast.com

Dr. Yury Gubman

Project Manager, Statistics

Jerusalem, Israel

Tel. +972-54-5717409

Mail: ygu...@gmail.com


--
You received this message because you are subscribed to a topic in the Google Groups "Israel R User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/israel-r-user-group/8oUduobe0y4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to israel-r-user-g...@googlegroups.com.

Jonathan Rosenblatt

unread,
Mar 18, 2016, 9:48:47 AM3/18/16
to israel-r-user-group
Never tried an iterator on a data.table.
But if the data is on disk, maybe it is best letting each slave read it's own data from disk, thus avoiding a full copy on each slave.
You then only have to pass the subset of rows to be read by each slave.

Reply all
Reply to author
Forward
0 new messages