Hi JD,
Thanks for replying so fast! My dataList is 8 Mb and my internet upload speed is up to 1Mbps. I'm doing bootstraping. Would it be better if I upload the original dataset to all nodes once using rObjectsOnNodes = c('data_ori'), and use seedList to generate bootstrap sample, instead of uploading all the bootstrap sample i.e., dataList? I guess that could avoid the uploading speed problem?
Also I completely don't know how the EMR works so I just realized that my problem is that I should have used more numInstances. If numInstances=2, then 1 instance is the master instance and I only have 1 working instance (slave instance), which can really slow things down. I can't believe I'm so ignorant...
I have a few more question. Estimating my model once (i.e., run myFunction once) cost 3 hours in my computer, and I need to do 300-500 bootstrap. So I guess I really need to use high CPU performance instances. But I'm very confused by the "vCPU" concept and also the "instancePerNode" in segue package.
1) if I specify "instancePerNode = 1", then only one vCPU in each instance is working?
2) If I use some instance with a lot vCPU (e.g., c1.xlarge has 8vCPU), I should definitely set "instancePerNode = 8" so that everybody of the 8 CPU in one c1.xlarge instance will work right? Is the "instancePerNode" kind of the multicore parrallelizing idea in our own computer?
3) Is using one instance with 8 vCPU equivalent to use 8 instances each with only 1 vCPU?
4) If run myFunction once cost 3 hours in my computer, should I set "taskTimeout = more than 3 hours", so that my work will definitely not be stopped?
Sorry to bother everybody with my silly questions. I hope my questions will be helpful to people similar to me (without any computer science knowledge but have to do a huge embarrassingly parallel!) I highly appreciate segue package and helps from this group!
Best,
Sophie