Hi,
I'm testing rmr2 with the following setup:
- Hadoop 1.2.1, 5 nodes, each 3.7 GB RAM + 1 GB swap
- R 3.0.2
- latest versions of all the RHadoop packages
- a 1.6 GB CSV file that contains exactly 1 million rows with 40 columns (this is only a test file - we plan to go up to 100 million rows)
The mapper fn is as follows:
function(k, r) {
r$ppdm <- r[[idx1]] / r[[idx2]]
keyval(r[[splitter]], r)
}
Basically, I need to compute a value for every row, add it to the row and return the data with hadoop key being one of the values from the row. There are about 75.000 distinct values in the splitter column.
When I run the job Hadoop decides to use 25 mappers, which means that 40.000 rows should be processed by one mapper - which is nothing for both Java and R and should be pretty quick IMHO.
But the R processes on all 5 nodes that run the first 5 maps (with 20 pending) behave exactly the same way - they slowly eat more than 3 GB or physical memory and are then killed by Linux OOM killer. The Hadoop job retries a bunch of times and then finally dies. The same for a 500.000 rows dataset.
When I run the same in local mode with only 100.000 rows I get the same problem - the R proces goes well north of 3 GB of RAM and is beheaded.
The same thing with 30 rows of data works exactly as it should.
The Hadoop & R setup is OK and works well for other types of algorithms I've tried.
Do you have any idea what's wrong?