rmr2 mapper process eating up all the memory

Daniel Kvasnička

unread,

Jan 15, 2014, 5:44:29 PM1/15/14

to rha...@googlegroups.com

Hi,

I'm testing rmr2 with the following setup:

- Hadoop 1.2.1, 5 nodes, each 3.7 GB RAM + 1 GB swap

- R 3.0.2

- latest versions of all the RHadoop packages

- a 1.6 GB CSV file that contains exactly 1 million rows with 40 columns (this is only a test file - we plan to go up to 100 million rows)

The mapper fn is as follows:

function(k, r) {

r$ppdm <- r[[idx1]] / r[[idx2]]

keyval(r[[splitter]], r)

}

Basically, I need to compute a value for every row, add it to the row and return the data with hadoop key being one of the values from the row. There are about 75.000 distinct values in the splitter column.

When I run the job Hadoop decides to use 25 mappers, which means that 40.000 rows should be processed by one mapper - which is nothing for both Java and R and should be pretty quick IMHO.

But the R processes on all 5 nodes that run the first 5 maps (with 20 pending) behave exactly the same way - they slowly eat more than 3 GB or physical memory and are then killed by Linux OOM killer. The Hadoop job retries a bunch of times and then finally dies. The same for a 500.000 rows dataset.

When I run the same in local mode with only 100.000 rows I get the same problem - the R proces goes well north of 3 GB of RAM and is beheaded.

The same thing with 30 rows of data works exactly as it should.

The Hadoop & R setup is OK and works well for other types of algorithms I've tried.

Do you have any idea what's wrong?

Antonio Piccolboni

unread,

Jan 15, 2014, 6:13:48 PM1/15/14

to RHadoop Google Group

Since it's happening on the local backend, you have powerful tool to try and debug this. I would pick a data set between 30 and 100,000 rows, a size where the slow down is clear but the program still completes. Then step through mapreduce and monitor memory. Wait, even simpler, can you do a from.dfs on the 10K data set? How much memory does R take then? I don't have a good hypothesis off the bat. One thing is that you have lots of distinct keys creating very small data frames. That's a speed problem way before it becomes a memory problem (and we are working on a solution, btw). The other would be a bad format, sometimes read.table which I am using behind the scenes can behave strangely if one just picks the wrong separator, but your experiment on 30 rows seems to exclude that. So I need you to gather some more evidence, or maybe allow me to reproduce the issue. Thanks

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Daniel Kvasnička

unread,

Jan 16, 2014, 4:07:39 AM1/16/14

to rha...@googlegroups.com, ant...@piccolboni.info

dfs.from() on the 100K dataset works like a charm, R has ~ 370 MB of RAM when it loads it. I've also tried 10.000 and it's still the same problem. I may dig deeper and try to pinpoint the exact location...

Unfortunately I'm not at liberty to share our program and/or data.

Daniel

Antonio Piccolboni

unread,

Feb 10, 2014, 3:18:16 PM2/10/14

to rha...@googlegroups.com, ant...@piccolboni.info

Sorry for the delayed answer, could you also then write the same data with to.dfs

to.dfs(from.dfs(.... as before, works))

If this is slow or fails, the problem is in the serialization format, it's the same whether you are doing MR or just writing a file. If not, we need to look into the function rmr2:::rmr.stream which is where the local mapreduce is implemented.

Another possibility, and the reason I've been a little incommunicado, is that rmr2 3.0.0 is available as of today. I should be more efficient particularly when working with data frames and many distinct keys.

Antonio

Reply all

Reply to author

Forward