wordcount of a big file

208 views
Skip to first unread message

Jarcem

unread,
Oct 1, 2013, 4:09:40 AM10/1/13
to rha...@googlegroups.com
I have a cluster with 3 servers, running Hadoop 2.0.0-cdh4.3.0, R version 3.0.1 and rmr 2.2.2
I'm using wordcount as a performance test.
When I use a big text file (around 3GB), while "reduce" task I've seen that R is getting a lot of memory and also that there's only one reduce task.

I wonder why I only have only one reduce task and why R gets so much memory. What can I do to increase the number of reducers and/or decrease the memory used by R?

Test file has 100.000.000 lines, each one has a line number followed by 5 random rumbers between 0 and 32000.

wordcount =
  function(
    input,
    output = NULL,
    pattern = " "){

wc.map =
      function(., lines) {
        keyval(
          unlist(
            strsplit(
              x = lines,
              split = pattern)),
          1)}

wc.reduce =
      function(word, counts ) {
        keyval(word, sum(counts))}

mapreduce(
      input = input,
      input.format = "text",
      output = output,
      map = wc.map,
      reduce = wc.reduce,
      combine = T)
}

Antonio Piccolboni

unread,
Oct 1, 2013, 12:44:58 PM10/1/13
to rha...@googlegroups.com
First keep in mind examples are in general not optimizes but kept as simple as possible.

The number of reduce tasks defaults to one in many Hadoop distribution. You need to change that. One way is to supply this additional argument to mapreduce

backend.parameters = list(hadoop= list(D = "mapred.reduce.tasks=10"))


where 10 should be actually slightly less than the number of available reduce slots. This argument is deprecated because it was being abused, but I think this use case will be preserved by whatever replaces it. The memory issue points to something else, like a word with a very large count. Even then, with the combiner on, it should work. Nonetheless I would check that the pattern is appropriate for your file, that would be single space separated words. Another thing is to try with a smaller, local file, say 1 mb, on the local backend, do a debug(wc.map) and debug(wc.reduce) and see that it is doing what you expect it to do. Have you tried different data sizes and wrote down the different times-to-completion for the map phase and reduce pahse? How do they grow? That usually gives some insight.  Another thing you can do is adding 

rmr.str(word) 
rmr.str(counts)

as the first two lines of the reduce function. It will write a lot in your std error, so use at smaller data sizes.
This is just an example after all and it hasn't been tested on more than one file. Pretty much the debugging guideline on the wiki can be used a little bit also a performance troubleshooting guide. There are other things you can try to do to move the work to the map phase, but I would first understand what the problem is instead of trying solutions. My suspicion is that your lines are short, vocabulary huge and uniformly distributed. In real life lines are longer, vocabulary smaller and with a zipf distribution. That could explain speed issues but not memory issues, so we need to dig deeper to figure out what's wrong.


Antonio

Jarcem

unread,
Oct 11, 2013, 5:53:30 AM10/11/13
to rha...@googlegroups.com
Hi Antonio,

 thank you very much for your help. I found that parameter on my client configuration files and was set to 1 :-(

 I changed it to a higher value and all my problems went away.

 You were 100% right about my test file :-O short lines with vocabulary huge and uniformly distributed.
 I do this performance tests with wordcount because I'm also comparing against a Java's wordcount. I'm comparing also the performance depending on the file size.

 Now I want to get better performance because there's too much difference between R and Java. I see that Revolution Analytics have rmr, rhdfs and rhbase, are they the same versions that you have on GitHub? I wonder if they have just optimized R or also your libraries.

Best regards,
 Jorge

Antonio Piccolboni

unread,
Oct 11, 2013, 10:46:29 AM10/11/13
to RHadoop Google Group
I managed to get within 20% of Java on a collocation analysis task. That was on EC2 were some of the network latencies and other overhead may mask the differences. On a smaller dataset on my laptop the difference was 6X, with the R implementation constantly CPU-bound. If you don't bring in other dimensions in your analysis, such as the productivity and greater statistical sophistication afforded by R, I can spare you the effort, just go with Java. Then please complain on linkedin you can't find enough data scientists! The wordcount example is what it is, an example for purpose of explanation and teaching and there may be optimizations available that I haven't thought of. If you want you can turn the profile.nodes option on and share the data produced with me and I will take a look. I would also try to increase the keyval.length option 10X and see if that helps.

The Revolution distribution is behind github by a few versions. RHadoop will run on Revo R and take advantage of the optimizations therein, nonetheless I would advise against running RevoR in multicore mode on Hadoop as having two separate mechanism for parallelism will probably create problems more than solve them.


Antonio


Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages