how to control map task number?

159 views
Skip to first unread message

Jingmin

unread,
Nov 24, 2013, 4:03:01 AM11/24/13
to rha...@googlegroups.com
my file is 30000 lines and just 10M. The default map task will be generated 2. I have 6 datanodes and want to increase map task number to improve computation performance.
My computation is very complex, and it just can handle 2000 lines. So I want to increase map task number to 8 or more. How to do increase the number of map task in Rhadoop? Thank you! 

Antonio Piccolboni

unread,
Nov 24, 2013, 1:26:49 PM11/24/13
to RHadoop Google Group
There is an option that you can set with backend.parameters, called mapred.map.tasks. Unfortunately, it is not clear what it does if anything (lower bound? upper bound? "suggestion"? If  you google around it's entertaining to collect all the incompatible, high confidence answers. It sounds like the number of splits is an important factor, and this is determined by your input size and input format class. It looks like it can be affected with mapred.max.split.size
So I would try something like


mapreduce (... other args here ..., backend.parameters = list(hadoop = list(D = "mapred.map.tasks=24", D = "mapred.max.split.size=1000")))

and experiment with those numbers and see what happens. Other people suggest just splitting the file into smaller chunks, you can use the rmr2 utility scatter for that. It'd be great if you could report back on what works and what doesn't. This question has been asked many times and we need to document the answer. The main point is that Hadoop is not meant to process small data sets in very complex ways. That's the domain  of traditional HPC. On the other hand it can be done, for instance see the record setting computation of pi, which has essentially no input. So it can be done but the default settings have to be tampered with for this specific application (another issue may be timouts if your application takes a long time before outputting something).


Antonio



On Sun, Nov 24, 2013 at 1:03 AM, Jingmin <jingm...@gmail.com> wrote:
>
> my file is 30000 lines and just 10M. The default map task will be generated 2. I have 6 datanodes and want to increase map task number to improve computation performance.
> My computation is very complex, and it just can handle 2000 lines. So I want to increase map task number to 8 or more. How to do increase the number of map task in Rhadoop? Thank you! 



>
> --
> post: rha...@googlegroups.com ||
> unsubscribe: rhadoop+u...@googlegroups.com ||
> web: https://groups.google.com/d/forum/rhadoop?hl=en-US
> ---
> You received this message because you are subscribed to the Google Groups "RHadoop" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Jingmin

unread,
Nov 29, 2013, 10:56:01 AM11/29/13
to rha...@googlegroups.com, ant...@piccolboni.info
I use mapred.map.tasks which you told me can control the number of task number, and it has a warning. So i put it in hadoop mapred-default.html and there is no warning. Also it happened what you said  timouts. Now I am try to solve it. Before that can I ask you one thing?
There is two files. One is M.txt and another one is N.txt. they may have several millions lines. A B C D and a b c d they are rownames
The contents of M.txt are like:
A  4.953156 13.558079  8.837385  3.262974  4.972366 10.827528  7.577138  3.750967  2.074705  2.851451
B  9.637207 12.183856 11.907440  8.077675  1.170748 12.376517 11.503032 12.508148  9.692648  8.168134
C  5.061248 12.668217  2.292028 14.500193  1.709681  7.151250 13.130690  1.784773 13.101968  8.557451
D  4.913432 12.620262 14.487713 14.397911 13.668904  6.830494 12.443367  2.822725  8.139648  1.525864

The contents of M.txt are like:
a  4.211084 9.463963 2.665201 3.401210 6.787613 8.255354 5.082539 8.193901 1.988843 6.955243
b  6.177279 4.248939 2.614391 7.588413 6.548253 2.426467 4.073928 6.597446 8.195755 7.093351
c  5.876866  3.182274  1.648620 13.399885 14.494392  1.824633 11.081571  2.662918  7.443045  5.137352
d  2.121217 11.868432  6.142129 13.383439 13.477533 10.797223  7.939662  5.005920  2.131644 14.468207

what I want compute the correlation of each pair of them like: cor(A,a), cor(A,b), cor(A,c), cor(A,d), cor(B,a), cor(B,b), cor(B,c), cor(B,d), cor(C,a), cor(C,b), cor(C,c), cor(C,d), cor(D,a), cor(D,b), cor(D,c), cor(D,d), 
How can I implement it in Rhadoop? 
if I just take both of them from the hdfs, the the files will be split by randomly. Each map take one of inputsplits of M.txt and one of inputsplits of N.txt, then compute.   it will be miss some part. It wont be complete. 
So How can I handle this problem? Is there any method to let them compute completely?
Thank you so much for your help
Message has been deleted

Antonio Piccolboni

unread,
Dec 2, 2013, 12:22:15 PM12/2/13
to rha...@googlegroups.com, ant...@piccolboni.info
In general we don't answer questions about how to solve specific problems (of course other group members are free to do so). But let me sketch how I would go about this one, as there is an interest in how to use RHadoop for CPU bound work. First question is whether we can load one matrix in main memory? The answer should be yes by a good margin. Then let's say you have M (or the smaller of the two)  in memory and N in a hadoop file. The function cor operates on the columns, so you need to transpose both matrices.

Mt = t(M)

out = mapreduce(N, map = function(k,v) cor(Mt, t(v)))

you may want to reduce the default value for keyval.length (in rmr.options) to avoid time outs.


Antonio

Jingmin

unread,
Dec 2, 2013, 9:27:45 PM12/2/13
to rha...@googlegroups.com, ant...@piccolboni.info
Thanks for your tips

Return the first question I have use mapred.map.tasks=8 to control map number and mapred.task.timeout=3600000 to avoid time outs
but the problem is that it runs much slower than just run in one computer. It run in R(one computer) need 20 minutes, in Rhadoop (1 namenode and 4 datanodes) need more than 40 minutes.
Even I set the map tasks number is 8 and see the Num Tasks is 8 in 50030, but I think it's just run to 3 tasks, I mean the real data just assign to these 3 tasks. Because the other 5 tasks is finished so quickly less than 10 seconds, and the other 3 tasks need more than 40 minutes.
The 3 tasks and other 5 tasks counter are different just like in the attach file. 3 tasks real call rmr(just 1 time) and others are not.
I don't know why it happened. and each one of the 3 tasks deals with 10000 lines(My file has 30000 lines).  Is that cause of I didn't set keyval.length?  My mapfunction don't have reduce(NULL).
Even though it just assign to 3 tasks I think it should be faster than just run it in one computer. It makes me so confused.  

And one more, I saw my Task Log:
stderr logs
Loading required package: rhdfs
Loading required package: methods
Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
  call: fun(libname, pkgname)
  error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("rhdfs", "rJava", "rmr2", "reshape2", "plyr", "stringr",  :
  can't load rhdfs
Loading required package: rmr2
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
It can run the mapreduce function well, But it shows the error and warning. Why it can't load the rhdfs? and I have set Sys.setenv() why has error??

Thanks a lot for your help.
counter.jpg

Jingmin

unread,
Dec 2, 2013, 11:57:08 PM12/2/13
to rha...@googlegroups.com, ant...@piccolboni.info
I have found out it is affected by keyval.length. I changed the keyval.length=3000, it just need 6 minutes. 4 times faster than in R(one computer). Thank you so much for your help.
And I also don't know just what I said Task Log show the error and warning, maybe you can tell me what that. Thanks a lot

Antonio Piccolboni

unread,
Dec 3, 2013, 12:03:10 AM12/3/13
to RHadoop Google Group
On Mon, Dec 2, 2013 at 6:27 PM, Jingmin <jingm...@gmail.com> wrote:
Thanks for your tips

Return the first question I have use mapred.map.tasks=8 to control map number and mapred.task.timeout=3600000 to avoid time outs

Time outs are necessary, by setting them so high you are eliminating fault tolerance. You need to get your job to output something at regular intervals, like a status message if not actual output. You can call the function status to achieve that. L
 
but the problem is that it runs much slower than just run in one computer. It run in R(one computer) need 20 minutes, in Rhadoop (1 namenode and 4 datanodes) need more than 40 minutes.

Comparison between in memory, sequential programs and Hadoop programs is not as straightforward as N nodes implies 1/N time. Hadoop was designed for scalability first with efficiency taking second place. That said, since your job seems to be CPU bound and decomposable into independent units (you said it's a map only job) this performance result is surprising. I don't know what the differences are in detail between the two programs being compared, nor I know if the hardware configurations involved, so it's hard for me to tell. If the map function is essentially the same as the sequential program and the hardware is comparable, then you need to use the profiler to figure out were the time is going. 



Even I set the map tasks number is 8 and see the Num Tasks is 8 in 50030, but I think it's just run to 3 tasks, I mean the real data just assign to these 3 tasks. Because the other 5 tasks is finished so quickly less than 10 seconds, and the other 3 tasks need more than 40 minutes.
The 3 tasks and other 5 tasks counter are different just like in the attach file. 3 tasks real call rmr(just 1 time) and others are not.
I don't know why it happened. and each one of the 3 tasks deals with 10000 lines(My file has 30000 lines).  Is that cause of I didn't set keyval.length?  

Setting keyval.length to a smaller number would probably solve your time out issues but not the number of mappers issue.
 
My mapfunction don't have reduce(NULL).

A map function can't have a reduce. You probably meant your mapreduce call has the reduce argument set to NULL. Using the appropriate language will go a long way in resolving your issue, as would providing a reproducible test case.
 
Even though it just assign to 3 tasks I think it should be faster than just run it in one computer. It makes me so confused.  

And one more, I saw my Task Log:
stderr logs
Loading required package: rhdfs
Loading required package: methods
Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
  call: fun(libname, pkgname)
  error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("rhdfs", "rJava", "rmr2", "reshape2", "plyr", "stringr",  :
  can't load rhdfs
Loading required package: rmr2
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
It can run the mapreduce function well, But it shows the error and warning. Why it can't load the rhdfs? and I have set Sys.setenv() why has error??

Where did you install rhdfs? What envirnoment are you talking about? You are dealing with a cluster, there are many machines and many sessions involved. If you are not using rhdfs in your mapreduce call your best options are 1)ignore the errors 2) detach rhdfs before calling mapreduce.rmr2 doesn't need rhdfs.
 

Thanks a lot for your help.

You are welcome


Antonio 

Antonio Piccolboni

unread,
Dec 3, 2013, 12:11:23 AM12/3/13
to RHadoop Google Group
That points to a performance bug in your program (like a step of quadratic complexity in keyval.length) or maybe you were hitting memory limits and causing thrashing. Short of that, keyval.length can't make anything faster, or we would make the default 0 (warp speed).

Antonio

Jingmin

unread,
Dec 3, 2013, 7:56:47 PM12/3/13
to rha...@googlegroups.com, ant...@piccolboni.info
Could you tell me what the meaning of keyval.length exactly? I think it means if I have a input file which has 30000 lines and I set the keyval.length=3000 then each map will deal with 3000 lines. I set map task number is 4. Two of maps will be called 2 times and other two maps will be called 3 times. Is that right what I think?

Antonio Piccolboni

unread,
Dec 4, 2013, 12:50:43 AM12/4/13
to RHadoop Google Group
Can't totally follow you. Let me try to rephrase. Each map call will process about 3000 lines (not exactly, can be more or less). There will be about 10 map calls. How Hadoop spreads those across map tasks depends on many factors such as load balancing and data locality. Hopefully it will be roughly 10/4 per process but I wouldn't count on it.


Antonio
Reply all
Reply to author
Forward
0 new messages