I use mapred.map.tasks which you told me can control the number of task number, and it has a warning. So i put it in hadoop mapred-default.html and there is no warning. Also it happened what you said timouts. Now I am try to solve it. Before that can I ask you one thing?
There is two files. One is M.txt and another one is N.txt. they may have several millions lines. A B C D and a b c d they are rownames
The contents of M.txt are like:
A 4.953156 13.558079 8.837385 3.262974 4.972366 10.827528 7.577138 3.750967 2.074705 2.851451
B 9.637207 12.183856 11.907440 8.077675 1.170748 12.376517 11.503032 12.508148 9.692648 8.168134
C 5.061248 12.668217 2.292028 14.500193 1.709681 7.151250 13.130690 1.784773 13.101968 8.557451
D 4.913432 12.620262 14.487713 14.397911 13.668904 6.830494 12.443367 2.822725 8.139648 1.525864
The contents of M.txt are like:
a 4.211084 9.463963 2.665201 3.401210 6.787613 8.255354 5.082539 8.193901 1.988843 6.955243
b 6.177279 4.248939 2.614391 7.588413 6.548253 2.426467 4.073928 6.597446 8.195755 7.093351
c 5.876866 3.182274 1.648620 13.399885 14.494392 1.824633 11.081571 2.662918 7.443045 5.137352
d 2.121217 11.868432 6.142129 13.383439 13.477533 10.797223 7.939662 5.005920 2.131644 14.468207
what I want compute the correlation of each pair of them like: cor(A,a), cor(A,b), cor(A,c), cor(A,d), cor(B,a), cor(B,b), cor(B,c), cor(B,d), cor(C,a), cor(C,b), cor(C,c), cor(C,d), cor(D,a), cor(D,b), cor(D,c), cor(D,d),
How can I implement it in Rhadoop?
if I just take both of them from the hdfs, the the files will be split by randomly. Each map take one of inputsplits of M.txt and one of inputsplits of N.txt, then compute. it will be miss some part. It wont be complete.
So How can I handle this problem? Is there any method to let them compute completely?
Thank you so much for your help