MR function runs locally but fails on EMR

Mukul Biswas

unread,

Apr 28, 2014, 10:11:21 AM4/28/14

to rha...@googlegroups.com

Hello RHadoop experts,

My program has several MR functions chained in series. Output of one is used by the next as the input. The error occurs on the 3rd MR function. 1st and 2nd MR functions are working just fine and their output (HDFS) files look perfect. Seems like I am doing something funny with the 3rd MR function. Please note that the entire program runs fine in local mode but breaks down throwing an error stating failed reduce tasks. Pointers to possible causes of error would be greatly appreciated.

The MR code looks like this -

correlatedPairs = mapreduce(

input = dCustomerPairs,

output = file.path(hdfs.output.dir, 'correlated.csv'),

output.format = make.output.format("csv", sep=","),

map = function(k, v){

keyval(v, "rho") # rho is just a dummy value

},

reduce = function(k, v) {

ts1 = getSalesDataByCustomerIdFromDfs(as.numeric(k[1]))

ts2 = getSalesDataByCustomerIdFromDfs(as.numeric(k[2]))

if(length(ts1) == length(ts2)){

rho = cor(ts1, ts2)

} else {

rho = 0

}

if(rho >= MAX_RHO_LIMIT) keyval(k, rho)

}

)

The MR function can be located on line number 200 in the attached source in github. The README.md should give some idea about what I am trying to achieve.

Antonio Piccolboni

unread,

Apr 28, 2014, 12:48:37 PM4/28/14

to RHadoop Google Group

Could you access the standard error of the failed task (the R process) and paste it into your next message? What looks suspicious is that k[2], keys are always length one, unless they are two-dimensional, in which case they have 1 row. There is no difference that I can think of in this respect between local and hadoop backends, so my intuition doesn't fit with your observation that the program runs fine on the local backend.

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mukul Biswas

unread,

May 1, 2014, 6:23:42 AM5/1/14

to rha...@googlegroups.com

Hi Antonio, Thanks for looking into this. I was wrong in saying that my program "fails" but what happens is it gets "stuck" for a very long time. While the previous MR function takes less than a minute but this one takes well over 30 mins (before I kill the job) with no apparent change in the HDFS output folder. The output folder is created in HDFS (called 'correlated.csv'). There is just a '_temporary' folder under the output folder. I assume the following as the possible cause of current behaviour -

The function 'getSalesDataByCustomerIdFromDfs' call in the reduce() itself has a mapreduce call. In a way, it is a nested MR calls.
My local versions are - R 3.0.2, rmr2 3.0.0; while the versions on EMR are - R 2.15.3, rmr2 2.3.0. There might be something handled differently.

To your point on k[1], the input file has unique keys (1 key per row) - like this -

1,2

1,3

2,3

2,4

2,6

3,6

... and I am expecting only 1 entry per reduce. For the last row k[1] = 3 and k[2] = 6. It works on the local mode with my results showing up.

I have been through the job & node logs and could not find any ERR entry. There is nothing worth blaming in any of the WARN entries.

I am planning to remove the MR call from within the 'dubious' function and try accessing the a part of the data frame using a key (Not sure if that is possible with the bigdata objects).

Any number of pointers would help me at this point of time.

Thanks.

Antonio Piccolboni

unread,

May 1, 2014, 11:56:30 AM5/1/14

to RHadoop Google Group

Sorry my brain did not fully engage on the first answer. We do not support EMR, meaning it is not supposed to work. I am surprised you got even as far as you did, maybe they fixed their API to be more standard? I haven't looked into the EMR issue for a long while. To run on Amazon EC2 we use whirr to create a rmr2 cluster, with standard MR on it. As to your other points: nested MR calls are not possible, neither in rmr2, nor in any other MR compatible or MR-like system I know of.

Antonio

Mukul Biswas

unread,

May 1, 2014, 1:38:02 PM5/1/14

to rha...@googlegroups.com, ant...@piccolboni.info

Thanks for the quick response. I guess I will have to take an alternative approach (EC2, may be).

If anyone is interested in in running rmr2 on AWS EMR (say, for pure academic reasons), I am leaving my current code (with some modification like omitting out the nested MR part) in GitHub.

We will close this item.

Antonio Piccolboni

unread,

Nov 19, 2014, 1:15:22 PM11/19/14

to RHadoop Google Group

Take a look at this post, it may address your problem

Reply all

Reply to author

Forward