Row wise comparison in rhadoop rmr2 mapreduce

35 views
Skip to first unread message

aparna

unread,
Jun 5, 2015, 1:12:19 PM6/5/15
to rha...@googlegroups.com
Hi I am new to RHadoop. I am trying to implement a usecase where I have two data sets(say A and B, A large and B small) and I have to compare both datasets line by line with the help of R stringdist package.

The HDFS path to A is given as mapper input and B is loaded to the R memory(cache) with the help of from.dfs() so that it is available to other nodes.
I expected A would be processed line by line (just as it is done in hadoop mapper class), and B would be read line by line when iterated in a loop.

I tried to compare the rows of A with all rows of B (With the help of for loop, B[i,]) like below:
stringdist(row_of_A, B[i,])

The problem is I am unable to process A and B rowwise.
1. Instead the entire rows of A is loaded as a single chunk.
2. B cannot be read line by line in a loop

Please suggest any solution. Isn't Rhadoop the best option for such ROW-WISE COMPARISON CASES?

Thanks
Aparna

Antonio Piccolboni

unread,
Jun 5, 2015, 1:28:42 PM6/5/15
to rha...@googlegroups.com
On Fri, Jun 5, 2015 at 10:12 AM aparna <aparna...@gmail.com> wrote:
Hi I am new to RHadoop. I am trying to implement a usecase where I have two data sets(say A and B, A large and B small) and I have to compare both datasets line by line with the help of R stringdist package.

The HDFS path to A is given as mapper input and B is loaded to the R memory(cache) with the help of from.dfs() so that it is available to other nodes.
I expected A would be processed line by line (just as it is done in hadoop mapper class), and B would be read line by line when iterated in a loop.

Then you read the documentation, found your reasonable expectation to be wrong, found the rationale behind violating the expectation acceptable, read the many examples that illustrate how this expectation is always wrong for rmr2 and finally solved your problem. Really, not a single example that reads one row at a time in the mapper. If you had read one of them, you'd know exactly how things work. If you had traced a mapper once, you'd know.  Row at a time is a hyper-inefficient programming style in R. You can't do big data that way. You can't even do medium data. It works only if every row is big in and of itself, like a matrix. Now you just have to write

  stringdist(A[i:j,], B[i,])

for some i<j and you have to write it efficiently (loop over the rows of A and you go back to square one. It's call vectorization in R (a misnomer, but used consistently in R circles). You either learn it or no hadoop will rescue your programs. If you just want to pretend R is java with more statistical function, you are not going to be successful.


I tried to compare the rows of A with all rows of B (With the help of for loop, B[i,]) like below:
    stringdist(row_of_A, B[i,])

The problem is I am unable to process A and B rowwise.
1. Instead the entire rows of A is loaded as a single chunk.
2. B cannot be read line by line in a loop

Please suggest any solution. Isn't Rhadoop the best option for such ROW-WISE COMPARISON CASES?

Thanks
Aparna

--
post:  rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aparna

unread,
Jun 7, 2015, 10:11:53 AM6/7/15
to rha...@googlegroups.com
Hi Antonio,

Thanks for your reply. I would try implementing the same with the suggested changes.

Aparna.


Reply all
Reply to author
Forward
0 new messages