Sorting Data using RHadoop

Beta

unread,

Mar 22, 2015, 12:42:10 PM3/22/15

to rha...@googlegroups.com

Hi,

I'm pretty new in Hadoop & RHadoop. So, was trying to sort data in Mapreduce structure using RHadoop. But I can't sort the data. The code is given below. Can anybody please help me to find out where I'm making the mistake.

small.ints=runif(100,10.0,20.0)
data<-sample(1:100,100,replace=F)
data1<-data.frame(data,small.ints)
hdfs.input = to.dfs(data1)
# Mapper
mapper <- function(k,v) {
key <- data
value <-small.ints
keyval(key,value)
}

#Reducer

reducer <- function(k,v) {
key <- k
value <- v
keyval(key,arrange(v))
}
#mapreduce program
out<-mapreduce(
input = hdfs.input,
map = mapper,reduce=reducer)

Thanks a lot!

Antonio Piccolboni

unread,

Mar 23, 2015, 12:41:51 PM3/23/15

to RHadoop Google Group

You are ignoring the arguments in the mapper, that seems like a big one. Do you know why functions have arguments? If so, map functions are just functions. The other major thing is that you are calling arrange on a vector instead of a data frame. There's also a number of hard to decode steps like initializing variables and then ignoring them. I would suggest you look into strengthening your understanding of R functions, as well the dplyr library, neither the subject of this group. Once the basics are out of the way, you may want to review the concept of sorting in mapreduce. With big data, as the data is partitioned into multiple files, what sorted mean is not absolutely clear. The usual definition is to have each partition sorted internally and covering disjoint ranges of the data. This is hard to do with rmr2 as we don't have access to custom partitioners, which are necessary to create partitions as described. Moreover, data sorted this way still doesn't allow certain important operations on sorted data, such as applying a moving window operator or computing differences. On the positive side, many important operations can be achieved, sometimes more efficiently, without sorting: approximate quantiles and top and bottom k elements are two examples.

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Beta

unread,

Mar 23, 2015, 2:00:36 PM3/23/15

to rha...@googlegroups.com, ant...@piccolboni.info

Thank you Antonio! I'm still trying to understand mapreduce function in RHadoop. So I just created a random problem to solve through mapreduce. Your answer helped me to have a better understanding of it. Thanks again.

Reply all

Reply to author

Forward