How to find variance in RHadoop?

189 views
Skip to first unread message

Ashok Kumar Harnal

unread,
Oct 24, 2013, 5:10:44 AM10/24/13
to rha...@googlegroups.com

Is there a way to find variance of a dataset in RHadoop using map-reduce functions?

Will be grateful for help.

Antonio Piccolboni

unread,
Oct 24, 2013, 1:24:27 PM10/24/13
to RHadoop Google Group
There sure is, what approaches have you tried?


Antonio


On Thu, Oct 24, 2013 at 2:10 AM, Ashok Kumar Harnal <ashok...@gmail.com> wrote:

Is there a way to find variance of a dataset in RHadoop using map-reduce functions?

Will be grateful for help.

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ashok Kumar Harnal

unread,
Oct 25, 2013, 5:42:06 AM10/25/13
to rha...@googlegroups.com
The approach I am taking is to run mapreduce in step 1 to find mean. In step 2, run mapreduce again. This time subtract mean from
each value square the difference and sum all differences. The following is my code:


xbar = mapreduce(input="/user/test/test2.txt", map = function(k,v) { keyval(1,mean(as.numeric(v)))},reduce = function(k,v) {keyval(1,mean(as.numeric(v)))} )

m<-from.dfs(xbar)

diffsquare=mapreduce(input="/user/test/test2.txt", map = function(k,v) { keyval(1,sum(as.numeric(v)-as.numeric(m$val))^2 )},reduce = function(k,v) {keyval(1,sum(as.numeric(v)) )} )

While the mean (xbar) is getting evaluated (almost) correctly but not the sum of sqaured differences from mean (m$val).


I do not know what is the problem.

Antonio Piccolboni

unread,
Oct 25, 2013, 12:10:23 PM10/25/13
to RHadoop Google Group
Since the chunks of data that are passed to the map function in the map phase, by taking the mean of arbitrary subsets of the data you put yourself in a dead end. What do you do with a collection of means? I would say, just compute the same and the sum of squares and a count of data points. Then the variance is E[X^2] - E[X]^2. Let me know if this gets you going.

Antonio


--

Antonio Piccolboni

unread,
Oct 25, 2013, 7:09:25 PM10/25/13
to rha...@googlegroups.com, ant...@piccolboni.info
I meant the sum and the sum of squares, sorry


Antonio

Ashok Kumar Harnal

unread,
Oct 27, 2013, 1:21:48 AM10/27/13
to rha...@googlegroups.com

Thanks, for the guidance.

For finding out E(X^2), I used the following map/reduce functions

map = function(k,v) { keyval(1,(as.numeric(v))^2)}
reduce = function(k,v) {keyval(length(v),sum(v)/length(v))}

For finding ( E(X) )^2, I used the following map/reduce

map1 =    function(k,v) { keyval(1,as.numeric(v))}
reduce1 = function(k,v) {keyval(length(v),(sum(v)/length(v))^2)}

In both the instances, since map function emits only one key of value 1, the
load will be only on one reducer. Only one reducer will be summing up all numbers.
Am I right? If so, I think this process is not efficient as multiple reducers are not being used.

Thanks,






On Thursday, October 24, 2013 2:40:44 PM UTC+5:30, Ashok Kumar Harnal wrote:

Antonio Piccolboni

unread,
Oct 27, 2013, 1:36:50 AM10/27/13
to RHadoop Google Group
Your observation on the single reducer problem is spot on. You need to use the combiner. To be able to do that you need to delay computing the ratio. Just compute the numerator and denominator separately with the combiner on. 


Antonio


--

Ashok Kumar Harnal

unread,
Oct 28, 2013, 4:44:07 AM10/28/13
to rha...@googlegroups.com
Thanks for guidance. I did it with combiner.
Both numerator and denominator are now to be calculated separately in two different mapreduce jobs.




On Thursday, October 24, 2013 2:40:44 PM UTC+5:30, Ashok Kumar Harnal wrote:

Antonio Piccolboni

unread,
Oct 28, 2013, 11:23:45 AM10/28/13
to RHadoop Google Group
Great, I think you can do it in a single job, like have the value be cbind(x, x^2) and then do a matrix sum in the reducer.


Antonio


--

Ashok Kumar Harnal

unread,
Oct 29, 2013, 8:17:22 PM10/29/13
to rha...@googlegroups.com
Thanks for all the help. I, think, I completed my first lesson in RHadoop. I am now onto
the second lesson of manipulating a CSV file. I am studying your replies in other forums
(Stack Overflow).  Thanks.


On Thursday, October 24, 2013 2:40:44 PM UTC+5:30, Ashok Kumar Harnal wrote:
Reply all
Reply to author
Forward
0 new messages