MannWhitney U Test with Cascading

12 views

Skip to first unread message

HIMANSHU VERMA

unread,

Feb 3, 2017, 3:42:36 PM2/3/17

to cascading-user

Hi Everyone,

I want to perform statistical Mann Whitney U Test with Cascading platform. How should I approach this problem? Thanks in advance.

Mann-Whitney test worked Example:

http://users.sussex.ac.uk/~grahamh/RM1web/Mann-Whitney%20worked%20example.pdf

Thanks

Himanshu

Ken Krugler

unread,

Feb 3, 2017, 4:06:52 PM2/3/17

to cascadi...@googlegroups.com

You need as input a Tuple with just brand and rating (since you don’t care about the explicit participant id, and you assume every participant only does one ranking for one of the two products).

You then need to do a GroupBy(Fields.NONE, “rating”) to generate a single group (thus Fields.NONE for the grouping key) sorted by the rating.

Now you need to write a custom Buffer that you use with the subsequent Every(), which calculates the rank for each Tuple, and emits <“brand”, “rank”>.

It’s not clear exactly how the rank is calculated when there are multiple ratings that are the same (e.g. what if the lowest three ranks were all the same, would they still get 1.5, or do are you calculating the average rank for the set).

This assumes you don’t have so many tuples with the same rank that it creates a memory problem, as you have to keep all tuples in memory that have the same score until you’ve got the complete set, so you can calculate the average rank and then emit the tuples with the “rank” field added.

Now you can do a SumBy(“brand”, “rank”) to create a “summed_rank” field that has the sum of ranks for each brand. So now your tuple looks like <“brand”, “summed_rank”>

At this point you need to split off a stream from the original stream, GroupBy(“brand”), followed by an Every() with a Count() to get <“brand”, “participants”>.

Now do a CoGroup(“brand”) of these two pipes to give you <“brand”, “participants”, “summed_rank”>

[At this point your data set is probably small enough to write to a text file, which you can then process using regular Java code. You will have one tuple per brand, so it’s going to be a very small amount of data]

If you did want to continue in Cascading, some rough ideas for what comes next...

You can then do a GroupBy(Fields.NONE, “summed_rank”, true) to get one group with all of the rankings, ordered high-to-low. Follow this with a First (or do a FirstBy that combines the GroupBy and the First) to get the brand with the max rank. So now this Tuple steam has <“brand”, “summed_rank”> with a single tuple in it (the summed_rank is your Tx).

Now I’d do a CoGroup with this “top brand” pipe and the pipe that has <“brand”, “participants”, “summed_rank”>,, with Fields.NONE to get a single group that has all your data.

You could follow that with a custom Buffer that calculates your n1 x n2 (x n3…?) + nx x (nx + 1)/2 - Tx, to get U.

— Ken

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Reply all

Reply to author

Forward

0 new messages