Scalding: How to genreate a List of value from the values in a field in a pipe

457 views
Skip to first unread message

Jeremy Calbreath

unread,
Mar 25, 2015, 1:59:00 PM3/25/15
to cascadi...@googlegroups.com

I need to take a pipe that has a column of labels with associated values, and pivot that pipe so that there is a column for each label with the correct values in each column. So f example if I have this:

Id  Label Value 
1   Red   5
1   Blue  6
2   Red   7
2   Blue  8
3   Red   9
3   Blue  10

I need to turn it into this:

ID Red Blue
1  5   6
2  7   8
3  9   10

I know how to do this using the pivot command, but I have to explicitly know (and declare) the values of the labels. How can I can dynamically read the labels from the “label” column into a list that I can then pass into the pivot command? I have tried to create list with:

pipe.groupBy('id) {_.toList(‘label) }

, but I get a type mismatch saying it found a symbol but is expecting (cascading.tuple.Fields, cascading.tuple.Fields). Also, from reading online, it sounds like using toList is frowned upon. The number of things in 'label is finite and not that big (30-50 items maybe), but may be different depending on what sample of data I am working with.

Any suggestions you have would be great. Thanks very much!


Oscar Boykin

unread,
Mar 25, 2015, 10:06:06 PM3/25/15
to cascadi...@googlegroups.com
Look at pivot and unpivot in the fields API that do exactly this (see RichPipe and GroupBuilder).
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/04c5e7b8-0cf9-479e-bbc3-e1b2c95fb2b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Oscar Boykin :: @posco :: http://twitter.com/posco

Jeremy Calbreath

unread,
Mar 25, 2015, 11:27:10 PM3/25/15
to cascadi...@googlegroups.com
Pivot, as I understand it's use, requires me to explicitly define the values that need to be pivoted. In my example above, I have to know, and specify, the values in the "labels" column, red and blue. I want to do the pivot operation but without knowing ahead of time values are there.

So I might run this one dataset that has green and yellow and the final output would have two columns named green and yellow. Another run might be on data with red and blue. I don't whatnot is before hand dominant to get the column names from the actual data.

If I have misunderstood how to use pivot, please let me know.

Antonios Chalkiopoulos

unread,
Mar 26, 2015, 4:14:35 AM3/26/15
to cascadi...@googlegroups.com

Jeremy Calbreath

unread,
Mar 26, 2015, 10:31:06 AM3/26/15
to cascadi...@googlegroups.com
As I mentioned above, pivot requires an explicit argument for the values to be pivoted into columns.  The code you referenced shows this (highlighted below)

.groupBy('quarter) { group => group.pivot(('product,'sales) -> ('wine, 'beer,'coffee), defaultValue) }

What I need to do is be able to pivot without supplying this argument, and have the values read directly from the field, or somehow generate the values into a list based on the data.  I know I can pass a list into the argument like this:


val MyList = List('wine, 'beer, 'coffee)
pipe.groupBy('quarter) { group => group.pivot(('product,'sales) -> (MyList), defaultValue) }

but I need to be able to generate that list automatically, instead of explcitly defining it and having to hard code it.  I want ot make my program a little more generic.


On Thursday, March 26, 2015 at 4:14:35 AM UTC-4, Antonios Chalkiopoulos wrote:
https://github.com/scalding-io/ProgrammingWithScalding/blob/master/chapter3/src/main/scala/pivotUnpivot.scala

Oscar Boykin

unread,
Mar 26, 2015, 3:39:40 PM3/26/15
to cascadi...@googlegroups.com
So if you are going to generate the list from the data, you have to make a full pass over the data first. Since we can't make multiple passes without spilling.

You write one job to collect all the labels, then another job that reads those labels and goes from there. We have a way to do that more easily with the Typed-API using a type called Execution, which allows you to clearly write such iterative or multi-step jobs when the output of a previous step changes the plan of the next step:


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

Jeremy Calbreath

unread,
Mar 31, 2015, 8:56:49 AM3/31/15
to cascadi...@googlegroups.com
Thanks.  This looks like it should do what I want.
Reply all
Reply to author
Forward
0 new messages