Advice on how to pivot

64 views
Skip to first unread message

jd

unread,
May 18, 2012, 6:53:00 PM5/18/12
to cascadi...@googlegroups.com

Still looking for advice on the best way to do a pivot in Cascading. I see the Scala version, but I'm afraid I'm not very fluent in Scala.
I wonder if someone could give me some pointers on how to accomplish this.
Input of: Date,Type,count.
Output of Date,Count_of_Type1, Count_of_Type2, Count_of_Type3 ... etc


Ken Krugler

unread,
May 20, 2012, 12:50:11 PM5/20/12
to cascadi...@googlegroups.com
Hi jd,

Output of Date,Count_of_Type1, Count_of_Type2, Count_of_Type3 … etc

Assuming count is an integer…

1. First use SumBy to get sums for each date/type combination,

Pipe p = new SumBy(incomingPipe, new Fields("Date", "Type"), new Fields("count"), new Fields("sum"), Integer.class);

2. Then use a custom Buffer (e.g. "Pivot") to create the output that you want.

Sorting by Type simplifies the logic in Pivot

p = new GroupBy(p, new Fields("Date"), new Fields("Type"));
p = new Every(p, new Pivot(validTypes...), Fields.RESULTS);

where Pivot() would need to know the range of the "Type" field in order to output the summed counts in appropriate columns.

If you wanted to avoid having to tell Pivot what to use as "column names" then you'd need to do another operation upstream that calculates the unique set of values, which works (at the cost of another job, and some fun joining that in).

I've got the start of a general Pivot() subassembly (in Java), but it's not done yet.

-- Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




jd

unread,
May 20, 2012, 12:58:22 PM5/20/12
to cascadi...@googlegroups.com
Thanks Ken.
I follow you. I've been a bit ambivalent on how to attach the generic upstream data (the set of all types) to the buffer that does the final compute.

The way I was thinking was I would compute a single tuple that is the set of types. Then hashjoin to the original data set for use when I do the buffer. But this seemed a bit heavy handed..

jd

unread,
May 23, 2012, 2:19:57 PM5/23/12
to cascadi...@googlegroups.com
Ken,
Have you figured out how to recover the new field names? Ideally I don't want to end up with Fields.UNKNOWN after doing a pivot.



On Sunday, May 20, 2012 9:50:11 AM UTC-7, kkrugler wrote:

Ken Krugler

unread,
May 25, 2012, 10:38:46 AM5/25/12
to cascadi...@googlegroups.com
Hi jd,

On May 23, 2012, at 11:19am, jd wrote:

Have you figured out how to recover the new field names? Ideally I don't want to end up with Fields.UNKNOWN after doing a pivot.

The approach I'd try here is to artificially break up the workflow into two Flows.

1. Calculate the set of unique field values via Unique(), and write that out to HDFS.

2. When defining the second Flow, read the set of unique values and build a Fields() using that, which you pass to the custom Pivot class's constructor so it can tell Cascading what Fields it will be emitting.

But probably Chris has a better idea here :)

-- Ken


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/neoinoCVIW8J.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Chris K Wensel

unread,
May 25, 2012, 10:54:53 AM5/25/12
to cascadi...@googlegroups.com
It was an explicit design goal to not provide any meta/reflective properties like promoting Tuple values up into Field names.

these capabilities should be layered on top, if at all, as they are error prone (no fail fast type safety).

finally, neither Hadoop or Cascading was intended as a reporting tool. A Pivot table is quite useful for presenting data, but data at scale could likely have 100k or 1M unique values you plan to pivot onto which isn't human readable. there are clearly exceptions.

my recommendation is to load the data into a reporting tool/database and use that for presenting your data. Mondrian comes to mind.

ckw
Reply all
Reply to author
Forward
0 new messages