Can't GroupBy with TupleEntry containing TupleEntry

JPatrick Davenport

unread,

Mar 3, 2015, 1:04:31 PM3/3/15

to cascadi...@googlegroups.com

Hello,
I have a TupleEntry with the following keys: Name, Age, Address. Name and Age are just strings. Address is a TupleEntry with the sole key of City. Each operations work fine. I can use Identity. I can pass them to a Function.

When I pass the root TupleEntry to GroupBy, I get the stacktrace below.

Here's the code.

Pipe in = new Pipe("FromArango");
Pipe grouping = new GroupBy(in, new Fields("name"));
Pipe buf = new Every(grouping, new BuffOp());
Pipe cause = new Each(buf, new Identity());
Pipe out = new Each(cause, new ComplexRead.Flattener());

If I get rid of the GroupBy and Ever, it works fine.

Is it possible to have TupleEntries with TupleEntries? If so, what's the config dial I need?

cascading.flow.FlowException: local step failed
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:230)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:150)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: cascading.CascadingException: unable to load serializer for: cascading.tuple.TupleEntry from: org.apache.hadoop.io.serializer.SerializationFactory
    at cascading.tuple.hadoop.TupleSerialization.getNewSerializer(TupleSerialization.java:453)
    at cascading.tuple.hadoop.TupleSerialization$SerializationElementWriter.write(TupleSerialization.java:743)
    at cascading.tuple.io.TupleOutputStream.writeElement(TupleOutputStream.java:114)
    at cascading.tuple.io.TupleOutputStream.write(TupleOutputStream.java:89)
    at cascading.tuple.io.TupleOutputStream.writeTuple(TupleOutputStream.java:64)
    at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:37)
    at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:28)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1074)
    at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
    at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
    at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:68)
    at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:37)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
    at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
    at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    ... 4 more

Thanks,
JPD

Chris K Wensel

unread,

Mar 3, 2015, 1:45:59 PM3/3/15

to cascadi...@googlegroups.com

Putting TupleEntry into a Tuple isn’t not supported, but is definitely dubious since you will be serializing Classes/CoercibleTypes, Comparators, and Strings (field names) for every Tuple (which is something Cascading makes great pains to never do).

So we haven’t spent any time testing, making work.

fwiw, its worth pointing out Tuples aren’t a type system. They are a container, a bin. You put stuff at the top and at the bottom. or in the top right side corner, or bottom left side corner. then you pluck stuff out, assemble it into something new, the drop it back in, as the bin moves down the conveyor.

if you need a hierarchical model, its time to create a custom type and pop an assembled one into the outgoing Tuple.

that said, its probably time to enrich the Tuple/Field model to support hierarchical data first class so simple things don’t require a custom type and without the overhead of serializing maps all over the place. and hopefully make it pluggable.

I explored an aspect of this 15 years ago

http://carrierwave.sourceforge.net

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/39d5e084-002d-4fdc-99a3-f70b5ecd0cd3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

—

Chris K Wensel

ch...@wensel.net

JPatrick Davenport

unread,

Mar 3, 2015, 2:11:48 PM3/3/15

to cascadi...@googlegroups.com

Thanks. So I guess my guidance on the project will be to only use complex TupleEntries for map side work for now.

I'll have to look at Avro and other formats to see how they use objects.

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/5CpM0QZbn80/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/EB17E299-CDCF-4984-95D3-6777318C1FEF%40wensel.net.

Reply all

Reply to author

Forward