Tuple stream semantics for pipe assemblies

28 views
Skip to first unread message

Matthew Willson

unread,
Feb 25, 2015, 11:30:23 AM2/25/15
to cascadi...@googlegroups.com
Hi again

Sorry another design rationale question :)

It seems to me that cascading's pipe constructors come close to being a clearly-defined set of combinators for transforming tuple streams, a very nice thing.

But there are a few awkward exceptions -- kinds of pipe whose interpretation as a tuple stream is a little bit murky or context-sensitive, which act only as hints to the planner or nodes in a syntax tree, with the ultimate semantics dependent on what follows before or after them.

Are there any plans to move towards a uniform tuple stream semantics for all pipes in future? For example this might mean specifying the behaviour of every pipe constructor purely in terms of:

- How are the output fields of the pipe determined given the fields of the input pipes?
- How is the output tuple stream determined given the tuple streams and fields of the input pipes?

(In the case of a head Pipe the fields and tuple stream would be determined by the Scheme of the Tap they're bound to -- in fact thinking in these terms, Schemes to me seem a lot like special cases of Pipes, either without input or without output)

If possible I think having a semantics like this would be quite helpful for newbies as it makes it easy and explicit how to interpret what's going on. But also might make any automated reasoning about pipe assemblies easier, make them a closer fit for execution on platforms based on a DAG of tuple stream operations, and make it easier to implement a simple and readable in-memory reference implementation, etc.

I'm sure I'm missing some context, though, perhaps relating to how much of a pain it is to implement a planner that maps this kind of general tuple stream graph to a MapReduce job flow?

-Matt

Chris K Wensel

unread,
Feb 25, 2015, 11:59:45 AM2/25/15
to cascadi...@googlegroups.com
what exceptions? happy to add clarity or resolve this before 3.0 is out.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CADA%2B6aSdKLbq8inhh6t7xw8-jfz50SBoGSWk8gE7PjrsUzPVfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Matthew Willson

unread,
Feb 25, 2015, 1:04:13 PM2/25/15
to cascadi...@googlegroups.com
(Something I missed, which is important but I don't think changes the basic point: the semantics would I think need to be defined in terms of operations on sets of tuple streams, rather than just tuple streams, to account for the way mappers and reducers operate in parallel on streamed data from partitions of the overall data)


CoGroup with a BufferJoin is probably the most glaring example -- the BufferJoin only serves as marker that an Every with Buffer is going to follow, and it outputs Fields.NONE "as a hint to the planner". (Incidentally this sometimes seems to screw up the planner's ability to infer the output fields of the following Every too, in 2.6 at least).

GroupBy and CoGroup in general, while they do sort of have a general tuple stream semantics as I understand it (at least when followed by an Each, the output is the input partitioned and sorted by the group fields then the sort fields within each group), the way it's explained makes it sound very dependent on the combination with what follows after it (an Each, and Every, ...), rather than describing it as an operation which has a clear interpretation in its own right.

The Every pipe that follows after a GroupBy/CoGroup requires special knowledge of what precedes it in order to know what the grouping field was, so it's not really a general tuple stream transformation in itself. It's not just that it can only follow after a CoGroup/GroupBy, it's that its behaviour depends on the internal parameters of the preceding GroupBy/CoGroup (the grouping fields), not just the fields and tuple stream that come out of the thing that precedes it. If an Every could work with any (grouped) stream of tuples provided you tell it how they are grouped, then it'd be a general tuple stream transformation, and that might actually be useful sometimes when doing map-side aggregations on pre-grouped data.

-Matt

Matthew Willson

unread,
Mar 27, 2015, 11:50:00 AM3/27/15
to cascadi...@googlegroups.com
Another example I spotted that breaks any kind of "combinators for transforming tuple streams" semantics: the way multiple Every(Aggregator) pipes can follow after a GroupBy.

The pipes are chained in serial, but it's not a composition of a series of transformations. The semantics are that they happen as one parallel transformation over a grouped tuple stream.

I don't mean to harp on these kind of examples, regardless I like cascading as an abstraction over mapreduce. I guess why I was pushing a bit on this topic is that I'm wondering whether it'll be worth continuing to use it cascading as an extra abstraction layer over one of the newer, nicer platform like Spark, Tez etc. These newer platforms seem to have quite nice, clean composable semantics so (speaking personally) I'd like any layer used on top also to be based on abstractions with clean composable semantics, to be justified.

-Matt

Chris K Wensel

unread,
Mar 27, 2015, 1:34:47 PM3/27/15
to cascadi...@googlegroups.com
The Every model is a composable model, which allows for reuse of Aggregate functions independently without coupling to other functions.

If you wish for a functional nested model, where downstream relies on upstream, this can be built into the Buffer interface very simply via the exposed iterators.

as for BufferJoin, yes, its a hack. it was added before the 3.0 planner as an optimization for certain cases. 

3.0 itself now improves our ability to add new primitive operations etc to the data flows so we can be more high brow about our model. this would be a great place to try new ideas and improvements in code.

fwiw, Cascading was designed as the lowest common denominator to allow for new frameworks and languages to be built on top. this has worked, since we are the basis for quite a few languages, frameworks, and commercial applications.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Reply all
Reply to author
Forward
0 new messages