Using the Fields to add Serializers

33 views
Skip to first unread message

Oscar Boykin

unread,
Jan 2, 2015, 1:54:44 PM1/2/15
to cascadi...@googlegroups.com
Happy New Year, Chris (and all cascading folks)!

We are looking more into making binary comparators easier. The current API is pretty good: if we pass you a Comparator that implements StreamComparator, we are good to go.

But if we look at how serializers are set up, as far as I know, the only way to do that is via Hadoop's reflection based mechanism (which dispatches on .getClass of the data). Secondly, we have two parts of the code that need to know about how data is serialized located far from each other (one in the hadoop config, the other in the Fields Comparators).

It occurs to me it would be nice if I pass a Comparator that implements cascading.tuple.Serializer (for instance), that would be nice. In one place I have the Comparator, StreamComparator and Serializer for my data type.

Next, this gives you a nice side benefit in terms of performance: since I am statically telling you how to serialize a particular field, you don't need to write the type token into each record, which can give significant savings, especially for wide records of small data types.

There are significant issues to the .getClass based dispatch that I don't want to bother to go into in this email which the above solution also side steps.

The nice part is this proposal would be totally API compatible since users would opt into it by passing a certain type of Comparator.

We could sketch an implementation of this, but I know you prefer to write the code for cascading so I thought we should discuss before code is written.

Best,
--
Oscar Boykin :: @posco :: http://twitter.com/posco

Chris K Wensel

unread,
Jan 3, 2015, 5:06:52 PM1/3/15
to cascadi...@googlegroups.com
this all sounds reasonable. 

i’m fairly dissatisfied with Hadoop’s serialization framework, but in an effort to not reinvent everything, we inherited it to minimize the any previous work with raw MapReduce when porting over. Its some of the messiest bits of Cascading. 

note local mode doesn’t have one so this might be the best all around solution that would work across platforms (which would be the first step to any graceful degradation in local mode making it much more robust for production use).

fwiw, if your Serializer instances implement Comparison, they will also source the required Comparator. curious how this will play out with your proposal.

also, we need typed versions of TupleInput/OutputStream (TypedTuple/InputOutputStream) that leverages the declared types so they are never written. I think this should be addressed first or in tandem. it would be a huge boost. (getting off MapReduce has taken priority, thus the delays on this work and questions around how much would port to the tez model)

this sums up our position on contributed code. in short we want more, but we need to be involved.


ckw
 

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CANX%3DQ2qdpcTUs0opiotw4U1_6-JS5%2BqvftDK%2B961AXLOC_49_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Oscar Boykin

unread,
Jun 11, 2015, 2:41:42 PM6/11/15
to cascadi...@googlegroups.com
Chris,

Congrats on getting cascading 3 out! We look forward to making that the default for scalding (after we get a branch building and validate that things work with our cluster for our set of test jobs).

I'm still really interested in my proposal here as it seems highly backward compatible: if we pass a Comparator to the Fields object that also extends another cascading serializer class that you could add, which would allow us to skip the tokens and lookup that happens in the current path.

This turns out to be a big deal for us because scalding users so often use their own classes in the jobs (case classes in scala being excellent data records that are usually one-liners to declare). Secondly, with scala macros, we can actually generate the serializers at compile time, but right now we have an awful hack to get them registered (we wrap each item in a marker class of which we have 100 and then dispatch on that class id to get the macro generated serializer).

It would be a big performance, correctness, and code cleanliness win for us to have this feature.

Of course it does require some thought: you need a serializer everywhere you could cross between tasks. In cascading 2, that is only at: HashJoin, Checkpoint, GroupBy, CoGroupBy as far as I know, but in cascading 3, with Tez, you are also crossing the boundary at Merge if I'm not mistaken, so some careful thought should be taken about making a contract about where it is possible and not possible to cross boundaries. If we don't want this, we need to be prepared with a serializer in every fields object everwhere in case you might serialize at that point. This decouples things more, but it does complicate the type signatures for us and complexity for the user.

How can we move forward on this?

Congrats again!


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel

unread,
Jun 11, 2015, 5:47:57 PM6/11/15
to cascadi...@googlegroups.com
I think in order we need
 
- TypedTupleInput/OuputStream — when there are types declared, we don’t write the token —this is way overdue, hoping someone would have stepped up

- SerializableType that is possibly a kind of CoercibleType, to allow for custom serialization (CoercibleType enforces the stored/managed type)

- some generic means to map custom type as presented by the storage format to its SerializableType — see a Person, dynamically sort out who the SerializableType is, if not declared in code

the propagation of type information is already there. If we read type information from a format that can provide type information, we can push it around through the plans, but this requires all fields to have declared type information so it can be enforced (CoercibleType/Type). So there won’t be any real work of jacking in the Serializers, they are inherited.

the true challenge will be during sorting/partitioning etc, pushing the SerializableType to the underlying Serialization system so that they can be applied outside of Cascading’s control (comparators etc are handed to Hadoop as String classnames, so we then have an initialization overhead via the Configuration etc, this is where all the nasty bits are, see StreamComparators to see some of the hoops)

If you guys want to test the waters with a TypedTupleIOStream implementation, that would help get this going.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Chris K Wensel

unread,
Jun 15, 2015, 6:04:11 PM6/15/15
to cascadi...@googlegroups.com
fwiw, i’m doing a little work to see what the actual LOE will be. some parts are easier, but others are more challenging.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Chris K Wensel

unread,
Jun 19, 2015, 6:40:30 PM6/19/15
to cascadi...@googlegroups.com
i’ll follow up with more details when available, but in short if types are fully declared, 

— Cascading will no longer write size or type information (no more lookup/writing serialization tokens), 
— and if no custom comparators or secondary sorting is in play, bit wise comparisons will be performed directly on the serialized byte array, bypassing any deserialization during comparisons (ordered partitioning).

one caveat is on MapReduce, during a join, all keys must be the exact same type (Long.class != Long.TYPE), else we degrade back to writing type info.

early next week this will be a 3.1 wip. 

there are some other optimizations I would like to get in as well, including have temp taps leverage the declared type info (there are never temp taps in Tez).

and having the planner report when types are missing, even optionally fail at plan time to enforce their use.

I also haven’t prototyped the SerializableType thing yet. that will be after all the type changes above stick. but it looks like a trivial add if we hang it off CoercibleType.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Oscar Boykin

unread,
Jun 26, 2015, 6:04:06 PM6/26/15
to cascadi...@googlegroups.com
I like the progress here. It will be a win for us, and I look forward to using it.

One note: assuming you can do bitwise comparisons is not safe. That should be opt in. Not all serializations have the property that if x == y, then ser(x) == ser(y), although most do. Consider using a type that implements Comparable, but lazily normalizes on compareTo: a hashSet of comparables, for instance. If someone calls compareTo, we make a sorted representation, but if they don't we use hashing.

This are corner cases for sure (but we've hit such cases). I think it should be opt-in that assume bitwise is safe without a custom comparator.

Lastly, in our case, we know the exact class in exactly the same cases that we have a custom comparator, hasher, and binary comparator. For us, we'd not want to lose the ability to plug all those in. We just want the win of not having the tokens.


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel

unread,
Jun 26, 2015, 6:37:29 PM6/26/15
to cascadi...@googlegroups.com

One note: assuming you can do bitwise comparisons is not safe. That should be opt in. Not all serializations have the property that if x == y, then ser(x) == ser(y), although most do. Consider using a type that implements Comparable, but lazily normalizes on compareTo: a hashSet of comparables, for instance. If someone calls compareTo, we make a sorted representation, but if they don't we use hashing.

This are corner cases for sure (but we've hit such cases). I think it should be opt-in that assume bitwise is safe without a custom comparator.

Lastly, in our case, we know the exact class in exactly the same cases that we have a custom comparator, hasher, and binary comparator. For us, we'd not want to lose the ability to plug all those in. We just want the win of not having the tokens.


This is why I pushed a wip for people to play with. and made it opt out to force the issue. note if you don’t provide any type information, everything is as usual. 

that is, if you are grouping on foo.Person, and declare the ‘person’ field as being foo.Person.class, you will have trouble unless you set Comparator on ‘person’ (or the PersonSerialization implements Comparison)

if a Comparator on the element is provided (via Fields or via cascading.tuple.Comparison), comparisons are deferred to it instead.

if you are performing secondary sorting, we also currently shut off bitwise comparisons. and I think they are off on MapReduce regardless if performing a join thanks to the need to prefix with ordinality, currently.

also, I suspect unicode will cause grief, so flipping the bit might be called for.

you also cannot get away with providing some type info, its all or nothing per tuple. i’m hoping to provide some planner features to help identify these cases.

anyway, lots of caveats — it is a wip after all. feedback will help make it safe/reliable.

see these props settings to manipulate things.


anyway, 3.1-wip-1 is up if anyone wants to break it or their code.

Chris K Wensel




Oscar Boykin

unread,
Jun 26, 2015, 8:27:30 PM6/26/15
to cascadi...@googlegroups.com
Just a note about unicode: UTF-8 string sort order is happily the same as bitwise sorting, so that is not something you need to care about.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages