Proposed changes: freeze/thaw mechanism for objects

58 views
Skip to first unread message

Steffen Mueller

unread,
Dec 18, 2013, 1:06:47 PM12/18/13
to serea...@googlegroups.com, Damian Gryski, Rafaël Garcia-Suarez, Yves Orton
Hi all,

I've just pushed a proposed Sereal spec change to a branch in the
repository:

https://github.com/Sereal/Sereal/commit/3942c0b9cfee708a281a807c49d8d5aae6614079

I'd love some feedback.

Best regards,
Steffen

Ævar Arnfjörð Bjarmason

unread,
Dec 19, 2013, 2:11:16 PM12/19/13
to Steffen Mueller, serea...@googlegroups.com, Damian Gryski, Rafaël Garcia-Suarez, Yves Orton
On Wed, Dec 18, 2013 at 7:06 PM, Steffen Mueller
<steffen...@booking.com> wrote:
> Hi all,
>
> I've just pushed a proposed Sereal spec change to a branch in the
> repository:
>
> https://github.com/Sereal/Sereal/commit/3942c0b9cfee708a281a807c49d8d5aae6614079
>
> I'd love some feedback.

I never liked this hook interface for serialization that relied on
sprinkling FREEZE/THAW methods to all of your classes.

Why not just go for more general interface that trivially allows you
to do that if you want, i.e.:

object_encode_hook => sub {
my $obj = shift;
return $obj->FREEZE('Sereal') if $obj->can("FREEZE");
return $obj;
},

But would also allow you to intercept and munge all objects if you
wanted. You could skip objects with "return;" and serialize them as
undef with "return undef" etc.

I can see that having a wide array of uses that just dispatching to
FREEZE/THAW in individual classes wouldn't give you, to name a simple
example maybe you want to:

object_encode_hook => sub {
my $obj = shift;
die "PANIC: Only allow this for objects that can FREEZE" unless
$obj->can("FREEZE");
return $obj->FREEZE('Sereal');
},

Or deny objects that have a DESTROY method or whatever.

More generally it seems to me that the act of pre-processing the data
structures you pass to any serializer is a logically distinct step
from serializing. The only reason you don't do:

my $munged = Some::General::Munger::munge($data);
my $serialized = Some::Serializer::serialize($munged);

Is for optimization reasons, i.e. you want to be able to do streaming
serialization and not copy needlessly, it would be very interesting if
someone came up with an interface for doing this that serializers
could hook into, even if it was just some C function pointers, then
you could write general munging modules that you could use with any
serializer.

But I'm dreaming, but I think having a single callback is still way
better than dispatching to FREEZE/THAW methods.

Steffen Mueller

unread,
Dec 19, 2013, 4:01:28 PM12/19/13
to Ævar Arnfjörð Bjarmason, serea...@googlegroups.com, Damian Gryski, Rafaël Garcia-Suarez, Yves Orton
On 12/19/2013 08:11 PM, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Dec 18, 2013 at 7:06 PM, Steffen Mueller
> <steffen...@booking.com> wrote:
>> Hi all,
>>
>> I've just pushed a proposed Sereal spec change to a branch in the
>> repository:
>>
>> https://github.com/Sereal/Sereal/commit/3942c0b9cfee708a281a807c49d8d5aae6614079
>>
>> I'd love some feedback.
>
> I never liked this hook interface for serialization that relied on
> sprinkling FREEZE/THAW methods to all of your classes.

Yves is going to be very relieved to hear that you agree with him on
this. :)

> Why not just go for more general interface that trivially allows you
> to do that if you want, i.e.:
>
> object_encode_hook => sub {
> my $obj = shift;
> return $obj->FREEZE('Sereal') if $obj->can("FREEZE");
> return $obj;
> },

I think this is pretty much what Yves has wanted all along. I'm of two
minds about this:

a) This interface makes you pay a cost for all objects (worse than
cached method lookup).
b) This interface can do things that FREEZE/THAW can't. Like inject
custom logic for handling objects that maybe don't implement FREEZE/THAW
themselves.
c) Just preprocessing large data structures manually is sometimes an
option, but often means not only breaking encapsulation but also
reimplementing data structure walking over and over again. This isn't
quite just optimization (as you rightly say it is below). It's also
being more programmer efficient.

So I see good and bad sides of each of the two valid alternatives ("just
preprocess in the application" is sometimes doable but often
impractical). Why not support them both? The user would have to choose
between the two and implement the FREEZE/THAW variant in the generic
callback if he wanted both.

> More generally it seems to me that the act of pre-processing the data
> structures you pass to any serializer is a logically distinct step
> from serializing. The only reason you don't do:
>
> my $munged = Some::General::Munger::munge($data);
> my $serialized = Some::Serializer::serialize($munged);
>
> Is for optimization reasons, i.e. you want to be able to do streaming
> serialization and not copy needlessly, it would be very interesting if
> someone came up with an interface for doing this that serializers
> could hook into, even if it was just some C function pointers, then
> you could write general munging modules that you could use with any
> serializer.
>
> But I'm dreaming, but I think having a single callback is still way
> better than dispatching to FREEZE/THAW methods.

I think it's sometimes better. It's more general for sure. But I think
the FREEZE/THAW interface with potential serializer specialization is
actually not bad at all. It allows class authors to provide
non-encapsulation breaking ways of serializing hard-to-serialize
objects. Whether called through the generic object callback or by the
serializer is kind of secondary.

The other bit is that without the FREEZE/THAW convention,
object-DEserialize hooks are going to require the same special
callback-wielding tags that my branch adds to the spec. So this can
conveniently piggy-back.

There's one flaw in ALL variants, though. If we offer both facilities,
the user needs to basically control both sides, encoding and decoding,
code wise. IOW none of these concepts work well for generic wire
protocols / RPC. But then again, I don't think this was a strength of
Sereal's to begin with.

--Steffen

Ævar Arnfjörð Bjarmason

unread,
Dec 19, 2013, 4:34:51 PM12/19/13
to Steffen Mueller, serea...@googlegroups.com, Damian Gryski, Rafaël Garcia-Suarez, Yves Orton
On Thu, Dec 19, 2013 at 10:01 PM, Steffen Mueller
As far as speed is concerned we could still have our cake and eat it
too. We could just provide some default built-in hook that did exactly
what your initial proposal suggests, or maybe a flag saying that the
hooks should only be called on classes that ->can($whatever). My point
was mainly that I think it would be very nice that if we had a
facility like this it *also* provided the ability to have first-level
support for passing all objects through the same callback.

It would also be interesting to be able to provide an integer as a
callback that would be used to call a C function pointer with the SV*
struct for the blessed object, since those that need this for large
data structures are likely to appreciate the speedup, but I digress.

Also is the point of adding this to the actual Sereal format rather
than just having it be something that would be a encoder/decoder
implementation detail so that you can avoid work when deserializing
because you'll know which objects you want to check ->can("THAW") on?

I'd just like to point out that if that's the case it might be useful
to either drop that or have a facility in the decoders to call this
for *all* objects regardless if if they have the tag or not.

Consider the case of serializing Some::Object v0.01 to disk and later
trying to restore that with Some::Object v2.00 not having known at the
time that the internal structure changed from an ArrayRef to a HashRef
or something.

I could see being able to munge such objects when you thaw them when
you weren't initially expecting to have to munge them would be very
useful in some cases.

Steffen Mueller

unread,
Dec 20, 2013, 2:14:14 AM12/20/13
to Ævar Arnfjörð Bjarmason, serea...@googlegroups.com, Damian Gryski, Rafaël Garcia-Suarez, Yves Orton
On 12/19/2013 10:34 PM, Ævar Arnfjörð Bjarmason wrote:
> As far as speed is concerned we could still have our cake and eat it
> too. We could just provide some default built-in hook that did exactly
> what your initial proposal suggests, or maybe a flag saying that the
> hooks should only be called on classes that ->can($whatever). My point
> was mainly that I think it would be very nice that if we had a
> facility like this it *also* provided the ability to have first-level
> support for passing all objects through the same callback.

So for the record, a conversation that Yves and I had on Jabber the
other day leading up to this was basically (freely paraphrased):

me: "I still like the FREEZE/THAW interface. It's the only sane way to
avoid breaking encapsulation."

Yves: "Well, yeah, I don't care as long as I can have the generic
callback which is more powerful in many situations."

me: "But the thing is, I might give it a shot to implement F/T, but
currently have no interesting in hacking on the other."

Yves: "We can agree to eventually have both and whoever does the work of
either gets to have it."

In other words, the only reason why F/T came first was because I removed
the foot from my mouth first. :)

> Also is the point of adding this to the actual Sereal format rather
> than just having it be something that would be a encoder/decoder
> implementation detail so that you can avoid work when deserializing
> because you'll know which objects you want to check ->can("THAW") on?

I think we all agree that the generic callback interface can be used to
implement F/T semantics as you demonstrated.

The point of having it in the actual format is that the representation
of the object is going to be different "on the wire" than in memory
(even beyond simple flattening). Thus, it makes sense to allow the
encoding side to signal things to the decoding side.

Damian has also said that this actually solves a problem in the Go
implementation wrt. things that the Go introspection doesn't allow him
to do. I would expect other statically typed languages to need such a
thing even more. I'll leave the explanation of how this is important to
his Go port to Damian.

> I'd just like to point out that if that's the case it might be useful
> to either drop that or have a facility in the decoders to call this
> for *all* objects regardless if if they have the tag or not.
>
> Consider the case of serializing Some::Object v0.01 to disk and later
> trying to restore that with Some::Object v2.00 not having known at the
> time that the internal structure changed from an ArrayRef to a HashRef
> or something.

The versioning mismatch is a problem that ALL of these approaches have.
But who is more qualified to anticipate this than the author of the
class whose instance representation changed? Nobody, indeed. So if that
author were to implement FREEZE/THAW hooks, he is in the unique position
to write logic that will make them backwards compatible with previously
serialized v0.01 objects. Attempting to fist-violate encapsulation from
the outside isn't going to pretty for such cases.

Yves had a very, very important point to make about this, and I'll try
to paraphrase it: At B.com, we are in control of most code we use, code
infrastructure classes, application, all. That's not a situation most
users are in. If upstream authors adopt the relatively generic (almost
serializer-agnostic) F/T callbacks in their libraries, then everybody
wins. The most obvious example to us Perlers is of course modules on
that we use CPAN.

> I could see being able to munge such objects when you thaw them when
> you weren't initially expecting to have to munge them would be very
> useful in some cases.

I have no problem with supporting a generic callback at all, so let's
make sure you're not running into open doors here. I just have three
conditions (which are pretty similar to the conditions that Yves
proposed for me adding F/T):

a) I don't have to implement it.
b) "You only pay for it if you use it."
c) It's fairly well thought-through. There's lots of performance
trade-offs vs. more complex interfaces to be made.

The nice thing about a generic callback is that it doesn't require any
changes to the Sereal protocol - certainly not any on top of the one I
am proposing here. That means it can be added at will in the future
whenever somebody volunteers.

Best regards,
Steffen

Damian Gryski

unread,
Dec 25, 2013, 3:32:57 PM12/25/13
to serea...@googlegroups.com, Ævar Arnfjörð Bjarmason, Damian Gryski, Rafaël Garcia-Suarez, Yves Orton


On Friday, December 20, 2013 8:14:14 AM UTC+1, Steffen Mueller wrote:

Damian has also said that this actually solves a problem in the Go
implementation wrt. things that the Go introspection doesn't allow him
to do. I would expect other statically typed languages to need such a
thing even more. I'll leave the explanation of how this is important to
his Go port to Damian.

   Where this comes into play with Go (and in general other languages that have stronger visibility restrictions) is being able to serialize, or not, things you don't have access to.  In Perl, a hash is a hash is a hash, so you can always get at the keys and the values and turn them into bytes.  For Go (and similarly Java), private members simply aren't accessible.  The way Go solves this is by providing generic interfaces that the serialization packages can call.  The json package already does this: If you have a MarshalJSON function, then when the json library hits your type, your MarshalJSON function is called instead of the standard reflection-based one.  The XML library has a similar function (MarshalXML).   For Go 1.2 (released, December 1st),  they made this a bit more generic, providing definitions for BinaryMarshaler and TextMarshaler (http://golang.org/pkg/encoding/ ).  For the Go implementation, I would check to see if the type provides MarshalBinary(), and then call it.

As a concrete use case for this, the 'time.Time' type in Go does not contain any exported members, but it _does_ provide MarshalBinary / UnmarshalBinary.  This means without implementing this type of logic, the Go sereal implementation _cannot_ roundtrip a standard Go time object. (A quick grep through the source tree shows this to be the only example in the standard library.

The same problem will show up in Java, although I believe with reflection you can do tricky things and work around Java's visibility rules.  (That doesn't make it any less hacky, just possible.) 

Steffen Mueller

unread,
Dec 27, 2013, 12:42:45 PM12/27/13
to serea...@googlegroups.com, Ævar Arnfjörð Bjarmason, Damian Gryski, Rafaël Garcia-Suarez, Yves Orton
On 12/20/2013 08:14 AM, Steffen Mueller wrote:
[bla bla, we can have both f/t generic and per-class callbacks]

If nobody stops me, I plan to rebase/ff my branch into master some time
soon. Speak up now or forever...

--Steffen

Ævar Arnfjörð Bjarmason

unread,
Dec 27, 2013, 2:28:56 PM12/27/13
to Steffen Mueller, Rafaël Garcia-Suarez, Damian Gryski, Yves Orton, serea...@googlegroups.com

Your blackjack looks fine to me, we can always add hookers later.

Reply all
Reply to author
Forward
0 new messages