Immutability of generated data structures

658 views
Skip to first unread message

Kannan Goundan

unread,
Apr 23, 2009, 8:03:22 AM4/23/09
to Protocol Buffers
The code generated by protoc seems to go to great lengths to make sure
that once a message object is created, it can't be modified. I'm
guessing that this is to avoid cycles in the object graph, so that the
serialization routine doesn't have to detect cycles. Is this
correct? Would a cycle in the object graph put the current serializer
into an infinite loop?

Jon Skeet <skeet@pobox.com>

unread,
Apr 23, 2009, 3:56:43 PM4/23/09
to Protocol Buffers
I think it's more because our experience in Google is that immutable
objects are easier to reason about - basically a lesson from
functional programming.

But yes, I suspect the side-benefit of preventing cycles is a
generally good thing too :)

Jon

Marc Gravell

unread,
Apr 23, 2009, 5:05:10 PM4/23/09
to Protocol Buffers
In many ways this is an implementation detail, though. protobuf-net
takes the other (mutable classes) approach - and indeed one
consequence is that it needs to be careful about loops when
serializing. That needn't be too invasive, though: I simply track the
depth, and don't even bother tracking objects until we get
suspiciously deep. Beyond a certain depth, it starts tracking the
references, and if a duplicate is found it raises an exception.

But yes; the google implementations are (AFAIK) all based on
immutability and builders.

Marc

Kenton Varda

unread,
Apr 23, 2009, 6:21:58 PM4/23/09
to Kannan Goundan, Protocol Buffers
You're specifically talking about the Java implementation.  We quite intentionally chose to make message objects completely immutable.  Version 1 of protocol buffers (never released publicly) used mutable objects, and we found it lead to many bugs as people would modify messages that were simultaneously being used elsewhere in the app.  To defend against such bugs, people had to constantly make "defensive copies" of message objects -- often unnecessarily, because it was hard to be sure when a copy was necessary.

In C++, we solve this problem using the "const" qualifier, but Java has no "const", so we had to go a different route in Java.

The idea to use immutable objects was actually first suggested to me by Josh Bloch (author of Effective Java, and Google employee).  Since I personally am a fan of functional programming, I liked the idea a lot and ran with it.  Most Java developers inside Google seem to think this was a big improvement.

Kannan Goundan

unread,
Apr 23, 2009, 6:59:04 PM4/23/09
to Kenton Varda, Protocol Buffers
The reason I'm asking all this is that I'm implementing a data
serialization format that has the same usage model as protocol buffers
(i.e. generate language bindings, serialize/deserialize).

For me, the biggest benefit of immutability here is that it prevents
cycles in the object graph. (Well, I think this is true for Java and
C++. In Haskell, though, this is not the case). Without cycles, the
serialization code doesn't have to worry about getting stuck in an
infinite loop (though it's more likely that it'll eventually stack
overflow...).

The downside is the potential inefficiency of making unnecessary
copies of the data structures. I'm on board with the benefits of
immutability, so I'm usually willing to take the performance hit, but
I wasn't sure if others would be as well. Have you guys gotten any
requests to add a "generate mutable data structures" mode to protoc?

If I do end up having to support mutable structures, maybe there's
some clever way to efficiently prevent cycles in an object graph as it
is being manipulated. I really don't want to have to detect cycles in
the serializer (I think cycle detection is what makes Java's built-in
serialization so slow...).

- Kannan

Kenton Varda

unread,
Apr 23, 2009, 9:07:24 PM4/23/09
to Kannan Goundan, Protocol Buffers
On Thu, Apr 23, 2009 at 3:59 PM, Kannan Goundan <kan...@cakoose.com> wrote:
The reason I'm asking all this is that I'm implementing a data
serialization format that has the same usage model as protocol buffers
(i.e. generate language bindings, serialize/deserialize).

For me, the biggest benefit of immutability here is that it prevents
cycles in the object graph.  (Well, I think this is true for Java and
C++.  In Haskell, though, this is not the case).  Without cycles, the
serialization code doesn't have to worry about getting stuck in an
infinite loop (though it's more likely that it'll eventually stack
overflow...).

The downside is the potential inefficiency of making unnecessary
copies of the data structures.

Note that immutable data structures also allow you to *avoid* copies in a lot of cases, since there's no need to make defensive copies.  So it's not necessarily always less efficient -- in some cases it is more efficient.  I'm not sure what the average case is.
 
 I'm on board with the benefits of
immutability, so I'm usually willing to take the performance hit, but
I wasn't sure if others would be as well.  Have you guys gotten any
requests to add a "generate mutable data structures" mode to protoc?

Not really.  Note that builder objects can be used like mutable data structures, with some limitations (e.g. the sub-messages aren't mutable).
 
If I do end up having to support mutable structures, maybe there's
some clever way to efficiently prevent cycles in an object graph as it
is being manipulated.  I really don't want to have to detect cycles in
the serializer (I think cycle detection is what makes Java's built-in
serialization so slow...).

Honestly, I don't think you need to worry about it.  It's really not very easy to accidentally write code which produces a cyclic data structure.  If someone manages to do it, they'll figure out what went wrong pretty quickly when their stack overflows.  Preventing cycles was not a consideration that factored into our design decision.

That said, if you have mutable data structures, then you probably want to keep track of ownership anyway.  That is, each object should only be permitted to have one parent.  Otherwise, it's surprising when editing a sub-object of one object can mysteriously affect some other object elsewhere.  This is exactly the kind of problem we were trying to prevent.

This would, of course, mean that your system can only be used to build trees, not DAGs.  But if you don't want to detect cycles in your serialization algorithm, I'm guessing you aren't interested in detecting when the same object appears in multiple places.  So your serialization is really writing trees anyway.
Reply all
Reply to author
Forward
0 new messages