Can serialized messages be used reliably as keys?

alopecoid

unread,

Sep 29, 2009, 3:22:27 PM9/29/09

to Protocol Buffers

Hi,

Can serialized messages be used reliably as keys?

In other words, is it guaranteed that...

- Two equal messages will always generate equal byte sequences?
(Are fields always written in the same order?)

- Two unequal messages will always generate unequal byte sequences?
(Are tag identifiers enough to delimit variable length fields from
accidentally producing equal byte sequences?)

I have a feeling that the answer is no. For example, given a proto
with two fields, both variable length int64 types, it seems that two
unequal messages could, by chance, generate the same byte sequence:

[1 byte tag] [3 byte value] [1 byte tag] [2 byte value] = 7 bytes
[1 byte tag] [2 byte value] [1 byte tag] [3 byte value] = 7 bytes
[1 byte tag] [6 byte value] = 7 bytes
... etc.

If those 7 bytes just happen to be equal, then the serialized messages
can NOT be used reliably as keys.

Thoughts?

Thank you.

Jon Skeet

unread,

Sep 29, 2009, 3:29:17 PM9/29/09

to Protocol Buffers

Given that the serialized bytes have to be able to *deserialize* back
to the original messages, surely if those original messages aren't
equal, the serialized forms would have to be different too - assuming
we're talking about the same message type. (Two messages of different
types could serialize to the same data, admittedly.)

The Java serialized form does serialize all the fields in order, I
believe.

Jon

alopecoid

unread,

Sep 29, 2009, 3:41:46 PM9/29/09

to Protocol Buffers

> Given that the serialized bytes have to be able to *deserialize* back
> to the original messages, surely if those original messages aren't
> equal, the serialized forms would have to be different too - assuming
> we're talking about the same message type

But, as in my example, that doesn't seem to be the case (necessarily).
Again, for example, let's say you have two messages, both of the same
type. The proto defines two optional fields, both of type variable
int64.

Say message A poopulates both optional fields:

[1 byte tag] [3 byte value] [1 byte tag] [2 byte value] = 7 bytes

And message B populates only one optional field:

[1 byte tag] [6 byte value] = 7 bytes

Couldn't these generate, by chance, the same 7 bytes? Yes, using
deserialize will correctly parse two unequal messages, but if you look
at the raw serialized byte sequences, they could actually be the same.

Kenton Varda

unread,

Sep 29, 2009, 3:59:03 PM9/29/09

to alopecoid, Protocol Buffers

On Tue, Sep 29, 2009 at 12:22 PM, alopecoid <alop...@gmail.com> wrote:

Hi,

Can serialized messages be used reliably as keys?

In other words, is it guaranteed that...

- Two equal messages will always generate equal byte sequences?
(Are fields always written in the same order?)

Only if:

1) The implementation you are using writes fields in canonical order. All official implementations do this, and probably most unofficial ones. However, implementations are technically allowed to write fields in any order.

2) There are no unknown fields in the message. Your message may have unknown fields if you originally parsed it off the wire and the sender is a newer binary that knows about new fields recently added to the .proto file. In C++ you can get rid of all unknown fields in a message by calling the DiscardUnknownFields() method.

If possible, I would recommend designing your application such that it only requires that equal messages have the same serialization *most* of the time. For example, if you were designing a cache where the cache key is the hash of a serialized message, then the worst that can happen if two equal messages had different serializations is that you'd perform the same operation twice rather than hitting cache. As long as this is relatively rare, it's no big deal.

- Two unequal messages will always generate unequal byte sequences?

As Jon said, this clearly has to be true. If two messages could have the same serialization, then how would the parser know which one to produce when parsing?

Kenton Varda

unread,

Sep 29, 2009, 4:03:57 PM9/29/09

to alopecoid, Protocol Buffers

On Tue, Sep 29, 2009 at 12:41 PM, alopecoid <alop...@gmail.com> wrote:

But, as in my example, that doesn't seem to be the case (necessarily).
Again, for example, let's say you have two messages, both of the same
type. The proto defines two optional fields, both of type variable
int64.

Say message A poopulates both optional fields:

[1 byte tag] [3 byte value] [1 byte tag] [2 byte value] = 7 bytes

And message B populates only one optional field:

[1 byte tag] [6 byte value] = 7 bytes

Couldn't these generate, by chance, the same 7 bytes?

No.

Yes, using
deserialize will correctly parse two unequal messages, but if you look
at the raw serialized byte sequences, they could actually be the same.

That makes no sense. If the bytes were the same, how would deserializing them be able to produce unequal messages?

If you must know the details, in the varint encoding, the upper bit of each byte is used to indicate whether there are more bytes in the value. So, in a 3-byte varint, the first two bytes have the upper bit set, but the last byte does not. So obviously a 3-byte varint cannot start with the same bytes as a 6-byte varint, because in a 6-byte varint the third byte would have the upper bit set.

However, these details really aren't necessary to answer the question. The same bytes, passed to the same parsing function, will produce the same output, regardless of how the encoding works.

Henner Zeller

unread,

Sep 29, 2009, 4:10:14 PM9/29/09

to alopecoid, Protocol Buffers

On Tue, Sep 29, 2009 at 12:41, alopecoid <alop...@gmail.com> wrote:
>
>> Given that the serialized bytes have to be able to *deserialize* back
>> to the original messages, surely if those original messages aren't
>> equal, the serialized forms would have to be different too - assuming
>> we're talking about the same message type
>
> But, as in my example, that doesn't seem to be the case (necessarily).
> Again, for example, let's say you have two messages, both of the same
> type. The proto defines two optional fields, both of type variable
> int64.
>
> Say message A poopulates both optional fields:
> [1 byte tag] [3 byte value] [1 byte tag] [2 byte value] = 7 bytes
>
> And message B populates only one optional field:
> [1 byte tag] [6 byte value] = 7 bytes

The varints are self synchronizing
http://code.google.com/apis/protocolbuffers/docs/encoding.html#varints
i.e. the first bit is always set in the bytes except for the last one.
So the 3 byte value will have something like 1xxxxxxx 1xxxxxxx
0xxxxxxx while the 6 byte value will have have a msb set to 1 at the
third byte. So they will always be different.

So yes, they will be different. As Jon said: the protocol decoder
needs to be able to decode it properly - a confusion between a (3byte
+ tag + 2 byte varint) vs. (6 byte varint) would not work. So two
different messages of the same message type are always different
(however two messages of different type could theoretically encode two
the same).

The thing you have to worry about more is the _sequence_ in which the
tags are encoded. The decoder does not care in which sequence the
fields are encoded, so it could be that messages with the same content
can be encoded in different ways.

However the encoding in the Google implementation guarantees that the
fields are always in a consistent order (I guess too many people
relied on the fact that messages can be used as a key/can be hashed).

-h

alopecoid

unread,

Sep 29, 2009, 5:14:36 PM9/29/09

to Protocol Buffers

> That makes no sense. If the bytes were the same, how would deserializing
> them be able to produce unequal messages?

Yes, I guess if we can rely on the canonical ordering of the fields,
that should be enough.

> If possible, I would recommend designing your application such that it only
> requires that equal messages have the same serialization *most* of the time.
> For example, if you were designing a cache where the cache key is the hash
> of a serialized message, then the worst that can happen if two equal
> messages had different serializations is that you'd perform the same
> operation twice rather than hitting cache. As long as this is relatively
> rare, it's no big deal.

I was thinking more along the lines of keys in a MapReduce (between
Mapper and Reducer phases). I don't think this would work in this
case.

> If you must know the details, in the varint encoding, the upper bit of each
> byte is used to indicate whether there are more bytes in the value. So, in
> a 3-byte varint, the first two bytes have the upper bit set, but the last
> byte does not. So obviously a 3-byte varint cannot start with the same
> bytes as a 6-byte varint, because in a 6-byte varint the third byte would
> have the upper bit set.

Yes, I actually know this :) I've implemented similar encoding, and
also looked at the source... the beauty of open source. It was just
hard to grok the entirety of the serialization scheme from the code.

Reply all

Reply to author

Forward