order of elements in encoding

3,690 views
Skip to first unread message

Willem de Jong

unread,
Jul 9, 2008, 11:34:18 AM7/9/08
to Protocol Buffers
Hi,

Can I assume that the order of the elements (key-value pairs, as the
are called in the documentation) in an encoded message is the same as
in the .proto file? What about extensions, are the added at the end?

Regards,
Willem

Alexey Zakhlestin

unread,
Jul 9, 2008, 11:38:40 AM7/9/08
to Willem de Jong, Protocol Buffers

order is specified by "digits" in .proto file

from the official example:

message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}

"= 1" means first
"= 2" means second


--
Alexey Zakhlestin
http://blog.milkfarmsoft.com/

Henner Zeller

unread,
Jul 9, 2008, 12:32:51 PM7/9/08
to Willem de Jong, Protocol Buffers
On Wed, Jul 9, 2008 at 5:34 PM, Willem de Jong <w.a.d...@gmail.com> wrote:
>
> Hi,
>
> Can I assume that the order of the elements (key-value pairs, as the
> are called in the documentation) in an encoded message is the same as
> in the .proto file?

Since each encode field have a header that describes their
tag+wiretype, the order in which they are sent does not matter. So a
protocol decoder must assume that fields come in any order.
For repeated fields, the elements end up in the same order they are
'on the wire'.

The documentation should probably state this more explicitly.
If you read the last paragraph in
http://code.google.com/apis/protocolbuffers/docs/encoding.html that
states that merging two protocol buffers is equivalent to parsing the
concatenated
binary representation this could be deduced.

So it is encoder dependent, in which order the fields are written to
the wire; typically, of course, you might see the elements written in
the order you see them in the proto file, but you must not assume this
when you write a decoder.

-henner

>
> Regards,
> Willem
> >
>

--
Henner Zeller | h.ze...@acm.org
Bücher kaufen und freie Software fördern | http://bookzilla.de

Kenton Varda

unread,
Jul 9, 2008, 1:45:06 PM7/9/08
to Willem de Jong, Protocol Buffers

All three existing implementations will always write known fields in order by field number, whether they are regular fields or extensions.  We guarantee this so that it is generally safe to use a hash of a serialized message as a cache key.  However, unknown fields (if they are retained; C++ and Java currently do this but not Python) will be appended to the end in arbitrary order -- it was too hard to merge them in without harming performance.  Parsers should not rely on ordering.

Willem de Jong

unread,
Jul 10, 2008, 2:18:19 AM7/10/08
to Kenton Varda, Protocol Buffers
Hello Kenton,
 
Thanks for the response. I hope its OK if I ask a little bit further:
 
- I am afraid I don't understand the point about unknown fields. What scenario are you thinking of? A sending system sending fields that are not present in the .proto file that is used on the sending side? In that case, how is the field number for such a field determined?
 
I guess these fields can only be used on the receiving side if it is present in the .proto file that is used on that side. Otherwise they would simply be ignored, correct?
 
- If you guarantee that known fields are ordered by field number, then why not make this a part of the encoding specification? I am asking, because (a) I have some code that I might reuse for writing a protobuf decoder, but it will be easier if I can be rely on fields being in sequence (some additional out-of-sequence fields at the end would be ignored), and (b) I think it would generally be possible to write a more efficient parser if the parser always 'knows what to receive next'. In that case it doesn't have to check a list of things that it might receive, but it just has to check the next item. Also, it wouldn't have to check in the end whether all required elements were present - it could fail immediately if it noticed a required element was missing.
 
Regards,
Willem

 

Kenton Varda

unread,
Jul 10, 2008, 3:42:58 AM7/10/08
to Willem de Jong, Protocol Buffers
On Wed, Jul 9, 2008 at 11:18 PM, Willem de Jong <w.a.d...@gmail.com> wrote:
- I am afraid I don't understand the point about unknown fields. What scenario are you thinking of? A sending system sending fields that are not present in the .proto file that is used on the sending side? In that case, how is the field number for such a field determined?

When you parse a message off the wire, if the parser sees unknown fields, it won't just ignore them.  It puts the field values off to the side, in the message's UnknownFieldSet.  If you then serialize the message without clearing it in between, the unknown fields are written back out.  This way, if you have a server that acts as a proxy -- receiving messages and then forwarding them elsewhere -- you do not have to upgrade it every time you add a new field to your format.
 
- If you guarantee that known fields are ordered by field number, then why not make this a part of the encoding specification? I am asking, because (a) I have some code that I might reuse for writing a protobuf decoder, but it will be easier if I can be rely on fields being in sequence (some additional out-of-sequence fields at the end would be ignored), and (b) I think it would generally be possible to write a more efficient parser if the parser always 'knows what to receive next'. In that case it doesn't have to check a list of things that it might receive, but it just has to check the next item. Also, it wouldn't have to check in the end whether all required elements were present - it could fail immediately if it noticed a required element was missing.

It's guaranteed that serializing a protocol message object will write the tags in order.  However, there are other ways to construct protocol messages.  If you simply concatenate two messages, for example, this has the effect of merging them as if you used MergeFrom().  This is a useful property which we have actually used in some cases.  We actually have found several cases where people wanted to write messages manually and not have to write the tags in order, so the format does not mandate an order.

I don't think you could really get that much of a performance improvement by assuming tags are ordered.  Unless you message is entirely required fields (no optional or repeated), you would still have to check each tag.  The code we generate now (for C++, at least) actually has an optimization where it predicts that the next tag in the input will be the next tag in sequence, and so it compares against that prediction before falling back to the switch.

Willem de Jong

unread,
Jul 10, 2008, 7:25:52 AM7/10/08
to Kenton Varda, Protocol Buffers
Hi,
 
Some additional comments... Feel free to ignore them, after all it is your format/standard/idea/tool.

 
On 7/10/08, Kenton Varda <ken...@google.com> wrote:
On Wed, Jul 9, 2008 at 11:18 PM, Willem de Jong <w.a.d...@gmail.com> wrote:
- I am afraid I don't understand the point about unknown fields. What scenario are you thinking of? A sending system sending fields that are not present in the .proto file that is used on the sending side? In that case, how is the field number for such a field determined?

 
When you parse a message off the wire, if the parser sees unknown fields, it won't just ignore them.  It puts the field values off to the side, in the message's UnknownFieldSet.  If you then serialize the message without clearing it in between, the unknown fields are written back out.  This way, if you have a server that acts as a proxy -- receiving messages and then forwarding them elsewhere -- you do not have to upgrade it every time you add a new field to your format.
 
But it wouldn't be terribly difficult (or even costly, I would say) to put them in the right order.

 
 
- If you guarantee that known fields are ordered by field number, then why not make this a part of the encoding specification? I am asking, because (a) I have some code that I might reuse for writing a protobuf decoder, but it will be easier if I can be rely on fields being in sequence (some additional out-of-sequence fields at the end would be ignored), and (b) I think it would generally be possible to write a more efficient parser if the parser always 'knows what to receive next'. In that case it doesn't have to check a list of things that it might receive, but it just has to check the next item. Also, it wouldn't have to check in the end whether all required elements were present - it could fail immediately if it noticed a required element was missing.

 
It's guaranteed that serializing a protocol message object will write the tags in order.  However, there are other ways to construct protocol messages.  If you simply concatenate two messages, for example, this has the effect of merging them as if you used MergeFrom().  This is a useful property which we have actually used in some cases.  We actually have found several cases where people wanted to write messages manually and not have to write the tags in order, so the format does not mandate an order.

 
I don't think you could really get that much of a performance improvement by assuming tags are ordered.  Unless you message is entirely required fields (no optional or repeated), you would still have to check each tag. The code we generate now (for C++, at least) actually has an optimization where it predicts that the next tag in the input will be the next tag in sequence, and so it compares against that prediction before falling back to the switch.
 
 
That supports my point, doesn't it? The optimization is probably counter productive if the fields are out-of-sequence, so it would make sense to add a recommendation to the standard that the fields SHOULD be in sequence whenever feasible. 

Note that, if the elements can be out of sequence, you have to maintain some administration of what you have parsed, in order to be able to check whether you received everything (and an additional step at the end to execute this check). If the fields are in sequence, you can conclude that a field is missing when you receive the next field - no administration. This can also be done when some of the fields are optional, that doesn't really change it.
 
Regards,
Willem
 
 

Kenton Varda

unread,
Jul 10, 2008, 3:35:38 PM7/10/08
to Willem de Jong, Protocol Buffers
On Thu, Jul 10, 2008 at 4:25 AM, Willem de Jong <w.a.d...@gmail.com> wrote:
But it wouldn't be terribly difficult (or even costly, I would say) to put them in the right order.

Actually, it would be very costly to get unknown fields to be ordered correctly relative to known ones.  Currently the generated code for serialization simply goes through all the fields in order and writes them.  If it had to worry about unknown fields, then before every single field it would have to check if there are any unknown fields to write.  The extra checks would hurt speed a little bit and probably add a lot of code size bloat.

Note that even when field numbers are sequential with no gaps in between, you still have to check for unknown fields because a field which has a known field number but the wrong wire type is treated the same as an unknown one.
 
That supports my point, doesn't it? The optimization is probably counter productive if the fields are out-of-sequence, so it would make sense to add a recommendation to the standard that the fields SHOULD be in sequence whenever feasible. 

Yes, I agree.  Fields *should* be in sequence, but are not required to be.  I'll make a note to add that to the encoding doc.
 
Note that, if the elements can be out of sequence, you have to maintain some administration of what you have parsed, in order to be able to check whether you received everything (and an additional step at the end to execute this check). If the fields are in sequence, you can conclude that a field is missing when you receive the next field - no administration. This can also be done when some of the fields are optional, that doesn't really change it.

In practice, we find that the vast majority of fields end up being optional or repeated, not required, so this isn't that helpful anyway.  Also, it would be hard to implement the "partial" parsing methods (that do not enforce required fields) if required field checking were not done in a second pass.  We've found these methods to be important in practice.
Reply all
Reply to author
Forward
0 new messages