Streaming Serialization - Suggestion

261 views
Skip to first unread message

Yoav H

unread,
Mar 23, 2016, 2:50:36 PM3/23/16
to Protocol Buffers
Hi,

I have a suggestion fr improving the protobuf encoding.
Is proto3 final?

I like the simplicity of the encoding of protobuf.
But I think it has one issue with serialization, using streams.
The problem is with length delimited fields and the fact that they require knowing the length ahead of time.
If we have a very long string, we need to encode the entire string before we know its length, so we basically duplicate the data in memory.
Same is true for embedded messages, where we need to encode the entire embedded message before we can append it to the stream.

I think there is a simple solution for both issues.

For strings and byte arrays, a simple solution is to use "chunked encoding".
Which means that the byte array is split into chunks and every chunk starts with the chunk length. End of array is indicated by length zero.

For embedded messages, the solution is to have an "start embedding" tag and an "end embedding tag".
Everything in between is the embedded message.

By adding these two new features, serialization can be fully streamable and there is no need to pre-serialize big chunks in memory before writing them to the stream.

Hope you'll find this suggestion useful and incorporate it into the protocol.

Thanks,
Yoav.


Yoav H

unread,
Mar 26, 2016, 11:31:14 PM3/26/16
to Protocol Buffers
Any comment on this?
Will you consider this for proto3?

Peter Hultqvist

unread,
Mar 28, 2016, 1:24:17 PM3/28/16
to Yoav H, Protocol Buffers

This exact suggestion has been up for discussion long time ago(years?, before proto2?)

When it comes to taking suggestions I'm only a 3rd party implementer but my understanding is that the design process of protocol buffers and its goals are internal to Google and they usually publish new versions of their code implementing new features before you can read about them in the documents.

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
To post to this group, send email to prot...@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.

Yoav H

unread,
Mar 29, 2016, 1:53:23 AM3/29/16
to Protocol Buffers, joe.dai...@gmail.com
They say on their website: "When evaluating new features, we look for additions that are very widely useful or very simple".
What I'm suggesting here is both very useful (speeding up serialization and eliminating memory duplication) and very simple (simple additions to the encoding, no need to change the language).
So far, no response from the Google guys...

Feng Xiao

unread,
Mar 29, 2016, 8:06:46 PM3/29/16
to Yoav H, Protocol Buffers
On Mon, Mar 28, 2016 at 10:53 PM, Yoav H <joe.dai...@gmail.com> wrote:
They say on their website: "When evaluating new features, we look for additions that are very widely useful or very simple".
What I'm suggesting here is both very useful (speeding up serialization and eliminating memory duplication) and very simple (simple additions to the encoding, no need to change the language).
So far, no response from the Google guys...
Actually there are already a "start embedding" tag and a "end embedding" tag in protobuf:

3Start groupgroups (deprecated)
4End groupgroups (deprecated)

They are deprecated though.

You mentioned it will be a performance gain, but what we experienced in google says otherwise. For example, in a lot places we are only interested in a few fields and want to skip through all other fields (if we are building a proxy, or the field is simply an unknown field). The start group/end group tag pair forces the parser to decode every single field in the a whole group even the whole group is to be ignored after parsing, and that's a very significant drawback.

And adding a new wire tag type to protobuf is not a simple thing. Actually I don't think we have added any new wire type to protobuf before. There are a lot issues to consider. For example, isn't all code that switch on protobuf wire types now suddenly broken? if a new serializer uses the new wire type in its output, what will happen if the parsers can't understand it?

Proto3 is already finalized and we will not add new wire types in proto3. Whether to add it in proto4 depends on whether we have a good use for it and whether we can mitigate the risks of rolling out a new wire type.

David Yu

unread,
Mar 30, 2016, 12:12:12 AM3/30/16
to Feng Xiao, Yoav H, Protocol Buffers
On Wed, Mar 30, 2016 at 8:06 AM, 'Feng Xiao' via Protocol Buffers <prot...@googlegroups.com> wrote:


On Mon, Mar 28, 2016 at 10:53 PM, Yoav H <joe.dai...@gmail.com> wrote:
They say on their website: "When evaluating new features, we look for additions that are very widely useful or very simple".
What I'm suggesting here is both very useful (speeding up serialization and eliminating memory duplication) and very simple (simple additions to the encoding, no need to change the language).
So far, no response from the Google guys...
Actually there are already a "start embedding" tag and a "end embedding" tag in protobuf:

3Start groupgroups (deprecated)
4End groupgroups (deprecated)

They are deprecated though.

You mentioned it will be a performance gain, but what we experienced in google says otherwise. For example, in a lot places we are only interested in a few fields and want to skip through all other fields (if we are building a proxy, or the field is simply an unknown field). The start group/end group tag pair forces the parser to decode every single field in the a whole group even the whole group is to be ignored after parsing, and that's a very significant drawback.
This is definitely the use-case where delimiting makes perfect sense (proxy/middleware service that reads part of a message).
The name 'protocol buffers' does kinda makes that use-case obvious.
If using protobuf to simply serialize/deserialize, then start/end group would definitely benefit the streaming use-case.
Shameless plug: https://github.com/protostuff/protostuff optimizes for the latter use-case and was mostly the reason it was created (java only though)



--
When the cat is away, the mouse is alone.
- David Yu

Yoav H

unread,
Mar 30, 2016, 8:27:53 PM3/30/16
to Protocol Buffers, joe.dai...@gmail.com
I saw the start\end group but I couldn't find any information on those and how to use them.

Your point about skipping fields makes sense.
I think it is also solvable with applying the same idea of chunked encoding, even on sub fields.
So instead of writing the full length of the child field, you allow the serializer to write it in smaller chunks.
The deserializer can then just read the chunk markings and skip them.
A very basic serializer can put just one chunk (which will be equivalent to the current implementation, plus one more zero marking at the end), but it allows a more efficient serializer to stream data.

Regarding adding something to the encoding spec, are you allowing proto2 serializers to call into proto3 deserializers and vice versa?
I thought that if you have a protoX server, you expect clients to take the protoX file and generate a client out of it, which will match that proto version encoding. Isn't it the case?

Thanks,
Yoav.

Josh Haberman

unread,
Apr 1, 2016, 7:21:27 PM4/1/16
to Protocol Buffers, joe.dai...@gmail.com
Hi Yoav,

Chunked encoding is definitely an interesting idea, and I can see the benefits you mentioned. However proto2 and proto3 are more or less frozen from a wire perspective. There are lots of existing clients out there already communicating with proto3, so we're not really at liberty to make any changes. Sorry about that.

Best,
Josh

Feng Xiao

unread,
Apr 1, 2016, 8:43:51 PM4/1/16
to Yoav H, Protocol Buffers
On Wed, Mar 30, 2016 at 5:27 PM, Yoav H <joe.dai...@gmail.com> wrote:
I saw the start\end group but I couldn't find any information on those and how to use them.

Your point about skipping fields makes sense.
I think it is also solvable with applying the same idea of chunked encoding, even on sub fields.
So instead of writing the full length of the child field, you allow the serializer to write it in smaller chunks.
The deserializer can then just read the chunk markings and skip them.
A very basic serializer can put just one chunk (which will be equivalent to the current implementation, plus one more zero marking at the end), but it allows a more efficient serializer to stream data.

Regarding adding something to the encoding spec, are you allowing proto2 serializers to call into proto3 deserializers and vice versa?
I thought that if you have a protoX server, you expect clients to take the protoX file and generate a client out of it, which will match that proto version encoding. Isn't it the case?
Proto2 and proto3 are wire-compatible. We already have a lot of proto3 clients communicating with proto2 servers or vice versa. Like Josh mentioned, we can't change proto3's wire format now.
Reply all
Reply to author
Forward
0 new messages