Adding large binary attachments to protobuf messages

2,976 views
Skip to first unread message

duselbaer

unread,
Jan 19, 2010, 7:46:30 AM1/19/10
to Protocol Buffers
Hi there,

we're using protobuf messages here to do some kind of RPC over HTTP.
Now we have the need for a service which returns either a metadata
structure or a message containing some large binary data.

The caller does not know the return type upon request instanciation
(eg. a call getContent(pathname) were pathname is either a directory
or a file - for directory a list of files is returned and for files
the content is returned - packed into the same structure). No problem
so far. It would be nice to add an attachment without reading its
content into memory.

But: these binary contents will become very large (hundreds of MBytes)
and as I understood the documentation protobuf is designed to transfer
large binary data (and as the Java developers told me, reading a byte
array with protobuf creates 3 copies of it in memory).

So what would be a good solution to this?

We thought of the following solutions:

1. Make this a transport issue :)

Replace the buffer in the message which contains the attachment by a
unique identifier which references an attachment. Attachments are
added to the RpcChannel / RpcService implementation as a reference to
some stream instance where they can be read from. At transport level
the attachments will be delivered as parts of a multipart HTTP
request / response identified by a Content-Id header.

Sure - when reading attachments in the wrong order some attachments
have to be read into memory - this should be no problem.

This approach does not add full-fledged support of attachments to
protocol buffers messages because attachments will be stored
externally and methods like 'SerializeToOStream()' will not work
anymore because these don't know anything about attachments.

2. Add attachment support to protocol buffers :)

What about adding a new datatype called 'attachment'? Lets create a
new datatype called Attachment which holds either a reference to a
stream or a pointer to a buffer containing the binary data.

When serializing a message all instances of the Attachment class will
be put into a list and will be appended after the message. The fields
of type attachmend will become some kind of reference to an attachment
(eg. an integer type which marks the attachment index).

Deserializing a message could be done by another method which creates
an instance of Attachment for each referenced attachment (eg.
parseFromStreamWithoutAttachments). An application should query the
existence of an attachment before accessing it. If it does not exist
yet, the application should call 'loadAttachment
(attachment_identifier, Attachment*)' which appends the content of an
attachment to a previously created instance of Attachment (remember:
pointing to a buffer or to a stream). Methods like parseFromStream or
parseFromFile will create Attachment instances automatically and load
the content into memory. Of course there should be another method
which reads from the stream until the last attachment has been read.

Thus we should be able to use all of the common serialization and
parsing methods and if we're interested in handling attachments
efficiently, we should use the parseFromStreamWithoutAttachments and
loadAttachment methods.

I am not sure how attachments of embedded messages could be handled -
should these be serialized at the end of the message itself or at the
end of the containing message(s)? Of course, this requires an instance
of the stream where the attachments could be read from in the
application.

What do you think?

- Would this be a solution to the attachment issue and would you like
to see this in protocol buffers?
- How would you deal with the issue?

Regards,

Ronny

Kenton Varda

unread,
Jan 19, 2010, 1:28:08 PM1/19/10
to duselbaer, Protocol Buffers
On Tue, Jan 19, 2010 at 4:46 AM, duselbaer <ronny....@googlemail.com> wrote:
(and as the Java developers told me, reading a byte
array with protobuf creates 3 copies of it in memory).

This is not true in the common case.  However, it looks like there is a code path which creates 3 copies but could fairly easily be improved.  Remember that if you run into a performance problem with protobufs, it's usually because no one else has exercised that path before.  Please do not be afraid to send me patches to improve performance!
 
So what would be a good solution to this?

Ideally:
1) Extend ByteString so that it can hold its data in multiple pieces instead of one, big, flat array.  Allocating gigantic flat arrays is dangerous so should be avoided.
2) Improve the parsing code so that it can parse large ByteStrings without making multiple copies of the data.
3) Do not count large ByteStrings against the 64MB message size limit.

Then you can use the "bytes" type to store blobs of hundreds of megs with no problems.  No new features are needed -- this is purely an optimization problem.

For the first step above, I think ByteString should have an internal class like:

  private static interface Node {
    int size();
    byte get(int index);
    // other stuff
  }

There should then be two implementations of Node:  One that is a flat array and implemented as ByteString is currently, and one which is a concatenation of some set of sub-nodes.  The ByteString itself then just has a single field which points to the root Node.  Hopefully, this approach would allow the JIT to inline calls to get() in the common case where ByteStrings are composed of just one flat array, so that this case does not become significantly slower than it is now.  This approach also makes it easy to implement concatenation in O(1) time and substring in O(log n).
Reply all
Reply to author
Forward
0 new messages