we're using protobuf messages here to do some kind of RPC over HTTP.
Now we have the need for a service which returns either a metadata
structure or a message containing some large binary data.
The caller does not know the return type upon request instanciation
(eg. a call getContent(pathname) were pathname is either a directory
or a file - for directory a list of files is returned and for files
the content is returned - packed into the same structure). No problem
so far. It would be nice to add an attachment without reading its
content into memory.
But: these binary contents will become very large (hundreds of MBytes)
and as I understood the documentation protobuf is designed to transfer
large binary data (and as the Java developers told me, reading a byte
array with protobuf creates 3 copies of it in memory).
So what would be a good solution to this?
We thought of the following solutions:
1. Make this a transport issue :)
Replace the buffer in the message which contains the attachment by a
unique identifier which references an attachment. Attachments are
added to the RpcChannel / RpcService implementation as a reference to
some stream instance where they can be read from. At transport level
the attachments will be delivered as parts of a multipart HTTP
request / response identified by a Content-Id header.
Sure - when reading attachments in the wrong order some attachments
have to be read into memory - this should be no problem.
This approach does not add full-fledged support of attachments to
protocol buffers messages because attachments will be stored
externally and methods like 'SerializeToOStream()' will not work
anymore because these don't know anything about attachments.
2. Add attachment support to protocol buffers :)
What about adding a new datatype called 'attachment'? Lets create a
new datatype called Attachment which holds either a reference to a
stream or a pointer to a buffer containing the binary data.
When serializing a message all instances of the Attachment class will
be put into a list and will be appended after the message. The fields
of type attachmend will become some kind of reference to an attachment
(eg. an integer type which marks the attachment index).
Deserializing a message could be done by another method which creates
an instance of Attachment for each referenced attachment (eg.
parseFromStreamWithoutAttachments). An application should query the
existence of an attachment before accessing it. If it does not exist
yet, the application should call 'loadAttachment
(attachment_identifier, Attachment*)' which appends the content of an
attachment to a previously created instance of Attachment (remember:
pointing to a buffer or to a stream). Methods like parseFromStream or
parseFromFile will create Attachment instances automatically and load
the content into memory. Of course there should be another method
which reads from the stream until the last attachment has been read.
Thus we should be able to use all of the common serialization and
parsing methods and if we're interested in handling attachments
efficiently, we should use the parseFromStreamWithoutAttachments and
loadAttachment methods.
I am not sure how attachments of embedded messages could be handled -
should these be serialized at the end of the message itself or at the
end of the containing message(s)? Of course, this requires an instance
of the stream where the attachments could be read from in the
application.
What do you think?
- Would this be a solution to the attachment issue and would you like
to see this in protocol buffers?
- How would you deal with the issue?
Regards,
Ronny
(and as the Java developers told me, reading a byte
array with protobuf creates 3 copies of it in memory).
So what would be a good solution to this?