Using a ByteBuffer instead of a ByteString?

3,985 views
Skip to first unread message

Nicolae Mihalache

unread,
Jan 11, 2011, 12:45:27 AM1/11/11
to Protocol Buffers
Hello,

I recently started to use GPB, great software! :)

But I have noticed in java that it is impossible to create a message
containing a "bytes" fields without copying some buffers around. For
example if I have a encoded message of 1MB with a few regular fields
and one big bytes field, decoding the message will make a copy of the
entire buffer instead of keeping a reference to it.

Even worse when encoding: if I read some data from file, does not seem
possible to put it directly into a ByteString so I have to make first
a byte[], then copy it into the ByteString and when encoding, it makes
yet another byte[].


So my question: is it possible to make an exception from the
immutability for the "bytes" fields and use java.nio.ByteBuffers
instead of ByteStrings?

thanks,
nicolae

Evan Jones

unread,
Jan 11, 2011, 3:25:49 PM1/11/11
to Nicolae Mihalache, Protocol Buffers
On Jan 11, 2011, at 0:45 , Nicolae Mihalache wrote:
> But I have noticed in java that it is impossible to create a message
> containing a "bytes" fields without copying some buffers around. For
> example if I have a encoded message of 1MB with a few regular fields
> and one big bytes field, decoding the message will make a copy of the
> entire buffer instead of keeping a reference to it.

By "decoding" I'm assuming you mean deserializing the message from a
file or something.

This is a disadvantage, but it makes things much easier: it means the
buffer used to read data can be recycled for the next message. Without
this copy, the library would need to do complicated tracking of chunks
of memory to determine if they are "in use" or not.

However, now that you mention it: in the case of big buffers,
CodedInputStream.readBytes() gets called, which currently makes 2
copies of the data (it calls readRawBytes() then calls
ByteString.copyFrom()). This could probably be "fixed" in
CodedInputStream.readBytes(), which might improve performance a fair
bit. I'll put this on my TODO list of things to look at, since I think
my code does this pretty frequently.


> Even worse when encoding: if I read some data from file, does not seem
> possible to put it directly into a ByteString so I have to make first
> a byte[], then copy it into the ByteString and when encoding, it makes
> yet another byte[].

The copy cannot be avoided because it makes the API simpler (thread-
safety, don't need to worry about the ByteBuffer being accidentally
changed, etc). The latest version of Protocol Buffers in Subversion
has ByteString.copyFrom(ByteBuffer) which will do what you want
efficiently.

Evan

--
Evan Jones
http://evanjones.ca/

Nicolae Mihalache

unread,
Jan 11, 2011, 5:53:46 PM1/11/11
to Protocol Buffers
On Jan 11, 9:25 pm, Evan Jones <ev...@MIT.EDU> wrote:
> This is a disadvantage, but it makes things much easier: it means the  
> buffer used to read data can be recycled for the next message. Without  
> this copy, the library would need to do complicated tracking of chunks  
> of memory to determine if they are "in use" or not.
I read in several places that allocating objects in java rather than
reusing is not so bad. The garbage collector is smart enough to take
care of it.

> However, now that you mention it: in the case of big buffers,  
> CodedInputStream.readBytes() gets called, which currently makes 2  
> copies of the data (it calls readRawBytes() then calls  
> ByteString.copyFrom()). This could probably be "fixed" in  
> CodedInputStream.readBytes(), which might improve performance a fair  
> bit. I'll put this on my TODO list of things to look at, since I think  
> my code does this pretty frequently.
ok, thanks.

>
> The copy cannot be avoided because it makes the API simpler (thread-
> safety, don't need to worry about the ByteBuffer being accidentally  
> changed, etc). The latest version of Protocol Buffers in Subversion  
> has ByteString.copyFrom(ByteBuffer) which will do what you want  
> efficiently.
>
I want to avoid copying data as much as possible (I'm aware it will
not be possible to eliminate it altogether).
I thought it wouldn't be so difficult to put an option in a message
definition that will make protoc generate ByteBuffer fields instead of
ByteString.
Then with a corresponding method in CodedInput/OutputStream it should
work, right?


Kenton Varda

unread,
Jan 12, 2011, 1:00:18 AM1/12/11
to Nicolae Mihalache, Protocol Buffers
On Mon, Jan 10, 2011 at 9:45 PM, Nicolae Mihalache <xpro...@gmail.com> wrote:
Hello,

I recently started to use GPB, great software! :)

But I have noticed in java that it is impossible to create a message
containing a "bytes" fields without copying some buffers around. For
example if I have a encoded message of 1MB with a few regular fields
and one big bytes field, decoding the message will make a copy of the
entire buffer instead of keeping a reference to it.

We are actually looking at fixing this by allowing ByteStrings to share buffers.
 
Even worse when encoding: if I read some data from file, does not seem
possible to put it directly into a ByteString so I have to make first
a byte[], then copy it into the ByteString and when encoding, it makes
yet another byte[].

ByteString provides multiple methods of construction.  One is to copy from a byte array.  Another is to use an OutputStream that writes into a ByteString.  In future versions, we are looking at making it possible to concatenate ByteStrings without a copy.

But yes, if you start with a byte[], and you want a ByteString with the same content, you are going to need to make a copy, because ByteString has to guarantee immutability.
 
So my question: is it possible to make an exception from the
immutability for the "bytes" fields and use java.nio.ByteBuffers
instead of ByteStrings?

No, sorry, making any exception to immutability would end up unraveling the whole library.  You can go from ByteString to ByteBuffer without a copy (by calling asReadOnlyByteBuffer()), but you can't go the other way, because there is no way to know given a ByteBuffer pointer whether or not someone might be able to modify it in the future.

Storing ByteBuffer in message objects directly has additional problems.  ByteBuffer is a stateful class -- it maintains a pointer to the current read location, for example.  So a protocol message object with ByteBuffers inside it would be thread-hostile no matter how you look at it.  This just leads to too many problems...
Reply all
Reply to author
Forward
0 new messages