A quick question regarding writing protobuf message to Stream preceded by Header

Saptarshi

unread,

Aug 23, 2009, 12:06:37 PM8/23/09

to Protocol Buffers

Hello,
I would like to write the size of the serialized message followed by
the message(to stdout)
One approach is to write to an array, write the length and then the
array

Method 1
v= ByteSize()
SerializeWithCachedSizesToArray(uint8* data,)
and then write data

But this requires I create an array.

Method 2
Another approach is to write ByteSize(to stdout) and then call
SerializeWithCachedSizes(OstreamOutputStream) directly writing to
stdout.

Q:
Is the second method more efficient? I just need to create messages
and throw them out.
I don't need the serialized data hanging around.

Regards
Saptarshi

Kenton Varda

unread,

Aug 24, 2009, 3:29:59 PM8/24/09

to Saptarshi, Protocol Buffers

Generally the most efficient way to serialize a message to stdout is:

message.SerializeToFileDescriptor(STDOUT_FILENO);

(If your system doesn't define STDOUT_FILENO, just use the number 1.)

If you normally use C++'s cout, you might want to write to that instead:

message.SerializeToOstream(std::cout);

For small messages, it may be slightly faster to serialize to a string and then write that. But the difference there would be small, and if it matters to you we should probably just fix the protobuf library to do this optimization automatically...

All of these methods require that you write the size first if you intend to write multiple messages to the stream.

Saptarshi Guha

unread,

Aug 27, 2009, 5:06:15 PM8/27/09

to Kenton Varda, Protocol Buffers

Hello
I was thinking about this and had some questions

On Mon, Aug 24, 2009 at 3:29 PM, Kenton Varda<ken...@google.com> wrote:
> Generally the most efficient way to serialize a message to stdout is:
> message.SerializeToFileDescriptor(STDOUT_FILENO);
> (If your system doesn't define STDOUT_FILENO, just use the number 1.)
> If you normally use C++'s cout, you might want to write to that instead:
> message.SerializeToOstream(std::cout);

Does the protobuf library buffer on the file descriptor? Or does it depend
on the OS level buffering, because given a file descriptor i guess it
uses "write" calls
and not fwrite.
I am opening stdout in binary mode, changing the buffer size (setvbuf)
and writing to that
if i give SerializeToFileDescriptor the file descriptor of this new
FILE* object, I guess it won't
use my buffer (I know fwrite uses write, but does write care for the
buffer of the FILE* object?).

> For small messages, it may be slightly faster to serialize to a string and
> then write that. But the difference there would be small, and if it matters
> to you we should probably just fix the protobuf library to do this
> optimization automatically...

I should point out that my messages will be in the kb and definitely
less than an MB.

You mention serializing to string. However I also see a method
"SerializeToArray" .
What is the difference?
To avoid repeated mallocs/free, I intend to keep one global
array(resizing if required)
, writing to that array and keeping a track of the bytes written and
writing th array out to the stream.
Since my app is not threaded, I do not have an issue of multiple
threads writing to that single array.
However if SerializeToFileDescriptor is still better than this
approach there is no need for this.

> All of these methods require that you write the size first if you intend to
> write multiple messages to the stream.

Yes, I will be writing the length first.

I should point out I haven't had much experience with write,fwrite so
my understanding might be incomplete.

Much thanks for advice
Regards
Saptarshi

Kenton Varda

unread,

Aug 27, 2009, 10:18:16 PM8/27/09

to sg...@purdue.edu, Protocol Buffers

On Thu, Aug 27, 2009 at 2:06 PM, Saptarshi Guha <saptars...@gmail.com> wrote:

Hello
I was thinking about this and had some questions

On Mon, Aug 24, 2009 at 3:29 PM, Kenton Varda<ken...@google.com> wrote:
> Generally the most efficient way to serialize a message to stdout is:
> message.SerializeToFileDescriptor(STDOUT_FILENO);
> (If your system doesn't define STDOUT_FILENO, just use the number 1.)
> If you normally use C++'s cout, you might want to write to that instead:
> message.SerializeToOstream(std::cout);

Does the protobuf library buffer on the file descriptor?

Yes.

I am opening stdout in binary mode, changing the buffer size (setvbuf)
and writing to that
if i give SerializeToFileDescriptor the file descriptor of this new
FILE* object, I guess it won't
use my buffer (I know fwrite uses write, but does write care for the
buffer of the FILE* object?).

That is correct. FILE* adds a buffering layer on top of the fd. If you wanted protobuf to write to that buffer, you could probably write an implementation of protobuf::io::CopyingOutputStream for FILE* and wrap it in a protobuf::io::CopyingOutputStreamAdaptor, then pass that to message.SerializeToZeroCopyStream().

> For small messages, it may be slightly faster to serialize to a string and
> then write that. But the difference there would be small, and if it matters
> to you we should probably just fix the protobuf library to do this
> optimization automatically...

I should point out that my messages will be in the kb and definitely
less than an MB.

For "small messages", I mean ~4kb or less. The issue is that SerializeToFileDescriptor() allocates an 8k buffer internally, which is wasteful if the message is much less than 8k. We should fix it so that it doesn't do that for small messages.

You mention serializing to string. However I also see a method
"SerializeToArray" .
What is the difference?

With SerializeToArray() you need to make sure the array is big enough ahead of time, whereas SerializeToString() will allocate a string of the correct size. You can call ByteSize() in order to size your array, but when you then call SerializeToArray() it will have to call ByteSize() again internally, which is wasteful. To allocate a correctly-sized array and serialize to it with optimal efficiency you have to use ByteSize() and then call SerializeToArrayWithCachedSizes() -- which reuses the sizes computed by the previous ByteSize() call. Actually, I guess that's not very hard, is it? It used to be harder.

To avoid repeated mallocs/free, I intend to keep one global
array(resizing if required)

If you reuse a single std::string object, you should get the same effect. string::clear() does not free the backing array, it just sets the size to zero. So, it will reuse that array the next time you serialize into it.

, writing to that array and keeping a track of the bytes written and
writing th array out to the stream.
Since my app is not threaded, I do not have an issue of multiple
threads writing to that single array.
However if SerializeToFileDescriptor is still better than this
approach there is no need for this.

SerializeToFileDescriptor() is better if your messages are very large because it avoids allocating large contiguous blocks of memory, which can cause memory fragmentation. Otherwise it has no advantage over serializing to an array and then writing it to the file.

> All of these methods require that you write the size first if you intend to
> write multiple messages to the stream.

Yes, I will be writing the length first.

Ah, of course, in this case you have to call ByteSize() anyway, so if you're really worried about performance then you want to call Serialize*WithCachedSizes().

Kenton Varda

unread,

Aug 27, 2009, 10:19:18 PM8/27/09

to sg...@purdue.edu, Protocol Buffers

BTW, when I talk about one thing being more efficient than another, it's really a matter of a few percent difference. For the vast majority of apps, it does not matter. I'd suggest not worrying about it unless you're really sure you need to improve your performance *and* profiling shows that you spend a lot of time in protobuf code.

Saptarshi Guha

unread,

Aug 28, 2009, 11:14:41 AM8/28/09

to Kenton Varda, Protocol Buffers

Hello,
Thanks much for the answers. I did perform some tests and your
statements hold true (marginal differences however)
i.e for small messages (~7kb), the FDescriptor method is faster than
SerializeToString. For larger messages the latter is faster.

I tried a typical case (for me), creating R runif(N) object(once),
serialize using ProtoBufs, writing this out and repeating this M
times.
For N say, 125 *FD is better and for larger N(2000, about 15KB) to
String is better. However, i did notice about 10% improvement (not a
very rigorous experiment) for the FD method over *String method when
it came to right tiny messages (~1KB) 10MM(=M) times .

Surprisingly, the output to array is much slower than the other two.

Thanks for your input, it was really helpful.
Regards
Saptarshi

Kenton Varda

unread,

Aug 28, 2009, 12:49:56 PM8/28/09

to sg...@purdue.edu, Protocol Buffers

On Fri, Aug 28, 2009 at 8:14 AM, Saptarshi Guha <saptars...@gmail.com> wrote:

Hello,
Thanks much for the answers. I did perform some tests and your
statements hold true (marginal differences however)
i.e for small messages (~7kb), the FDescriptor method is faster than
SerializeToString. For larger messages the latter is faster.

Err, I had it the other way around. :) SerializeToFileDescriptor() should definitely be slower than SerializeToString() for small messages. And actually, for large messages it is still probably slower, but avoiding memory fragmentation seems more important.

Surprisingly, the output to array is much slower than the other two.

That doesn't seem right, since SerializeToString() and SerializeToArray() share the same implementation. The only difference is that SerializeToString() has to allocate space first, which should make it slower.

Dave W.

unread,

Sep 7, 2009, 10:03:29 AM9/7/09

to Protocol Buffers

> To allocate a correctly-sized array and
> serialize to it with optimal efficiency you have to use ByteSize() and then
> call SerializeToArrayWithCachedSizes() -- which reuses the sizes computed by
> the previous ByteSize() call.

Where is this SerializeToArrayWithCachedSizes() call? I can't find it
anywhere in the code. Does it really exist somewhere?

Kenton Varda

unread,

Sep 8, 2009, 12:45:13 AM9/8/09

to Dave W., Protocol Buffers

Sorry, it's actually SerializeWithCachedSizesToArray(). It's defined on the MessageLite interface so every protocol message object has this method.

Paul Kotlyar

unread,

Aug 2, 2025, 3:51:20 AMAug 2

to Protocol Buffers

I solved this problem https://medium.com/@unclepaul84/efficiently-reading-and-writing-very-large-protobuf-files-local-disk-and-s3-approaches-a289c8855606

Reply all

Reply to author

Forward