Java deserialization - any best practices for performances?

964 views
Skip to first unread message

Alex Black

unread,
Jul 17, 2009, 11:13:55 PM7/17/09
to Protocol Buffers
When I write out messages using C++ I'm careful to clear messages and
re-use them, is there something equivalent on the java side when
reading those same messages in?

My code looks like:

CodedInputStream stream = CodedInputStream.newInstance(inputStream);

while ( !stream.isAtEnd() )
{
MyMessage.Builder builder = MyMessage.newBuilder();
stream.readMessage(builder, null);
MyMessage myMessage = builder.build();

for ( MessageValue messageValue : myMessage.getValuesList() )
{
......
}
}

I'm passing 150 messages each with 1000 items, so presumably memory is
allocated 150 times for each of the messages...

- Alex

Alek Storm

unread,
Jul 18, 2009, 12:25:23 AM7/18/09
to Alex Black, Protocol Buffers
I think what you want is lazy parsing, which unfortunately isn't available yet.  You could always read bytes off the stream in chunks, or write your own CodedInputStream to skip to the end of each message every time it sees a length.

Alek

Alex Black

unread,
Jul 18, 2009, 10:17:17 AM7/18/09
to Protocol Buffers
Hi Alek, can you elaborate a bit on what you mean by lazy parsing?

I think what I want is to be able to *reuse* my objects, specifically
the instances of MyMessage, instead of allocating new ones each time
through the loop. This is analagous to what my C++ code does when
writing out messages, it re-uses the same message object, clearing it
between uses.

On Jul 18, 12:25 am, Alek Storm <alek.st...@gmail.com> wrote:
> I think what you want is lazy parsing, which unfortunately isn't available
> yet.  You could always read bytes off the stream in chunks, or write your
> own CodedInputStream to skip to the end of each message every time it sees a
> length.
>
> Alek
>

Kenton Varda

unread,
Jul 18, 2009, 9:22:38 PM7/18/09
to Alex Black, Protocol Buffers
On Fri, Jul 17, 2009 at 8:13 PM, Alex Black <al...@alexblack.ca> wrote:

When I write out messages using C++ I'm careful to clear messages and
re-use them, is there something equivalent on the java side when
reading those same messages in?

No.  Sorry.  This just doesn't fit at all with the Java library's design, and even if it did, you cannot reuse Java String objects, which often account for most of the memory usage.  However, memory allocation is cheaper in Java than in C++, so there's less to gain from it.
 

alopecoid

unread,
Jul 23, 2009, 3:32:48 AM7/23/09
to Protocol Buffers
Hi,

I haven't actually used the Java protobuf API, but it seems to me from
the quick occasional glance that this isn't entirely true. I mean,
specifically in response to the code snippet posted in the original
message, I would possibly:

1. Reuse the Builder object by calling its clear() method. This would
save from the need to create a new Builder object for each iteration
of the outermost loop.

2. Iterate over the repeated field using the get*Count() and get*
(index) methods instead of the get*List() method. I'm not sure if this
would save anything, but depending on how things are implemented in
the generated code, this could save from allocating a new List object.

Also, might "bytes" type fields perform better than any "string" type
fields that you may have in your particular data set? I'm not sure,
but it might be worth benchmarking.

On Jul 18, 9:22 pm, Kenton Varda <ken...@google.com> wrote:

Kenton Varda

unread,
Jul 23, 2009, 1:42:45 PM7/23/09
to alopecoid, Protocol Buffers
On Thu, Jul 23, 2009 at 12:32 AM, alopecoid <alop...@gmail.com> wrote:

Hi,

I haven't actually used the Java protobuf API, but it seems to me from
the quick occasional glance that this isn't entirely true. I mean,
specifically in response to the code snippet posted in the original
message, I would possibly:

1. Reuse the Builder object by calling its clear() method. This would
save from the need to create a new Builder object for each iteration
of the outermost loop.

You can't continue to use a Builder after calling build().  Even if we made it so you could, it would be building an entirely new object, not reusing the old one.  We can't make it reuse the old one because that would break the immutability guarantee of message objects.

Reusing the actual builder object is not that useful since it's only a very small object containing a pointer to a message object.
 
2. Iterate over the repeated field using the get*Count() and get*
(index) methods instead of the get*List() method. I'm not sure if this
would save anything, but depending on how things are implemented in
the generated code, this could save from allocating a new List object.

Won't save anything; we still need a list object internally.

But seriously, object allocation with a modern generational garbage collector is extremely cheap, especially for objects that don't stick around very long.  So I don't think there's much to gain here.

alopecoid

unread,
Jul 23, 2009, 10:15:52 PM7/23/09
to Protocol Buffers
Hi Kenton,

Thanks for your reply.

> You can't continue to use a Builder after calling build(). Even if we made
> it so you could, it would be building an entirely new object, not reusing
> the old one. We can't make it reuse the old one because that would break
> the immutability guarantee of message objects.

Hmm... that strikes me as strange. I understand that the Message
objects are immutable, but the Builders are as well? I thought that
they would work more along the lines of String and StringBuilder,
where String is obviously immutable and StringBuilder is mutable/
reusable.

> But seriously, object allocation with a modern generational garbage
> collector is extremely cheap, especially for objects that don't stick around
> very long.  So I don't think there's much to gain here.

While I agree that object allocation is relatively cheap in Java, I
have noticed that if you generate a lot of garbage, you have to also
spend some time tweaking the garbage collector settings to avoid long/
frequent garbage collection pauses. I know that there has been a lot
of recent work done in Java 7 (and experimentally in Java 6) to avoid
this, but I haven't had the opportunity to test this yet. In fact, I
find that often times this is the real difference in performance
between Java and C++ in the cases where C++ seems to perform
significantly faster... different object allocation practices (but
more importantly, implementation/design choices). I don't know how
well this holds true for a spectrum of different usage patterns, but
my experience has been more from the large scale data processing side
of things. And don't get me wrong, I'm actually one of the few people
(out of my closest colleagues) who think that data processing can and
should be done in Java over C++, but that's another discussion
entirely :)

But while we're on the subject, I have been looking for some rough
benchmarks comparing the performance of Protocol Buffers in Java
versus C++. Do you (the collective you) have any [rough] idea as to
how they compare performance wise? I am thinking more in terms of
batch-style processing (disk I/O, parsing centric) rather than RPC
centric usage patterns. Any experiences you can share would be great.

Thanks!

Kenton Varda

unread,
Jul 24, 2009, 6:02:10 PM7/24/09
to alopecoid, Protocol Buffers
On Thu, Jul 23, 2009 at 7:15 PM, alopecoid <alop...@gmail.com> wrote:
Hmm... that strikes me as strange. I understand that the Message
objects are immutable, but the Builders are as well? I thought that
they would work more along the lines of String and StringBuilder,
where String is obviously immutable and StringBuilder is mutable/
reusable.

The point is that it's the Message object that contains all the stuff allocated by the Builder, and therefore none of that stuff can actually be reused.  (When you call build(), nothing is copied -- it just returns the object that it has been working on.)  So reusing the builder itself is kind of useless, because it's just a trivial object containing one pointer (to the message object it is working on constructing).
 
But while we're on the subject, I have been looking for some rough
benchmarks comparing the performance of Protocol Buffers in Java
versus C++. Do you (the collective you) have any [rough] idea as to
how they compare performance wise? I am thinking more in terms of
batch-style processing (disk I/O, parsing centric) rather than RPC
centric usage patterns. Any experiences you can share would be great.

I have some benchmarks that IIRC show that Java parsing and serialization is roughly half the speed of C++.  As I recall a lot of the speed difference is from UTF-8 decoding/encoding -- in C++ we just leave the bytes encoded, but in Java we need to decode them in order to construct standard String objects.

I've been planning to release these benchmarks publicly but it will take some work and there's a lot of higher-priority stuff to do.  :/  (I think Jon Skeet did get the Java side of the benchmarks into SVN but there's no C++ equivalent yet.)

Christopher Smith

unread,
Jul 24, 2009, 10:05:52 PM7/24/09
to Kenton Varda, alopecoid, Protocol Buffers
The best way to think of it is:

Builder : Java Message :: C++ Message : const C++ Message

As far as performance goes, it is a common mistake to confuse C/C++
heap memory allocation costs to Java heap allocation. In the common
case, allocations in Java are just a few instructions... comperable to
stack allocations in C/C++. What normally gets you in Java is the
initialization cost, and in this particlar scenario there is no way
around that.

If you are worried, you could benchmark the difference between
constantly allocating builders as you go vs. starting with an array of
N builders (allocating the array would be done outside of the
benchmark). I am sure it will prove enlightening.
--
Sent from my mobile device

Chris
Reply all
Reply to author
Forward
0 new messages