Serializing large arrays in Java

1,056 views
Skip to first unread message

Joshua Hartman

unread,
Jun 22, 2010, 10:16:34 PM6/22/10
to Protocol Buffers
Hi,

I'm trying to serialize a large primitive array of int in Java, and
I'm testing using 1,000,000 elements. My .proto message is just a
packed repeated field of type int32. To serialize, I just create a
builder and loop through the array, adding elements one at a time. I
took a peek at the underlying implementation and it seems that it uses
an ArrayList. However, there's no way to reserve the size of the
ArrayList, so serializing is slower than it could be. In fact, it's
faster for me to just serialize the data an ObjectOutputStream to put
the data in a protocol buffer bytes message.

Is there a way for me to do the serialization faster? I took a peek
through the API but didn't see anything obvious. I saw
CodedOutputStream as well, but I didn't understand the intent of that
class. Could anyone please explain what it's used for?

Thanks,
Josh H

Kenton Varda

unread,
Jun 23, 2010, 9:26:08 PM6/23/10
to Joshua Hartman, Protocol Buffers
This is a use case that is, unfortunately, not well-optimized by the Java implementation.  What's needed is versions of ArrayList which store unboxed values.  Unfortunately, a separate such implementation would be needed for every primitive field type, since Java generics work only with boxed types.  It will be tedious, but you're welcome to work on it.


--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.


Marc-André Laverdière

unread,
Jun 24, 2010, 12:07:32 AM6/24/10
to Protocol Buffers
There is an Apache Commons project that provide the primitive lists:
http://commons.apache.org/primitives/

I think that those would do the heavy lifting for us :)

Marc-André LAVERDIÈRE
"Perseverance must finish its work so that you may be mature and
complete, not lacking anything." -James 1:4
mlaverd.theunixplace.com/blog

/"\
\ / ASCII Ribbon Campaign
X against HTML e-mail
/ \

2010/6/24 Kenton Varda <ken...@google.com>:

Josh Hartman

unread,
Jun 24, 2010, 2:05:42 PM6/24/10
to Protocol Buffers
Using something like Fastutil or the collections library posted would definitely improve performance. Google guys/committers - what do you think about adding a dependency on these libraries? I don't mind contributing code for this.

I still think a huge performance increase could be seen if we could preallocate the ArrayList in the Builder. Looking at generated code for a simple repeated int32 type, whenever an item is added to the list ArrayLists's append is called. If we exposed ArrayList's ensureCapacity method, we could reserve the capacity we needed and wouldn't have to reallocate objects and copy over as we build up the List.



2010/6/23 Marc-André Laverdière <marcandre....@gmail.com>

Kenton Varda

unread,
Jun 28, 2010, 3:10:43 PM6/28/10
to Josh Hartman, Protocol Buffers
Ideally, we want to avoid dependencies if at all possible.  The library is already considered to be too big (motivating protobuf-lite, among other things).  A dependency could make things worse.

I wonder how much code it would take to implement an absolutely minimal set of primitive-value ArrayLists, designed solely for use by protobufs.

And yeah, providing a way to call ensureCapacity() seems worthwhile.

Marc-André Laverdière

unread,
Jun 28, 2010, 6:16:17 PM6/28/10
to Protocol Buffers
Good question.

I think you could re-implement whatever those libraries we talked
about are doing. Conceptually it is very simple, but I'm pretty sure
that you must have a bunch of tricky things for performance to be
careful.
So a naive implementation (with memcpys) could be done in a day I
think. A more solid implementation (no memcpys) might take another
day. More optimizations: no clue.

But this is going to be very very boilerplate-like. It might be nice
to have a code generator (something as simple as StringTemplate). It
would save you a lot of time.

Marc-André LAVERDIÈRE
"Perseverance must finish its work so that you may be mature and
complete, not lacking anything." -James 1:4
mlaverd.theunixplace.com/blog

/"\
\ / ASCII Ribbon Campaign
X against HTML e-mail
/ \

2010/6/29 Kenton Varda <ken...@google.com>:

Josh Hartman

unread,
Jun 29, 2010, 5:49:36 PM6/29/10
to Protocol Buffers
I've started working on implementations for this. We can also keep the current interface stable by having our custom IntList implement List<Integer>. Unfortunately, users will have to pay the boxing penalty for certain operations but we can make others a lot faster and save memory.

I've also added ensureCapacity. The code looks like this in java_primitive_field.cc:
"public Builder ensure$capitalized_name$Capacity(int minCapacity) {\n"
"  if (result.$name$_.isEmpty()) {\n"
"    result.$name$_ = new java.util.ArrayList<$boxed_type$>(minCapacity);\n"
"  } else {\n"
"    ((java.util.ArrayList<$boxed_type$>) result.$name$_).ensureCapacity(minCapacity);\n"
"  }\n"
"  return this;\n"
. Unfortunately, it's actually slower for me to use this than not. I've played around for hours to get optimal GC settings since GC contributes a significant portion of the execution time, but setting the capacity before adding all the elements actually takes a tad longer. Does anyone have any suggestions why this might be the case? I see nothing weird when poring over the GC logs or anything. I'm on OSX 1.6.20 64bit sun JVM. I'm going to try again on a windows VM when I can get the chance.


Josh Hartman



2010/6/28 Marc-André Laverdière <marcandre....@gmail.com>

Marc-André Laverdière

unread,
Jun 29, 2010, 9:00:22 PM6/29/10
to Protocol Buffers
If we add this support in the new version, we can afford to add a new
interface for the new primitive-enabled API.

But we probably will need to keep the old interface as wrapper or something...

I think that, for the sake of saving many hours, one could prototype
using the Apache stuff. If the benchmarks are good, then we can think
about porting it and tweaking to fit our needs better. That might be
more enjoyable than tweaking GC settings :)

Marc-André LAVERDIÈRE
"Perseverance must finish its work so that you may be mature and
complete, not lacking anything." -James 1:4
mlaverd.theunixplace.com/blog

/"\
\ / ASCII Ribbon Campaign
X against HTML e-mail
/ \

2010/6/30 Josh Hartman <atomi...@gmail.com>:

Evan Jones

unread,
Jun 30, 2010, 5:24:34 PM6/30/10
to Josh Hartman, Protocol Buffers
On Jun 29, 2010, at 17:49 , Josh Hartman wrote:
> "public Builder ensure$capitalized_name$Capacity(int minCapacity)
> {\n"
> " if (result.$name$_.isEmpty()) {\n"
> " result.$name$_ = new java.util.ArrayList<$boxed_type
> $>(minCapacity);\n"

I think the problem is that in the empty case, the next call to add*
will re-create the ArrayList as empty, which defeats the call to
ensureCapacity. Try hacking something in that method to make this work
(check the type?)

Evan

--
Evan Jones
http://evanjones.ca/

Josh Hartman

unread,
Jun 30, 2010, 6:09:58 PM6/30/10
to Evan Jones, Protocol Buffers
Good catch. This was the problem and the fix results in a speedup of about 70% all the way from serializing 1000 elements to 1,000,000 elements in an int array. I'll throw my changes up on github this evening.

Instead of doing an empty check, I just do a reference check against Collections.EMPTY_LIST.

Thanks,
Josh

On Wed, Jun 30, 2010 at 2:24 PM, Evan Jones <ev...@mit.edu> wrote:
On Jun 29, 2010, at 17:49 , Josh Hartman wrote:
       "public Builder ensure$capitalized_name$Capacity(int minCapacity) {\n"
       "  if (result.$name$_.isEmpty()) {\n"
       "    result.$name$_ = new java.util.ArrayList<$boxed_type$>(minCapacity);\n"

Prakash Rao

unread,
Aug 11, 2010, 10:40:16 AM8/11/10
to Protocol Buffers
I'm also looking for proper ways to set initial capacity for my
repeatable fields (in protobuf java 2.3.0). Is this possible without
hacking the generated code?

Regards,
Prakash
> >http://evanjones.ca/- Hide quoted text -
>
> - Show quoted text -

Josh Hartman

unread,
Aug 11, 2010, 11:25:26 AM8/11/10
to Prakash Rao, Protocol Buffers
I have code ready for the compiler to do just this. Please see http://github.com/jhartman/Google-Protocol-Buffers/tree/java-optimizations

I need to clean up and benchmark the code for using apache primitives to store the repeated primitive fields. I'll try to finish it this week and post it to the mailing list.

Thanks,
Josh H


--

Prakash Rao

unread,
Aug 12, 2010, 12:44:13 AM8/12/10
to Protocol Buffers
Thanks for your quick response. This really helps :-)

Regards,
Prakash

On Aug 11, 8:25 pm, Josh Hartman <atomicfo...@gmail.com> wrote:
> I have code ready for the compiler to do just this. Please seehttp://github.com/jhartman/Google-Protocol-Buffers/tree/java-optimiza...
>
> I need to clean up and benchmark the code for using apache primitives to
> store the repeated primitive fields. I'll try to finish it this week and
> post it to the mailing list.
>
> Thanks,
> Josh H
>
> > > >http://evanjones.ca/-Hide quoted text -
>
> > > - Show quoted text -
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Protocol Buffers" group.
> > To post to this group, send email to prot...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > protobuf+u...@googlegroups.com<protobuf%2Bunsubscribe@googlegroups.c­om>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/protobuf?hl=en.- Hide quoted text -
Reply all
Reply to author
Forward
0 new messages