[kryo] buffer size

1,145 views
Skip to first unread message

Felix Ng

unread,
Apr 22, 2010, 1:55:35 PM4/22/10
to kryo-users
i have an arraylist with 10 thousand+ string would like to serialized,
and i have to use ByteBuffer.allocate(10 * 1024 * 1024); in order to
avoid buffer overflow...

is this normal practice or is there anyway to make use of a smaller
buffer?

--
You received this message because you are subscribed to the "kryo-users" group.
http://groups.google.com/group/kryo-users

Nate

unread,
Apr 22, 2010, 3:24:46 PM4/22/10
to kryo-...@googlegroups.com
This is normal. There is more info on this subject here:
http://code.google.com/p/kryo/issues/detail?id=10&can=1

Support to work around this hasn't been added to Kryo because to serialize an object graph, the graph is normally going to fit in memory. The required size of the ByteBuffer will likely be smaller than the in-memory Java representation, and memory is relatively abundant.

-Nate

seth/nqzero

unread,
Apr 23, 2010, 1:37:57 PM4/23/10
to kryo-...@googlegroups.com
nate - i've been thinking about this some more, and i'd like to come
up with something cleaner ... i'm running many threads doing
serialization, so i need a buffer for each thread. most of my objects
are small, but i want to be able to handle the occasional monster. so
i've got a bunch of "huge" ByteBuffers laying around, doing nothing
most of the time

the status quo is to push the responsibility off to the end-users. for
me (and i think for martin too) this isn't good since we're trying to
make serialization transparent to the end users

here are a few options i've considered

- extend ByteBuffer so that it handles the overflows. would delegate
to an array of ByteBuffers - each larger than the last. conceptually
simple, and nice in that it doesn't effect kryo at all (and might be
useful in other applications) but i'm guessing that the delegation
would be messy

- replace the ByteBuffer usage with something stream oriented. i think
you said that this would break the compression stuff. i'm not using it
right now, but i don't think i'd want to give this up permanently, so
i need to evaluate workarounds

- i think i saw some code (not sure if it was yours or a submission)
that bombed out, caught the exception, realloced, and tried again.
probably got the job done, but was invasive and ugly (imho)

none of these explicitly corrects the problem of having a bunch of
huge buffers laying around, but i think that once we had the auto
allocation working, the dealloc would end up being pretty easy ...

not totally happy with any of the above, but some combination has got
to work. i haven't started coding anything yet, so if there was an
approach that you were more likely to accept, i'd try that first -
though i'm on vacation right now, so not sure how much will get done
in the next few weeks

one thing i haven't considered till just now is to just alloc a huge
buffer on demand, and then immediately delete it when i'm done with
it. there are enough interactions going on that i'm not sure what
performance impact that would have ... native arrays, context
switching, kernel handling of cache. and i guess that in effect this
is what i'm doing now, and it's working ...

anyone have favorites, prototypes, comments or new/other ideas ?

seth

Jim

unread,
Apr 28, 2010, 5:46:39 PM4/28/10
to kryo-users
The fixed ByteBuffer seems okay for smaller object graphs, but like
Felix I have a few really large ones. I also need to use kryo in a
multi-threaded web server that has to support as high a load as
possible.

Is it possible to support serialization directly to/from OutputStreams
and InputStreams as well as to/from ByteBuffers? (XStream and Jackson
both have this feature.)

Thanks,

Jim
> > On Thu, Apr 22, 2010 at 10:55 AM, Felix Ng <felixn...@gmail.com> wrote:
>
> >> i have an arraylist with 10 thousand+ string would like to serialized,
> >> and i have to use ByteBuffer.allocate(10 * 1024 * 1024); in order to
> >> avoid buffer overflow...
>
> >> is this normal practice or is there anyway to make use of a smaller
> >> buffer?
>
> >> --
> >> You received this message because you are subscribed to the "kryo-users"
> >> group.
> >>http://groups.google.com/group/kryo-users
>
> > --
> > You received this message because you are subscribed to the "kryo-users"
> > group.
> >http://groups.google.com/group/kryo-users
>
> --
> You received this message because you are subscribed to the "kryo-users" group.http://groups.google.com/group/kryo-users

Martin Grotzke

unread,
Apr 29, 2010, 5:04:39 PM4/29/10
to kryo-...@googlegroups.com
Just as a note: Input/OutputStream would also be sufficient for me. As
I'm writing sessions to memcached, the serialized bytes should not
exceed some megabytes - actually I didn't measure how expensive it is
to allocate e.g. always 1Mb instead of allocating only 100kb if this
would be the actual size needed.

Cheers,
Martin
--
Martin Grotzke
http://www.javakaffee.de/blog/

Martin Grotzke

unread,
Apr 29, 2010, 5:51:12 PM4/29/10
to kryo-...@googlegroups.com
On Thu, Apr 29, 2010 at 11:04 PM, Martin Grotzke
<martin....@googlemail.com> wrote:
> Just as a note: Input/OutputStream would also be sufficient for me. As
> I'm writing sessions to memcached, the serialized bytes should not
> exceed some megabytes - actually I didn't measure how expensive it is
> to allocate e.g. always 1Mb instead of allocating only 100kb if this
> would be the actual size needed.
Just to let you know, as I was curious about this now I wrote a very
simple prog that serializes a string of different sizes with different
ObjectBuffer initial sizes. This is the result (on my T61), all sizes
in kb, time in millis:
initialSize 100, objSize 100: allocationTime=57,
serializationTime=1722, total: 1781 ms
initialSize 100, objSize 200: allocationTime=46,
serializationTime=4135, total: 4185 ms
initialSize 200, objSize 200: allocationTime=87,
serializationTime=3377, total: 3468 ms
initialSize 100, objSize 400: allocationTime=52,
serializationTime=9349, total: 9406 ms
initialSize 400, objSize 400: allocationTime=182,
serializationTime=7116, total: 7303 ms
initialSize 100, objSize 800: allocationTime=50,
serializationTime=20675, total: 20735 ms
initialSize 800, objSize 800: allocationTime=316,
serializationTime=14311, total: 14634 ms

It seems that it takes
- 0.75 sec for growing from 100 -> 200 (vs. 0.04sec more needed for
initially 200 compared to initially 100)
- 2.2 sec for growing from 100 -> 400 (vs. 0.13sec more needed for
initially 400 compared to initially 100)
- 6 sec for growing from 100 -> 800 (vs. 0.25sec more needed for
initially 800 compared to initially 100)

So it seems a larger initial size is cheaper compared to growing.

This is the simple program:

public static void main( final String[] args ) throws InterruptedException {
bench( 100, 100 );
bench( 200, 100 );
bench( 200, 200 );
bench( 400, 100 );
bench( 400, 400 );
bench( 800, 100 );
bench( 800, 800 );
}

private static void bench(final int objSizeInKb, final int
initialSizeInKb) throws InterruptedException {
final String val = newString(objSizeInKb);
final Kryo kryo = new Kryo();
final long start = System.currentTimeMillis();
long allocationTime = 0, serializationTime = 0;
for( int i = 0; i < 1000; i++ ) {
final long startAllocate = System.currentTimeMillis();
final ObjectBuffer buffer = new ObjectBuffer(kryo,
initialSizeInKb * 1024, 2000 * 1024 );
allocationTime += (System.currentTimeMillis() - startAllocate);
final long startSerialize = System.currentTimeMillis();
buffer.writeObject( val );
serializationTime += (System.currentTimeMillis() - startSerialize);
}
System.out.println( "initialSize "+ initialSizeInKb +",
objSize "+ objSizeInKb +
": allocationTime="+ allocationTime+",
serializationTime="+ serializationTime +
", total: " + ( System.currentTimeMillis() - start) + " ms");
System.gc();
Thread.sleep( 1000 );
}

private static String newString( final int lengthInKb ) {
final StringBuilder sb = new StringBuilder( lengthInKb );
final Random random = new Random();
for( int i = 0; i < lengthInKb * 1024; i++ ) {
sb.append( random.nextInt( 9 ) );
}
return sb.toString();
}

Just for the sake of numbers *g*

Cheers,
Martin

Martin Grotzke

unread,
Apr 29, 2010, 6:41:52 PM4/29/10
to kryo-...@googlegroups.com
On Thu, Apr 29, 2010 at 11:51 PM, Martin Grotzke
<martin....@googlemail.com> wrote:
> Just to let you know, as I was curious about this now I wrote a very
> simple prog that serializes a string of different sizes with different
> ObjectBuffer initial sizes. This is the result (on my T61), all sizes
> in kb, time in millis:
...and everything for 1000 cycles (forgot to mention that)...

Cheers,
Martin

Nate

unread,
May 5, 2010, 7:11:52 PM5/5/10
to kryo-...@googlegroups.com
Hi Felix, Seth, Jeff, and Martin,

I hear what you guys are saying.

Using one large buffer is most flexible and the simplest approach, at the expense of memory. A stream-based approach is easy on memory, but is less flexible. When a length needs to be written before data, a temporary buffer is required. This is potentially just as bad as using one large buffer.

Most of the time, there is no problem with using one large buffer, even for very large object graphs. All the objects fit in memory, and the buffer size required is somewhat less than that. It is inefficient though, and multiple threads can make the problem worse.

Seth brought up some potential solutions:

1) ByteBuffer cannot be extended because it has default access constructors.
2) Replacing ByteBuffer with something else that handles the buffering/streaming is one solution. I wonder how much overhead adding a layer here would be.
3) ObjectBuffer uses an inefficient grow-and-retry approach. This only relieves you of knowing how much memory you need beforehand, but doesn't solve the problem of needing so much memory.

Another solution is to periodically dump the contents of the buffer. This would break any serializers that access the buffer contents after a dump, but would provide an option for memory efficient serialization. For deserialization, the buffer would have to be periodically filled. I like this approach because it doesn't affect the fast path if what you need is one large ByteBuffer, and doesn't limit the random access of the buffer if streaming isn't needed.

Where is the best place(s) to dump/fill the buffer? There currently isn't a good, central place. One option would be to wrap all serializers with a serializer that delegates to the wrapped serializer before dumping/filling.

How does the API expose dumping/filling bytes? Objects could be placed into the context, so that the wrapper serializer could access them.

Interestingly, this could be built without needing changes to Kryo. I am curious how well this would work. Anyone feel like implementing it? :)

-Nate

seth/nqzero

unread,
May 13, 2010, 10:38:51 AM5/13/10
to kryo-...@googlegroups.com
just back from a few weeks vacation and behind on a bunch of stuff,
but i'll start thinking about this again. if i can get my head wrapped
around the dumping / filling strategy, i'll take a shot at
implementing it. ETAs at least a week or two though -- seth

seth/nqzero

unread,
Jun 3, 2010, 2:08:08 AM6/3/10
to kryo-...@googlegroups.com
finally starting to look at this. figured the first place to start was
looking at performance of the current options

technique -- mean -- std - iters between gc

Direct_0 || 0.342401 0.023697 3
Direct_4 || 1.109712 0.088039 2
IDC_0 || 0.318936 0.016013 5
DC_4 || 0.337229 0.013684 5
InDirect_0 || 0.343316 0.025373 5
InDirect_4 || 0.870705 0.167930 2

techniques:
Direct - a new ByteBuffer.allocateDirect for each call to kryo serialize
InDirect - a new ByteBuffer.allocate for each call to kryo serialize
IDC - InDirect, Cached (the buffer is reused)
DC - Direct, Cached
_0 - use a buffer slightly larger than what's required
_4 - use a buffer that's 16 times longer than _0

the test object has a single field - a long array (1 << 20) of random
ints. after writing to the array, the sum of the bytes is printed (to
try to account for any difference between the OS (direct) and java
(indirect) memory. on some runs, i used verbose:gc to create the gc
column. it's the number of iterations in between gc passes (higher is
better)

problems:
i couldn't think of a way to monitor GC passes programatically, so i
didn't interleve the tests, and i paused and System.gc()'d before each
technique. can anyone think of an easy way to PROGRAMATICALLY MONITOR
GC ACTIVITY ?

my conclusions:
this was 1 run of 15 iterations of each technique, but i've done many
- all seem to look about the same. direct vs indirect doesn't seem to
matter much, and for "small" buffers allocating for each kryo call
doesn't seem to matter. for larger buffers (64M in this case), per
kryo call allocation isn't efficient. so planning to go forwards with
the flushing meta serializer

any thoughts ?
seth

Nate

unread,
Jun 3, 2010, 5:05:22 AM6/3/10
to kryo-...@googlegroups.com
In general, it is best for an app to allocate one large ByteBuffer and slice it into many small ones, as needed. Probably doesn't affect your benchmarks though.

A bit of trivia... If an app never creates a non-direct ByteBuffer and only direct ByteBuffers are used, then the ByteBuffer abstract class will have only the one implementation loaded. The JVM can optimize for this and make all calls non-virtual. As soon as ANY code in the JVM creates a non-direct ByteBuffer, all the calls get deoptimized and direct ByteBuffer calls cost 10x more. The server compiler however uses type based profiling for each call site, which means mostly no performance loss.

I would think a lot of factors could affect your benchmark, especially hardware. Were you using the server compiler? Did you do a JIT warm-up of many thousand iterations before measuring? Were you using a large initial and max heap size?

Ideally you would be able to complete each test without a GC occurring during the test. There are a lot of factors for GC kicking in, and performance can change for various versions of the JVM.

I would look hard at your use case and try to write a test appropriate to that. 64MB of Kryo serialized data is a very large object graph. The JVM's in-memory representation will be larger, likely much larger. Are you really dealing with 100MB+ object graphs? How many threads need to serialize these graphs concurrently? Do you have the memory to allocate that many buffers? A 32-bit VM is limited to a ~1200MB heap, but a 64-bit VM gives you many GBs to play with. If you have the memory, you could pool buffers or pay the small performance hit for allocating each time. How time sensitive is the processing all these threads are doing? If you don't have the memory, you could still pool buffers and block threads as needed to limit concurrent serialization. If this is just too slow, then OK, maybe you need a flushing serializer. :)

-Nate

seth/nqzero

unread,
Jun 5, 2010, 9:07:15 PM6/5/10
to kryo-...@googlegroups.com
not really looking for a perfect benchmark (see the note below) - was
just looking at a near-worst case scenario. in my mind, this is many
userland threads that might call kryo with data that tends to be
small, but might be large. the options supported by kryo are:

1. use a pool of huge buffers, which requires a lot of memory
allocated statically
2. allocate a huge buffer for each call - i did my benchmark to see if
this was practical - it isn't
3. use some sort of catch / retry for overlows - inefficient ( n log
n, i think ), and still a memory hog

buffers in #1 and #2 need to be huge to accommodate the largest
possible object. i can certainly live with any of these options, but
they leave the app with the responsibility of bolting on some sort of
mechanism. if there was an easy way to simplify the management of the
buffer(s) i think it'd be worth it ...

nate wrote:
>> >> Interestingly, this could be built without needing changes to Kryo. I
>> >> am curious how well this would work. Anyone feel like implementing it? :)

i've added a couple of proof of concept serializers that show that
it's possible to implement a "streaming" writing api without any
changes to kryo - http://code.google.com/p/kryo/issues/detail?id=10

while these work, they're not perfect - either falling back to #3
above, or imposing an approximately 15% runtime penalty (i'm assuming
that this is due to intercepting all the serializer calls). think my
next attempt will be more invasive - see if i can eliminate the
penalty by moving the streaming code into the serializers themselves
(instead of wrapping them)

seth

NOTE on benchmarks: given the complexity and adhoc rules of the jit,
benchmarks tend to be pretty crude. yes, you can guard against gc, and
allow a long burnin, and limit which classes that you use. and you can
make any particular test give great results. but under real-life
conditions, it's pretty hard to reproduce this result, especially if
your app is large, uses 3rd party components or is ultimately a
library that somebody else uses

Nate

unread,
Jun 5, 2010, 9:36:34 PM6/5/10
to kryo-...@googlegroups.com
Thanks for the flushing code, I will definitely check it out!

There are currently some outstanding Kryo issues, namely streaming and reference handling. I haven't had a lot of time lately for the project, but I expect in a few weeks I will. I've been thinking of doing some large refuctoring ;) to better handle these cases. Possibly it would make sense to have a centralized place to hook serializers. Right now you have to wrap each serializer, which is a bit of a pain. It is almost like the serializers should do their job, but have mechanisms for cross cutting customization. Performance and usability are important, so doing this may be tricky.

While microbenchmarks are generally pretty unreliable, I think the 3 items I mentioned are pretty basic for providing a realistic test environment. Any server code should be using the server VM. A large initial and max heap size can reduce the effect of GC on the benchmark. Benchmarking without a warmup is measuring code unoptimized by the JIT. It should be easy to apply these things, just run the JVM with "-Xmx1024M -Xms1024M -server" and do a few thousand iterations before measuring. :)

-Nate
Reply all
Reply to author
Forward
0 new messages