pooled-byte-array

582 views
Skip to first unread message

Georges Gomes

unread,
Nov 30, 2013, 8:02:21 AM11/30/13
to mechanica...@googlegroups.com
Hi Gentlemen,

Following different discussions around byte array allocation.
(The Netty4-Twitter case for example)

I started a couple of classes around a byte[] pool
'automatically' recycling byte[] 'by' the garbage collector.
https://github.com/georges-gomes/pooled-byte-array

Simple.
but I don't really feel good about the it.
Playing with reference in finalize() raises a small red flag in my head.
I would appreciate your feedback or experiences with something similar.

Kind regards
Georges

Norman Maurer

unread,
Nov 30, 2013, 8:09:43 AM11/30/13
to Georges Gomes, mechanica...@googlegroups.com
I did not check the code yet but I would not use finalize(). You don’t want to do this.
If someone use a pool he/she also should be responsible to free the resources. 

Bye,
Norman

Ps: Feel free to ask stuff about the pool in Netty as I’m one of the Netty devs ;)


-- 
Norman Maurer
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Peter Lawrey

unread,
Nov 30, 2013, 9:08:59 AM11/30/13
to mechanica...@googlegroups.com
The most efficient pool is one which is just pre-allocated and you keep reusing.  Passing resources esp byte[] between threads is very expensive to recycle as the byte[] will actually get copied back between CPU caches even if you overwrite it. i.e. the CPU cache is too stupid to not read in a cache line you are about to overwrite.

Using finalize is very expensive as it adds nodes to a queue to process in a background thread.  This is likely to be more expensive than letting the GC do the work. This is made worse if a finalized object blocks for a while as it prevents anything else being finalized.

I suggest looking at using a ring buffer of buffers.  This has the advantage that when you have a free slot to write data you also have a recycled buffer to write to as well. It look something like this.


      RingBuffer<ByteBuffer> ring = ...

      // producer
     ByteBuffer bytes = ring.nextFree();
     populate(bytes);
     ring.writeComplete(bytes);

     // consumer
     ByteBuffer bytes = ring.nextUsed();
     process(bytes);
     ring.readComplete(bytes);

This doesn't solve the writing the buffer back again problem, but it does mean you can queue buffers and recycle them in one action.

Georges Gomes

unread,
Nov 30, 2013, 10:05:09 AM11/30/13
to Norman Maurer, mechanica...@googlegroups.com

Hi Norman
Thanks for replying.
Netty 4 buffer pool is on my to-read list. Talk to you soon :)
Cheers
Gg

Georges Gomes

unread,
Nov 30, 2013, 11:22:56 AM11/30/13
to mechanica...@googlegroups.com
Thanks for taking your time on this Peter. Additional comments inline.

The most efficient pool is one which is just pre-allocated and you keep reusing.  Passing resources esp byte[] between threads is very expensive to recycle as the byte[] will actually get copied back between CPU caches even if you overwrite it. i.e. the CPU cache is too stupid to not read in a cache line you are about to overwrite.

Yes, but even pre-allocating the buffer in a class, if multiple threads access it, there is CPU cache 'reload' that is going to happen (not mentioning the locking required to allow access to the buffer)

I tried to keep the byte[] 'hot' by implementing the pool with LIFO (ConcurrentDeque).
OK, it's quite naive :) but this is all very experimental.

Your comment is quite right, to get the best performance out of the byte[] we need to keep it local to his cache. What would you think of an implementation that would pool the byte[] depending on the core that 'ask' for it?
 
Using finalize is very expensive as it adds nodes to a queue to process in a background thread.  This is likely to be more expensive than letting the GC do the work. This is made worse if a finalized object blocks for a while as it prevents anything else being finalized.

First, don't give me wrong, I do fear the finalize() as well but I would like to find hard facts to convince myself :)

I have a even bigger fear in pushing the responsibility to recycle to the developer. Because if the reference of the pooled byte[] is passed to multiple threads, the last thread needs to recycle it. This would require very well written reference-count-like armada. And this is likely to lead to memory leaks or worth: byte[] shared by 2 'contexts' - corrupting the data in a very bad way(!) Finding references is the best job of the GC, it is going to do it anyway, why not use it? There is too much benefits not to dig into this one...

Recycling 'manually' through a recycle() method or getting this done in the finalize, is likely to take the same time. Maybe the finalizers are sequentially executed by GC, so not parallelized, so you are right in a highly concurrent scenario. But would need to be benchmarked.
  
I suggest looking at using a ring buffer of buffers.  This has the advantage that when you have a free slot to write data you also have a recycled buffer to write to as well. It look something like this.


      RingBuffer<ByteBuffer> ring = ...

      // producer
     ByteBuffer bytes = ring.nextFree();
     populate(bytes);
     ring.writeComplete(bytes);

     // consumer
     ByteBuffer bytes = ring.nextUsed();
     process(bytes);
     ring.readComplete(bytes);

This doesn't solve the writing the buffer back again problem, but it does mean you can queue buffers and recycle them in one action.

This is interesting for producer and consumer threads but a little bit offside of what I'm trying to explore here.

Kirk Pepperdine

unread,
Nov 30, 2013, 3:36:18 PM11/30/13
to mechanica...@googlegroups.com
If you look at the implementation of finalize (or any reference type) you'll see that there are special bits that the JVM plays with to understand what it's suppose to do next. Using finalize to "recover" an object will cause those bits to be set leaving the object in question in a half dead state. I'd have to review but a half dead object cannot be a GC root and there are some other things that don't quite work out so well. So, in addition to a more expensive creation (object must be wrapped with a Final reference and the C++ code in the JVM allocates these extra bits to manage reference handling, GC is also more expensive as it has to play with the bits and then move the object to the finalization threads reference queue.. plus you get the knock on effect of actually having to run finalization.

----
Kirk Pepperdine
Principal Consultant
http://www.kodewerk.com
Tel: +36 60 213 6543
skype: kcpeppe
twitter: @kcpeppe
Java Champion
NetBeans Dream Team

Georges Gomes

unread,
Nov 30, 2013, 4:06:58 PM11/30/13
to mechanica...@googlegroups.com

Hi Kirk

That's quite factual! :)

Because the pooled object wrapping the byte array is not 'resurrected' do you think this still apply to the byte array?

I'm going to look at openjdk source code, you got my interest!

Anyway
Non-finalizer 3 - 0 finalizer
It's hard to fight :)

I have a lot to explore now
I keep you posted if I find anything interesting

Many thanks for you time
Gg

Kirk Pepperdine

unread,
Nov 30, 2013, 4:31:33 PM11/30/13
to mechanica...@googlegroups.com
The byte[] will be ok but don't use the wrapper. That said (and as Peter has mentioned) you are looking at there being at least 3 thread touching the array, mutator, gc, and finalization helper thread...

BTW, I'm not saying don't do it.. but I'm not saying do it either... just saying, don't use the wrapper ;-)

----
Kirk Pepperdine
Principal Consultant
http://www.kodewerk.com
Tel: +36 60 213 6543
skype: kcpeppe
twitter: @kcpeppe
Java Champion
NetBeans Dream Team

Gil Tene

unread,
Dec 1, 2013, 2:04:24 AM12/1/13
to
There are multiple parts to address here. I'll start with the small style stuff and progress thru to the "this is very inefficient" conclusion.

First, the use of a finalizer is correctly frowned upon. Virtually anything you can do with a finalizer can be more cleanly done with either a weak reference (for postmortem  processing) or a phantom reference (for pre-mortem processing). In this specific case, the cleaner way to express what you are trying to achieve (the recycling of the byte array upon death of the containing object) is to use a weak reference, you do so by creating a subclass of WeakReference, say PooledByteArrayReference, which would include a field to that strongly references the byte array. PooledByteArrayReference instances would be created in PooledByteArray constructor, and the queue they are put in is then processed in the background to reap and recycle byte buffers. This way there are no additional liveness states, and PooledByteArray instances properly die when they can.

Second, the notion that allocating byte buffers instead of recycling them with this specific recycling pattern would be expensive usually turns out to be false (actually measuring it would be the best way to tell). The byte zeroing avoidance here is not much of a savings unless your byte buffers are multi-MB ones, since (assuming you actually use the byte buffers for storage) the storage of contents into a buffer right after allocation will usually overlap with the cost of zeroing, making the actual zeroing cost near-zero. This happens because both the zero and the contents stores have to bring the buffer's cache lines into the the cache in exclusive mode, and once a line in the L3 due to zeroing, the contents store is a very likely cache hit.

Third, the added de-reference cost on every array access operation here is likely to be MUCH more expensive than any (nearly zero to begin with) savings that come from avoiding the zeroing of buffers in normal allocation.

Fourth (and most dominant): Unlike both a limited size pool with explicit freeing, or simply allocating the buffers and letting GC pick them up, this recycling pattern is likely to fill the heap up with many more live byte buffers than are actually needed when under pressure, and throw GC into thrashing conditions. The reason for this is that recycling won't happen until the heap has filled up enough to trigger a GC. But when GC eventually triggers with this recycling approach, it will find tons of live byte buffers in the heap (in contrast to lots of dead ones in the trivial allocation/GC pattern without this recycling approach).


Kirk Pepperdine

unread,
Dec 1, 2013, 2:55:44 AM12/1/13
to mechanica...@googlegroups.com

----
Kirk Pepperdine
Principal Consultant
http://www.kodewerk.com
Tel: +36 60 213 6543
skype: kcpeppe
twitter: @kcpeppe
Java Champion
NetBeans Dream Team

On 2013-12-01, at 7:59 AM, Gil Tene <g...@azulsystems.com> wrote:

There are multiple parts to address here. I'll start with the small style stuff and progress thru to the "this us very inefficient" conclusion.

First, the use if a finalizer is correctly frowned open. Virtually anything you can do with a finalizer can be more cleanly done with either a weak reference (for postmortem  processing) or a phantom reference (for pre-mortem processing). In this specific case, the cleaner way to express what you are trying to achieve (the recycling of the byte array upon death of the containing object) is to use a weak reference, you do so by creating a subclass of WeakReference, say PooledByteArrayReference, which would include a field to that strongly references the byte array. PooledByteArrayReference instances would be created in PooledByteArray constructor, and the queue they are put in is then processed in the background to reap and recycle byte buffers. This way there are no additional liveness states, and PooledByteArray instances properly die when they can.

I was about to suggest Phantom but quickly realized that it suffers from the same failures as finalize. In fact if you look at the implementation, finalize is probably safer than Phantom in that finalize has guards built in that most developers wouldn't think of using. That said, Weak "maybe" a better option in that it by default gets dropped into a null ReferenceQueue and therefore can be collected without any special processing

Second, the notion that allocating byte buffers instead of recycling them with this specific recycling pattern would be expensive usually turns out to be false (actually measuring it would be the best way to tell). The byte zeroing avoidance here is not much of a savings unless your byte buffers are multi-MB ones, since (assuming you actually use the byte buffers for storage) the storage of contents into a buffer right after allocation will usually overlap with the cost of zeroing, making the actual zeroing cost near-zero. This happens because both the zero and the intents stores or have to bring the buffer's cache lines into the the cache in exclusive mode, and once a line in the L3 due to zeroing, the contents store is a very likely cache hit.


Third, the added de-reference cost on every array access operation here is likely to be MUCH more expensive than any (nearly zero to begin with) savings that come from avoiding the zeroing of buffers in normal allocation.

Fourth (and most dominant): Unlike both the fixed sized or limited size pool with explicit freeing, or simply allocating the buffers and letting GC pick them up, this recycling pattern is likely to fill the heap up with many more live byte buffers than are actually needed when under pressure, and throw GC into thrashing conditions. The reason for this is that recycling won't happen until the heap has filled up enough to trigger a GC. But when GC eventually triggers with this recycling approach, it will find tons of live byte buffers in the heap (in contrast to lots of dead ones in the trivial allocation/GC pattern without this recycling approach).

And even if the objects are dead, weak references present an expensive path to the collector.

However, a measure is what is really needed. I say this because it used to be said that you should always tenure (i.e., don't use survivor spaces) because that would be very expensive. Measures in the field very quickly dispelled that myth. So, in total life cycle costs is pooling byte buffers more or less expensive than re-allocating them.. I've never measured this so I don't know for sure but my feeling is that as expensive as byte array allocation can be, you'd have to be performing it very frequently to have it approach the total life-cycle costs.

-- Kirk

Georges Gomes

unread,
Dec 1, 2013, 3:40:36 AM12/1/13
to mechanica...@googlegroups.com
Thanks Gil, your insight is always helpful.
Comments inlined.

On Sun, Dec 1, 2013 at 7:59 AM, Gil Tene <g...@azulsystems.com> wrote:
There are multiple parts to address here. I'll start with the small style stuff and progress thru to the "this us very inefficient" conclusion.

First, the use if a finalizer is correctly frowned open. Virtually anything you can do with a finalizer can be more cleanly done with either a weak reference (for postmortem  processing) or a phantom reference (for pre-mortem processing). In this specific case, the cleaner way to express what you are trying to achieve (the recycling of the byte array upon death of the containing object) is to use a weak reference, you do so by creating a subclass of WeakReference, say PooledByteArrayReference, which would include a field to that strongly references the byte array. PooledByteArrayReference instances would be created in PooledByteArray constructor, and the queue they are put in is then processed in the background to reap and recycle byte buffers. This way there are no additional liveness states, and PooledByteArray instances properly die when they can.

This is nice. Never thought of WeakReference this way.
I have to try this just for the sport!

Second, the notion that allocating byte buffers instead of recycling them with this specific recycling pattern would be expensive usually turns out to be false (actually measuring it would be the best way to tell). The byte zeroing avoidance here is not much of a savings unless your byte buffers are multi-MB ones, since (assuming you actually use the byte buffers for storage) the storage of contents into a buffer right after allocation will usually overlap with the cost of zeroing, making the actual zeroing cost near-zero. This happens because both the zero and the intents stores or have to bring the buffer's cache lines into the the cache in exclusive mode, and once a line in the L3 due to zeroing, the contents store is a very likely cache hit.

That was the offline comment of my friend Jean-Philippe Bempel as well. And based on some testing it's proven to be true. The only 'but' is: if the buffer is over-provisioned then more zeros and cache line usage are done than necessary... But if you allocate new byte[] inline then you usually know how much you need.

Third, the added de-reference cost on every array access operation here is likely to be MUCH more expensive than any (nearly zero to begin with) savings that come from avoiding the zeroing of buffers in normal allocation.

True on a byte-to-byte basis.
Copying 'large' chunks through system.arraycopy will have less impacts.
But the later is probably not the most common use case for a byte[]...

I would need to find a way to achieve the same thing without de-reference to make this useful for performance. This would only be possible through PhantomReference on the byte[] directly
(as mentioned by Kirk in the next email I think)

Fourth (and most dominant): Unlike both the fixed sized or limited size pool with explicit freeing, or simply allocating the buffers and letting GC pick them up, this recycling pattern is likely to fill the heap up with many more live byte buffers than are actually needed when under pressure, and throw GC into thrashing conditions. The reason for this is that recycling won't happen until the heap has filled up enough to trigger a GC. But when GC eventually triggers with this recycling approach, it will find tons of live byte buffers in the heap (in contrast to lots of dead ones in the trivial allocation/GC pattern without this recycling approach).

I'm not too worry about this one because the life cycle of trivial byte[] vs PooledByteArray would be the same: collected in young or old depending on the situation. It would be easy to then limit the maximum number of byte[] recycled in the pool to avoid going through the roof after exceptional spikes.
This would have the same footprint than fixed/limited sized pool.
Am I missing something?

Georges Gomes

unread,
Dec 1, 2013, 4:35:52 AM12/1/13
to mechanica...@googlegroups.com
Yes next step is to run some benchmark of all this with different buffer sizes.

Regarding tenuring, on large byte[] I was thinking of combining it with -XX:PretenureSizeThreshold=xxxx
which would pre-allocate 'large' byte[] directly in old without affecting the young/survivor/tenuring.
This should have no effect on 99.9% of all allocation at the expect of only minimal unnecessary pretenuring. Depends on the software obviously, we have a lot of RAM on servers now...

Cheers
GG



Paul de Verdière

unread,
Dec 1, 2013, 12:17:55 PM12/1/13
to mechanica...@googlegroups.com
I applied this exact principle in a toy project that I started last summer and work on when time permits: https://github.com/pcdv/jocket

The use-case is single-producer / single-consumer exchange of bytes between two threads (IPC or intraprocess). It uses a shared buffer containing a ring of packet information and a ring buffer of actual data.

In a recent commit I added experimental support for zero-copy, looking exactly like your code snippet, cf issue https://github.com/pcdv/jocket/issues/2.

Hope some of you will find it interesting.

PS: don't scream at my benchmarks, they don't use JMH or Caliper and they do suffer from coordinated omission, sorry guys ;)
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Dec 2, 2013, 12:13:21 PM12/2/13
to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com
Comment inline.

On Dec 1, 2013, at 12:40 AM, "Georges Gomes" <george...@gmail.com> wrote:
...
Fourth (and most dominant): Unlike both the fixed sized or limited size pool with explicit freeing, or simply allocating the buffers and letting GC pick them up, this recycling pattern is likely to fill the heap up with many more live byte buffers than are actually needed when under pressure, and throw GC into thrashing conditions. The reason for this is that recycling won't happen until the heap has filled up enough to trigger a GC. But when GC eventually triggers with this recycling approach, it will find tons of live byte buffers in the heap (in contrast to lots of dead ones in the trivial allocation/GC pattern without this recycling approach).

I'm not too worry about this one because the life cycle of trivial byte[] vs PooledByteArray would be the same: collected in young or old depending on the situation. It would be easy to then limit the maximum number of byte[] recycled in the pool to avoid going through the roof after exceptional spikes.
This would have the same footprint than fixed/limited sized pool.
Am I missing something?

What you are missing here is the fact that the otherwise abundant empty space normally found by an oldgen GC will likely be filled with live byte[] objects, thrashing the collector. For every dead PooledByteArray object that lived long enough to get promoted past newgen, there will be a live byte[] object that will survive the next oldgen.

With a non-explicit pooling scheme that uses post-death (of PooledByteArray) actions for recycling, you have no way to limit the number byte[] objects that survive the next oldgen Full GC cycle. You can limit the number that will continue to be kept alive by your recycling code after that GC had run, but such pruning will not take effect until the next oldgen cycle. At that point you are too late, as the collector us already thrashed. 

On CMS, for example, I'd expect this scheme to reliably cause constant concurrent mode failures and fallbacks to a full GC on every second CMS cycle, and with much more frequent oldgen/FullGC occurs cues tan without it. That's when an application actually makes heavy dynamic use of these PooledByteArray objects of course (if no heavy use is made, there is no point to the discussion to begin with...).

My logic here is based on analysis and understanding of the underlying mechanisms, but empiric measurement will probably best demonstrate or disprove it. To convince yourself of the qualities of this scheme compared to others, I'd do a simple side-by-side measurement of three scenarios: this scheme, the trivial scheme (allocate regular byte[] and let GC take care if them), and an explicit allooc/free pool. In all cases your test should consistently retain some buffers long enough for them to be promoted (a your benchmark that allocates and immediately forgets). The behaviors I'd look for are:

A) latency behavior under a given allocation rate. E.g. 1M new 1KB buffers per second, with 1 in 20 buffers living at least 20 seconds (so ~1GB/sec allocation, ~50MB/sec promotion). Run this in a 4GB heap for at least 20 minutes to see what happens.  (you can use jHiccup to get simple picture of latency behavior for each measured scenario).
B) sustainable throughout: the number of allocated buffers per second* you can keep going without blowing some service level, like keeping 99.9%'ile below 1 second. (Here too, jHiccup gives you an easy way to establish the entire percentile spectrum and compare it to an SLA).
C) total throughput (rate of allocated buffers per second that can be done regardless of SLA).

I'd also track GC behavior (number of cycles, amount if GC work, size of pauses, etc.) but not so much as a metric, just as a way of understanding the behavior that dominates the other metrics.

I would expect all three behaviors (latency behavior, sustainable throughput, and total throughout) to suffer in this scheme compared to the other two.

* [you'll want to keep the ratio of allocated buffers to lived-long-enough-to-be promoted buffers fixed for this. It's actually the promotion rate that you should be doing throughput measurement of, except that the explicit alloc/free scenario will not have any promotion].

Georges Gomes

unread,
Dec 2, 2013, 12:20:04 PM12/2/13
to mechanica...@googlegroups.com

Thanks Gil
You know I trust any word you say :)

I'm going to do some benchmarks just to go through the bottom of it.
I keep you posted with the results.

Thanks again
Gg

--

Georges Gomes

unread,
Dec 11, 2013, 6:33:06 PM12/11/13
to mechanica...@googlegroups.com
Hi

I didn't progressed much but posted some results of byte[] vs PooledByteArray

In a nutshell


#Buffer 4096 bytes

new byte[] => 10 GCs, 300us each!
PooledByteArray => 4 GCs, 20ms each!


#Buffer 32768 bytes

new byte[] => 50+ GCs, 1ms each
PooledByteArray => 7 GCs, 20ms each


It prove your points around finalize().
I didn't suspect this kind of impact... really massive.
I will continue to dig into alternatives to finalize() and see where it goes.

Cheers
GG

Reply all
Reply to author
Forward
0 new messages