how do we setup ByteBuf pools?

travis...@gmail.com

unread,

Jun 16, 2013, 8:56:44 PM6/16/13

to ne...@googlegroups.com

Hello,

I need to create a pool of X ByteBuf instances of maximum size S (bytes) and am having trouble deciphering how to setup a Netty pooled allocator. There's not a lot of comments in the buffer pool code and it seems like the built-in pools are far more complex than what I need. However, I am still interested in using ReferenceCounted to trigger placing a ByteBuf back into the pool once the refCnt reaches zero.

With the more complex implementations, we have fields such as,

int pageSize, int maxOrder, int pageShifts, int chunkSize

What are reasonable settings for a simple example of 500 ByteBuff instances with maximum size 512 bytes?

-Travis

ryan rawson

unread,

Jun 17, 2013, 2:11:30 AM6/17/13

to ne...@googlegroups.com, ne...@googlegroups.com

Why use pooled objects? Java can be very efficient with short lived objects.

Be sure to test regardless. Getting objects promoted to old gen isn't great either.

Sent from your iPhone

--

---
You received this message because you are subscribed to the Google Groups "Netty discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to netty+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

"이희승 (Trustin Lee)"

unread,

Jun 17, 2013, 2:21:48 AM6/17/13

to ne...@googlegroups.com

Hello Travis,

The default setup should be fine. The pool implementation is basically
a variant of jemalloc, so if you don't have prior knowledge about
jemalloc, it will not be easy to understand.

To explain very simply, the allocator creates a large buffer called
'chunk' and slice it into smaller ones to lend it to you. The size of a
chunk is determined by pageSize and maxOrder: chunkSize = pageSize <<
maxOrder, where the default pageSize is 4096 and maxOrder is 11 (i.e.
chunkSize is 16 MiB). pageShift is another derived variable which is
rather relevant only to the internals.

If a user asks for the buffer larger than chunkSize, it will allocate an
unpooled buffer. If a user asks for a small buffer like less than half
of pageSize (or maybe some other value I can't remember right now), it
will slice a 4096-byte page into smaller ones. Otherwise the buffer
will span over multiple pages.

So, I think the default configuration parameter should be fine in most
cases unless you allocate a buffer greater than 16 MiB. If you want the
allocator to create a smaller chunk, I'd recommend you to decrease
maxOrder. Also, allocating a buffer that spans over multiple pages
takes longer than allocating a sub-page buffer, so you might want to
increase the pageSize if you allocate buffers greater than 4096B very often.

HTH,
T

--
https://twitter.com/trustin
https://twitter.com/trustin_ko
https://twitter.com/netty_project

"이희승 (Trustin Lee)"

unread,

Jun 17, 2013, 2:24:53 AM6/17/13

to ne...@googlegroups.com

However efficient garbage collector is, creating a byte[] or ByteBuffer
requires JVM to fill the buffer with zeroes which consumes quite a bit
of memory bandwidth. Buffer pool removes that overhead completely.

Also, although it is true that the garbage collection of short-lived
objects are very cheap, it is not completely free. When we can pool
it, we should do it, and it will help the JVM deal with the garbage
that other part of the application produces.

T

On Mon 17 Jun 2013 03:11:30 PM KST, ryan rawson wrote:
> Why use pooled objects? Java can be very efficient with short lived
> objects.
>
> Be sure to test regardless. Getting objects promoted to old gen isn't
> great either.
>
> Sent from your iPhone
>
> On Jun 16, 2013, at 5:56 PM, travis...@gmail.com

> <mailto:travis...@gmail.com> wrote:
>
>> Hello,
>>
>> I need to create a pool of X ByteBuf instances of maximum size S
>> (bytes) and am having trouble deciphering how to setup a Netty pooled
>> allocator. There's not a lot of comments in the buffer pool code and
>> it seems like the built-in pools are far more complex than what I
>> need. However, I am still interested in using ReferenceCounted to
>> trigger placing a ByteBuf back into the pool once the refCnt reaches
>> zero.
>>
>> With the more complex implementations, we have fields such as,
>>
>> int pageSize, int maxOrder, int pageShifts, int chunkSize
>>
>> What are reasonable settings for a simple example of 500 ByteBuff
>> instances with maximum size 512 bytes?
>>
>> -Travis
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google
>> Groups "Netty discussions" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to netty+un...@googlegroups.com

>> <mailto:netty+un...@googlegroups.com>.

>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
> --
>
> ---
> You received this message because you are subscribed to the Google
> Groups "Netty discussions" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to netty+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

"이희승 (Trustin Lee)"

unread,

Jun 17, 2013, 2:26:30 AM6/17/13

to ne...@googlegroups.com

Also, please take a look at io.netty.util.Recycler which might be
interesting to you when you want to pool an arbitrary object. See
io.netty.channel.MessageList to learn how to use it.

ryan rawson

unread,

Jun 17, 2013, 2:32:31 AM6/17/13

to ne...@googlegroups.com, ne...@googlegroups.com

Pooling is a strategy that may or may not improve performance. I think the message should be "may (frequently) improve performance, ymmv and benchmark". Trust but verify if you will.

Sent from your iPhone

"이희승 (Trustin Lee)"

unread,

Jun 17, 2013, 2:46:24 AM6/17/13

to ne...@googlegroups.com

Sure thing. At least for buffer allocation, check out
netty-microbenchmark module.

travis...@gmail.com

unread,

Jun 17, 2013, 12:26:24 PM6/17/13

to ne...@googlegroups.com

Thanks for the details Trustin.

I'm experimenting with the LMAX Disruptor, in combination with Netty, but I don't want the Disruptor to pre-allocate the ByteBuf instances, because data from clients is copied into a ByteBuf after numerous channel request/responses. It seems like it would be most efficient to use a Netty ByteBuf pool, pass a ByteBuf instance to the Disruptor, and then return the ByteBuf to the pool after the last Disruptor "worker" is done with the data.

Other details are that I'm using a Flyweight pattern to read headers/messages from the ByteBuf and am going to be doing IO with the ByteBuf (journal file and socket broadcast). Disruptor was recently updated to work better with multiple producer-threads, so it seems like that scenario would be a good match for how Netty uses multiple threads for accepting client communication.

This is really an experiment and an opportunity to learn more about how/where the LMAX Disruptor brings a big performance benefit. Thanks!

Kevin Burton

unread,

Jun 18, 2013, 11:23:07 AM6/18/13

to ne...@googlegroups.com

On Sunday, June 16, 2013 11:24:53 PM UTC-7, Trustin Lee wrote:

However efficient garbage collector is, creating a byte[] or ByteBuffer
requires JVM to fill the buffer with zeroes which consumes quite a bit
of memory bandwidth. Buffer pool removes that overhead completely.

There is also the issue of direct buffers.

If you're using direct buffers the JVM does NOT correctly manage them. IMO it's completely broken and a bad bug.

Java doesn't free the buffers until the GC decides to and you could be out of OS memory by that time and this means that your JVM will get OOM killed.

Horrible horrible behavior basically.

But with pooled direct byte buffers you allocate once on startup and re-use.

Also, can't sendfile be used on direct byte buffers? I've been meaning on looking into whether netty does the right thing here.

Kevin Burton

unread,

Jun 18, 2013, 11:24:53 AM6/18/13

to ne...@googlegroups.com

Nice.. post your thoughts regarding Disruptor and Netty to the list when you're done (if you wouldn't mind).

I'm about to do the same thing with Peregrine so it would be interesting to leverage any insight you have.

Kevin Burton

unread,

Jun 18, 2013, 3:38:14 PM6/18/13

to ne...@googlegroups.com

Do you run this yourself manually or do you have more of a continuous integration process with metrics and graphs to track performance?

I am going to be running benchmarks on all commits in Peregrine. I'm totally sold on continuous integration for making development way easier and much more fun and productive. If you have an error you can immediately see which commit broke the build if the unit test fails.

But with benchmarking it's a bit more subtle.

You could lose 10% of performance and not even realize it...

This way with a benchmark you could see the performance for every commit and see which ones slowed performance (and correct it sooner).

It turns out Yourkit has a command line option for profiling so I might try to incorporate that too.

"이희승 (Trustin Lee)"

unread,

Jun 18, 2013, 3:43:46 PM6/18/13

to ne...@googlegroups.com

+1. Please feel free to get back to us. We're very interested in your
experiment!

Cheers,
T

"이희승 (Trustin Lee)"

unread,

Jun 18, 2013, 3:44:32 PM6/18/13

to ne...@googlegroups.com

It's manual and ad-hoc currently, but I'm going to introduce something
nice maybe next quarter.

travis...@gmail.com

unread,

Jun 19, 2013, 11:12:17 AM6/19/13

to ne...@googlegroups.com

I still have some more work to do before I start benchmarking, but here are some thoughts on how this differs from the original LMAX Disruptor "best case scenario". My impression was that the LMAX server read pre-encoded byte-messages with a single thread doing socket accept, so they were able to read small messages very quickly and maintain the single-writer-principle as that same thread placed messages onto the ring buffer. Disruptor worker threads can read off the ring buffer in batches (if they want to), so the sequential data could take advantage of cache locality and prefetch.

With my use-case, and many real world servers, the messages coming from clients are of varying size and I have to do some processing on them, before I place it on the ring buffer. The fact that the messages are of varying size is a big problem, or at least greatly diminishes how easily we can fully take advantage of the cache/memory friendly nature of Disruptor. For my use case, I am going to enforce a maximum size on the messages, but the general case (messages of varying size) should be investigated in the future. So, using a Netty ByteBuf pool and the asynchronous nature of when one ByteBuf is taken from the pool and when it is actually placed on the ring buffer is probably going to turn a "dense" memory allocation (when the pool was first created) into a much more sparse allocation over time. It seems like what is needed is a way for the pool to cleanup itself and get back to a "dense" state every now and then, but I need to think about algorithm possibilities for that. To explain this further, the pool only becomes "sparse" from the perspective of worker threads that are reading messages off the ring buffer in batches.

Another potential issue is that when a ring buffer gets backed up (completely full and workers cant keep up) it blocks, which would block the Netty threads that are dealing with client communication. The ring buffer needs to always be larger than needed and it will require a lot of measuring and tweaking before one can have confidence in the configuration. Still, if the worker threads can utilize thread affinity (tied to core/socket) then some really nice performance should be possible with a server that has a Netty "front-end".

It really seems like Disruptor is best for servers with lots of cores and definitely not a server running in a VM hosted environment.

Reply all

Reply to author

Forward