Concurrent BlockingQueue

904 views
Skip to first unread message

Simone Bordet

unread,
Jan 25, 2013, 7:15:19 AM1/25/13
to mechanica...@googlegroups.com
Hi,

does anyone know of a concurrent BlockingQueue implementation that
does not allocate a "node" for each call to offer() and that does not
use locks (aside for await/signal notifications) ?

I read that Martin Thompson had in mind to open source something along
these lines... :)

Thanks !

--
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless. Victoria Livschitz

Chris Vest

unread,
Jan 25, 2013, 7:44:47 AM1/25/13
to Simone Bordet, mechanica...@googlegroups.com
A design that is frugal with allocation, doesn't use locks yet can give back-pressure when full.
This sounds like the Disruptor, actually.  :)

Simone Bordet

unread,
Jan 25, 2013, 9:53:26 AM1/25/13
to Chris Vest, mechanica...@googlegroups.com
Hi,

On Fri, Jan 25, 2013 at 1:44 PM, Chris Vest <mr.chr...@gmail.com> wrote:
> A design that is frugal with allocation, doesn't use locks yet can give
> back-pressure when full.
> This sounds like the Disruptor, actually. :)

Ugh :)

I was looking more for a single class rather than a whole framework.

In any case, where I can find Disruptor examples or documentation of
multi-producer with multi-consumer ?

Stefano Fago

unread,
Jan 25, 2013, 11:51:33 AM1/25/13
to mechanica...@googlegroups.com, Chris Vest
something like this or simpler?

http://mentaqueue.soliveirajr.com/Page.mtw

it considere the experience of disruptor to obtain more simpler interactions...

There are also in google project & github different examples of using the same logic of
distruptor about circuar array... one example is http://code.google.com/p/jodk/

Bye :-)

Martin Thompson

unread,
Jan 25, 2013, 12:09:31 PM1/25/13
to mechanica...@googlegroups.com
Hi Simone,

I assume you want  the semantics of put() and take() if you talking about a BlockingQueue?  Normal queue only have offer() and poll() which are non-blocking.

I have queues that are much faster than the Disruptor that do not allocate nodes.  I'm working on a open-source project that will eventually make these available but that may take some time yet.  On my training course we work through examples were the students learn to build such things.

ArrayBlockingQueue does not allocate nodes internally.  However you should note that any blocking implementation that is using j.u.c locks that it maintains the waiters as a linked queue internally (AbstractQueuedSynchronizer).  So if you use locks you do not avoid garbage if that is your goal.

What is the requirement you are trying to address?  It would then be easier to put this in context.

Regards,
Martin...

Simone Bordet

unread,
Jan 25, 2013, 2:53:39 PM1/25/13
to Stefano Fago, mechanica...@googlegroups.com, Chris Vest
Hi,

On Fri, Jan 25, 2013 at 5:51 PM, Stefano Fago <stefa...@gmail.com> wrote:
> something like this or simpler?
>
> http://mentaqueue.soliveirajr.com/Page.mtw
>
> it considere the experience of disruptor to obtain more simpler
> interactions...

I need it multi producer and multi consumer.
From that page looks like it's only 1 producer and 1 consumer ?

--
Simone Bordet
----
http://cometd.org
http://webtide.com
http://intalio.com
Developer advice, training, services and support
from the Jetty & CometD experts.
Intalio, the modern way to build business applications.

Simone Bordet

unread,
Jan 25, 2013, 3:16:07 PM1/25/13
to Martin Thompson, mechanica...@googlegroups.com
Hi,

On Fri, Jan 25, 2013 at 6:09 PM, Martin Thompson <mjp...@gmail.com> wrote:
> Hi Simone,
>
> I assume you want the semantics of put() and take() if you talking about a
> BlockingQueue? Normal queue only have offer() and poll() which are
> non-blocking.

Yes, I need take() semantic, I can live without put() semantic.

> I have queues that are much faster than the Disruptor that do not allocate
> nodes. I'm working on a open-source project that will eventually make these
> available but that may take some time yet. On my training course we work
> through examples were the students learn to build such things.

Yeah, I read you were working on something, and that's what I was
referring to in my previous email.
I can go the way of writing it myself, but I thought to ask before
diving into it :)

> ArrayBlockingQueue does not allocate nodes internally. However you should
> note that any blocking implementation that is using j.u.c locks that it
> maintains the waiters as a linked queue internally
> (AbstractQueuedSynchronizer). So if you use locks you do not avoid garbage
> if that is your goal.
>
> What is the requirement you are trying to address? It would then be easier
> to put this in context.

It's a queue for the thread pool in Jetty.

Before you scream at me :), we have benchmarked the JDK thread pools,
and they're slower than what we have now.

We currently have a BlockingArrayQueue that uses a circular array
(that can grow) and 2 Locks to protect head and tail.
We are working on eliminating false sharing (we have a naive approach
in the current code, but it's totally broken and we know it) between
head and tail (something that does not seem to be addressed in JDK's
ConcurrentLinkedQueue).
On multicore machines, it seems that use of locks is the biggest
bottleneck, so I was looking for some lock-free implementation.

Subclassing JDK's ConcurrentLinkedQueue to maintain the queue size
kind of works, but produces garbage and ConcurrentLinkedQueue's
implementation is difficult to extend to avoid allocation (for example
to reuse Node instances).

On a related note, my experience is that queues are often mostly
empty, so array-based implementation may be subject to false sharing
on the first elements (the producer writes the array element with the
value, the consumer writes it with null).
Needless to say, spacing elements by a cache line is a massive waste
of memory :)
Any consideration/experience here ?

On a similar quest, we're trying to avoid allocation for schedulers:
java.util.Timer does not allocate but uses locks (and boy it really
sucks), while ScheduledExecutorService allocates.
Here, the problem is more difficult because the queue must be ordered.
I found a good non-blocking algorithm online (Sundell & Tsigas), but
if you have insights, they're much appreciated.

Martin Thompson

unread,
Jan 27, 2013, 3:07:25 AM1/27/13
to mechanica...@googlegroups.com, Martin Thompson


On Friday, January 25, 2013 8:16:07 PM UTC, Simone Bordet wrote:

Before you scream at me :), we have benchmarked the JDK thread pools,
and they're slower than what we have now.

We currently have a BlockingArrayQueue that uses a circular array
(that can grow) and 2 Locks to protect head and tail.
We are working on eliminating false sharing (we have a naive approach
in the current code, but it's totally broken and we know it) between
head and tail (something that does not seem to be addressed in JDK's
ConcurrentLinkedQueue).
On multicore machines, it seems that use of locks is the biggest
bottleneck, so I was looking for some lock-free implementation.

Subclassing JDK's ConcurrentLinkedQueue to maintain the queue size
kind of works, but produces garbage and ConcurrentLinkedQueue's
implementation is difficult to extend to avoid allocation (for example
to reuse Node instances).
 
For a number of clients I've worked in various Executor and thread pool implementations that perform significantly better than what comes with the JDK.  Even keeping the standard API I can get between 4X - 50X improvement in throughput depending on use case.

If you are willing to get to a different API it is possible to do even better and achieve things like the same thread always runs a certain category of task which is useful to avoid contention elsewhere in a system and improve thread affinity.
 
On a related note, my experience is that queues are often mostly
empty, so array-based implementation may be subject to false sharing
on the first elements (the producer writes the array element with the
value, the consumer writes it with null).
Needless to say, spacing elements by a cache line is a massive waste
of memory :)
Any consideration/experience here ?

There is a small impact because of the false sharing here.  When nulling out a field you can do this with lazySet rather than a volatile write.  This way the write goes to the store buffer and you do not block the core until it is drained.  However I've found this a great way of exposing JVM bugs/features in a multi-producer multi-consumer case.

If you space elements by cacheline, I've found it actually degrades performance rather than helping because of the increased bandwidth washing through the CPU cache.
 

On a similar quest, we're trying to avoid allocation for schedulers:
java.util.Timer does not allocate but uses locks (and boy it really
sucks), while ScheduledExecutorService allocates.
Here, the problem is more difficult because the queue must be ordered.
I found a good non-blocking algorithm online (Sundell & Tsigas), but
if you have insights, they're much appreciated.

What is the desire to totally avoid allocation?  Are these objects living just too long and thus getting promoted from the young generation?  What type of tasks are you trying to schedule?

Simone Bordet

unread,
Jan 28, 2013, 8:02:30 AM1/28/13
to Martin Thompson, mechanica...@googlegroups.com
Hi,

On Sun, Jan 27, 2013 at 9:07 AM, Martin Thompson <mjp...@gmail.com> wrote:
> There is a small impact because of the false sharing here. When nulling out
> a field you can do this with lazySet rather than a volatile write. This way
> the write goes to the store buffer and you do not block the core until it is
> drained. However I've found this a great way of exposing JVM bugs/features
> in a multi-producer multi-consumer case.

Interesting. Care to expand ?

> If you space elements by cacheline, I've found it actually degrades
> performance rather than helping because of the increased bandwidth washing
> through the CPU cache.

Makes sense.

> What is the desire to totally avoid allocation? Are these objects living
> just too long and thus getting promoted from the young generation? What
> type of tasks are you trying to schedule?

We have some Jetty user running 10k-15k HTTP requests/s (without
sweating); we are trying hard to submit as little as possible of tasks
to either the thread pool or the scheduler, and it would be good if we
can avoid allocation.
Given a small object in java that gets submitted, say 8 bytes or so,
we're in the range of the 100s KiB/s of allocation just for these.
I'm not worried by GC (which is pretty cheap, as long as I can mark
those objects dead), but impact on GC frequency (young generation will
be filling way quicker), and the "hidden cost of allocation" as
mentioned by Cliff Click: trashing the TLABs (IIUC).

I'm trying to gather information about the possibility of a
non-allocating solution.
In parallel we're working on reducing those submissions to a minimum
(and we have succeeded partially).
If there is a non-allocating solution, we can make a big step now, and
work on reducing the submissions anyway with a bit less pressure :)

Brian Toal

unread,
Jan 29, 2013, 1:37:19 AM1/29/13
to mechanica...@googlegroups.com, Martin Thompson
Simone, we're using Jetty 8.1.4 and I didn't see BlockingArrayQueue in use.  Can you share the source and the benchmark?

I'm interested in understanding what feature in Jetty is being impacted at the moment.

Gil Tene

unread,
Jan 29, 2013, 12:34:47 PM1/29/13
to
On Monday, January 28, 2013 5:02:30 AM UTC-8, Simone Bordet wrote:
> I'm not worried by GC (which is pretty cheap, as long as I can mark 
> those objects dead), but impact on GC frequency (young generation will 
> be filling way quicker), and the "hidden cost of allocation" as 
> mentioned by Cliff Click: trashing the TLABs (IIUC). 

I won't presume to speak for Cliff on this (having worked with him for 9 years, I know we can still find technical stuff to disagree on). But I'll give you my high level thoughts on the "real cost of allocation":

Allocation in Java (on modern JVMs) is about as cheap as it comes from a CPU cycle point of view. Cheaper than any other form of memory management, including virtually all forms of pooling of objects, free list management, and certainly cheaper than malloc(). TLABs (on modern JVMs) do a great job, and are really hard to trash or thrash (unless you dramatically misconfigure your newgen doing something silly like mixing a tiny newgen with lots of concurrently runnable threads). But the real core benefit comes from the prevalent copying/compacting newgen collectors that produce contiguous empty memory to allocate into (with TLABs and bump-pointer allocation), and do so using algorithms that avoid spending any cycles thinking about the dead matter they get rid of. 

The *only* downside for allocation of short-lived objects in most JVMs is the occurrence of newgen collections, and the frequency of those collections. Frequency is linear to (allocation_rate / newgen_size), and the amount of work done in each collection can be roughly thought of as a constant (not sensitive to either allocation rate or newgen size as long as stuff still gets to die young). Given your control of the denominator in the frequency expression, you hold ultimate control over the efficiency of allocation: you can make allocation/collection cost arbitrarily cheap simply by growing the [otherwise empty] newgen memory in your JVM configuration, making newgen frequency lower at the same time.

*IF* newgen collections are stop-the-world events, you end up facing a periodic newgen pauses. That's usually the real downside: Pauses. not Efficiency, Speed, or wasted CPU. Unfortunately, reducing allocation pressure doesn't make these pauses go away, or make them any shorter, it just makes them less frequent.

So form an efficiency point of view, think of it this way: you can either spend a lot of coding effort to avoid allocation and cut your overall allocation rate (for the entire program) by 1/2. Or you can double your newgen size to achieve the same thing. The effect will be virtually the same with either choice: Newgen GC Frequency will be cut in half. GC CPU will be cut in half. GC pause times will remain the same.

Now try the same trick with a 10x effect. You can easily do that by increasing newgen size. Not so easy when it comes to getting rid of 90% of the overall allocation in your program.

The real bugaboo is the newgen pause, and unfortunately no amount of code improvement or newgen increase will get rid of that. Both will only serve to delay it (and not by much). Since all commercial JVMs except for one [can you guess which?] use monolithic stop-the-world newgen collectors, these pauses will come to bite you. If you only care about avoiding multi-second noise, those bites are usually more like tiny nibbles. But if you care about noise levels that are in millesconds, or even tens or low hundreds of milliseconds, a stop-the-world newgen collector usually means periodic SLA failure.

Given that I think "the real problem" is stop-the-world newgen algorithms, I'm continually surprised at the lack of work on non-stop-the-world newgens in GC. Even academic work on the subject is very sparse.

> I'm trying to gather information about the possibility of a 
> non-allocating solution. 

Don't get me wrong. I like non-allocating solutions when they "fall into place" naturally. (my HdrHistogram is non-allocating for both recording and iteration, for example). And I think it's natural for things like constant-footprint data structures and circular buffers. Circular queues with embedded, fixed amount of queue elements may make sense as well, but then you start messing around with how to synchronize slot allocation and such [beyond head and tail stuff], and you can quickly find yourself half way to building a disruptor.

But when you start dealing with variable sized stuff coming in from requestors, which probably means a bunch of allocation ahead of (and behind) some sort of actual queue, allocating [or not allocating[ queue elements is probably not going to make much of a difference in the overall scheme of things. Not when you count things in hundreds of thousands per second. Not even when you count them in a few millions per second. Think about it this way: each x86 core will happily allocate hundreds of MB/sec without breaking a sweat, all the while doing useful Java work. Newgens can get rid of it just as fast if it's dead matter.

On the other hand, leveraging and embracing allocation (and automatic, background garbage collection) can make concurrent things both simpler and faster. To quote Cliff on this subject: "...Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free..."

Adam Browning

unread,
Jan 30, 2013, 6:11:32 PM1/30/13
to mechanica...@googlegroups.com
I wonder if a hybrid approach where the head of the queue is a small linked list of pre-allocated nodes, with an array overflow queue would be a good match for what you're dealing with. It would limit allocation while (probably) avoiding a lot of false sharing, though I suspect the overhead of maintaining it would be brutal if it frequently fills the linked-list to overflow into the array.

Simone Bordet

unread,
Jan 31, 2013, 3:42:47 AM1/31/13
to Brian Toal, mechanica...@googlegroups.com, Martin Thompson
Hi,

On Tue, Jan 29, 2013 at 7:37 AM, Brian Toal <brian...@gmail.com> wrote:
> Simone, we're using Jetty 8.1.4 and I didn't see BlockingArrayQueue in use.
> Can you share the source and the benchmark?

BAQ is the default queue for Jetty's thread pool, see the latest
version in Jetty 9 here:
http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/tree/jetty-util/src/main/java/org/eclipse/jetty/util/BlockingArrayQueue.java

Jetty 8's version is on branch "jetty-8", same location.

The benchmark is not published, but I may find some time to polish it
and publish it.

> I'm interested in understanding what feature in Jetty is being impacted at
> the moment.

None. As I said, BAQ works better than JDK queues and Jetty thread
pools do too, so no impact apart better performance :)

I was just trying to see if there was some work done in this area, but
looks like not much, or I am worrying too much.

Martin Thompson

unread,
Jan 31, 2013, 5:21:55 AM1/31/13
to


On Monday, January 28, 2013 1:02:30 PM UTC, Simone Bordet wrote:

On Sun, Jan 27, 2013 at 9:07 AM, Martin Thompson <mjp...@gmail.com> wrote:
> There is a small impact because of the false sharing here.  When nulling out
> a field you can do this with lazySet rather than a volatile write.  This way
> the write goes to the store buffer and you do not block the core until it is
> drained.  However I've found this a great way of exposing JVM bugs/features
> in a multi-producer multi-consumer case.

Interesting. Care to expand ?

Atomic*.lazySet() is usually implemented via Unsafe.putOrdered*().  On x86 this should be sufficient for many algorithms when we have a single uncontended writer for a field in a concurrent environment   However I keep finding that, for the reference implementation in particular, the write is not paired correctly with a volatile read.  In each instance when the bug has been found by a JVM vendor it is usually an optimisation issue.  If I fall back to a volatile write then issue can often be resolved at a performance cost.  Interestingly these issues are easier to recreate in a resource constrained environment with say a 2 core + HT laptop.  On server class machines I've never seen the issues arise.
 
> What is the desire to totally avoid allocation?  Are these objects living 
> just too long and thus getting promoted from the young generation?  What
> type of tasks are you trying to schedule?

We have some Jetty user running 10k-15k HTTP requests/s (without
sweating); we are trying hard to submit as little as possible of tasks
to either the thread pool or the scheduler, and it would be good if we
can avoid allocation.
Given a small object in java that gets submitted, say 8 bytes or so,
we're in the range of the 100s KiB/s of allocation just for these.
I'm not worried by GC (which is pretty cheap, as long as I can mark
those objects dead), but impact on GC frequency (young generation will
be filling way quicker), and the "hidden cost of allocation" as
mentioned by Cliff Click: trashing the TLABs (IIUC).

I'm trying to gather information about the possibility of a
non-allocating solution.
In parallel we're working on reducing those submissions to a minimum
(and we have succeeded partially).
If there is a non-allocating solution, we can make a big step now, and
work on reducing the submissions anyway with a bit less pressure :)

I've done a fair bit of work helping messaging and web server product vendors.  In my experience aiming for zero allocation is quite far down the list of performance focus areas.  Profiling usually leads me to parsing, integration with IO, logging, and shared state management issues first.  Have you ranked the bottlenecks in Jetty to ensure you are working on the top issue at all times?  Remember Theory of Constraints - After fixing the number 1 issue the list below can radically change.

Take parsing for example.  Besides typically being one of the largest sources of allocation, it often relies on standard JDK classes for dealing with primitive types, dates, timestamps, and other structured formats.  The JDK classes mostly work on strings in UCS-2 format, yet you are dealing with bytes on and off the wire, therefore you are converting before doing any real work.  Having libraries that deal with the bytes directly can be a huge benefit to many applications.  Also the standard JDK classes are not designed to work in a reactive or asynchronous manner which is generally the best way to do systems level programming.  Well I think asynchronous is better generally but that is a can of worms :-)

Simone Bordet

unread,
Jan 31, 2013, 6:23:22 AM1/31/13
to Martin Thompson, mechanica...@googlegroups.com
Hi,

On Thu, Jan 31, 2013 at 11:18 AM, Martin Thompson <mjp...@gmail.com> wrote:
> I've done a fair bit of work helping messaging and web server product
> vendors. In my experience aiming for zero allocation is quite far down the
> list of performance focus areas. Profiling usually leads me to parsing,
> integration with IO, logging, and shared state management issues first.
> Have you ranked the bottlenecks in Jetty to ensure you are working on the
> top issue at all times? Remember Theory of Constraints - After fixing the
> number 1 issue the list below can radically change.
>
> Take parsing for example. Besides typically being one of the largest
> sources of allocation, it often relies on standard JDK classes for dealing
> with primitive types, dates, an other structured formats. The JDK classes
> mostly work on strings in UCS-2 format, yet you are dealing with bytes on
> and off the wire, therefore you are converting before does any real work.
> Having libraries that deal with the bytes directly can be a huge benefit to
> many applications.

Yes to all, we are doing all that.

For the queue, as I said I tried to ask here if there was a
non-allocating concurrent BlockingQueue queue available, because that
would have been a low-hanging fruit: take the better impl, put it in
Jetty, done.
But it seems it is not like that, so it will require some work that,
as many of you said, may not be worth the effort - in the sense that
we may concentrate on something bigger for now.

We do have our own faster UTF handling.

We try to be smart on HTTP parsing, but HTTP is a PITA to parse, so we
had to take some compromises there.

As posted on this group we have blogged about threading and cache
issues we have figured out in Jetty.

I am sure there are tons more to do.

> Also the standard JDK classes are not designed to work
> in a reactive or asynchronous manner which is generally the best way to do
> systems level programming. Well I think asynchronous is better generally
> but that is a can of worms :-)

If you refer to async I/O, then yes in Jetty 9 all I/O is asynchronous
at low level, and we have to be blocking only to respect the Servlet
API semantic.
SPDY and WebSocket offer both blocking and asynchronous APIs, but the
implementation is all asynchronous.
And no, we're not using JDK 7's AsynchronousChannel - not good enough for us.

Anyhow, we're probably a bit off topic talking about how to optimize
Jetty - sorry about that.

Thanks to all for the useful comments !
Reply all
Reply to author
Forward
0 new messages