Are there good examples of SPMC queues?

Rajiv Kurian

unread,

Dec 23, 2013, 2:55:09 AM12/23/13

to lock...@googlegroups.com

My use case is the following:

1) A single thread reads bytes of the network and produces work.

2) Multiple worker threads actually complete the work. There are only a few types of work (say at most 20) but each one takes a variable amount of time.

I first tried using multiple SPSC ring buffers (connecting the single producer to a worker thread) to distribute the work. The producer would just round robin between the SPSC ring buffers and enqueue some work to each worker. This avoids contention since the producer thread and each worker thread is connected by a different queue. It ends up causing imbalanced queues though. Some workers are starved of work while others have too much. This especially applies to my work load where some work items take a few seconds while others take a few 100 micro seconds.

I thought of a few other options:

1) Use a SPMC ring buffer. The workers just compete to get a slot on the ring buffer and then do their work. The nice thing is that they now are never idle if there is work to be done. The bad thing is contention on the queue. This is the simplest option though.

2) Use some kind of work requesting. I just saw the work requesting design on Dmitry's blog. This seems like a nice pattern especially since, if most workers are busy there is no contention. The problem though is if a worker is stuck working on a large request (one that takes a few seconds) it won't be able to hand off work to an idle worker.

3) Build a cost model of requests. For my application this doesn't seem like a bad thing. Since I have only a few types of requests (20 or so) I could run a tuning phase at start up that builds a cost model for each input task. Let's say for an image processing application the types of work are filters like blur, sharpen, resize etc. We could run a phase on startup where we check how much time each of the filter takes with image of different sizes (user supplied maybe). Once we have this info down, we could build a cost model which estimates how much time say a blur of an image of size x px * y px would take. Now we could go back to the multiple SPSC queues design and instead of round robin we just assign work to the lightest loaded worker based on our cost model. So in essence we are just trying to ensure that the workers never get starved or over-worked at the source of scheduling instead of after the fact.

Any comments or other methods would be highly appreciated.

Dmitriy V'jukov

unread,

Jan 9, 2014, 9:28:21 AM1/9/14

to lock...@googlegroups.com

On Mon, Dec 23, 2013 at 11:55 AM, Rajiv Kurian <geet...@gmail.com> wrote:
> My use case is the following:
>
> 1) A single thread reads bytes of the network and produces work.
> 2) Multiple worker threads actually complete the work. There are only a few
> types of work (say at most 20) but each one takes a variable amount of time.
>
> I first tried using multiple SPSC ring buffers (connecting the single
> producer to a worker thread) to distribute the work. The producer would just
> round robin between the SPSC ring buffers and enqueue some work to each
> worker. This avoids contention since the producer thread and each worker
> thread is connected by a different queue. It ends up causing imbalanced
> queues though. Some workers are starved of work while others have too much.
> This especially applies to my work load where some work items take a few
> seconds while others take a few 100 micro seconds.

Hi!

Yes, this definitely does not look like a good solution.

> I thought of a few other options:
>
> 1) Use a SPMC ring buffer. The workers just compete to get a slot on the
> ring buffer and then do their work. The nice thing is that they now are
> never idle if there is work to be done. The bad thing is contention on the
> queue. This is the simplest option though.
>
> 2) Use some kind of work requesting. I just saw the work requesting design
> on Dmitry's blog. This seems like a nice pattern especially since, if most
> workers are busy there is no contention. The problem though is if a worker
> is stuck working on a large request (one that takes a few seconds) it won't
> be able to hand off work to an idle worker.
>
> 3) Build a cost model of requests. For my application this doesn't seem like
> a bad thing. Since I have only a few types of requests (20 or so) I could
> run a tuning phase at start up that builds a cost model for each input task.
> Let's say for an image processing application the types of work are filters
> like blur, sharpen, resize etc. We could run a phase on startup where we
> check how much time each of the filter takes with image of different sizes
> (user supplied maybe). Once we have this info down, we could build a cost
> model which estimates how much time say a blur of an image of size x px * y
> px would take. Now we could go back to the multiple SPSC queues design and
> instead of round robin we just assign work to the lightest loaded worker
> based on our cost model. So in essence we are just trying to ensure that the
> workers never get starved or over-worked at the source of scheduling instead
> of after the fact.

There is also "classical" work stealing. You can have N SPMC queues,
the produces queues in round-robin, each thread takes one element from
own queue and processes, if it's out of work it tries to steal a half
of elements from other queues.

However, I would consider using an efficient SPMC queue as well,
because it will be much simpler.
You can take this queue:
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
and remove the CAS from enqueue because you have only 1 producer.

Or even you can take this Chris' modification as base:
https://groups.google.com/forum/#!searchin/lock-free/thomasson/lock-free/acjQ3-89abE/a6-Di0GZsyEJ

Since you have a single queue in the program, you can afford to pad
each element to cache line size. This should give you very decent
contention.

Rajiv Kurian

unread,

Jan 9, 2014, 7:25:30 PM1/9/14

to lock...@googlegroups.com

Thank you Dmitry. I think I will use a SPMC queue. The application is for image processing and processing each image takes longer than any cache contention during deque.The automatic load balancing with a SPMC queue just makes life easier than first distributing the work and then trying to do work stealing.

Dmitriy V'jukov

unread,

Jan 12, 2014, 5:36:35 AM1/12/14

to lock...@googlegroups.com

Makes sense.

Reply all

Reply to author

Forward