On Mon, Dec 23, 2013 at 11:55 AM, Rajiv Kurian <
geet...@gmail.com> wrote:
> My use case is the following:
>
> 1) A single thread reads bytes of the network and produces work.
> 2) Multiple worker threads actually complete the work. There are only a few
> types of work (say at most 20) but each one takes a variable amount of time.
>
> I first tried using multiple SPSC ring buffers (connecting the single
> producer to a worker thread) to distribute the work. The producer would just
> round robin between the SPSC ring buffers and enqueue some work to each
> worker. This avoids contention since the producer thread and each worker
> thread is connected by a different queue. It ends up causing imbalanced
> queues though. Some workers are starved of work while others have too much.
> This especially applies to my work load where some work items take a few
> seconds while others take a few 100 micro seconds.
Hi!
Yes, this definitely does not look like a good solution.
> I thought of a few other options:
>
> 1) Use a SPMC ring buffer. The workers just compete to get a slot on the
> ring buffer and then do their work. The nice thing is that they now are
> never idle if there is work to be done. The bad thing is contention on the
> queue. This is the simplest option though.
>
> 2) Use some kind of work requesting. I just saw the work requesting design
> on Dmitry's blog. This seems like a nice pattern especially since, if most
> workers are busy there is no contention. The problem though is if a worker
> is stuck working on a large request (one that takes a few seconds) it won't
> be able to hand off work to an idle worker.
>
> 3) Build a cost model of requests. For my application this doesn't seem like
> a bad thing. Since I have only a few types of requests (20 or so) I could
> run a tuning phase at start up that builds a cost model for each input task.
> Let's say for an image processing application the types of work are filters
> like blur, sharpen, resize etc. We could run a phase on startup where we
> check how much time each of the filter takes with image of different sizes
> (user supplied maybe). Once we have this info down, we could build a cost
> model which estimates how much time say a blur of an image of size x px * y
> px would take. Now we could go back to the multiple SPSC queues design and
> instead of round robin we just assign work to the lightest loaded worker
> based on our cost model. So in essence we are just trying to ensure that the
> workers never get starved or over-worked at the source of scheduling instead
> of after the fact.
There is also "classical" work stealing. You can have N SPMC queues,
the produces queues in round-robin, each thread takes one element from
own queue and processes, if it's out of work it tries to steal a half
of elements from other queues.
However, I would consider using an efficient SPMC queue as well,
because it will be much simpler.
You can take this queue:
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
and remove the CAS from enqueue because you have only 1 producer.
Or even you can take this Chris' modification as base:
https://groups.google.com/forum/#!searchin/lock-free/thomasson/lock-free/acjQ3-89abE/a6-Di0GZsyEJ
Since you have a single queue in the program, you can afford to pad
each element to cache line size. This should give you very decent
contention.