Hi Artur,
The algorithm looks correct to me.
Relacy can be useful for verification of such kind of algorithms:
http://www.1024cores.net/home/relacy-race-detector
It does not only verify all possible interleavings of threads, but
also model relaxed memory model and ensures that all the memory fences
and ordering constraints are correct.
Several comments wrt performance. The padding between head and tail
does not make sense, because threads always touch them at the same
time. So with padding they only need to load 2 cache lines instead of
1.
You don't need fetch_add to update positions. It is usually compiled
to an expensive atomic read-modify-write instruction. Plain store is
enough here.
The most efficient SPSC queues do not require producer and consumer to
synchronize during each operation. Instead of both producer and
consumer looking at both head and tail, they use per-element
synchronization state that allows them to use mostly independently.
Search this group or comp.programming.threads archives for SPSC, or
check out these articles:
http://www.1024cores.net/home/lock-free-algorithms/queues/fastflow
http://www.1024cores.net/home/lock-free-algorithms/queues/unbounded-spsc-queue
--
Dmitry Vyukov
All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net