Thank you for guiding me to the mpmc queue algorithm posted here.
I read about your initial idea and the experimentations that you had
done on it.
Im very much interested about these kind of algorithms and think I
have lots to learn from this.
Please take pain if I bore you with much of known details and correct
me if my observations are wrong
Following are my observations :
Lets consider a single producer and a single consumer working on the
The producer makes a atomic fetch_and_add on producer_pos and then
updates the enqueued value in the array of items.
The consumer makes a atomic fetch_and_add on consumer_pos and then
reads a item in the array of items.
Thereby, a producer makes two writes to finish enqueue,
while a consumer makes one write for updating consumer_pos and one
read for reading an item in the array to finish a dequeue.
It has to be noted that the two operations done by the producer and
consumer are seperated in the time domain.
Considering the initial idea where the cell is defined as,
item_t data [(cache_line_size - 2*sizeof(short)) / sizeof(item_t)];
The first write on producer_pos will result in an Intent to Write
(ITOW) that will make sure an Exclusive copy of
cell_t is available in its L1 cache. Now the producer will modify the
producer_pos and the state of the cacheline
that contains cell_t will be in Modified state.
Now at this point, if the consumer thread doesnot interfere with this
cell_t, then the next write to the array of items
by the producer will not result in an ITOW because the producer thread
already has the cacheline is Modified state.
So it can just do the data[pos] = value costlessly.
But if the consumer thread tries to do a write on consumer_pos before
the producer does a write on the array of items,
then the consumer thread has to get an exclusive copy of the cell_t in
its L1 cache.
So the consumer thread's write will make an ITOW, which will make the
cacheline modified by the producer to be updated
to their common ancestor(making the producer's cacheline to Invalid
state) and then an exclusive copy is got for the
consumer thread to modify. Now again if the producer thread tries to
write to the array of items, it makes another ITOW
because its cacheline is in invalid state. The ITOW by the producer
thread will make the cacheline of consumer thread
invalid( after updation to the common ancestor off course) and will
get an exclusive access from the common ancestor
and modifies it. Now when the consumer thread wants to read the value
from the array, it cannot do so because the
cacheline state is invalid and it makes an Intent to Read(ITOR) trying
to get a shared copy of the cacheline and
this will again cause a flush from the common ancestor which will
cause an update of the producer thread's cacheline
to the common ancestor. So end of this ITOR both the threads L1
cacheline will be in shared state and the consumer thread
reads the value.
So what happens in this worst case where the producer and consumer do
the steps alternatively w.r.t time,
are continuous invalidations across till the common ancestor to get
copies in L1 cache to modify or read. The case gets
even worser when the producer and consumer threads operate on cores on
different processors, causing updations of cacheline
across two ancestor levels. Worst when there are multiple producers
Now coming to the variant,
// note 256 here
item_t data [(256 - 3*sizeof(uintptr_t)) / sizeof(item_t)];
Now the reasoning which I can find as to why the above variant
performs better is that the consume_pos
and produce_pos on located on different cachelines. So producers and
consumers dont contend to get an exclusive copy
to write to them. However different producers contend for produce_pos
and different consumers contend for consumer_pos,
but I dont think this can be avoided.
So I would expect the following variant to atleast perform as good as
your best implementation and
better than your initial idea.
item_t data[(page_size - 2 * cachelinesize - sizeof(block_t*))/sizeof
Here the consume_pos, produce_pos are on two seperate cachelines.
And to build on this, I think if choosing the position in data array
are directed to different cachelines
for consequetive (push/pop) operations in the queue then it would help
In short this would be similar to how consequetive operations lead to
choosing micro queues in an LRU way
in TBB concurrent_queue.