But, the problem is that it works slowly, even slower than naive lock implementation. I am searching a way to make it faster but I consider if is it possible to make it faster in significally way at whole?
As usually, the problematic is CAS instruction, especially high contetnion is a painful issue here. I manage a memory in manual way so every thread "taking" head or tail must increment it. Bacause of the high contention threads compete with, in CAS-loop there is a heave traffic on a bus ( I mean connection between cores/memory controller). what can lead to cache-misses.
I've tried use of _mm_pause but it didn't give a performance improvement. ( Perhaps, I used it wrongly). I've seen that boost implementation avoided incrementation but it doesn't manage memory manually- it uses something like pool. I tried to lookup a similar implementation but I found only a bounded implementation ( what can be simpler ( and faster) ).
Please suggest me something.
Thanks in advance.