On Jul 30 2009, 6:51 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> Here is my multi-producer/single-consumer queue:http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87ac...
A couple of questions.
First off, I'm sorry if this topic has already been covered somewhere
else, but I
couldn't find any discussions on this topic.
I'm going to skip the discussion about using memory fences as the code
seems to
assume that it's going to be run on a very strict memory model (such
as x86)
Q. #1
It seems to me that the test "if (n != t)" and the whole block under
it could
be skipped since the code that follows takes care of cases when the
tail was
changed from "under" us. Am I right?
Q. #2
Why are we loading head_ in stages and not as a whole number? Since
"head_" is
volatile, the compiler is not allowed to optimize two reads into a
single read.
That means that two separate reads must be issued. That also means
that it's
possible to read a pointer with an older value and a counter with a
newer value.
I don't see if that creates correctness issues, but it sure does
confuse a lot.
Thanks,
Andy.
Hi Andrew,
Yes, it assumes strict memory model. By "// 32-bit, Windows, MSVC" I
imply x86.
> Q. #1
> It seems to me that the test "if (n != t)" and the whole block under
> it could
> be skipped since the code that follows takes care of cases when the
> tail was
> changed from "under" us. Am I right?
No. It handles the following special case.
If (n->next_ == 0) and (n != t), then producer is already in the
middle of enqueue. I.e. he changed tail_ with XCHG, but not yet
updated next pointer, i.e. he is currently at the point marked with
"(***)".
So consumer just spins waiting for producer, that's what "(n != t)"
block handles.
> Q. #2
> Why are we loading head_ in stages and not as a whole number? Since
> "head_" is
> volatile, the compiler is not allowed to optimize two reads into a
> single read.
> That means that two separate reads must be issued. That also means
> that it's
> possible to read a pointer with an older value and a counter with a
> newer value.
> I don't see if that creates correctness issues, but it sure does
> confuse a lot.
What I wanted to emphasize is that the algorithm does not require
atomic double-word load (there are algorithms that require atomic
double-word load, or at least loads must be issued in particular
order).
Of course, you may code it as atomic load if you wish, that won't
sacrifice correctness.
As for performance, plain load from cached location is pretty cheap
(portion of a cycle). And atomic load has to use MMX/SSE load
instructions along with conversion from MMX/SSE register to general-
purpose register, which I suspect is slower.
Btw, if you issue plain load on __int64 variable, compiler will indeed
issue 2 32-bit loads.
--
Dmitriy V'jukov
thank you for your explanation.
But it looks like I'm still missing something.
On Feb 9, 2:31 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Feb 8, 11:23 pm, Andrew Venikov <aveni...@gmail.com> wrote:
<snip>
> > Q. #1
> > It seems to me that the test "if (n != t)" and the whole block under
> > it could
> > be skipped since the code that follows takes care of cases when the
> > tail was
> > changed from "under" us. Am I right?
>
> No. It handles the following special case.
> If (n->next_ == 0) and (n != t), then producer is already in the
> middle of enqueue. I.e. he changed tail_ with XCHG, but not yet
> updated next pointer, i.e. he is currently at the point marked with
> "(***)".
> So consumer just spins waiting for producer, that's what "(n != t)"
> block handles.
What I meant to say here is that even if you remove (n != t) check
and
the block that gets executed if it's true, the algorithm would still
be correct.
Let's see what happens if you remove it.
First, we get in that area only if prior to that we've established
that
there's only one element in the queue. (n->next_ was NULL).
No, let's say that producers have modified the tail. Possibly several
times.
The first thing that we'd do in that case would be CAS on the head. If
id
doesn't succede - fine we're back looping. If it does succeed
however,
we'll try to CAS the tail which will fail in our case. That means that
we
need to wait for the consumer to update "next" pointer of the first
node
in the queue, which you do in your "***" loop.
I think what I'm trying to say is that you do this "waiting" loop in
two
different places. I think detecting that the tail has been modified
earlier
(at n != t line) adds no benefits as compared to when we detect it
later
(at CAS(tail_) line).
What am I missing?
Thanks,
Andy.
Hi Andrew,
I've carefully re-examined the algorithm and I think you are generally
right. However, here is one caveat.
Second spin-wait "while (n->next_ == 0) SwitchToThread();" is
conducted when the consumer reset 'head_' to zero, and this prevents
all other consumers from any progress (they just instantly piss off
once they see 'head_ == 0'). That's not very good.
I think that part of the algorithm can improved along the lines of:
if (prev.whole_ == h.whole_)
{
node_t* prev_tail = (node_t*)
_InterlockedCompareExchange
((long*)&tail_, (long)&head_.ptr_, (long)n);
if (prev_tail == n)
return n;
// CHANGED START HERE \/ \/ \/
if (n->next_)
{
// if the next pointer is already established by the producer,
// then we consume current message
head_.ptr_ = n->next_;
return n;
}
// otherwise restore 'head_' value to unblock other consumers
// and wait for the producer to establish the next link
head_.ptr_ = n;
SwitchToThread();
h.ptr_= head_.ptr_;
h.cnt_ = head_.cnt_;
continue;
}
And even with this improvement I would still prefer to leave first
spin-wait "while (n->next_ == 0) SwitchToThread();" as is, because if
we know that the producer is preempted in (***) there is no need in
constant CASes and modifications of shared state. Well, it's like
TATAS mutex algorithm - why try to modify shared state if we know that
we will fail?
--
Dmitriy V'jukov