Eliminating the window of inconsistency using rollforward and rollback

88 views
Skip to first unread message

Eric Grant

unread,
Feb 15, 2014, 9:58:23 PM2/15/14
to lock...@googlegroups.com
Colleagues,

I have a thought experiment that might interest you.

For our point of departure, let's revisit Dmitry's "Multi-producer/multi-consumer SEH-based queue," https://groups.google.com/forum/#!topic/lock-free/yOMgHaSmowA. Here, I'm interested in the enqueue() routine, which Dmitry presented thus:

void enqueue(node_t* node)
{
ASSERT(node);
node->next_ = 0;
node_t** prev = (node_t**)
_InterlockedExchange((long*)&tail_, (long)node);
ASSERT(prev);
// <--- the window of inconsistency is HERE (***)
prev[0] = node;
}

Let's assume the x86-64 ISA and translate this into assembly language:

mov node, prev // for 'xchg' below
movq $0, NEXT(node) // node->next = 0
xchg prev, (tail) // prev = XCHG (&tail, node)
// window of inconsistency
mov node, NEXT(prev) // prev->next = node

Dmitry observed that the "window of inconsistency" is one machine instruction in length. We can see here that the one instruction is the final 'mov': if the kernel preempts the enqueue() routine after the 'xchg' and before that final 'mov', the queue suffers the trauma that requires making the corresponding dequeue() routine so complicated.

What if we could tell the kernel NOT to preempt the routine in this window? To ask the same thing, what if we could direct the kernel to allow the routine to "roll forward" through the window before the routine may be preempted? Something like this:

mov node, prev // for 'xchg' below
movq $0, NEXT(node) // node->next = 0
xchg prev, (tail) // prev = XCHG (&tail, node)

.rollforward_enter

mov node, NEXT(prev) // prev->next = node

.rollforward_leave

Would this solve the problem? If we are using "wired" (i.e., non-pageable) memory to hold all of the nodes, then I believe so. But if we are using normally allocated memory, which can be paged out, then we face the potential problem that the write to NEXT(prev) could result in a page fault, which simply cannot be handled without an indeterminate delay (at least on Linux/x86 or xnu/x86). Thus, rollforward alone does not satisfy.

What if we read NEXT(prev) before the 'xchg' to ensure that the 'prev' node was paged in? We could accomplish this by adding lines 3 and 4 below. Of course, you will point out that the node read from TAIL(queue) in line 3 may not be the same node (implicitly) read from TAIL(queue) in line 5; another thread could have enqueued a new node in the meantime. But that newly enqueued node will itself be paged in by virtue of the other thread's execution of line 2. Like this:

1: mov node, prev // for 'xchg' below
2: movq $0, NEXT(node) // node->next = 0
3: mov TAIL(queue), scratch // scratch = queue->tail
4: mov NEXT(scratch), scratch // scratch = scratch->next (ignore result)
5: xchg prev, (tail) // prev = XCHG (&tail, node)

.rollforward_enter

6: mov node, NEXT(prev) // prev->next = node

.rollforward_leave

Doubtless you will now point out that one or both threads could be preempted between lines 2 and 5 and that during a sufficiently lengthy interruption, the operating system could page out the very 'prev' node that we had counted on being paged in as a result of having executed lines 2 through 4. This brings us to another "what if?"

What if we could tell the kernel that if it preempts us between line 2 and line 5, it should "roll back" the instruction pointer to line 2 when it returns control to the preempted thread? That might result in the re-execution of lines 2 through 4 (or some subset thereof); however, like the re-execution of instructions in a CAS loop, that would not be be problematic here. More importantly, the hypothesized rollback would effectively ensure that when line 6 is executed, it will have been executed only at the the end of an uninterrupted sequence of instructions beginning with line 2. With this guarantee, we can be certain that the 'prev' node returned by the 'xchg' in line 5 will be paged in, such that the write to NEXT(prev) in line 6 will NOT generate a page fault. Like this:

1: mov node, prev // for 'xchg' below

.rollback_enter

2: movq $0, NEXT(node) // node->next = 0
3: mov TAIL(queue), scratch // scratch = queue->tail
4: mov NEXT(scratch), scratch // scratch = scratch->next (ignore result)
5: xchg prev, (tail) // prev = XCHG (&tail, node)

.rollforward_enter

6: mov node, NEXT(prev) // prev->next = node

.rollforward_leave

Problem solved? No more window of inconsistency? Again, I believe so.

Of course, I don't expect you to agree without an empirical demonstration. But I am interested in any theoretical flaws that you see. Or any other comments that you might have.

Thanks,
Eric Grant

Dmitriy V'jukov

unread,
Feb 17, 2014, 1:50:53 PM2/17/14
to lock...@googlegroups.com
Hi!

How do you prevent that the data gets evicted from memory between 5 and 6?
If you consider things like page faults and interrupts, I do not think
that you can solve the problem of forward progress in user space. Even
if you solve the problem on producer side, what if consumer hits page
fault right after reading an element? It will have essentially the
same effect of lack of forward progress.
If you are in kernel space, then you can ensure some of the properties
by disabling interrupts around the operations. Then you can be sure
that each operaition finishes in N machine instructions.
But even then, if you have a simple MOV instruction and the data is in
memory, how long can it take? There is actually no upper bound. And
this gets us back to where we started.
So I argue that all that matters is common case performance. And my
algorithm is already quite good from this point of view.
If you go for hard real time, you need to start from designing and
manufacturing own hardware. Then OS. Only then you can have guarantees
that some operations will finish in bounded amount of time. Until then
lock/wait-free algorithms are illusive.






> Problem solved? No more window of inconsistency? Again, I believe so.
>
> Of course, I don't expect you to agree without an empirical demonstration. But I am interested in any theoretical flaws that you see. Or any other comments that you might have.
>
> Thanks,
> Eric Grant
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/2413C3A7-19FD-4630-80BB-BE72718A8314%40eagrant.com.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

Eric Grant

unread,
Feb 18, 2014, 4:22:07 AM2/18/14
to lock...@googlegroups.com
On Feb 17, 2014, at 10:50 AM, Dmitriy V'jukov wrote:

> On Sun, Feb 16, 2014 at 6:58 AM, Eric Grant <ma...@eagrant.com> wrote:
>>
>> <snip>
>>
>> 1: mov node, prev // for 'xchg' below
>>
>> .rollback_enter
>>
>> 2: movq $0, NEXT(node) // node->next = 0
>> 3: mov TAIL(queue), scratch // scratch = queue->tail
>> 4: mov NEXT(scratch), scratch // scratch = scratch->next (ignore result)
>> 5: xchg prev, (tail) // prev = XCHG (&tail, node)
>>
>> .rollforward_enter
>>
>> 6: mov node, NEXT(prev) // prev->next = node
>>
>> .rollforward_leave
>
> Hi!

Hi Dmitriy,

> How do you prevent that the data gets evicted from memory between 5 and 6?

The data will not be evicted from memory between 5 and 6 because paging out a block of data from RAM to disk and replacing it with other data from disk is Long Complicated Operation that will take the kernel's virtual memory subsystem *hundreds* of instructions to accomplish, involving the acquisition of one or more locks. By definition, the latency in our routine is at most *four* instructions that will execute in sequence without interruption.

> If you consider things like page faults and interrupts, I do not think
> that you can solve the problem of forward progress in user space.

You are certainly correct in this regard. We can instruct the kernel to do things like roll forward and roll back only if the kernel is prepared to heed these instructions. That means modifying the kernel source or at least adding code to the kernel dynamically, e.g., a kernel module on Linux or a kernel extension on the Mac OS.

> Even
> if you solve the problem on producer side, what if consumer hits page
> fault right after reading an element? It will have essentially the
> same effect of lack of forward progress.

I'm not sure what you mean by this. If a consumer thread faults or is otherwise preempted *before* executing the CAS (cmpxchg[16b] on x86-64) that dequeues a node, that preemption will not prevent other consumer threads from dequeuing the node while the first thread is suspended. If the consumer thread is preempted *after* the CAS that successfully dequeues a node, then again that preemption will not prevent other consumer threads from dequeuing whatever nodes remain. The concept I presented is not intended to guarantee any particular rate of forward progress; it is intended to prevent the "inconsistency" (as you yourself called it) that occurs if preemption occurs at one critical location in the enqueue routine.

> If you are in kernel space, then you can ensure some of the properties
> by disabling interrupts around the operations. Then you can be sure
> that each operaition finishes in N machine instructions.
> But even then, if you have a simple MOV instruction and the data is in
> memory, how long can it take? There is actually no upper bound. And
> this gets us back to where we started.
> So I argue that all that matters is common case performance. And my
> algorithm is already quite good from this point of view.

I agree that your algorithm is already quite good. Wouldn't it be better if the window of inconsistency were eliminated?

> If you go for hard real time, you need to start from designing and
> manufacturing own hardware. Then OS. Only then you can have guarantees
> that some operations will finish in bounded amount of time. Until then
> lock/wait-free algorithms are illusive.

I didn't offer any guarantee of hard real time or of a bounded amount of time to finish. I merely offered a (theoretical) improvement on an already efficient algorithm. You are correct, though, that (as noted above) the improvement is in the realm of the OS, not pure user-space code.

If we can eliminate the window of inconsistency, then we can vastly simplify the dequeue routine. I will present that modification in another post.

Cheers,
Eric

Dmitriy V'jukov

unread,
Feb 19, 2014, 5:00:45 PM2/19/14
to lock...@googlegroups.com
On Tue, Feb 18, 2014 at 9:22 AM, Eric Grant <ma...@eagrant.com> wrote:
> On Feb 17, 2014, at 10:50 AM, Dmitriy V'jukov wrote:
>
>> On Sun, Feb 16, 2014 at 6:58 AM, Eric Grant <ma...@eagrant.com> wrote:
>>>
>>> <snip>
>>>
>>> 1: mov node, prev // for 'xchg' below
>>>
>>> .rollback_enter
>>>
>>> 2: movq $0, NEXT(node) // node->next = 0
>>> 3: mov TAIL(queue), scratch // scratch = queue->tail
>>> 4: mov NEXT(scratch), scratch // scratch = scratch->next (ignore result)
>>> 5: xchg prev, (tail) // prev = XCHG (&tail, node)
>>>
>>> .rollforward_enter
>>>
>>> 6: mov node, NEXT(prev) // prev->next = node
>>>
>>> .rollforward_leave
>>
>> Hi!
>
> Hi Dmitriy,
>
>> How do you prevent that the data gets evicted from memory between 5 and 6?
>
> The data will not be evicted from memory between 5 and 6 because paging out a block of data from RAM to disk and replacing it with other data from disk is Long Complicated Operation that will take the kernel's virtual memory subsystem *hundreds* of instructions to accomplish, involving the acquisition of one or more locks. By definition, the latency in our routine is at most *four* instructions that will execute in sequence without interruption.

The page eviction can be initiated on another core, and "take effect"
exactly between 5 and 6.

>
>> If you consider things like page faults and interrupts, I do not think
>> that you can solve the problem of forward progress in user space.
>
> You are certainly correct in this regard. We can instruct the kernel to do things like roll forward and roll back only if the kernel is prepared to heed these instructions. That means modifying the kernel source or at least adding code to the kernel dynamically, e.g., a kernel module on Linux or a kernel extension on the Mac OS.
>
>> Even
>> if you solve the problem on producer side, what if consumer hits page
>> fault right after reading an element? It will have essentially the
>> same effect of lack of forward progress.
>
> I'm not sure what you mean by this. If a consumer thread faults or is otherwise preempted *before* executing the CAS (cmpxchg[16b] on x86-64) that dequeues a node, that preemption will not prevent other consumer threads from dequeuing the node while the first thread is suspended. If the consumer thread is preempted *after* the CAS that successfully dequeues a node, then again that preemption will not prevent other consumer threads from dequeuing whatever nodes remain. The concept I presented is not intended to guarantee any particular rate of forward progress; it is intended to prevent the "inconsistency" (as you yourself called it) that occurs if preemption occurs at one critical location in the enqueue routine.

I meant that what matters in the end is user visible properties like
latency and bandwidth. And solving just the "window of inconsistency"
problem buys you nothing.


>> If you are in kernel space, then you can ensure some of the properties
>> by disabling interrupts around the operations. Then you can be sure
>> that each operaition finishes in N machine instructions.
>> But even then, if you have a simple MOV instruction and the data is in
>> memory, how long can it take? There is actually no upper bound. And
>> this gets us back to where we started.
>> So I argue that all that matters is common case performance. And my
>> algorithm is already quite good from this point of view.
>
> I agree that your algorithm is already quite good. Wouldn't it be better if the window of inconsistency were eliminated?


It definitely would be better. The question that is interesting to me
is -- how much exactly would it be better. And I suspect that the
answer is not that much. I may be wrong. What is your expectation?


>> If you go for hard real time, you need to start from designing and
>> manufacturing own hardware. Then OS. Only then you can have guarantees
>> that some operations will finish in bounded amount of time. Until then
>> lock/wait-free algorithms are illusive.
>
> I didn't offer any guarantee of hard real time or of a bounded amount of time to finish. I merely offered a (theoretical) improvement on an already efficient algorithm. You are correct, though, that (as noted above) the improvement is in the realm of the OS, not pure user-space code.
>
> If we can eliminate the window of inconsistency, then we can vastly simplify the dequeue routine. I will present that modification in another post.
>
> Cheers,
> Eric
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/ACFA6EEA-D299-42A9-9191-006F1C5D6CAE%40eagrant.com.
Reply all
Reply to author
Forward
0 new messages