sys_synch?

John M. Dlugosz

unread,

Oct 19, 2009, 1:51:48 PM10/19/09

to Scalable Synchronization Algorithms

> On Fri, Oct 16, 2009 at 9:07 AM, John Dlugosz <JDlu...@tradestation.com> wrote:
> > Howdy!
> >
> > I happened upon your post <
> > http://groups.google.com/group/lock-free/browse_frm/thread/1efdc652571
> > c6137>
> >
> > and read other that were linked from it, and searched the web for
> > rcu_syncrohize, etc.
> >
> > I've been doing quite a bit of code that avoids both locks and "atomic"
> > instructions, on the x86/x64 architecture under Windows.
> >
> > But I really don't understand what is going on here. What's all the
> > hullabaloo with sys_sych() etc? I'm guessing that some processor
> > architecture will buffer writes in a way that they are not visible to
> > other cores. But, my understanding is that all machines for sale
> > today actually are "ccNUMA", that is, the cache is coherent across all
> > cores even on different nodes.
> >
> > So, what am I missing?
> >
> > Also, I see you have _ReadWriteBarrier() as the last statement in
> > reader_lock. That seems odd. It's to deal with inlining of that
> > function, right?
>
>
> Hi John,
>
> I will write a kind of article on asymmetric synchronization, I was going to do that anyway. However it will take some time, so here is quick answers.
>
> No, sys_sych() is not about cache coherency, it's about ordering. It's needed for the same things you use MFENCE instruction on x86.
>
> _ReadWriteBarrier() is required in reader_lock()/reader_unlock() so as to code from user critical section does not intermix with synchronization code in reader_lock()/reader_unlock(). If you have, let's say, _InterlockedExchange() in mutex acquire function then this provides some guarantees that user code will not hoist above that. But since I have basically no synchronization in reader_lock()/reader_unlock(), user code can hoist above acquire or sink below release.
>
> While I am writing my "article" please refer to David Dice et al "Asymmetric Dekker Synchronization":
> http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt
> They describe basically the same idea. They call SYNCHRONIZE(thread) what I called sys_sych(). However they applied the idea only to two-thread mutual exclusion, while I applied it to the reader-write problem.
>
> If you will have any questions after reading "Asymmetric Dekker Synchronization", feel free to ask me. However I would prefer to have a discussion on the lock-free group:
> http://groups.google.com/group/lock-free
>
> Thank you for the interest.
>
> --
> Dmitriy V'jukov
>
> Relacy Race Detector: Make your synchronization correct!
> http://groups.google.ru/group/relacy

That's what I meant about sys_sync and cache coherency: once the core
processes the write (which may have been queued to an I/O pipeline),
sending it "to memory", it is visible to all cores even though the
write starts off in the first of a hierarchy of caches before getting
to the main RAM chips.

So, given that once a write is actually issued (not just queued to be
performed later) it is visible to all, what is the reason for
requiring a thread on every other core to perform some special
operation? The write queuing is something that is only on this core.
Making sure it is really issued before doing the read is something
local to the thread doing it.

What architectures require this? The original writings here don't
mention anything so seems to be general purpose. But I've not
encountered any such concept on x86/x64.

Dmitriy Vyukov

unread,

Oct 20, 2009, 2:51:49 AM10/20/09

to Scalable Synchronization Algorithms

On Oct 19, 10:51 am, "John M. Dlugosz" <11lrha...@gmail.com> wrote:

> That's what I meant about sys_sync and cache coherency: once the core
> processes the write (which may have been queued to an I/O pipeline),
> sending it "to memory", it is visible to all cores even though the
> write starts off in the first of a hierarchy of caches before getting
> to the main RAM chips.
>
> So, given that once a write is actually issued (not just queued to be
> performed later) it is visible to all, what is the reason for
> requiring a thread on every other core to perform some special
> operation? The write queuing is something that is only on this core.
> Making sure it is really issued before doing the read is something
> local to the thread doing it.

Yes, it's private to the thread doing it. Thus each thread that
participate in synchronization have to cope with it, namely execute
MFENCE. And that's traditional 'symmetric' synchronization.

Ok, here is what I write to date for 'the article':

Here is the essence of the asymmetric synchronization from the
implementation point of view. Consider following code (which is the
core of the Dekker mutual exclusion algorithm):

X = Y = 0;

// thread 1
X = 1
R1 = X

// thread 2
Y = 1
R2 = Y

Under sequentially consistent memory model (no reorderings possible)
the outcome R1 == R2 == 0 is impossible (either R1 or R2 must be equal
to 1). However, on all modern hardware architectures the outcome R1 ==
R2 == 0 is indeed possible, because they feature relaxed memory models
(memory accesses can be reordered). To prevent that one has to use
store-load memory fences:

// thread 1
X = 1
store-load-memory-fence
R1 = X

// thread 2
Y = 1
store-load-memory-fence
R2 = Y

Now, outcome R1 == R2 == 0 is impossible, however store-load-memory-
fence is the most expensive memory fence (MFENCE instruction on x86,
or #StoreLoad membar on SPARC).
That was symmetric synchronization (both threads execute the same
code). In order to utilize asymmetric synchronization we rewrite the
code as:

// thread 1
X = 1
compiler-fence
R1 = X

// thread 2
Y = 1
system-wide-fence
R2 = Y

compiler-fence is basically costless, it prevents any reorderings on
compiler level only (_ReadWriteBarrier() for MSVC, or __asm __volatile
("":::"memory") for gcc).
And the most interesting thing here is system-wide-fence. During
execution of a system-wide-fence each thread (or more precisely -
processor/core) executes store-load-memory-fence. Informally, a thread
executes memory fence "for me and for that guy", so "that guy" (i.e.
first thread) may not execute expensive memory fence.
You may see that under such definitions the outcome R1 == R2 == 0 is
impossible too, however first thread has virtually zero
synchronization overhead now.

> What architectures require this? The original writings here don't
> mention anything so seems to be general purpose. But I've not
> encountered any such concept on x86/x64.

Basically all architectures require it.

--
Dmitriy V'jukov

John M. Dlugosz

unread,

Oct 20, 2009, 4:47:39 PM10/20/09

to Scalable Synchronization Algorithms

I see... the point is to prevent the fast thread from needing a MFENCE
at all.

At what cost? The global delay or task switch is orders of magnitude
slower, so you would have to understand when this is a benefit. I'm
surprised that MFENCE is so inefficient, and expected to have much
better characteristics than the LOCK prefix.

My own methodology is that having specialized patterns allows for much
more relaxed synchronization needs than a single general purpose
primitive. For example, I implemented a queue that has one writer and
multiple readers, for communicating jobs to worker threads.

Another is separating "commitment" of a value forms. For example,
setting a flag (originally zero/null/empty handle) is easy. Threads
don't normally signal completion of each item, since that is slow,
except when another thread wants such a signal, and then sets that
mode. The worker has zero overhead to test that the handle is null
and thus not use it to signal with; the interested thread only gets
involved when it needs to wait anyway, and it has degeneracy: set the
flag, then check the condition directly, then (if not satisfied yet)
wait for the flag. If the setting of the enable flag was "raced" with
its use, the interested thread may get both a true direct test and a
set signal which it ignores. Setting the signal multiple times is
also OK.

On the other hand, if I want to modify queued job, and a worker might
have started it or be just starting it, the caller needs to positively
know if the modification "took". For example, revoke something, and
then delete the pointer if it was indeed revoked, or realize that it's
still owned (and may be in the middle of being deleted) by the worker
otherwise.

Another interesting thing is a reference counter (for lifetimes) that
has two different counts for two threads (e.g. main and one worker),
which can be done without slow instructions except when it is hitting
zero. Having provision for exactly two threads makes it possible,
when a general multi-thread reference count is slow.

BTW, what is the concept called? First we have lock-free and wait-
free code, that uses atomic instructions instead. But what do you
call code that avoids the slow atomic instructions? xxxx-free code?

--John

John M. Dlugosz

unread,

Oct 20, 2009, 4:57:46 PM10/20/09

to Scalable Synchronization Algorithms

> code). In order to utilize asymmetric synchronization we rewrite the
> code as:
>
> // thread 1
> X = 1
> compiler-fence
> R1 = X
>
> // thread 2
> Y = 1
> system-wide-fence
> R2 = Y
>
> compiler-fence is basically costless, it prevents any reorderings on

The compiler-fence is not necessary in your case. Even with it, the
other thread can see the wrong order, so why bother? It is only
needed in front of the "safe area" that the system-wide-fence forces
execution to.

The use of the system-wide-fence for such asymmetry is not a plug-in
replacement for the "interesting part" of Dekker's, as you show.
After being forced to a known safe area, the first thread might come
around again and be in the hot zone _again_ before thread 2 can
react. It needs another set of flags around this to recover what is
done by this alone in the symmetric case (as shown in the article you
cited earlier).

--John

Dmitriy Vyukov

unread,

Oct 21, 2009, 12:57:17 AM10/21/09

to Scalable Synchronization Algorithms

On Oct 20, 1:57 pm, "John M. Dlugosz" <11lrha...@gmail.com> wrote:
> > code). In order to utilize asymmetric synchronization we rewrite the
> > code as:
>
> > // thread 1
> > X = 1
> > compiler-fence
> > R1 = X
>
> > // thread 2
> > Y = 1
> > system-wide-fence
> > R2 = Y
>
> > compiler-fence is basically costless, it prevents any reorderings on
>
> The compiler-fence is not necessary in your case. Even with it, the
> other thread can see the wrong order, so why bother? It is only
> needed in front of the "safe area" that the system-wide-fence forces
> execution to.

Nope. W/o the compiler-fence the algorithm is dead-broken.
If there will be NO compiler-fence, then compiler may reorder code in
asymmetric rw mutex as:

void reader_lock()
{
if (writer_pending)
{
reader_inside[current_thread_index] = false;
EnterCriticalSection(&cs);
reader_inside[current_thread_index] = true;
LeaveCriticalSection(&cs);
}
reader_inside[current_thread_index] = true;
}

Now assume, reader thread checks writer_pending, notices that
writer_pending=false, then reader thread is preempted.
Then writer thread sets writer_pending=true, executes sys_synch(),
verifies all reader_inside flags (which are equal to false) and enters
critical section.
Now reader thread wakes up, sets reader_inside=true, and enters
critical section too.
Taa-daa! Both threads are in critical section.

sys_synch() can only fight against hardware reorderings, but it is
powerless against compiler reorderings. sys_synch() is only a
replacement for MFENCE, a kind of REMOTE_MFENCE, compiler ordering is
still necessary.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Oct 21, 2009, 1:03:28 AM10/21/09

to Scalable Synchronization Algorithms

On Oct 20, 1:57 pm, "John M. Dlugosz" <11lrha...@gmail.com> wrote:

> The use of the system-wide-fence for such asymmetry is not a plug-in
> replacement for the "interesting part" of Dekker's, as you show.
> After being forced to a known safe area, the first thread might come
> around again and be in the hot zone _again_ before thread 2 can
> react. It needs another set of flags around this to recover what is
> done by this alone in the symmetric case (as shown in the article you
> cited earlier).

But of course. I only showed "the core" of the Dekker synchronization.
There is additional logic in asymmetric rw mutex and in asymmetric
Dekker mutex.
Anyway, following code:

// thread 1
X = 1

// some fence here

R1 = X

// thread 2
Y = 1

// some fence here
R2 = Y

May perfectly yield R1==R2==1, no matter what kind of fence one will
put between store and load, or even no fence at all.
So this code is not sufficient for mutual exclusion. It only
guarantees (provided correct fences) that R1==R2==0 is impossible.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Oct 21, 2009, 1:33:02 AM10/21/09

to Scalable Synchronization Algorithms

On Oct 20, 1:47 pm, "John M. Dlugosz" <11lrha...@gmail.com> wrote:
> I see... the point is to prevent the fast thread from needing a MFENCE
> at all.

Exactly!

> At what cost? The global delay or task switch is orders of magnitude
> slower, so you would have to understand when this is a benefit. I'm
> surprised that MFENCE is so inefficient, and expected to have much
> better characteristics than the LOCK prefix.

Yes, it is orders of magnitude slower. And yes, one have to understand
trade-offs. This algorithm is intended for applications with high read-
to-write ratios and large protected data. Read-to-write ratio can
easily be 10^7 or even higher, consider server with a lot of worker
threads, high transaction throughput, and once per minute update of
the centralized data structure (for example, routing server where
centralized data structure is a routing table).
Assume that MFENCE (or uncontended atomic RMW) costs 50 cycles. If you
cut down 50 cycles for each of 10^7 operations, you have quite a lot
of cycles to repay sys_synch().
And note that the algorithm still provides "traditional" mutual
exclusion. There are other techniques that allow you to eliminate that
nasty MFENCE, for example, RCU. However with RCU you introduce quite a
lot of complexity into your system, and you have to have enough memory
to keep alive at least 2 versions of the data.

And if you will use traditional centralized rw mutex, let's assume
that atomic RMW on shared data costs 300 cycles, and you have 2 of
them (1 to enter cs, and 1 to leave cs), and read-to-write ratio is
10^7. So now you have 6*10^9 cycles lost on reader synchronization.

And if you have shared data of moderate size and high read-to-write-
ration (at least 10) then you may consider my "optimized sequence
lock":
http://groups.google.com/group/lock-free/browse_frm/thread/5705265ef7a1a1a3
It provides both costless access and lock-free properties for readers
(i.e. readers never wait). Writers have to make a copy of the data,
and so you have to have memory for at least 2 versions of data. Must
not be a problem provided moderate size data, though.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Oct 21, 2009, 1:55:23 AM10/21/09

to Scalable Synchronization Algorithms

On Oct 20, 1:47 pm, "John M. Dlugosz" <11lrha...@gmail.com> wrote:

> My own methodology is that having specialized patterns allows for much
> more relaxed synchronization needs than a single general purpose
> primitive.

I totally agree.
My asymmetric rw mutex is just an another specialized primitive.
Thank you for sharing your designs. I see that we are thinking in the
same direction
\/\/\/\/\/\/\/\/\/

> For example, I implemented a queue that has one writer and
> multiple readers, for communicating jobs to worker threads.

You may see quite a lot of them here:
http://groups.google.com/group/lock-free/topics

> Another is separating "commitment" of a value forms. For example,
> setting a flag (originally zero/null/empty handle) is easy. Threads
> don't normally signal completion of each item, since that is slow,
> except when another thread wants such a signal, and then sets that
> mode. The worker has zero overhead to test that the handle is null
> and thus not use it to signal with; the interested thread only gets
> involved when it needs to wait anyway, and it has degeneracy: set the
> flag, then check the condition directly, then (if not satisfied yet)
> wait for the flag. If the setting of the enable flag was "raced" with
> its use, the interested thread may get both a true direct test and a
> set signal which it ignores. Setting the signal multiple times is
> also OK.

Note that it's basically Dekker synchronization. So you need MFENCE
between both (set handle and check condition/flag) and (set condition/
flag and check handle). Otherwise both threads may miss signal from
another.

> On the other hand, if I want to modify queued job, and a worker might
> have started it or be just starting it, the caller needs to positively
> know if the modification "took". For example, revoke something, and
> then delete the pointer if it was indeed revoked, or realize that it's
> still owned (and may be in the middle of being deleted) by the worker
> otherwise.
>
> Another interesting thing is a reference counter (for lifetimes) that
> has two different counts for two threads (e.g. main and one worker),
> which can be done without slow instructions except when it is hitting
> zero. Having provision for exactly two threads makes it possible,
> when a general multi-thread reference count is slow.

Atomic reference counting is indeed slow. It's a multicore anti-
pattern.
I've designed a similar thing, however more general purpose:
http://groups.google.com/group/lock-free/browse_frm/thread/46d00a0f543a758e
In short, each thread has it's own counter for each object. Then there
are periodic aggregations of these private counters, and some tricky
logic to detect when an object may be safely deleted.
There are also a lot of interesting optimizations to the general
scheme, that I did not post here though.

> BTW, what is the concept called? First we have lock-free and wait-
> free code, that uses atomic instructions instead. But what do you
> call code that avoids the slow atomic instructions? xxxx-free code?

I call it "atomic-free". It's not a very good name, and not a widely
accepted one. However I do not have a better one.
The problem is that lock-/wait-/obstruction-free is all about progress
guarantees. Lock-free algorithm may use 100 CASes per operation and be
dead slow as compared to mutex-based solution, but it's still lock-
free.
So I use "atomic-free" for algorithms that are all about performance/
scalability. They do not use atomic RMW, MFENCE and do not access
shared mutable state at least on fast-path. On slow-path they may use
mutex and whatever they want, so formally they may be not even
obstruction-free, i.e. they may be blocking.

--
Dmitriy V'jukov

John M. Dlugosz

unread,

Oct 21, 2009, 6:40:02 PM10/21/09

to Scalable Synchronization Algorithms

On Oct 20, 11:57 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> > > // thread 1
> > > X = 1
> > > compiler-fence
> > > R1 = X

> Nope. W/o the compiler-fence the algorithm is dead-broken.
> If there will be NO compiler-fence, then compiler may reorder code in
> asymmetric rw mutex as:
>
> void reader_lock()
> {
> if (writer_pending)
> {
> reader_inside[current_thread_index] = false;
> EnterCriticalSection(&cs);
> reader_inside[current_thread_index] = true;
> LeaveCriticalSection(&cs);
> }
> reader_inside[current_thread_index] = true;
> }
>

Uh, how does
X=1; R1=X;
get re-arranged to become all that stuff?

Sure, no doubt some compiler fencing is needed in your actual code,
but the example in the earlier post doesn't have any of that. You
were exposing purely on one instance of a write-then-read problem.

John M. Dlugosz

unread,

Oct 21, 2009, 6:45:20 PM10/21/09

to Scalable Synchronization Algorithms

> But of course. I only showed "the core" of the Dekker synchronization.
> There is additional logic in asymmetric rw mutex and in asymmetric
> Dekker mutex.

I'm sure your implementation is OK, as I saw on the original topic.
It's the explanatory writing that needs work. If this is to be a
"paper" (or posting or whatever) it needs to match up what is stated
with the earlier statements in the article.

(I'm a writer as much as a programmer, and have also edited magazines)
<http://www.dlugosz.com/Magazine/CLM/Nov89/> is about threading from
20 years ago! That was my first "cover".

--John

Dmitriy Vyukov

unread,

Oct 30, 2009, 4:09:58 AM10/30/09

to Scalable Synchronization Algorithms

Unfortunately, I am only a programmer... plus not a native English
speaker...

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Oct 30, 2009, 4:10:05 AM10/30/09

to Scalable Synchronization Algorithms

On Oct 21, 3:40 pm, "John M. Dlugosz" <11lrha...@gmail.com> wrote:
> On Oct 20, 11:57 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
>
>
> > > > // thread 1
> > > > X = 1
> > > > compiler-fence
> > > > R1 = X
> > Nope. W/o the compiler-fence the algorithm is dead-broken.
> > If there will be NO compiler-fence, then compiler may reorder code in
> > asymmetric rw mutex as:
>
> > void reader_lock()
> > {
> > if (writer_pending)
> > {
> > reader_inside[current_thread_index] = false;
> > EnterCriticalSection(&cs);
> > reader_inside[current_thread_index] = true;
> > LeaveCriticalSection(&cs);
> > }
> > reader_inside[current_thread_index] = true;
> > }
>
> Uh, how does
> X=1; R1=X;
> get re-arranged to become all that stuff?

"reader_inside[current_thread_index] = true" is "X=1"
"if (writer_pending)" is "R1=X"

> Sure, no doubt some compiler fencing is needed in your actual code,
> but the example in the earlier post doesn't have any of that. You
> were exposing purely on one instance of a write-then-read problem.

Write-then-read is a core of the Dekker/Peterson mutual exclusion
algorithms. You may see that David Dice el el. in the paper explain
asymmetric synchronization on the same pattern:

[Dekker with MEMBARs]
Thread1 would execute ST T1; MEMBAR; LD T2; test value ...
Thread2 would execute ST T2; MEMBAR; LD T1; test value ...

--
Dmitriy V'jukov

John M. Dlugosz

unread,

Oct 30, 2009, 4:58:53 PM10/30/09

to Scalable Synchronization Algorithms

When you are writing your "paper" to post on the web, you might ask me
to proof it. Always, feel free to ask me about English grammar and
writing questions.

--John

Reply all

Reply to author

Forward