aarch64: waiter.hh - should t variable be an atomic?

103 views
Skip to first unread message

Waldek Kozaczuk

unread,
Mar 25, 2021, 11:10:36 PM3/25/21
to OSv Development
As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.

As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario. 

According to https://github.com/Broadcom/arm64-linux/blob/master/Documentation/memory-barriers.txt, especially "WHERE ARE MEMORY BARRIERS NEEDED?" and "INTERPROCESSOR INTERACTION" chapters, it seems that every time some piece of memory needs to be accessed (read/written) by many CPUs for example in the scheduler or other kind of coordination, one must use proper memory barriers to enforce that those memory changes are visible to between the CPUs. One way to achieve this is using atomic with the default memory ordering (see "One-way barriers" in https://developer.arm.com/documentation/100941/0100/Barriers). The instructions LDAR/STLR are used to implement strong versions of the atomics.

My experiments seem to prove that using atomic seems to help with the hang problem. But I am afraid that 't' variable in the waiter object maybe not the ONLY place where we need to worry about memory change visibility.

I wonder what you think about my findings and the reasoning?

Regards,
Waldek

Nadav Har'El

unread,
Apr 1, 2021, 12:36:19 PM4/1/21
to Waldek Kozaczuk, OSv Development
On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.

As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario. 

This is an interesting question. Indeed, the waiter::t is not marked atomic, and it's been a very long time since I wrote this code . Note that "t" is not public, and the only methods that access t, wake() and wait().

I *think* (if I remember correctly) my thinking was that each of these wake() and wait() operations already has its *own* memory barriers. namely wake_impl() and prepare_wait() both access the state of the thread with CST ordering - https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). I think (again, I hope I remember correctly), we were hoping that these barriers also apply to the t access before of after them.

The thinking was that it is similar to how one uses a shared variable protected by the mutex: You don't need the shared variable to be std::atomic or care about memory order or barriers - the mutex already takes care of that for you.

But you're right that it's hard to catch bugs in this memory ordering on x86 and it's easier to do it wrong on ARM.

The question is what exactly is wrong:
  1. Is the ordering in wake_impl() and prepare_wait() (called by waiter::wake() and waiter::wait()) not strong enough on ARM to serialize *everything*, not just a specific variable?
  2. Do we have this serialization in the right place? E.g., maybe we have the ordering point before writing t but we need it after writing t, and so.
  3. Maybe the problem isn't in waiter::wake() or waiter::wait() but in other functions like waiter::woken(), waiter::clear()?
I suggest that you change waiter.hh to use std::atomic for t, and then using "relaxed" ordering to read/write t for some of the waiter methods and "cst" ordering for other methods - and see which method needs the non-relaxed ordering. When we know if one specific method is missing the ordering, we can try to understand why it is missing.

Unfortunately, after several years of working in Seastar more than OSv, my understanding of memory ordering issues has rusted, so my intuition on these issues isn't as good as it used to be :-(


According to https://github.com/Broadcom/arm64-linux/blob/master/Documentation/memory-barriers.txt, especially "WHERE ARE MEMORY BARRIERS NEEDED?" and "INTERPROCESSOR INTERACTION" chapters, it seems that every time some piece of memory needs to be accessed (read/written) by many CPUs for example in the scheduler or other kind of coordination, one must use proper memory barriers to enforce that those memory changes are visible to between the CPUs. One way to achieve this is using atomic with the default memory ordering (see "One-way barriers" in https://developer.arm.com/documentation/100941/0100/Barriers). The instructions LDAR/STLR are used to implement strong versions of the atomics.

My experiments seem to prove that using atomic seems to help with the hang problem. But I am afraid that 't' variable in the waiter object maybe not the ONLY place where we need to worry about memory change visibility.

Definitely this is not the only other place where we needed to care about memory ordering, but we were pretty careful to do it properly, and in fact one of the main reasons why we chose C++11 when we started the OSv project is its nice support for std::atomic and memory ordering. But, it is indeed not unlikely that we have bugs in this area, because OSv only ran on x86 initially, where the memory reorderings that can happen is less than can happen on ARM.
 

I wonder what you think about my findings and the reasoning?

Regards,
Waldek

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/5b16202b-8a9b-4ff3-8bc1-68bc87baa812n%40googlegroups.com.

Waldek Kozaczuk

unread,
Apr 4, 2021, 11:00:10 AM4/4/21
to OSv Development
On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.

As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario. 

This is an interesting question. Indeed, the waiter::t is not marked atomic, and it's been a very long time since I wrote this code . Note that "t" is not public, and the only methods that access t, wake() and wait().

I *think* (if I remember correctly) my thinking was that each of these wake() and wait() operations already has its *own* memory barriers. namely wake_impl() and prepare_wait() both access the state of the thread with CST ordering - https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). I think (again, I hope I remember correctly), we were hoping that these barriers also apply to the t access before of after them.

The thinking was that it is similar to how one uses a shared variable protected by the mutex: You don't need the shared variable to be std::atomic or care about memory order or barriers - the mutex already takes care of that for you.

I think that is true on Intel but not necessarily on ARM. Specifically, in the SMP case, the changes made to that shared variable ob CPU 1 even within mutex critical section are not visible to CPU 2 because the changes made by the former will still be in the L1 cache. Only when we use proper memory barrier instructions (which I have figured which ones to use and where exactly) OR std::atomic<> (which on ARM translates to the LDAR/STLR instructions that by nature make the changes to memory atomic and visible to all CPUs), the changes to _t are visible.

Otherwise the wait() method in wait_record.hh will not see _t equal to null when the thread waiting on CPU2 is woken. So the effect is that the interrupt arrives, thread is woken but does not see _t value equal to 0. But I might be wrong in my reasoning.


But you're right that it's hard to catch bugs in this memory ordering on x86 and it's easier to do it wrong on ARM.

The question is what exactly is wrong:
  1. Is the ordering in wake_impl() and prepare_wait() (called by waiter::wake() and waiter::wait()) not strong enough on ARM to serialize *everything*, not just a specific variable?
I think this is specific to _t. 
  1. Do we have this serialization in the right place? E.g., maybe we have the ordering point before writing t but we need it after writing t, and so.
  2. Maybe the problem isn't in waiter::wake() or waiter::wait() but in other functions like waiter::woken(), waiter::clear()?
I suggest that you change waiter.hh to use std::atomic for t, and then using "relaxed" ordering to read/write t for some of the waiter methods and "cst" ordering for other methods - and see which method needs the non-relaxed ordering. When we know if one specific method is missing the ordering, we can try to understand why it is missing.
Which specific place in code do you have in mind? Nothing in wait_record.hh uses memory ordering involved methods.
 

Unfortunately, after several years of working in Seastar more than OSv, my understanding of memory ordering issues has rusted, so my intuition on these issues isn't as good as it used to be :-(


According to https://github.com/Broadcom/arm64-linux/blob/master/Documentation/memory-barriers.txt, especially "WHERE ARE MEMORY BARRIERS NEEDED?" and "INTERPROCESSOR INTERACTION" chapters, it seems that every time some piece of memory needs to be accessed (read/written) by many CPUs for example in the scheduler or other kind of coordination, one must use proper memory barriers to enforce that those memory changes are visible to between the CPUs. One way to achieve this is using atomic with the default memory ordering (see "One-way barriers" in https://developer.arm.com/documentation/100941/0100/Barriers). The instructions LDAR/STLR are used to implement strong versions of the atomics.

My experiments seem to prove that using atomic seems to help with the hang problem. But I am afraid that 't' variable in the waiter object maybe not the ONLY place where we need to worry about memory change visibility.

Definitely this is not the only other place where we needed to care about memory ordering, but we were pretty careful to do it properly, and in fact one of the main reasons why we chose C++11 when we started the OSv project is its nice support for std::atomic and memory ordering. But, it is indeed not unlikely that we have bugs in this area, because OSv only ran on x86 initially, where the memory reorderings that can happen is less than can happen on ARM.
 

I wonder what you think about my findings and the reasoning?

Regards,
Waldek

So if my theory about _t visibility (or lack of between CPUs) is correct, we could "templatize" wait_record.hh and make _t type a parameter that would be sched::thread* for Intel and std::atomic<sched::thread*> for ARM.

Waldek Kozaczuk

unread,
Apr 4, 2021, 11:02:42 AM4/4/21
to OSv Development
On Sunday, April 4, 2021 at 11:00:10 AM UTC-4 Waldek Kozaczuk wrote:
On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.

As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario. 

This is an interesting question. Indeed, the waiter::t is not marked atomic, and it's been a very long time since I wrote this code . Note that "t" is not public, and the only methods that access t, wake() and wait().

I *think* (if I remember correctly) my thinking was that each of these wake() and wait() operations already has its *own* memory barriers. namely wake_impl() and prepare_wait() both access the state of the thread with CST ordering - https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). I think (again, I hope I remember correctly), we were hoping that these barriers also apply to the t access before of after them.

The thinking was that it is similar to how one uses a shared variable protected by the mutex: You don't need the shared variable to be std::atomic or care about memory order or barriers - the mutex already takes care of that for you.

I think that is true on Intel but not necessarily on ARM. Specifically, in the SMP case, the changes made to that shared variable ob CPU 1 even within mutex critical section are not visible to CPU 2 because the changes made by the former will still be in the L1 cache. Only when we use proper memory barrier instructions (which I have
I meant "I have NOT figured" 

Nadav Har'El

unread,
Apr 4, 2021, 12:32:04 PM4/4/21
to Waldek Kozaczuk, OSv Development
On Sun, Apr 4, 2021 at 6:00 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:


On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:
On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.

As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario. 

This is an interesting question. Indeed, the waiter::t is not marked atomic, and it's been a very long time since I wrote this code . Note that "t" is not public, and the only methods that access t, wake() and wait().

I *think* (if I remember correctly) my thinking was that each of these wake() and wait() operations already has its *own* memory barriers. namely wake_impl() and prepare_wait() both access the state of the thread with CST ordering - https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). I think (again, I hope I remember correctly), we were hoping that these barriers also apply to the t access before of after them.

The thinking was that it is similar to how one uses a shared variable protected by the mutex: You don't need the shared variable to be std::atomic or care about memory order or barriers - the mutex already takes care of that for you.

I think that is true on Intel but not necessarily on ARM. Specifically, in the SMP case, the changes made to that shared variable ob CPU 1 even within mutex critical section are not visible to CPU 2 because the changes made by the former will still be in the L1 cache. Only when we use proper memory barrier instructions (which I have figured which ones to use and where exactly) OR std::atomic<> (which on ARM translates to the LDAR/STLR instructions that by nature make the changes to memory atomic and visible to all CPUs), the changes to _t are visible.

I don't think this is the case. Multi-threaded code which uses traditional Posix mutex does *not* use any std::atomic or barriers. People wrote such code before C++11's std::atomic, and without barriers in assembly language (again, I'm talking about Posix threads code with mutexes - not about lockfree code).
The assumption is that the implementation of the mutex is using these barriers: "acquire" and "release" barriers so that for example changes that one thread does to a shared variable inside a critical section protected by a mutex will be visible by some other thread on another CPU which later takes the same mutex.
 
The wait_record.hh wait()/wake() do NOT use a mutex (they are used in the mutex implementation ;-)) but they do have barriers as I'll explain below:


Otherwise the wait() method in wait_record.hh will not see _t equal to null when the thread waiting on CPU2 is woken. So the effect is that the interrupt arrives, thread is woken but does not see _t value equal to 0. But I might be wrong in my reasoning.


But you're right that it's hard to catch bugs in this memory ordering on x86 and it's easier to do it wrong on ARM.

The question is what exactly is wrong:
  1. Is the ordering in wake_impl() and prepare_wait() (called by waiter::wake() and waiter::wait()) not strong enough on ARM to serialize *everything*, not just a specific variable?
I think this is specific to _t. 
  1. Do we have this serialization in the right place? E.g., maybe we have the ordering point before writing t but we need it after writing t, and so.
  2. Maybe the problem isn't in waiter::wake() or waiter::wait() but in other functions like waiter::woken(), waiter::clear()?
I suggest that you change waiter.hh to use std::atomic for t, and then using "relaxed" ordering to read/write t for some of the waiter methods and "cst" ordering for other methods - and see which method needs the non-relaxed ordering. When we know if one specific method is missing the ordering, we can try to understand why it is missing.
Which specific place in code do you have in mind? Nothing in wait_record.hh uses memory ordering involved methods.

wait_record::wait() calls sched::thread::wait_until(). This calls do_wait_until() which starts creating a "wait_guard".  This calls thread::prepare_wait(), which calls
    _detached_state->st.store(status::waiting);
because there is no memory-ordering parameter, it uses the default memory_order_seq_cst ordering.
The thinking was that that this does all the necessary memory barriers.

In the other side, wait_record::wake() calls thread::wake_with_from_mutex() which calls do_wake_with(), which does
        auto ds = _detached_state.get();
this get() uses memory_order_seq_cst, and so it is the other side of the store barrier we had in wait().

Again, this was the thinking, or at least what I remember from many years ago when I was an expert in this code ;-)
Maybe there is something subtly wrong with this thinking on ARM, while it was correct for x86....

Waldek Kozaczuk

unread,
Apr 7, 2021, 1:36:30 AM4/7/21
to OSv Development
I think you are right about wait_record and memory barriers (unless there is still some hole in our thinking :-)).

So let us get back to one of the observations I made and the related question I posed here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337

The thread::wake_impl() method has a deliberate "if" logic to validate that the current (old) status of the thread is one of the initial states per the specified allowed_initial_states_mask argument. If it is NOT, the wake_impl() simply returns without actually waking a thread (setting need_reschedule to true or calling send_wakeup_ipi()). It is interesting that neither wake_impl() nor wake() returns a result of the "waking process" - the return type is void. I wonder what the rationale behind it was. Maybe in most cases, it does not matter whether we really wake it or not. But maybe in some, it does and we should be able to know that thread was not really woken.

Normally the thread, the wake_impl() is called on (per st argument) would be most likely in the waiting state. But clearly, it does not have to be the case always - the same thread could have just been woken by another thread on another CPU for example. 

I have added these debug printouts to see what states thread might be when woke_impl() is called:

void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask)
{
    status old_status = status::waiting;
    trace_sched_wake(st->t);
    while (!st->st.compare_exchange_weak(old_status, status::waking)) {
        if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
            if (allowed_initial_states_mask == ((1 << unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {
                if (old_status != status::waiting && old_status != status::sending_lock) {
                   debug_early_u64("! wake_impl: ", (unsigned)old_status);
                }    
            }    
            return;
        }    
    }    

Please note I am specifically logging only cases when wake_impl() is called from thread::wake_with_from_mutex(Action action). With one CPU, I occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I also occasionally see "waking" and "running", which is interesting. Most of the time even if I see these "waking" and "running" printouts, OSv would NOT hang. But sometimes it would right after.

Now if we focus on lockless mutex (core/lfmutex.cc) and two key methods - lock() and unlock() - I think we can imagine following scenario:

1. Thread T1 on CPU0 calls lock() on some lock-free mutex M and ends up creating a wait_record "waiter" on the stack (could NOT acquire the lock), adds "waiter" to the M waitqueue, and finally calls waiter.wait() which ends up calling sched::do_wait_until(). Eventually T1 gets scheduled out and becomes "waiting".
2. Later, thread T2 on CPU0 calls thread::wake() or other places in code not related to lock-free mutex that calls wake_impl() to wake thread T1 (not sure what such scenario might be).
3a. Thread T1 becomes running again on CPU0, but it checks the condition of the waiter which obviously is NOT true (!t) and it goes back to "waiting".
3b. Almost at the same time as T1 is running per 3a on CPU0, thread T3 on CPU1 calls unlock() on the same mutex M and pops the waiter from 1) as part of "Responsibility Hand-Off" protocol and calls wake() (assuming it will simply wake the thread). But in the end, the wake_impl() "if" determines that T1 is running so it actually never wakes it and T1 stays stuck like this (no?) unless some other thread later simply wakes it again (t was set to null in the waiter). But what if such thread never comes to wake it?

Am I misunderstanding something or describing a scenario that should never happen? Or maybe it is a symptom of another problem?

If my thinking is correct, then:
1. Shouldn't we check if wake_impl() actually worked all the way in the lockfree mutex code and push the waiter back again to its waitqueue if it did not? Or is it even more complicated and we should iterate to find a waiter it can actually wake?
2. Shouldn't the implementation of thread::do_wake_with(Action action, unsigned allowed_initial_states_mask) change and call action ONLY if wake_impl() actually woke the thread (meaning the while loop suceeded)? We would need another wake_impl() that would take action as an argument.
 
If that is indeed a real problem, why it does not happen on Intel?

Waldek Kozaczuk

unread,
Apr 7, 2021, 10:56:43 AM4/7/21
to OSv Development
On Wed, Apr 7, 2021 at 01:36 Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I think you are right about wait_record and memory barriers (unless there is still some hole in our thinking :-)).

So let us get back to one of the observations I made and the related question I posed here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337

The thread::wake_impl() method has a deliberate "if" logic to validate that the current (old) status of the thread is one of the initial states per the specified allowed_initial_states_mask argument. If it is NOT, the wake_impl() simply returns without actually waking a thread (setting need_reschedule to true or calling send_wakeup_ipi()). It is interesting that neither wake_impl() nor wake() returns a result of the "waking process" - the return type is void. I wonder what the rationale behind it was. Maybe in most cases, it does not matter whether we really wake it or not. But maybe in some, it does and we should be able to know that thread was not really woken.

Normally the thread, the wake_impl() is called on (per st argument) would be most likely in the waiting state. But clearly, it does not have to be the case always - the same thread could have just been woken by another thread on another CPU for example. 

I have added these debug printouts to see what states thread might be when woke_impl() is called:

void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask)
{
    status old_status = status::waiting;
    trace_sched_wake(st->t);
    while (!st->st.compare_exchange_weak(old_status, status::waking)) {
        if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
            if (allowed_initial_states_mask == ((1 << unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {
                if (old_status != status::waiting && old_status != status::sending_lock) {
                   debug_early_u64("! wake_impl: ", (unsigned)old_status);
                }    
            }    
            return;
        }    
    }    

Please note I am specifically logging only cases when wake_impl() is called from thread::wake_with_from_mutex(Action action). With one CPU, I occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I also occasionally see "waking" and "running", which is interesting. Most of the time even if I see these "waking" and "running" printouts, OSv would NOT hang. But sometimes it would right after.
I think that “queued” is not an issue as this thread is in the runqueue and will eventually be running. With single CPU the running case is impossible (one thread can be running) and waking is also ok. 

I wonder if step 2 below is ever possible or should be?

Or maybe we have some memory ordering issue with waitlist when popping a waiter is not visible on other cpu and it is still in that list or maybe we have duplicates of the same thread. 
--
You received this message because you are subscribed to a topic in the Google Groups "OSv Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com.

Waldek Kozaczuk

unread,
Apr 8, 2021, 11:15:43 AM4/8/21
to OSv Development, Nadav Har'El
I have applied the same patch and ran test same tests and I could see the same occasional "waking" and "running" printouts. But on x64 it never hangs. So maybe the scenarios that involve wake_impl() that fails to change state to waking are legitimate cases that are correctly handled but I still do not understand how.

So back to the board. So I still think we are missing some memory barrier instructions somewhere. See below.
So you are right about these memory barriers in prepare_wait() and do_wake_with() but I may have found another very related place where we are missing ones. 

In thread::do_wait_until() we do use wait_guard that calls prepare_wait(). However, pepare_wait() uses store() which perhaps only issues store memory barrier instruction. And wait_until() check the shared variable - t - in this case. For t changed by action in do_wake_with(), are we not missing a load barrier before checking predicate?

Relatedly, as you can see above, wake_impl() may fail to change the st variable to waiting for the reasons described above. In that case, perhaps compare_exchange_weak() call that returns false, does NOT issue a store barrier, does it? So maybe in this case before we return in that while loop in wake_impl() we should also issue a memory barrier instruction?

I conducted the experiments when I added generic memory barrier instruction - dmb isht - to both prepare_wait() and wake_impl() in the places described and I could see my hello word execute 1-2K times before hanging in some other place. So not as good as using atomic for t but these extra memory barriers definitely help.

@@ -1143,6 +1164,7 @@ void thread::prepare_wait()
     preempt_disable();
     assert(_detached_state->st.load() == status::running);
     _detached_state->st.store(status::waiting);
+            asm volatile ("dmb ishst");
 }

@@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask)

     trace_sched_wake(st->t);     while (!st->st.compare_exchange_weak(old_status, status::waking)) {
         if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
+            asm volatile ("dmb ishst");
             return;

Waldek Kozaczuk

unread,
Apr 9, 2021, 10:37:55 AM4/9/21
to OSv Development, Nadav Har'El
+            asm volatile ("dmb ish");
 }

One thing I missed is that prepare_wait() calls both load() and then store() on _detached_state->st which is interesting as it means that all memory operations - both load and store - should be ordered BEFORE and AFTER. This would indicate that "asm volatile ("dmb ish");" (which is a general memory barrier instruction that waits for both the store and load instructions to complete) is NOT necessary. But when I remove it, I get back the same hung failures.

Maybe it is because when I use "asm volatile ("dmb ishst");" in wake_impl() which is NOT specific to _detached_state->st, the rules of memory barriers do not apply? Meaning the load() has to match store() of the SAME atomic variable, no?

For example, this statement from C++ memory ordering documentation seems to indicate:

"memory_order_release - A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below). "

So maybe because we use non-atomic barrier instruction in wake_impl(), we have to use a similar one in prepare_wait() to make sure that changes to shared variable t are visible between 2 threads.

Waldek Kozaczuk

unread,
Apr 10, 2021, 11:44:28 PM4/10/21
to OSv Development
Regarding your explanation about  thread::prepare_wait() and do_wake_with(), the more I read about the acquire/release to make changes to some shared variable synced and visible between threads, should not this be reversed? If you look at do_wake_with(), it calls action() after "auto ds = _detached_state.get();" which is acquire operation, right?

But normally the order is this:
1) some thread tried to acquire a lock calls wait_record::wait() which then effectively loops in thread::do_wait_until() until it see t null, right?
2) another thread calls wait_record::wake() which then ultimately set t to null and calls wake_impl()

So given that shouldn't we have the acquire operation in the wait_record::wait() chain and release operation in the do_wake_with() after we call the action in wake_impl()? 

In theory, the prepare_wait has this line:

assert(_detached_state->st.load() == status::running); - the acquire operation

followed by the store which is release().


And the wake_impl() has the loop to call st.compare_exchange_weak() which I think is a release operation If it succeeds. So maybe we are OK except when we break from that loop and return because the thread state was neither waiting nor sending_lock.

My atomic_thread_fence() calls try to fill those missing gaps, I think.




Again, this was the thinking, or at least what I remember from many years ago when I was an expert in this code ;-)
Maybe there is something subtly wrong with this thinking on ARM, while it was correct for x86....

So you are right about these memory barriers in prepare_wait() and do_wake_with() but I may have found another very related place where we are missing ones. 

In thread::do_wait_until() we do use wait_guard that calls prepare_wait(). However, pepare_wait() uses store() which perhaps only issues store memory barrier instruction. And wait_until() check the shared variable - t - in this case. For t changed by action in do_wake_with(), are we not missing a load barrier before checking predicate?

Relatedly, as you can see above, wake_impl() may fail to change the st variable to waiting for the reasons described above. In that case, perhaps compare_exchange_weak() call that returns false, does NOT issue a store barrier, does it? So maybe in this case before we return in that while loop in wake_impl() we should also issue a memory barrier instruction?

I conducted the experiments when I added generic memory barrier instruction - dmb isht - to both prepare_wait() and wake_impl() in the places described and I could see my hello word execute 1-2K times before hanging in some other place. So not as good as using atomic for t but these extra memory barriers definitely help.

@@ -1143,6 +1164,7 @@ void thread::prepare_wait()
     preempt_disable();
     assert(_detached_state->st.load() == status::running);
     _detached_state->st.store(status::waiting);
+            asm volatile ("dmb ish");
 }

One thing I missed is that prepare_wait() calls both load() and then store() on _detached_state->st which is interesting as it means that all memory operations - both load and store - should be ordered BEFORE and AFTER. This would indicate that "asm volatile ("dmb ish");" (which is a general memory barrier instruction that waits for both the store and load instructions to complete) is NOT necessary. But when I remove it, I get back the same hung failures.

Maybe it is because when I use "asm volatile ("dmb ishst");" in wake_impl() which is NOT specific to _detached_state->st, the rules of memory barriers do not apply? Meaning the load() has to match store() of the SAME atomic variable, no?

For example, this statement from C++ memory ordering documentation seems to indicate:

"memory_order_release - A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below). "

The more I did into atomics and acquire-release semantics, the more I learn that indeed load()/store() (and their derivatives) need to called against same atomic variable.

So maybe because we use non-atomic barrier instruction in wake_impl(), we have to use a similar one in prepare_wait() to make sure that changes to shared variable t are visible between 2 threads.

@@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask)
     trace_sched_wake(st->t);     while (!st->st.compare_exchange_weak(old_status, status::waking)) {
         if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
+            asm volatile ("dmb ishst");
             return;
         }
The C++ equivalent to the asm instructions would be this:

@@ -1143,6 +1164,8 @@ void thread::prepare_wait()
     preempt_disable();
     assert(_detached_state->st.load() == status::running);
     _detached_state->st.store(status::waiting);
+    std::atomic_thread_fence(std::memory_order_acquire);
 }
 
 // This function is responsible for changing a thread's state from
@@ -1192,6 +1215,15 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask)
     trace_sched_wake(st->t);
     while (!st->st.compare_exchange_weak(old_status, status::waking)) {
         if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
+            std::atomic_thread_fence(std::memory_order_release);
             return;
         }
     }
@@ -1209,6 +1241,7 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask)

Waldek Kozaczuk

unread,
Apr 19, 2021, 12:50:07 AM4/19/21
to OSv Development
The more I read about C++ 11 memory model and atomics, the more I seem to think that maybe we are missing the correct "synchronize-with" relationships in the right places needed for changes to _t in wait_record (and maybe other places) to be visible between threads on different cpus.

More specifically, wake_impl() called from thread::do_wake_with() AFTER calling action (_t = nullptr), may break from the loop and therefore fail to store a new value of _detached_state->st. In such a scenario, there is potentially no "release" operation that would tie to a corresponding load of _detached_state->st ("acquire")  on other CPU.

I think the only way to provide such a "synchronize-with" relationship for _t is to turn it into atomic and use release/acquire memory order for the corresponding load()/store() calls in wait_record.hh.

The _detached_state is std::unique_ptr<detached_state>, so calling get() does not provide any load barrier. I think you may have meant another place where load() is called on the atomic _detached_state->st. 
Reply all
Reply to author
Forward
0 new messages