As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario.
According to https://github.com/Broadcom/arm64-linux/blob/master/Documentation/memory-barriers.txt, especially "WHERE ARE MEMORY BARRIERS NEEDED?" and "INTERPROCESSOR INTERACTION" chapters, it seems that every time some piece of memory needs to be accessed (read/written) by many CPUs for example in the scheduler or other kind of coordination, one must use proper memory barriers to enforce that those memory changes are visible to between the CPUs. One way to achieve this is using atomic with the default memory ordering (see "One-way barriers" in https://developer.arm.com/documentation/100941/0100/Barriers). The instructions LDAR/STLR are used to implement strong versions of the atomics.My experiments seem to prove that using atomic seems to help with the hang problem. But I am afraid that 't' variable in the waiter object maybe not the ONLY place where we need to worry about memory change visibility.
I wonder what you think about my findings and the reasoning?Regards,Waldek
--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/5b16202b-8a9b-4ff3-8bc1-68bc87baa812n%40googlegroups.com.
On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario.This is an interesting question. Indeed, the waiter::t is not marked atomic, and it's been a very long time since I wrote this code . Note that "t" is not public, and the only methods that access t, wake() and wait().I *think* (if I remember correctly) my thinking was that each of these wake() and wait() operations already has its *own* memory barriers. namely wake_impl() and prepare_wait() both access the state of the thread with CST ordering - https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). I think (again, I hope I remember correctly), we were hoping that these barriers also apply to the t access before of after them.The thinking was that it is similar to how one uses a shared variable protected by the mutex: You don't need the shared variable to be std::atomic or care about memory order or barriers - the mutex already takes care of that for you.
But you're right that it's hard to catch bugs in this memory ordering on x86 and it's easier to do it wrong on ARM.The question is what exactly is wrong:
- Is the ordering in wake_impl() and prepare_wait() (called by waiter::wake() and waiter::wait()) not strong enough on ARM to serialize *everything*, not just a specific variable?
I suggest that you change waiter.hh to use std::atomic for t, and then using "relaxed" ordering to read/write t for some of the waiter methods and "cst" ordering for other methods - and see which method needs the non-relaxed ordering. When we know if one specific method is missing the ordering, we can try to understand why it is missing.
- Do we have this serialization in the right place? E.g., maybe we have the ordering point before writing t but we need it after writing t, and so.
- Maybe the problem isn't in waiter::wake() or waiter::wait() but in other functions like waiter::woken(), waiter::clear()?
Unfortunately, after several years of working in Seastar more than OSv, my understanding of memory ordering issues has rusted, so my intuition on these issues isn't as good as it used to be :-(According to https://github.com/Broadcom/arm64-linux/blob/master/Documentation/memory-barriers.txt, especially "WHERE ARE MEMORY BARRIERS NEEDED?" and "INTERPROCESSOR INTERACTION" chapters, it seems that every time some piece of memory needs to be accessed (read/written) by many CPUs for example in the scheduler or other kind of coordination, one must use proper memory barriers to enforce that those memory changes are visible to between the CPUs. One way to achieve this is using atomic with the default memory ordering (see "One-way barriers" in https://developer.arm.com/documentation/100941/0100/Barriers). The instructions LDAR/STLR are used to implement strong versions of the atomics.My experiments seem to prove that using atomic seems to help with the hang problem. But I am afraid that 't' variable in the waiter object maybe not the ONLY place where we need to worry about memory change visibility.Definitely this is not the only other place where we needed to care about memory ordering, but we were pretty careful to do it properly, and in fact one of the main reasons why we chose C++11 when we started the OSv project is its nice support for std::atomic and memory ordering. But, it is indeed not unlikely that we have bugs in this area, because OSv only ran on x86 initially, where the memory reorderings that can happen is less than can happen on ARM.I wonder what you think about my findings and the reasoning?Regards,Waldek
On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario.This is an interesting question. Indeed, the waiter::t is not marked atomic, and it's been a very long time since I wrote this code . Note that "t" is not public, and the only methods that access t, wake() and wait().I *think* (if I remember correctly) my thinking was that each of these wake() and wait() operations already has its *own* memory barriers. namely wake_impl() and prepare_wait() both access the state of the thread with CST ordering - https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). I think (again, I hope I remember correctly), we were hoping that these barriers also apply to the t access before of after them.The thinking was that it is similar to how one uses a shared variable protected by the mutex: You don't need the shared variable to be std::atomic or care about memory order or barriers - the mutex already takes care of that for you.I think that is true on Intel but not necessarily on ARM. Specifically, in the SMP case, the changes made to that shared variable ob CPU 1 even within mutex critical section are not visible to CPU 2 because the changes made by the former will still be in the L1 cache. Only when we use proper memory barrier instructions (which I have
On Thursday, April 1, 2021 at 12:36:19 PM UTC-4 Nadav Har'El wrote:On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:As I have been researching a bit the SMP issue described here - https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that the 't' variable in https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33 might be related to why OSv hangs with two or more CPUs.As I am describing here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270 - writing to the thread pointer variable by thread T1 on cpu0, may not be visible to thread T2 on cpu1 in the weak-memory model scenario.This is an interesting question. Indeed, the waiter::t is not marked atomic, and it's been a very long time since I wrote this code . Note that "t" is not public, and the only methods that access t, wake() and wait().I *think* (if I remember correctly) my thinking was that each of these wake() and wait() operations already has its *own* memory barriers. namely wake_impl() and prepare_wait() both access the state of the thread with CST ordering - https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering). I think (again, I hope I remember correctly), we were hoping that these barriers also apply to the t access before of after them.The thinking was that it is similar to how one uses a shared variable protected by the mutex: You don't need the shared variable to be std::atomic or care about memory order or barriers - the mutex already takes care of that for you.I think that is true on Intel but not necessarily on ARM. Specifically, in the SMP case, the changes made to that shared variable ob CPU 1 even within mutex critical section are not visible to CPU 2 because the changes made by the former will still be in the L1 cache. Only when we use proper memory barrier instructions (which I have figured which ones to use and where exactly) OR std::atomic<> (which on ARM translates to the LDAR/STLR instructions that by nature make the changes to memory atomic and visible to all CPUs), the changes to _t are visible.
Otherwise the wait() method in wait_record.hh will not see _t equal to null when the thread waiting on CPU2 is woken. So the effect is that the interrupt arrives, thread is woken but does not see _t value equal to 0. But I might be wrong in my reasoning.But you're right that it's hard to catch bugs in this memory ordering on x86 and it's easier to do it wrong on ARM.The question is what exactly is wrong:
- Is the ordering in wake_impl() and prepare_wait() (called by waiter::wake() and waiter::wait()) not strong enough on ARM to serialize *everything*, not just a specific variable?
I think this is specific to _t.I suggest that you change waiter.hh to use std::atomic for t, and then using "relaxed" ordering to read/write t for some of the waiter methods and "cst" ordering for other methods - and see which method needs the non-relaxed ordering. When we know if one specific method is missing the ordering, we can try to understand why it is missing.
- Do we have this serialization in the right place? E.g., maybe we have the ordering point before writing t but we need it after writing t, and so.
- Maybe the problem isn't in waiter::wake() or waiter::wait() but in other functions like waiter::woken(), waiter::clear()?
Which specific place in code do you have in mind? Nothing in wait_record.hh uses memory ordering involved methods.
I think you are right about wait_record and memory barriers (unless there is still some hole in our thinking :-)).So let us get back to one of the observations I made and the related question I posed here - https://github.com/cloudius-systems/osv/issues/1123#issuecomment-803710337.The thread::wake_impl() method has a deliberate "if" logic to validate that the current (old) status of the thread is one of the initial states per the specified allowed_initial_states_mask argument. If it is NOT, the wake_impl() simply returns without actually waking a thread (setting need_reschedule to true or calling send_wakeup_ipi()). It is interesting that neither wake_impl() nor wake() returns a result of the "waking process" - the return type is void. I wonder what the rationale behind it was. Maybe in most cases, it does not matter whether we really wake it or not. But maybe in some, it does and we should be able to know that thread was not really woken.Normally the thread, the wake_impl() is called on (per st argument) would be most likely in the waiting state. But clearly, it does not have to be the case always - the same thread could have just been woken by another thread on another CPU for example.I have added these debug printouts to see what states thread might be when woke_impl() is called:void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask){status old_status = status::waiting;trace_sched_wake(st->t);while (!st->st.compare_exchange_weak(old_status, status::waking)) {if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {if (allowed_initial_states_mask == ((1 << unsigned(status::waiting)) | (1 << unsigned(status::sending_lock)))) {if (old_status != status::waiting && old_status != status::sending_lock) {debug_early_u64("! wake_impl: ", (unsigned)old_status);}}return;}}Please note I am specifically logging only cases when wake_impl() is called from thread::wake_with_from_mutex(Action action). With one CPU, I occasionally see that "queued" is logged. With 2 CPUs, besides "queued", I also occasionally see "waking" and "running", which is interesting. Most of the time even if I see these "waking" and "running" printouts, OSv would NOT hang. But sometimes it would right after.
--
You received this message because you are subscribed to a topic in the Google Groups "OSv Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osv-dev/NHFvOfJa--M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/26f17fcf-aac0-4899-ba68-cf2c1fbc1a13n%40googlegroups.com.
+ asm volatile ("dmb ish");
}
assert(_detached_state->st.load() == status::running); - the acquire operation
followed by the store which is release().
And the wake_impl() has the loop to call st.compare_exchange_weak() which I think is a release operation If it succeeds. So maybe we are OK except when we break from that loop and return because the thread state was neither waiting nor sending_lock.
My atomic_thread_fence() calls try to fill those missing gaps, I think.
Again, this was the thinking, or at least what I remember from many years ago when I was an expert in this code ;-)Maybe there is something subtly wrong with this thinking on ARM, while it was correct for x86....So you are right about these memory barriers in prepare_wait() and do_wake_with() but I may have found another very related place where we are missing ones.In thread::do_wait_until() we do use wait_guard that calls prepare_wait(). However, pepare_wait() uses store() which perhaps only issues store memory barrier instruction. And wait_until() check the shared variable - t - in this case. For t changed by action in do_wake_with(), are we not missing a load barrier before checking predicate?Relatedly, as you can see above, wake_impl() may fail to change the st variable to waiting for the reasons described above. In that case, perhaps compare_exchange_weak() call that returns false, does NOT issue a store barrier, does it? So maybe in this case before we return in that while loop in wake_impl() we should also issue a memory barrier instruction?I conducted the experiments when I added generic memory barrier instruction - dmb isht - to both prepare_wait() and wake_impl() in the places described and I could see my hello word execute 1-2K times before hanging in some other place. So not as good as using atomic for t but these extra memory barriers definitely help.@@ -1143,6 +1164,7 @@ void thread::prepare_wait()
preempt_disable();
assert(_detached_state->st.load() == status::running);
_detached_state->st.store(status::waiting);+ asm volatile ("dmb ish");
}One thing I missed is that prepare_wait() calls both load() and then store() on _detached_state->st which is interesting as it means that all memory operations - both load and store - should be ordered BEFORE and AFTER. This would indicate that "asm volatile ("dmb ish");" (which is a general memory barrier instruction that waits for both the store and load instructions to complete) is NOT necessary. But when I remove it, I get back the same hung failures.Maybe it is because when I use "asm volatile ("dmb ishst");" in wake_impl() which is NOT specific to _detached_state->st, the rules of memory barriers do not apply? Meaning the load() has to match store() of the SAME atomic variable, no?For example, this statement from C++ memory ordering documentation seems to indicate:"memory_order_release - A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable (see Release-Acquire ordering below) and writes that carry a dependency into the atomic variable become visible in other threads that consume the same atomic (see Release-Consume ordering below). "
So maybe because we use non-atomic barrier instruction in wake_impl(), we have to use a similar one in prepare_wait() to make sure that changes to shared variable t are visible between 2 threads.@@ -1192,6 +1214,13 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask)trace_sched_wake(st->t); while (!st->st.compare_exchange_weak(old_status, status::waking)) {if (!((1 << unsigned(old_status)) & allowed_initial_states_mask)) {
+ asm volatile ("dmb ishst");
return;
}