[PATCH] aarch64: make RCU work correctly in SMP mode

Waldemar Kozaczuk

unread,

May 9, 2021, 4:13:13 PM5/9/21

to osv...@googlegroups.com, Waldemar Kozaczuk

As the issue #1134 explains, some unit tests hang in the do_main_thread
while shutting down when OSv runs with 2 or more CPUs on aarch64. This
happens even with the patch to address the issue #1123 applied.

After connecting with gdb, the thread dump always looks very similar
where the do_main_thread() thread waits on semaphore in the osv::rcu_flush()
method that calls rcu_defer() to force execution of any outstanding
callbacks and that callback never seems be executed.
The threads of cpu_quiescent_state_thread type (one for each CPU)
seem to be stuck waiting to converge to progress to the next generation.
More specifically it seems that even though a condition involving
the atomic variable `_generation` is true, one of the threads is stuck
in sched::thread::wait_until(). This seems to indicate that most likely
even though this thread was woken, it could not see correct value
of `_generation` atomic variable modified by other thread on other CPUs.

This looks somewhat similar to the issue #1123 in that it is caused by
memory order related deficiencies in the code that come to light when
running on aarch64 which comes with weak memory model.

To address those, this patch modifies the memory order qualifier
for some atomic operations in rcu.cc to make sure that the changes made
on one CPU are visible on another one. More specifically the cpu_quiescent_state_thread
threads interact with each other while converging on next generation by
writing and reading the _request, _requested and _generation atomics.
Currently all operations on those use the memory_order_relaxed qualifier
which only guarantees atomicity but not visibility. To make sure that
the changes to _request, _requested and _generation are visible across
CPUs in sched::thread::wait_until(), we modify the memory modifier
in relevant places to memory_order_acquire (when reading), memory_order_release (when writing)
and memory_order_acq_rel for the CAS operation. These changes create
proper acquire-release relationships between relevant read/write
operations on those atomics and thus enforce necessary memory barrier
instructions generated around those.

Before this patch, the tst-evenfd.so test would hang under 100 runs
with 2 vCPUs with the symptoms described by #1134. After this patch
I could execute the same test over 13K times.

Fixes #1134

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
core/rcu.cc | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/core/rcu.cc b/core/rcu.cc
index d6f587b8..95969109 100644
--- a/core/rcu.cc
+++ b/core/rcu.cc
@@ -73,7 +73,7 @@ cpu_quiescent_state_thread::cpu_quiescent_state_thread(sched::cpu* cpu)
void cpu_quiescent_state_thread::request(uint64_t generation)
{
auto r = _request.load(std::memory_order_relaxed);
- while (generation > r && !_request.compare_exchange_weak(r, generation, std::memory_order_relaxed)) {
+ while (generation > r && !_request.compare_exchange_weak(r, generation, std::memory_order_acq_rel)) {
// nothing to do
}
_t->wake();
@@ -90,7 +90,7 @@ void cpu_quiescent_state_thread::set_generation(uint64_t generation)
// Wake the quiescent threads who might be interested in my _generation.
for (auto cqst : cpu_quiescent_state_threads) {
if (cqst != this &&
- cqst->_requested.load(std::memory_order_relaxed)) {
+ cqst->_requested.load(std::memory_order_acquire)) {
cqst->_t->wake();
}
}
@@ -135,7 +135,7 @@ void cpu_quiescent_state_thread::do_work()
}
if (toclean) {
auto g = next_generation.fetch_add(1, std::memory_order_relaxed) + 1;
- _requested.store(true, std::memory_order_relaxed);
+ _requested.store(true, std::memory_order_release);
// copy cpu_quiescent_state_threads to prevent a hotplugged cpu
// from changing the number of cpus we request a new generation on,
// and the number of cpus we wait on
@@ -152,7 +152,7 @@ void cpu_quiescent_state_thread::do_work()
while (true) {
sched::thread::wait_until([&cqsts, &g, this] {
return ( (_generation.load(std::memory_order_relaxed) <
- _request.load(std::memory_order_relaxed))
+ _request.load(std::memory_order_acquire))
|| all_at_generation(cqsts, g)); });
auto r = _request.load(std::memory_order_relaxed);
if (_generation.load(std::memory_order_relaxed) < r) {
@@ -177,7 +177,7 @@ void cpu_quiescent_state_thread::do_work()
// wants to clean up, or we are woken to clean up our callbacks
sched::thread::wait_until([=] {
return (_generation.load(std::memory_order_relaxed) <
- _request.load(std::memory_order_relaxed)) ||
+ _request.load(std::memory_order_acquire)) ||
percpu_callbacks->ncallbacks[percpu_callbacks->buf]; });
auto r = _request.load(std::memory_order_relaxed);
if (_generation.load(std::memory_order_relaxed) < r) {
--
2.30.2

Nadav Har'El

unread,

May 9, 2021, 5:06:08 PM5/9/21

to Waldemar Kozaczuk, Osv Dev

Very nice detective work!

I'm very rusty in this memory ordering business, so don't have a very useful review, and rather have a question below.

--
Nadav Har'El

n...@scylladb.com

I sadly have to admit I'm too rusty in memory ordering to check this code :-(

I see that this place is the only memory_order_release you did. I would think that this means that whatever we wrote *before* this release, becomes visible for code that does a memory_order_acquire on _requested. But what did we write in this code before the release? For example, we only write _generation a few lines down...

Again, I'm probably missing something because I'm so rusty in this...

// copy cpu_quiescent_state_threads to prevent a hotplugged cpu
// from changing the number of cpus we request a new generation on,
// and the number of cpus we wait on
@@ -152,7 +152,7 @@ void cpu_quiescent_state_thread::do_work()
while (true) {
sched::thread::wait_until([&cqsts, &g, this] {
return ( (_generation.load(std::memory_order_relaxed) <
- _request.load(std::memory_order_relaxed))
+ _request.load(std::memory_order_acquire))
|| all_at_generation(cqsts, g)); });
auto r = _request.load(std::memory_order_relaxed);
if (_generation.load(std::memory_order_relaxed) < r) {
@@ -177,7 +177,7 @@ void cpu_quiescent_state_thread::do_work()
// wants to clean up, or we are woken to clean up our callbacks
sched::thread::wait_until([=] {
return (_generation.load(std::memory_order_relaxed) <
- _request.load(std::memory_order_relaxed)) ||
+ _request.load(std::memory_order_acquire)) ||
percpu_callbacks->ncallbacks[percpu_callbacks->buf]; });
auto r = _request.load(std::memory_order_relaxed);
if (_generation.load(std::memory_order_relaxed) < r) {
--
2.30.2

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20210509201256.876696-1-jwkozaczuk%40gmail.com.

Waldek Kozaczuk

unread,

May 9, 2021, 6:49:33 PM5/9/21

to OSv Development

This a place where we acquire and release the _request atomic from in the loop in lines 144-148 when called on cpu A on the behalf of other cpu B so that is properly visible on that cpu B in sched::thread::wait_until().

// nothing to do
}
_t->wake();
@@ -90,7 +90,7 @@ void cpu_quiescent_state_thread::set_generation(uint64_t generation)
// Wake the quiescent threads who might be interested in my _generation.
for (auto cqst : cpu_quiescent_state_threads) {
if (cqst != this &&
- cqst->_requested.load(std::memory_order_relaxed)) {
+ cqst->_requested.load(std::memory_order_acquire)) {
cqst->_t->wake();
}
}
@@ -135,7 +135,7 @@ void cpu_quiescent_state_thread::do_work()
}
if (toclean) {
auto g = next_generation.fetch_add(1, std::memory_order_relaxed) + 1;
- _requested.store(true, std::memory_order_relaxed);
+ _requested.store(true, std::memory_order_release);

I sadly have to admit I'm too rusty in memory ordering to check this code :-(

I see that this place is the only memory_order_release you did. I would think that this means that whatever we wrote *before* this release, becomes visible for code that does a memory_order_acquire on _requested. But what did we write in this code before the release? For example, we only write _generation a few lines down...

Please see my comment above about _request.

Again, I'm probably missing something because I'm so rusty in this...

I think that it should be enough to use relaxed in most places against _generation as this atomic is only modified by this CPU thread. But to be safe then sorry we should probably change _generation.store() in set_generation() to use release and then change relaxed to acquire in _generation.load() in check() as check() is called by all_at_generation() always on other CPUs. So for visibility reasons we should do it. I will make this change and run more tests.

Waldemar Kozaczuk

unread,

May 12, 2021, 1:42:30 PM5/12/21

to osv...@googlegroups.com, Waldemar Kozaczuk

Changes since V1:
- changed `_generation.load()` in the `check()` method to use
memory_order_acquire qualifier and `_generation.store()` in
the set_generation() method to use memory_order_release qualifier
in order to establish another acquire-release relationship. The
`check()` is always called on different CPUs so we also need to make sure
the `_generation` atomic changes are visible across CPUs.

-----

core/rcu.cc | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/core/rcu.cc b/core/rcu.cc
index d6f587b8..2bc7a75f 100644

--- a/core/rcu.cc
+++ b/core/rcu.cc
@@ -73,7 +73,7 @@ cpu_quiescent_state_thread::cpu_quiescent_state_thread(sched::cpu* cpu)
void cpu_quiescent_state_thread::request(uint64_t generation)
{
auto r = _request.load(std::memory_order_relaxed);
- while (generation > r && !_request.compare_exchange_weak(r, generation, std::memory_order_relaxed)) {
+ while (generation > r && !_request.compare_exchange_weak(r, generation, std::memory_order_acq_rel)) {

// nothing to do
}
_t->wake();

@@ -81,16 +81,16 @@ void cpu_quiescent_state_thread::request(uint64_t generation)

bool cpu_quiescent_state_thread::check(uint64_t generation)
{
- return _generation.load(std::memory_order_relaxed) >= generation;
+ return _generation.load(std::memory_order_acquire) >= generation;
}

void cpu_quiescent_state_thread::set_generation(uint64_t generation)
{
- _generation.store(generation, std::memory_order_relaxed);
+ _generation.store(generation, std::memory_order_release);

// Wake the quiescent threads who might be interested in my _generation.
for (auto cqst : cpu_quiescent_state_threads) {
if (cqst != this &&
- cqst->_requested.load(std::memory_order_relaxed)) {
+ cqst->_requested.load(std::memory_order_acquire)) {
cqst->_t->wake();
}
}
@@ -135,7 +135,7 @@ void cpu_quiescent_state_thread::do_work()
}
if (toclean) {
auto g = next_generation.fetch_add(1, std::memory_order_relaxed) + 1;
- _requested.store(true, std::memory_order_relaxed);
+ _requested.store(true, std::memory_order_release);

Commit Bot

unread,

May 13, 2021, 3:22:45 PM5/13/21

to osv...@googlegroups.com, Waldemar Kozaczuk

From: Waldemar Kozaczuk <jwkoz...@gmail.com>
Committer: Waldemar Kozaczuk <jwkoz...@gmail.com>
Branch: master

aarch64: make RCU work correctly in SMP mode

diff --git a/core/rcu.cc b/core/rcu.cc
--- a/core/rcu.cc
+++ b/core/rcu.cc
@@ -73,24 +73,24 @@ cpu_quiescent_state_thread::cpu_quiescent_state_thread(sched::cpu* cpu)

void cpu_quiescent_state_thread::request(uint64_t generation)
{
auto r = _request.load(std::memory_order_relaxed);
- while (generation > r && !_request.compare_exchange_weak(r, generation, std::memory_order_relaxed)) {
+ while (generation > r && !_request.compare_exchange_weak(r, generation, std::memory_order_acq_rel)) {
// nothing to do
}
_t->wake();
}

Reply all

Reply to author

Forward