[PATCH 1/1] aio_general_context: flush: provide 1 second grace for retries

Benny Halevy

<bhalevy@scylladb.com>

unread,

Nov 25, 2021, 8:58:01 AM11/25/21

to seastar-dev@googlegroups.com, Benny Halevy

The existing assert that was introduced in
528762adbaf2c6dde857957b920606378959e8fb
is too harsh. It will fail on the first retry
failure after `needs_preeempt` returned true.

This change uses the lowres_clock to allow flush to make progress
within 1 second of the first `io_submit` failure.

If the failure happens only once, flush will complete
submitting all iocbs. If another failure happens,
we check if at least one iocb has been submitted
and if so, return a "short flush", and copy the remaining
iocbs to be submitted next time flush is called.

If no progress has been made within 1 second
of the first failure an assert will fail, to prevent
an infinite loop.

Fixes #975

Test: unit(dev)

Signed-off-by: Benny Halevy <bha...@scylladb.com>
---
src/core/reactor_backend.cc | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 25b13090..007078bf 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -26,6 +26,9 @@
#include <seastar/core/internal/buffer_allocator.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/read_first_line.hh>
+#include <seastar/core/lowres_clock.hh>
+
+#include <optional>
#include <chrono>
#include <sys/poll.h>
#include <sys/syscall.h>
@@ -272,7 +275,8 @@ void aio_general_context::queue(linux_abi::iocb* iocb) {

size_t aio_general_context::flush() {
auto begin = iocbs.get();
- auto retried = last;
+ constexpr lowres_clock::time_point no_time_point = lowres_clock::time_point(lowres_clock::duration(0));
+ auto retry_until = no_time_point;
while (begin != last) {
auto r = io_submit(io_context, last - begin, begin);
if (__builtin_expect(r > 0, true)) {
@@ -280,11 +284,19 @@ size_t aio_general_context::flush() {
continue;
}
// errno == EAGAIN is expected here. We don't explicitly assert that
- // since the assert below requires that some progress will be
+ // since the logic below requires that some progress will be
// made, preventing an endless loop for any reason.
- if (need_preempt()) {
- assert(retried != begin);
- retried = begin;
+ if (retry_until == no_time_point) {
+ // allow retrying for 1 second
+ retry_until = lowres_clock::now() + 1s;
+ } else if (begin != iocbs.get()) {
+ // partial flush, return the number of successful submissions
+ std::copy(begin, last, iocbs.get());
+ auto nr = begin - iocbs.get();
+ last -= nr;
+ return nr;
+ } else {
+ assert(lowres_clock::now() < retry_until);
}
}
auto nr = last - iocbs.get();
--
2.31.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 25, 2021, 9:17:35 AM11/25/21

to Benny Halevy, seastar-dev@googlegroups.com

On 25/11/2021 15.57, Benny Halevy wrote:
> The existing assert that was introduced in
> 528762adbaf2c6dde857957b920606378959e8fb
> is too harsh. It will fail on the first retry
> failure after `needs_preeempt` returned true.
>
> This change uses the lowres_clock to allow flush to make progress
> within 1 second of the first `io_submit` failure.
>
> If the failure happens only once, flush will complete
> submitting all iocbs. If another failure happens,
> we check if at least one iocb has been submitted
> and if so, return a "short flush", and copy the remaining
> iocbs to be submitted next time flush is called.

I don't think it's legal to defer a flush. Suppose this
aio_general_context is preempt_io_context. If we don't flush, we won't
arm the preemption monitor and highres timer. If a loopy continuation
then runs, we won't get kicked out by preemption.

>
> If no progress has been made within 1 second
> of the first failure an assert will fail, to prevent
> an infinite loop.

Did you check the patch fixes the problem (e.g. is it reproducible)?

Do you know if it was the preemption context of the pollable fd context
that crashed?

Benny Halevy

<bhalevy@scylladb.com>

unread,

Nov 26, 2021, 3:47:07 AM11/26/21

to Avi Kivity, seastar-dev@googlegroups.com

On Thu, 2021-11-25 at 16:17 +0200, Avi Kivity wrote:
>
> On 25/11/2021 15.57, Benny Halevy wrote:
> > The existing assert that was introduced in
> > 528762adbaf2c6dde857957b920606378959e8fb
> > is too harsh. It will fail on the first retry
> > failure after `needs_preeempt` returned true.
> >
> > This change uses the lowres_clock to allow flush to make progress
> > within 1 second of the first `io_submit` failure.
> >
> > If the failure happens only once, flush will complete
> > submitting all iocbs. If another failure happens,
> > we check if at least one iocb has been submitted
> > and if so, return a "short flush", and copy the remaining
> > iocbs to be submitted next time flush is called.
>
>
> I don't think it's legal to defer a flush. Suppose this
> aio_general_context is preempt_io_context. If we don't flush, we won't
> arm the preemption monitor and highres timer. If a loopy continuation
> then runs, we won't get kicked out by preemption.

Hmm, even the comment for aio_general_context::flush agrees with you

// submit all queued iocbs and return their count.
size_t flush();

I'll drop the partial-flush support then.
It will remain all or nothing (== abort).

>
>
> >
> > If no progress has been made within 1 second
> > of the first failure an assert will fail, to prevent
> > an infinite loop.
>
>
> Did you check the patch fixes the problem (e.g. is it reproducible)?

I'll try the add_50_nodes dtest before and after and see.
It seems to be quite reproducible on Jenkins,
hopefully it will reproduce locally for me.

>
>
> Do you know if it was the preemption context of the pollable fd context
> that crashed?

Here's the backtrace:

seastar::aio_general_context::flush() at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:286
(inlined by) seastar::reactor_backend_aio::wait_and_process_events(__sigset_t const*) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:489
seastar::reactor::sleep() at ./build/release/seastar/./seastar/src/core/reactor.cc:3041
(inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3007
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2839
seastar::app_template::run_deprecated(int, char**, std::function&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:228
seastar::app_template::run(int, char**, std::function ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:128
main at ./main.cc:492

this is from a call to _polling_io.flush() in wait_and_process_events

so it's unrelated to pollable fd IIUC,
but rather to internal reactor backend usage for
either hrtimer_poll_completion or smp_wakeup_aio_completion

Benny Halevy

<bhalevy@scylladb.com>

unread,

Nov 26, 2021, 6:16:05 AM11/26/21

to Avi Kivity, seastar-dev@googlegroups.com

It didn't reproduce for me locally.
Trying again with the fix, just in case.

Benny Halevy

<bhalevy@scylladb.com>

unread,

Nov 26, 2021, 7:14:57 AM11/26/21

to seastar-dev@googlegroups.com, Benny Halevy

The existing assert that was introduced in
528762adbaf2c6dde857957b920606378959e8fb
is too harsh. It will fail on the first retry
failure after `needs_preeempt` returned true.

This lead to the following crash:
```
scylla: ./seastar/src/core/reactor_backend.cc:286: size_t seastar::aio_general_context::flush(): Assertion `retried != begin' failed.
```

Decoded backtrace:

```
seastar::aio_general_context::flush() at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:286
(inlined by) seastar::reactor_backend_aio::wait_and_process_events(__sigset_t const*) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:489
seastar::reactor::sleep() at ./build/release/seastar/./seastar/src/core/reactor.cc:3041
(inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3007
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2839
seastar::app_template::run_deprecated(int, char**, std::function&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:228
seastar::app_template::run(int, char**, std::function ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:128
main at ./main.cc:492

```

This change uses the lowres_clock to allow flush to complete

within 1 second of the first `io_submit` failure.

Fixes #975

Test: unit(release)

Dtest: update_cluster_layout_tests.py:TestLargeScaleCluster.add_50_nodes_test(release)

Signed-off-by: Benny Halevy <bha...@scylladb.com>
---

src/core/reactor_backend.cc | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

In v2:
- dropped the early return with "short flush"
when some progress has been made following the first failure.
- now flush always completes, or aborts if no progress has
been made in 1 second.

diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
index 25b13090..fe76ae80 100644
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -26,6 +26,8 @@

#include <seastar/core/internal/buffer_allocator.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/read_first_line.hh>
+#include <seastar/core/lowres_clock.hh>
+

#include <chrono>
#include <sys/poll.h>
#include <sys/syscall.h>

@@ -272,7 +274,8 @@ void aio_general_context::queue(linux_abi::iocb* iocb) {

size_t aio_general_context::flush() {
auto begin = iocbs.get();
- auto retried = last;
+ constexpr lowres_clock::time_point no_time_point = lowres_clock::time_point(lowres_clock::duration(0));
+ auto retry_until = no_time_point;
while (begin != last) {
auto r = io_submit(io_context, last - begin, begin);
if (__builtin_expect(r > 0, true)) {

@@ -280,11 +283,12 @@ size_t aio_general_context::flush() {

continue;
}
// errno == EAGAIN is expected here. We don't explicitly assert that
- // since the assert below requires that some progress will be

- // made, preventing an endless loop for any reason.

- if (need_preempt()) {
- assert(retried != begin);
- retried = begin;

+ // since the assert below prevents an endless loop for any reason.

+ if (retry_until == no_time_point) {
+ // allow retrying for 1 second
+ retry_until = lowres_clock::now() + 1s;

+ } else {

+ assert(lowres_clock::now() < retry_until);
}
}
auto nr = last - iocbs.get();

--
2.31.1

Benny Halevy

<bhalevy@scylladb.com>

unread,

Oct 4, 2022, 3:38:07 AM10/4/22

to seastar-dev@googlegroups.com

ping

Avi Kivity

<avi@scylladb.com>

unread,

Oct 4, 2022, 5:29:52 AM10/4/22

to Benny Halevy, seastar-dev@googlegroups.com

On Fri, 2021-11-26 at 14:14 +0200, Benny Halevy wrote:

lowres_clock won't advance if the reactor isn't running, so this just
spins forever.

Benny Halevy

<bhalevy@scylladb.com>

unread,

Oct 4, 2022, 7:02:45 AM10/4/22

to Avi Kivity, seastar-dev@googlegroups.com

Good point.

Okay to use std::chrono::steady_clock here?

Avi Kivity

<avi@scylladb.com>

unread,

Oct 4, 2022, 7:32:49 AM10/4/22

to Benny Halevy, seastar-dev@googlegroups.com

So long as it's only read after we fail, yes.

Commit Bot

<bot@cloudius-systems.com>

unread,

Oct 6, 2022, 11:51:58 AM10/6/22

to seastar-dev@googlegroups.com, Benny Halevy

From: Benny Halevy <bha...@scylladb.com>
Committer: Avi Kivity <a...@scylladb.com>
Branch: master

aio_general_context: flush: provide 1 second grace for retries

The existing assert that was introduced in
528762adbaf2c6dde857957b920606378959e8fb
is too harsh. It will fail on the first retry
failure after `needs_preeempt` returned true.

This lead to the following crash:
```
scylla: ./seastar/src/core/reactor_backend.cc:286: size_t seastar::aio_general_context::flush(): Assertion `retried != begin' failed.
```

Decoded backtrace:
```
seastar::aio_general_context::flush() at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:286
(inlined by) seastar::reactor_backend_aio::wait_and_process_events(__sigset_t const*) at ./build/release/seastar/./seastar/src/core/reactor_backend.cc:489
seastar::reactor::sleep() at ./build/release/seastar/./seastar/src/core/reactor.cc:3041
(inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3007
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2839
seastar::app_template::run_deprecated(int, char**, std::function&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:228
seastar::app_template::run(int, char**, std::function ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:128
main at ./main.cc:492
```

This change uses the std::chrono::steady_clock to allow flush to complete

within 1 second of the first `io_submit` failure.

Note that the seastar lowres_clock is unsuitable for this purpose
since it won't advance if the reactor has no chance to run.

Fixes #975

Test: unit(release)

Signed-off-by: Benny Halevy <bha...@scylladb.com>

---

diff --git a/src/core/reactor_backend.cc b/src/core/reactor_backend.cc
--- a/src/core/reactor_backend.cc
+++ b/src/core/reactor_backend.cc
@@ -27,6 +27,7 @@

#include <seastar/core/internal/buffer_allocator.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/read_first_line.hh>
+

#include <chrono>
#include <filesystem>
#include <sys/poll.h>
@@ -312,19 +313,22 @@ void aio_general_context::queue(linux_abi::iocb* iocb) {

size_t aio_general_context::flush() {
auto begin = iocbs.get();
- auto retried = last;

+ using clock = std::chrono::steady_clock;
+ constexpr clock::time_point no_time_point = clock::time_point(clock::duration(0));

+ auto retry_until = no_time_point;
while (begin != last) {
auto r = io_submit(io_context, last - begin, begin);
if (__builtin_expect(r > 0, true)) {

begin += r;

continue;
}
// errno == EAGAIN is expected here. We don't explicitly assert that
- // since the assert below requires that some progress will be
- // made, preventing an endless loop for any reason.
- if (need_preempt()) {
- assert(retried != begin);
- retried = begin;
+ // since the assert below prevents an endless loop for any reason.
+ if (retry_until == no_time_point) {
+ // allow retrying for 1 second

+ retry_until = clock::now() + 1s;
+ } else {
+ assert(clock::now() < retry_until);

Reply all

Reply to author

Forward