Stall detector doesn't report

Vanegicloh J

<vanegicloh@gmail.com>

unread,

Oct 17, 2025, 9:33:40 AMOct 17

to seastar-dev

Hello. I would like to figure out how the stall detector works. It's a bit hard for me to find and understand all related code in reactor.cc. Could you please help me?

As a playground i make an UT. It looks like this:

...
uint reports{}; // will be incremented on stall
seastar::engine().set_stall_detector_report_function([&] { ++reports; });

auto stallDetectorTimeout = seastar::engine().get_blocked_reactor_notify_ms();

auto spin = [stallDetectorTimeout] {

static constexpr auto MULT = 5; // 5 is minimum value, less is not working at all

// Should be enough time to starve the reactor
auto end = stall_clock_t::now() + stallDetectorTimeout * MULT;

for (uint i{};; ++i) // busy loop without preemption
{
do_not_optimize(i); // otherwise compiler will throw away the loop
if (stall_clock_t::now() > end) { break; }
}
};

spin();
co_await seastar::sleep(10ms); // give some time to report?
REQUIRE(reports > 0);
...

So, i created a busy loop without preemption and expect to see that stall detector is reporting. I run UT 100 times in Release mode.
It always fails with REQUIRE(0 > 0), means that stall detector didn't report. Sometimes it fails immediately on the first test. Sometimes, first ~50 test runs are passed, but next one is failed. So, UT is very flaky and unstable.

What is the magic here? I expect that simple busy loop without preemption must always be reported by the stall detector. Simple and stable test. What am i missing here? What should i do to correct the test and get stable UT? Thank you.

Avi Kivity

<avi@scylladb.com>

unread,

Oct 21, 2025, 10:28:50 AMOct 21

to Vanegicloh J, seastar-dev

What is UT?

What is stall_clock_t?

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/seastar-dev/050a02ed-7dcb-4f47-b64d-df2dfab35df9n%40googlegroups.com.

Vanegicloh J

<vanegicloh@gmail.com>

unread,

Oct 21, 2025, 10:57:34 AMOct 21

to seastar-dev

UT is unit test. using stall_clock_t = seastar::internal::cpu_stall_detector::clock_type;

вторник, 21 октября 2025 г. в 17:28:50 UTC+3, a...@scylladb.com:

Avi Kivity

<avi@scylladb.com>

unread,

Oct 21, 2025, 11:45:48 AMOct 21

to Vanegicloh J, seastar-dev

Next time, write "Unit Test", so I may understand you.

Use a normal high-resolution clock. I think it should have worked, but maybe this clock has low resolution on your machine.

You can try strace -ff to see if the stall detector sent a signal as expected.

Vanegicloh J

<vanegicloh@gmail.com>

unread,

Oct 21, 2025, 12:21:35 PMOct 21

to seastar-dev

I will investigate it, thank you. One more question if i may. I found "update_blocked_reactor_notify_ms" function in public seastar API. It changes timeout for stall detector. Let's say i set it to 50ms - "seastar::engine().update_blocked_reactor_notify_ms(50ms)". How to make sure that the stall detector is updated? When will the changes be applied? "update_blocked_reactor_notify_ms function" will update a stall detector's config. But the stall detector related timer will be rearmed with the new value only on the next "run_some_tasks" call. At this line: "r._cpu_stall_detector->start_task_run(now());" https://github.com/scylladb/seastar/blob/b47cd506d8699e50de366da890c173f4f6e3d989/src/core/reactor.cc#L3120. How to properly wait for this to happen? I tried to put seastar::yield, seastar::sleep(10ms) or seastar::check_for_io_immediately after update_blocked_reactor_notify_ms. Unfortunately, these functions did not guarantee stall detector's change and lead to unstable unit test as well. Seems that i have wrong understanding of how the change should work.

вторник, 21 октября 2025 г. в 18:45:48 UTC+3, a...@scylladb.com:

Reply all

Reply to author

Forward