Hello. I would like to figure out how the stall detector works. It's a bit hard for me to find and understand all related code in reactor.cc. Could you please help me?
As a playground i make an UT. It looks like this:
...
uint reports{}; // will be incremented on stall
seastar::engine().set_stall_detector_report_function([&] { ++reports; });
auto stallDetectorTimeout = seastar::engine().get_blocked_reactor_notify_ms();
auto spin = [stallDetectorTimeout] {
static constexpr auto MULT = 5; // 5 is minimum value, less is not working at all
// Should be enough time to starve the reactor
auto end = stall_clock_t::now() + stallDetectorTimeout * MULT;
for (uint i{};; ++i) // busy loop without preemption
{
do_not_optimize(i); // otherwise compiler will throw away the loop
if (stall_clock_t::now() > end) { break; }
}
};
spin();
co_await seastar::sleep(10ms); // give some time to report?
REQUIRE(reports > 0);
...
So, i created a busy loop without preemption and expect to see that stall detector is reporting. I run UT 100 times in Release mode.
It always fails with REQUIRE(0 > 0), means that stall detector didn't report. Sometimes it fails immediately on the first test. Sometimes, first ~50 test runs are passed, but next one is failed. So, UT is very flaky and unstable.
What is the magic here? I expect that simple busy loop without preemption must always be reported by the stall detector. Simple and stable test. What am i missing here? What should i do to correct the test and get stable UT? Thank you.