[PATCH seastar v1 0/9] Wean the stall detector off the timer thread

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:14 PM11/22/18

to seastar-dev@googlegroups.com

The long-awaited aio poll patches remove the timer thread; but the cpu
stall detector still depends on it. This patchset disentangles the stall
detector from the reactor (to some extent), and converts it to use a dedicated
kernel timer.

To avoid the need to re-arm the timer every task quota expiration, some slop
is introduced. Since the measurements don't need to be exact, I expect this
is fine.

Since we now have a dedicated timer, we also switch it to use a thread CPU timer.
This protects against preemption false positives, at the expense of kernel
blocks false negatives. Since the latter are rare, I think the tradeoff is
worthwhile. Kernel blocks also have their own detector.

Finally, a unit test is added.

Tests: unit (release)

https://github.com/avikivity/seastar stall-detector/v1

Avi Kivity (9):
reactor: extract cpu stall detector state into a class
reactor: fold block_notifier_rate_limit into cpu_stall_detector
reactor: move stall-detection code from timer thread to stall detector
tick function
reactor: introduce cpu_stall_detector::account_for_missed_ticks
reactor: change cpu stall detector not to depend on timer thread
reactor: use thread cputime clock for stall detection
reactor: add get_blocked_reactor_notify_ms()
reactor: add hook to capture reports of the cpu stall detector
tests: add unit test for cpu stall detector

configure.py | 3 +
test.py | 1 +
include/seastar/core/reactor.hh | 11 +-
src/core/stall_detector.hh | 83 +++++++++++
src/core/reactor.cc | 238 +++++++++++++++++++-----------
tests/unit/stall_detector_test.cc | 82 ++++++++++
6 files changed, 327 insertions(+), 91 deletions(-)
create mode 100644 src/core/stall_detector.hh
create mode 100644 tests/unit/stall_detector_test.cc

--
2.19.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:16 PM11/22/18

to seastar-dev@googlegroups.com

As the first step of disentangling the cpu stall detector from the reactor,
move the internal state and the signal handler into a class. The reactor
still manages the state, but we'll change that in a follow-up.
---
include/seastar/core/reactor.hh | 7 ++--
src/core/stall_detector.hh | 58 +++++++++++++++++++++++++++
src/core/reactor.cc | 70 ++++++++++++++++++++++-----------
3 files changed, 109 insertions(+), 26 deletions(-)
create mode 100644 src/core/stall_detector.hh

diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
index 5336c28ca5..5af5cd2387 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -637,10 +637,11 @@ class io_queue {
};

namespace internal {

class reactor_stall_sampler;
+class cpu_stall_detector;

}

class io_desc;
class disk_config_params;
@@ -776,14 +777,11 @@ class reactor {
file_desc _task_quota_timer;
promise<> _start_promise;
semaphore _cpu_started;
std::atomic<uint64_t> _tasks_processed = { 0 };
std::atomic<uint64_t> _polls = { 0 };
- std::atomic<unsigned> _tasks_processed_stalled = { 0 };
- unsigned _tasks_processed_report_threshold;
- unsigned _stall_detector_reports_per_minute;
- std::atomic<uint64_t> _stall_detector_missed_ticks = { 0 };
+ std::unique_ptr<internal::cpu_stall_detector> _cpu_stall_detector;

unsigned _max_task_backlog = 1000;
timer_set<timer<>, &timer<>::_link> _timers;
timer_set<timer<>, &timer<>::_link>::timer_list_t _expired_timers;
timer_set<timer<lowres_clock>, &timer<lowres_clock>::_link> _lowres_timers;
@@ -908,10 +906,11 @@ class reactor {
};

signals _signals;
thread_pool _thread_pool;
friend class thread_pool;
+ friend class internal::cpu_stall_detector;

uint64_t pending_task_count() const;
void run_tasks(task_queue& tq);
bool have_more_tasks() const;
bool posix_reuseport_detect();
diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
new file mode 100644
index 0000000000..8f5006714d
--- /dev/null
+++ b/src/core/stall_detector.hh
@@ -0,0 +1,58 @@
+
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2018 ScyllaDB
+ */
+
+#pragma once
+
+#include <signal.h>
+#include <limits>
+#include <chrono>
+
+namespace seastar {
+
+class reactor;
+
+namespace internal {
+
+struct cpu_stall_detector_config {
+ std::chrono::duration<double> threshold = std::chrono::seconds(2);
+ unsigned stall_detector_reports_per_minute = 1;
+};
+
+// Detects stalls in continuations that run for too long
+class cpu_stall_detector {
+ reactor* _r;
+ std::atomic<unsigned> _tasks_processed_stalled = { 0 };
+ unsigned _tasks_processed_report_threshold;
+ unsigned _stall_detector_reports_per_minute;
+ std::atomic<uint64_t> _stall_detector_missed_ticks = { 0 };
+ cpu_stall_detector_config _config;
+ friend reactor;
+public:
+ cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg = {});
+ static int signal_number() { return SIGRTMIN + 1; }
+ void generate_trace();
+ void update_config(cpu_stall_detector_config cfg);
+ cpu_stall_detector_config get_config() const;
+};
+
+}
+}
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index f35883c7df..6651be531c 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -110,10 +110,11 @@
#include <seastar/util/defer.hh>
#include <seastar/core/alien.hh>
#include <seastar/core/metrics.hh>
#include <seastar/core/execution_stage.hh>
#include <seastar/core/exception_hacks.hh>
+#include "stall_detector.hh"

#include <yaml-cpp/yaml.h>

#ifdef SEASTAR_TASK_HISTOGRAM
#include <typeinfo>
@@ -442,14 +443,10 @@ inline int alarm_signal() {
// We don't want to use SIGALRM, because the boost unit test library
// also plays with it.
return SIGRTMIN;
}

-inline int block_notifier_signal() {
- return SIGRTMIN + 1;
-}
-
// Installs signal handler stack for current thread.
// The stack remains installed as long as the returned object is kept alive.
// When it goes out of scope the previous handler is restored.
static decltype(auto) install_signal_handler_stack() {
size_t size = SIGSTKSZ;
@@ -544,10 +541,11 @@ reactor::reactor(unsigned id)
[&] { timer_thread_func(); }, sched::thread::attr().stack(4096).name("timer_thread").pin(sched::cpu::current()))
, _engine_thread(sched::thread::current())
#endif
, _task_quota_timer(file_desc::timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC))
, _cpu_started(0)
+ , _cpu_stall_detector(std::make_unique<cpu_stall_detector>(this))
, _io_context(0)
, _reuseport(posix_reuseport_detect())
, _task_quota_timer_thread(&reactor::task_quota_timer_thread_fn, this)
, _thread_pool(seastar::format("syscall-{}", id)) {
_task_queues.push_back(std::make_unique<task_queue>(0, "main", 1000));
@@ -572,11 +570,11 @@ reactor::reactor(unsigned id)
sev._sigev_un._tid = syscall(SYS_gettid);
sev.sigev_signo = alarm_signal();
r = timer_create(CLOCK_MONOTONIC, &sev, &_steady_clock_timer);
assert(r >= 0);
sigemptyset(&mask);
- sigaddset(&mask, block_notifier_signal());
+ sigaddset(&mask, cpu_stall_detector::signal_number());
r = ::pthread_sigmask(SIG_UNBLOCK, &mask, NULL);
assert(r == 0);
#endif
memory::set_reclaim_hook([this] (std::function<void ()> reclaim_fn) {
add_high_priority_task(make_task(default_scheduling_group(), [fn = std::move(reclaim_fn)] {
@@ -586,11 +584,11 @@ reactor::reactor(unsigned id)
}

reactor::~reactor() {
sigset_t mask;
sigemptyset(&mask);
- sigaddset(&mask, block_notifier_signal());
+ sigaddset(&mask, cpu_stall_detector::signal_number());
auto r = ::pthread_sigmask(SIG_BLOCK, &mask, NULL);
assert(r == 0);

_dying.store(true, std::memory_order_relaxed);
_task_quota_timer.timerfd_settime(0, seastar::posix::to_relative_itimerspec(1ns, 1ms)); // Make the timer fire soon
@@ -621,10 +619,30 @@ inline Integral add_nonatomically(std::atomic<Integral>& value, Integral inc) {
template <typename Integral>
inline Integral increment_nonatomically(std::atomic<Integral>& value) {
return add_nonatomically(value, Integral(1));
}

+cpu_stall_detector::cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg)
+ : _r(r) {
+ update_config(cfg);
+}
+
+cpu_stall_detector_config
+cpu_stall_detector::get_config() const {
+ return _config;
+}
+
+void cpu_stall_detector::update_config(cpu_stall_detector_config cfg) {
+ _config = cfg;
+ auto tq = _r->_task_quota;
+ if (tq == 0s) {
+ tq = 1ms; // safe default, initialization order is wrong
+ }
+ _tasks_processed_report_threshold = cfg.threshold / tq;
+ _stall_detector_reports_per_minute = cfg.stall_detector_reports_per_minute;
+}
+
void
reactor::task_quota_timer_thread_fn() {
auto thread_name = seastar::format("timer-{}", _id);
pthread_setname_np(pthread_self(), thread_name.c_str());

@@ -637,11 +655,11 @@ reactor::task_quota_timer_thread_fn() {
if (r) {
seastar_logger.info("Thread {}: failed to block signals. Aborting.", thread_name.c_str());
abort();
}

- unsigned report_at = _tasks_processed_report_threshold;
+ unsigned report_at = _cpu_stall_detector->_tasks_processed_report_threshold;
uint64_t last_tasks_processed_seen = 0;
uint64_t last_polls_seen = 0;

class block_notifier_rate_limiter {
unsigned _reported = 0;
@@ -696,54 +714,60 @@ reactor::task_quota_timer_thread_fn() {
{
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;
}
- block_notifier_rate_limiter rate_limit(unsigned(60s / _task_quota), _stall_detector_reports_per_minute, _id);
+ block_notifier_rate_limiter rate_limit(unsigned(60s / _task_quota), _cpu_stall_detector->_stall_detector_reports_per_minute, _id);
uint64_t saved_missed_ticks = 0;

while (!_dying.load(std::memory_order_relaxed)) {
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;

- auto missed_ticks = _stall_detector_missed_ticks.load(std::memory_order_relaxed);
+ auto missed_ticks = _cpu_stall_detector->_stall_detector_missed_ticks.load(std::memory_order_relaxed);
rate_limit.tick(std::max(uint64_t(1), missed_ticks - saved_missed_ticks));
saved_missed_ticks = missed_ticks;

auto tp = _tasks_processed.load(std::memory_order_relaxed);
auto p = _polls.load(std::memory_order_relaxed);
if ((tp == last_tasks_processed_seen) && (p == last_polls_seen)) {
- if ((increment_nonatomically(_tasks_processed_stalled) == report_at)) {
- rate_limit.maybe_report(_thread_id, block_notifier_signal());
+ if ((increment_nonatomically(_cpu_stall_detector->_tasks_processed_stalled) == report_at)) {
+ rate_limit.maybe_report(_thread_id, cpu_stall_detector::signal_number());
report_at <<= 1;
}
} else {
last_tasks_processed_seen = tp;
last_polls_seen = p;
- _tasks_processed_stalled.store(0, std::memory_order_relaxed);
- report_at = _tasks_processed_report_threshold;
+ _cpu_stall_detector->_tasks_processed_stalled.store(0, std::memory_order_relaxed);
+ report_at = _cpu_stall_detector->_tasks_processed_report_threshold;
}

// We're in a different thread, but guaranteed to be on the same core, so even
// a signal fence is overdoing it
std::atomic_signal_fence(std::memory_order_seq_cst);
}
}
void
reactor::update_blocked_reactor_notify_ms(std::chrono::milliseconds ms) {
- unsigned threshold = ms / _task_quota;
- if (threshold != _tasks_processed_report_threshold) {
- _tasks_processed_report_threshold = threshold;
+ auto cfg = _cpu_stall_detector->get_config();
+ if (ms != cfg.threshold) {
+ cfg.threshold = ms;
+ _cpu_stall_detector->update_config(cfg);
seastar_logger.info("updated: blocked-reactor-notify-ms={}", ms.count());
}
}

void
reactor::block_notifier(int) {
- auto steps = engine()._tasks_processed_stalled.load(std::memory_order_relaxed);
- auto delta = std::chrono::duration_cast<std::chrono::milliseconds>(engine()._task_quota * steps);
+ engine()._cpu_stall_detector->generate_trace();
+}
+
+void
+cpu_stall_detector::generate_trace() {
+ auto steps = _tasks_processed_stalled.load(std::memory_order_relaxed);
+ auto delta = std::chrono::duration_cast<std::chrono::milliseconds>(_r->_task_quota * steps);

backtrace_buffer buf;
buf.append("Reactor stalled for ");
buf.append_decimal(uint64_t(delta.count()));
buf.append(" ms");
@@ -850,12 +874,14 @@ void reactor::configure(boost::program_options::variables_map vm) {
_handle_sigint = !vm.count("no-handle-interrupt");
auto task_quota = vm["task-quota-ms"].as<double>() * 1ms;
_task_quota = std::chrono::duration_cast<sched_clock::duration>(task_quota);

auto blocked_time = vm["blocked-reactor-notify-ms"].as<unsigned>() * 1ms;
- _tasks_processed_report_threshold = unsigned(blocked_time / task_quota);
- _stall_detector_reports_per_minute = vm["blocked-reactor-reports-per-minute"].as<unsigned>();
+ cpu_stall_detector_config csdc;
+ csdc.threshold = blocked_time;
+ csdc.stall_detector_reports_per_minute = vm["blocked-reactor-reports-per-minute"].as<unsigned>();
+ _cpu_stall_detector->update_config(csdc);

_max_task_backlog = vm["max-task-backlog"].as<unsigned>();
_max_poll_time = vm["idle-poll-time-us"].as<unsigned>() * 1us;
if (vm.count("poll-mode")) {
_max_poll_time = std::chrono::nanoseconds::max();
@@ -3252,11 +3278,11 @@ int reactor::run() {
auto& task_quote_itimerspec = its;

struct sigaction sa_block_notifier = {};
sa_block_notifier.sa_handler = &reactor::block_notifier;
sa_block_notifier.sa_flags = SA_RESTART;
- auto r = sigaction(block_notifier_signal(), &sa_block_notifier, nullptr);
+ auto r = sigaction(cpu_stall_detector::signal_number(), &sa_block_notifier, nullptr);
assert(r == 0);

bool idle = false;

std::function<bool()> check_for_work = [this] () {
@@ -3316,11 +3342,11 @@ int reactor::run() {
_task_quota_timer.timerfd_settime(0, zero_itimerspec);
auto start_sleep = sched_clock::now();
sleep();
// We may have slept for a while, so freshen idle_end
idle_end = sched_clock::now();
- add_nonatomically(_stall_detector_missed_ticks, uint64_t((idle_end - start_sleep)/_task_quota));
+ add_nonatomically(_cpu_stall_detector->_stall_detector_missed_ticks, uint64_t((idle_end - start_sleep)/_task_quota));
_total_sleep += idle_end - start_sleep;
_task_quota_timer.timerfd_settime(0, task_quote_itimerspec);
}
} else {
// We previously ran pure_check_for_work(), might not actually have performed
--
2.19.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:16 PM11/22/18

to seastar-dev@googlegroups.com

They have a 1:1 relationship, and are part of the same task, so merge them.
---
src/core/stall_detector.hh | 9 ++++
src/core/reactor.cc | 90 +++++++++++++++++---------------------
2 files changed, 48 insertions(+), 51 deletions(-)

diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
index 8f5006714d..462f0bff49 100644
--- a/src/core/stall_detector.hh
+++ b/src/core/stall_detector.hh
@@ -42,17 +42,26 @@ class cpu_stall_detector {
reactor* _r;

std::atomic<unsigned> _tasks_processed_stalled = { 0 };

unsigned _tasks_processed_report_threshold;
unsigned _stall_detector_reports_per_minute;

std::atomic<uint64_t> _stall_detector_missed_ticks = { 0 };

+ unsigned _reported = 0;
+ unsigned _ticks = 0;
+ unsigned _ticks_per_minute;
+ unsigned _max_reports_per_minute;
+ unsigned _shard_id;
+ unsigned _thread_id;
cpu_stall_detector_config _config;
friend reactor;
+private:
+ void maybe_report(pthread_t who, int sig);
public:

cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg = {});

static int signal_number() { return SIGRTMIN + 1; }

void generate_trace();
void update_config(cpu_stall_detector_config cfg);
cpu_stall_detector_config get_config() const;
+ void tick(unsigned ticks = 1);
};

}
}
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 6651be531c..3649d7329f 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -620,11 +620,12 @@ template <typename Integral>

inline Integral increment_nonatomically(std::atomic<Integral>& value) {
return add_nonatomically(value, Integral(1));
}

cpu_stall_detector::cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg)
- : _r(r) {
+ : _r(r)
+ , _shard_id(_r->cpu_id()) {
update_config(cfg);
}

cpu_stall_detector_config
cpu_stall_detector::get_config() const {
@@ -637,10 +638,45 @@ void cpu_stall_detector::update_config(cpu_stall_detector_config cfg) {
if (tq == 0s) {

tq = 1ms; // safe default, initialization order is wrong
}

_tasks_processed_report_threshold = cfg.threshold / tq;

_stall_detector_reports_per_minute = cfg.stall_detector_reports_per_minute;
+ _ticks_per_minute = 60s / tq;
+ _max_reports_per_minute = cfg.stall_detector_reports_per_minute;
+}
+
+void cpu_stall_detector::maybe_report(pthread_t who, int sig) {
+ if (_reported++ < _max_reports_per_minute) {
+ pthread_kill(who, sig);
+ }
+}
+// We use a tick at every timer firing so we can report suppressed backtraces.
+// Best case it's a correctly predicted branch. If a backtrace had happened in
+// the near past it's an increment and two branches.
+//
+// We can do it a cheaper if we don't report suppressed backtraces.
+void cpu_stall_detector::tick(unsigned ticks) {
+ if (!_reported) {
+ return;
+ }
+ _ticks += ticks;
+ if (_ticks >= _ticks_per_minute) {
+ if (_reported > _max_reports_per_minute) {
+ auto supressed = _reported - _max_reports_per_minute;
+ backtrace_buffer buf;
+ // Reuse backtrace buffer infrastructure so we don't have to allocate here
+ buf.append("Rate-limit: supressed ");
+ buf.append_decimal(_reported - _max_reports_per_minute);
+ supressed == 1 ? buf.append(" backtrace") : buf.append(" backtraces");
+ buf.append(" on shard ");
+ buf.append_decimal(_shard_id);
+ buf.append("\n");
+ buf.flush();
+ }
+ _reported = 0;
+ _ticks = 0;

+ }
}

void
reactor::task_quota_timer_thread_fn() {
auto thread_name = seastar::format("timer-{}", _id);

@@ -659,82 +695,34 @@ reactor::task_quota_timer_thread_fn() {

unsigned report_at = _cpu_stall_detector->_tasks_processed_report_threshold;
uint64_t last_tasks_processed_seen = 0;
uint64_t last_polls_seen = 0;

- class block_notifier_rate_limiter {
- unsigned _reported = 0;
- unsigned _ticks = 0;
- unsigned _ticks_per_minute;
- unsigned _max_reports_per_minute;
- unsigned _shard_id;
- unsigned _thread_id;
- public:
- void maybe_report(pthread_t who, int sig) {
- if (_reported++ < _max_reports_per_minute) {
- pthread_kill(who, sig);
- }
- }
- // We use a tick at every timer firing so we can report suppressed backtraces.
- // Best case it's a correctly predicted branch. If a backtrace had happened in
- // the near past it's an increment and two branches.
- //
- // We can do it a cheaper if we don't report suppressed backtraces.
- void tick(unsigned ticks = 1) {
- if (!_reported) {
- return;
- }
- _ticks += ticks;
- if (_ticks >= _ticks_per_minute) {
- if (_reported > _max_reports_per_minute) {
- auto supressed = _reported - _max_reports_per_minute;
- backtrace_buffer buf;
- // Reuse backtrace buffer infrastructure so we don't have to allocate here
- buf.append("Rate-limit: supressed ");
- buf.append_decimal(_reported - _max_reports_per_minute);
- supressed == 1 ? buf.append(" backtrace") : buf.append(" backtraces");
- buf.append(" on shard ");
- buf.append_decimal(_shard_id);
- buf.append("\n");
- buf.flush();
- }
- _reported = 0;
- _ticks = 0;
- }
- }
- block_notifier_rate_limiter(unsigned ticks_per_minute, unsigned max_reports, unsigned shard_id)
- : _ticks_per_minute(ticks_per_minute)
- , _max_reports_per_minute(max_reports)
- , _shard_id(shard_id)
- {}
- };
-
// We need to wait until task quota is set before we can calculate how many ticks are to
// a minute. Technically task_quota is used from many threads, but since it is read-only here
// and only used during initialization we will avoid complicating the code.

{
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;
}

- block_notifier_rate_limiter rate_limit(unsigned(60s / _task_quota), _cpu_stall_detector->_stall_detector_reports_per_minute, _id);

uint64_t saved_missed_ticks = 0;

while (!_dying.load(std::memory_order_relaxed)) {
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;

auto missed_ticks = _cpu_stall_detector->_stall_detector_missed_ticks.load(std::memory_order_relaxed);

- rate_limit.tick(std::max(uint64_t(1), missed_ticks - saved_missed_ticks));
+ _cpu_stall_detector->tick(std::max(uint64_t(1), missed_ticks - saved_missed_ticks));

saved_missed_ticks = missed_ticks;

auto tp = _tasks_processed.load(std::memory_order_relaxed);
auto p = _polls.load(std::memory_order_relaxed);
if ((tp == last_tasks_processed_seen) && (p == last_polls_seen)) {

if ((increment_nonatomically(_cpu_stall_detector->_tasks_processed_stalled) == report_at)) {

- rate_limit.maybe_report(_thread_id, cpu_stall_detector::signal_number());
+ _cpu_stall_detector->maybe_report(_thread_id, cpu_stall_detector::signal_number());

report_at <<= 1;
}
} else {
last_tasks_processed_seen = tp;
last_polls_seen = p;

--
2.19.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:17 PM11/22/18

to seastar-dev@googlegroups.com

Isolate the code in prepration to modification.
---
src/core/stall_detector.hh | 6 +++++-
src/core/reactor.cc | 43 ++++++++++++++++++--------------------
2 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
index 462f0bff49..b0a4179182 100644
--- a/src/core/stall_detector.hh
+++ b/src/core/stall_detector.hh
@@ -48,20 +48,24 @@ class cpu_stall_detector {
unsigned _ticks = 0;
unsigned _ticks_per_minute;
unsigned _max_reports_per_minute;
unsigned _shard_id;
unsigned _thread_id;
+ unsigned _report_at{};
+ unsigned _saved_missed_ticks{};
+ uint64_t _last_tasks_processed_seen{};
+ uint64_t _last_polls_seen{};
cpu_stall_detector_config _config;
friend reactor;
private:

void maybe_report(pthread_t who, int sig);
public:
cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg = {});
static int signal_number() { return SIGRTMIN + 1; }
void generate_trace();
void update_config(cpu_stall_detector_config cfg);
cpu_stall_detector_config get_config() const;

- void tick(unsigned ticks = 1);
+ void tick();
};

}
}
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 3649d7329f..e963283471 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -652,11 +652,29 @@ void cpu_stall_detector::maybe_report(pthread_t who, int sig) {

// We use a tick at every timer firing so we can report suppressed backtraces.

// Best case it's a correctly predicted branch. If a backtrace had happened in

// the near past it's an increment and two branches.

//

// We can do it a cheaper if we don't report suppressed backtraces.

-void cpu_stall_detector::tick(unsigned ticks) {
+void cpu_stall_detector::tick() {
+ auto missed_ticks = _stall_detector_missed_ticks.load(std::memory_order_relaxed);
+ auto ticks = std::max(uint64_t(1), missed_ticks - _saved_missed_ticks);
+ _saved_missed_ticks = missed_ticks;
+
+ auto tp = _r->_tasks_processed.load(std::memory_order_relaxed);
+ auto p = _r->_polls.load(std::memory_order_relaxed);
+ if ((tp == _last_tasks_processed_seen) && (p == _last_polls_seen)) {
+ if ((increment_nonatomically(_tasks_processed_stalled) == _report_at)) {
+ maybe_report(_r->_thread_id, cpu_stall_detector::signal_number());
+ _report_at <<= 1;
+ }
+ } else {
+ _last_tasks_processed_seen = tp;
+ _last_polls_seen = p;
+ _tasks_processed_stalled.store(0, std::memory_order_relaxed);
+ _report_at = _tasks_processed_report_threshold;
+ }
+
if (!_reported) {
return;
}
_ticks += ticks;
if (_ticks >= _ticks_per_minute) {
@@ -691,46 +709,25 @@ reactor::task_quota_timer_thread_fn() {

if (r) {
seastar_logger.info("Thread {}: failed to block signals. Aborting.", thread_name.c_str());
abort();
}

- unsigned report_at = _cpu_stall_detector->_tasks_processed_report_threshold;
- uint64_t last_tasks_processed_seen = 0;
- uint64_t last_polls_seen = 0;
-

// We need to wait until task quota is set before we can calculate how many ticks are to
// a minute. Technically task_quota is used from many threads, but since it is read-only here
// and only used during initialization we will avoid complicating the code.
{
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;
}

- uint64_t saved_missed_ticks = 0;

while (!_dying.load(std::memory_order_relaxed)) {
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;

- auto missed_ticks = _cpu_stall_detector->_stall_detector_missed_ticks.load(std::memory_order_relaxed);
- _cpu_stall_detector->tick(std::max(uint64_t(1), missed_ticks - saved_missed_ticks));
- saved_missed_ticks = missed_ticks;
-
- auto tp = _tasks_processed.load(std::memory_order_relaxed);
- auto p = _polls.load(std::memory_order_relaxed);
- if ((tp == last_tasks_processed_seen) && (p == last_polls_seen)) {
- if ((increment_nonatomically(_cpu_stall_detector->_tasks_processed_stalled) == report_at)) {
- _cpu_stall_detector->maybe_report(_thread_id, cpu_stall_detector::signal_number());
- report_at <<= 1;
- }
- } else {
- last_tasks_processed_seen = tp;
- last_polls_seen = p;
- _cpu_stall_detector->_tasks_processed_stalled.store(0, std::memory_order_relaxed);
- report_at = _cpu_stall_detector->_tasks_processed_report_threshold;
- }
+ _cpu_stall_detector->tick();

// We're in a different thread, but guaranteed to be on the same core, so even
// a signal fence is overdoing it
std::atomic_signal_fence(std::memory_order_seq_cst);
}

--
2.19.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:18 PM11/22/18

to seastar-dev@googlegroups.com

Instead of letting the idle loop reach into the internals of the stall detector,
have it call an API.
---
src/core/stall_detector.hh | 1 +
src/core/reactor.cc | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
index b0a4179182..194a9e0733 100644
--- a/src/core/stall_detector.hh
+++ b/src/core/stall_detector.hh
@@ -63,9 +63,10 @@ class cpu_stall_detector {

static int signal_number() { return SIGRTMIN + 1; }
void generate_trace();
void update_config(cpu_stall_detector_config cfg);
cpu_stall_detector_config get_config() const;

void tick();
+ void account_for_missed_ticks(std::chrono::steady_clock::duration idle_time);
};

}
}
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index e963283471..ce4fa3dc05 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -693,10 +693,15 @@ void cpu_stall_detector::tick() {
_reported = 0;
_ticks = 0;
}
}

+void
+cpu_stall_detector::account_for_missed_ticks(std::chrono::steady_clock::duration idle_time) {
+ add_nonatomically(_stall_detector_missed_ticks, uint64_t(idle_time / _r->_task_quota));
+}

+
void
reactor::task_quota_timer_thread_fn() {
auto thread_name = seastar::format("timer-{}", _id);

pthread_setname_np(pthread_self(), thread_name.c_str());

@@ -3327,11 +3332,11 @@ int reactor::run() {

_task_quota_timer.timerfd_settime(0, zero_itimerspec);
auto start_sleep = sched_clock::now();
sleep();
// We may have slept for a while, so freshen idle_end
idle_end = sched_clock::now();

- add_nonatomically(_cpu_stall_detector->_stall_detector_missed_ticks, uint64_t((idle_end - start_sleep)/_task_quota));
+ _cpu_stall_detector->account_for_missed_ticks(idle_end - start_sleep);

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:20 PM11/22/18

to seastar-dev@googlegroups.com

The reactor will soon gain an aio poll mode, which does not have a timer thread.
To avoid losing the stall detector, we migrate it to use a timer+signal directly.

In order to prevent excessive signals, the timer is set to the report threshold,
rather than the task quota. This means that almost every delivered signal
corresponds to a quota violation (the exceptions are when we've stopped running tasks
by the time the signal was delivered).

To avoid excessive calls to timer_settime(), we allow ourselves some slack in the
timing. For a threshold of 10ms we'll set the timer to 13ms and rearm it after 3ms
have gone by. So when we enter task execution, we have between 10ms and 13ms on the
clock.
---
src/core/stall_detector.hh | 29 ++++++----
src/core/reactor.cc | 107 +++++++++++++++++++++++--------------
2 files changed, 85 insertions(+), 51 deletions(-)

diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
index 194a9e0733..41d57ab5cd 100644
--- a/src/core/stall_detector.hh
+++ b/src/core/stall_detector.hh
@@ -23,50 +23,59 @@
#pragma once

#include <signal.h>
#include <limits>
#include <chrono>
+#include <seastar/core/posix.hh>

namespace seastar {

class reactor;

namespace internal {

struct cpu_stall_detector_config {

std::chrono::duration<double> threshold = std::chrono::seconds(2);

unsigned stall_detector_reports_per_minute = 1;
+ float slack = 0.3; // fraction of threshold that we're allowed to overshoot
};

// Detects stalls in continuations that run for too long

class cpu_stall_detector {
reactor* _r;
- std::atomic<unsigned> _tasks_processed_stalled = { 0 };
- unsigned _tasks_processed_report_threshold;
+ bool _timer_initialized = false;
+ timer_t _timer;
+ std::atomic<bool> _active{};

unsigned _stall_detector_reports_per_minute;
std::atomic<uint64_t> _stall_detector_missed_ticks = { 0 };

unsigned _reported = 0;
- unsigned _ticks = 0;

- unsigned _ticks_per_minute;

unsigned _max_reports_per_minute;
unsigned _shard_id;
unsigned _thread_id;

unsigned _report_at{};
- unsigned _saved_missed_ticks{};
- uint64_t _last_tasks_processed_seen{};
- uint64_t _last_polls_seen{};
+ std::chrono::steady_clock::time_point _minute_mark{};
+ std::chrono::steady_clock::time_point _rearm_timer_at{};
+ std::chrono::steady_clock::time_point _run_started_at{};
+ std::chrono::steady_clock::duration _threshold;
+ std::chrono::steady_clock::duration _slack;

cpu_stall_detector_config _config;
friend reactor;
private:

- void maybe_report(pthread_t who, int sig);
+ void maybe_report();
+ void arm_timer();
+ void report_suppressions(std::chrono::steady_clock::time_point now);

public:
cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg = {});

+ ~cpu_stall_detector();

static int signal_number() { return SIGRTMIN + 1; }

+ void start_task_run(std::chrono::steady_clock::time_point now);
+ void end_task_run(std::chrono::steady_clock::time_point now);

void generate_trace();
void update_config(cpu_stall_detector_config cfg);
cpu_stall_detector_config get_config() const;

- void tick();
- void account_for_missed_ticks(std::chrono::steady_clock::duration idle_time);
+ void on_signal();
+ void start_sleep();
+ void end_sleep();
};

}
}
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index ce4fa3dc05..1fcfe14ffd 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -623,64 +623,63 @@ inline Integral increment_nonatomically(std::atomic<Integral>& value) {

cpu_stall_detector::cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg)

: _r(r)
, _shard_id(_r->cpu_id()) {
update_config(cfg);
+ struct sigevent sev = {};
+ sev.sigev_notify = SIGEV_THREAD_ID;
+ sev.sigev_signo = signal_number();
+ sev._sigev_un._tid = syscall(SYS_gettid);
+ int err = timer_create(CLOCK_MONOTONIC, &sev, &_timer);
+ if (err) {
+ throw std::system_error(std::error_code(err, std::system_category()));
+ }
+ _timer_initialized = true;
+
+}
+
+cpu_stall_detector::~cpu_stall_detector() {
+ if (_timer_initialized) {
+ timer_delete(_timer);
+ }
}

cpu_stall_detector_config
cpu_stall_detector::get_config() const {
return _config;
}

void cpu_stall_detector::update_config(cpu_stall_detector_config cfg) {
_config = cfg;
- auto tq = _r->_task_quota;
- if (tq == 0s) {
- tq = 1ms; // safe default, initialization order is wrong
- }
- _tasks_processed_report_threshold = cfg.threshold / tq;
+ _threshold = std::chrono::duration_cast<std::chrono::steady_clock::duration>(cfg.threshold);
+ _slack = std::chrono::duration_cast<std::chrono::steady_clock::duration>(cfg.threshold * cfg.slack);
_stall_detector_reports_per_minute = cfg.stall_detector_reports_per_minute;
- _ticks_per_minute = 60s / tq;
_max_reports_per_minute = cfg.stall_detector_reports_per_minute;
+ _rearm_timer_at = std::chrono::steady_clock::now();
}

-void cpu_stall_detector::maybe_report(pthread_t who, int sig) {
+void cpu_stall_detector::maybe_report() {

if (_reported++ < _max_reports_per_minute) {
- pthread_kill(who, sig);

+ generate_trace();

}
}
// We use a tick at every timer firing so we can report suppressed backtraces.
// Best case it's a correctly predicted branch. If a backtrace had happened in
// the near past it's an increment and two branches.
//
// We can do it a cheaper if we don't report suppressed backtraces.

-void cpu_stall_detector::tick() {
- auto missed_ticks = _stall_detector_missed_ticks.load(std::memory_order_relaxed);
- auto ticks = std::max(uint64_t(1), missed_ticks - _saved_missed_ticks);
- _saved_missed_ticks = missed_ticks;
-
- auto tp = _r->_tasks_processed.load(std::memory_order_relaxed);
- auto p = _r->_polls.load(std::memory_order_relaxed);
- if ((tp == _last_tasks_processed_seen) && (p == _last_polls_seen)) {
- if ((increment_nonatomically(_tasks_processed_stalled) == _report_at)) {
- maybe_report(_r->_thread_id, cpu_stall_detector::signal_number());
- _report_at <<= 1;
- }
- } else {
- _last_tasks_processed_seen = tp;
- _last_polls_seen = p;
- _tasks_processed_stalled.store(0, std::memory_order_relaxed);
- _report_at = _tasks_processed_report_threshold;
+void cpu_stall_detector::on_signal() {
+ if (_active.load(std::memory_order_relaxed)) {
+ maybe_report();
+ _report_at <<= 1;
+ arm_timer();
}
+}

- if (!_reported) {
- return;
- }
- _ticks += ticks;
- if (_ticks >= _ticks_per_minute) {
- if (_reported > _max_reports_per_minute) {

+void cpu_stall_detector::report_suppressions(std::chrono::steady_clock::time_point now) {
+ if (now > _minute_mark + 60s) {
+ if (_reported) {

auto supressed = _reported - _max_reports_per_minute;

backtrace_buffer buf;

// Reuse backtrace buffer infrastructure so we don't have to allocate here

buf.append("Rate-limit: supressed ");
buf.append_decimal(_reported - _max_reports_per_minute);
@@ -689,17 +688,43 @@ void cpu_stall_detector::tick() {
buf.append_decimal(_shard_id);
buf.append("\n");
buf.flush();

}
_reported = 0;
- _ticks = 0;

+ _minute_mark = now;
}
}

-void
-cpu_stall_detector::account_for_missed_ticks(std::chrono::steady_clock::duration idle_time) {
- add_nonatomically(_stall_detector_missed_ticks, uint64_t(idle_time / _r->_task_quota));
+void cpu_stall_detector::arm_timer() {
+ auto its = posix::to_relative_itimerspec(_threshold * _report_at + _slack, 0s);
+ timer_settime(_timer, 0, &its, nullptr);
+}
+
+void cpu_stall_detector::start_task_run(std::chrono::steady_clock::time_point now) {
+ if (now > _rearm_timer_at) {
+ report_suppressions(now);
+ _report_at = 1;
+ _run_started_at = now;
+ _rearm_timer_at = now + _threshold * _report_at;
+ arm_timer();
+ }
+ _active.store(true, std::memory_order_relaxed);
+ std::atomic_signal_fence(std::memory_order_release); // Don't delay this write, so the signal handler can see it
+}
+
+void cpu_stall_detector::end_task_run(std::chrono::steady_clock::time_point now) {
+ std::atomic_signal_fence(std::memory_order_release); // Don't delay this write, so the signal handler can see it
+ _active.store(true, std::memory_order_relaxed);
+}
+
+void cpu_stall_detector::start_sleep() {
+ auto its = posix::to_relative_itimerspec(0s, 0s);
+ timer_settime(_timer, 0, &its, nullptr);
+ _rearm_timer_at = std::chrono::steady_clock::now();
+}
+
+void cpu_stall_detector::end_sleep() {

}

void
reactor::task_quota_timer_thread_fn() {
auto thread_name = seastar::format("timer-{}", _id);

@@ -728,12 +753,10 @@ reactor::task_quota_timer_thread_fn() {

while (!_dying.load(std::memory_order_relaxed)) {
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;

- _cpu_stall_detector->tick();
-

// We're in a different thread, but guaranteed to be on the same core, so even
// a signal fence is overdoing it
std::atomic_signal_fence(std::memory_order_seq_cst);
}
}

@@ -752,16 +775,15 @@ reactor::block_notifier(int) {
engine()._cpu_stall_detector->generate_trace();
}

void
cpu_stall_detector::generate_trace() {
- auto steps = _tasks_processed_stalled.load(std::memory_order_relaxed);
- auto delta = std::chrono::duration_cast<std::chrono::milliseconds>(_r->_task_quota * steps);
+ auto delta = std::chrono::steady_clock::now() - _run_started_at;

backtrace_buffer buf;
buf.append("Reactor stalled for ");

- buf.append_decimal(uint64_t(delta.count()));
+ buf.append_decimal(uint64_t(delta / 1ms));
buf.append(" ms");
print_with_backtrace(buf);
}

template <typename T, typename E, typename EnableFunc>
@@ -3125,10 +3147,11 @@ reactor::run_some_tasks() {
sched_print("run_some_tasks: start");
g_need_preempt = false;

sched_clock::time_point t_run_completed = std::chrono::steady_clock::now();
STAP_PROBE(seastar, reactor_run_tasks_start);
+ _cpu_stall_detector->start_task_run(t_run_completed);
do {
auto t_run_started = t_run_completed;
insert_activating_task_queues();
auto tq = _active_task_queues.front();
_active_task_queues.pop_front();
@@ -3146,10 +3169,11 @@ reactor::run_some_tasks() {
insert_active_task_queue(tq);
} else {
tq->_active = false;
}
} while (have_more_tasks() && !need_preempt());
+ _cpu_stall_detector->end_task_run(t_run_completed);
STAP_PROBE(seastar, reactor_run_tasks_end);
*internal::current_scheduling_group_ptr() = default_scheduling_group(); // Prevent inheritance from last group run
sched_print("run_some_tasks: end");
}

@@ -3329,14 +3353,15 @@ int reactor::run() {
if (idle_end - idle_start > _max_poll_time) {
// Turn off the task quota timer to avoid spurious wakeups
struct itimerspec zero_itimerspec = {};

_task_quota_timer.timerfd_settime(0, zero_itimerspec);
auto start_sleep = sched_clock::now();

+ _cpu_stall_detector->start_sleep();
sleep();
+ _cpu_stall_detector->end_sleep();

// We may have slept for a while, so freshen idle_end
idle_end = sched_clock::now();

- _cpu_stall_detector->account_for_missed_ticks(idle_end - start_sleep);

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:21 PM11/22/18

to seastar-dev@googlegroups.com

Using the thread cputime clock prevents false positives due to an overcommitted
processor, and this improves the quality of the reports.

It could also generate false negatives if system calls block (since the cputime
clock is stopped when a thread switches contexts voluntarily), but these events
are rarer these days and are better investigated using the other latency detector.
---
src/core/reactor.cc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 1fcfe14ffd..2d12315fb1 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -627,11 +627,11 @@ cpu_stall_detector::cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg
update_config(cfg);
struct sigevent sev = {};
sev.sigev_notify = SIGEV_THREAD_ID;
sev.sigev_signo = signal_number();
sev._sigev_un._tid = syscall(SYS_gettid);
- int err = timer_create(CLOCK_MONOTONIC, &sev, &_timer);
+ int err = timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &_timer);
if (err) {

throw std::system_error(std::error_code(err, std::system_category()));
}

_timer_initialized = true;

--
2.19.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:22 PM11/22/18

to seastar-dev@googlegroups.com

For symmetry with the update() function; will be useful in a unit test later.
---
include/seastar/core/reactor.hh | 1 +
src/core/reactor.cc | 6 ++++++
2 files changed, 7 insertions(+)

diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
index 5af5cd2387..417b720739 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -1178,10 +1178,11 @@ class reactor {
void set_bypass_fsync(bool value) {
_bypass_fsync = value;
}

void update_blocked_reactor_notify_ms(std::chrono::milliseconds ms);
+ std::chrono::milliseconds get_blocked_reactor_notify_ms() const;
};

template <typename Func> // signature: bool ()
inline
std::unique_ptr<reactor::pollfn>
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 2d12315fb1..1b1c4f183f 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -768,10 +768,16 @@ reactor::update_blocked_reactor_notify_ms(std::chrono::milliseconds ms) {

_cpu_stall_detector->update_config(cfg);
seastar_logger.info("updated: blocked-reactor-notify-ms={}", ms.count());
}
}

+std::chrono::milliseconds
+reactor::get_blocked_reactor_notify_ms() const {
+ auto d = _cpu_stall_detector->get_config().threshold;
+ return std::chrono::duration_cast<std::chrono::milliseconds>(d);
+}
+
void
reactor::block_notifier(int) {
engine()._cpu_stall_detector->generate_trace();
}

--
2.19.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:23 PM11/22/18

to seastar-dev@googlegroups.com

Useful in unit testing.
---
include/seastar/core/reactor.hh | 3 +++
src/core/stall_detector.hh | 2 ++
src/core/reactor.cc | 17 +++++++++++++++++
3 files changed, 22 insertions(+)

diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
index 417b720739..1e6f118862 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -1179,10 +1179,13 @@ class reactor {

_bypass_fsync = value;
}

void update_blocked_reactor_notify_ms(std::chrono::milliseconds ms);

std::chrono::milliseconds get_blocked_reactor_notify_ms() const;

+ // For testing:
+ void set_stall_detector_report_function(std::function<void ()> report);
+ std::function<void ()> get_stall_detector_report_function() const;

};

template <typename Func> // signature: bool ()
inline
std::unique_ptr<reactor::pollfn>

diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
index 41d57ab5cd..128b9707a1 100644
--- a/src/core/stall_detector.hh
+++ b/src/core/stall_detector.hh
@@ -23,10 +23,11 @@

#pragma once

#include <signal.h>
#include <limits>
#include <chrono>

+#include <functional>

#include <seastar/core/posix.hh>

namespace seastar {

class reactor;

@@ -35,10 +36,11 @@ namespace internal {

struct cpu_stall_detector_config {
std::chrono::duration<double> threshold = std::chrono::seconds(2);
unsigned stall_detector_reports_per_minute = 1;

float slack = 0.3; // fraction of threshold that we're allowed to overshoot

+ std::function<void ()> report; // alternative reporting function for tests

};

// Detects stalls in continuations that run for too long
class cpu_stall_detector {
reactor* _r;

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 1b1c4f183f..04a0cf30d3 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -774,19 +774,36 @@ std::chrono::milliseconds
reactor::get_blocked_reactor_notify_ms() const {

auto d = _cpu_stall_detector->get_config().threshold;

return std::chrono::duration_cast<std::chrono::milliseconds>(d);
}

+void
+reactor::set_stall_detector_report_function(std::function<void ()> report) {

+ auto cfg = _cpu_stall_detector->get_config();

+ cfg.report = std::move(report);
+ _cpu_stall_detector->update_config(std::move(cfg));
+}
+
+std::function<void ()>
+reactor::get_stall_detector_report_function() const {
+ return _cpu_stall_detector->get_config().report;

+}
+
void
reactor::block_notifier(int) {
engine()._cpu_stall_detector->generate_trace();
}

void
cpu_stall_detector::generate_trace() {

auto delta = std::chrono::steady_clock::now() - _run_started_at;

+ if (_config.report) {
+ _config.report();
+ return;
+ }
+

backtrace_buffer buf;
buf.append("Reactor stalled for ");

buf.append_decimal(uint64_t(delta / 1ms));
buf.append(" ms");
print_with_backtrace(buf);

--
2.19.1

Avi Kivity

<avi@scylladb.com>

unread,

Nov 22, 2018, 12:30:25 PM11/22/18

to seastar-dev@googlegroups.com

---
configure.py | 3 ++
test.py | 1 +
tests/unit/stall_detector_test.cc | 82 +++++++++++++++++++++++++++++++
3 files changed, 86 insertions(+)
create mode 100644 tests/unit/stall_detector_test.cc

diff --git a/configure.py b/configure.py
index 0fcf7053e3..f443625d14 100755
--- a/configure.py
+++ b/configure.py
@@ -310,10 +310,11 @@ tests = [
'tests/unit/net_config_test',
'tests/unit/abort_source_test',
'tests/unit/alien_test',
'tests/unit/signal_test',
'tests/unit/simple_stream_test',
+ 'tests/unit/stall_detector_test',
] + perf_tests

apps = [
'apps/httpd/httpd',
'apps/seawreck/seawreck',
@@ -640,10 +641,11 @@ deps = {
'tests/unit/net_config_test': ['tests/unit/net_config_test.cc'] + core + libnet,
'tests/unit/abort_source_test': ['tests/unit/abort_source_test.cc'] + core,
'tests/unit/alien_test': ['tests/unit/alien_test.cc'] + core,
'tests/unit/signal_test': ['tests/unit/signal_test.cc'] + core,
'tests/unit/simple_stream_test': ['tests/unit/simple_stream_test.cc'] + core,
+ 'tests/unit/stall_detector_test': ['tests/unit/stall_detector_test.cc'] + core,
}

boost_tests = [
'apps/memcached/tests/memcached_ascii_parser_test',
'tests/unit/file_io_test',
@@ -664,10 +666,11 @@ boost_tests = [
'tests/unit/dns_test',
'tests/unit/execution_stage_test',
'tests/unit/lowres_clock_test',
'tests/unit/abort_source_test',
'tests/unit/signal_test',
+ 'tests/unit/stall_detector_test',
]

for bt in boost_tests:
deps[bt] += boost_test_lib

diff --git a/test.py b/test.py
index bb34126c60..d99ac834a1 100755
--- a/test.py
+++ b/test.py
@@ -56,10 +56,11 @@ boost_tests = [
'unit/noncopyable_function_test',
'unit/abort_source_test',
'unit/signal_test',
'unit/simple_stream_test',
'unit/dns_test',
+ 'unit/stall_detector_test',
]

other_tests = [
'unit/smp_test',
'unit/timer_test',
diff --git a/tests/unit/stall_detector_test.cc b/tests/unit/stall_detector_test.cc
new file mode 100644
index 0000000000..519a2f2a66
--- /dev/null
+++ b/tests/unit/stall_detector_test.cc
@@ -0,0 +1,82 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ * Copyright (C) 2018 ScyllaDB Ltd.
+ */
+
+#include <seastar/core/reactor.hh>
+#include "test-utils.hh"
+#include <atomic>
+#include <chrono>
+
+using namespace seastar;
+using namespace std::chrono_literals;
+
+class temporary_stall_detector_settings {
+ std::chrono::milliseconds _old_threshold;
+ std::function<void ()> _old_report;
+public:
+ temporary_stall_detector_settings(std::chrono::duration<double> threshold, std::function<void ()> report)
+ : _old_threshold(engine().get_blocked_reactor_notify_ms())
+ , _old_report(engine().get_stall_detector_report_function()) {
+ engine().update_blocked_reactor_notify_ms(std::chrono::duration_cast<std::chrono::milliseconds>(threshold));
+ engine().set_stall_detector_report_function(std::move(report));
+ }
+ ~temporary_stall_detector_settings() {
+ engine().update_blocked_reactor_notify_ms(_old_threshold);
+ engine().set_stall_detector_report_function(std::move(_old_report));
+ }
+};
+
+void spin(std::chrono::duration<double> how_much) {
+ auto end = std::chrono::steady_clock::now() + how_much;
+ while (std::chrono::steady_clock::now() < end) {
+ // spin!
+ }
+}
+
+void spin_some_cooperatively(std::chrono::duration<double> how_much) {
+ auto end = std::chrono::steady_clock::now() + how_much;
+ while (std::chrono::steady_clock::now() < end) {
+ spin(200us);
+ if (need_preempt()) {
+ thread::yield();
+ }
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(normal_case) {
+ std::atomic<unsigned> reports{};
+ temporary_stall_detector_settings tsds(10ms, [&] { ++reports; });
+ spin_some_cooperatively(1s);
+ BOOST_REQUIRE_EQUAL(reports, 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(simple_stalls) {
+ std::atomic<unsigned> reports{};
+ temporary_stall_detector_settings tsds(10ms, [&] { ++reports; });
+ unsigned nr = 10;
+ for (unsigned i = 0; i < nr; ++i) {
+ spin_some_cooperatively(100ms);
+ spin(20ms);
+ }
+ spin_some_cooperatively(100ms);
+ BOOST_REQUIRE_EQUAL(reports, 10);
+}
+
+
--
2.19.1

Glauber Costa

<glauber@scylladb.com>

unread,

Nov 26, 2018, 2:59:40 PM11/26/18

to Avi Kivity, seastar-dev@googlegroups.com

On Thu, Nov 22, 2018 at 12:30 PM Avi Kivity <a...@scylladb.com> wrote:

The reactor will soon gain an aio poll mode, which does not have a timer thread.
To avoid losing the stall detector, we migrate it to use a timer+signal directly.

In order to prevent excessive signals, the timer is set to the report threshold,
rather than the task quota. This means that almost every delivered signal
corresponds to a quota violation (the exceptions are when we've stopped running tasks
by the time the signal was delivered).

I don't understand the above. If we set the timer to 10ms (or really any value above the task quota),

and the timer triggered, how is that not a violation - regardless of whether or not we stopped eventually?

To avoid excessive calls to timer_settime(), we allow ourselves some slack in the
timing. For a threshold of 10ms we'll set the timer to 13ms and rearm it after 3ms
have gone by. So when we enter task execution, we have between 10ms and 13ms on the
clock.

I am also failing a bit to see the motivation for this.

---
src/core/stall_detector.hh | 29 ++++++----
src/core/reactor.cc | 107 +++++++++++++++++++++++--------------
2 files changed, 85 insertions(+), 51 deletions(-)

diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
index 194a9e0733..41d57ab5cd 100644
--- a/src/core/stall_detector.hh
+++ b/src/core/stall_detector.hh
@@ -23,50 +23,59 @@
#pragma once

#include <signal.h>
#include <limits>
#include <chrono>
+#include <seastar/core/posix.hh>

namespace seastar {

class reactor;

namespace internal {

struct cpu_stall_detector_config {
std::chrono::duration<double> threshold = std::chrono::seconds(2);
unsigned stall_detector_reports_per_minute = 1;
+ float slack = 0.3; // fraction of threshold that we're allowed to overshoot

We shouldn't have variables called "slack". It is insensitive, some of us have PTSD on the sight of the word.

};

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20181122173009.18506-6-avi%40scylladb.com.
For more options, visit https://groups.google.com/d/optout.

Glauber Costa

<glauber@scylladb.com>

unread,

Nov 26, 2018, 3:02:00 PM11/26/18

to Avi Kivity, seastar-dev@googlegroups.com

On Thu, Nov 22, 2018 at 12:30 PM Avi Kivity <a...@scylladb.com> wrote:

Using the thread cputime clock prevents false positives due to an overcommitted
processor, and this improves the quality of the reports.

It could also generate false negatives if system calls block (since the cputime
clock is stopped when a thread switches contexts voluntarily), but these events
are rarer these days and are better investigated using the other latency detector.

This is a bit out of of scope for this series, but the other latency detector (I am assuming you are talking about the rusage one)

is a bit opaque these days.

It's hard to understand how much time it takes and whether or not we are having problems coming from it.

For a while now we have been exporting the total time spent in violation as a prometheus metric but we don't do this for things that block in pollers - and they do happen, unfortunately.

---
src/core/reactor.cc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 1fcfe14ffd..2d12315fb1 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -627,11 +627,11 @@ cpu_stall_detector::cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg
update_config(cfg);
struct sigevent sev = {};
sev.sigev_notify = SIGEV_THREAD_ID;
sev.sigev_signo = signal_number();
sev._sigev_un._tid = syscall(SYS_gettid);
- int err = timer_create(CLOCK_MONOTONIC, &sev, &_timer);
+ int err = timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &_timer);
if (err) {
throw std::system_error(std::error_code(err, std::system_category()));
}
_timer_initialized = true;

--
2.19.1

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20181122173009.18506-7-avi%40scylladb.com.

Glauber Costa

<glauber@scylladb.com>

unread,

Nov 26, 2018, 3:19:27 PM11/26/18

to Avi Kivity, seastar-dev@googlegroups.com

FWIW, I played with this test a bit

I added a new test trying to make sure we are catching stalls as low as 1ms (maybe we should add that pre-merge),

and also added a new test in which I ran a CPU hog in the same CPU as the test to make sure the normal case test wasn't

misfiring.

They all worked well, and aside from a couple of things I didn't quite understand in one of the patches I am very happy with this.

On Thu, Nov 22, 2018 at 12:30 PM Avi Kivity <a...@scylladb.com> wrote:

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/20181122173009.18506-10-avi%40scylladb.com.

Avi Kivity

<avi@scylladb.com>

unread,

Nov 27, 2018, 2:55:11 AM11/27/18

to Glauber Costa, seastar-dev@googlegroups.com

On 11/26/18 9:59 PM, Glauber Costa wrote:

On Thu, Nov 22, 2018 at 12:30 PM Avi Kivity <a...@scylladb.com> wrote:

The reactor will soon gain an aio poll mode, which does not have a timer thread.
To avoid losing the stall detector, we migrate it to use a timer+signal directly.

In order to prevent excessive signals, the timer is set to the report threshold,
rather than the task quota. This means that almost every delivered signal
corresponds to a quota violation (the exceptions are when we've stopped running tasks
by the time the signal was delivered).

I don't understand the above. If we set the timer to 10ms (or really any value above the task quota),

and the timer triggered, how is that not a violation - regardless of whether or not we stopped eventually?

If the report threshold is 10ms and we set the timer to 10ms, we will have to re-arm the timer the timer every time we run tasks. Otherwise, some time will have been taken off the timer on the previous run.

Since we don't want a system call every time we run some tasks, we add in some slack and set the timer to 13ms. This means 3ms of runtime can pass until the timer has just 10ms and we need to rearm it, so we rearm at most 300 times per second (per shard).

You can't fool me, you're in love with that thing.

Avi Kivity

<avi@scylladb.com>

unread,

Nov 27, 2018, 2:58:17 AM11/27/18

to Glauber Costa, seastar-dev@googlegroups.com

On 11/26/18 10:01 PM, Glauber Costa wrote:

On Thu, Nov 22, 2018 at 12:30 PM Avi Kivity <a...@scylladb.com> wrote:

Using the thread cputime clock prevents false positives due to an overcommitted
processor, and this improves the quality of the reports.

It could also generate false negatives if system calls block (since the cputime
clock is stopped when a thread switches contexts voluntarily), but these events
are rarer these days and are better investigated using the other latency detector.

This is a bit out of of scope for this series, but the other latency detector (I am assuming you are talking about the rusage one)

is a bit opaque these days.

It's also a bit not-enabled-by-default. We should merge the two.

It's hard to understand how much time it takes and whether or not we are having problems coming from it.

For a while now we have been exporting the total time spent in violation as a prometheus metric but we don't do this for things that block in pollers - and they do happen, unfortunately.

We may also have CPU stalls in pollers, so we should extend the detector to cover that too.

Glauber Costa

<glauber@scylladb.com>

unread,

Nov 27, 2018, 8:38:57 AM11/27/18

to Avi Kivity, seastar-dev@googlegroups.com

On Tue, Nov 27, 2018 at 2:58 AM Avi Kivity <a...@scylladb.com> wrote:

On 11/26/18 10:01 PM, Glauber Costa wrote:

On Thu, Nov 22, 2018 at 12:30 PM Avi Kivity <a...@scylladb.com> wrote:

Using the thread cputime clock prevents false positives due to an overcommitted
processor, and this improves the quality of the reports.

It could also generate false negatives if system calls block (since the cputime
clock is stopped when a thread switches contexts voluntarily), but these events
are rarer these days and are better investigated using the other latency detector.

This is a bit out of of scope for this series, but the other latency detector (I am assuming you are talking about the rusage one)

is a bit opaque these days.

It's also a bit not-enabled-by-default. We should merge the two.

It's hard to understand how much time it takes and whether or not we are having problems coming from it.

For a while now we have been exporting the total time spent in violation as a prometheus metric but we don't do this for things that block in pollers - and they do happen, unfortunately.

We may also have CPU stalls in pollers, so we should extend the detector to cover that too.

Ok, we should have issue numbers.

Avi Kivity

<avi@scylladb.com>

unread,

Nov 29, 2018, 9:57:39 AM11/29/18

to seastar-dev@googlegroups.com

Ping? I don't think Glauber's comments were actionable.

Glauber Costa

<glauber@scylladb.com>

unread,

Nov 29, 2018, 11:27:37 AM11/29/18

to Avi Kivity, seastar-dev@googlegroups.com

On Thu, Nov 29, 2018 at 9:57 AM Avi Kivity <a...@scylladb.com> wrote:

Ping? I don't think Glauber's comments were actionable.

they were not indeed.

on the contrary, I tested it and it worked very well.

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To post to this group, send email to seast...@googlegroups.com.
Visit this group at https://groups.google.com/group/seastar-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/b1ac2793-544b-94e5-47fc-66d7aa16e556%40scylladb.com.

Duarte Nunes

<duarte@scylladb.com>

unread,

Nov 29, 2018, 6:14:47 PM11/29/18

to Avi Kivity, seastar-dev@googlegroups.com

I don't think _timer_initialized is needed, we can assume it is.

Avi Kivity

<avi@scylladb.com>

unread,

Dec 2, 2018, 11:45:27 AM12/2/18

to Duarte Nunes, seastar-dev@googlegroups.com

Right. I wanted something in case we have a failure after the timer is
initialized but before the constructor is complete but (1) in that case
we won't reach this destructor and (2) it can't happen because there's
nothing after timer initialization.

I'll replace it with a comment in the constructor warning the next
developer that defer() and cancel() might be needed next time.

>
>

Avi Kivity

<avi@scylladb.com>

unread,

Dec 3, 2018, 4:54:35 AM12/3/18

to seastar-dev@googlegroups.com

The long-awaited aio poll patches remove the timer thread; but the cpu
stall detector still depends on it. This patchset disentangles the stall
detector from the reactor (to some extent), and converts it to use a dedicated
kernel timer.

To avoid the need to re-arm the timer every task quota expiration, some slop
is introduced. Since the measurements don't need to be exact, I expect this
is fine.

Since we now have a dedicated timer, we also switch it to use a thread CPU timer.
This protects against preemption false positives, at the expense of kernel
blocks false negatives. Since the latter are rare, I think the tradeoff is
worthwhile. Kernel blocks also have their own detector.

Finally, a unit test is added.

Tests: unit (release)

https://github.com/avikivity/seastar stall-detector/v2

v2:
- remove unneeded _timer_initialized variable (patch 5)
- fix end_task_run (was a copy of start_task_run by mistake) (patch 5)

Avi Kivity (9):
reactor: extract cpu stall detector state into a class
reactor: fold block_notifier_rate_limit into cpu_stall_detector
reactor: move stall-detection code from timer thread to stall detector
tick function
reactor: introduce cpu_stall_detector::account_for_missed_ticks
reactor: change cpu stall detector not to depend on timer thread
reactor: use thread cputime clock for stall detection
reactor: add get_blocked_reactor_notify_ms()
reactor: add hook to capture reports of the cpu stall detector
tests: add unit test for cpu stall detector

include/seastar/core/reactor.hh | 11 +-
src/core/stall_detector.hh | 82 +++++++++++
src/core/reactor.cc | 235 +++++++++++++++++++-----------
tests/unit/stall_detector_test.cc | 82 +++++++++++
tests/unit/CMakeLists.txt | 3 +
5 files changed, 322 insertions(+), 91 deletions(-)

create mode 100644 src/core/stall_detector.hh
create mode 100644 tests/unit/stall_detector_test.cc

--

2.19.2

Avi Kivity

<avi@scylladb.com>

unread,

Dec 3, 2018, 4:54:37 AM12/3/18

to seastar-dev@googlegroups.com

As the first step of disentangling the cpu stall detector from the reactor,
move the internal state and the signal handler into a class. The reactor
still manages the state, but we'll change that in a follow-up.
---
include/seastar/core/reactor.hh | 7 ++--
src/core/stall_detector.hh | 58 +++++++++++++++++++++++++++
src/core/reactor.cc | 70 ++++++++++++++++++++++-----------

3 files changed, 109 insertions(+), 26 deletions(-)
create mode 100644 src/core/stall_detector.hh

diff --git a/include/seastar/core/reactor.hh b/include/seastar/core/reactor.hh
index 8f613be08d..442c8afe75 100644
--- a/include/seastar/core/reactor.hh
+++ b/include/seastar/core/reactor.hh
@@ -638,10 +638,11 @@ class io_queue {

};

namespace internal {

class reactor_stall_sampler;
+class cpu_stall_detector;

}

class io_desc;
class disk_config_params;

@@ -777,14 +778,11 @@ class reactor {

file_desc _task_quota_timer;
promise<> _start_promise;
semaphore _cpu_started;
std::atomic<uint64_t> _tasks_processed = { 0 };
std::atomic<uint64_t> _polls = { 0 };

- std::atomic<unsigned> _tasks_processed_stalled = { 0 };
- unsigned _tasks_processed_report_threshold;

- unsigned _stall_detector_reports_per_minute;
- std::atomic<uint64_t> _stall_detector_missed_ticks = { 0 };
+ std::unique_ptr<internal::cpu_stall_detector> _cpu_stall_detector;

unsigned _max_task_backlog = 1000;
timer_set<timer<>, &timer<>::_link> _timers;
timer_set<timer<>, &timer<>::_link>::timer_list_t _expired_timers;
timer_set<timer<lowres_clock>, &timer<lowres_clock>::_link> _lowres_timers;

@@ -909,10 +907,11 @@ class reactor {

};

signals _signals;
thread_pool _thread_pool;
friend class thread_pool;
+ friend class internal::cpu_stall_detector;

uint64_t pending_task_count() const;
void run_tasks(task_queue& tq);
bool have_more_tasks() const;
bool posix_reuseport_detect();
diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh

new file mode 100644

index 0000000000..8f5006714d
--- /dev/null
+++ b/src/core/stall_detector.hh
@@ -0,0 +1,58 @@
+

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2018 ScyllaDB

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index efd1a497d3..ca84d857a4 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc

inline Integral increment_nonatomically(std::atomic<Integral>& value) {

return add_nonatomically(value, Integral(1));
}

+cpu_stall_detector::cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg)
+ : _r(r) {
+ update_config(cfg);
+}
+
+cpu_stall_detector_config
+cpu_stall_detector::get_config() const {
+ return _config;
+}
+
+void cpu_stall_detector::update_config(cpu_stall_detector_config cfg) {
+ _config = cfg;
+ auto tq = _r->_task_quota;
+ if (tq == 0s) {
+ tq = 1ms; // safe default, initialization order is wrong
+ }
+ _tasks_processed_report_threshold = cfg.threshold / tq;
+ _stall_detector_reports_per_minute = cfg.stall_detector_reports_per_minute;
+}
+

void
reactor::task_quota_timer_thread_fn() {
auto thread_name = seastar::format("timer-{}", _id);

pthread_setname_np(pthread_self(), thread_name.c_str());

@@ -637,11 +655,11 @@ reactor::task_quota_timer_thread_fn() {

if (r) {
seastar_logger.info("Thread {}: failed to block signals. Aborting.", thread_name.c_str());
abort();
}

- unsigned report_at = _tasks_processed_report_threshold;
+ unsigned report_at = _cpu_stall_detector->_tasks_processed_report_threshold;

uint64_t last_tasks_processed_seen = 0;
uint64_t last_polls_seen = 0;

class block_notifier_rate_limiter {
unsigned _reported = 0;
@@ -696,54 +714,60 @@ reactor::task_quota_timer_thread_fn() {
{

uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;
}

- block_notifier_rate_limiter rate_limit(unsigned(60s / _task_quota), _stall_detector_reports_per_minute, _id);

+ block_notifier_rate_limiter rate_limit(unsigned(60s / _task_quota), _cpu_stall_detector->_stall_detector_reports_per_minute, _id);
uint64_t saved_missed_ticks = 0;

while (!_dying.load(std::memory_order_relaxed)) {
uint64_t events;
_task_quota_timer.read(&events, 8);
_local_need_preempt = true;

- auto missed_ticks = _stall_detector_missed_ticks.load(std::memory_order_relaxed);

+ auto missed_ticks = _cpu_stall_detector->_stall_detector_missed_ticks.load(std::memory_order_relaxed);

rate_limit.tick(std::max(uint64_t(1), missed_ticks - saved_missed_ticks));

saved_missed_ticks = missed_ticks;

auto tp = _tasks_processed.load(std::memory_order_relaxed);

auto p = _polls.load(std::memory_order_relaxed);

if ((tp == last_tasks_processed_seen) && (p == last_polls_seen)) {

- if ((increment_nonatomically(_tasks_processed_stalled) == report_at)) {
- rate_limit.maybe_report(_thread_id, block_notifier_signal());

+ if ((increment_nonatomically(_cpu_stall_detector->_tasks_processed_stalled) == report_at)) {
+ rate_limit.maybe_report(_thread_id, cpu_stall_detector::signal_number());

report_at <<= 1;
}
} else {
last_tasks_processed_seen = tp;

last_polls_seen = p;
- _tasks_processed_stalled.store(0, std::memory_order_relaxed);

- report_at = _tasks_processed_report_threshold;
+ _cpu_stall_detector->_tasks_processed_stalled.store(0, std::memory_order_relaxed);
+ report_at = _cpu_stall_detector->_tasks_processed_report_threshold;
}

// We're in a different thread, but guaranteed to be on the same core, so even
// a signal fence is overdoing it
std::atomic_signal_fence(std::memory_order_seq_cst);
}
}

void

reactor::update_blocked_reactor_notify_ms(std::chrono::milliseconds ms) {

- unsigned threshold = ms / _task_quota;
- if (threshold != _tasks_processed_report_threshold) {
- _tasks_processed_report_threshold = threshold;

+ auto cfg = _cpu_stall_detector->get_config();

+ if (ms != cfg.threshold) {
+ cfg.threshold = ms;

+ _cpu_stall_detector->update_config(cfg);

seastar_logger.info("updated: blocked-reactor-notify-ms={}", ms.count());
}
}

void
reactor::block_notifier(int) {
- auto steps = engine()._tasks_processed_stalled.load(std::memory_order_relaxed);
- auto delta = std::chrono::duration_cast<std::chrono::milliseconds>(engine()._task_quota * steps);
+ engine()._cpu_stall_detector->generate_trace();
+}
+
+void
+cpu_stall_detector::generate_trace() {

+ auto steps = _tasks_processed_stalled.load(std::memory_order_relaxed);
+ auto delta = std::chrono::duration_cast<std::chrono::milliseconds>(_r->_task_quota * steps);

backtrace_buffer buf;
buf.append("Reactor stalled for ");

_task_quota_timer.timerfd_settime(0, zero_itimerspec);
auto start_sleep = sched_clock::now();

sleep();
// We may have slept for a while, so freshen idle_end
idle_end = sched_clock::now();

- add_nonatomically(_stall_detector_missed_ticks, uint64_t((idle_end - start_sleep)/_task_quota));

+ add_nonatomically(_cpu_stall_detector->_stall_detector_missed_ticks, uint64_t((idle_end - start_sleep)/_task_quota));

_total_sleep += idle_end - start_sleep;
_task_quota_timer.timerfd_settime(0, task_quote_itimerspec);
}
} else {
// We previously ran pure_check_for_work(), might not actually have performed

--
2.19.2

Avi Kivity

<avi@scylladb.com>

unread,

Dec 3, 2018, 4:54:38 AM12/3/18

to seastar-dev@googlegroups.com

They have a 1:1 relationship, and are part of the same task, so merge them.
---
src/core/stall_detector.hh | 9 ++++
src/core/reactor.cc | 90 +++++++++++++++++---------------------

2 files changed, 48 insertions(+), 51 deletions(-)

diff --git a/src/core/stall_detector.hh b/src/core/stall_detector.hh
index 8f5006714d..462f0bff49 100644
--- a/src/core/stall_detector.hh
+++ b/src/core/stall_detector.hh

@@ -42,17 +42,26 @@ class cpu_stall_detector {
reactor* _r;

std::atomic<unsigned> _tasks_processed_stalled = { 0 };

unsigned _tasks_processed_report_threshold;
unsigned _stall_detector_reports_per_minute;

std::atomic<uint64_t> _stall_detector_missed_ticks = { 0 };

+ unsigned _reported = 0;
+ unsigned _ticks = 0;
+ unsigned _ticks_per_minute;
+ unsigned _max_reports_per_minute;
+ unsigned _shard_id;
+ unsigned _thread_id;
cpu_stall_detector_config _config;
friend reactor;
+private:

+ void maybe_report(pthread_t who, int sig);
public:

cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg = {});

static int signal_number() { return SIGRTMIN + 1; }

void generate_trace();
void update_config(cpu_stall_detector_config cfg);
cpu_stall_detector_config get_config() const;

+ void tick(unsigned ticks = 1);
};

}
}
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index ca84d857a4..21af317038 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -620,11 +620,12 @@ template <typename Integral>

inline Integral increment_nonatomically(std::atomic<Integral>& value) {
return add_nonatomically(value, Integral(1));
}

cpu_stall_detector::cpu_stall_detector(reactor* r, cpu_stall_detector_config cfg)

- : _r(r) {
+ : _r(r)
+ , _shard_id(_r->cpu_id()) {
update_config(cfg);
}

cpu_stall_detector_config
cpu_stall_detector::get_config() const {
@@ -637,10 +638,45 @@ void cpu_stall_detector::update_config(cpu_stall_detector_config cfg) {
if (tq == 0s) {

tq = 1ms; // safe default, initialization order is wrong
}

_tasks_processed_report_threshold = cfg.threshold / tq;

+ }
}

void
reactor::task_quota_timer_thread_fn() {
auto thread_name = seastar::format("timer-{}", _id);

@@ -659,82 +695,34 @@ reactor::task_quota_timer_thread_fn() {

unsigned report_at = _cpu_stall_detector->_tasks_processed_report_threshold;
uint64_t last_tasks_processed_seen = 0;
uint64_t last_polls_seen = 0;

- class block_notifier_rate_limiter {
- unsigned _reported = 0;

- unsigned _ticks = 0;
- unsigned _ticks_per_minute;

- unsigned _max_reports_per_minute;
- unsigned _shard_id;
- unsigned _thread_id;
- public:

- void maybe_report(pthread_t who, int sig) {
- if (_reported++ < _max_reports_per_minute) {
- pthread_kill(who, sig);
- }
- }
- // We use a tick at every timer firing so we can report suppressed backtraces.
- // Best case it's a correctly predicted branch. If a backtrace had happened in
- // the near past it's an increment and two branches.
- //
- // We can do it a cheaper if we don't report suppressed backtraces.
- void tick(unsigned ticks = 1) {

- if (!_reported) {
- return;
- }
- _ticks += ticks;
- if (_ticks >= _ticks_per_minute) {
- if (_reported > _max_reports_per_minute) {

- auto supressed = _reported - _max_reports_per_minute;
- backtrace_buffer buf;
- // Reuse backtrace buffer infrastructure so we don't have to allocate here
- buf.append("Rate-limit: supressed ");
- buf.append_decimal(_reported - _max_reports_per_minute);
- supressed == 1 ? buf.append(" backtrace") : buf.append(" backtraces");
- buf.append(" on shard ");
- buf.append_decimal(_shard_id);
- buf.append("\n");
- buf.flush();
- }

- _reported = 0;
- _ticks = 0;

- }
- }
- block_notifier_rate_limiter(unsigned ticks_per_minute, unsigned max_reports, unsigned shard_id)
- : _ticks_per_minute(ticks_per_minute)
- , _max_reports_per_minute(max_reports)
- , _shard_id(shard_id)
- {}
- };

-
// We need to wait until task quota is set before we can calculate how many ticks are to
// a minute. Technically task_quota is used from many threads, but since it is read-only here
// and only used during initialization we will avoid complicating the code.
{