[PATCH 00/10] Simplify io-queue configuration

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:34 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The rate-limiter scheduler requires 2 pairs of iops:bandiwdth counters
in configuration -- one for calculating the number of tokens a request
should require from token buckets, the other one to calculate how many
units a request contributes to shares accounting using accumulators.

Both pairs (represented as tickets) are in latency_goal[us] proportion,
i.e. by default the shares-capacity ticket is 750 larger than its cost
peer.

This set replaces the shares-capacity with the cost-capacity, so that
both -- token buckets and accumulators -- use the same units.

Another place where the shares-capacity config numbers are used is in
limiting the maximum request length. This place is simple to patch, by
coincidence code just re-uses the variable, so the new one is added
(spoilter: even thie new one will be reworked some time soon).

And finally there's a code that configures the queue from the legacy
--max-requests option. It's a BUG, the shares-capacity configured like
this doesn't work.

branch: https://github.com/xemul/seastar/tree/br-unify-fq-costs
tests: unit(dev), manual.rl-iosched(dev)

Pavel Emelyanov (10):
fair_queue: Move duration calculation into group
io_queue: Move max blocks count onto group
fair_queue: Merge cost/shares capacities
fair_queue: Treat accumulator as capacity
fair_queue: Use the fair_queue::capacity_t everywhere
seastar: Introduce --io-latency-goal option
fair_queue: Adjust duration/factor for smaller rates
io_queue: Check latency goal and warn the user
reactor: Tune up legacy configuration
io_queue: Remove max counts from config

include/seastar/core/fair_queue.hh | 51 ++++++---------------
include/seastar/core/io_queue.hh | 5 +-
include/seastar/core/reactor_config.hh | 4 ++
src/core/fair_queue.cc | 63 +++++++++++++++++++-------
src/core/io_queue.cc | 51 ++++++---------------
src/core/reactor.cc | 18 +++++---
tests/perf/fair_queue_perf.cc | 4 --
tests/unit/fair_queue_test.cc | 2 -
8 files changed, 94 insertions(+), 104 deletions(-)

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:36 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

This removes the need to export rate from fair_group's API. Also
there's a peer function on group that converts time to capacity.

Out-line the next_pending_aio() while patching one.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---
include/seastar/core/fair_queue.hh | 32 +++++++-----------------------
src/core/fair_queue.cc | 20 +++++++++++++++++++
2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index 68f173a0..db9b9855 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh
@@ -243,6 +243,12 @@ class fair_group {
using rate_resolution = std::chrono::duration<double, std::micro>;
static constexpr float fixed_point_factor = float(1 << 14);

+ // Estimated time to process the given amount of capacity
+ // (peer of accumulated_capacity() helper)
+ rate_resolution capacity_duration(capacity_t cap) const noexcept {
+ return rate_resolution(cap / _replenish_rate);
+ }
+
struct config {
sstring label = "";
unsigned max_weight;
@@ -269,7 +275,6 @@ class fair_group {

capacity_t capacity_deficiency(capacity_t from) const noexcept;
capacity_t ticket_capacity(fair_queue_ticket ticket) const noexcept;
- capacity_t rate() const noexcept { return _replenish_rate; }
};

/// \brief Fair queuing class
@@ -349,12 +354,6 @@ class fair_queue {
void push_priority_class_from_idle(priority_class_data& pc);
void pop_priority_class(priority_class_data& pc);

- // Estimated time to process the given ticket
- std::chrono::microseconds duration(fair_group::capacity_t desc) const noexcept {
- auto duration_ms = fair_group::rate_resolution(desc / _group.rate());
- return std::chrono::duration_cast<std::chrono::microseconds>(duration_ms);
- }
-
bool grab_capacity(const fair_queue_entry& ent) noexcept;
bool grab_pending_capacity(const fair_queue_entry& ent) noexcept;
public:
@@ -405,24 +404,7 @@ class fair_queue {
/// Try to execute new requests if there is capacity left in the queue.
void dispatch_requests(std::function<void(fair_queue_entry&)> cb);

- clock_type::time_point next_pending_aio() const noexcept {
- if (_pending) {
- /*
- * We expect the disk to release the ticket within some time,
- * but it's ... OK if it doesn't -- the pending wait still
- * needs the head rover value to be ahead of the needed value.
- *
- * It may happen that the capacity gets released before we think
- * it will, in this case we will wait for the full value again,
- * which's sub-optimal. The expectation is that we think disk
- * works faster, than it really does.
- */
- fair_group::capacity_t over = _group.capacity_deficiency(_pending->head);
- return std::chrono::steady_clock::now() + duration(over);
- }
-
- return std::chrono::steady_clock::time_point::max();
- }
+ clock_type::time_point next_pending_aio() const noexcept;
};
/// @}

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 8d5aff19..3c289c9a 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc
@@ -327,6 +327,26 @@ void fair_queue::notify_request_cancelled(fair_queue_entry& ent) noexcept {
ent._ticket = fair_queue_ticket();
}

+fair_queue::clock_type::time_point fair_queue::next_pending_aio() const noexcept {
+ if (_pending) {
+ /*
+ * We expect the disk to release the ticket within some time,
+ * but it's ... OK if it doesn't -- the pending wait still
+ * needs the head rover value to be ahead of the needed value.
+ *
+ * It may happen that the capacity gets released before we think
+ * it will, in this case we will wait for the full value again,
+ * which's sub-optimal. The expectation is that we think disk
+ * works faster, than it really does.
+ */
+ auto over = _group.capacity_deficiency(_pending->head);
+ auto ticks = _group.capacity_duration(over);
+ return std::chrono::steady_clock::now() + std::chrono::duration_cast<std::chrono::microseconds>(ticks);
+ }
+
+ return std::chrono::steady_clock::time_point::max();
+}
+
void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
fair_group::capacity_t dispatched = 0;

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:37 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

This value is used to calculate request accumulator cost and to check if
the request length is too large. The former usage is about to be replaced
with rate-limter token cost, but it's still needed to know the maximum
length of the request, so keep the limit as explicitly separate param.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/io_queue.hh | 3 ++-
src/core/io_queue.cc | 14 ++++++--------
2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
index 08116a37..6158742f 100644
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh
@@ -99,7 +99,7 @@ class io_queue {
dev_t devid;
unsigned capacity = std::numeric_limits<unsigned>::max();
unsigned max_req_count = std::numeric_limits<int>::max();
- mutable unsigned max_blocks_count = std::numeric_limits<int>::max();
+ unsigned max_blocks_count = std::numeric_limits<int>::max();
unsigned long req_count_rate = std::numeric_limits<int>::max();
unsigned long blocks_count_rate = std::numeric_limits<int>::max();
unsigned disk_req_write_to_read_multiplier = read_request_base_count;
@@ -172,6 +172,7 @@ class io_group {
friend class ::io_queue_for_tests;

const io_queue::config _config;
+ unsigned max_ticket_size = std::numeric_limits<int>::max();
std::vector<std::unique_ptr<fair_group>> _fgs;

static fair_group::config make_fair_group_config(const io_queue::config& qcfg) noexcept;
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index 81edca08..62a949ea 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -394,9 +394,7 @@ io_group::io_group(io_queue::config io_cfg) noexcept

/*
* The maximum request size shouldn't result in the capacity that would
- * be larger than the group's replenisher limit. It's not the same as the
- * max_blocks_count which is now used purely for shares accounting and no
- * longer has anything to do with replenisher.
+ * be larger than the group's replenisher limit.
*
* To get the correct value check the 2^N sizes and find the largest one
* with little enough capacity. Actually (FIXME) requests should calculate
@@ -430,7 +428,7 @@ io_group::io_group(io_queue::config io_cfg) noexcept

update_max_size(internal::io_direction_and_length::write_idx);
update_max_size(internal::io_direction_and_length::read_idx);
- _config.max_blocks_count = max_size;
+ max_ticket_size = max_size;

seastar_logger.info("Created io group, length limit {}:{}, rate {}:{}",
(max_size / io_queue::read_request_base_count) << io_queue::block_size_shift,
@@ -658,14 +656,14 @@ fair_queue_ticket io_queue::request_fq_ticket(internal::io_direction_and_length

static thread_local size_t oversize_warning_threshold = 0;

- if (size > get_config().max_blocks_count) {
+ if (size > _group->max_ticket_size) {
if (size > oversize_warning_threshold) {
oversize_warning_threshold = size;
io_log.warn("oversized request (length {}) submitted. "
"dazed and confuzed, trimming its weight from {} down to {}", dnl.length(),
- size, get_config().max_blocks_count);
+ size, _group->max_ticket_size);
}
- size = get_config().max_blocks_count;
+ size = _group->max_ticket_size;
}

return fair_queue_ticket(weight, size);
@@ -673,7 +671,7 @@ fair_queue_ticket io_queue::request_fq_ticket(internal::io_direction_and_length

io_queue::request_limits io_queue::get_request_limits() const noexcept {
request_limits l;
- size_t max_length = get_config().max_blocks_count << block_size_shift;
+ size_t max_length = _group->max_ticket_size << block_size_shift;
l.max_read = align_down<size_t>(std::min<size_t>(get_config().disk_read_saturation_length, max_length / read_request_base_count), 1 << block_size_shift);
l.max_write = align_down<size_t>(std::min<size_t>(get_config().disk_write_saturation_length, max_length / get_config().disk_blocks_write_to_read_multiplier), 1 << block_size_shift);
return l;
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:38 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

Currently there are two tickets on the fair-group used to normalize
per-request tickets -- the cost_capacity and the shares_capacity. The
former one is used to evaluate how much a request contributes to the
disk run-time capacity (a.k.a. the tokens in the buckets), while the
former one is used to calculate the accumulator -- a counter used to
balance between classes according to their shares. The shares ticket
is exactly latency_goal[us] times larger than the cost one.

It's better to remove the shares ticket and use the cost one in all
the calculations. Note, that the accumulator values instantly become
~1k times less, but for now it's fine -- the accumulator is (still)
a floating-point number.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 5 -----
src/core/fair_queue.cc | 11 +++++------
src/core/io_queue.cc | 3 ---
tests/perf/fair_queue_perf.cc | 4 ----
tests/unit/fair_queue_test.cc | 2 --
5 files changed, 5 insertions(+), 20 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index db9b9855..7f310b46 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh
@@ -172,8 +172,6 @@ class fair_group {
using fair_group_atomic_rover = std::atomic<capacity_t>;
static_assert(fair_group_atomic_rover::is_always_lock_free);

- const fair_queue_ticket _shares_capacity;
-
/*
* The dF/dt <= K limitation is managed by the modified token bucket
* algo where tokens are ticket.normalize(cost_capacity), the refill
@@ -251,8 +249,6 @@ class fair_group {

struct config {
sstring label = "";

- unsigned max_weight;
- unsigned max_size;
unsigned min_weight = 0;
unsigned min_size = 0;
unsigned long weight_rate;
@@ -264,7 +260,6 @@ class fair_group {
explicit fair_group(config cfg) noexcept;
fair_group(fair_group&&) = delete;

- fair_queue_ticket shares_capacity() const noexcept { return _shares_capacity; }
fair_queue_ticket cost_capacity() const noexcept { return _cost_capacity; }
capacity_t maximum_capacity() const noexcept { return _replenish_limit; }
capacity_t grab_capacity(capacity_t cap) noexcept;
diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 3c289c9a..cabbd6e3 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc
@@ -90,8 +90,7 @@ uint64_t wrapping_difference(const uint64_t& a, const uint64_t& b) noexcept {
}

fair_group::fair_group(config cfg) noexcept
- : _shares_capacity(cfg.max_weight, cfg.max_size)
- , _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
+ : _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
, _replenish_rate(cfg.rate_factor * fixed_point_factor)
, _replenish_limit(_replenish_rate * std::chrono::duration_cast<rate_resolution>(cfg.rate_limit_duration).count())
, _replenish_threshold(std::max((capacity_t)1, ticket_capacity(fair_queue_ticket(cfg.min_weight, cfg.min_size))))
@@ -101,8 +100,8 @@ fair_group::fair_group(config cfg) noexcept
, _capacity_ceil(_replenish_limit)
{
assert(!wrapping_difference(_capacity_tail.load(std::memory_order_relaxed), _capacity_head.load(std::memory_order_relaxed)));
- seastar_logger.info("Created fair group {}, capacity shares {} rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
- _shares_capacity, _cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
+ seastar_logger.info("Created fair group {}, capacity rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
+ _cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
}

auto fair_group::grab_capacity(capacity_t cap) noexcept -> capacity_t {
@@ -214,7 +213,7 @@ void fair_queue::push_priority_class_from_idle(priority_class_data& pc) {
// duration. For this estimate how many capacity units can be
// accumulated with the current class shares per rate resulution
// and scale it up to tau.
- accumulator_t max_deviation = _group.cost_capacity().normalize(_group.shares_capacity()) / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
+ accumulator_t max_deviation = fair_group::fixed_point_factor / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
pc._accumulated = std::max(_last_accumulated - max_deviation, pc._accumulated);
_handles.push(&pc);
pc._queued = true;
@@ -371,7 +370,7 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
_requests_executing++;
_requests_queued--;

- auto req_cost = req._ticket.normalize(_group.shares_capacity()) / h._shares;
+ auto req_cost = (accumulator_t)_group.ticket_capacity(req._ticket) / h._shares;
auto next_accumulated = h._accumulated + req_cost;
if (std::isinf(next_accumulated)) {
for (auto& pc : _priority_classes) {
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index 62a949ea..f5327f82 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -367,13 +367,10 @@ fair_group::config io_group::make_fair_group_config(const io_queue::config& qcfg
*/
if (max_req_count < max_req_count_min) {
seastar_logger.warn("The disk request rate is too low, configuring it to {}, but you may experience latency problems", max_req_count_min);
- max_req_count = max_req_count_min;
}

fair_group::config cfg;
cfg.label = fmt::format("io-queue-{}", qcfg.devid);
- cfg.max_weight = max_req_count;
- cfg.max_size = qcfg.max_blocks_count;
cfg.min_weight = std::min(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);
cfg.min_size = std::min(io_queue::read_request_base_count, qcfg.disk_blocks_write_to_read_multiplier);
cfg.weight_rate = qcfg.req_count_rate;
diff --git a/tests/perf/fair_queue_perf.cc b/tests/perf/fair_queue_perf.cc
index 054cdfe2..d22af74b 100644
--- a/tests/perf/fair_queue_perf.cc
+++ b/tests/perf/fair_queue_perf.cc
@@ -39,8 +39,6 @@ struct local_fq_and_class {

static fair_group::config fg_config() {
fair_group::config cfg;
- cfg.max_weight = 1;
- cfg.max_size = 1;
cfg.weight_rate = std::numeric_limits<int>::max();
cfg.size_rate = std::numeric_limits<int>::max();
return cfg;
@@ -81,8 +79,6 @@ struct perf_fair_queue {

static fair_group::config fg_config() {
fair_group::config cfg;
- cfg.max_weight = smp::count;
- cfg.max_size = smp::count;
cfg.weight_rate = std::numeric_limits<int>::max();
cfg.size_rate = std::numeric_limits<int>::max();
return cfg;
diff --git a/tests/unit/fair_queue_test.cc b/tests/unit/fair_queue_test.cc
index 7e7ef902..b2e65230 100644
--- a/tests/unit/fair_queue_test.cc
+++ b/tests/unit/fair_queue_test.cc
@@ -64,8 +64,6 @@ class test_env {

static fair_group::config fg_config(unsigned cap) {
fair_group::config cfg;
- cfg.max_weight = cap;
- cfg.max_size = std::numeric_limits<int>::max();
cfg.weight_rate = 1'000'000;
cfg.size_rate = std::numeric_limits<int>::max();
cfg.rate_limit_duration = std::chrono::microseconds(cap);
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:39 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

As was described in the previous patch -- capacity is the number of
tokens a request contributes to the bucket, while the accumulator is
the number of units a request contributes to its class. The latter
value is numerically equal to the former one so it's natural to use
these two interchangeably. All the more so the accumulated value on
a class represents the number of tokens consumed by this class in the
rate-limiting formula.

However, the accumulated values are the tokens divided by the number
of shares. For unconfigured queue the smallest "cost" of a request is
~1900 (the 4GB/s disk has this value ~50000), so large shares values
may produce something less than 1 and converting it to integer may
produce 0 which, in turn, means a class makes no progress and
monopolizes (locks) the queue. So make sure the division doesn't
produce 0 by chance.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 5 +++--
src/core/fair_queue.cc | 23 +++++++++++++++--------
2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index 7f310b46..b0a685fa 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh
@@ -304,7 +304,8 @@ class fair_queue {

using class_id = unsigned int;
class priority_class_data;
- using accumulator_t = double;
+ using capacity_t = fair_group::capacity_t;
+ using signed_capacity_t = std::make_signed<capacity_t>::type;

private:
using clock_type = std::chrono::steady_clock;
@@ -323,7 +324,7 @@ class fair_queue {
using prioq = std::priority_queue<priority_class_ptr, std::vector<priority_class_ptr>, class_compare>;
prioq _handles;
std::vector<std::unique_ptr<priority_class_data>> _priority_classes;
- accumulator_t _last_accumulated = 0;
+ capacity_t _last_accumulated = 0;

/*
* When the shared capacity os over the local queue delays
diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index cabbd6e3..9ac71a63 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc
@@ -158,7 +158,7 @@ auto fair_group::fetch_add(fair_group_atomic_rover& rover, capacity_t cap) noexc
class fair_queue::priority_class_data {
friend class fair_queue;
uint32_t _shares = 0;
- fair_queue::accumulator_t _accumulated = 0;
+ capacity_t _accumulated = 0;
fair_queue_entry::container_list_t _queue;
bool _queued = false;

@@ -213,8 +213,12 @@ void fair_queue::push_priority_class_from_idle(priority_class_data& pc) {

// duration. For this estimate how many capacity units can be
// accumulated with the current class shares per rate resulution
// and scale it up to tau.

- accumulator_t max_deviation = fair_group::fixed_point_factor / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
- pc._accumulated = std::max(_last_accumulated - max_deviation, pc._accumulated);
+ capacity_t max_deviation = fair_group::fixed_point_factor / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
+ // On start this deviation can go to negative values, so not to
+ // introduce extra if's for that short corner case, use signed
+ // arithmetics and make sure the _accumulated value doesn't grow
+ // over signed maximum (see overflow check below)
+ pc._accumulated = std::max<signed_capacity_t>(_last_accumulated - max_deviation, pc._accumulated);
_handles.push(&pc);
pc._queued = true;
}
@@ -370,9 +374,13 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
_requests_executing++;
_requests_queued--;

- auto req_cost = (accumulator_t)_group.ticket_capacity(req._ticket) / h._shares;
- auto next_accumulated = h._accumulated + req_cost;
- if (std::isinf(next_accumulated)) {
+ // Usually the cost of request is tens to hundreeds of thousands. However, for
+ // unrestricted queue it can be as low as 2k. With large enough shares this
+ // has chances to be translated into zero cost which, in turn, will make the
+ // class show no progress and monopolize the queue.
+ auto req_cost = std::max(_group.ticket_capacity(req._ticket) / h._shares, (capacity_t)1);
+ // signed overflow check to make push_priority_class_from_idle math work
+ if (h._accumulated >= std::numeric_limits<signed_capacity_t>::max() - req_cost) {
for (auto& pc : _priority_classes) {
if (pc) {
if (pc->_queued) {
@@ -383,9 +391,8 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
}
}
_last_accumulated = 0;
- next_accumulated = h._accumulated + req_cost;
}
- h._accumulated = next_accumulated;
+ h._accumulated += req_cost;

if (!h._queue.empty()) {
push_priority_class(h);
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:41 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The previous patch introduced this type for its specific reason, now
the rest of the fair-queue code can benefit from it a bit.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 4 ++--
src/core/fair_queue.cc | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index b0a685fa..a38e5049 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh
@@ -338,10 +338,10 @@ class fair_queue {
* in the middle of the waiting
*/
struct pending {
- fair_group::capacity_t head;
+ capacity_t head;
fair_queue_ticket ticket;

- pending(fair_group::capacity_t t, fair_queue_ticket c) noexcept : head(t), ticket(c) {}
+ pending(capacity_t t, fair_queue_ticket c) noexcept : head(t), ticket(c) {}
};

std::optional<pending> _pending;
diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 9ac71a63..2df0057b 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc
@@ -240,7 +240,7 @@ bool fair_queue::grab_pending_capacity(const fair_queue_entry& ent) noexcept {
if (ent._ticket == _pending->ticket) {
_pending.reset();
} else {
- fair_group::capacity_t cap = _group.ticket_capacity(ent._ticket);
+ capacity_t cap = _group.ticket_capacity(ent._ticket);
/*
* This branch is called when the fair queue decides to
* submit not the same request that entered it into the
@@ -259,8 +259,8 @@ bool fair_queue::grab_capacity(const fair_queue_entry& ent) noexcept {
return grab_pending_capacity(ent);
}

- fair_group::capacity_t cap = _group.ticket_capacity(ent._ticket);
- fair_group::capacity_t want_head = _group.grab_capacity(cap) + cap;
+ capacity_t cap = _group.ticket_capacity(ent._ticket);
+ capacity_t want_head = _group.grab_capacity(cap) + cap;
if (_group.capacity_deficiency(want_head)) {
_pending.emplace(want_head, ent._ticket);
return false;
@@ -351,7 +351,7 @@ fair_queue::clock_type::time_point fair_queue::next_pending_aio() const noexcept

}

void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {

- fair_group::capacity_t dispatched = 0;
+ capacity_t dispatched = 0;

while (!_handles.empty() && (dispatched < _group.maximum_capacity() / smp::count)) {
priority_class_data& h = *_handles.top();
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:43 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The option controls the maximum IO latency the scheduler should aim at. If
not specified, the legacy task-quota-ms * 1.5 is used.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/reactor_config.hh | 4 ++++
src/core/reactor.cc | 7 ++++++-
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/seastar/core/reactor_config.hh b/include/seastar/core/reactor_config.hh
index fa2d7360..90de473b 100644
--- a/include/seastar/core/reactor_config.hh
+++ b/include/seastar/core/reactor_config.hh
@@ -61,6 +61,10 @@ struct reactor_options : public program_options::option_group {
///
/// Default: 500.
program_options::value<double> task_quota_ms;
+ /// \brief Max time (ms) IO operations must take.
+ ///
+ /// Default: 1.5 * task_quota_ms value
+ program_options::value<double> io_latency_goal;
/// \brief Maximum number of task backlog to allow.
///
/// When the number of tasks grow above this, we stop polling (e.g. I/O)
diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index db607641..9bb446dd 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -3457,6 +3457,7 @@ reactor_options::reactor_options(program_options::option_group* parent_group)
, poll_aio(*this, "poll-aio", true,
"busy-poll for disk I/O (reduces latency and increases throughput)")
, task_quota_ms(*this, "task-quota-ms", 500, "Max time (ms) between polls")
+ , io_latency_goal(*this, "io-latency-goal", 0, "Max time (ms) io operations must take")
, max_task_backlog(*this, "max-task-backlog", 1000, "Maximum number of task backlog to allow; above this we ignore I/O")
, blocked_reactor_notify_ms(*this, "blocked-reactor-notify-ms", 200, "threshold in miliseconds over which the reactor is considered blocked if no progress is made")
, blocked_reactor_reports_per_minute(*this, "blocked-reactor-reports-per-minute", 5, "Maximum number of backtraces reported by stall detector per minute")
@@ -3697,7 +3698,11 @@ class disk_config_params {

void parse_config(const smp_options& smp_opts, const reactor_options& reactor_opts) {
seastar_logger.debug("smp::count: {}", smp::count);
- _latency_goal = std::chrono::duration_cast<std::chrono::duration<double>>(reactor_opts.task_quota_ms.get_value() * 1.5 * 1ms);
+ if (reactor_opts.io_latency_goal.defaulted()) {
+ _latency_goal = std::chrono::duration_cast<std::chrono::duration<double>>(reactor_opts.task_quota_ms.get_value() * 1.5 * 1ms);
+ } else {
+ _latency_goal = std::chrono::duration_cast<std::chrono::duration<double>>(reactor_opts.io_latency_goal.get_value() * 1ms);
+ }
seastar_logger.debug("latency_goal: {}", latency_goal().count());

if (smp_opts.max_io_requests) {
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:45 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The fair queue math calculates the costs/rates with usec resolution.
This means that disks should have IOPS and bandwidth as large as to
produce non-zero costs each microsecond. This means at least 8k IOPS
(and 4MB/s) which can be a bit too much for cloud disks and HDDs.

Change the resolution to be milliseconds and adjust the float point
factor accordingly.

While at it -- make sure the cost capacity doesn't become zero for
slow disks.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 5 +++--
src/core/fair_queue.cc | 5 +++++
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index a38e5049..b180465c 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh
@@ -71,6 +71,7 @@ class fair_queue_ticket {
/// For a fair_queue ticket to be non-zero, at least one of its represented quantities need to
/// be non-zero
explicit operator bool() const noexcept;
+ bool is_non_zero() const noexcept;

friend std::ostream& operator<<(std::ostream& os, fair_queue_ticket t);

@@ -238,8 +239,8 @@ class fair_group {
* time period for which the speeds from F (in above formula) are taken.
*/

- using rate_resolution = std::chrono::duration<double, std::micro>;
- static constexpr float fixed_point_factor = float(1 << 14);
+ using rate_resolution = std::chrono::duration<double, std::milli>;
+ static constexpr float fixed_point_factor = float(1 << 24);

// Estimated time to process the given amount of capacity

// (peer of accumulated_capacity() helper)

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 2df0057b..c0ad30a6 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc
@@ -72,6 +72,10 @@ fair_queue_ticket::operator bool() const noexcept {
return (_weight > 0) || (_size > 0);
}

+bool fair_queue_ticket::is_non_zero() const noexcept {
+ return (_weight > 0) && (_size > 0);
+}
+
bool fair_queue_ticket::operator==(const fair_queue_ticket& o) const noexcept {
return _weight == o._weight && _size == o._size;
}
@@ -100,6 +104,7 @@ fair_group::fair_group(config cfg) noexcept

, _capacity_ceil(_replenish_limit)
{
assert(!wrapping_difference(_capacity_tail.load(std::memory_order_relaxed), _capacity_head.load(std::memory_order_relaxed)));

+ assert(_cost_capacity.is_non_zero());

seastar_logger.info("Created fair group {}, capacity rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
_cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
}

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:46 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

If the disk is too slow it may result that even the smallest request
cannot get into the disk. In the previous scheduler it was checked by
the max-counts values, but today the more relevant check is the rate
limiter replenishment limit value.

This check is performed in io_group construction when it tries to find
the maximum length of the requests. Now it's asserted that the minimal
request is at least 1k, but the better behavior is to ask the user to
increase the latency goal.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/io_queue.hh | 2 +-
src/core/io_queue.cc | 11 +++++++++--
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
index 6158742f..62574dc5 100644
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh
@@ -165,7 +165,7 @@ class io_queue {

class io_group {
public:
- explicit io_group(io_queue::config io_cfg) noexcept;
+ explicit io_group(io_queue::config io_cfg);

private:
friend class io_queue;
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index f5327f82..7f9aba12 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -380,7 +380,7 @@ fair_group::config io_group::make_fair_group_config(const io_queue::config& qcfg
return cfg;
}

-io_group::io_group(io_queue::config io_cfg) noexcept
+io_group::io_group(io_queue::config io_cfg)
: _config(std::move(io_cfg))
{
auto fg_cfg = make_fair_group_config(_config);
@@ -415,7 +415,14 @@ io_group::io_group(io_queue::config io_cfg) noexcept

auto cap = _fgs[g_idx]->ticket_capacity(fair_queue_ticket(weight, size));
if (cap > max_cap) {
- assert(shift > 0);
+ if (shift == 0) {
+ auto n_goal = std::chrono::duration_cast<std::chrono::duration<float, std::milli>>(_config.rate_limit_duration * cap / max_cap);
+ throw std::runtime_error(fmt::format("The disk cannot meet latency goal, the minimal value is {:.1f}ms", n_goal.count()));
+ }
+ if (shift < 5) {
+ seastar_logger.warn("The disk meets latency goal, but the maximum request size would be less than 16k");
+ }
+
max_size = std::min(max_size, prev_size);
break;
}
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:48 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The --max-requests option is still not removed (~1y had passed since
deprecation), but the rate-limiter code doesn't take it into account.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

src/core/reactor.cc | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 9bb446dd..cce625a4 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -3797,11 +3797,12 @@ class disk_config_params {
// For backwards compatibility
cfg.capacity = *_capacity;
// Legacy configuration when only concurrency is specified.
- unsigned max_req_count = std::min(*_capacity, reactor::max_aio_per_queue);
- cfg.max_req_count = io_queue::read_request_base_count * max_req_count;
+ unsigned max_req_count = std::min(*_capacity * 1000, reactor::max_aio_per_queue);
+ cfg.req_count_rate = io_queue::read_request_base_count * max_req_count;
// specify size in terms of 16kB IOPS.
- static_assert(reactor::max_aio_per_queue << (14 - io_queue::block_size_shift) <= std::numeric_limits<decltype(cfg.max_blocks_count)>::max() / io_queue::read_request_base_count);
- cfg.max_blocks_count = io_queue::read_request_base_count * (max_req_count << (14 - io_queue::block_size_shift));
+ static_assert(reactor::max_aio_per_queue << (14 - io_queue::block_size_shift) <= std::numeric_limits<decltype(cfg.blocks_count_rate)>::max() / io_queue::read_request_base_count);
+ cfg.blocks_count_rate = io_queue::read_request_base_count * (max_req_count << (14 - io_queue::block_size_shift));
+
}
return cfg;
}
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 17, 2022, 11:00:50 AM1/17/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

Now all the usage of the max-counts is gone, the configuration can
be removed. A reminder -- these counts were used to

- evaluate accumulators (switched to cost ticket that uses rates)
- calculate maximum request length (added replenisher limit check)
- configure legacy --max-requests option (patched to use rates)

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/io_queue.hh | 2 --
src/core/io_queue.cc | 23 -----------------------
src/core/reactor.cc | 2 --
3 files changed, 27 deletions(-)

diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
index 62574dc5..2cc08d59 100644
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh
@@ -98,8 +98,6 @@ class io_queue {
struct config {

dev_t devid;
unsigned capacity = std::numeric_limits<unsigned>::max();

- unsigned max_req_count = std::numeric_limits<int>::max();
- unsigned max_blocks_count = std::numeric_limits<int>::max();
unsigned long req_count_rate = std::numeric_limits<int>::max();
unsigned long blocks_count_rate = std::numeric_limits<int>::max();
unsigned disk_req_write_to_read_multiplier = read_request_base_count;
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index 7f9aba12..a7a08768 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -346,29 +346,6 @@ io_queue::io_queue(io_group_ptr group, internal::io_sink& sink)
}

fair_group::config io_group::make_fair_group_config(const io_queue::config& qcfg) noexcept {
- /*
- * It doesn't make sense to configure requests limit higher than
- * it can be if the queue is full of minimal requests. At the same
- * time setting too large value increases the chances to overflow
- * the group rovers and lock-up the queue.
- *
- * The same is technically true for blocks limit, but the group
- * rovers are configured in blocks (ticket size shift), and this
- * already makes a good protection.
- */
- auto max_req_count = std::min(qcfg.max_req_count, qcfg.max_blocks_count);
- auto max_req_count_min = std::max(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);
- /*
- * Read requests weight read_request_base_count, writes weight
- * disk_req_write_to_read_multiplier. The fair queue limit must
- * be enough to pass the largest one through. The same is true
- * for request sizes, but that check is done run-time, see the
- * request_fq_ticket() method.
- */
- if (max_req_count < max_req_count_min) {
- seastar_logger.warn("The disk request rate is too low, configuring it to {}, but you may experience latency problems", max_req_count_min);
- }
-

fair_group::config cfg;
cfg.label = fmt::format("io-queue-{}", qcfg.devid);

cfg.min_weight = std::min(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index cce625a4..e3ca4348 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -3774,12 +3774,10 @@ class disk_config_params {

if (!_capacity) {
if (p.read_bytes_rate != std::numeric_limits<uint64_t>::max()) {
- cfg.max_blocks_count = (io_queue::read_request_base_count * per_io_group(p.read_bytes_rate * latency_goal().count(), nr_groups)) >> io_queue::block_size_shift;
cfg.blocks_count_rate = (io_queue::read_request_base_count * (unsigned long)per_io_group(p.read_bytes_rate, nr_groups)) >> io_queue::block_size_shift;
cfg.disk_blocks_write_to_read_multiplier = (io_queue::read_request_base_count * p.read_bytes_rate) / p.write_bytes_rate;
}
if (p.read_req_rate != std::numeric_limits<uint64_t>::max()) {
- cfg.max_req_count = io_queue::read_request_base_count * per_io_group(p.read_req_rate * latency_goal().count(), nr_groups);
cfg.req_count_rate = io_queue::read_request_base_count * (unsigned long)per_io_group(p.read_req_rate, nr_groups);
cfg.disk_req_write_to_read_multiplier = (io_queue::read_request_base_count * p.read_req_rate) / p.write_req_rate;
}
--
2.20.1

Avi Kivity

<avi@scylladb.com>

unread,

Jan 18, 2022, 9:32:10 AM1/18/22

to Pavel Emelyanov, seastar-dev@googlegroups.com

On 17/01/2022 18.00, Pavel Emelyanov wrote:
> The option controls the maximum IO latency the scheduler should aim at. If
> not specified, the legacy task-quota-ms * 1.5 is used.

In the future, I'd like iotune to measure iodepth=1 latency (e.g.
100usec for SSD and 10ms for HDD), and calculate the default latency
goal as max(5*basic_disk_latency, 1.5*task_quota_ms) so that HDDs can
have more than one request queued.

Avi Kivity

<avi@scylladb.com>

unread,

Jan 18, 2022, 9:35:14 AM1/18/22

to Pavel Emelyanov, seastar-dev@googlegroups.com

On 17/01/2022 18.00, Pavel Emelyanov wrote:

It's better to adjust the latency goal ourselves (and tell the user),
rather than tell the user to do it.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 18, 2022, 10:01:32 AM1/18/22

to Avi Kivity, seastar-dev@googlegroups.com

User will most likely ignore this message, so what's the point?

BTW, I just thought that we can make the latency-goal be the part of the disk profile taken
from the io_properties, not the CLI option. Maybe it's better to go this route instead?

-- Pavel

Avi Kivity

<avi@scylladb.com>

unread,

Jan 18, 2022, 10:03:26 AM1/18/22

to Pavel Emelyanov, seastar-dev@googlegroups.com

How can they ignore a throw? The system won't start.

> BTW, I just thought that we can make the latency-goal be the part of
> the disk profile taken
> from the io_properties, not the CLI option. Maybe it's better to go
> this route instead?

I think the latency goal is not part of the disk itself. I think
io-properties.conf should be an objective description of the disk, and
application requirements can be added on top.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 18, 2022, 10:09:08 AM1/18/22

to Avi Kivity, seastar-dev@googlegroups.com

If we auto-adjust the latency goal the system will start, so user will just ignore a message.

Avi Kivity

<avi@scylladb.com>

unread,

Jan 18, 2022, 10:11:28 AM1/18/22

to Pavel Emelyanov, seastar-dev@googlegroups.com

Well, what's the problem? the system will work. It's not like we could
deliver a lower latency goal with that disk.

i.e. we can't deliver 1ms latency with HDD, nothing the user can
configure can do that. Why bother the user then?

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 18, 2022, 11:37:42 AM1/18/22

to Avi Kivity, seastar-dev@googlegroups.com

The problem is that with SSDs having default 0.75ms IO goal is good choice from both
sides -- we have this IO latency for reads AND can afford 64k/128k requests to have good
throughput for compaction. So everybody is happy. If the disk cannot provide this latency
then someone has to make a choice between smaller read latency vs larger compaction/flush
throughput. So what should this auto-adjustment logic aim at?

-- Pavel

Avi Kivity

<avi@scylladb.com>

unread,

Jan 18, 2022, 11:54:42 AM1/18/22

to Pavel Emelyanov, seastar-dev@googlegroups.com

basic_disk_latency * 5. There's no point in anything below
basic_disk_latency, the disk won't be able to achieve it. At
~basic_disk_latency, the head will move each time between random read
and sequential write (assuming 50:50 shares), so both bandwidth and iops
suffer. At 5*basic_disk_latency we do 5 seeks, then 1 seek + some
sequential writes, then 5 seeks, etc.

Of course, if the disk is really a RAID array (which will be common)
we'll stuff 5*nr_array_member seeks or the equivalent in sequential
writes for the same goal. But it's the same calculation.

At 50*basic_disk_latency, we're even more efficient (less time moving
away from sequential writes), but it's diminishing returns.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 18, 2022, 12:10:25 PM1/18/22

to Avi Kivity, seastar-dev@googlegroups.com

Agree. But we don't have the basic_disk_latency yet. Also, even when we have one, if
seastar will start without this parameter it will have to abort gracefully with some
meaningful message (e.g. to re-run iotune), not the "assertion 'shift > 0' failed" one.

-- Pavel

Avi Kivity

<avi@scylladb.com>

unread,

Jan 18, 2022, 12:13:29 PM1/18/22

to Pavel Emelyanov, seastar-dev@googlegroups.com

Sure, but if we can adjust latency_goal so that Seastar starts, even if
it isn't the optimal setting, it's better, no? After all it started
before we changed the I/O scheduler, with suboptimal latency goal.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:09 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The rate-limiter scheduler requires 2 pairs of iops:bandiwdth counters
in configuration -- one for calculating the number of tokens a request
should require from token buckets, the other one to calculate how many
units a request contributes to shares accounting using accumulators.

Both pairs (represented as tickets) are in latency_goal[us] proportion,
i.e. by default the shares-capacity ticket is 750 larger than its cost
peer.

This set replaces the shares-capacity with the cost-capacity, so that
both -- token buckets and accumulators -- use the same units.

Another place where the shares-capacity config numbers are used is in
limiting the maximum request length. This place is simple to patch, by
coincidence code just re-uses the variable, so the new one is added
(spoilter: even thie new one will be reworked some time soon).

And finally there's a code that configures the queue from the legacy
--max-requests option. It's a BUG, the shares-capacity configured like
this doesn't work.

branch: https://github.com/xemul/seastar/tree/br-unify-fq-costs-2
tests: unit(dev), manual.rl-iosched(dev)

v2:
- renamed --io-latency-goal option into --io-latency-goal-ms
- auto increase the given latency goal if it's too small
This ^^ is rather a big rework of the patch #8 as the io-properties
data gets heavily converted while its gets passed towards fair queue,
and the place where it's finally known whether a request "fits" the
latency goal is not earlier than during fair_group construction. So
for the reactor.cc it's almost impossible to know if the configured
latency goal is good enough without making similar conversions, and
it's not nice to duplicate all this math. Hence some not extremely
obvious manipulations on the fair-queue level.

Pavel Emelyanov (10):
fair_queue: Move duration calculation into group
io_queue: Move max blocks count onto group
fair_queue: Merge cost/shares capacities
fair_queue: Treat accumulator as capacity
fair_queue: Use the fair_queue::capacity_t everywhere
seastar: Introduce --io-latency-goal option
fair_queue: Adjust duration/factor for smaller rates

io_queue: Auto-increase the io-latency goal

reactor: Tune up legacy configuration
io_queue: Remove max counts from config

tests/perf/fair_queue_perf.cc | 4 --
tests/unit/fair_queue_test.cc | 2 -

8 files changed, 102 insertions(+), 103 deletions(-)

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:10 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

This value is used to calculate request accumulator cost and to check if
the request length is too large. The former usage is about to be replaced
with rate-limter token cost, but it's still needed to know the maximum
length of the request, so keep the limit as explicitly separate param.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/io_queue.hh | 3 ++-
src/core/io_queue.cc | 14 ++++++--------

2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
index 08116a37..6158742f 100644
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh

@@ -99,7 +99,7 @@ class io_queue {

dev_t devid;
unsigned capacity = std::numeric_limits<unsigned>::max();

unsigned max_req_count = std::numeric_limits<int>::max();

- mutable unsigned max_blocks_count = std::numeric_limits<int>::max();
+ unsigned max_blocks_count = std::numeric_limits<int>::max();

unsigned long req_count_rate = std::numeric_limits<int>::max();
unsigned long blocks_count_rate = std::numeric_limits<int>::max();
unsigned disk_req_write_to_read_multiplier = read_request_base_count;

@@ -172,6 +172,7 @@ class io_group {
friend class ::io_queue_for_tests;

const io_queue::config _config;
+ unsigned max_ticket_size = std::numeric_limits<int>::max();
std::vector<std::unique_ptr<fair_group>> _fgs;

static fair_group::config make_fair_group_config(const io_queue::config& qcfg) noexcept;
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index 81edca08..62a949ea 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:10 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

This removes the need to export rate from fair_group's API. Also
there's a peer function on group that converts time to capacity.

Out-line the next_pending_aio() while patching one.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 32 +++++++-----------------------
src/core/fair_queue.cc | 20 +++++++++++++++++++

2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index 68f173a0..db9b9855 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -243,6 +243,12 @@ class fair_group {

using rate_resolution = std::chrono::duration<double, std::micro>;

static constexpr float fixed_point_factor = float(1 << 14);

+ // Estimated time to process the given amount of capacity
+ // (peer of accumulated_capacity() helper)
+ rate_resolution capacity_duration(capacity_t cap) const noexcept {
+ return rate_resolution(cap / _replenish_rate);
+ }
+

struct config {
sstring label = "";

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 8d5aff19..3c289c9a 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -327,6 +327,26 @@ void fair_queue::notify_request_cancelled(fair_queue_entry& ent) noexcept {
ent._ticket = fair_queue_ticket();
}

+fair_queue::clock_type::time_point fair_queue::next_pending_aio() const noexcept {
+ if (_pending) {
+ /*
+ * We expect the disk to release the ticket within some time,
+ * but it's ... OK if it doesn't -- the pending wait still
+ * needs the head rover value to be ahead of the needed value.
+ *
+ * It may happen that the capacity gets released before we think
+ * it will, in this case we will wait for the full value again,
+ * which's sub-optimal. The expectation is that we think disk
+ * works faster, than it really does.
+ */
+ auto over = _group.capacity_deficiency(_pending->head);
+ auto ticks = _group.capacity_duration(over);
+ return std::chrono::steady_clock::now() + std::chrono::duration_cast<std::chrono::microseconds>(ticks);
+ }
+
+ return std::chrono::steady_clock::time_point::max();
+}
+

void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {

fair_group::capacity_t dispatched = 0;

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:12 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

As was described in the previous patch -- capacity is the number of
tokens a request contributes to the bucket, while the accumulator is
the number of units a request contributes to its class. The latter
value is numerically equal to the former one so it's natural to use
these two interchangeably. All the more so the accumulated value on
a class represents the number of tokens consumed by this class in the
rate-limiting formula.

However, the accumulated values are the tokens divided by the number
of shares. For unconfigured queue the smallest "cost" of a request is
~1900 (the 4GB/s disk has this value ~50000), so large shares values
may produce something less than 1 and converting it to integer may
produce 0 which, in turn, means a class makes no progress and
monopolizes (locks) the queue. So make sure the division doesn't
produce 0 by chance.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 5 +++--
src/core/fair_queue.cc | 23 +++++++++++++++--------

2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index 7f310b46..b0a685fa 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -304,7 +304,8 @@ class fair_queue {

using class_id = unsigned int;
class priority_class_data;
- using accumulator_t = double;
+ using capacity_t = fair_group::capacity_t;
+ using signed_capacity_t = std::make_signed<capacity_t>::type;

private:
using clock_type = std::chrono::steady_clock;
@@ -323,7 +324,7 @@ class fair_queue {
using prioq = std::priority_queue<priority_class_ptr, std::vector<priority_class_ptr>, class_compare>;
prioq _handles;
std::vector<std::unique_ptr<priority_class_data>> _priority_classes;
- accumulator_t _last_accumulated = 0;
+ capacity_t _last_accumulated = 0;

/*
* When the shared capacity os over the local queue delays

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index cabbd6e3..9ac71a63 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:12 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

Currently there are two tickets on the fair-group used to normalize
per-request tickets -- the cost_capacity and the shares_capacity. The
former one is used to evaluate how much a request contributes to the
disk run-time capacity (a.k.a. the tokens in the buckets), while the
former one is used to calculate the accumulator -- a counter used to
balance between classes according to their shares. The shares ticket
is exactly latency_goal[us] times larger than the cost one.

It's better to remove the shares ticket and use the cost one in all
the calculations. Note, that the accumulator values instantly become
~1k times less, but for now it's fine -- the accumulator is (still)
a floating-point number.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

5 files changed, 5 insertions(+), 20 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index db9b9855..7f310b46 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -172,8 +172,6 @@ class fair_group {
using fair_group_atomic_rover = std::atomic<capacity_t>;
static_assert(fair_group_atomic_rover::is_always_lock_free);

- const fair_queue_ticket _shares_capacity;
-
/*
* The dF/dt <= K limitation is managed by the modified token bucket
* algo where tokens are ticket.normalize(cost_capacity), the refill
@@ -251,8 +249,6 @@ class fair_group {

struct config {
sstring label = "";

- unsigned max_weight;
- unsigned max_size;
unsigned min_weight = 0;
unsigned min_size = 0;
unsigned long weight_rate;
@@ -264,7 +260,6 @@ class fair_group {
explicit fair_group(config cfg) noexcept;
fair_group(fair_group&&) = delete;

- fair_queue_ticket shares_capacity() const noexcept { return _shares_capacity; }
fair_queue_ticket cost_capacity() const noexcept { return _cost_capacity; }
capacity_t maximum_capacity() const noexcept { return _replenish_limit; }
capacity_t grab_capacity(capacity_t cap) noexcept;

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 3c289c9a..cabbd6e3 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -90,8 +90,7 @@ uint64_t wrapping_difference(const uint64_t& a, const uint64_t& b) noexcept {
}

fair_group::fair_group(config cfg) noexcept
- : _shares_capacity(cfg.max_weight, cfg.max_size)
- , _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
+ : _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
, _replenish_rate(cfg.rate_factor * fixed_point_factor)
, _replenish_limit(_replenish_rate * std::chrono::duration_cast<rate_resolution>(cfg.rate_limit_duration).count())
, _replenish_threshold(std::max((capacity_t)1, ticket_capacity(fair_queue_ticket(cfg.min_weight, cfg.min_size))))

@@ -101,8 +100,8 @@ fair_group::fair_group(config cfg) noexcept

, _capacity_ceil(_replenish_limit)
{
assert(!wrapping_difference(_capacity_tail.load(std::memory_order_relaxed), _capacity_head.load(std::memory_order_relaxed)));

- seastar_logger.info("Created fair group {}, capacity shares {} rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
- _shares_capacity, _cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
+ seastar_logger.info("Created fair group {}, capacity rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
+ _cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
}

auto fair_group::grab_capacity(capacity_t cap) noexcept -> capacity_t {

@@ -214,7 +213,7 @@ void fair_queue::push_priority_class_from_idle(priority_class_data& pc) {

// duration. For this estimate how many capacity units can be
// accumulated with the current class shares per rate resulution
// and scale it up to tau.

- accumulator_t max_deviation = _group.cost_capacity().normalize(_group.shares_capacity()) / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
+ accumulator_t max_deviation = fair_group::fixed_point_factor / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();

pc._accumulated = std::max(_last_accumulated - max_deviation, pc._accumulated);

_handles.push(&pc);
pc._queued = true;
@@ -371,7 +370,7 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
_requests_executing++;
_requests_queued--;

- auto req_cost = req._ticket.normalize(_group.shares_capacity()) / h._shares;
+ auto req_cost = (accumulator_t)_group.ticket_capacity(req._ticket) / h._shares;

auto next_accumulated = h._accumulated + req_cost;

if (std::isinf(next_accumulated)) {
for (auto& pc : _priority_classes) {

diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index 62a949ea..f5327f82 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -367,13 +367,10 @@ fair_group::config io_group::make_fair_group_config(const io_queue::config& qcfg
*/
if (max_req_count < max_req_count_min) {

seastar_logger.warn("The disk request rate is too low, configuring it to {}, but you may experience latency problems", max_req_count_min);

- max_req_count = max_req_count_min;

}

fair_group::config cfg;
cfg.label = fmt::format("io-queue-{}", qcfg.devid);

- cfg.max_weight = max_req_count;
- cfg.max_size = qcfg.max_blocks_count;

cfg.min_weight = std::min(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:13 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The previous patch introduced this type for its specific reason, now
the rest of the fair-queue code can benefit from it a bit.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 4 ++--
src/core/fair_queue.cc | 8 ++++----

2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index b0a685fa..a38e5049 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -338,10 +338,10 @@ class fair_queue {
* in the middle of the waiting
*/
struct pending {
- fair_group::capacity_t head;
+ capacity_t head;
fair_queue_ticket ticket;

- pending(fair_group::capacity_t t, fair_queue_ticket c) noexcept : head(t), ticket(c) {}
+ pending(capacity_t t, fair_queue_ticket c) noexcept : head(t), ticket(c) {}
};

std::optional<pending> _pending;

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 9ac71a63..2df0057b 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -240,7 +240,7 @@ bool fair_queue::grab_pending_capacity(const fair_queue_entry& ent) noexcept {
if (ent._ticket == _pending->ticket) {
_pending.reset();
} else {
- fair_group::capacity_t cap = _group.ticket_capacity(ent._ticket);
+ capacity_t cap = _group.ticket_capacity(ent._ticket);
/*
* This branch is called when the fair queue decides to
* submit not the same request that entered it into the
@@ -259,8 +259,8 @@ bool fair_queue::grab_capacity(const fair_queue_entry& ent) noexcept {
return grab_pending_capacity(ent);
}

- fair_group::capacity_t cap = _group.ticket_capacity(ent._ticket);
- fair_group::capacity_t want_head = _group.grab_capacity(cap) + cap;
+ capacity_t cap = _group.ticket_capacity(ent._ticket);
+ capacity_t want_head = _group.grab_capacity(cap) + cap;
if (_group.capacity_deficiency(want_head)) {
_pending.emplace(want_head, ent._ticket);
return false;

@@ -351,7 +351,7 @@ fair_queue::clock_type::time_point fair_queue::next_pending_aio() const noexcept

}

void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:14 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The fair queue math calculates the costs/rates with usec resolution.
This means that disks should have IOPS and bandwidth as large as to
produce non-zero costs each microsecond. This means at least 8k IOPS
(and 4MB/s) which can be a bit too much for cloud disks and HDDs.

Change the resolution to be milliseconds and adjust the float point
factor accordingly.

While at it -- make sure the cost capacity doesn't become zero for
slow disks.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 5 +++--
src/core/fair_queue.cc | 5 +++++

2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index a38e5049..b180465c 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -71,6 +71,7 @@ class fair_queue_ticket {
/// For a fair_queue ticket to be non-zero, at least one of its represented quantities need to
/// be non-zero
explicit operator bool() const noexcept;
+ bool is_non_zero() const noexcept;

friend std::ostream& operator<<(std::ostream& os, fair_queue_ticket t);

@@ -238,8 +239,8 @@ class fair_group {
* time period for which the speeds from F (in above formula) are taken.
*/

- using rate_resolution = std::chrono::duration<double, std::micro>;
- static constexpr float fixed_point_factor = float(1 << 14);
+ using rate_resolution = std::chrono::duration<double, std::milli>;
+ static constexpr float fixed_point_factor = float(1 << 24);

// Estimated time to process the given amount of capacity

// (peer of accumulated_capacity() helper)

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index 2df0057b..c0ad30a6 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -72,6 +72,10 @@ fair_queue_ticket::operator bool() const noexcept {
return (_weight > 0) || (_size > 0);
}

+bool fair_queue_ticket::is_non_zero() const noexcept {
+ return (_weight > 0) && (_size > 0);
+}
+
bool fair_queue_ticket::operator==(const fair_queue_ticket& o) const noexcept {
return _weight == o._weight && _size == o._size;
}

@@ -100,6 +104,7 @@ fair_group::fair_group(config cfg) noexcept

, _capacity_ceil(_replenish_limit)
{
assert(!wrapping_difference(_capacity_tail.load(std::memory_order_relaxed), _capacity_head.load(std::memory_order_relaxed)));

+ assert(_cost_capacity.is_non_zero());

seastar_logger.info("Created fair group {}, capacity rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
_cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
}

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:14 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The option controls the maximum IO latency the scheduler should aim at. If
not specified, the legacy task-quota-ms * 1.5 is used.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/reactor_config.hh | 4 ++++
src/core/reactor.cc | 9 ++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/seastar/core/reactor_config.hh b/include/seastar/core/reactor_config.hh
index fa2d7360..1080ccb1 100644

--- a/include/seastar/core/reactor_config.hh
+++ b/include/seastar/core/reactor_config.hh
@@ -61,6 +61,10 @@ struct reactor_options : public program_options::option_group {
///
/// Default: 500.
program_options::value<double> task_quota_ms;
+ /// \brief Max time (ms) IO operations must take.
+ ///
+ /// Default: 1.5 * task_quota_ms value

+ program_options::value<double> io_latency_goal_ms;

/// \brief Maximum number of task backlog to allow.
///
/// When the number of tasks grow above this, we stop polling (e.g. I/O)
diff --git a/src/core/reactor.cc b/src/core/reactor.cc

index db607641..80598c56 100644

--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -3457,6 +3457,7 @@ reactor_options::reactor_options(program_options::option_group* parent_group)
, poll_aio(*this, "poll-aio", true,
"busy-poll for disk I/O (reduces latency and increases throughput)")
, task_quota_ms(*this, "task-quota-ms", 500, "Max time (ms) between polls")

+ , io_latency_goal_ms(*this, "io-latency-goal-ms", {}, "Max time (ms) io operations must take (1.5 * task-quota-ms if not set)")

, max_task_backlog(*this, "max-task-backlog", 1000, "Maximum number of task backlog to allow; above this we ignore I/O")
, blocked_reactor_notify_ms(*this, "blocked-reactor-notify-ms", 200, "threshold in miliseconds over which the reactor is considered blocked if no progress is made")
, blocked_reactor_reports_per_minute(*this, "blocked-reactor-reports-per-minute", 5, "Maximum number of backtraces reported by stall detector per minute")

@@ -3695,9 +3696,15 @@ class disk_config_params {
return _latency_goal;
}

+ double latency_goal_opt(const reactor_options& opts) const {
+ return opts.io_latency_goal_ms ?
+ opts.io_latency_goal_ms.get_value() :
+ opts.task_quota_ms.get_value() * 1.5;
+ }
+

void parse_config(const smp_options& smp_opts, const reactor_options& reactor_opts) {
seastar_logger.debug("smp::count: {}", smp::count);
- _latency_goal = std::chrono::duration_cast<std::chrono::duration<double>>(reactor_opts.task_quota_ms.get_value() * 1.5 * 1ms);

+ _latency_goal = std::chrono::duration_cast<std::chrono::duration<double>>(latency_goal_opt(reactor_opts) * 1ms);

seastar_logger.debug("latency_goal: {}", latency_goal().count());

if (smp_opts.max_io_requests) {

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:16 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

It may happen that the disk iops/bandwidth are too low for even a
single request to be served within the scheduler tick. In this case
the latency-goal value should be increased.

In current scheduler implementation this means checking the token
bucket limit against the replenisher threshold. The limit means how
many requests can be submitted within the latency goal, while the
replenish threshold is the number of tokens needed for the minimal
request. If the limit is below the threshold then even the smallest
request doesn't fit, and the goal is too low. And since it's better
if seastar "just works" even if the configuration is not correct,
the first change here is to set the limit to be at least N times
larger than the threshold to let at least N minimal requests in.

N is selected to be 3, because the limit value is used to get the
maximum request _length_ (in bytes), and 1.5K smells somewhat better
than 512 bytes ... Oh, well.

Finally, the io-queue checks if the latency goal was increased by
the fair-queue, and if it was the warning is printed with the newly
set-up value.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/fair_queue.hh | 9 ++++++++-
src/core/fair_queue.cc | 2 +-
src/core/io_queue.cc | 7 +++++++
3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
index b180465c..e9416b4b 100644
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh
@@ -194,8 +194,8 @@ class fair_group {

const fair_queue_ticket _cost_capacity;
const capacity_t _replenish_rate;
- const capacity_t _replenish_limit;
const capacity_t _replenish_threshold;
+ const capacity_t _replenish_limit;
std::atomic<clock_type::time_point> _replenished;

/*
@@ -242,6 +242,13 @@ class fair_group {

using rate_resolution = std::chrono::duration<double, std::milli>;

static constexpr float fixed_point_factor = float(1 << 24);

+ /*
+ * This defines how may requests of minimal capacity should fit into
+ * the replenish limit. If the actual number happens to be below this
+ * then the rate_limit_duration is too low and should be increased.
+ */
+ static constexpr int duration_capacity = 3;
+

// Estimated time to process the given amount of capacity
// (peer of accumulated_capacity() helper)

rate_resolution capacity_duration(capacity_t cap) const noexcept {

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
index c0ad30a6..2f0ab81a 100644
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc
@@ -96,8 +96,8 @@ uint64_t wrapping_difference(const uint64_t& a, const uint64_t& b) noexcept {
fair_group::fair_group(config cfg) noexcept

: _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
, _replenish_rate(cfg.rate_factor * fixed_point_factor)

- , _replenish_limit(_replenish_rate * std::chrono::duration_cast<rate_resolution>(cfg.rate_limit_duration).count())

, _replenish_threshold(std::max((capacity_t)1, ticket_capacity(fair_queue_ticket(cfg.min_weight, cfg.min_size))))

+ , _replenish_limit(std::max<capacity_t>(_replenish_rate * std::chrono::duration_cast<rate_resolution>(cfg.rate_limit_duration).count(), _replenish_threshold * duration_capacity))
, _replenished(clock_type::now())
, _capacity_tail(0)
, _capacity_head(0)
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index f5327f82..15b847bd 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -389,6 +389,13 @@ io_group::io_group(io_queue::config io_cfg) noexcept
_fgs.push_back(std::make_unique<fair_group>(fg_cfg));
}

+ std::chrono::duration<double> io_lat = _fgs[0]->capacity_duration(_fgs[0]->maximum_capacity());
+ if (io_lat > fg_cfg.rate_limit_duration) {
+ seastar_logger.warn("IO latency goal {:.3f} is too low for device {}, using {:.3f}ms instead",
+ std::chrono::duration_cast<std::chrono::duration<double, std::milli>>(fg_cfg.rate_limit_duration).count(),
+ _config.mountpoint, std::chrono::duration_cast<std::chrono::duration<double, std::milli>>(io_lat).count());
+ }
+

/*
* The maximum request size shouldn't result in the capacity that would

* be larger than the group's replenisher limit.

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:17 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

The --max-requests option is still not removed (~1y had passed since
deprecation), but the rate-limiter code doesn't take it into account.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

src/core/reactor.cc | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index 80598c56..a85a5d34 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -3799,11 +3799,12 @@ class disk_config_params {

Pavel Emelyanov

<xemul@scylladb.com>

unread,

Jan 20, 2022, 12:21:17 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

Now all the usage of the max-counts is gone, the configuration can
be removed. A reminder -- these counts were used to

- evaluate accumulators (switched to cost ticket that uses rates)
- calculate maximum request length (added replenisher limit check)
- configure legacy --max-requests option (patched to use rates)

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

include/seastar/core/io_queue.hh | 2 --
src/core/io_queue.cc | 23 -----------------------
src/core/reactor.cc | 2 --

3 files changed, 27 deletions(-)

diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
index 6158742f..17a119c3 100644
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh

@@ -98,8 +98,6 @@ class io_queue {
struct config {

dev_t devid;
unsigned capacity = std::numeric_limits<unsigned>::max();

- unsigned max_req_count = std::numeric_limits<int>::max();

- unsigned max_blocks_count = std::numeric_limits<int>::max();
unsigned long req_count_rate = std::numeric_limits<int>::max();
unsigned long blocks_count_rate = std::numeric_limits<int>::max();
unsigned disk_req_write_to_read_multiplier = read_request_base_count;
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
index 15b847bd..215f269b 100644
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc

@@ -346,29 +346,6 @@ io_queue::io_queue(io_group_ptr group, internal::io_sink& sink)
}

fair_group::config io_group::make_fair_group_config(const io_queue::config& qcfg) noexcept {
- /*
- * It doesn't make sense to configure requests limit higher than
- * it can be if the queue is full of minimal requests. At the same
- * time setting too large value increases the chances to overflow
- * the group rovers and lock-up the queue.
- *
- * The same is technically true for blocks limit, but the group
- * rovers are configured in blocks (ticket size shift), and this
- * already makes a good protection.
- */
- auto max_req_count = std::min(qcfg.max_req_count, qcfg.max_blocks_count);
- auto max_req_count_min = std::max(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);
- /*
- * Read requests weight read_request_base_count, writes weight
- * disk_req_write_to_read_multiplier. The fair queue limit must
- * be enough to pass the largest one through. The same is true
- * for request sizes, but that check is done run-time, see the
- * request_fq_ticket() method.
- */
- if (max_req_count < max_req_count_min) {

- seastar_logger.warn("The disk request rate is too low, configuring it to {}, but you may experience latency problems", max_req_count_min);
- }
-

fair_group::config cfg;
cfg.label = fmt::format("io-queue-{}", qcfg.devid);

cfg.min_weight = std::min(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
index a85a5d34..ef958495 100644
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -3776,12 +3776,10 @@ class disk_config_params {

Avi Kivity

<avi@scylladb.com>

unread,

Jan 20, 2022, 1:39:41 PM1/20/22

to Pavel Emelyanov, seastar-dev@googlegroups.com

We'll want to gather feedback from HDD users and see if this should be
increased, trading off latency for throughput. I believe that at 3 the
throughput hit will be large because sequential reads will be random
because the disk never sees two of them back-to-back.

However, it depends on disk read-ahead and write-behind. If the disk
reads > 128k even when it gets a single request, then it will be able to
serve the next request from its internal cache.

Another thing we can do is automatically tune the sequential I/O size.
It just doesn't make sense to use 128k writes for HDD. A good size is
k*bw/iops. For HDD bw=200 MB/s and iops=100, so it ends up k*2MB (k=3).

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:33 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

fair_queue: Move duration calculation into group

This removes the need to export rate from fair_group's API. Also
there's a peer function on group that converts time to capacity.

Out-line the next_pending_aio() while patching one.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh
@@ -243,6 +243,12 @@ public:
using rate_resolution = std::chrono::duration<double, std::micro>;
static constexpr float fixed_point_factor = float(1 << 14);

+ // Estimated time to process the given amount of capacity

+ // (peer of accumulated_capacity() helper)
+ rate_resolution capacity_duration(capacity_t cap) const noexcept {
+ return rate_resolution(cap / _replenish_rate);
+ }
+

struct config {
sstring label = "";

unsigned max_weight;
@@ -269,7 +275,6 @@ public:

capacity_t capacity_deficiency(capacity_t from) const noexcept;
capacity_t ticket_capacity(fair_queue_ticket ticket) const noexcept;
- capacity_t rate() const noexcept { return _replenish_rate; }
};

/// \brief Fair queuing class

@@ -349,12 +354,6 @@ private:

void push_priority_class_from_idle(priority_class_data& pc);
void pop_priority_class(priority_class_data& pc);

- // Estimated time to process the given ticket
- std::chrono::microseconds duration(fair_group::capacity_t desc) const noexcept {
- auto duration_ms = fair_group::rate_resolution(desc / _group.rate());
- return std::chrono::duration_cast<std::chrono::microseconds>(duration_ms);
- }
-
bool grab_capacity(const fair_queue_entry& ent) noexcept;
bool grab_pending_capacity(const fair_queue_entry& ent) noexcept;
public:

@@ -405,24 +404,7 @@ public:

/// Try to execute new requests if there is capacity left in the queue.
void dispatch_requests(std::function<void(fair_queue_entry&)> cb);

- clock_type::time_point next_pending_aio() const noexcept {
- if (_pending) {
- /*
- * We expect the disk to release the ticket within some time,
- * but it's ... OK if it doesn't -- the pending wait still
- * needs the head rover value to be ahead of the needed value.
- *
- * It may happen that the capacity gets released before we think
- * it will, in this case we will wait for the full value again,
- * which's sub-optimal. The expectation is that we think disk
- * works faster, than it really does.
- */
- fair_group::capacity_t over = _group.capacity_deficiency(_pending->head);
- return std::chrono::steady_clock::now() + duration(over);
- }
-
- return std::chrono::steady_clock::time_point::max();
- }
+ clock_type::time_point next_pending_aio() const noexcept;
};
/// @}

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc

--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -327,6 +327,26 @@ void fair_queue::notify_request_cancelled(fair_queue_entry& ent) noexcept {
ent._ticket = fair_queue_ticket();
}

+fair_queue::clock_type::time_point fair_queue::next_pending_aio() const noexcept {
+ if (_pending) {
+ /*
+ * We expect the disk to release the ticket within some time,
+ * but it's ... OK if it doesn't -- the pending wait still
+ * needs the head rover value to be ahead of the needed value.
+ *
+ * It may happen that the capacity gets released before we think
+ * it will, in this case we will wait for the full value again,
+ * which's sub-optimal. The expectation is that we think disk
+ * works faster, than it really does.
+ */
+ auto over = _group.capacity_deficiency(_pending->head);
+ auto ticks = _group.capacity_duration(over);
+ return std::chrono::steady_clock::now() + std::chrono::duration_cast<std::chrono::microseconds>(ticks);
+ }
+
+ return std::chrono::steady_clock::time_point::max();
+}
+

void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:34 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

io_queue: Move max blocks count onto group

This value is used to calculate request accumulator cost and to check if
the request length is too large. The former usage is about to be replaced
with rate-limter token cost, but it's still needed to know the maximum
length of the request, so keep the limit as explicitly separate param.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---

diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh
@@ -99,7 +99,7 @@ public:

dev_t devid;
unsigned capacity = std::numeric_limits<unsigned>::max();

unsigned max_req_count = std::numeric_limits<int>::max();

- mutable unsigned max_blocks_count = std::numeric_limits<int>::max();
+ unsigned max_blocks_count = std::numeric_limits<int>::max();

unsigned long req_count_rate = std::numeric_limits<int>::max();
unsigned long blocks_count_rate = std::numeric_limits<int>::max();
unsigned disk_req_write_to_read_multiplier = read_request_base_count;

@@ -172,6 +172,7 @@ private:

friend class ::io_queue_for_tests;

const io_queue::config _config;
+ unsigned max_ticket_size = std::numeric_limits<int>::max();
std::vector<std::unique_ptr<fair_group>> _fgs;

static fair_group::config make_fair_group_config(const io_queue::config& qcfg) noexcept;
diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc
@@ -394,9 +394,7 @@ io_group::io_group(io_queue::config io_cfg) noexcept

/*
* The maximum request size shouldn't result in the capacity that would

- * be larger than the group's replenisher limit. It's not the same as the
- * max_blocks_count which is now used purely for shares accounting and no
- * longer has anything to do with replenisher.
+ * be larger than the group's replenisher limit.
*
* To get the correct value check the 2^N sizes and find the largest one
* with little enough capacity. Actually (FIXME) requests should calculate
@@ -430,7 +428,7 @@ io_group::io_group(io_queue::config io_cfg) noexcept

update_max_size(internal::io_direction_and_length::write_idx);
update_max_size(internal::io_direction_and_length::read_idx);
- _config.max_blocks_count = max_size;
+ max_ticket_size = max_size;

seastar_logger.info("Created io group, length limit {}:{}, rate {}:{}",
(max_size / io_queue::read_request_base_count) << io_queue::block_size_shift,

@@ -658,22 +656,22 @@ fair_queue_ticket io_queue::request_fq_ticket(internal::io_direction_and_length

static thread_local size_t oversize_warning_threshold = 0;

- if (size > get_config().max_blocks_count) {
+ if (size > _group->max_ticket_size) {
if (size > oversize_warning_threshold) {
oversize_warning_threshold = size;
io_log.warn("oversized request (length {}) submitted. "
"dazed and confuzed, trimming its weight from {} down to {}", dnl.length(),
- size, get_config().max_blocks_count);
+ size, _group->max_ticket_size);
}
- size = get_config().max_blocks_count;
+ size = _group->max_ticket_size;
}

return fair_queue_ticket(weight, size);
}

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:36 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

fair_queue: Merge cost/shares capacities

Currently there are two tickets on the fair-group used to normalize
per-request tickets -- the cost_capacity and the shares_capacity. The
former one is used to evaluate how much a request contributes to the
disk run-time capacity (a.k.a. the tokens in the buckets), while the
former one is used to calculate the accumulator -- a counter used to
balance between classes according to their shares. The shares ticket
is exactly latency_goal[us] times larger than the cost one.

It's better to remove the shares ticket and use the cost one in all
the calculations. Note, that the accumulator values instantly become
~1k times less, but for now it's fine -- the accumulator is (still)
a floating-point number.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -172,8 +172,6 @@ private:

using fair_group_atomic_rover = std::atomic<capacity_t>;
static_assert(fair_group_atomic_rover::is_always_lock_free);

- const fair_queue_ticket _shares_capacity;
-
/*
* The dF/dt <= K limitation is managed by the modified token bucket
* algo where tokens are ticket.normalize(cost_capacity), the refill

@@ -251,8 +249,6 @@ public:

struct config {
sstring label = "";

- unsigned max_weight;
- unsigned max_size;
unsigned min_weight = 0;
unsigned min_size = 0;
unsigned long weight_rate;

@@ -264,7 +260,6 @@ public:

explicit fair_group(config cfg) noexcept;
fair_group(fair_group&&) = delete;

- fair_queue_ticket shares_capacity() const noexcept { return _shares_capacity; }
fair_queue_ticket cost_capacity() const noexcept { return _cost_capacity; }
capacity_t maximum_capacity() const noexcept { return _replenish_limit; }
capacity_t grab_capacity(capacity_t cap) noexcept;

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -90,8 +90,7 @@ uint64_t wrapping_difference(const uint64_t& a, const uint64_t& b) noexcept {
}

fair_group::fair_group(config cfg) noexcept
- : _shares_capacity(cfg.max_weight, cfg.max_size)
- , _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
+ : _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
, _replenish_rate(cfg.rate_factor * fixed_point_factor)

, _replenish_limit(_replenish_rate * std::chrono::duration_cast<rate_resolution>(cfg.rate_limit_duration).count())
, _replenish_threshold(std::max((capacity_t)1, ticket_capacity(fair_queue_ticket(cfg.min_weight, cfg.min_size))))

@@ -101,8 +100,8 @@ fair_group::fair_group(config cfg) noexcept

, _capacity_ceil(_replenish_limit)
{
assert(!wrapping_difference(_capacity_tail.load(std::memory_order_relaxed), _capacity_head.load(std::memory_order_relaxed)));

- seastar_logger.info("Created fair group {}, capacity shares {} rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
- _shares_capacity, _cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
+ seastar_logger.info("Created fair group {}, capacity rate {}, limit {}, rate {} (factor {}), threshold {}", cfg.label,
+ _cost_capacity, _replenish_limit, _replenish_rate, cfg.rate_factor, _replenish_threshold);
}

auto fair_group::grab_capacity(capacity_t cap) noexcept -> capacity_t {
@@ -214,7 +213,7 @@ void fair_queue::push_priority_class_from_idle(priority_class_data& pc) {
// duration. For this estimate how many capacity units can be
// accumulated with the current class shares per rate resulution
// and scale it up to tau.
- accumulator_t max_deviation = _group.cost_capacity().normalize(_group.shares_capacity()) / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
+ accumulator_t max_deviation = fair_group::fixed_point_factor / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
pc._accumulated = std::max(_last_accumulated - max_deviation, pc._accumulated);
_handles.push(&pc);
pc._queued = true;
@@ -371,7 +370,7 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
_requests_executing++;
_requests_queued--;

- auto req_cost = req._ticket.normalize(_group.shares_capacity()) / h._shares;
+ auto req_cost = (accumulator_t)_group.ticket_capacity(req._ticket) / h._shares;
auto next_accumulated = h._accumulated + req_cost;
if (std::isinf(next_accumulated)) {
for (auto& pc : _priority_classes) {

diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc

@@ -367,13 +367,10 @@ fair_group::config io_group::make_fair_group_config(const io_queue::config& qcfg
*/
if (max_req_count < max_req_count_min) {

seastar_logger.warn("The disk request rate is too low, configuring it to {}, but you may experience latency problems", max_req_count_min);

- max_req_count = max_req_count_min;

}

fair_group::config cfg;
cfg.label = fmt::format("io-queue-{}", qcfg.devid);

- cfg.max_weight = max_req_count;
- cfg.max_size = qcfg.max_blocks_count;

cfg.min_weight = std::min(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);

cfg.min_size = std::min(io_queue::read_request_base_count, qcfg.disk_blocks_write_to_read_multiplier);
cfg.weight_rate = qcfg.req_count_rate;
diff --git a/tests/perf/fair_queue_perf.cc b/tests/perf/fair_queue_perf.cc

--- a/tests/perf/fair_queue_perf.cc
+++ b/tests/perf/fair_queue_perf.cc
@@ -39,8 +39,6 @@ struct local_fq_and_class {

static fair_group::config fg_config() {
fair_group::config cfg;
- cfg.max_weight = 1;
- cfg.max_size = 1;
cfg.weight_rate = std::numeric_limits<int>::max();
cfg.size_rate = std::numeric_limits<int>::max();
return cfg;
@@ -81,8 +79,6 @@ struct perf_fair_queue {

static fair_group::config fg_config() {
fair_group::config cfg;
- cfg.max_weight = smp::count;
- cfg.max_size = smp::count;
cfg.weight_rate = std::numeric_limits<int>::max();
cfg.size_rate = std::numeric_limits<int>::max();
return cfg;
diff --git a/tests/unit/fair_queue_test.cc b/tests/unit/fair_queue_test.cc

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:37 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

fair_queue: Treat accumulator as capacity

As was described in the previous patch -- capacity is the number of
tokens a request contributes to the bucket, while the accumulator is
the number of units a request contributes to its class. The latter
value is numerically equal to the former one so it's natural to use
these two interchangeably. All the more so the accumulated value on
a class represents the number of tokens consumed by this class in the
rate-limiting formula.

However, the accumulated values are the tokens divided by the number
of shares. For unconfigured queue the smallest "cost" of a request is
~1900 (the 4GB/s disk has this value ~50000), so large shares values
may produce something less than 1 and converting it to integer may
produce 0 which, in turn, means a class makes no progress and
monopolizes (locks) the queue. So make sure the division doesn't
produce 0 by chance.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---
diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -304,7 +304,8 @@ public:

using class_id = unsigned int;
class priority_class_data;
- using accumulator_t = double;
+ using capacity_t = fair_group::capacity_t;
+ using signed_capacity_t = std::make_signed<capacity_t>::type;

private:
using clock_type = std::chrono::steady_clock;

@@ -323,7 +324,7 @@ private:

using prioq = std::priority_queue<priority_class_ptr, std::vector<priority_class_ptr>, class_compare>;
prioq _handles;
std::vector<std::unique_ptr<priority_class_data>> _priority_classes;
- accumulator_t _last_accumulated = 0;
+ capacity_t _last_accumulated = 0;

/*
* When the shared capacity os over the local queue delays

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -158,7 +158,7 @@ auto fair_group::fetch_add(fair_group_atomic_rover& rover, capacity_t cap) noexc
class fair_queue::priority_class_data {
friend class fair_queue;
uint32_t _shares = 0;
- fair_queue::accumulator_t _accumulated = 0;
+ capacity_t _accumulated = 0;
fair_queue_entry::container_list_t _queue;
bool _queued = false;

@@ -213,8 +213,12 @@ void fair_queue::push_priority_class_from_idle(priority_class_data& pc) {

// duration. For this estimate how many capacity units can be
// accumulated with the current class shares per rate resulution
// and scale it up to tau.

- accumulator_t max_deviation = fair_group::fixed_point_factor / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
- pc._accumulated = std::max(_last_accumulated - max_deviation, pc._accumulated);
+ capacity_t max_deviation = fair_group::fixed_point_factor / pc._shares * std::chrono::duration_cast<fair_group::rate_resolution>(_config.tau).count();
+ // On start this deviation can go to negative values, so not to
+ // introduce extra if's for that short corner case, use signed
+ // arithmetics and make sure the _accumulated value doesn't grow
+ // over signed maximum (see overflow check below)

+ pc._accumulated = std::max<signed_capacity_t>(_last_accumulated - max_deviation, pc._accumulated);
_handles.push(&pc);
pc._queued = true;
}
@@ -370,9 +374,13 @@ void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {
_requests_executing++;
_requests_queued--;

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:39 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

fair_queue: Use the fair_queue::capacity_t everywhere

The previous patch introduced this type for its specific reason, now
the rest of the fair-queue code can benefit from it a bit.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---
diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -338,10 +338,10 @@ private:

* in the middle of the waiting
*/
struct pending {
- fair_group::capacity_t head;
+ capacity_t head;
fair_queue_ticket ticket;

- pending(fair_group::capacity_t t, fair_queue_ticket c) noexcept : head(t), ticket(c) {}
+ pending(capacity_t t, fair_queue_ticket c) noexcept : head(t), ticket(c) {}
};

std::optional<pending> _pending;

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -240,7 +240,7 @@ bool fair_queue::grab_pending_capacity(const fair_queue_entry& ent) noexcept {
if (ent._ticket == _pending->ticket) {
_pending.reset();
} else {
- fair_group::capacity_t cap = _group.ticket_capacity(ent._ticket);
+ capacity_t cap = _group.ticket_capacity(ent._ticket);
/*
* This branch is called when the fair queue decides to
* submit not the same request that entered it into the
@@ -259,8 +259,8 @@ bool fair_queue::grab_capacity(const fair_queue_entry& ent) noexcept {
return grab_pending_capacity(ent);
}

- fair_group::capacity_t cap = _group.ticket_capacity(ent._ticket);
- fair_group::capacity_t want_head = _group.grab_capacity(cap) + cap;
+ capacity_t cap = _group.ticket_capacity(ent._ticket);
+ capacity_t want_head = _group.grab_capacity(cap) + cap;
if (_group.capacity_deficiency(want_head)) {
_pending.emplace(want_head, ent._ticket);
return false;

@@ -351,7 +351,7 @@ fair_queue::clock_type::time_point fair_queue::next_pending_aio() const noexcept

}

void fair_queue::dispatch_requests(std::function<void(fair_queue_entry&)> cb) {

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:40 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

seastar: Introduce --io-latency-goal option

The option controls the maximum IO latency the scheduler should aim at. If
not specified, the legacy task-quota-ms * 1.5 is used.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---

diff --git a/include/seastar/core/reactor_config.hh b/include/seastar/core/reactor_config.hh

--- a/include/seastar/core/reactor_config.hh
+++ b/include/seastar/core/reactor_config.hh
@@ -61,6 +61,10 @@ struct reactor_options : public program_options::option_group {
///
/// Default: 500.
program_options::value<double> task_quota_ms;
+ /// \brief Max time (ms) IO operations must take.
+ ///
+ /// Default: 1.5 * task_quota_ms value
+ program_options::value<double> io_latency_goal_ms;
/// \brief Maximum number of task backlog to allow.
///
/// When the number of tasks grow above this, we stop polling (e.g. I/O)
diff --git a/src/core/reactor.cc b/src/core/reactor.cc

--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:41 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

fair_queue: Adjust duration/factor for smaller rates

The fair queue math calculates the costs/rates with usec resolution.
This means that disks should have IOPS and bandwidth as large as to
produce non-zero costs each microsecond. This means at least 8k IOPS
(and 4MB/s) which can be a bit too much for cloud disks and HDDs.

Change the resolution to be milliseconds and adjust the float point
factor accordingly.

While at it -- make sure the cost capacity doesn't become zero for
slow disks.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---

diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -71,6 +71,7 @@ public:

/// For a fair_queue ticket to be non-zero, at least one of its represented quantities need to
/// be non-zero
explicit operator bool() const noexcept;
+ bool is_non_zero() const noexcept;

friend std::ostream& operator<<(std::ostream& os, fair_queue_ticket t);

@@ -238,8 +239,8 @@ public:

* time period for which the speeds from F (in above formula) are taken.
*/

- using rate_resolution = std::chrono::duration<double, std::micro>;
- static constexpr float fixed_point_factor = float(1 << 14);
+ using rate_resolution = std::chrono::duration<double, std::milli>;
+ static constexpr float fixed_point_factor = float(1 << 24);

// Estimated time to process the given amount of capacity

// (peer of accumulated_capacity() helper)

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -72,6 +72,10 @@ fair_queue_ticket::operator bool() const noexcept {
return (_weight > 0) || (_size > 0);
}

+bool fair_queue_ticket::is_non_zero() const noexcept {
+ return (_weight > 0) && (_size > 0);
+}
+
bool fair_queue_ticket::operator==(const fair_queue_ticket& o) const noexcept {
return _weight == o._weight && _size == o._size;
}

@@ -100,6 +104,7 @@ fair_group::fair_group(config cfg) noexcept

, _capacity_ceil(_replenish_limit)
{
assert(!wrapping_difference(_capacity_tail.load(std::memory_order_relaxed), _capacity_head.load(std::memory_order_relaxed)));

+ assert(_cost_capacity.is_non_zero());

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:42 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

io_queue: Auto-increase the io-latency goal

It may happen that the disk iops/bandwidth are too low for even a
single request to be served within the scheduler tick. In this case
the latency-goal value should be increased.

In current scheduler implementation this means checking the token
bucket limit against the replenisher threshold. The limit means how
many requests can be submitted within the latency goal, while the
replenish threshold is the number of tokens needed for the minimal
request. If the limit is below the threshold then even the smallest
request doesn't fit, and the goal is too low. And since it's better
if seastar "just works" even if the configuration is not correct,
the first change here is to set the limit to be at least N times
larger than the threshold to let at least N minimal requests in.

N is selected to be 3, because the limit value is used to get the
maximum request _length_ (in bytes), and 1.5K smells somewhat better
than 512 bytes ... Oh, well.

Finally, the io-queue checks if the latency goal was increased by
the fair-queue, and if it was the warning is printed with the newly
set-up value.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---
diff --git a/include/seastar/core/fair_queue.hh b/include/seastar/core/fair_queue.hh
--- a/include/seastar/core/fair_queue.hh
+++ b/include/seastar/core/fair_queue.hh

@@ -194,8 +194,8 @@ private:

const fair_queue_ticket _cost_capacity;
const capacity_t _replenish_rate;
- const capacity_t _replenish_limit;
const capacity_t _replenish_threshold;
+ const capacity_t _replenish_limit;
std::atomic<clock_type::time_point> _replenished;

/*

@@ -242,6 +242,13 @@ public:

using rate_resolution = std::chrono::duration<double, std::milli>;

static constexpr float fixed_point_factor = float(1 << 24);

+ /*
+ * This defines how may requests of minimal capacity should fit into
+ * the replenish limit. If the actual number happens to be below this
+ * then the rate_limit_duration is too low and should be increased.
+ */
+ static constexpr int duration_capacity = 3;
+

// Estimated time to process the given amount of capacity
// (peer of accumulated_capacity() helper)

rate_resolution capacity_duration(capacity_t cap) const noexcept {

diff --git a/src/core/fair_queue.cc b/src/core/fair_queue.cc
--- a/src/core/fair_queue.cc
+++ b/src/core/fair_queue.cc

@@ -96,8 +96,8 @@ uint64_t wrapping_difference(const uint64_t& a, const uint64_t& b) noexcept {
fair_group::fair_group(config cfg) noexcept

: _cost_capacity(cfg.weight_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count(), cfg.size_rate / std::chrono::duration_cast<rate_resolution>(std::chrono::seconds(1)).count())
, _replenish_rate(cfg.rate_factor * fixed_point_factor)

- , _replenish_limit(_replenish_rate * std::chrono::duration_cast<rate_resolution>(cfg.rate_limit_duration).count())

, _replenish_threshold(std::max((capacity_t)1, ticket_capacity(fair_queue_ticket(cfg.min_weight, cfg.min_size))))

+ , _replenish_limit(std::max<capacity_t>(_replenish_rate * std::chrono::duration_cast<rate_resolution>(cfg.rate_limit_duration).count(), _replenish_threshold * duration_capacity))
, _replenished(clock_type::now())
, _capacity_tail(0)
, _capacity_head(0)

diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc

@@ -389,6 +389,13 @@ io_group::io_group(io_queue::config io_cfg) noexcept
_fgs.push_back(std::make_unique<fair_group>(fg_cfg));
}

+ std::chrono::duration<double> io_lat = _fgs[0]->capacity_duration(_fgs[0]->maximum_capacity());
+ if (io_lat > fg_cfg.rate_limit_duration) {
+ seastar_logger.warn("IO latency goal {:.3f} is too low for device {}, using {:.3f}ms instead",
+ std::chrono::duration_cast<std::chrono::duration<double, std::milli>>(fg_cfg.rate_limit_duration).count(),

+ _config.mountpoint, std::chrono::duration_cast<std::chrono::duration<double, std::milli>>(io_lat).count());
+ }
+

/*
* The maximum request size shouldn't result in the capacity that would

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:43 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

reactor: Tune up legacy configuration

The --max-requests option is still not removed (~1y had passed since
deprecation), but the rate-limiter code doesn't take it into account.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc

Commit Bot

<bot@cloudius-systems.com>

unread,

Jan 20, 2022, 1:48:44 PM1/20/22

to seastar-dev@googlegroups.com, Pavel Emelyanov

From: Pavel Emelyanov <xe...@scylladb.com>
Committer: Pavel Emelyanov <xe...@scylladb.com>
Branch: master

io_queue: Remove max counts from config

Now all the usage of the max-counts is gone, the configuration can
be removed. A reminder -- these counts were used to

- evaluate accumulators (switched to cost ticket that uses rates)
- calculate maximum request length (added replenisher limit check)
- configure legacy --max-requests option (patched to use rates)

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>

---

diff --git a/include/seastar/core/io_queue.hh b/include/seastar/core/io_queue.hh
--- a/include/seastar/core/io_queue.hh
+++ b/include/seastar/core/io_queue.hh

@@ -98,8 +98,6 @@ public:
struct config {

dev_t devid;
unsigned capacity = std::numeric_limits<unsigned>::max();

- unsigned max_req_count = std::numeric_limits<int>::max();

- unsigned max_blocks_count = std::numeric_limits<int>::max();
unsigned long req_count_rate = std::numeric_limits<int>::max();
unsigned long blocks_count_rate = std::numeric_limits<int>::max();
unsigned disk_req_write_to_read_multiplier = read_request_base_count;

diff --git a/src/core/io_queue.cc b/src/core/io_queue.cc
--- a/src/core/io_queue.cc
+++ b/src/core/io_queue.cc

@@ -346,29 +346,6 @@ io_queue::io_queue(io_group_ptr group, internal::io_sink& sink)
}

fair_group::config io_group::make_fair_group_config(const io_queue::config& qcfg) noexcept {
- /*
- * It doesn't make sense to configure requests limit higher than
- * it can be if the queue is full of minimal requests. At the same
- * time setting too large value increases the chances to overflow
- * the group rovers and lock-up the queue.
- *
- * The same is technically true for blocks limit, but the group
- * rovers are configured in blocks (ticket size shift), and this
- * already makes a good protection.
- */
- auto max_req_count = std::min(qcfg.max_req_count, qcfg.max_blocks_count);
- auto max_req_count_min = std::max(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);
- /*
- * Read requests weight read_request_base_count, writes weight
- * disk_req_write_to_read_multiplier. The fair queue limit must
- * be enough to pass the largest one through. The same is true
- * for request sizes, but that check is done run-time, see the
- * request_fq_ticket() method.
- */
- if (max_req_count < max_req_count_min) {

- seastar_logger.warn("The disk request rate is too low, configuring it to {}, but you may experience latency problems", max_req_count_min);
- }
-

fair_group::config cfg;
cfg.label = fmt::format("io-queue-{}", qcfg.devid);

cfg.min_weight = std::min(io_queue::read_request_base_count, qcfg.disk_req_write_to_read_multiplier);

diff --git a/src/core/reactor.cc b/src/core/reactor.cc
--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc

Reply all

Reply to author

Forward