[PATCH v3 0/6] Switch partitions cache from BST to B+tree & array

100 views

Skip to first unread message

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 6, 2020, 2:21:26 PM5/6/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The data model is now

bplus::tree<Key = int64_t, T = array<cache_entry>>

The whole thing is encapsulated into a collection called "double_decker"
from patch #5. The array<T> is an array of T-s with 0-bytes overhead used
to resolve hash conflicts (patch #4).

Changes in v3:

- replace managed_vector with array
- use int64_t as tree key instead of dht::token
- simple tune-up of insertion algo for better packing (15% less inner nodes)
- optimize bplus::tree::erase(iterator)

This version is still without the SIMD node key lookup

branch: https://github.com/xemul/scylla/commits/row-cache-over-bptree-3
tests:
unit(debug) for new collections
unit(dev) for the rest
perf(dev) for row_cache

Testing results are somewhat promicing:

* memory_footprint:
sizeof(old cache_entry) = 104

sizeof(new cache_entry) = 72
sizeof(bptree::data) = 80 (1 pointer overhead)

in cache:
nopatch: 98400000
8 keys: 98960202 (+0.5%, 5.6 b/part)
16 keys: 98438582 (+0.03%, 0.4 b/part)
32 keys: 98198074 (-0.2%, 2.0 b/part)
64 keys: 98061524 (-0.3%, 3.4 b/part)

* row_cache_update:

3 sub-tests:
small_partitions
partition_with_few_small_rows
partition_with_lots_of_small_rows

numbers are update:/invalidation: times [ms]

The invalidation turned out _not_ to be improved O(logN) evicion
as it effectively does range erase with tree at hands, both std::set
and B+ are both O(1) here.

nopatch 1449.702 / 382.161 1369.786 / 346.426 478.766 / 0.417

8, lin 993.980 / 312.738 1117.000 / 264.217 446.281 / 0.419
16, lin 996.407 / 334.595 1108.257 / 257.385 438.133 / 0.426
32, lin 1088.253 / 316.095 1168.123 / 258.666 457.052 / 0.422

16, bin 1399.507 / 423.926 1225.133 / 347.217 448.318 / 0.431
32, bin 972.932 / 322.350 1103.503 / 256.497 449.912 / 0.419
64, bin 1136.614 / 336.679 1119.315 / 264.116 456.312 / 0.433

* perf_simple_query:

Same as memory footprint here -- the larger the nodes are, the better.

nopatch median 95362.44 abs.dev.: 194.58

8, lin median 88081.63 -7.6% abs.dev.: 23.10
16, lin median 93586.65 -1.9% abs.dev.: 29.07
32, lin median 96735.37 +1.4% abs.dev.: 47.11

16, bin median 94094.68 -1.3% abs.dev.: 36.83
32, bin median 96706.78 +1.4% abs.dev.: 16.54
64, bin median 95558.57 +0.2% abs.dev.: 5.47

The next TODO according to the above results is:

- SIMD to improve lookup and check how large nodes behave with it.
There're two places to look at -- key search on plain lookup and
node pointer search on insert/remove

Currently a 4-key tree with 5M randomly generated keys would
result in

inner nodes: 494527 (10% from 5M)
with 2 keys: 169853 (34%)
with 3 keys: 126217 (25%)
with 4 keys: 198457 (40%)
leaves: 1512186 (30% from 5M)
with 2 keys: 285564 (18%)
with 3 keys: 477616 (31%)
with 4 keys: 749006 (49%)

- More sophisticated insert/remove algos to produce better packaging.
Currently micro-bench on B+ vs std::set show we can sacrifice more
cycles for it and still work faster

- Or background packing of B+ nodes -- instead of spending time packing
the stuff on insert/remove -- do it later, when compaction starts

Pavel Emelyanov (6):
row_cache: Simplify clean_now()
test: Move perf measurement helpers into header
core: B+ tree implementation
utils: Array with trusted bounds
double-decker: A combinaiton of B+tree with array
row_cache: Switch partition tree onto B+ rails

configure.py | 12 +
dht/token.hh | 11 +
row_cache.hh | 66 +-
test/perf/perf.hh | 71 +
test/unit/bptree_key.hh | 101 ++
test/unit/bptree_validation.hh | 318 ++++
utils/array_trusted_bounds.hh | 237 +++
utils/bptree.hh | 1862 +++++++++++++++++++++++
utils/double-decker.hh | 288 ++++
dht/token.cc | 22 +-
row_cache.cc | 141 +-
test/boost/array_trusted_bounds_test.cc | 189 +++
test/boost/bptree_test.cc | 344 +++++
test/boost/double_decker_test.cc | 313 ++++
test/perf/memory_footprint_test.cc | 3 +-
test/perf/perf_bptree.cc | 165 ++
test/perf/perf_bptree_drain.cc | 154 ++
test/perf/perf_row_cache_update.cc | 71 -
test/unit/bptree_compaction_test.cc | 210 +++
test/unit/bptree_stress_test.cc | 236 +++
20 files changed, 4636 insertions(+), 178 deletions(-)
create mode 100644 test/unit/bptree_key.hh
create mode 100644 test/unit/bptree_validation.hh
create mode 100644 utils/array_trusted_bounds.hh
create mode 100644 utils/bptree.hh
create mode 100644 utils/double-decker.hh
create mode 100644 test/boost/array_trusted_bounds_test.cc
create mode 100644 test/boost/bptree_test.cc
create mode 100644 test/boost/double_decker_test.cc
create mode 100644 test/perf/perf_bptree.cc
create mode 100644 test/perf/perf_bptree_drain.cc
create mode 100644 test/unit/bptree_compaction_test.cc
create mode 100644 test/unit/bptree_stress_test.cc

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 6, 2020, 2:21:30 PM5/6/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The clean_now() wants to remove everything but the very last element
from the set. Reworking this into "clean everything then put the
trailing entry back" will greatly help the B+-tree implementation, as
the latter will have linear clear operation vs logarithmic range erase.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---
row_cache.hh | 4 ++++
row_cache.cc | 31 ++++++++++++++++---------------
2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/row_cache.hh b/row_cache.hh
index 3dd90fac4..fecdeba44 100644
--- a/row_cache.hh
+++ b/row_cache.hh
@@ -478,6 +478,10 @@ class row_cache final {
//
// internal_updater is only kept alive until its invocation returns.
future<> do_update(external_updater eu, internal_updater iu) noexcept;
+
+ void init_empty(is_continuous cont);
+ void drain();
+
public:
~row_cache();
row_cache(schema_ptr, snapshot_source, cache_tracker&, is_continuous = is_continuous::no);
diff --git a/row_cache.cc b/row_cache.cc
index 0c8c96d56..805a9ac02 100644
--- a/row_cache.cc
+++ b/row_cache.cc
@@ -778,8 +778,7 @@ row_cache::make_reader(schema_ptr s,
}
}

-
-row_cache::~row_cache() {
+void row_cache::drain() {
with_allocator(_tracker.allocator(), [this] {
_partitions.clear_and_dispose([this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
if (!p->is_dummy_entry()) {
@@ -791,15 +790,13 @@ row_cache::~row_cache() {
});
}

+row_cache::~row_cache() {
+ drain();
+}
+
void row_cache::clear_now() noexcept {
- with_allocator(_tracker.allocator(), [this] {
- auto it = _partitions.erase_and_dispose(_partitions.begin(), partitions_end(), [this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
- _tracker.on_partition_erase();
- p->evict(_tracker);
- deleter(p);
- });
- _tracker.clear_continuity(*it);
- });
+ drain();
+ init_empty(is_continuous::no);
}

template<typename CreateEntry, typename VisitEntry>
@@ -1170,6 +1167,14 @@ void row_cache::evict() {
while (_tracker.region().evict_some() == memory::reclaiming_result::reclaimed_something) {}
}

+void row_cache::init_empty(is_continuous cont) {
+ with_allocator(_tracker.allocator(), [this, cont] {
+ cache_entry* entry = current_allocator().construct<cache_entry>(cache_entry::dummy_entry_tag());
+ _partitions.insert_before(_partitions.end(), *entry);
+ entry->set_continuous(bool(cont));
+ });
+}
+
row_cache::row_cache(schema_ptr s, snapshot_source src, cache_tracker& tracker, is_continuous cont)
: _tracker(tracker)
, _schema(std::move(s))
@@ -1177,11 +1182,7 @@ row_cache::row_cache(schema_ptr s, snapshot_source src, cache_tracker& tracker,
, _underlying(src())
, _snapshot_source(std::move(src))
{
- with_allocator(_tracker.allocator(), [this, cont] {
- cache_entry* entry = current_allocator().construct<cache_entry>(cache_entry::dummy_entry_tag());
- _partitions.insert_before(_partitions.end(), *entry);
- entry->set_continuous(bool(cont));
- });
+ init_empty(cont);
}

cache_entry::cache_entry(cache_entry&& o) noexcept
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 6, 2020, 2:21:38 PM5/6/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

test/perf/perf.hh | 71 ++++++++++++++++++++++++++++++
test/perf/perf_row_cache_update.cc | 71 ------------------------------
2 files changed, 71 insertions(+), 71 deletions(-)

diff --git a/test/perf/perf.hh b/test/perf/perf.hh
index 9de2b2541..e73ac859a 100644
--- a/test/perf/perf.hh
+++ b/test/perf/perf.hh
@@ -24,7 +24,10 @@
#include <seastar/core/print.hh>
#include <seastar/core/future-util.hh>
#include <seastar/core/distributed.hh>
+#include <seastar/core/weak_ptr.hh>
#include "seastarx.hh"
+#include "utils/extremum_tracking.hh"
+#include "utils/estimated_histogram.hh"

#include <chrono>
#include <iosfwd>
@@ -126,3 +129,71 @@ std::vector<double> time_parallel(Func func, unsigned concurrency_per_core, int
}
return results;
}
+
+template<typename Func>
+auto duration_in_seconds(Func&& f) {
+ using clk = std::chrono::steady_clock;
+ auto start = clk::now();
+ f();
+ auto end = clk::now();
+ return std::chrono::duration_cast<std::chrono::duration<float>>(end - start);
+}
+
+class scheduling_latency_measurer : public weakly_referencable<scheduling_latency_measurer> {
+ using clk = std::chrono::steady_clock;
+ clk::time_point _last = clk::now();
+ utils::estimated_histogram _hist{300};
+ min_max_tracker<clk::duration> _minmax;
+ bool _stop = false;
+private:
+ void schedule_tick();
+ void tick() {
+ auto old = _last;
+ _last = clk::now();
+ auto latency = _last - old;
+ _minmax.update(latency);
+ _hist.add(latency.count());
+ if (!_stop) {
+ schedule_tick();
+ }
+ }
+public:
+ void start() {
+ schedule_tick();
+ }
+ void stop() {
+ _stop = true;
+ later().get(); // so that the last scheduled tick is counted
+ }
+ const utils::estimated_histogram& histogram() const {
+ return _hist;
+ }
+ clk::duration min() const { return _minmax.min(); }
+ clk::duration max() const { return _minmax.max(); }
+};
+
+void scheduling_latency_measurer::schedule_tick() {
+ seastar::schedule(make_task(default_scheduling_group(), [self = weak_from_this()] () mutable {
+ if (self) {
+ self->tick();
+ }
+ }));
+}
+
+std::ostream& operator<<(std::ostream& out, const scheduling_latency_measurer& slm) {
+ auto to_ms = [] (int64_t nanos) {
+ return float(nanos) / 1e6;
+ };
+ return out << sprint("{count: %d, "
+ //"min: %.6f [ms], "
+ //"50%%: %.6f [ms], "
+ //"90%%: %.6f [ms], "
+ "99%%: %.6f [ms], "
+ "max: %.6f [ms]}",
+ slm.histogram().count(),
+ //to_ms(slm.min().count()),
+ //to_ms(slm.histogram().percentile(0.5)),
+ //to_ms(slm.histogram().percentile(0.9)),
+ to_ms(slm.histogram().percentile(0.99)),
+ to_ms(slm.max().count()));
+}
diff --git a/test/perf/perf_row_cache_update.cc b/test/perf/perf_row_cache_update.cc
index 181ce6730..2a3cbde65 100644
--- a/test/perf/perf_row_cache_update.cc
+++ b/test/perf/perf_row_cache_update.cc
@@ -19,16 +19,13 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/

-#include <chrono>
#include <seastar/core/distributed.hh>
#include <seastar/core/app-template.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/thread.hh>
-#include <seastar/core/weak_ptr.hh>
#include <seastar/core/reactor.hh>

#include "utils/managed_bytes.hh"
-#include "utils/extremum_tracking.hh"
#include "utils/logalloc.hh"
#include "row_cache.hh"
#include "log.hh"
@@ -40,74 +37,6 @@ static const int update_iterations = 16;
static const int cell_size = 128;
static bool cancelled = false;

-template<typename Func>
-auto duration_in_seconds(Func&& f) {
- using clk = std::chrono::steady_clock;
- auto start = clk::now();
- f();
- auto end = clk::now();
- return std::chrono::duration_cast<std::chrono::duration<float>>(end - start);
-}
-
-class scheduling_latency_measurer : public weakly_referencable<scheduling_latency_measurer> {
- using clk = std::chrono::steady_clock;
- clk::time_point _last = clk::now();
- utils::estimated_histogram _hist{300};
- min_max_tracker<clk::duration> _minmax;
- bool _stop = false;
-private:
- void schedule_tick();
- void tick() {
- auto old = _last;
- _last = clk::now();
- auto latency = _last - old;
- _minmax.update(latency);
- _hist.add(latency.count());
- if (!_stop) {
- schedule_tick();
- }
- }
-public:
- void start() {
- schedule_tick();
- }
- void stop() {
- _stop = true;
- later().get(); // so that the last scheduled tick is counted
- }
- const utils::estimated_histogram& histogram() const {
- return _hist;
- }
- clk::duration min() const { return _minmax.min(); }
- clk::duration max() const { return _minmax.max(); }
-};
-
-void scheduling_latency_measurer::schedule_tick() {
- seastar::schedule(make_task(default_scheduling_group(), [self = weak_from_this()] () mutable {
- if (self) {
- self->tick();
- }
- }));
-}
-
-std::ostream& operator<<(std::ostream& out, const scheduling_latency_measurer& slm) {
- auto to_ms = [] (int64_t nanos) {
- return float(nanos) / 1e6;
- };
- return out << sprint("{count: %d, "
- //"min: %.6f [ms], "
- //"50%%: %.6f [ms], "
- //"90%%: %.6f [ms], "
- "99%%: %.6f [ms], "
- "max: %.6f [ms]}",
- slm.histogram().count(),
- //to_ms(slm.min().count()),
- //to_ms(slm.histogram().percentile(0.5)),
- //to_ms(slm.histogram().percentile(0.9)),
- to_ms(slm.histogram().percentile(0.99)),
- to_ms(slm.max().count()));
-}
-
template<typename MutationGenerator>
void run_test(const sstring& name, schema_ptr s, MutationGenerator&& gen) {
cache_tracker tracker;
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 6, 2020, 2:21:49 PM5/6/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

A plain array of elements that grows and shrinks by
constructing the new instance from an existing one and
moving the elements from it.

Behaves similarly to vector's external array, but has
0-bytes overhead. The array bounds (0-th and N-th
elemements) are determined by checking the flags on the
elements themselves. For this the type mus support
getters and setters for the flags.

Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

configure.py | 1 +
utils/array_trusted_bounds.hh | 237 ++++++++++++++++++++++++
test/boost/array_trusted_bounds_test.cc | 189 +++++++++++++++++++
3 files changed, 427 insertions(+)
create mode 100644 utils/array_trusted_bounds.hh
create mode 100644 test/boost/array_trusted_bounds_test.cc

diff --git a/configure.py b/configure.py
index c0e622d9b..f413f75a5 100755
--- a/configure.py
+++ b/configure.py
@@ -328,6 +328,7 @@ scylla_tests = set([
'test/boost/log_heap_test',
'test/boost/logalloc_test',
'test/boost/managed_vector_test',
+ 'test/boost/array_trusted_bounds_test',
'test/boost/map_difference_test',
'test/boost/memtable_test',
'test/boost/meta_test',
diff --git a/utils/array_trusted_bounds.hh b/utils/array_trusted_bounds.hh
new file mode 100644
index 000000000..c86bb69b3
--- /dev/null
+++ b/utils/array_trusted_bounds.hh
@@ -0,0 +1,237 @@
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+/*
+ * This file is part of Scylla.
+ *
+ * Scylla is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU Affero General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * Scylla is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#pragma once
+
+#include <array>
+#include <cassert>
+
+#include "utils/allocation_strategy.hh"
+
+GCC6_CONCEPT(
+ template <typename T>
+ concept bool BoundsKeeper = requires (T val, bool bit) {
+ { val.is_head() } -> bool;
+ { val.set_head(bit) } -> void;
+ { val.is_tail() } -> bool;
+ { val.set_tail(bit) } -> void;
+ };
+
+ template <typename K, typename T, typename Compare>
+ concept bool Comparable = requires (const K& a, const T& b, Compare cmp) {
+ { cmp(a, b) } -> int;
+ };
+)
+
+/*
+ * A plain array of T-s that grows and shrinks by constructing a new
+ * instances. Holds at least one element. Has facilities for sorting
+ * the elements and for doing "container_of" by the given element
+ * pointer. LSA-compactible.
+ *
+ * Important feature of the array is zero memory overhead -- it doesn't
+ * keep its size/capacity onboard. The size is calculated each time by
+ * walking the array of T-s and checking which one of them is the tail
+ * element. Respectively, the T must keep head/tail flags on itself.
+ */
+template <typename T>
+GCC6_CONCEPT( requires BoundsKeeper<T> && std::is_nothrow_move_constructible_v<T> )
+class array_trusted_bounds {
+ union maybe_constructed {
+ maybe_constructed() { }
+ ~maybe_constructed() { }
+ T object;
+ };
+
+ maybe_constructed _data[1];
+
+ int _number_of_elements() const {
+ for (int i = 0; ; i++) {
+ if (_data[i].object.is_tail()) {
+ return i + 1;
+ }
+ }
+
+ assert(false);
+ }
+
+public:
+ using iterator = T*;
+
+ /*
+ * There are 3 constructing options for the array: initial, grow
+ * and shrink.
+ *
+ * * initial just creates a 1-element array
+ * * grow -- makes a new one moving all elements from the original
+ * array and inserting the one (only one) more element at the given
+ * position
+ * * shrink -- also makes a new array skipping the not needed
+ * element while moving them from the original one
+ *
+ * In all cases the enough big memory chunk must be provided by the
+ * caller!
+ *
+ * Note, that none of them calls destructors on T-s, unlike vector.
+ * This is because when the older array is destroyed it has no idea
+ * about whether or not it was grown/shrunk and thus it destroys
+ * T-s itself.
+ */
+
+ // Initial
+ template <typename... Args>
+ array_trusted_bounds(Args&&... args) {
+ new (&_data[0].object) T(std::forward<Args>(args)...);
+ _data[0].object.set_head(true);
+ _data[0].object.set_tail(true);
+ }
+
+ // Growing
+ template <typename... Args>
+ array_trusted_bounds(array_trusted_bounds& from, int add_pos, Args&&... args) {
+ // The add_pos is strongly _expected_ to be within bounds
+ int i, off = 0;
+ bool tail = false;
+
+ for (i = 0; !tail; i++) {
+ if (i == add_pos) {
+ off = 1;
+ continue;
+ }
+
+ tail = from._data[i - off].object.is_tail();
+ new (&_data[i].object) T(std::move(from._data[i - off].object));
+ }
+
+ new (&_data[add_pos].object) T(std::forward<Args>(args)...);
+
+ _data[0].object.set_head(true);
+ if (add_pos == 0) {
+ _data[1].object.set_head(false);
+ }
+ _data[i - off].object.set_tail(true);
+ if (off == 0) {
+ _data[i - 1].object.set_tail(false);
+ }
+ }
+
+ // Shrinking
+ // The shrink_tag is to explicitly distinguish from grow-constructor
+ struct shrink_tag{};
+
+ array_trusted_bounds(array_trusted_bounds& from, int del_pos, struct shrink_tag) {
+ int i, off = 0;
+ bool tail = false;
+
+ for (i = 0; !tail; i++) {
+ tail = from._data[i].object.is_tail();
+ if (i == del_pos) {
+ off = 1;
+ } else {
+ new (&_data[i - off].object) T(std::move(from._data[i].object));
+ }
+ }
+
+ _data[0].object.set_head(true);
+ _data[i - off - 1].object.set_tail(true);
+ }
+
+ array_trusted_bounds(const array_trusted_bounds& other) = delete;
+ array_trusted_bounds(array_trusted_bounds&& other) noexcept {
+ bool tail = false;
+
+ for (int i = 0; !tail; i++) {
+ tail = other._data[i].object.is_tail();
+
+ new (&_data[i].object) T(std::move(other._data[i].object));
+ }
+ }
+
+ ~array_trusted_bounds() {
+ bool tail = false;
+
+ for (int i = 0; !tail; i++) {
+ tail = _data[i].object.is_tail();
+ _data[i].object.~T();
+ }
+ }
+
+ T& operator[](int pos) noexcept { return _data[pos].object; }
+ const T& operator[](int pos) const noexcept { return _data[pos].object; }
+
+ iterator begin() noexcept { return &_data[0].object; }
+ iterator end() noexcept { return &_data[_number_of_elements()].object; }
+
+ int index_of(iterator i) { return i - &_data[0].object; }
+ bool is_single_element() const { return _data[0].object.is_tail(); }
+
+ /* A helper for keeping the array sorted */
+ template <typename K, typename Compare>
+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& val, Compare cmp, bool& match) {
+ int i = 0;
+
+ do {
+ int x = cmp(_data[i].object, val);
+ if (x >= 0) {
+ match = (x == 0);
+ break;
+ }
+ } while (!_data[i++].object.is_tail());
+
+ return &_data[i].object;
+ }
+
+ template <typename K, typename Compare>
+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& val, Compare cmp) {
+ bool match = false;
+ return lower_bound(val, cmp, match);
+ }
+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )
+ void for_each(Func&& fn) {
+ bool tail = false;
+
+ for (int i = 0; !tail; i++) {
+ tail = _data[i].object.is_tail();
+ fn(_data[i].object);
+ }
+ }
+
+ size_t storage_size() const { return size_t(_number_of_elements() * sizeof(T)); }
+ size_t size() { return _number_of_elements(); }
+
+ friend size_t size_for_allocation_strategy(const array_trusted_bounds& obj) {
+ return obj.storage_size();
+ }
+
+ static array_trusted_bounds& from_element(T* ptr, int& idx) {
+ while (!ptr->is_head()) {
+ idx++;
+ ptr--;
+ }
+
+ static_assert(offsetof(array_trusted_bounds, _data[0].object) == 0);
+ return *reinterpret_cast<array_trusted_bounds*>(ptr);
+ }
+};
diff --git a/test/boost/array_trusted_bounds_test.cc b/test/boost/array_trusted_bounds_test.cc
new file mode 100644
index 000000000..27dc1ffc3
--- /dev/null
+++ b/test/boost/array_trusted_bounds_test.cc
@@ -0,0 +1,189 @@
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+/*
+ * This file is part of Scylla.
+ *
+ * Scylla is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU Affero General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * Scylla is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <boost/test/unit_test.hpp>
+#include <seastar/testing/thread_test_case.hh>
+#include <fmt/core.h>
+
+#include "utils/array_trusted_bounds.hh"
+#include "utils/logalloc.hh"
+
+class element {
+ bool _head;
+ bool _tail;
+ long _data;
+ int *_cookie;
+ int *_cookie2;
+
+public:
+ explicit element(long val) : _head(false), _tail(false), _data(val),
+ _cookie(new int(0)), _cookie2(new int(0)) { }
+
+ element(const element& other) = delete;
+ element(element&& other) noexcept : _head(other._head), _tail(other._tail), _data(other._data),
+ _cookie(other._cookie), _cookie2(new int(0)) {
+ other._cookie = nullptr;
+ }
+
+ ~element() {
+ if (_cookie != nullptr) {
+ delete _cookie;
+ }
+
+ delete _cookie2;
+ }
+
+ bool is_head() const { return _head; }
+ void set_head(bool v) { _head = v; }
+ bool is_tail() const { return _tail; }
+ void set_tail(bool v) { _tail = v; }
+
+ bool operator==(long v) const { return v == _data; }
+ long operator*() const { return _data; }
+
+ bool bound_check(int idx, int size) {
+ return ((idx == 0 && is_head()) || (idx != 0 && !is_head())) &&
+ ((idx == size - 1 && is_tail()) || (idx != size - 1 && !is_tail()));
+ }
+};
+
+using test_array = array_trusted_bounds<element>;
+
+void show(test_array& a, int sz) {
+ for (int i = 0; i < sz; i++) {
+ fmt::print("{}{}{}", a[i].is_head() ? 'H' : ' ', *a[i], a[i].is_tail() ? 'T' : ' ');
+ }
+ fmt::print("\n");
+}
+
+SEASTAR_THREAD_TEST_CASE(test_basic_construct) {
+ test_array array(12);
+
+ for (auto i = array.begin(); i != array.end(); i++) {
+ BOOST_REQUIRE(*i == 12);
+ }
+}
+
+test_array* grow(test_array& from, int nsize, int npos, long ndat) {
+ auto ptr = current_allocator().alloc(&get_standard_migrator<test_array>(), sizeof(element) * nsize, alignof(test_array));
+ return new (ptr) test_array(from, npos, ndat);
+}
+
+test_array* shrink(test_array& from, int nszie, int spos) {
+ auto ptr = current_allocator().alloc(&get_standard_migrator<test_array>(), sizeof(element) * nszie, alignof(test_array));
+ return new (ptr) test_array(from, spos, test_array::shrink_tag{});
+}
+
+void grow_shrink_and_check(test_array& cur, int size) {
+ for (int i = 0; i <= size; i++) {
+ long nel = size + 12;
+ test_array* narr = grow(cur, size + 1, i, nel);
+ int idx = 0;
+
+ for (auto ni = narr->begin(); ni != narr->end(); ni++) {
+ if (idx == i) {
+ BOOST_REQUIRE(*ni == nel);
+ } else if (idx < i) {
+ BOOST_REQUIRE(*ni == *cur[idx]);
+ } else {
+ BOOST_REQUIRE(*ni == *cur[idx - 1]);
+ }
+
+ BOOST_REQUIRE(ni->bound_check(idx, size + 1));
+ idx++;
+ }
+
+ if (size < 5) {
+ grow_shrink_and_check(*narr, size + 1);
+ }
+
+ current_allocator().destroy(narr);
+ }
+
+ if (size > 1) {
+ for (int i = 0; i < size; i++) {
+ test_array* narr = shrink(cur, size - 1, i);
+ int idx = 0;
+
+ for (auto ni = narr->begin(); ni != narr->end(); ni++) {
+ if (idx == i) {
+ continue;
+ } else if (idx < i) {
+ BOOST_REQUIRE(*ni == *cur[idx]);
+ } else {
+ BOOST_REQUIRE(*ni == *cur[idx + 1]);
+ }
+
+ BOOST_REQUIRE(ni->bound_check(idx, size - 1));
+ idx++;
+ }
+
+ current_allocator().destroy(narr);
+ }
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(test_grow_shrink_construct) {
+ test_array array(12);
+ grow_shrink_and_check(array, 1);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_lower_bound) {
+ test_array a1(12);
+ struct compare {
+ int operator()(const element& a, const element& b) const { return *a - *b; }
+ };
+
+ test_array *a2 = grow(a1, 2, 1, 14);
+
+ auto i = a2->lower_bound(element(13), compare{});
+ BOOST_REQUIRE(*i == 14 && a2->index_of(i) == 1);
+
+ test_array *a3 = grow(*a2, 3, 2, 17);
+
+ bool match;
+ BOOST_REQUIRE(*a3->lower_bound(element(11), compare{}, match) == 12 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(12), compare{}, match) == 12 && match);
+ BOOST_REQUIRE(*a3->lower_bound(element(13), compare{}, match) == 14 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(14), compare{}, match) == 14 && match);
+ BOOST_REQUIRE(*a3->lower_bound(element(15), compare{}, match) == 17 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(16), compare{}, match) == 17 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(17), compare{}, match) == 17 && match);
+ BOOST_REQUIRE(a3->lower_bound(element(18), compare{}, match) == a3->end());
+
+ current_allocator().destroy(a3);
+ current_allocator().destroy(a2);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_from_element) {
+ test_array a1(12);
+ test_array *a2 = grow(a1, 2, 1, 14);
+ test_array *a3 = grow(*a2, 3, 2, 17);
+
+ element* i = &((*a3)[2]);
+ BOOST_REQUIRE(*i == 17);
+ int idx = 0;
+ test_array& x = test_array::from_element(i, idx);
+ BOOST_REQUIRE(&x == a3 && idx == 2);
+
+ current_allocator().destroy(a3);
+ current_allocator().destroy(a2);
+}
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 6, 2020, 2:21:49 PM5/6/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

// The story is at
// https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ

This is the B+ version which satisfies seveal specific requirements
to be suitable for row-cache usage.

1. Insert/Remove doesn't invalidate iterators
2. Elements should be LSA-compactable
3. Low overhead of data nodes (1 pointer)
4. Exteral less-only comparator
5. As little actions on insert/delete as possible
6. Iterator walks the sorted keys

The design, briefly is:

There are 3 types of nodes, inner, leaf and data, inner and leaf
keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in
a doubly linked list. Data nodes live separately from the leaf ones
and keep pointers on them. Tree handler keeps pointers on root and
left-most and right-most leaves. Nodes do _not_ keep pointers or
references on the tree (except 3 of them, see below).

Update in v6:

- Insertion tries to push kids to siblings before split

Before this change insertion into full node resulted into this
node being split into two equal parts. This behaviour for random
keys stress gives a tree with ~2/3 of nodes half-filled.

With this change before splitting the full node try to push one
element to each of the siblings (if they exist and not full).
This slows the insertion a bit (but it's still way faster than
the std::set), but gives 15% less total number of nodes.

- Iterator method to reconstruct the data at the given position

The helper creates a new data node, emplaces data into it and
replaces the iterator's one with it. Needed to keep arrays of
data in tree.

- Milli-optimize erase()
- Return back an iterator that will likely be not re-validated
- Do not try to update ancestors separation key for leftmost kid

This caused the clear()-like workload work poorly as compared to
std:set. In particular the row_cache::invalidate() method does
exactly this and this change improves its timing.

- Perf test to measure drain speed
- Helper call to collect tree counters

Update in v5:

- Fix corner case of iterator.emplace_before()
- Clean heterogenous lookup API
- Handle exceptions from nodes allocations
- Explicitly mark places where the key is copied (for future)
- Extend the tree.lower_bound() API to report back whether
the bound hit the key or not
- Addressed style/cleanness review comments

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

create mode 100644 test/unit/bptree_key.hh
create mode 100644 test/unit/bptree_validation.hh

create mode 100644 utils/bptree.hh
create mode 100644 test/boost/bptree_test.cc

create mode 100644 test/perf/perf_bptree.cc
create mode 100644 test/perf/perf_bptree_drain.cc
create mode 100644 test/unit/bptree_compaction_test.cc
create mode 100644 test/unit/bptree_stress_test.cc

diff --git a/configure.py b/configure.py
index 794c19541..c0e622d9b 100755
--- a/configure.py
+++ b/configure.py
@@ -381,6 +381,7 @@ scylla_tests = set([
'test/boost/view_schema_ckey_test',
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
+ 'test/boost/bptree_test',
'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',
@@ -398,6 +399,8 @@ scylla_tests = set([
'test/perf/perf_fast_forward',
'test/perf/perf_hash',
'test/perf/perf_mutation',
+ 'test/perf/perf_bptree',
+ 'test/perf/perf_bptree_drain',
'test/perf/perf_row_cache_update',
'test/perf/perf_simple_query',
'test/perf/perf_sstable',
@@ -405,6 +408,8 @@ scylla_tests = set([
'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
'test/unit/row_cache_stress_test',
+ 'test/unit/bptree_stress_test',
+ 'test/unit/bptree_compaction_test',
])

perf_tests = set([
@@ -939,6 +944,7 @@ pure_boost_tests = set([
'test/boost/small_vector_test',
'test/boost/top_k_test',
'test/boost/vint_serialization_test',
+ 'test/boost/bptree_test',
'test/manual/json_test',
'test/manual/streaming_histogram_test',
])
@@ -952,10 +958,14 @@ tests_not_using_seastar_test_framework = set([
'test/perf/perf_cql_parser',
'test/perf/perf_hash',
'test/perf/perf_mutation',
+ 'test/perf/perf_bptree',
+ 'test/perf/perf_bptree_drain',
'test/perf/perf_row_cache_update',
'test/unit/lsa_async_eviction_test',
'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
+ 'test/unit/bptree_stress_test',
+ 'test/unit/bptree_compaction_test',
'test/manual/sstable_scan_footprint_test',
]) | pure_boost_tests

diff --git a/test/unit/bptree_key.hh b/test/unit/bptree_key.hh
new file mode 100644
index 000000000..54347a54f
--- /dev/null
+++ b/test/unit/bptree_key.hh
@@ -0,0 +1,101 @@

+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+/*
+ * This file is part of Scylla.
+ *
+ * Scylla is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU Affero General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * Scylla is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#pragma once
+

+/*
+ * Helper class that helps to check that tree
+ * - works with keys without default contstuctor
+ * - moves the keys around properly
+ */
+class test_key {
+ int _val;
+ int* _cookie;
+ int* _p_cookie;
+
+public:
+ bool is_alive() const {
+ if (_val == -1) {
+ fmt::print("key value is reset\n");
+ return false;
+ }
+
+ if (_cookie == nullptr) {
+ fmt::print("key cookie is reset\n");
+ return false;
+ }
+
+ if (*_cookie != 0) {
+ fmt::print("key cookie value is corrupted {}\n", *_cookie);
+ return false;
+ }
+
+ return true;
+ }
+
+ bool less(const test_key& o) const {
+ return _val < o._val;
+ }
+
+ explicit test_key(int nr) : _val(nr) {
+ _cookie = new int(0);
+ _p_cookie = new int(1);
+ }
+
+ operator int() const { return _val; }
+
+ test_key& operator=(const test_key& other) = delete;
+ test_key& operator=(test_key&& other) = delete;
+
+private:
+ /*
+ * Keep this private to make bptree.hh explicitly call the
+ * copy_key in the places where the key is copied
+ */
+ test_key(const test_key& other) noexcept : _val(other._val) {
+ _cookie = new int(*other._cookie);
+ _p_cookie = new int(*other._p_cookie);
+ }
+
+ friend test_key copy_key(const test_key&);
+
+public:
+ test_key(test_key&& other) noexcept : _val(other._val) {
+ other._val = -1;
+ _cookie = other._cookie;
+ other._cookie = nullptr;
+ _p_cookie = new int(*other._p_cookie);
+ }
+
+ ~test_key() {

+ if (_cookie != nullptr) {
+ delete _cookie;
+ }

+ assert(_p_cookie != nullptr);
+ delete _p_cookie;
+ }
+};
+
+test_key copy_key(const test_key& other) { return test_key(other); }
+
+struct test_key_compare {
+ bool operator()(const test_key& a, const test_key& b) const { return a.less(b); }
+};
diff --git a/test/unit/bptree_validation.hh b/test/unit/bptree_validation.hh
new file mode 100644
index 000000000..cdf137bda
--- /dev/null
+++ b/test/unit/bptree_validation.hh

@@ -0,0 +1,318 @@
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+/*
+ * This file is part of Scylla.
+ *
+ * Scylla is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU Affero General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * Scylla is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#pragma once
+

+namespace bplus {
+
+template <typename K, typename T, typename Less, int NodeSize>
+class validator {
+ using tree = class tree<K, T, Less, NodeSize, key_search::both, with_debug::yes>;
+ using node = typename tree::node;
+
+ void validate_node(const tree& t, const node& n, int& prev, int& min, bool is_root);
+ void validate_list(const tree& t);
+
+public:
+ void print_tree(const tree& t, char pfx) const {
+ fmt::print("/ {} <- | {} | -> {}\n", t._left->id(), t._root->id(), t._right->id());
+ print_node(*t._root, pfx, 2);

+ fmt::print("\\\n");
+ }
+

+ void print_node(const node& n, char pfx, int indent) const {
+ int i;
+
+ fmt::print("{:<{}c}{:s} {:d} ({:d} keys, {:x} flags):", pfx, indent,
+ n.is_leaf() ? "leaf" : "node", n.id(), n._num_keys, n._flags);
+ if (n.is_leaf()) {
+ for (i = 0; i < n._num_keys; i++) {
+ fmt::print(" {}", (int)n._keys[i].v);

+ }
+ fmt::print("\n");
+

+ return;

+ }
+ fmt::print("\n");
+

+ if (n._kids[0].n != nullptr) {
+ print_node(*n._kids[0].n, pfx, indent + 2);
+ }
+ for (i = 0; i < n._num_keys; i++) {
+ fmt::print("{:<{}c}---{}---\n", pfx, indent, (int)n._keys[i].v);
+ print_node(*n._kids[i + 1].n, pfx, indent + 2);
+ }
+ }
+
+ void validate(const tree& t);
+};
+
+
+template <typename K, typename T, typename L, int NS>
+void validator<K, T, L, NS>::validate_node(const tree& t, const node& n, int& prev_key, int& min_key, bool is_root) {
+ int i;
+
+ if (n.is_root() != is_root) {
+ fmt::print("node {} needs to {} root, but {}\n", n.id(), is_root ? "be" : "be not", n._flags);
+ throw "root broken";
+ }
+
+ for (i = 0; i < n._num_keys; i++) {
+ if (!n._keys[i].v.is_alive()) {
+ fmt::print("node {} key {} is not alive\n", n.id(), i);
+ throw "key dead";
+ }
+ }
+
+ if (n.is_leaf()) {
+ for (i = 0; i < n._num_keys; i++) {
+ if (t._less(n._keys[i].v, K(prev_key))) {
+ fmt::print("node misordered @{} (prev {})\n", (int)n._keys[i].v, prev_key);
+ throw "misorder";
+ }
+ if (n._kids[i + 1].d->_leaf != &n) {
+ fmt::print("data mispoint\n");
+ throw "data backlink";
+ }
+
+ prev_key = n._keys[i].v;
+ if (!n._kids[i + 1].d->value.match_key(n._keys[i].v)) {
+ fmt::print("node value corrupted @{:d}.{:d}\n", n.id(), i);
+ throw "data corruption";
+ }
+ }
+
+ if (n._num_keys > 0) {
+ min_key = (int)n._keys[0].v;
+ }
+ } else if (n._num_keys > 0) {
+ node* k = n._kids[0].n;
+
+ if (k->_parent != &n) {
+ fmt::print("node {:d} -parent-> {:d}, expect {:d}\n", k->id(), k->_parent->id(), n.id());
+ throw "mis-parented node";
+ }
+ validate_node(t, *k, prev_key, min_key, false);
+ for (i = 0; i < n._num_keys; i++) {
+ k = n._kids[i + 1].n;
+ if (k->_parent != &n) {
+ fmt::print("node {:d} -parent-> {:d}, expect {:d}\n",
+ k->id(), k->_parent ? k->_parent->id() : -1, n.id());
+ throw "mis-parented node";
+ }
+ if (t._less(k->_keys[0].v, n._keys[i].v)) {
+ fmt::print("node {:d}.{:d}, separation key {}, kid has {}\n", n.id(), k->id(),
+ (int)n._keys[i].v, (int)k->_keys[0].v);
+ throw "separation key mismatch";
+ }
+
+ int min = 0;
+ validate_node(t, *k, prev_key, min, false);
+ if (t._less(n._keys[i].v, K(min)) || t._less(K(min), n._keys[i].v)) {
+ fmt::print("node {:d}.[{:d}]{:d}, separation key {}, min {}\n",
+ n.id(), i, k->id(), (int)n._keys[i].v, min);
+ if (strict_separation_key || t._less(K(min), n._keys[i].v)) {
+ throw "separation key screw";
+ }
+ }
+ }
+ }
+}
+
+template <typename K, typename T, typename L, int NS>
+void validator<K, T, L, NS>::validate_list(const tree& t) {
+ int prev = 0;
+
+ node* lh = t.left_leaf_slow();
+ node* rh = t.right_leaf_slow();
+
+ if (lh != t._left) {
+ fmt::print("left {:d}, slow {:d}\n", t._left->id(), lh->id());
+ throw "list broken";
+ }
+
+ if (!(lh->_flags & node::NODE_LEFTMOST)) {
+ fmt::print("left {:d} is not marked as such {}\n", t._left->id(), t._left->_flags);;
+ throw "list broken";
+ }
+
+ if (rh != t._right) {
+ fmt::print("right {:d}, slow {:d}\n", t._right->id(), rh->id());
+ throw "list broken";
+ }
+
+ if (!(rh->_flags & node::NODE_RIGHTMOST)) {
+ fmt::print("right {:d} is not marked as such {}\n", t._right->id(), t._right->_flags);;
+ throw "list broken";
+ }
+
+ node* r = lh;
+ while (1) {
+ node *ln;
+
+ if (!r->is_rightmost()) {
+ ln = r->get_next();
+ if (ln->get_prev() != r) {
+ fmt::print("next leaf {:d} points to {:d}, expect {:d}\n", ln->id(), ln->get_prev()->id(), r->id());
+ throw "list broken";
+ }
+ } else if (r->_rightmost_tree != &t) {
+ fmt::print("right leaf doesn't point to tree\n");
+ throw "list broken";
+ }
+
+ if (!r->is_leftmost()) {
+ ln = r->get_prev();
+ if (ln->get_next() != r) {
+ fmt::print("prev leaf {:d} points to {:d}, expect {:d}\n", ln->id(), ln->get_next()->id(), r->id());
+ throw "list broken";
+ }
+ } else if (r->_kids[0]._leftmost_tree != &t) {
+ fmt::print("left leaf doesn't point to tree\n");
+ throw "list broken";
+ }
+
+ if (r->_num_keys > 0 && t._less(r->_keys[0].v, K(prev))) {
+ fmt::print("list misorder on element {:d}, keys {}..., prev {:d}\n", r->id(), (int)r->_keys[0].v, prev);
+ throw "list broken";
+ }
+
+ if (!r->is_root() && r->_parent != nullptr) {
+ const auto p = r->_parent;
+ int i = p->index_for(r->_keys[0].v, t._less);
+ if (i > 0) {
+ if (p->_kids[i - 1].n != r->get_prev()) {
+ fmt::print("list misorder on parent check: node {:d}.{:d}, parent prev {:d}, list prev {:d}\n",
+ p->id(), r->id(), p->_kids[i - 1].n->id(), r->get_prev()->id());
+ throw "list broken";
+ }
+ }
+ if (i < p->_num_keys - 1) {
+ if (p->_kids[i + 1].n != r->get_next()) {
+ fmt::print("list misorder on parent check: node {:d}.{:d}, parent next {:d}, list next {:d}\n",
+ p->id(), r->id(), p->_kids[i + 1].n->id(), r->get_next()->id());
+ throw "list broken";
+ }
+ }
+ }
+
+ if (r->_num_keys > 0) {
+ prev = (int)r->_keys[r->_num_keys - 1].v;
+ }
+
+ if (r != t._left && r != t._right && (r->_flags & (node::NODE_LEFTMOST | node::NODE_RIGHTMOST))) {
+ fmt::print("middle {:d} is marked as left/right {}\n", r->id(), r->_flags);;
+ throw "list broken";
+ }
+
+ if (r->is_rightmost()) {
+ break;
+ }
+
+ r = r->get_next();
+ }
+}
+
+template <typename K, typename T, typename L, int NS>
+void validator<K, T, L, NS>::validate(const tree& t) {
+ try {
+ validate_list(t);
+ int min = 0, prev = 0;
+ if (t._root->_root_tree != &t) {
+ fmt::print("root doesn't point to tree\n");
+ throw "root broken";
+ }
+
+ validate_node(t, *t._root, prev, min, true);
+ } catch (...) {
+ print_tree(t, '|');
+ fmt::print("[ ");
+ node* lh = t._left;
+ while (1) {
+ fmt::print(" {:d}", lh->id());
+ if (lh->is_rightmost()) {
+ break;
+ }
+ lh = lh->get_next();

+ }
+ fmt::print("]\n");

+ throw;
+ }
+}
+
+template <typename K, typename T, typename Less, int NodeSize>
+class iterator_checker {
+ using tree = class tree<K, T, Less, NodeSize, key_search::both, with_debug::yes>;
+
+ validator<K, T, Less, NodeSize>& _tv;
+ tree& _t;
+ typename tree::iterator _fwd, _fend;
+ T _fprev;
+
+public:
+ iterator_checker(validator<K, T, Less, NodeSize>& tv, tree& t) : _tv(tv), _t(t),
+ _fwd(t.begin()), _fend(t.end()) {
+ }
+
+ bool step() {
+ try {
+ return forward_check();
+ } catch(...) {
+ _tv.print_tree(_t, ':');
+ throw;
+ }
+ }
+
+ bool here(const K& k) {
+ return _fwd != _fend && _fwd->match_key(k);
+ }
+
+private:
+ bool forward_check() {
+ if (_fwd == _fend) {
+ return false;
+ }
+ _fwd++;
+ if (_fwd == _fend) {
+ return false;
+ }
+ T val = *_fwd;
+ _fwd++;
+ if (_fwd == _fend) {
+ return false;
+ }
+ _fwd--;
+ if (val != *_fwd) {
+ fmt::print("Iterator broken, {:d} != {:d}\n", val, *_fwd);
+ throw "iterator";
+ }
+ if (val < _fprev) {
+ fmt::print("Iterator broken, {:d} < {:d}\n", val, _fprev);
+ throw "iterator";
+ }
+ _fprev = val;
+
+ return true;
+ }
+};
+
+} // namespace
+
diff --git a/utils/bptree.hh b/utils/bptree.hh
new file mode 100644
index 000000000..165b61ced
--- /dev/null
+++ b/utils/bptree.hh
@@ -0,0 +1,1862 @@

+#include <boost/intrusive/parent_from_member.hpp>
+#include <seastar/util/defer.hh>
+#include <cassert>
+#include "utils/logalloc.hh"
+
+namespace bplus {
+
+enum class with_debug { no, yes };
+
+/*
+ * Linear search in a sorted array of keys slightly beats the
+ * binary one on small sizes. For debugging purposes both methods
+ * should be used (and the result must coinside).
+ */
+enum class key_search { linear, binary, both };
+
+/*
+ * The node_id class is purely a debugging thing -- when reading
+ * the validator print-s it's more handy to look at IDs consisting
+ * of 1-3 digits, rather than 16 hex-digits of a printed pointer
+ */
+template <with_debug D>
+struct node_id {
+ int operator()() const { return reinterpret_cast<uintptr_t>(this); }
+};
+
+template <>
+struct node_id<with_debug::yes> {
+ unsigned int _id;
+ static unsigned int _next() {
+ static std::atomic<unsigned int> rover {1};
+ return rover.fetch_add(1);
+ }
+
+ node_id() : _id(_next()) {}
+ int operator()() const { return _id; }
+};
+
+/*
+ * This wrapper prevents the value from being default-constructed
+ * when its container is created. The intended usage is to wrap
+ * elements of static arrays or containers with .emplace() methods
+ * that can live some time without the value in it.
+ *
+ * Similarly, the value is _not_ automatically destructed when this
+ * thing is, so ~Vaue() must be called by hands. For this there is the
+ * .remove() method and two helpers for common cases -- std::move-ing
+ * the value into another maybe-location (.emplace(maybe&&)) and
+ * constructing the new in place of the existing one (.replace(args...))
+ */
+template <typename Value>
+union maybe {
+ Value v;
+ maybe() noexcept {}
+ ~maybe() {}
+
+ void reset() { v.~Value(); }
+
+ /*
+ * Constructs the value inside the empty maybe wrapper.
+ */
+ template <typename... Args>
+ void emplace(Args&&... args) {
+ new (&v) Value (std::forward<Args>(args)...);
+ }
+
+ /*
+ * The special-case handling of moving some other alive maybe-value.
+ * Calls the source destructor after the move.
+ */
+ void emplace(maybe&& other) {
+ new (&v) Value(std::move(other.v));
+ other.reset();
+ }
+
+ /*
+ * Similar to emplace, but to be used on the alive maybe.
+ * Calls the destructor on it before constructing the new value.
+ */
+ template <typename... Args>
+ void replace(Args&&... args) {
+ reset();
+ emplace(std::forward<Args>(args)...);
+ }
+
+ void replace(maybe&& other) = delete; // not to be called by chance
+};
+
+// For .{do_something_with_data}_and_dispose methods below
+template <typename T>
+void default_dispose(T& value) { }
+
+/*
+ * Helper to explicitly capture all keys copying.
+ * Check test_key for more information.
+ */
+template <typename Key>
+GCC6_CONCEPT(requires std::is_nothrow_copy_constructible_v<Key>)
+Key copy_key(const Key& other) {
+ return Key(other);
+}
+
+/*
+ * Consider a small 2-level tree like this
+ *
+ * [ . 5 . ]
+ * | |
+ * +------+ +-----+
+ * | |
+ * [ 1 . 2 . 3 . ] [ 5 . 6 . 7 . ]
+ *
+ * And we remove key 5 from it. First -- the key is removed
+ * from the leaf entry
+ *
+ * [ . 5 . ]
+ * | |
+ * +------+ +-----+
+ * | |
+ * [ 1 . 2 . 3 . ] [ 6 . 7. ]
+ *
+ * At this point we have a choice -- whether or not to update
+ * the separation key on the parent (root). Strictly speaking,
+ * the whole tree is correct now -- all the keys on the right
+ * are greater-or-equal than their separation key, though the
+ * "equal" never happens.
+ *
+ * This can be problematic if the keys are stored on data nodes
+ * and are referenced from the (non-)leaf nodes. In this case
+ * the separation key must be updated to point to some real key
+ * in its sub-tree.
+ *
+ * [ . 6 . ] <--- this key updated
+ * | |
+ * +------+ +-----+
+ * | |
+ * [ 1 . 2 . 3 . ] [ 6 . 7. ]
+ *
+ * As this update takes some time, this behaviour is tunable.
+ *
+ */
+constexpr bool strict_separation_key = true;
+
+/*
+ * This is for testing, validator will be everybody's friend
+ * to have rights to check if the tree is internally correct.
+ */
+template <typename Key, typename T, typename Less, int NodeSize> class validator;
+template <with_debug Debug> class statistics;
+
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug> class node;
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug> class data;
+
+/*
+ * The tree itself.
+ * Equipped with O(1) (with little constant) begin() and end()
+ * and the iterator, that scans throug sorted keys and is not
+ * invalidated on insert/remove.
+ *
+ * The NodeSize parameter describes the amount of keys to be
+ * held on each node. Inner nodes will thus have N+1 sub-trees,
+ * leaf nodes will have N data pointers.
+ */
+
+GCC6_CONCEPT(
+ template <typename Key1, typename Key2, typename Less>
+ concept bool LessComparable = requires (const Key1& a, const Key2& b, Less less) {
+ { less(a, b) } -> bool;
+ { less(b, a) } -> bool;
+ };
+
+ template <typename T, typename Key>
+ concept bool CanGetKeyFromValue = requires (T val) {
+ { val.key() } -> Key;
+ };
+)
+
+struct stats {
+ unsigned long nodes;
+ std::vector<unsigned long> nodes_filled;
+ unsigned long leaves;
+ std::vector<unsigned long> leaves_filled;
+ unsigned long datas;
+};
+
+template <typename Key, typename T, typename Less, int NodeSize,
+ key_search Search = key_search::binary, with_debug Debug = with_debug::no>
+GCC6_CONCEPT( requires LessComparable<Key, Key, Less> &&
+ std::is_nothrow_move_constructible_v<Key> &&
+ std::is_nothrow_move_constructible_v<T>
+)
+class tree {
+public:
+ class iterator;
+ friend class validator<Key, T, Less, NodeSize>;
+ friend class node<Key, T, Less, NodeSize, Search, Debug>;
+
+ // Sanity not to allow slow key-search in non-debug mode
+ static_assert(Debug == with_debug::yes || Search != key_search::both);
+
+ using node = class node<Key, T, Less, NodeSize, Search, Debug>;
+ using data = class data<Key, T, Less, NodeSize, Search, Debug>;
+
+private:
+
+ node* _root = nullptr;
+ node* _left = nullptr;
+ node* _right = nullptr;
+ Less _less;
+
+ template <typename K>
+ node& find_leaf_for(const K& k) const {
+ node* cur = _root;
+
+ while (!cur->is_leaf()) {
+ int i = cur->index_for(k, _less);
+ cur = cur->_kids[i].n;
+ }
+
+ return *cur;
+ }
+
+ void maybe_init_empty_tree() {
+ if (_root != nullptr) {
+ return;
+ }
+
+ node* n = node::create();
+ n->_flags |= node::NODE_LEAF | node::NODE_ROOT | node::NODE_RIGHTMOST | node::NODE_LEFTMOST;
+ do_set_root(n);
+ do_set_left(n);
+ do_set_right(n);
+ }
+
+ node* left_leaf_slow() const {
+ node* cur = _root;
+ while (!cur->is_leaf()) {
+ cur = cur->_kids[0].n;
+ }
+ return cur;
+ }
+
+ node* right_leaf_slow() const {
+ node* cur = _root;
+ while (!cur->is_leaf()) {
+ cur = cur->_kids[cur->_num_keys].n;
+ }
+ return cur;

+ }
+
+ template <typename K>

+ iterator get_bound(const K& k, bool& upper) {
+ maybe_init_empty_tree();
+
+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ /*
+ * Element at i (key at i - 1) is less or equal to the k,
+ * the next element is greater. Mind corner cases.
+ */
+
+ if (i == 0) {
+ assert(n.is_leftmost());
+ return begin();
+ } else if (i <= n._num_keys) {
+ iterator cur = iterator(n._kids[i].d, i);
+ if (upper || _less(n._keys[i - 1].v, k)) {
+ cur++;
+ } else {
+ // Here 'upper' becomes 'match'
+ upper = true;
+ }
+
+ return cur;
+ } else {
+ assert(n.is_rightmost());
+ return end();
+ }
+ }
+
+public:
+
+ tree(const tree& other) = delete;
+ const tree& operator=(const tree& other) = delete;
+ tree& operator=(tree&& other) = delete;
+
+ explicit tree(Less less) : _less(less) { }
+ ~tree() {
+ if (_root != nullptr) {
+ node::destroy(*_root);
+ }
+ }
+
+ Less less() const { return _less; }
+
+ tree(tree&& other) noexcept : _less(std::move(other._less)) {
+ if (other._root) {
+ do_set_root(other._root);
+ do_set_left(other._left);
+ do_set_right(other._right);
+
+ other._root = nullptr;
+ other._left = nullptr;
+ other._right = nullptr;
+ }
+ }
+
+ // XXX -- this uses linear scan over the leaf nodes
+ size_t size_slow() const {
+ if (_root == nullptr) {
+ return 0;
+ }
+
+ size_t ret = 0;
+ const node* leaf = _left;
+ while (1) {
+ assert(leaf->is_leaf());
+ ret += leaf->_num_keys;
+ if (leaf == _right) {
+ break;
+ }
+ leaf = leaf->get_next();
+ }
+
+ return ret;
+ }
+
+ // Returns result that is equal (both not less than each other)
+ template <typename K = Key>
+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator find(const K& k) {
+ maybe_init_empty_tree();
+
+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ if (i >= 1 && !_less(n._keys[i - 1].v, k)) {
+ return iterator(n._kids[i].d, i);
+ } else {
+ return end();
+ }
+ }
+
+ // Returns the least x out of those !less(x, k)
+ template <typename K = Key>
+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator lower_bound(const K& k) {
+ bool upper = false;
+ return get_bound(k, upper);
+ }
+
+ template <typename K = Key>
+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator lower_bound(const K& k, bool& match) {
+ match = false;
+ return get_bound(k, match);
+ }
+
+ // Returns the least x out of those less(k, x)
+ template <typename K = Key>
+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator upper_bound(const K& k) {
+ bool upper = true;
+ return get_bound(k, upper);
+ }
+
+ /*
+ * Constructs the element with key k inside the tree and returns
+ * iterator on it. If the key already exists -- just returns the
+ * iterator on it and sets the .second to false.
+ */
+ template <typename... Args>
+ std::pair<iterator, bool> emplace(Key k, Args&&... args) {
+ maybe_init_empty_tree();
+
+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ if (i >= 1 && !_less(n._keys[i - 1].v, k)) {
+ // Direct hit
+ return std::pair(iterator(n._kids[i].d, i), false);
+ }
+
+ data* d = data::create(std::forward<Args>(args)...);
+ auto x = seastar::defer([&d] { data::destroy(*d, default_dispose<T>); });
+ n.insert(i, std::move(k), d, _less);
+ assert(d->attached());
+ x.cancel();
+ return std::pair(iterator(d, i + 1), true);

+ }
+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ iterator erase_and_dispose(const Key& k, Func&& disp) {
+ maybe_init_empty_tree();
+
+ node& n = find_leaf_for(k);
+
+ data* d;
+ int i = n.index_for(k, _less) - 1;
+
+ if (i < 0) {
+ return end();
+ }
+
+ assert(n._num_keys > 0);
+
+ if (_less(n._keys[i].v, k)) {
+ return end();
+ }
+
+ d = n._kids[i + 1].d;
+ iterator it(d, i + 1);
+ it++;
+
+ n.remove(i, _less);
+
+ data::destroy(*d, disp);
+ return it;

+ }
+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ iterator erase_and_dispose(iterator from, iterator to, Func&& disp) {
+ /*
+ * FIXME this is dog slow k*logN algo, need k+logN one
+ */
+ while (from != to) {
+ from = from.erase_and_dispose(disp, _less);
+ }
+
+ return to;
+ }
+
+ template <typename... Args>
+ iterator erase(Args&&... args) { return erase_and_dispose(std::forward<Args>(args)..., default_dispose<T>); }

+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ void clear_and_dispose(Func&& disp) {
+ if (_root != nullptr) {
+ _root->clear(
+ [this, &disp] (data* d) { data::destroy(*d, disp); },
+ [this] (node* n) { node::destroy(*n); }
+ );
+
+ node::destroy(*_root);
+ _root = nullptr;
+ _left = nullptr;
+ _right = nullptr;
+ }
+ }
+
+ void clear() { clear_and_dispose(default_dispose<T>); }
+
+private:
+ void do_set_left(node *n) {
+ assert(n->is_leftmost());
+ _left = n;
+ n->_kids[0]._leftmost_tree = this;
+ }
+
+ void do_set_right(node *n) {
+ assert(n->is_rightmost());
+ _right = n;
+ n->_rightmost_tree = this;
+ }
+
+ void do_set_root(node *n) {
+ assert(n->is_root());
+ n->_root_tree = this;
+ _root = n;
+ }
+
+public:
+ /*
+ * Iterator. Scans the datas in the sorted-by-key order.
+ * Is not invalidated by emplace/erase-s of other elements.
+ * Move constructors may turn the _idx invalid, but the
+ * .revalidate() method makes it good again.
+ */
+ class iterator {
+ friend class tree;
+
+ /*
+ * When the iterator gets to the end the _data is
+ * replaced with the _tree obtained from the right
+ * leaf, and the _idx is set to npos
+ */
+ union {
+ tree* _tree;
+ data* _data;
+ };
+ int _idx;
+
+ /*
+ * It could be 0 as well, as leaf nodes cannot have
+ * kids (data nodes) at 0 position, but ...
+ */
+ static constexpr int npos = -1;
+
+ bool is_end() const { return _idx == npos; }
+
+ explicit iterator(tree* t) : _tree(t), _idx(npos) { }
+ iterator(data* d, int idx) : _data(d), _idx(idx) { }
+
+ /*
+ * The routine makes sure the iterator's index is valid
+ * and returns back the leaf that points to it.
+ */
+ node* revalidate() {
+ assert(!is_end());
+
+ node* leaf = _data->_leaf;
+
+ /*
+ * The data._leaf pointer is always valid (it's updated
+ * on insert/remove operations), the datas do not move
+ * as well, so if the leaf still points at us, it is valid.
+ */
+ if (_idx > leaf->_num_keys || leaf->_kids[_idx].d != _data) {
+ _idx = leaf->index_for(_data);
+ }
+
+ return leaf;
+ }
+
+ public:
+ using iterator_category = std::bidirectional_iterator_tag;
+ using value_type = T;
+ using difference_type = ssize_t;
+ using pointer = value_type*;
+ using reference = value_type&;
+
+ /*
+ * Special constructor for the case when there's the need for an
+ * iterator to the given value poiter. In this case we need to
+ * get three things:
+ * - pointer on class data: we assume that the value pointer
+ * is indeed embedded into the data and do the "container_of"
+ * maneuver
+ * - index at which the data is seen on the leaf: use the
+ * standard revalidation. Note, that we start with index 1
+ * which gives us 1/NodeSize chance of hitting the right index
+ * right at once :)
+ * - the tree itself: the worst thing here, creating an iterator
+ * like this is logN operation
+ */
+ iterator(T* value) : _idx(1) {
+ _data = boost::intrusive::get_parent_from_member(value, &data::value);
+ revalidate();
+ }
+
+ iterator() : iterator(static_cast<tree*>(nullptr)) {}
+
+ reference operator*() const { return _data->value; }
+ pointer operator->() const { return &_data->value; }
+
+ iterator& operator++() {
+ node* leaf = revalidate();
+ if (_idx < leaf->_num_keys) {
+ _idx++;
+ } else {
+ if (leaf->is_rightmost()) {
+ _idx = npos;
+ _tree = leaf->_rightmost_tree;
+ return *this;
+ }
+
+ leaf = leaf->get_next();
+ _idx = 1;
+ }
+ _data = leaf->_kids[_idx].d;
+ return *this;
+ }
+
+ iterator& operator--() {
+ if (is_end()) {
+ node* n = _tree->_right;
+ assert(n->_num_keys > 0);
+ _data = n->_kids[n->_num_keys].d;
+ _idx = n->_num_keys;
+ return *this;
+ }
+
+ node* leaf = revalidate();
+ if (_idx > 1) {
+ _idx--;
+ } else {
+ leaf = leaf->get_prev();
+ _idx = leaf->_num_keys;
+ }
+ _data = leaf->_kids[_idx].d;
+ return *this;
+ }
+
+ iterator operator++(int) {
+ iterator cur = *this;
+ operator++();
+ return cur;
+ }
+
+ iterator operator--(int) {
+ iterator cur = *this;
+ operator--();
+ return cur;
+ }
+
+ bool operator==(const iterator& o) const { return is_end() ? o.is_end() : _data == o._data; }
+ bool operator!=(const iterator& o) const { return !(*this == o); }
+
+ /*
+ * The key _MUST_ be in order and not exist,
+ * neither of those is checked
+ */
+ template <typename KeyFn, typename... Args>
+ iterator emplace_before(KeyFn key, Less less, Args&&... args) {
+ node* leaf;
+ int i;
+
+ if (!is_end()) {
+ leaf = revalidate();
+ i = _idx - 1;
+
+ if (i == 0 && !leaf->is_leftmost()) {
+ /*
+ * If we're about to insert a key before the 0th one, then
+ * we must make sure the separation keys from upper layers
+ * will separate the new key as well. If they won't then we
+ * should select the left sibling for insertion.
+ *
+ * For !strict_separation_key the solution is simple -- the
+ * upper level separation keys match the current 0th one, so
+ * we always switch to the left sibling.
+ *
+ * If we're already on the left-most leaf -- just insert, as
+ * there's no separatio key above it.
+ */
+ if (!strict_separation_key) {
+ assert(false && "Not implemented");
+ }
+ leaf = leaf->get_prev();
+ i = leaf->_num_keys;
+ }
+ } else {
+ _tree->maybe_init_empty_tree();
+ leaf = _tree->_right;
+ i = leaf->_num_keys;
+ }
+
+ assert(i >= 0);
+
+ data* d = data::create(std::forward<Args>(args)...);
+ auto x = seastar::defer([&d] { data::destroy(*d, default_dispose<T>); });
+ leaf->insert(i, std::move(key(d)), d, less);
+ assert(d->attached());
+ x.cancel();
+ /*
+ * XXX -- if the node was not split we can ++ it index
+ * and keep iterator valid :)
+ */
+ return iterator(d, i);
+ }
+
+ template <typename... Args>
+ iterator emplace_before(Key k, Less less, Args&&... args) {
+ return emplace_before([&k] (data*) -> Key { return std::move(k); },
+ less, std::forward<Args>(args)...);
+ }
+
+ template <typename... Args>
+ GCC6_CONCEPT(requires CanGetKeyFromValue<T, Key>)
+ iterator emplace_before(Less less, Args&&... args) {
+ return emplace_before([] (data* d) -> Key { return d->value.key(); },
+ less, std::forward<Args>(args)...);
+ }
+
+ private:
+ /*
+ * Prepare a likely valid iterator for the next element.
+ * Likely means, that unless removal starts rebalancing
+ * datas the _idx will be for the correct pointer.
+ *
+ * This is just like the operator++, with the exception
+ * that staying on the leaf doesn't increase the _idx, as
+ * in this case the next element will be shifted left to
+ * the current position.
+ */
+ iterator next_after_erase(node* leaf) {
+ if (_idx < leaf->_num_keys) {
+ return iterator(leaf->_kids[_idx + 1].d, _idx);
+ }
+
+ if (leaf->is_rightmost()) {
+ return iterator(leaf->_rightmost_tree);
+ }
+
+ leaf = leaf->get_next();
+ return iterator(leaf->_kids[1].d, 1);
+ }
+
+ public:
+ template <typename Func>
+ iterator erase_and_dispose(Func&& disp, Less less) {
+ node* leaf = revalidate();
+ iterator cur = next_after_erase(leaf);
+
+ leaf->remove(_idx - 1, less);
+ data::destroy(*_data, disp);
+
+ return cur;
+ }
+
+ iterator erase(Less less) { return erase_and_dispose(default_dispose<T>, less); }
+
+ template <typename... Args>
+ void reconstruct(size_t new_size, Args&&... args) {
+ node* leaf = revalidate();
+ auto ptr = current_allocator().alloc(&get_standard_migrator<data>(), new_size, alignof(data));
+ data *dat, *cur = _data;
+
+ try {
+ dat = new (ptr) data(std::forward<Args>(args)...);
+ } catch(...) {
+ current_allocator().free(ptr, new_size);
+ throw;
+ }
+
+ dat->_leaf = leaf;
+ cur->_leaf = nullptr;
+
+ _data = dat;
+ leaf->_kids[_idx].d = dat;
+
+ current_allocator().destroy(cur);
+ }
+
+ size_t storage_size() const { return _data->storage_size(); }
+ };
+
+ iterator begin() {
+ if (_root == nullptr || _root->_num_keys == 0) {
+ return end();
+ }
+
+ assert(_left->_num_keys > 0);
+ // Leaf nodes have data pointers starting from index 1
+ return iterator(_left->_kids[1].d, 1);
+ }
+ iterator end() { return iterator(this); }
+
+ using reverse_iterator = std::reverse_iterator<iterator>;
+ reverse_iterator rbegin() { return std::make_reverse_iterator(end()); }
+ reverse_iterator rend() { return std::make_reverse_iterator(begin()); }
+
+ struct stats get_stats() {
+ struct stats st;
+
+ st.nodes = 0;
+ st.leaves = 0;
+ st.datas = 0;
+
+ if (_root != nullptr) {
+ st.nodes_filled.resize(NodeSize + 1);
+ st.leaves_filled.resize(NodeSize + 1);
+ _root->fill_stats(st);
+ }
+
+ return st;
+ }
+};
+
+
+/*
+ * A node describes both, inner and leaf nodes.
+ */
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug>
+class node final {
+ friend class validator<Key, T, Less, NodeSize>;
+ friend class tree<Key, T, Less, NodeSize, Search, Debug>;
+
+ using tree = class tree<Key, T, Less, NodeSize, Search, Debug>;
+ using data = class data<Key, T, Less, NodeSize, Search, Debug>;
+
+ class prealloc;
+
+ /*
+ * The NodeHalf is the level at which the node is considered
+ * to be underflown and should be re-filled. This slightly
+ * differs for even and odd sizes.
+ *
+ * For odd sizes the node will stand until it contains literally
+ * more than 1/2 of it's size (e.g. for size 5 keeping 3 keys
+ * is OK). For even cases this barrier is less than the actual
+ * half (e.g. for size 4 keeping 2 is still OK).
+ */
+ static constexpr int NodeHalf = ((NodeSize - 1) / 2);
+ static_assert(NodeHalf >= 1);
+
+ union node_or_data_or_tree {
+ node* n;
+ data* d;
+
+ tree* _leftmost_tree; // See comment near node::__next about this
+ };
+
+ using node_or_data = node_or_data_or_tree;
+
+ friend data::data(data&&);
+
+ [[no_unique_address]] node_id<Debug> id;
+
+ unsigned short _num_keys;
+ unsigned short _flags;
+
+ static const unsigned short NODE_ROOT = 0x1;
+ static const unsigned short NODE_LEAF = 0x2;
+ static const unsigned short NODE_LEFTMOST = 0x4;
+ static const unsigned short NODE_RIGHTMOST = 0x8;
+
+ bool is_leaf() const { return _flags & NODE_LEAF; }
+ bool is_root() const { return _flags & NODE_ROOT; }
+ bool is_rightmost() const { return _flags & NODE_RIGHTMOST; }
+ bool is_leftmost() const { return _flags & NODE_LEFTMOST; }
+
+ /*
+ * separation keys
+ * non-leaf nodes:
+ * the kids[i] contains keys[i + 1] <= k < keys[i + 2]
+ * kids[0] contains keys < all keys in the node
+ * leaf nodes:
+ * kids[i + 1] is the data for keys[i]
+ * kids[0] is unused
+ *
+ * In the examples below the leaf nodes will be shown like
+ *
+ * keys: [012]
+ * datas: [-012]
+ *
+ * and the non-leaf ones like
+ *
+ * keys: [012]
+ * kids: [A012]
+ *
+ * to have digits correspond to different elements and staying
+ * in its correct positions. And the A kid is this left-most one
+ * at index 0 for the non-leaf node.
+ */
+
+ /*
+ * Wrap keys in array with "maybe" not to default-construct yet
+ * unused key in it on node allocation.
+ */
+
+ maybe<Key> _keys[NodeSize];
+ node_or_data _kids[NodeSize + 1];
+
+ /*
+ * The root node uses this to point to the tree object. This is
+ * needed to update tree->_root on node move.
+ */
+ union {
+ node* _parent;
+ tree* _root_tree;
+ };
+
+ /*
+ * Leaf nodes are linked in a list, since leaf nodes do
+ * not use the _kids[0] pointer we re-use it. Respectively,
+ * non-leaf nodes don't use the __next one.
+ *
+ * Also, leftmost and rightmost respectively have prev and
+ * next pointing to the tree object itsef. This is done for
+ * _left/_right update on node move.
+ */
+ union {
+ node* __next;
+ tree* _rightmost_tree;
+ };
+
+ node* get_next() const {
+ assert(is_leaf());
+ return __next;
+ }
+
+ void set_next(node *n) {
+ assert(is_leaf());
+ __next = n;
+ }
+
+ node* get_prev() const {
+ assert(is_leaf());
+ return _kids[0].n;
+ }
+
+ void set_prev(node* n) {
+ assert(is_leaf());
+ _kids[0].n = n;
+ }
+
+ // Links the new node n right after the current one
+ void link(node& n) {
+ if (is_rightmost()) {
+ _flags &= ~NODE_RIGHTMOST;
+ n._flags |= node::NODE_RIGHTMOST;
+ tree* t = _rightmost_tree;
+ assert(t->_right == this);
+ t->do_set_right(&n);
+ } else {
+ n.set_next(get_next());
+ get_next()->set_prev(&n);
+ }
+
+ n.set_prev(this);
+ set_next(&n);
+ }
+
+ void unlink() {
+ node* x;
+ tree* t;
+
+ switch (_flags & (node::NODE_LEFTMOST | node::NODE_RIGHTMOST)) {
+ case node::NODE_LEFTMOST:
+ x = get_next();
+ _flags &= ~node::NODE_LEFTMOST;
+ x->_flags |= node::NODE_LEFTMOST;
+ t = _kids[0]._leftmost_tree;
+ assert(t->_left == this);
+ t->do_set_left(x);
+ break;
+ case node::NODE_RIGHTMOST:
+ x = get_prev();
+ _flags &= ~node::NODE_RIGHTMOST;
+ x->_flags |= node::NODE_RIGHTMOST;
+ t = _rightmost_tree;
+ assert(t->_right == this);
+ t->do_set_right(x);
+ break;
+ case 0:
+ get_prev()->set_next(get_next());
+ get_next()->set_prev(get_prev());
+ break;
+ default:
+ /*
+ * Right- and left-most at the same time can only be root,
+ * otherwise this would mean we have root with 0 keys.
+ */
+ assert(false);
+ }
+
+ set_next(this);
+ set_prev(this);
+ }
+
+ node(const node& other) = delete;
+ const node& operator=(const node& other) = delete;
+ node& operator=(node&& other) = delete;
+
+ /*
+ * There's no pointer/reference from nodes to the tree, neither
+ * there is such from data, because otherwise we'd have to update
+ * all of them inside tree move constructor, which in turn would
+ * make it toooo slow linear operation. Thus we walk up the nodes
+ * ._parent chain up to the root node which has the _root_tree.
+ */
+ tree* tree_slow() const {
+ const node* cur = this;
+
+ while (!cur->is_root()) {
+ cur = cur->_parent;
+ }
+
+ return cur->_root_tree;
+ }
+
+ /*
+ * Finds the index of the subtree to which the k belongs.
+ * Respectively, the key[i - 1] <= k < key[i] (and if i == 0
+ * then the node is inner and the key is in leftmost subtree).
+ */
+ template <typename K>
+ int index_for(const K& k, Less less) const {
+ return index_for(k, less, std::integral_constant<key_search, Search>());

+ }
+
+ template <typename K>

+ int index_for(const K& k, Less less, std::integral_constant<key_search, key_search::both>) const {
+ int rl = index_for(k, less, std::integral_constant<key_search, key_search::linear>());
+ int rb = index_for(k, less, std::integral_constant<key_search, key_search::binary>());
+ assert(rl == rb);
+ return rl;

+ }
+
+ template <typename K>

+ int index_for(const K& k, Less less, std::integral_constant<key_search, key_search::binary>) const {
+ int s = 0, e = _num_keys - 1, c = 0;
+
+ while (s <= e) {
+ int i = (s + e) / 2;
+ c++;
+ if (less(k, _keys[i].v)) {
+ e = i - 1;
+ } else {
+ s = i + 1;
+ }
+ }
+
+ return s;

+ }
+
+ template <typename K>

+ int index_for(const K& k, Less less, std::integral_constant<key_search, key_search::linear>) const {
+ int i;
+
+ for (i = 0; i < _num_keys; i++) {
+ if (less(k, _keys[i].v)) {
+ break;
+ }
+ }
+
+ return i;
+ }
+
+ int index_for(node *n) const {
+ // Keep index on kid (FIXME?)
+
+ int i;
+
+ for (i = 0; i <= _num_keys; i++) {
+ if (_kids[i].n == n) {
+ break;
+ }
+ }
+ assert(i <= _num_keys);
+ return i;
+ }
+
+ bool need_refill() const {
+ return _num_keys <= NodeHalf;
+ }
+
+ bool can_grab_from() const {
+ return _num_keys > NodeHalf + 1;
+ }
+
+ bool can_push_to() const {
+ return _num_keys < NodeSize;
+ }
+
+ bool can_merge_with(const node& n) const {
+ return _num_keys + n._num_keys + (is_leaf() ? 0 : 1) <= NodeSize;
+ }
+
+ void shift_right(int s) {
+ for (int i = _num_keys - 1; i >= s; i--) {
+ _keys[i + 1].emplace(std::move(_keys[i]));
+ _kids[i + 2] = _kids[i + 1];
+ }
+ _num_keys++;
+ }
+
+ void shift_left(int s) {
+ // The key at s is expected to be .remove()-d !
+ for (int i = s; i < _num_keys - 1; i++) {
+ _keys[i].emplace(std::move(_keys[i + 1]));
+ _kids[i + 1] = _kids[i + 2];
+ }
+ _num_keys--;
+ }
+
+ void move_keys_and_kids(int foff, node& to, int toff, int count) {
+ for (int i = 0; i < count; i++) {
+ to._keys[toff + i].emplace(std::move(_keys[foff + i]));
+ to._kids[toff + i + 1] = _kids[foff + i + 1];
+ }
+ }
+
+ void move_to(node& to, int off, int count) {
+ move_keys_and_kids(off, to, 0, count);
+ _num_keys = off;
+ to._num_keys = count;
+ if (is_leaf()) {
+ for (int i = 0; i < count; i++) {
+ to._kids[i + 1].d->reattach(&to);
+ }
+ } else {
+ for (int i = 0; i < count; i++) {
+ to._kids[i + 1].n->_parent = &to;
+ }
+ }
+
+ }
+
+ void grab_from_left(node& from, maybe<Key>& sep) {
+ /*
+ * Grab one element from the left sibling and return
+ * the new separation key for them.
+ *
+ * Leaf: just move the last key (and the last kid) and report
+ * it as new separation key
+ *
+ * keys: [012] -> [56] = [01] [256] 2 is new separation
+ * datas: [-012] -> [-56] = [-01] [-256]
+ *
+ * Non-leaf is trickier. We need the current separation key
+ * as we're grabbing the last element which has no the right
+ * boundary on the node. So the parent node tells us one.
+ *
+ * keys: [012] -> s [56] = [01] 2 [s56] 2 is new separation
+ * kids: [A012] -> [B56] = [A01] [2B56]
+ */
+
+ int i = from._num_keys - 1;
+
+ shift_right(0);
+ from._num_keys--;
+
+ if (is_leaf()) {
+ _keys[0].emplace(std::move(from._keys[i]));
+ _kids[1] = from._kids[i + 1];
+ _kids[1].d->reattach(this);
+ sep.replace(std::move(copy_key(_keys[0].v)));
+ } else {
+ _keys[0].emplace(std::move(sep));
+ _kids[1] = _kids[0];
+ _kids[0] = from._kids[i + 1];
+ _kids[0].n->_parent = this;
+ sep.emplace(std::move(from._keys[i]));
+ }
+ }
+
+ void merge_into(node& t, Key key) {
+ /*
+ * Merge current node into t preparing it for being
+ * killed. This merge is slightly different for leaves
+ * and for non-leaves wrt the 0th element.
+ *
+ * Non-leaves. For those we need the separation key, whic
+ * is passed to us. The caller "knows" that this and t are
+ * two siblings and thus the separation key is the one from
+ * the parent node. For this reason merging two non-leaf
+ * nodes needs one more slot in the target as compared to
+ * the leaf-nodes case.
+ *
+ * keys: [012] + K + [456] = [012K456]
+ * kids: [A012] + [B456] = [A012B456]
+ *
+ * Leaves. This is simple -- just go ahead and merge.
+ *
+ * keys: [012] + [456] = [012456]
+ * datas: [-012] + [-456] = [-012456]
+ */
+
+ if (!t.is_leaf()) {
+ int i = t._num_keys;
+ t._keys[i].emplace(std::move(key));
+ t._kids[i + 1] = _kids[0];
+ t._kids[i + 1].n->_parent = &t;
+ t._num_keys++;
+ }
+
+ move_keys_and_kids(0, t, t._num_keys, _num_keys);
+
+ if (t.is_leaf()) {
+ for (int i = t._num_keys; i < t._num_keys + _num_keys; i++) {
+ t._kids[i + 1].d->reattach(&t);
+ }
+ } else {
+ for (int i = t._num_keys; i < t._num_keys + _num_keys; i++) {
+ t._kids[i + 1].n->_parent = &t;
+ }
+ }
+
+ t._num_keys += _num_keys;
+ _num_keys = 0;
+ }
+
+ void grab_from_right(node& from, maybe<Key>& sep) {
+ /*
+ * Grab one element from the right sibling and return
+ * the new separation key for them.
+ *
+ * Leaf: just move the 0th key (and 1st kid) and the
+ * new separation key is what becomes 0 in the source.
+ *
+ * keys: [01] <- [456] = [014] [56] 5 is new separation
+ * datas: [-01] <- [-456] = [-014] [-56]
+ *
+ * Non-leaf is trickier. We need the current separation
+ * key as we're grabbing the kids[0] element which has no
+ * corresponding keys[-1]. So the parent node tells us one.
+ *
+ * keys: [01] <- s [456] = [01s] 4 [56] 4 is new separation
+ * kids: [A01] <- [B456] = [A01B] [456]
+ */
+
+ int i = _num_keys;
+
+ if (is_leaf()) {
+ _keys[i].emplace(std::move(from._keys[0]));
+ _kids[i + 1] = from._kids[1];
+ _kids[i + 1].d->reattach(this);
+ sep.replace(std::move(copy_key(from._keys[1].v)));
+ } else {
+ _kids[i + 1] = from._kids[0];
+ _kids[i + 1].n->_parent = this;
+ _keys[i].emplace(std::move(sep));
+ from._kids[0] = from._kids[1];
+ sep.emplace(std::move(from._keys[0]));
+ }
+
+ _num_keys++;
+ from.shift_left(0);
+ }
+
+ /*
+ * When splitting, the result should be almost equal. The
+ * "almost" depends on the node-size being odd or even and
+ * on the node itself being leaf or inner.
+ */
+ bool equally_split(const node& n2) const {
+ if (Debug == with_debug::yes) {
+ return (_num_keys == n2._num_keys) ||
+ (_num_keys == n2._num_keys + 1) ||
+ (_num_keys + 1 == n2._num_keys);
+ }
+ return true;
+ }
+
+ // Helper for assert(). See comment for do_insert for details.
+ bool left_kid_sorted(const Key& k, Less less) const {
+ if (Debug == with_debug::yes && !is_leaf() && _num_keys > 0) {
+ node* x = _kids[0].n;
+ if (x != nullptr && less(k, x->_keys[x->_num_keys - 1].v)) {
+ return false;
+ }
+ }
+
+ return true;
+ }
+
+ template <typename DFunc, typename NFunc>
+ GCC6_CONCEPT(requires
+ requires (DFunc f, data* val) { { f(val) } -> void; } &&
+ requires (NFunc f, node* n) { { f(n) } -> void; }
+ )
+ void clear(DFunc&& ddisp, NFunc&& ndisp) {
+ if (is_leaf()) {
+ _flags &= ~(node::NODE_LEFTMOST | node::NODE_RIGHTMOST);
+ set_next(this);
+ set_prev(this);
+ } else {
+ node* n = _kids[0].n;
+ n->clear(ddisp, ndisp);
+ ndisp(n);
+ }
+
+ for (int i = 0; i < _num_keys; i++) {
+ _keys[i].reset();
+ if (is_leaf()) {
+ ddisp(_kids[i + 1].d);
+ } else {
+ node* n = _kids[i + 1].n;
+ n->clear(ddisp, ndisp);
+ ndisp(n);
+ }
+ }
+
+ _num_keys = 0;
+ }
+
+ static node* create() {
+ return current_allocator().construct<node>();
+ }
+
+ static void destroy(node& n) {
+ current_allocator().destroy(&n);
+ }
+
+ void drop() {
+ assert(!is_root());
+ if (is_leaf()) {
+ unlink();
+ }
+ destroy(*this);
+ }
+
+ void insert_into_full(int idx, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ if (!is_root()) {
+ node& p = *_parent;
+ int i = p.index_for(_keys[0].v, less);
+
+ /*
+ * Try to push left or right existing keys to the respective
+ * siblings. Keep in mind two corner cases:
+ *
+ * 1. Push to left. In this case the new key should not go
+ * to the [0] element, otherwise we'd have to update the p's
+ * separation key one more time.
+ *
+ * 2. Push to right. In this case we must make sure the new
+ * key is not the rightmost itself, otherwise it's _him_ who
+ * must be pushed there.
+ *
+ * Both corner cases are possible to implement though.
+ */
+ if (idx > 1 && i > 0) {
+ node* left = p._kids[i - 1].n;
+ if (left->can_push_to()) {
+ /*
+ * We've moved the 0th elemet from this, so the index
+ * for the new key shifts too
+ */
+ idx--;
+ left->grab_from_right(*this, p._keys[i - 1]);
+ }
+ }
+
+ if (idx < _num_keys && i < p._num_keys) {
+ node* right = p._kids[i + 1].n;
+ if (right->can_push_to()) {
+ right->grab_from_left(*this, p._keys[i]);
+ }
+ }
+
+ if (_num_keys < NodeSize) {
+ do_insert(idx, std::move(k), nd, less);
+ nodes.drain();
+ return;
+ }
+
+ /*
+ * We can only get here if both ->can_push_to() checks above
+ * had failed. In this case -- go ahead and split this.
+ */
+ }
+
+ split_and_insert(idx, std::move(k), nd, less, nodes);
+ }
+
+ void split_and_insert(int idx, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ assert(_num_keys >= NodeSize);
+
+ node* nn = nodes.pop();
+ maybe<Key> sep;
+
+ /*
+ * Insertion with split.
+ * 1. Existing node (this) is split into two. We try a bit harder
+ * than we might to to make the split equal.
+ * 2. The new element is added to either of the resulting nodes.
+ * 3. The new node nn is inserted into parent one with the help
+ * of a separation key sep
+ *
+ * First -- find the position in the current node at which the
+ * new element should have appeared.
+ */
+
+ int off = NodeHalf + (idx > NodeHalf ? 1 : 0);
+
+ if (is_leaf()) {
+ nn->_flags |= NODE_LEAF;
+ link(*nn);
+
+ /*
+ * Split of leaves. This is simple -- just copy the needed
+ * amount of keys and kids from this to nn, then insert the
+ * new pair into the proper place. When inserting the new
+ * node into parent the separation key is the one latter
+ * starts with.
+ *
+ * keys: [01234]
+ * datas: [-01234]
+ *
+ * if the new key is below 2, then
+ * keys: -> [01] [234] -> [0n1] [234] -> sep is 2
+ * datas: -> [-01] [-234] -> [-0n1] [-234]
+ *
+ * if the new key is above 2, then
+ * keys: -> [012] [34] -> [012] [3n4] -> sep is 3 (or n)
+ * datas: -> [-012] [-34] -> [-012] [-3n4]
+ */
+ move_to(*nn, off, NodeSize - off);
+
+ if (idx <= NodeHalf) {
+ do_insert(idx, std::move(k), nd, less);
+ } else {
+ nn->do_insert(idx - off, std::move(k), nd, less);
+ }
+ sep.emplace(std::move(copy_key(nn->_keys[0].v)));
+ } else {
+ /*
+ * Node insertion has one special case -- when the new key
+ * gets directly into the middle.
+ */
+ if (idx == NodeHalf + 1) {
+ /*
+ * Split of nodes and the new key is in the middle. In this
+ * we need to split the node into two, but take the k as the
+ * separation kep. The corresponding data becomes new node's
+ * 0 kid.
+ *
+ * keys: [012345] -> [012] k [345] (and the k goes up)
+ * kids: [A012345] -> [A012] [n345]
+ */
+ move_to(*nn, off, NodeSize - off);
+ sep.emplace(std::move(k));
+ nn->_kids[0] = nd;
+ nn->_kids[0].n->_parent = nn;
+ } else {
+ /*
+ * Split of nodes and the new key gets into either of the
+ * halves. This is like leaves split, but we need to carefully
+ * handle the kids[0] for both. The correspoding key is not
+ * on the node and "has" an index of -1 and thus becomes the
+ * separation one for the upper layer.
+ *
+ * keys: [012345]
+ * datas: [A012345]
+ *
+ * if the new key goes left then
+ * keys: -> [01] 2 [345] -> [0n1] 2 [345]
+ * datas: -> [A01] [2345] -> [A0n1] [2345]
+ *
+ * if the new key goes right then
+ * keys: -> [012] 3 [45] -> [012] 3 [4n5]
+ * datas: -> [A012] [345] -> [-123] [34n5]
+ */
+ move_to(*nn, off + 1, NodeSize - off - 1);
+ sep.emplace(std::move(_keys[off]));
+ nn->_kids[0] = _kids[off + 1];
+ nn->_kids[0].n->_parent = nn;
+ _num_keys--;
+
+ if (idx <= NodeHalf) {
+ do_insert(idx, std::move(k), nd, less);
+ } else {
+ nd.n->_parent = nn;
+ nn->do_insert(idx - off - 1, std::move(k), nd, less);
+ }
+ }
+ }
+
+ assert(equally_split(*nn));
+
+ if (is_root()) {
+ insert_into_root(*nn, std::move(sep.v), nodes);
+ } else {
+ insert_into_parent(*nn, std::move(sep.v), less, nodes);
+ }
+ sep.reset();
+ }
+
+ void do_insert(int i, Key k, node_or_data nd, Less less) {
+ assert(_num_keys < NodeSize);
+
+ /*
+ * The new k:nd pair should be put into the given index and
+ * shift offenders to the right. However, if it should be
+ * put left to the non-leaf's left-most node -- it's a BUG,
+ * as there's no corresponding key here.
+ *
+ * Non-leaf nodes get here when their kids are split, and
+ * when they do, if the kid gets into the left-most sub-tree,
+ * it's directly put there, and this helper is not called.
+ * Said that, if we're inserting a new pair, the newbie can
+ * only get to the right of the left-most kid.
+ */
+ assert(i != 0 || left_kid_sorted(k, less));
+
+ shift_right(i);
+
+ // Puts k:nd pair at position idx (keys[idx] and kids[idx + 1])
+ _keys[i].emplace(std::move(k));
+ _kids[i + 1] = nd;
+ if (is_leaf()) {
+ nd.d->attach(*this);
+ }
+ }
+
+ void insert_into_parent(node& nn, Key sep, Less less, prealloc& nodes) {
+ nn._parent = _parent;
+ _parent->insert_key(std::move(sep), node_or_data{n: &nn}, less, nodes);
+ }
+
+ void insert_into_root(node& nn, Key sep, prealloc& nodes) {
+ tree* t = _root_tree;
+
+ node* nr = nodes.pop();
+
+ nr->_num_keys = 1;
+ nr->_keys[0].emplace(std::move(sep));
+ nr->_kids[0].n = this;
+ nr->_kids[1].n = &nn;
+ _flags &= ~node::NODE_ROOT;
+ _parent = nr;
+ nn._parent = nr;
+
+ nr->_flags |= node::NODE_ROOT;
+ t->do_set_root(nr);
+ }
+
+ void insert_key(Key k, node_or_data nd, Less less, prealloc& nodes) {
+ int i = index_for(k, less);
+ insert(i, std::move(k), nd, less, nodes);
+ }
+
+ void insert(int i, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ if (_num_keys == NodeSize) {
+ insert_into_full(i, std::move(k), nd, less, nodes);
+ } else {
+ do_insert(i, std::move(k), nd, less);
+ }
+ }
+
+ void insert(int i, Key k, data* d, Less less) {
+ prealloc nodes;
+
+ /*
+ * Prepare the nodes for split in advaice, if the node::create will
+ * start throwing while splitting we'll have troubles "unsplitting"
+ * the nodes back.
+ */
+ node* cur = this;
+ while (cur->_num_keys == NodeSize) {
+ nodes.push();
+ if (cur->is_root()) {
+ nodes.push();
+ break;
+ }
+ cur = cur->_parent;
+ }
+
+ insert(i, std::move(k), node_or_data{d: d}, less, nodes);
+ assert(nodes.empty());
+ }
+
+ void remove_from(int i, Less less) {
+ _keys[i].reset();
+ shift_left(i);
+
+ if (!is_root()) {
+ if (need_refill()) {
+ refill(less);
+ }
+ } else if (_num_keys == 0 && !is_leaf()) {
+ node* nr;
+ nr = _kids[0].n;
+ nr->_flags |= node::NODE_ROOT;
+ _root_tree->do_set_root(nr);
+
+ _flags &= ~node::NODE_ROOT;
+ _parent = nullptr;
+ drop();
+ }
+ }
+
+ void merge_kids(node& t, node& n, int idx, Less less) {
+ n.merge_into(t, std::move(_keys[idx].v));
+ n.drop();
+ remove_from(idx, less);
+ }
+
+ void refill(Less less) {
+ node& p = *_parent, *left, *right;
+
+ /*
+ * We need to locate this node's index at parent array by using
+ * the 0th key, so make sure it exists. We can go even without
+ * it, but since we don't let's be on the safe side.
+ */
+ assert(_num_keys > 0);
+ int i = p.index_for(_keys[0].v, less);
+ assert(p._kids[i].n == this);
+
+ /*
+ * The node is "underflown" (see comment near NodeHalf
+ * about what this means), so we try to refill it at the
+ * siblings' expense. Many cases possible, but we go with
+ * only four:
+ *
+ * 1. Left sibling exists and it has at least 1 item
+ * above being the half-full. -> we grab one element
+ * from it.
+ *
+ * 2. Left sibling exists and we can merge current with
+ * it. "Can" means the resulting node will not overflow
+ * which, in turn, differs by one for leaf and non-leaf
+ * nodes. For leaves the merge is possible is the total
+ * number of the elements fits the maximum. For non-leaf
+ * we'll need room for one more element, here's why:
+ *
+ * [012] + [456] -> [012X456]
+ * [A012] + [B456] -> [A012B456]
+ *
+ * The key X in the middle separates B from everything on
+ * the left and this key was not sitting on either of the
+ * wannabe merging nodes. This X is the current separation
+ * of these two nodes taken from their parent.
+ *
+ * And two same cases for the right sibling.
+ */
+
+ left = i > 0 ? p._kids[i - 1].n : nullptr;
+ right = i < p._num_keys ? p._kids[i + 1].n : nullptr;
+
+ if (left != nullptr && left->can_grab_from()) {
+ grab_from_left(*left, p._keys[i - 1]);
+ return;
+ }
+
+ if (right != nullptr && right->can_grab_from()) {
+ grab_from_right(*right, p._keys[i]);
+ return;
+ }
+
+ if (left != nullptr && can_merge_with(*left)) {
+ p.merge_kids(*left, *this, i - 1, less);
+ return;
+ }
+
+ if (right != nullptr && can_merge_with(*right)) {
+ p.merge_kids(*this, *right, i, less);
+ return;
+ }
+
+ /*
+ * Susprisingly, the node in the B+ tree can violate the
+ * "minimally filled" rule for non roots. It _can_ stay with
+ * less than half elements on board. The next remove from
+ * it or either of its siblings will probably refill it.
+ *
+ * Keeping 1 key on the non-root node is possible, but needs
+ * some special care -- if we will remove this last key from
+ * this node, the code will try to refill one and will not
+ * be able to find this node's index at parent (the call for
+ * index_for() above).
+ */
+ assert(_num_keys > 1);
+ }
+
+ void remove(int i, Less less) {
+ assert(i >= 0);
+ /*
+ * Update the matching separation key from above. It
+ * exists only if we're removing the 0th key, but for
+ * the left-most child it doesn't exist.
+ *
+ * Note, that the latter check is crucial for clear()
+ * performance, as it's always removes the left-most
+ * key, without this check each remove() would walk the
+ * tree upwards in vain.
+ */
+ if (strict_separation_key && i == 0 && !is_leftmost()) {
+ const Key& k = _keys[i].v;
+ node* p = this;
+
+ while (!p->is_root()) {
+ p = p->_parent;
+ int j = p->index_for(k, less) - 1;
+ if (j >= 0) {
+ p->_keys[j].replace(std::move(copy_key(_keys[1].v)));
+ break;
+ }
+ }
+ }
+
+ remove_from(i, less);
+ }
+
+public:
+ explicit node() : _num_keys(0) , _flags(0) , _parent(nullptr) { }
+
+ ~node() {
+ assert(_num_keys == 0);
+ assert(is_root() || !is_leaf() || (get_prev() == this && get_next() == this));
+ }
+
+ node(node&& other) noexcept : _flags(other._flags) {
+ if (is_leaf()) {
+ if (!is_rightmost()) {
+ set_next(other.get_next());
+ get_next()->set_prev(this);
+ } else {
+ other._rightmost_tree->do_set_right(this);
+ }
+
+ if (!is_leftmost()) {
+ set_prev(other.get_prev());
+ get_prev()->set_next(this);
+ } else {
+ other._kids[0]._leftmost_tree->do_set_left(this);
+ }
+
+ other._flags &= ~(NODE_LEFTMOST | NODE_RIGHTMOST);
+ other.set_next(&other);
+ other.set_prev(&other);
+ } else {
+ _kids[0].n = other._kids[0].n;
+ _kids[0].n->_parent = this;
+ }
+
+ other.move_to(*this, 0, other._num_keys);
+
+ if (!is_root()) {
+ _parent = other._parent;
+ int i = _parent->index_for(&other);
+ assert(_parent->_kids[i].n == &other);
+ _parent->_kids[i].n = this;
+ } else {
+ other._root_tree->do_set_root(this);
+ }
+ }
+
+ int index_for(data *d) const {
+ /*
+ * We'd could look up the data's new idex with binary search,
+ * but we don't have the key at hands
+ */
+
+ int i;
+
+ for (i = 1; i <= _num_keys; i++) {
+ if (_kids[i].d == d) {
+ break;
+ }
+ }
+ assert(i <= _num_keys);
+ return i;
+ }
+
+private:
+ class prealloc {
+ std::vector<node*> _nodes;
+ public:
+ bool empty() { return _nodes.empty(); }
+
+ void push() {
+ _nodes.push_back(node::create());
+ }
+
+ node* pop() {
+ assert(!_nodes.empty());
+ node* ret = _nodes.back();
+ _nodes.pop_back();
+ return ret;
+ }
+
+ void drain() {
+ while (!empty()) {
+ node::destroy(*pop());
+ }
+ }
+
+ ~prealloc() {
+ drain();
+ }
+ };
+
+ void fill_stats(struct stats& st) {
+ if (is_leaf()) {
+ st.leaves_filled[_num_keys]++;
+ st.leaves++;
+ st.datas += _num_keys;
+ } else {
+ st.nodes_filled[_num_keys]++;
+ st.nodes++;
+ for (int i = 0; i <= _num_keys; i++) {
+ _kids[i].n->fill_stats(st);
+ }
+ }
+ }
+};
+
+/*
+ * The data represents data node (the actual data is stored "outside"
+ * of the tree). The tree::emplace() constructs the payload inside the
+ * data before inserting it into the tree.
+ */
+template <typename K, typename T, typename Less, int NS, key_search S, with_debug D>
+class data final {
+ friend class validator<K, T, Less, NS>;
+ template <typename c1, typename c2, typename c3, int s1, key_search p1, with_debug p2>
+ friend class tree<c1, c2, c3, s1, p1, p2>::iterator;
+
+ using node = class node<K, T, Less, NS, S, D>;
+
+ node* _leaf;
+ T value;
+
+public:
+ template <typename... Args>
+ static data* create(Args&&... args) {
+ return current_allocator().construct<data>(std::forward<Args>(args)...);
+ }
+

+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ static void destroy(data& d, Func&& disp) {
+ disp(d.value);
+ d._leaf = nullptr;
+ current_allocator().destroy(&d);
+ }
+
+ template <typename... Args>
+ data(Args&& ... args) : _leaf(nullptr), value(std::forward<Args>(args)...) {}
+
+ data(data&& other) noexcept : _leaf(other._leaf), value(std::move(other.value)) {
+ if (attached()) {
+ int i = _leaf->index_for(&other);
+ _leaf->_kids[i].d = this;
+ other._leaf = nullptr;
+ }
+ }
+
+ ~data() { assert(!attached()); }
+
+ bool attached() const { return _leaf != nullptr; }
+
+ void attach(node& to) {
+ assert(!attached());
+ _leaf = &to;
+ }
+
+ void reattach(node* to) {
+ assert(attached());
+ _leaf = to;

+ }
+
+ size_t storage_size() const {

+ return sizeof(data) - sizeof(T) + size_for_allocation_strategy(value);
+ }
+
+ friend size_t size_for_allocation_strategy(const data& obj) {

+ return obj.storage_size();
+ }
+};
+

+} // namespace bplus
diff --git a/test/boost/bptree_test.cc b/test/boost/bptree_test.cc
new file mode 100644
index 000000000..ea8c6ce71
--- /dev/null
+++ b/test/boost/bptree_test.cc
@@ -0,0 +1,344 @@
+

+#define BOOST_TEST_MODULE bptree
+
+#include <boost/test/unit_test.hpp>
+#include <fmt/core.h>
+
+#include "utils/bptree.hh"
+#include "test/unit/bptree_key.hh"
+
+struct int_compare {
+ bool operator()(const int& a, const int& b) const { return a < b; }
+};
+
+using namespace bplus;
+using test_tree = tree<int, unsigned long, int_compare, 4, key_search::both, with_debug::yes>;
+
+BOOST_AUTO_TEST_CASE(test_ops_empty_tree) {
+ /* Sanity checks for no nullptr dereferences */
+ test_tree t(int_compare{});
+ t.erase(1);
+ t.find(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_double_insert) {
+ /* No assertions should happen in ~tree */
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 1);
+ BOOST_REQUIRE(i.second);
+ i = t.emplace(1, 1);
+ BOOST_REQUIRE(!i.second);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_cookie_find) {
+ struct int_to_key_compare {
+ bool operator()(const test_key& a, const int& b) const { return (int)a < b; }
+ bool operator()(const int& a, const test_key& b) const { return a < (int)b; }
+ bool operator()(const test_key& a, const test_key& b) const {
+ test_key_compare cmp;
+ return cmp(a, b);
+ }
+ };
+
+ using test_tree = tree<test_key, int, int_to_key_compare, 4, key_search::both, with_debug::yes>;
+
+ test_tree t(int_to_key_compare{});
+ t.emplace(test_key{1}, 132);
+
+ auto i = t.find(1);
+ BOOST_REQUIRE(*i == 132);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_double_erase) {
+ test_tree t(int_compare{});
+ t.emplace(1, 1);
+ t.emplace(2, 2);
+ auto i = t.erase(1);
+ BOOST_REQUIRE(*i == 2);
+ i = t.erase(1);
+ BOOST_REQUIRE(i == t.end());
+ i = t.erase(2);
+ BOOST_REQUIRE(i == t.end());
+ t.erase(2);
+}
+
+BOOST_AUTO_TEST_CASE(test_remove_corner_case) {
+ /* Sanity check for erasure to be precise */
+ test_tree t(int_compare{});
+ t.emplace(1, 1);
+ t.emplace(2, 123);
+ t.emplace(3, 3);
+ t.erase(1);
+ t.erase(3);
+ auto f = t.find(2);
+ BOOST_REQUIRE(*f == 123);
+ t.erase(2);
+}
+
+BOOST_AUTO_TEST_CASE(test_end_iterator) {
+ /* Check std::prev(end()) */
+ test_tree t(int_compare{});
+ t.emplace(1, 123);
+ auto i = std::prev(t.end());
+ BOOST_REQUIRE(*i = 123);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_next_to_end_iterator) {
+ /* Same, but with "artificial" end iterator */
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 123).first;
+ i++;
+ BOOST_REQUIRE(i == t.end());
+ i--;
+ BOOST_REQUIRE(*i = 123);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_clear) {
+ /* Quick check for tree::clear */
+ test_tree t(int_compare{});
+
+ for (int i = 0; i < 32; i++) {
+ t.emplace(i, i);
+ }
+
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_post_clear) {
+ /* Check that tree is work-able after clear */
+ test_tree t(int_compare{});
+
+ t.emplace(1, 1);
+ t.clear();
+ t.emplace(2, 2);
+ t.erase(2);
+}
+
+BOOST_AUTO_TEST_CASE(test_iterator_erase) {
+ /* Check iterator::erase */
+ test_tree t(int_compare{});
+ auto it = t.emplace(2, 2);
+ t.emplace(1, 321);
+ it.first.erase(int_compare{});
+ BOOST_REQUIRE(*t.find(1) == 321);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_iterator_equal) {
+ test_tree t(int_compare{});
+ auto i1 = t.emplace(1, 1);
+ auto i2 = t.emplace(2, 2);
+ auto i3 = t.find(1);
+ BOOST_REQUIRE(i1.first == i3);
+ BOOST_REQUIRE(i1.first != i2.first);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_lower_bound) {
+ test_tree t(int_compare{});
+ t.emplace(1, 11);
+ t.emplace(3, 13);
+
+ bool match;
+ BOOST_REQUIRE(*t.lower_bound(0, match) == 11 && !match);
+ BOOST_REQUIRE(*t.lower_bound(1, match) == 11 && match);
+ BOOST_REQUIRE(*t.lower_bound(2, match) == 13 && !match);
+ BOOST_REQUIRE(*t.lower_bound(3, match) == 13 && match);
+ BOOST_REQUIRE(t.lower_bound(4, match) == t.end() && !match);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_upper_bound) {
+ test_tree t(int_compare{});
+ t.emplace(1, 11);
+ t.emplace(3, 13);
+
+ BOOST_REQUIRE(*t.upper_bound(0) == 11);
+ BOOST_REQUIRE(*t.upper_bound(1) == 13);
+ BOOST_REQUIRE(*t.upper_bound(2) == 13);
+ BOOST_REQUIRE(t.upper_bound(3) == t.end());
+ BOOST_REQUIRE(t.upper_bound(4) == t.end());
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_iterator_index) {
+ /* Check insertion iterator ++ and duplicate key */
+ test_tree t(int_compare{});
+ t.emplace(1, 10);
+ t.emplace(3, 13);
+ auto i = t.emplace(2, 2).first;
+ i++;
+ BOOST_REQUIRE(*i == 13);
+ auto i2 = t.emplace(2, 2); /* 2nd insert finds the previous */
+ BOOST_REQUIRE(!i2.second);
+ i2.first++;
+ BOOST_REQUIRE(*(i2.first) == 13);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before) {
+ /* Check iterator::insert_before */
+ test_tree t(int_compare{});
+ auto i3 = t.emplace(3, 13).first;
+ auto i2 = i3.emplace_before(2, int_compare{}, 12);
+ BOOST_REQUIRE(++i2 == i3);
+ BOOST_REQUIRE(*i3 == 13);
+ BOOST_REQUIRE(*--i2 == 12);
+ BOOST_REQUIRE(*--i3 == 12);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before_end) {
+ /* The same but for end() iterator */
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 1).first;
+ auto i2 = t.end().emplace_before(2, int_compare{}, 12);
+ BOOST_REQUIRE(++i == i2);
+ BOOST_REQUIRE(++i2 == t.end());
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before_end_empty) {
+ /* The same, but for empty tree */
+ test_tree t(int_compare{});
+ auto i = t.end().emplace_before(42, int_compare{}, 142);
+ BOOST_REQUIRE(i == t.begin());
+ t.erase(42);
+}
+
+BOOST_AUTO_TEST_CASE(test_iterators) {
+ test_tree t(int_compare{});
+
+ for (auto i = t.rbegin(); i != t.rend(); i++) {
+ BOOST_REQUIRE(false);
+ }
+ for (auto i = t.begin(); i != t.end(); i++) {
+ BOOST_REQUIRE(false);
+ }
+
+ t.emplace(1, 7);
+ t.emplace(2, 9);
+
+ {
+ auto i = t.begin();
+ BOOST_REQUIRE(*(i++) == 7);
+ BOOST_REQUIRE(*(i++) == 9);
+ BOOST_REQUIRE(i == t.end());
+ }
+
+ {
+ auto i = t.rbegin();
+ BOOST_REQUIRE(*(i++) == 9);
+ BOOST_REQUIRE(*(i++) == 7);
+ BOOST_REQUIRE(i == t.rend());
+ }
+
+ t.clear();
+}
+
+/*
+ * Special test that makes sure "self-iterator" works OK.
+ * See comment near the bptree::iterator(T* d) constructor
+ * for details.
+ */
+class tree_data {
+ int _key;
+ int _cookie;
+public:
+ explicit tree_data(int cookie) : _key(-1), _cookie(cookie) {}
+ tree_data(int key, int cookie) : _key(key), _cookie(cookie) {}
+ int cookie() const { return _cookie; }
+ int key() const {
+ assert(_key != -1);
+ return _key;
+ }
+};
+
+BOOST_AUTO_TEST_CASE(test_data_self_iterator) {
+ using test_tree = tree<int, tree_data, int_compare, 4, key_search::both, with_debug::yes>;
+
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 42);
+ BOOST_REQUIRE(i.second);
+
+ tree_data* d = &(*i.first);
+ BOOST_REQUIRE(d->cookie() == 42);
+
+ test_tree::iterator di(d);
+ BOOST_REQUIRE(di->cookie() == 42);
+
+ di.erase(int_compare{});
+ BOOST_REQUIRE(t.find(1) == t.end());
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before_nokey) {
+ using test_tree = tree<int, tree_data, int_compare, 4, key_search::both, with_debug::yes>;
+
+ test_tree t(int_compare{});
+ auto i = t.emplace(2, 52).first;
+ auto ni = i.emplace_before(int_compare{}, 1, 42);
+ BOOST_REQUIRE(ni->cookie() == 42);
+ ni++;
+ BOOST_REQUIRE(ni == i);
+ t.clear();
+}
+
+
+BOOST_AUTO_TEST_CASE(test_self_iterator_rover) {
+ test_tree t(int_compare{});
+ auto i = t.emplace(2, 42).first;
+ unsigned long* d = &(*i);
+ test_tree::iterator di(d);
+
+ i = di.emplace_before(1, int_compare{}, 31);
+ BOOST_REQUIRE(*i == 31);
+ BOOST_REQUIRE(*(++i) == 42);
+ BOOST_REQUIRE(++i == t.end());
+ BOOST_REQUIRE(++di == t.end());
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_erase_range) {
+ /* Quick check for tree::clear */
+ test_tree t(int_compare{});
+
+ for (int i = 0; i < 32; i++) {
+ t.emplace(i, i);
+ }
+
+ auto b = t.find(8);
+ auto e = t.find(25);
+ t.erase(b, e);
+
+ BOOST_REQUIRE(*t.find(7) == 7);
+ BOOST_REQUIRE(t.find(8) == t.end());
+ BOOST_REQUIRE(t.find(24) == t.end());
+ BOOST_REQUIRE(*t.find(25) == 25);
+
+ t.clear();
+}
diff --git a/test/perf/perf_bptree.cc b/test/perf/perf_bptree.cc
new file mode 100644
index 000000000..2de5e5139
--- /dev/null
+++ b/test/perf/perf_bptree.cc
@@ -0,0 +1,165 @@

+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <algorithm>
+#include <vector>
+#include <random>
+#include <fmt/core.h>
+
+using key_t = int;
+
+struct key_compare {
+ bool operator()(const key_t& a, const key_t& b) const { return a < b; }
+};
+
+#include "utils/bptree.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+constexpr int TEST_NODE_SIZE = 4;
+
+/* On node size 4 (this test) linear search works better */
+using test_tree = tree<key_t, unsigned long, key_compare, TEST_NODE_SIZE, key_search::linear>;
+
+class collection_tester {
+public:
+ virtual void insert(key_t k) = 0;
+ virtual void erase(key_t k) = 0;
+ virtual void show_stats() = 0;
+ virtual ~collection_tester() {};
+};
+
+class bptree_tester : public collection_tester {
+ test_tree _t;
+public:
+ bptree_tester() : _t(key_compare{}) {}
+ virtual void insert(key_t k) override { _t.emplace(k, 0); }
+ virtual void erase(key_t k) override { _t.erase(k); }
+ virtual void show_stats() {
+ struct bplus::stats st = _t.get_stats();
+ fmt::print("nodes: {}\n", st.nodes);
+ for (int i = 0; i < (int)st.nodes_filled.size(); i++) {
+ fmt::print(" {}: {} ({}%)\n", i, st.nodes_filled[i], st.nodes_filled[i] * 100 / st.nodes);
+ }
+ fmt::print("leaves: {}\n", st.leaves);
+ for (int i = 0; i < (int)st.leaves_filled.size(); i++) {
+ fmt::print(" {}: {} ({}%)\n", i, st.leaves_filled[i], st.leaves_filled[i] * 100 / st.leaves);
+ }
+ fmt::print("datas: {}\n", st.datas);
+ }
+ virtual ~bptree_tester() = default;
+};
+
+class set_tester : public collection_tester {
+ std::set<key_t> _s;
+public:
+ virtual void insert(key_t k) override { _s.insert(k); }
+ virtual void erase(key_t k) override { _s.erase(k); }
+ virtual void show_stats() { }
+ virtual ~set_tester() = default;
+};
+
+class map_tester : public collection_tester {
+ std::map<key_t, unsigned long> _m;
+public:
+ virtual void insert(key_t k) override { _m[k] = 0; }
+ virtual void erase(key_t k) override { _m.erase(k); }
+ virtual void show_stats() { }
+ virtual ~map_tester() = default;
+};
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(5000000), "number of keys to fill the tree with")
+ ("batch", bpo::value<int>()->default_value(100), "number of operations between deferring points")
+ ("iters", bpo::value<int>()->default_value(1), "number of iterations")
+ ("col", bpo::value<std::string>()->default_value("bptree"), "collection to test")
+ ("stats", bpo::value<bool>()->default_value(false), "show stats");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto iters = app.configuration()["iters"].as<int>();
+ auto batch = app.configuration()["batch"].as<int>();
+ auto col = app.configuration()["col"].as<std::string>();
+ auto stats = app.configuration()["stats"].as<bool>();
+
+ return seastar::async([count, iters, batch, col, stats] {
+ int rep = iters;
+ collection_tester* c;
+
+ if (col == "bptree") {
+ c = new bptree_tester();
+ } else if (col == "set") {
+ c = new set_tester();
+ } else if (col == "map") {
+ c = new map_tester();
+ } else {
+ fmt::print("Unknown collection\n");
+ return;
+ }
+
+ std::vector<int> keys;
+
+ for (int i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+
+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Inserting {:d} k:v pairs into {} {:d} times\n", count, col, iters);
+
+ again:
+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+ for (int i = 0; i < count; i++) {
+ c->insert(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+
+ if (stats) {
+ c->show_stats();
+ }
+
+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+ for (int i = 0; i < count; i++) {
+ c->erase(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+
+ if (--rep > 0) {
+ goto again;
+ }
+
+ delete c;
+ });
+ });
+}
diff --git a/test/perf/perf_bptree_drain.cc b/test/perf/perf_bptree_drain.cc
new file mode 100644
index 000000000..703e15719
--- /dev/null
+++ b/test/perf/perf_bptree_drain.cc
@@ -0,0 +1,154 @@

+#include <chrono>
+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <seastar/core/weak_ptr.hh>
+#include <algorithm>
+#include <vector>
+#include <random>
+#include <fmt/core.h>
+#include "perf.hh"
+
+using key_t = int;
+
+struct key_compare {
+ bool operator()(const key_t& a, const key_t& b) const { return a < b; }
+};
+
+#include "utils/bptree.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+constexpr int TEST_NODE_SIZE = 4;
+
+/* On node size 4 (this test) linear search works better */
+using test_tree = tree<key_t, unsigned long, key_compare, TEST_NODE_SIZE, key_search::linear>;
+
+class collection_tester {
+public:
+ virtual void insert(key_t k) = 0;
+ virtual void drain(int batch) = 0;
+ virtual ~collection_tester() {};
+};
+
+class bptree_tester : public collection_tester {
+ test_tree _t;
+public:
+ bptree_tester() : _t(key_compare{}) {}
+ virtual void insert(key_t k) override { _t.emplace(k, 0); }
+ virtual void drain(int batch) override {
+ int x = 0;
+ auto i = _t.begin();
+ while (i != _t.end()) {
+ i = i.erase(key_compare{});
+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }
+ virtual ~bptree_tester() = default;
+};
+
+class set_tester : public collection_tester {
+ std::set<key_t> _s;
+public:
+ virtual void insert(key_t k) override { _s.insert(k); }
+ virtual void drain(int batch) override {
+ int x = 0;
+ auto i = _s.begin();
+ while (i != _s.end()) {
+ i = _s.erase(i);
+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }
+ virtual ~set_tester() = default;
+};
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(5000000), "number of keys to fill the tree with")
+ ("batch", bpo::value<int>()->default_value(100), "number of operations between deferring points")
+ ("iters", bpo::value<int>()->default_value(1), "number of iterations")
+ ("col", bpo::value<std::string>()->default_value("bptree"), "collection to test");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto iters = app.configuration()["iters"].as<int>();
+ auto batch = app.configuration()["batch"].as<int>();
+ auto col = app.configuration()["col"].as<std::string>();
+
+ return seastar::async([count, iters, batch, col] {
+ int rep = iters;
+ collection_tester* c;
+
+ if (col == "bptree") {
+ c = new bptree_tester();
+ } else if (col == "set") {
+ c = new set_tester();
+ } else {
+ fmt::print("Unknown collection\n");
+ return;
+ }
+
+ std::vector<int> keys;
+
+ for (int i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+
+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Inserting {:d} k:v pairs into {} {:d} times\n", count, col, iters);
+
+ again:
+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+ for (int i = 0; i < count; i++) {
+ c->insert(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+
+ scheduling_latency_measurer invalidate_slm;
+ invalidate_slm.start();
+ auto d = duration_in_seconds([&] {
+ c->drain(batch);
+ });
+ invalidate_slm.stop();
+ fmt::print("drain: {:.6f} [ms], preemption: {}\n", d.count() * 1000, invalidate_slm);
+
+ if (--rep > 0) {
+ goto again;
+ }
+
+ delete c;
+ });
+ });
+}
diff --git a/test/unit/bptree_compaction_test.cc b/test/unit/bptree_compaction_test.cc
new file mode 100644
index 000000000..9b1b48072
--- /dev/null
+++ b/test/unit/bptree_compaction_test.cc
@@ -0,0 +1,210 @@

+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <map>
+#include <vector>
+#include <random>
+#include <string>
+#include <iostream>
+#include <fmt/core.h>
+#include "utils/logalloc.hh"
+
+constexpr int TEST_NODE_SIZE = 7;
+
+#include "bptree_key.hh"
+#include "utils/bptree.hh"
+#include "bptree_validation.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+class test_data {
+ int _value;
+public:
+ test_data() : _value(0) {}
+ test_data(test_key& k) : _value((int)k + 10) {}
+
+ operator unsigned long() const { return _value; }
+ bool match_key(const test_key& k) const { return _value == (int)k + 10; }
+};
+using test_tree = tree<test_key, test_data, test_key_compare, TEST_NODE_SIZE, key_search::both, with_debug::yes>;
+using test_validator = validator<test_key, test_data, test_key_compare, TEST_NODE_SIZE>;
+
+class reference {
+ reference* _ref = nullptr;
+public:
+ reference() = default;
+ reference(const reference& other) = delete;
+
+ reference(reference&& other) noexcept : _ref(other._ref) {
+ if (_ref != nullptr) {
+ _ref->_ref = this;
+ }
+ other._ref = nullptr;
+ }
+
+ ~reference() {
+ if (_ref != nullptr) {
+ _ref->_ref = nullptr;
+ }
+ }
+
+ void link(reference& other) {
+ assert(_ref == nullptr);
+ _ref = &other;
+ other._ref = this;
+ }
+
+ reference* get() {
+ assert(_ref != nullptr);
+ return _ref;
+ }
+};
+
+class tree_pointer {
+ reference _ref;
+
+ class tree_wrapper {
+ friend class tree_pointer;
+ test_tree _tree;
+ reference _ref;
+ public:
+ tree_wrapper() : _tree(test_key_compare{}) {}
+ };
+
+ tree_wrapper* get_wrapper() {
+ return boost::intrusive::get_parent_from_member(_ref.get(), &tree_wrapper::_ref);
+ }
+
+public:
+
+ tree_pointer(const tree_pointer& other) = delete;
+ tree_pointer(tree_pointer&& other) = delete;
+
+ tree_pointer() {
+ tree_wrapper *t = current_allocator().construct<tree_wrapper>();
+ _ref.link(t->_ref);
+ }
+
+ test_tree* operator->() {
+ tree_wrapper *tw = get_wrapper();
+ return &tw->_tree;
+ }
+
+ test_tree& operator*() {
+ tree_wrapper *tw = get_wrapper();
+ return tw->_tree;
+ }
+
+ ~tree_pointer() {
+ tree_wrapper *tw = get_wrapper();
+ current_allocator().destroy(tw);
+ }
+};
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(10000), "number of keys to fill the tree with")
+ ("iters", bpo::value<int>()->default_value(13), "number of iterations")
+ ("verb", bpo::value<bool>()->default_value(false), "be verbose");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto rep = app.configuration()["iters"].as<int>();
+ auto verb = app.configuration()["verb"].as<bool>();
+
+ return seastar::async([count, rep, verb] {
+ int iter = rep;
+ std::vector<int> keys;
+ for (int i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+
+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Compacting {:d} k:v pairs {:d} times\n", count, iter);
+
+ test_validator tv;
+
+ logalloc::region mem;
+
+ with_allocator(mem.allocator(), [&] {
+ tree_pointer t;
+
+ again:
+ {
+ std::shuffle(keys.begin(), keys.end(), g);
+
+ logalloc::reclaim_lock rl(mem);
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ auto ti = t->emplace(std::move(copy_key(k)), k);
+ assert(ti.second);
+ seastar::thread::maybe_yield();
+ }
+ }
+
+ mem.full_compaction();
+
+ if (verb) {
+ fmt::print("After fill + compact\n");
+ tv.print_tree(*t, '|');
+ }
+
+ tv.validate(*t);
+
+ {
+ std::shuffle(keys.begin(), keys.end(), g);
+
+ logalloc::reclaim_lock rl(mem);
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ t->erase(k);
+ seastar::thread::maybe_yield();
+ }
+ }
+
+ mem.full_compaction();
+
+ if (verb) {
+ fmt::print("After erase + compact\n");
+ tv.print_tree(*t, '|');
+ }
+
+ tv.validate(*t);
+
+ if (--iter > 0) {
+ seastar::thread::yield();
+ goto again;
+ }
+ });
+ });
+ });
+}
diff --git a/test/unit/bptree_stress_test.cc b/test/unit/bptree_stress_test.cc
new file mode 100644
index 000000000..3060b1a7b
--- /dev/null
+++ b/test/unit/bptree_stress_test.cc
@@ -0,0 +1,236 @@

+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <map>
+#include <vector>
+#include <random>
+#include <string>
+#include <iostream>
+#include <fmt/core.h>
+#include <fmt/ostream.h>
+
+constexpr int TEST_NODE_SIZE = 16;
+
+#include "bptree_key.hh"
+#include "utils/bptree.hh"
+#include "bptree_validation.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+class test_data {
+ int _value;
+public:
+ test_data() : _value(0) {}
+ test_data(test_key& k) : _value((int)k + 10) {}
+
+ operator unsigned long() const { return _value; }
+ bool match_key(const test_key& k) const { return _value == (int)k + 10; }
+};
+
+std::ostream& operator<<(std::ostream& os, test_data d) {
+ os << (unsigned long)d;
+ return os;
+}
+
+using test_tree = tree<test_key, test_data, test_key_compare, TEST_NODE_SIZE, key_search::both, with_debug::yes>;
+using test_node = typename test_tree::node;
+using test_validator = validator<test_key, test_data, test_key_compare, TEST_NODE_SIZE>;
+using test_iterator_checker = iterator_checker<test_key, test_data, test_key_compare, TEST_NODE_SIZE>;
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(4132), "number of keys to fill the tree with")
+ ("iters", bpo::value<int>()->default_value(9), "number of iterations")
+ ("keys", bpo::value<std::string>()->default_value("rand"), "how to generate keys (rand, asc, desc)")
+ ("verb", bpo::value<bool>()->default_value(false), "be verbose");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto iters = app.configuration()["iters"].as<int>();
+ auto ks = app.configuration()["keys"].as<std::string>();
+ auto verb = app.configuration()["verb"].as<bool>();
+
+ return seastar::async([count, iters, ks, verb] {
+ int rep = iters;
+ auto *t = new test_tree(test_key_compare{});
+ std::map<int, unsigned long> oracle;
+
+ int p = count / 10;
+ if (p == 0) {
+ p = 1;
+ }
+
+ std::vector<int> keys;
+
+ for (int i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+
+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Inserting {:d} k:v pairs {:d} times\n", count, rep);
+
+ test_validator tv;
+
+ if (ks == "desc") {
+ fmt::print("Reversing keys vector\n");
+ std::reverse(keys.begin(), keys.end());
+ }
+
+ bool shuffle = ks == "rand";
+ if (shuffle) {
+ fmt::print("Will shuffle keys each iteration\n");
+ }
+
+
+ again:
+ auto* itc = new test_iterator_checker(tv, *t);
+
+ if (shuffle) {
+ std::shuffle(keys.begin(), keys.end(), g);
+ }
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ if (verb) {
+ fmt::print("+++ {}\n", (int)k);
+ }
+
+ if (rep % 2 != 1) {
+ auto ir = t->emplace(std::move(copy_key(k)), k);
+ assert(ir.second);
+ } else {
+ auto ir = t->lower_bound(k);
+ ir.emplace_before(std::move(copy_key(k)), test_key_compare{}, k);
+ }
+ oracle[keys[i]] = keys[i] + 10;
+
+ if (verb) {
+ fmt::print("Validating\n");
+ tv.print_tree(*t, '|');
+ }
+
+ /* Limit validation rate for many keys */
+ if (i % (i/1000 + 1) == 0) {
+ tv.validate(*t);
+ }
+
+ if (i % 7 == 0) {
+ if (!itc->step()) {
+ delete itc;
+ itc = new test_iterator_checker(tv, *t);
+ }
+ }
+
+ seastar::thread::maybe_yield();
+ }
+
+ auto sz = t->size_slow();
+ if (sz != (size_t)count) {
+ fmt::print("Size {} != count {}\n", sz, count);
+ throw "size";
+ }
+
+ auto ti = t->begin();
+ for (auto oe : oracle) {
+ if (*ti != oe.second) {
+ fmt::print("Data mismatch {} vs {}\n", oe.second, *ti);
+ throw "oracle";
+ }
+ ti++;
+ }
+
+ if (shuffle) {
+ std::shuffle(keys.begin(), keys.end(), g);
+ }
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ /*
+ * kill iterator if we're removing what it points to,
+ * otherwise it's not invalidated
+ */
+ if (itc->here(k)) {
+ delete itc;
+ itc = nullptr;
+ }
+
+ if (verb) {
+ fmt::print("--- {}\n", (int)k);
+ }
+
+ if (rep % 3 != 2) {
+ t->erase(k);
+ } else {
+ auto ri = t->find(k);
+ auto ni = ri;
+ ni++;
+ auto eni = ri.erase(test_key_compare{});
+ assert(ni == eni);
+ }
+
+ oracle.erase(keys[i]);
+
+ if (verb) {
+ fmt::print("Validating\n");
+ tv.print_tree(*t, '|');
+ }
+
+ if ((count-i) % ((count-i)/1000 + 1) == 0) {
+ tv.validate(*t);
+ }
+
+ if (itc == nullptr) {
+ itc = new test_iterator_checker(tv, *t);
+ }
+
+ if (i % 5 == 0) {
+ if (!itc->step()) {
+ delete itc;
+ itc = new test_iterator_checker(tv, *t);
+ }
+ }
+
+ seastar::thread::maybe_yield();
+ }
+
+ delete itc;
+
+ if (--rep > 0) {
+ if (verb) {
+ fmt::print("{:d} iterations left\n", rep);
+ }
+ goto again;
+ }
+
+ oracle.clear();
+ delete t;
+ });
+ });
+}
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 6, 2020, 2:21:53 PM5/6/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The collection is K:V store

bplus::tree<Key = K, Value = array_trusted_bounds<V>>

It will be used as partitions cache. The outer tree is used to
quickly map token to cache_entry, the inner array -- to resolve
(expected to be rare) hash collisions.

It also must be equipped with two comparators -- less one for
keys and full one for values.

The core API consists of just 2 calls

- Heterogenuous lower_bound(search_key) -> iterator : finds the
element that's greater or equal to the provided search key

Other than the iterator the call returns a "hint" object
that helps the next call.

- emplace_before(iterator, key, hint, ...) : the call construct
the element right before the given iterator. The key and hint
are needed for more optimal algo, but strictly speaking not
required.

Adding an entry to the double_decker may result in growing the
node's array. Here to B+ iterator's .reconstruct() method
comes into play. The new array is created, old elements are
moved onto it, then the fresh node replaces the old one.

// TODO: Ideally this should be turned into the
// template <typename OuterCollection, typename InnerCollection>
// but for now the double_decker still has some intimate knowledge
// about what outer and inner collections are.

Insertion into this collection _may_ invalidate iterators, but
may leave intact. Invalidation only happens in case of hashing
conflict, which can be clearly seen from the hint object, so
there's a good room for improvement.

The main usage by row_cache (the find_or_create_entry) looks like

cache_entry find_or_create_entry() {
bound_hint hint;

it = lower_bound(decorated_key, &hint);
if !found {
it = emplace_before(it, decorated_key.token(), hint,
<constructor args>)
}
return *it;
}

Now the hint. It contains 3 booleans, that are

- match: set to true when the "greater or equal" condition
evaluated to "equal". This frees the caller from the need
to manually check whether the entry returned matches the
search key or the new one should be inserted.

This is the "!found" check from the above snipet.

To explain the next 2 bools, here's a small example. Consider
the tree containing two elements {token, partition key}:

{ 3, "a" }, { 5, "z" }

As the collection is sorted they go in the order shown. Next,
this is what the lower_bound would return for some cases:

{ 3, "z" } -> { 5, "z" }
{ 4, "a" } -> { 5, "z" }
{ 5, "a" } -> { 5, "z" }

Apparently, the lower bound for those 3 elements are the same,
but the code-flows of emplacing them before one differ drastically.

{ 3, "z" } : need to get previous element from the tree and
push the element to it's vector's back
{ 4, "a" } : need to create new element in the tree and populate
its empty vector with the single element
{ 5, "a" } : need to put the new element in the found tree
element right before the found vector position

To make one of the above decisions the .emplace_before would need
to perform another set of comparisons of keys and elements.
Fortunatelly, the needed information was already known inside the
lower_bound call and can be reported via the hint.

Said that,

- key_match: set to true if tree.lower_bound() found the element
for the Key (which is token). For above examples this will be
true for cases 3z and 5a.

- key_tail: set to true if the tree element was found, but when
comparing values from array the bounding element turned out
to belong to the next tree element and the iterator was ++-ed.
For above examples this would be true for case 3z only.

And the last, but not least -- the "erase self" feature. Which is
given only the cache_entry pointer at hands remove it from the
collection. To make this happen we need to make two steps:

1. get the array the entry sits in
2. get the b+ tree node the vectors sits in

Both methods are provided by array_trusted_bounds and bplus::tree.
So, when we need to get iterator from the given T pointer, the algo
looks like

- Walk back the T array untill hitting the head element
- Call array_trusted_bounds::from_element() getting the array
- Construct b+ iterator from obtained array
- Construct the double_decker iterator from b+ iterator and from
the number of "steps back" from above
- Call double_decker::iterator.erase()

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

configure.py | 1 +
utils/double-decker.hh | 286 ++++++++++++++++++++++++++++
test/boost/double_decker_test.cc | 313 +++++++++++++++++++++++++++++++
3 files changed, 600 insertions(+)
create mode 100644 utils/double-decker.hh
create mode 100644 test/boost/double_decker_test.cc

diff --git a/configure.py b/configure.py
index f413f75a5..9dc0d9872 100755
--- a/configure.py
+++ b/configure.py
@@ -383,6 +383,7 @@ scylla_tests = set([
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/bptree_test',
+ 'test/boost/double_decker_test',

'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',

diff --git a/utils/double-decker.hh b/utils/double-decker.hh
new file mode 100644
index 000000000..4029b82ae
--- /dev/null
+++ b/utils/double-decker.hh
@@ -0,0 +1,286 @@

+#include "utils/bptree.hh"
+#include "utils/array_trusted_bounds.hh"
+#include <fmt/core.h>
+
+/*
+ * The double-decker is the ordered keeper of key:value pairs having
+ * the pairs sorted by both key and value (key first).
+ *
+ * The keys collisions are expected to be rare enough to afford holding
+ * the values in a sorted array with the help of linear algorithms.
+ */
+
+GCC6_CONCEPT(
+ template <typename T1, typename T2, typename Compare>
+ concept bool Comparable = requires (const T1& a, const T2& b, Compare cmp) {

+ { cmp(a, b) } -> int;
+ };
+)

+
+template <typename Key, typename T, typename Less, typename Compare, int NodeSize,
+ bplus::key_search Search = bplus::key_search::binary, bplus::with_debug Debug = bplus::with_debug::no>
+GCC6_CONCEPT( requires Comparable<T, T, Compare> && std::is_nothrow_move_constructible_v<T> )
+class double_decker {
+ using inner_array = array_trusted_bounds<T>;
+ using outer_tree = bplus::tree<Key, inner_array, Less, NodeSize, Search, Debug>;
+ using outer_iterator = typename outer_tree::iterator;
+
+ outer_tree _tree;
+ Compare _cmp;
+
+public:
+ class iterator {
+ friend class double_decker;
+
+ using inner_array = typename double_decker::inner_array;
+ using outer_iterator = typename double_decker::outer_iterator;
+
+ outer_iterator _bucket;
+ int _idx;
+

+ public:
+ using iterator_category = std::bidirectional_iterator_tag;

+ using difference_type = ssize_t;
+ using value_type = T;
+ using pointer = value_type*;
+ using reference = value_type&;
+

+ iterator() = default;
+ iterator(outer_iterator bkt, int idx) : _bucket(bkt), _idx(idx) {}
+
+ iterator(T* ptr) : _idx(0) {
+ inner_array& arr = inner_array::from_element(ptr, _idx);
+ _bucket = outer_iterator(&arr);
+ }
+
+ reference operator*() const { return (*_bucket)[_idx]; }
+ pointer operator->() const { return &((*_bucket)[_idx]); }
+
+ iterator& operator++() {
+ if ((*_bucket)[_idx++].is_tail()) {
+ _bucket++;
+ _idx = 0;
+ }
+

+ return *this;
+ }
+
+ iterator operator++(int) {
+ iterator cur = *this;
+ operator++();
+ return cur;
+ }
+

+ iterator& operator--() {
+ if (_idx-- == 0) {
+ _bucket--;
+ _idx = _bucket->index_of(_bucket->end()) - 1;
+ }
+

+ return *this;
+ }
+

+ iterator operator--(int) {
+ iterator cur = *this;
+ operator--();
+ return cur;
+ }
+

+ bool operator==(const iterator& o) const { return _bucket == o._bucket && _idx == o._idx; }

+ bool operator!=(const iterator& o) const { return !(*this == o); }
+

+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ iterator erase_and_dispose(Less less, Func&& disp) {
+ disp(**this);
+
+ if (_bucket->is_single_element()) {
+ outer_iterator bkt = _bucket.erase(less);
+ return iterator(bkt, 0);
+ }
+
+ bool tail = (*_bucket)[_idx].is_tail();
+ _bucket.reconstruct(_bucket.storage_size() - sizeof(T), *_bucket, _idx, typename inner_array::shrink_tag{});
+
+ if (tail) {
+ _bucket++;
+ _idx = 0;
+ }
+

+ return *this;
+ }
+

+ iterator erase(Less less) { return erase_and_dispose(less, bplus::default_dispose<T>); }
+ };
+
+ /*
+ * Structure that shed some more light on how the lower_bound
+ * actually found the bounding elements.
+ */
+ struct bound_hint {
+ /*
+ * Set to true if the element fully matched to the key
+ * according to Compare
+ */
+ bool match;
+ /*
+ * Set to true if the bucket for the given key exists
+ */
+ bool key_match;
+ /*
+ * Set to true if the given key is more than anything
+ * on the bucket and iterator was switched to the next
+ * one (or when the key_match is false)
+ */
+ bool key_tail;
+ };
+
+ iterator begin() { return iterator(_tree.begin(), 0); }
+ iterator end() { return iterator(_tree.end(), 0); }
+
+ double_decker(Less less, Compare cmp) : _tree(less), _cmp(cmp) { }
+
+ double_decker(const double_decker& other) = delete;
+ double_decker(double_decker&& other)
+ : _tree(std::move(other._tree))
+ , _cmp(std::move(other._cmp)) {}
+
+ iterator insert(Key k, T value) {
+ std::pair<outer_iterator, bool> oip = _tree.emplace(std::move(k), std::move(value));
+ outer_iterator& bkt = oip.first;
+ int idx = 0;
+
+ if (!oip.second) {
+ /*
+ * Unlikely, but in this case we reconstruct the array. The value
+ * must not have been moved by emplace() above.
+ */
+ idx = bkt->index_of(bkt->lower_bound(value, _cmp));
+ bkt.reconstruct(bkt.storage_size() + sizeof(T), *bkt, idx, std::move(value));
+ }
+
+ return iterator(bkt, idx);

+ }
+
+ template <typename... Args>

+ iterator emplace_before(iterator i, Key k, const bound_hint& hint, Args&&... args) {
+ assert(!hint.match);
+ outer_iterator& bucket = i._bucket;
+
+ if (!hint.key_match) {
+ /*
+ * The most expected case -- no key conflict, respectively the
+ * bucket is not found, and i points to the next one. Just go
+ * ahead and emplace the new bucket before the i and push the
+ * 0th element into it.
+ */
+ outer_iterator nb = bucket.emplace_before(std::move(k), _tree.less(), std::forward<Args>(args)...);
+ return iterator(nb, 0);
+ }
+
+ /*
+ * Key conflict, need to expand some inner vector, but still there
+ * are two cases -- whether the bounding element is on k's bucket
+ * or the bound search overflew and switched to the next one.
+ */
+
+ int idx = i._idx;
+
+ if (hint.key_tail) {
+ /*
+ * The latter case -- i points to the next one. Need to shift
+ * back and append the new element to its tail.
+ */
+ bucket--;
+ idx = bucket->index_of(bucket->end());
+ }
+
+ bucket.reconstruct(bucket.storage_size() + sizeof(T), *bucket, idx, std::forward<Args>(args)...);
+ return iterator(bucket, idx);

+ }
+
+ template <typename K = Key>

+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator find(const K& key) {
+ outer_iterator bkt = _tree.find(key);
+ int idx = 0;
+
+ if (bkt != _tree.end()) {

+ bool match = false;

+ idx = bkt->index_of(bkt->lower_bound(key, _cmp, match));
+ if (!match) {
+ bkt = _tree.end();
+ idx = 0;
+ }
+ }
+
+ return iterator(bkt, idx);

+ }
+
+ template <typename K = Key>

+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& key, bound_hint& hint) {
+ outer_iterator bkt = _tree.lower_bound(key, hint.key_match);
+
+ hint.key_tail = false;
+ hint.match = false;
+
+ if (bkt == _tree.end() || !hint.key_match) {
+ return iterator(bkt, 0);
+ }
+
+ int i = bkt->index_of(bkt->lower_bound(key, _cmp, hint.match));
+
+ if (i != 0 && (*bkt)[i - 1].is_tail()) {
+ /*
+ * The lower_bound is after the last element -- shift
+ * to the net bucket's 0'th one.
+ */
+ bkt++;
+ i = 0;
+ hint.key_tail = true;
+ }
+
+ return iterator(bkt, i);
+ }
+

+ template <typename K = Key>

+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& key) {
+ bound_hint hint;
+ return lower_bound(key, hint);

+ }
+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ void clear_and_dispose(Func&& disp) {
+ _tree.clear_and_dispose([&disp] (inner_array& arr) {
+ arr.for_each(disp);
+ });
+ }
+
+ void clear() { clear_and_dispose(bplus::default_dispose<T>); }
+};
diff --git a/test/boost/double_decker_test.cc b/test/boost/double_decker_test.cc
new file mode 100644
index 000000000..cc8fe5dfc
--- /dev/null
+++ b/test/boost/double_decker_test.cc
@@ -0,0 +1,313 @@

+
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+/*
+ * This file is part of Scylla.
+ *
+ * Scylla is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU Affero General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * Scylla is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+

+#define BOOST_TEST_MODULE double_decker
+
+#include <seastar/core/print.hh>
+#include <boost/test/unit_test.hpp>
+#include <fmt/core.h>
+#include <string>
+
+#include "utils/double-decker.hh"
+#include "test/lib/random_utils.hh"
+
+class compound_key {
+public:
+ int key;
+ std::string sub_key;
+
+ compound_key(int k, std::string sk) noexcept : key(k), sub_key(sk) {}
+
+ compound_key(const compound_key& other) = delete;
+ compound_key(compound_key&& other) noexcept : key(other.key), sub_key(std::move(other.sub_key)) {}
+
+ compound_key& operator=(const compound_key& other) = delete;
+ compound_key& operator=(compound_key&& other) noexcept {
+ key = other.key;
+ sub_key = std::move(other.sub_key);

+ return *this;
+ }
+

+ std::string format() const {
+ return seastar::format("{}.{}", key, sub_key);
+ }
+
+ bool operator==(const compound_key& other) const {
+ return key == other.key && sub_key == other.sub_key;
+ }
+
+ bool operator!=(const compound_key& other) const { return !(*this == other); }
+
+ struct compare {
+ int operator()(const int& a, const int& b) const { return a - b; }
+ int operator()(const int& a, const compound_key& b) const { return a - b.key; }
+ int operator()(const compound_key& a, const int& b) const { return a.key - b; }
+
+ int operator()(const compound_key& a, const compound_key& b) const {
+ if (a.key != b.key) {
+ return this->operator()(a.key, b.key);
+ } else {
+ return a.sub_key.compare(b.sub_key);
+ }
+ }
+ };
+
+ struct less_compare {
+ compare cmp;
+
+ template <typename A, typename B>
+ bool operator()(const A& a, const B& b) const {
+ return cmp(a, b) < 0;
+ }
+ };
+};
+
+class test_data {
+ compound_key _key;
+ bool _head = false;
+ bool _tail = false;

+ int *_cookie;
+ int *_cookie2;

+public:

+ bool is_head() const { return _head; }

+ bool is_tail() const { return _tail; }
+ void set_head(bool v) { _head = v; }
+ void set_tail(bool v) { _tail = v; }
+

+ test_data(int key, std::string sub) : _key(key, sub), _cookie(new int(0)), _cookie2(new int(0)) {}
+
+ test_data(const test_data& other) = delete;
+ test_data(test_data&& other) noexcept : _key(std::move(other._key)), _head(other._head), _tail(other._tail),
+ _cookie(other._cookie), _cookie2(new int(0)) {

+ other._cookie = nullptr;
+ }

+
+ ~test_data() {

+ if (_cookie != nullptr) {
+ delete _cookie;
+ }

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 6, 2020, 2:21:59 PM5/6/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The row_cache::partitions_type is replaced from boost::intrusive::set
to bplus::tree<Key = int64_t, T = array_trusted_bounds<cache_entry>>

Where token is used to quickly locate the partition by its token and
the internal array -- to resolve hashing conflicts.

Summary of changes in cache_entry:

- compare's operator() reports int instead of bool, this makes entries
comparing in array simpler and sometimes take less steps

- token_less is added that compares tokens in less-manner

- when initialized the dummy entry is added with "after_all_keys" kind,
not "before_all_keys" as it was by default. This is to make tree
entries sorted by token

- insertion and removing of cache_entries happens inside double_decker,
most of the changes in row_cache.cc are about passing constructor args
from current_allocator.construct into double_decker.empace_before()

- the _flags is extended to keep array head/tail bits. There's a room
for it, sizeof(cache_entry) remains unchanged

The rest fits smothly into the double_decker API.

Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.

Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.

tests: unit(dev), perf(dev)

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

dht/token.hh | 11 +++
row_cache.hh | 62 +++++++++++-----
utils/double-decker.hh | 2 +
dht/token.cc | 22 +++---
row_cache.cc | 114 +++++++++++++----------------
test/perf/memory_footprint_test.cc | 3 +-
6 files changed, 120 insertions(+), 94 deletions(-)

diff --git a/dht/token.hh b/dht/token.hh
index 439b6b45e..43cd6cd43 100644
--- a/dht/token.hh
+++ b/dht/token.hh
@@ -160,10 +160,21 @@ class token {
return 0; // hardcoded for now; unlikely to change
}

+ int64_t raw() const {
+ if (is_minimum()) {
+ return std::numeric_limits<int64_t>::min();
+ }
+ if (is_maximum()) {
+ return std::numeric_limits<int64_t>::max();
+ }
+
+ return _data;
+ }
};

const token& minimum_token();
const token& maximum_token();
+int tri_compare_raw(const int64_t& t1, const int64_t& t2);
int tri_compare(const token& t1, const token& t2);
inline bool operator==(const token& t1, const token& t2) { return tri_compare(t1, t2) == 0; }
inline bool operator<(const token& t1, const token& t2) { return tri_compare(t1, t2) < 0; }
diff --git a/row_cache.hh b/row_cache.hh
index fecdeba44..d9ab4e512 100644
--- a/row_cache.hh
+++ b/row_cache.hh
@@ -40,6 +40,7 @@
#include <seastar/core/metrics_registration.hh>
#include "flat_mutation_reader.hh"
#include "mutation_cleaner.hh"
+#include "utils/double-decker.hh"

namespace bi = boost::intrusive;

@@ -61,11 +62,6 @@ class lsa_manager;
//
// TODO: Make memtables use this format too.
class cache_entry {
- // We need auto_unlink<> option on the _cache_link because when entry is
- // evicted from cache via LRU we don't have a reference to the container
- // and don't want to store it with each entry.
- using cache_link_type = bi::set_member_hook<bi::link_mode<bi::auto_unlink>>;
-
schema_ptr _schema;
dht::decorated_key _key;
partition_entry _pe;
@@ -73,8 +69,9 @@ class cache_entry {
struct {
bool _continuous : 1;
bool _dummy_entry : 1;
+ bool _head : 1;
+ bool _tail : 1;
} _flags{};
- cache_link_type _cache_link;
friend class size_calculator;

flat_mutation_reader do_read(row_cache&, cache::read_context& reader);
@@ -82,6 +79,11 @@ class cache_entry {
friend class row_cache;
friend class cache_tracker;

+ bool is_head() const { return _flags._head; }
+ void set_head(bool v) { _flags._head = v; }
+ bool is_tail() const { return _flags._tail; }
+ void set_tail(bool v) { _flags._tail = v; }
+
struct dummy_entry_tag{};
struct incomplete_tag{};
struct evictable_tag{};
@@ -90,6 +92,8 @@ class cache_entry {
: _key{dht::token(), partition_key::make_empty()}
{
_flags._dummy_entry = true;
+ _flags._head = false;
+ _flags._tail = false;
}

// Creates an entry which is fully discontinuous, except for the partition tombstone.
@@ -101,7 +105,10 @@ class cache_entry {
: _schema(std::move(s))
, _key(key)
, _pe(partition_entry::make_evictable(*_schema, mutation_partition(*_schema, p)))
- { }
+ {
+ _flags._head = false;
+ _flags._tail = false;
+ }

cache_entry(schema_ptr s, dht::decorated_key&& key, mutation_partition&& p)
: cache_entry(evictable_tag(), s, std::move(key),
@@ -114,9 +121,13 @@ class cache_entry {
: _schema(std::move(s))
, _key(std::move(key))
, _pe(std::move(pe))
- { }
+ {
+ _flags._head = false;
+ _flags._tail = false;
+ }

cache_entry(cache_entry&&) noexcept;
+
~cache_entry();

static cache_entry& container_of(partition_entry& pe) {
@@ -148,34 +159,48 @@ class cache_entry {

bool is_dummy_entry() const { return _flags._dummy_entry; }

+ struct token_compare {
+ bool operator()(const int64_t& k1, const int64_t& k2) const {
+ return dht::tri_compare_raw(k1, k2) < 0;
+ }
+
+ bool operator()(const dht::ring_position_view& k1, const int64_t& k2) const {
+ return dht::tri_compare_raw(k1.token().raw(), k2) < 0;
+ }
+
+ bool operator()(const int64_t& k1, const dht::ring_position_view& k2) const {
+ return dht::tri_compare_raw(k1, k2.token().raw()) < 0;
+ }
+ };
+
struct compare {
- dht::ring_position_less_comparator _c;
+ dht::ring_position_comparator _c;

compare(schema_ptr s)
: _c(*s)
{}

- bool operator()(const dht::decorated_key& k1, const cache_entry& k2) const {
+ int operator()(const dht::decorated_key& k1, const cache_entry& k2) const {
return _c(k1, k2.position());
}

- bool operator()(dht::ring_position_view k1, const cache_entry& k2) const {
+ int operator()(dht::ring_position_view k1, const cache_entry& k2) const {
return _c(k1, k2.position());
}

- bool operator()(const cache_entry& k1, const cache_entry& k2) const {
+ int operator()(const cache_entry& k1, const cache_entry& k2) const {
return _c(k1.position(), k2.position());
}

- bool operator()(const cache_entry& k1, const dht::decorated_key& k2) const {
+ int operator()(const cache_entry& k1, const dht::decorated_key& k2) const {
return _c(k1.position(), k2);
}

- bool operator()(const cache_entry& k1, dht::ring_position_view k2) const {
+ int operator()(const cache_entry& k1, dht::ring_position_view k2) const {
return _c(k1.position(), k2);
}

- bool operator()(dht::ring_position_view k1, dht::ring_position_view k2) const {
+ int operator()(dht::ring_position_view k1, dht::ring_position_view k2) const {
return _c(k1, k2);
}
};
@@ -315,10 +340,9 @@ void cache_tracker::insert(partition_entry& pe) noexcept {
class row_cache final {
public:
using phase_type = utils::phased_barrier::phase_type;
- using partitions_type = bi::set<cache_entry,
- bi::member_hook<cache_entry, cache_entry::cache_link_type, &cache_entry::_cache_link>,
- bi::constant_time_size<false>, // we need this to have bi::auto_unlink on hooks
- bi::compare<cache_entry::compare>>;
+ using partitions_type = double_decker<int64_t, cache_entry,
+ cache_entry::token_compare, cache_entry::compare,
+ 32, bplus::key_search::binary>;
friend class cache::autoupdating_underlying_reader;
friend class single_partition_populating_reader;
friend class cache_entry;
diff --git a/utils/double-decker.hh b/utils/double-decker.hh
index 4029b82ae..28324145b 100644
--- a/utils/double-decker.hh
+++ b/utils/double-decker.hh
@@ -44,10 +44,12 @@ template <typename Key, typename T, typename Less, typename Compare, int NodeSiz

bplus::key_search Search = bplus::key_search::binary, bplus::with_debug Debug = bplus::with_debug::no>

GCC6_CONCEPT( requires Comparable<T, T, Compare> && std::is_nothrow_move_constructible_v<T> )

class double_decker {
+public:

using inner_array = array_trusted_bounds<T>;

using outer_tree = bplus::tree<Key, inner_array, Less, NodeSize, Search, Debug>;

using outer_iterator = typename outer_tree::iterator;

+private:
outer_tree _tree;
Compare _cmp;

diff --git a/dht/token.cc b/dht/token.cc
index ee9443a2a..a3e10707b 100644
--- a/dht/token.cc
+++ b/dht/token.cc
@@ -33,11 +33,7 @@ namespace dht {
using uint128_t = unsigned __int128;

inline int64_t long_token(const token& t) {
- if (t.is_minimum() || t.is_maximum()) {
- return std::numeric_limits<int64_t>::min();
- }
-
- return t._data;
+ return t.raw();
}

static const token min_token{ token::kind::before_all_keys, 0 };
@@ -53,19 +49,21 @@ maximum_token() {
return max_token;
}

+int tri_compare_raw(const int64_t& l1, const int64_t& l2) {
+ if (l1 == l2) {
+ return 0;
+ } else {
+ return l1 < l2 ? -1 : 1;
+ }
+}
+
int tri_compare(const token& t1, const token& t2) {
if (t1._kind < t2._kind) {
return -1;
} else if (t1._kind > t2._kind) {
return 1;
} else if (t1._kind == token_kind::key) {
- auto l1 = long_token(t1);
- auto l2 = long_token(t2);
- if (l1 == l2) {
- return 0;
- } else {
- return l1 < l2 ? -1 : 1;
- }
+ return tri_compare_raw(long_token(t1), long_token(t2));
}
return 0;
}
diff --git a/row_cache.cc b/row_cache.cc
index 805a9ac02..f8548559f 100644
--- a/row_cache.cc
+++ b/row_cache.cc
@@ -284,12 +284,12 @@ class partition_range_cursor final {
// Strong exception guarantees.
bool advance_to(dht::ring_position_view pos) {
auto cmp = cache_entry::compare(_cache.get()._schema);
- if (cmp(_end_pos, pos)) { // next() may have moved _start_pos past the _end_pos.
+ if (cmp(_end_pos, pos) < 0) { // next() may have moved _start_pos past the _end_pos.
_end_pos = pos;
}
- _end = _cache.get()._partitions.lower_bound(_end_pos, cmp);
- _it = _cache.get()._partitions.lower_bound(pos, cmp);
- auto same = !cmp(pos, _it->position());
+ _end = _cache.get()._partitions.lower_bound(_end_pos);
+ _it = _cache.get()._partitions.lower_bound(pos);
+ auto same = !(cmp(pos, _it->position()) < 0);
set_position(*_it);
_last_reclaim_count = _cache.get().get_cache_tracker().allocator().invalidate_counter();
return same;
@@ -375,13 +375,13 @@ class single_partition_populating_reader final : public flat_mutation_reader::im
_cache._read_section(_cache._tracker.region(), [this] {
with_allocator(_cache._tracker.allocator(), [this] {
dht::decorated_key dk = _read_context->range().start()->value().as_decorated_key();
- _cache.do_find_or_create_entry(dk, nullptr, [&] (auto i) {
+ _cache.do_find_or_create_entry(dk, nullptr, [&] (auto i, const row_cache::partitions_type::bound_hint& hint) {
mutation_partition mp(_cache._schema);

- cache_entry* entry = current_allocator().construct<cache_entry>(

+ row_cache::partitions_type::iterator entry = _cache._partitions.emplace_before(i, dk.token().raw(), hint,
_cache._schema, std::move(dk), std::move(mp));
_cache._tracker.insert(*entry);
entry->set_continuous(i->continuous());
- return _cache._partitions.insert_before(i, *entry);
+ return entry;
}, [&] (auto i) {
_cache._tracker.on_miss_already_populated();
});
@@ -496,8 +496,7 @@ class range_populating_reader {
return;
}
if (!_reader.range().end() || !_reader.range().end()->is_inclusive()) {
- cache_entry::compare cmp(_cache._schema);
- auto it = _reader.range().end() ? _cache._partitions.find(_reader.range().end()->value(), cmp)
+ auto it = _reader.range().end() ? _cache._partitions.find(_reader.range().end()->value())
: std::prev(_cache._partitions.end());
if (it != _cache._partitions.end()) {
if (it == _cache._partitions.begin()) {
@@ -748,8 +747,8 @@ row_cache::make_reader(schema_ptr s,
return with_linearized_managed_bytes([&] {
cache_entry::compare cmp(_schema);
auto&& pos = ctx->range().start()->value();
- auto i = _partitions.lower_bound(pos, cmp);
- if (i != _partitions.end() && !cmp(pos, i->position())) {
+ auto i = _partitions.lower_bound(pos);
+ if (i != _partitions.end() && !(cmp(pos, i->position()) < 0)) {
cache_entry& e = *i;
upgrade_entry(e);
on_partition_hit();
@@ -780,12 +779,11 @@ row_cache::make_reader(schema_ptr s,

void row_cache::drain() {
with_allocator(_tracker.allocator(), [this] {
- _partitions.clear_and_dispose([this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
- if (!p->is_dummy_entry()) {
+ _partitions.clear_and_dispose([this] (cache_entry& p) mutable {
+ if (!p.is_dummy_entry()) {

_tracker.on_partition_erase();
}
- p->evict(_tracker);
- deleter(p);

+ p.evict(_tracker);
});
});
}
@@ -809,9 +807,10 @@ cache_entry& row_cache::do_find_or_create_entry(const dht::decorated_key& key,
{
return with_allocator(_tracker.allocator(), [&] () -> cache_entry& {
return with_linearized_managed_bytes([&] () -> cache_entry& {
- auto i = _partitions.lower_bound(key, cache_entry::compare(_schema));
- if (i == _partitions.end() || !i->key().equal(*_schema, key)) {
- i = create_entry(i);
+ partitions_type::bound_hint hint;
+ auto i = _partitions.lower_bound(key, hint);
+ if (i == _partitions.end() || !hint.match) {
+ i = create_entry(i, hint);
} else {
visit_entry(i);
}
@@ -834,10 +833,11 @@ cache_entry& row_cache::do_find_or_create_entry(const dht::decorated_key& key,
}

cache_entry& row_cache::find_or_create(const dht::decorated_key& key, tombstone t, row_cache::phase_type phase, const previous_entry_pointer* previous) {
- return do_find_or_create_entry(key, previous, [&] (auto i) { // create
- auto entry = current_allocator().construct<cache_entry>(cache_entry::incomplete_tag{}, _schema, key, t);
+ return do_find_or_create_entry(key, previous, [&] (auto i, const partitions_type::bound_hint& hint) { // create
+ partitions_type::iterator entry = _partitions.emplace_before(i, key.token().raw(), hint,
+ cache_entry::incomplete_tag{}, _schema, key, t);
_tracker.insert(*entry);
- return _partitions.insert_before(i, *entry);
+ return entry;
}, [&] (auto i) { // visit
_tracker.on_miss_already_populated();
cache_entry& e = *i;
@@ -848,14 +848,13 @@ cache_entry& row_cache::find_or_create(const dht::decorated_key& key, tombstone

void row_cache::populate(const mutation& m, const previous_entry_pointer* previous) {
_populate_section(_tracker.region(), [&] {
- do_find_or_create_entry(m.decorated_key(), previous, [&] (auto i) {

- cache_entry* entry = current_allocator().construct<cache_entry>(

+ do_find_or_create_entry(m.decorated_key(), previous, [&] (auto i, const partitions_type::bound_hint& hint) {
+ partitions_type::iterator entry = _partitions.emplace_before(i, m.decorated_key().token().raw(), hint,
m.schema(), m.decorated_key(), m.partition());
_tracker.insert(*entry);
entry->set_continuous(i->continuous());
- i = _partitions.insert_before(i, *entry);
- upgrade_entry(*i);
- return i;
+ upgrade_entry(*entry);
+ return entry;
}, [&] (auto i) {
throw std::runtime_error(format("cache already contains entry for {}", m.key()));
});
@@ -939,7 +938,6 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
partition_presence_checker is_present = _prev_snapshot->make_partition_presence_checker();
while (!m.partitions.empty()) {
with_allocator(_tracker.allocator(), [&] () {
- auto cmp = cache_entry::compare(_schema);
{
size_t partition_count = 0;
{
@@ -954,8 +952,9 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
_update_section(_tracker.region(), [&] {
memtable_entry& mem_e = *m.partitions.begin();
size_entry = mem_e.size_in_allocator_without_rows(_tracker.allocator());
- auto cache_i = _partitions.lower_bound(mem_e.key(), cmp);
- update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc);
+ partitions_type::bound_hint hint;
+ auto cache_i = _partitions.lower_bound(mem_e.key(), hint);
+ update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc, hint);
});
}
// We use cooperative deferring instead of futures so that
@@ -1000,11 +999,11 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
future<> row_cache::update(external_updater eu, memtable& m) {
return do_update(std::move(eu), m, [this] (logalloc::allocating_section& alloc,
row_cache::partitions_type::iterator cache_i, memtable_entry& mem_e, partition_presence_checker& is_present,
- real_dirty_memory_accounter& acc) mutable {
+ real_dirty_memory_accounter& acc, const partitions_type::bound_hint& hint) mutable {
// If cache doesn't contain the entry we cannot insert it because the mutation may be incomplete.
// FIXME: keep a bitmap indicating which sstables we do cover, so we don't have to
// search it.
- if (cache_i != partitions_end() && cache_i->key().equal(*_schema, mem_e.key())) {
+ if (cache_i != partitions_end() && hint.match) {
cache_entry& entry = *cache_i;
upgrade_entry(entry);
assert(entry._schema == _schema);
@@ -1016,12 +1015,11 @@ future<> row_cache::update(external_updater eu, memtable& m) {
|| with_allocator(standard_allocator(), [&] { return is_present(mem_e.key()); })
== partition_presence_checker_result::definitely_doesnt_exist) {
// Partition is absent in underlying. First, insert a neutral partition entry.
- cache_entry* entry = current_allocator().construct<cache_entry>(cache_entry::evictable_tag(),
- _schema, dht::decorated_key(mem_e.key()),
+ partitions_type::iterator entry = _partitions.emplace_before(cache_i, mem_e.key().token().raw(), hint,
+ cache_entry::evictable_tag(), _schema, dht::decorated_key(mem_e.key()),
partition_entry::make_evictable(*_schema, mutation_partition(_schema)));
entry->set_continuous(cache_i->continuous());
_tracker.insert(*entry);
- _partitions.insert_before(cache_i, *entry);
mem_e.upgrade_schema(_schema, _tracker.memtable_cleaner());
return entry->partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), _tracker.memtable_cleaner(),
alloc, _tracker.region(), _tracker, _underlying_phase, acc);
@@ -1034,7 +1032,7 @@ future<> row_cache::update(external_updater eu, memtable& m) {
future<> row_cache::update_invalidating(external_updater eu, memtable& m) {
return do_update(std::move(eu), m, [this] (logalloc::allocating_section& alloc,
row_cache::partitions_type::iterator cache_i, memtable_entry& mem_e, partition_presence_checker& is_present,
- real_dirty_memory_accounter& acc)
+ real_dirty_memory_accounter& acc, const partitions_type::bound_hint&)
{
if (cache_i != partitions_end() && cache_i->key().equal(*_schema, mem_e.key())) {
// FIXME: Invalidate only affected row ranges.
@@ -1057,7 +1055,7 @@ void row_cache::refresh_snapshot() {
void row_cache::touch(const dht::decorated_key& dk) {
_read_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
- auto i = _partitions.find(dk, cache_entry::compare(_schema));
+ auto i = _partitions.find(dk);
if (i != _partitions.end()) {
for (partition_version& pv : i->partition().versions_from_oldest()) {
for (rows_entry& row : pv.partition().clustered_rows()) {
@@ -1072,7 +1070,7 @@ void row_cache::touch(const dht::decorated_key& dk) {
void row_cache::unlink_from_lru(const dht::decorated_key& dk) {
_read_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
- auto i = _partitions.find(dk, cache_entry::compare(_schema));
+ auto i = _partitions.find(dk);
if (i != _partitions.end()) {
for (partition_version& pv : i->partition().versions_from_oldest()) {
for (rows_entry& row : pv.partition().clustered_rows()) {
@@ -1085,15 +1083,14 @@ void row_cache::unlink_from_lru(const dht::decorated_key& dk) {
}

void row_cache::invalidate_locked(const dht::decorated_key& dk) {
- auto pos = _partitions.lower_bound(dk, cache_entry::compare(_schema));
+ partitions_type::iterator pos = _partitions.lower_bound(dk);
if (pos == partitions_end() || !pos->key().equal(*_schema, dk)) {
_tracker.clear_continuity(*pos);
} else {
- auto it = _partitions.erase_and_dispose(pos,
- [this, &dk, deleter = current_deleter<cache_entry>()](auto&& p) mutable {
+ auto it = pos.erase_and_dispose(cache_entry::token_compare{},
+ [this](cache_entry& p) mutable {

_tracker.on_partition_erase();
- p->evict(_tracker);
- deleter(p);

+ p.evict(_tracker);
});
_tracker.clear_continuity(*it);
}
@@ -1123,17 +1120,15 @@ future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&
while (true) {
auto done = with_linearized_managed_bytes([&] {
return _update_section(_tracker.region(), [&] {
- auto cmp = cache_entry::compare(_schema);
- auto it = _partitions.lower_bound(*_prev_snapshot_pos, cmp);
- auto end = _partitions.lower_bound(dht::ring_position_view::for_range_end(range), cmp);
+ auto it = _partitions.lower_bound(*_prev_snapshot_pos);
+ auto end = _partitions.lower_bound(dht::ring_position_view::for_range_end(range));
return with_allocator(_tracker.allocator(), [&] {
- auto deleter = current_deleter<cache_entry>();
while (it != end) {
- it = _partitions.erase_and_dispose(it, [&] (cache_entry* p) mutable {

- _tracker.on_partition_erase();
- p->evict(_tracker);
- deleter(p);
- });

+ it = it.erase_and_dispose(cache_entry::token_compare{},
+ [&] (cache_entry& p) mutable {
+ _tracker.on_partition_erase();
+ p.evict(_tracker);
+ });
// it != end is necessary for correctness. We cannot set _prev_snapshot_pos to end->position()
// because after resuming something may be inserted before "end" which falls into the next range.
if (need_preempt() && it != end) {
@@ -1169,8 +1164,9 @@ void row_cache::evict() {

void row_cache::init_empty(is_continuous cont) {

with_allocator(_tracker.allocator(), [this, cont] {
- cache_entry* entry = current_allocator().construct<cache_entry>(cache_entry::dummy_entry_tag());
- _partitions.insert_before(_partitions.end(), *entry);

+ partitions_type::iterator entry = _partitions.emplace_before(_partitions.end(),
+ std::numeric_limits<int64_t>::max(),
+ partitions_type::bound_hint(), cache_entry::dummy_entry_tag());
entry->set_continuous(bool(cont));
});
}
@@ -1178,7 +1174,7 @@ void row_cache::init_empty(is_continuous cont) {

row_cache::row_cache(schema_ptr s, snapshot_source src, cache_tracker& tracker, is_continuous cont)
: _tracker(tracker)
, _schema(std::move(s))

- , _partitions(cache_entry::compare(_schema))
+ , _partitions(cache_entry::token_compare{}, cache_entry::compare(_schema))

, _underlying(src())
, _snapshot_source(std::move(src))
{

@@ -1190,13 +1186,7 @@ cache_entry::cache_entry(cache_entry&& o) noexcept
, _key(std::move(o._key))
, _pe(std::move(o._pe))
, _flags(o._flags)
- , _cache_link()
{
- {
- using container_type = row_cache::partitions_type;
- container_type::node_algorithms::replace_node(o._cache_link.this_ptr(), _cache_link.this_ptr());
- container_type::node_algorithms::init(o._cache_link.this_ptr());
- }
}

cache_entry::~cache_entry() {
@@ -1211,11 +1201,11 @@ void row_cache::set_schema(schema_ptr new_schema) noexcept {
}

void cache_entry::on_evicted(cache_tracker& tracker) noexcept {
- auto it = row_cache::partitions_type::s_iterator_to(*this);
+ row_cache::partitions_type::iterator it(this);
std::next(it)->set_continuous(false);
evict(tracker);
- current_deleter<cache_entry>()(this);
tracker.on_partition_eviction();
+ it.erase(cache_entry::token_compare{});
}

void rows_entry::on_evicted(cache_tracker& tracker) noexcept {
diff --git a/test/perf/memory_footprint_test.cc b/test/perf/memory_footprint_test.cc
index 444c99615..32aca47f0 100644
--- a/test/perf/memory_footprint_test.cc
+++ b/test/perf/memory_footprint_test.cc
@@ -56,11 +56,12 @@ class size_calculator {
public:
static void print_cache_entry_size() {
std::cout << prefix() << "sizeof(cache_entry) = " << sizeof(cache_entry) << "\n";
+ std::cout << prefix() << "sizeof(bptree::node) = " << sizeof(row_cache::partitions_type::outer_tree::node) << "\n";
+ std::cout << prefix() << "sizeof(bptree::data) = " << sizeof(row_cache::partitions_type::outer_tree::data) << "\n";

{
nest n;
std::cout << prefix() << "sizeof(decorated_key) = " << sizeof(dht::decorated_key) << "\n";
- std::cout << prefix() << "sizeof(cache_link_type) = " << sizeof(cache_entry::cache_link_type) << "\n";
print_mutation_partition_size();
}

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 7, 2020, 9:21:17 AM5/7/20

to scylladb-dev

> * perf_simple_query:
>
> Same as memory footprint here -- the larger the nodes are, the better.
>
> nopatch median 95362.44 abs.dev.: 194.58
>
> 8, lin median 88081.63 -7.6% abs.dev.: 23.10
> 16, lin median 93586.65 -1.9% abs.dev.: 29.07
> 32, lin median 96735.37 +1.4% abs.dev.: 47.11
>
> 16, bin median 94094.68 -1.3% abs.dev.: 36.83
> 32, bin median 96706.78 +1.4% abs.dev.: 16.54
> 64, bin median 95558.57 +0.2% abs.dev.: 5.47

Surprisingly, but the row_cache::_partitions implementation doesn't affect this
test :\ This is because all the partitions that are created are kept in memtables
and the row_cache tree constantly sits with a single entry in it. If running the
test for much longer than the default 5 seconds duration the numbers will not
differ that much.

If flushing memtables the row_cache gets filled with partitions, the new collection
finally comes into play and the result looks much better

* 10k partitions: +1%..2%

no flush: 96k
with flush, no-patch: 88k

8 keys, linear/binary: 89k / -
16 keys, linear/binary: 89k / 91k
32 keys, linear/binary: 88k / 90k
64 keys, linear/binary: - / 90k

B+ makes it 1%-2% better.

* 1M partitions: +9%..12%

no flush: 87k
with flush, no-patch: 77k

8 keys, linear/binary: 85k / -
16 keys, linear/binary: 85k / 86k
32 keys, linear/binary: 84k / 86k
64 keys, linear/binary: - / 85k

B+ makes it 9%-12% better.

Another observation -- scaling from 10k to 1M results in
12% penalty on std::set and only 4% on B+.

> The next TODO according to the above results is:

Respectively, one item is added here:

- Check if switching memtable partitions tree onto B+ makes things better (or worse)

-- Pavel

Avi Kivity

<avi@scylladb.com>

unread,

May 7, 2020, 9:29:34 AM5/7/20

to Pavel Emelyanov, scylladb-dev

I assumed you'd switch both. I think we'll see an improvement, even in
inserts, which is the critical load for memtables, even though we end up
doing some more work when inserting.

Memtables are also always scanned, when flushing the memtable to
sstable. That's faster too with btree.

Tomasz Grabiec

<tgrabiec@scylladb.com>

unread,

May 7, 2020, 9:59:24 AM5/7/20

to Pavel Emelyanov, scylladb-dev

Maybe if we randomized the LRU a bit so that LSA compaction is triggered and objects are moved around, it would show up.

There is also test_eviction in row_cache_test.cc which could be used as a microbenchmark.

Botond Dénes

<bdenes@scylladb.com>

unread,

May 11, 2020, 10:34:12 AM5/11/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com

On Wed, 2020-05-06 at 21:21 +0300, Pavel Emelyanov wrote:
> A plain array of elements that grows and shrinks by
> constructing the new instance from an existing one and
> moving the elements from it.
>
> Behaves similarly to vector's external array, but has
> 0-bytes overhead. The array bounds (0-th and N-th
> elemements) are determined by checking the flags on the
> elements themselves. For this the type mus support
> getters and setters for the flags.

I believe such containers are called intrusive ones, maybe
intrusive_array would be a better name.

Out of curiosity: why is this better then a maybe_constructed* _data?

> +
> + int _number_of_elements() const {

size_t

We don't use leading _ for method names.

> + for (int i = 0; ; i++) {
> + if (_data[i].object.is_tail()) {
> + return i + 1;
> + }
> + }
> +
> + assert(false);

std::abort();

This calls for an assert.

Why aren't these implemented as mutating `emplace(iterator it,
Args&&... args)` and `erase(iterator it)` respectively?

size_t

Why implement this here? We have std::for_each().

Botond Dénes

<bdenes@scylladb.com>

unread,

May 11, 2020, 11:03:34 AM5/11/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com

On Wed, 2020-05-06 at 21:21 +0300, Pavel Emelyanov wrote:

Not necessary, the _flags{} in the declaration takes care of this.

> }
>
> // Creates an entry which is fully discontinuous, except for the
> partition tombstone.
> @@ -101,7 +105,10 @@ class cache_entry {
> : _schema(std::move(s))
> , _key(key)
> , _pe(partition_entry::make_evictable(*_schema,
> mutation_partition(*_schema, p)))
> - { }
> + {
> + _flags._head = false;
> + _flags._tail = false;

Same here.

> + }
>
> cache_entry(schema_ptr s, dht::decorated_key&& key,
> mutation_partition&& p)
> : cache_entry(evictable_tag(), s, std::move(key),
> @@ -114,9 +121,13 @@ class cache_entry {
> : _schema(std::move(s))
> , _key(std::move(key))
> , _pe(std::move(pe))
> - { }
> + {
> + _flags._head = false;
> + _flags._tail = false;

Same here.

> + }
>
> cache_entry(cache_entry&&) noexcept;
> +
> ~cache_entry();
>
> static cache_entry& container_of(partition_entry& pe) {
> @@ -148,34 +159,48 @@ class cache_entry {
>
> bool is_dummy_entry() const { return _flags._dummy_entry; }
>
> + struct token_compare {
> + bool operator()(const int64_t& k1, const int64_t& k2) const
> {

Maybe it would be better to have dht::token::value_type, instead of
just passing int64_t around.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 11, 2020, 1:45:49 PM5/11/20

to Botond Dénes, scylladb-dev

пн, 11 мая 2020 г., 17:34 Botond Dénes <bde...@scylladb.com>:

Will need memory to carry this very pointer on-board. Right now the memory overhead is zero.

True, lost one in rebase.

Because I cannot keep "this" at it's memory location and will have to allocate the new memory and return this new "this" back. This isn't an .emplace/.erase people get used to (methinks).

Not to walk the array twice -- one for end() the other one for for_each itself.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 11, 2020, 2:11:20 PM5/11/20

to Botond Dénes, scylladb-dev@googlegroups.com

The ability of C++ to drastically change things with a single character is amazing.

Looking at git blame makes me think it was patched to pass the very int64_t around...

-- Pavel

Botond Dénes

<bdenes@scylladb.com>

unread,

May 12, 2020, 4:22:59 AM5/12/20

to Pavel Emelyanov, scylladb-dev

C arrays are sneaky.

Right, I see. Can we also add a tag to the growing constructor, to make it more expicit.

Also, we could use a tagged integer (not sure we have a generic one), so the tag actually wraps the position we want to add/remove:

struct add_pos {

size_t pos;

};

struct remove_pos {

size_t pos;

};

array_trusted_bounds(array_trusted_bounds& from, add_pos pos);

array_trusted_bounds(array_trusted_bounds& from, remove_pos pos);

This is explicit and compact at the same time :).

Right, makes sense.

Maybe if these are common, it would be worth to have a static helper method for these. Haven't seen how the array is used in double_decker yet.

Botond Dénes

<bdenes@scylladb.com>

unread,

May 12, 2020, 5:09:37 AM5/12/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com

Ok.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:04 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The data model is now

bplus::tree<Key = int64_t, T = array<cache_entry>>

The whole thing is encapsulated into a collection called "double_decker"

from patch #6. The array<T> is an array of T-s with 0-bytes overhead used
to resolve hash conflicts (patch #5).

Changes in v4:

- SIMD lookup
- switched memtable on B+
- double-decker doesn't carry comparator on board

This is because comarator keep reference on schema and it's
not that easy to prevent one from doing use-after-free. Since
we anyway have schema in all the calls we can use it (as current
code does)

- B+ find/lower_bound do not init empty tree

If lookup happens without proper with_allocator() call the tree
root gets allocated and freed in differente allocators

- Review comments

Changes in v3:

- replace managed_vector with array
- use int64_t as tree key instead of dht::token
- simple tune-up of insertion algo for better packing (15% less inner nodes)
- optimize bplus::tree::erase(iterator)

branch: https://github.com/xemul/scylla/commits/row-cache-over-bptree-4
tests:
unit(debug) for new collections, memtable and row_cache
unit(dev) for the rest
perf(dev) below

Preliminary testing results are quite promicing:

* memory_footprint:
sizeof(old cache_entry) = 104
sizeof(new cache_entry) = 72

sizeof(old memtable_entry) = 96
sizeof(new memtable_entry) = 72

sizeof(bptree::data) = 80 (1 pointer overhead)

sizeof(bptree::node<16 keys>) = 288

in cache:
nopatch: 98400000
8 keys: 98960202 (+0.5%, 5.6 b/part)
16 keys: 98438582 (+0.03%, 0.4 b/part)
32 keys: 98198074 (-0.2%, 2.0 b/part)
64 keys: 98061524 (-0.3%, 3.4 b/part)

in memtable:
nopatch: 74200000
16 keys: 75028384 (+1.1%, 8.2 b/part)
32 keys: 74794752 (+0.8%, 5.9 b.part)

with memtables the situation is worse because the memory saving from
dropping the boost set link from the entry is not that big. But
there's a chance to improve it with better packing (in TODO)

* row_cache_update (didn't re-measure one with SIMD):

3 sub-tests:
small_partitions
partition_with_few_small_rows
partition_with_lots_of_small_rows

numbers are update:/invalidation: times [ms]

nopatch 1449.702 / 382.161 1369.786 / 346.426 478.766 / 0.417

8, lin 993.980 / 312.738 1117.000 / 264.217 446.281 / 0.419
16, lin 996.407 / 334.595 1108.257 / 257.385 438.133 / 0.426
32, lin 1088.253 / 316.095 1168.123 / 258.666 457.052 / 0.422

16, bin 1399.507 / 423.926 1225.133 / 347.217 448.318 / 0.431
32, bin 972.932 / 322.350 1103.503 / 256.497 449.912 / 0.419
64, bin 1136.614 / 336.679 1119.315 / 264.116 456.312 / 0.433

The invalidation turned out _not_ to be improved O(logN) evicion
as it effectively does range erase with tree at hands, both std::set

and B+ are both O(1) here. The eviction micro-bench test was picked
from boost (spoiler: the winner)

* eviction test from boost/row_cache_test

eviciton of 1M keys:
nopatch: 6.1 sec , 3 reactor stall warnings
patch: 4.2 sec , 2 reactor stall warnings sometimes =)

* perf_simple_query:

1M partitions:

with flush:
nopatch: 77k
16 keys linear: 84k
16 keys simd: 87k

no flush:
nopatch: 87k
16 keys linear: 92k
16 keys simd: 95k
32 keys linear: 92k
32 keys simd: 93k

The next TODO:

- More sophisticated insert/remove algos to produce better packaging

Micro-bench on B+ vs std::set show we can sacrifice more cycles for

it and still work faster

- Merging nodes together during compaction

Currently a 4-key tree with 5M randomly generated keys results int

inner nodes: 494527 (10% from 5M)
with 2 keys: 169853 (34%)
with 3 keys: 126217 (25%)
with 4 keys: 198457 (40%)
leaves: 1512186 (30% from 5M)
with 2 keys: 285564 (18%)
with 3 keys: 477616 (31%)
with 4 keys: 749006 (49%)

- Unit test for hash conflicts

- The number of various comparators exceeds the expectation. It's worth
cleaning this part some day

Pavel Emelyanov (9):
row_cache: Simplify clean_now()
memtable: Count partitions separately

test: Move perf measurement helpers into header
core: B+ tree implementation
utils: Array with trusted bounds
double-decker: A combinaiton of B+tree with array

bptree: AVX linear searcher for int64_t keys

row_cache: Switch partition tree onto B+ rails

test/unit/bptree_key.hh | 101 ++
test/unit/bptree_validation.hh | 318 ++++

test/unit/bptree_compaction_test.cc | 210 +++
test/unit/bptree_stress_test.cc | 236 +++

23 files changed, 4972 insertions(+), 220 deletions(-)

create mode 100644 test/unit/bptree_key.hh
create mode 100644 test/unit/bptree_validation.hh

create mode 100644 utils/array-search.hh
create mode 100644 utils/array_trusted_bounds.hh
create mode 100644 utils/bptree.hh
create mode 100644 utils/double-decker.hh
create mode 100644 test/boost/array_trusted_bounds_test.cc
create mode 100644 test/boost/bptree_test.cc
create mode 100644 test/boost/double_decker_test.cc
create mode 100644 test/perf/perf_bptree.cc

create mode 100644 test/unit/bptree_compaction_test.cc
create mode 100644 test/unit/bptree_stress_test.cc

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:05 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

row_cache.hh | 4 ++++
row_cache.cc | 31 ++++++++++++++++---------------
2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/row_cache.hh b/row_cache.hh
index 3dd90fac4..fecdeba44 100644
--- a/row_cache.hh
+++ b/row_cache.hh

@@ -478,6 +478,10 @@ class row_cache final {
//
// internal_updater is only kept alive until its invocation returns.
future<> do_update(external_updater eu, internal_updater iu) noexcept;
+
+ void init_empty(is_continuous cont);
+ void drain();
+
public:
~row_cache();
row_cache(schema_ptr, snapshot_source, cache_tracker&, is_continuous = is_continuous::no);

diff --git a/row_cache.cc b/row_cache.cc
index 0c8c96d56..805a9ac02 100644
--- a/row_cache.cc
+++ b/row_cache.cc

@@ -778,8 +778,7 @@ row_cache::make_reader(schema_ptr s,
}
}

-
-row_cache::~row_cache() {
+void row_cache::drain() {
with_allocator(_tracker.allocator(), [this] {

_partitions.clear_and_dispose([this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {

if (!p->is_dummy_entry()) {
@@ -791,15 +790,13 @@ row_cache::~row_cache() {
});
}

+row_cache::~row_cache() {
+ drain();
+}
+
void row_cache::clear_now() noexcept {
- with_allocator(_tracker.allocator(), [this] {

- auto it = _partitions.erase_and_dispose(_partitions.begin(), partitions_end(), [this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
- _tracker.on_partition_erase();

- p->evict(_tracker);
- deleter(p);

- });
- _tracker.clear_continuity(*it);
- });
+ drain();
+ init_empty(is_continuous::no);
}

template<typename CreateEntry, typename VisitEntry>
@@ -1170,6 +1167,14 @@ void row_cache::evict() {
while (_tracker.region().evict_some() == memory::reclaiming_result::reclaimed_something) {}
}

+void row_cache::init_empty(is_continuous cont) {
+ with_allocator(_tracker.allocator(), [this, cont] {
+ cache_entry* entry = current_allocator().construct<cache_entry>(cache_entry::dummy_entry_tag());
+ _partitions.insert_before(_partitions.end(), *entry);
+ entry->set_continuous(bool(cont));
+ });
+}
+

row_cache::row_cache(schema_ptr s, snapshot_source src, cache_tracker& tracker, is_continuous cont)
: _tracker(tracker)
, _schema(std::move(s))

@@ -1177,11 +1182,7 @@ row_cache::row_cache(schema_ptr s, snapshot_source src, cache_tracker& tracker,

, _underlying(src())
, _snapshot_source(std::move(src))
{

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:06 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The B+ will not have constant-time .size() call, so do it by hands

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

memtable.hh | 4 +++-
memtable.cc | 13 ++++++++-----
row_cache.cc | 6 +++---
3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/memtable.hh b/memtable.hh
index e17108913..cbff1ecc6 100644
--- a/memtable.hh
+++ b/memtable.hh
@@ -137,6 +137,7 @@ class memtable final : public enable_lw_shared_from_this<memtable>, private loga
logalloc::allocating_section _read_section;
logalloc::allocating_section _allocating_section;
partitions_type partitions;
+ size_t nr_partitions = 0;
db::replay_position _replay_position;
db::rp_set _rp_set;
// mutation source to which reads fall-back after mark_flushed()
@@ -203,6 +204,7 @@ class memtable final : public enable_lw_shared_from_this<memtable>, private loga
void apply(const mutation& m, db::rp_handle&& = {});
// The mutation is upgraded to current schema.
void apply(const frozen_mutation& m, const schema_ptr& m_schema, db::rp_handle&& = {});
+ void evict_entry(memtable_entry& e, mutation_cleaner& cleaner);

static memtable& from_region(logalloc::region& r) {
return static_cast<memtable&>(r);
@@ -236,7 +238,7 @@ class memtable final : public enable_lw_shared_from_this<memtable>, private loga
return _memtable_list;
}

- size_t partition_count() const;
+ size_t partition_count() const { return nr_partitions; }
logalloc::occupancy_stats occupancy() const;

// Creates a reader of data in this memtable for given partition range.
diff --git a/memtable.cc b/memtable.cc
index 573074d1f..a5a02c164 100644
--- a/memtable.cc
+++ b/memtable.cc
@@ -137,11 +137,16 @@ uint64_t memtable::dirty_size() const {
return occupancy().total_space();
}

+void memtable::evict_entry(memtable_entry& e, mutation_cleaner& cleaner) {
+ e.partition().evict(cleaner);
+ nr_partitions--;
+}
+
void memtable::clear() noexcept {
auto dirty_before = dirty_size();
with_allocator(allocator(), [this] {
partitions.clear_and_dispose([this] (memtable_entry* e) {
- e->partition().evict(_cleaner);
+ evict_entry(*e, _cleaner);
current_deleter<memtable_entry>()(e);
});
});
@@ -154,6 +159,7 @@ future<> memtable::clear_gently() noexcept {
auto& alloc = allocator();

auto p = std::move(partitions);
+ nr_partitions = 0;
while (!p.empty()) {
auto dirty_before = dirty_size();
with_allocator(alloc, [&] () noexcept {
@@ -210,6 +216,7 @@ memtable::find_or_create_partition(const dht::decorated_key& key) {
memtable_entry* entry = current_allocator().construct<memtable_entry>(
_schema, dht::decorated_key(key), mutation_partition(_schema));
partitions.insert_before(i, *entry);
+ ++nr_partitions;
++_table_stats.memtable_partition_insertions;
return entry->partition();
} else {
@@ -753,10 +760,6 @@ mutation_source memtable::as_data_source() {
});
}

-size_t memtable::partition_count() const {
- return partitions.size();
-}
-
memtable_entry::memtable_entry(memtable_entry&& o) noexcept
: _link()
, _schema(std::move(o._schema))
diff --git a/row_cache.cc b/row_cache.cc
index 805a9ac02..85d9a2f2c 100644
--- a/row_cache.cc
+++ b/row_cache.cc
@@ -887,14 +887,14 @@ void row_cache::invalidate_sync(memtable& m) noexcept {
bool blow_cache = false;
// Note: clear_and_dispose() ought not to look up any keys, so it doesn't require
// with_linearized_managed_bytes(), but invalidate() does.
- m.partitions.clear_and_dispose([this, deleter = current_deleter<memtable_entry>(), &blow_cache] (memtable_entry* entry) {
+ m.partitions.clear_and_dispose([this, &m, deleter = current_deleter<memtable_entry>(), &blow_cache] (memtable_entry* entry) {
with_linearized_managed_bytes([&] {
try {
invalidate_locked(entry->key());
} catch (...) {
blow_cache = true;
}
- entry->partition().evict(_tracker.memtable_cleaner());
+ m.evict_entry(*entry, _tracker.memtable_cleaner());
deleter(entry);
});
});
@@ -970,7 +970,7 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
auto i = m.partitions.begin();
memtable_entry& mem_e = *i;
m.partitions.erase(i);
- mem_e.partition().evict(_tracker.memtable_cleaner());
+ m.evict_entry(mem_e, _tracker.memtable_cleaner());
current_allocator().destroy(&mem_e);
});
++partition_count;
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:08 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

+}
+

+std::ostream& operator<<(std::ostream& out, const scheduling_latency_measurer& slm) {
+ auto to_ms = [] (int64_t nanos) {
+ return float(nanos) / 1e6;
+ };
+ return out << sprint("{count: %d, "
+ //"min: %.6f [ms], "
+ //"50%%: %.6f [ms], "
+ //"90%%: %.6f [ms], "
+ "99%%: %.6f [ms], "
+ "max: %.6f [ms]}",
+ slm.histogram().count(),
+ //to_ms(slm.min().count()),
+ //to_ms(slm.histogram().percentile(0.5)),
+ //to_ms(slm.histogram().percentile(0.9)),
+ to_ms(slm.histogram().percentile(0.99)),
+ to_ms(slm.max().count()));
+}
diff --git a/test/perf/perf_row_cache_update.cc b/test/perf/perf_row_cache_update.cc
index 181ce6730..2a3cbde65 100644
--- a/test/perf/perf_row_cache_update.cc
+++ b/test/perf/perf_row_cache_update.cc
@@ -19,16 +19,13 @@

* along with Scylla. If not, see <http://www.gnu.org/licenses/>.

- return _hist;
- }

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:11 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

elements themselves. For this the type must support

getters and setters for the flags.

Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.

Changes in v4:

- Smoother API for grow/shrink
- The upper_bound() helper

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

configure.py | 1 +
utils/array_trusted_bounds.hh | 259 ++++++++++++++++++++++++
test/boost/array_trusted_bounds_test.cc | 189 +++++++++++++++++
3 files changed, 449 insertions(+)
create mode 100644 utils/array_trusted_bounds.hh

create mode 100644 test/boost/array_trusted_bounds_test.cc

diff --git a/configure.py b/configure.py

index 262597d60..393426fc0 100755

--- a/configure.py
+++ b/configure.py
@@ -328,6 +328,7 @@ scylla_tests = set([
'test/boost/log_heap_test',
'test/boost/logalloc_test',
'test/boost/managed_vector_test',
+ 'test/boost/array_trusted_bounds_test',
'test/boost/map_difference_test',
'test/boost/memtable_test',
'test/boost/meta_test',
diff --git a/utils/array_trusted_bounds.hh b/utils/array_trusted_bounds.hh
new file mode 100644

index 000000000..c7c8a11a5
--- /dev/null
+++ b/utils/array_trusted_bounds.hh
@@ -0,0 +1,259 @@

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.

+GCC6_CONCEPT( requires BoundsKeeper<T> && std::is_nothrow_move_constructible_v<T> )

+class array_trusted_bounds {
+ union maybe_constructed {
+ maybe_constructed() { }
+ ~maybe_constructed() { }
+ T object;
+ };
+
+ maybe_constructed _data[1];

+
+ size_t number_of_elements() const {

+ for (int i = 0; ; i++) {
+ if (_data[i].object.is_tail()) {
+ return i + 1;
+ }
+ }
+

+ std::abort();

+ struct grow_tag {
+ int add_pos;
+ };
+
+ template <typename... Args>
+ array_trusted_bounds(array_trusted_bounds& from, grow_tag grow, Args&&... args) {

+ // The add_pos is strongly _expected_ to be within bounds

+ int i, off = 0;
+ bool tail = false;
+
+ for (i = 0; !tail; i++) {

+ if (i == grow.add_pos) {

+ off = 1;
+ continue;
+ }
+
+ tail = from._data[i - off].object.is_tail();
+ new (&_data[i].object) T(std::move(from._data[i - off].object));
+ }
+

+ assert(grow.add_pos <= i);
+
+ new (&_data[grow.add_pos].object) T(std::forward<Args>(args)...);

+
+ _data[0].object.set_head(true);

+ if (grow.add_pos == 0) {

+ _data[1].object.set_head(false);
+ }
+ _data[i - off].object.set_tail(true);
+ if (off == 0) {
+ _data[i - 1].object.set_tail(false);
+ }
+ }
+
+ // Shrinking

+ struct shrink_tag {
+ int del_pos;
+ };
+
+ array_trusted_bounds(array_trusted_bounds& from, shrink_tag shrink) {

+ int i, off = 0;
+ bool tail = false;
+
+ for (i = 0; !tail; i++) {
+ tail = from._data[i].object.is_tail();

+ if (i == shrink.del_pos) {

+ off = 1;
+ } else {
+ new (&_data[i - off].object) T(std::move(from._data[i].object));
+ }
+ }
+
+ _data[0].object.set_head(true);
+ _data[i - off - 1].object.set_tail(true);
+ }

+ iterator end() noexcept { return &_data[number_of_elements()].object; }
+
+ size_t index_of(iterator i) { return i - &_data[0].object; }

+ bool is_single_element() const { return _data[0].object.is_tail(); }
+

+ // A helper for keeping the array sorted

+ template <typename K, typename Compare>
+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& val, Compare cmp, bool& match) {
+ int i = 0;
+
+ do {
+ int x = cmp(_data[i].object, val);
+ if (x >= 0) {
+ match = (x == 0);
+ break;
+ }
+ } while (!_data[i++].object.is_tail());
+
+ return &_data[i].object;
+ }
+
+ template <typename K, typename Compare>
+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& val, Compare cmp) {
+ bool match = false;
+ return lower_bound(val, cmp, match);
+ }
+

+ // And its peer ... just to be used

+ template <typename K, typename Compare>
+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )

+ iterator upper_bound(const K& val, Compare cmp) {
+ int i = 0;
+
+ do {
+ if (cmp(_data[i].object, val) > 0) {

+ break;
+ }
+ } while (!_data[i++].object.is_tail());
+
+ return &_data[i].object;
+ }
+

+ template <typename Func>

+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )
+ void for_each(Func&& fn) {
+ bool tail = false;
+
+ for (int i = 0; !tail; i++) {
+ tail = _data[i].object.is_tail();
+ fn(_data[i].object);
+ }
+ }

+
+ size_t storage_size() const { return number_of_elements() * sizeof(T); }
+ size_t size() { return number_of_elements(); }

+
+ friend size_t size_for_allocation_strategy(const array_trusted_bounds& obj) {
+ return obj.storage_size();
+ }
+
+ static array_trusted_bounds& from_element(T* ptr, int& idx) {
+ while (!ptr->is_head()) {
+ idx++;
+ ptr--;
+ }
+
+ static_assert(offsetof(array_trusted_bounds, _data[0].object) == 0);
+ return *reinterpret_cast<array_trusted_bounds*>(ptr);
+ }
+};
diff --git a/test/boost/array_trusted_bounds_test.cc b/test/boost/array_trusted_bounds_test.cc
new file mode 100644

index 000000000..2b6a5a528

--- /dev/null
+++ b/test/boost/array_trusted_bounds_test.cc
@@ -0,0 +1,189 @@
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+/*
+ * This file is part of Scylla.
+ *
+ * Scylla is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU Affero General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * Scylla is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.

+ bool is_head() const { return _head; }
+ void set_head(bool v) { _head = v; }
+ bool is_tail() const { return _tail; }

+ return new (ptr) test_array(from, test_array::grow_tag{npos}, ndat);

+}
+
+test_array* shrink(test_array& from, int nszie, int spos) {
+ auto ptr = current_allocator().alloc(&get_standard_migrator<test_array>(), sizeof(element) * nszie, alignof(test_array));

+ return new (ptr) test_array(from, test_array::shrink_tag{spos});
+}

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:12 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

keys and full one for values. The latter is not kept on-board,
but it required on all calls.

Changes in v4:

- The comparator is not kept on-board
- Helper to check if the iterators got invalidated by .emplace()

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---
configure.py | 1 +

utils/double-decker.hh | 308 ++++++++++++++++++++++++++++
test/boost/double_decker_test.cc | 342 +++++++++++++++++++++++++++++++
3 files changed, 651 insertions(+)
create mode 100644 utils/double-decker.hh
create mode 100644 test/boost/double_decker_test.cc

diff --git a/configure.py b/configure.py
index 393426fc0..593f982eb 100755
--- a/configure.py
+++ b/configure.py

@@ -383,6 +383,7 @@ scylla_tests = set([
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/bptree_test',
+ 'test/boost/double_decker_test',
'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',
diff --git a/utils/double-decker.hh b/utils/double-decker.hh

new file mode 100644
index 000000000..d797124a7
--- /dev/null
+++ b/utils/double-decker.hh
@@ -0,0 +1,308 @@

+
+GCC6_CONCEPT(
+ template <typename T1, typename T2, typename Compare>
+ concept bool Comparable = requires (const T1& a, const T2& b, Compare cmp) {

+ { cmp(a, b) } -> int;
+ };
+)
+

+template <typename Key, typename T, typename Less, typename Compare, int NodeSize,
+ bplus::key_search Search = bplus::key_search::binary, bplus::with_debug Debug = bplus::with_debug::no>
+GCC6_CONCEPT( requires Comparable<T, T, Compare> && std::is_nothrow_move_constructible_v<T> )
+class double_decker {
+ using inner_array = array_trusted_bounds<T>;
+ using outer_tree = bplus::tree<Key, inner_array, Less, NodeSize, Search, Debug>;
+ using outer_iterator = typename outer_tree::iterator;
+
+ outer_tree _tree;
+

+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ _bucket.reconstruct(_bucket.storage_size() - sizeof(T), *_bucket, typename inner_array::shrink_tag{_idx});

+
+ if (tail) {
+ _bucket++;
+ _idx = 0;
+ }
+
+ return *this;
+ }
+
+ iterator erase(Less less) { return erase_and_dispose(less, bplus::default_dispose<T>); }
+ };

+
+ /*

+ * Structure that shed some more light on how the lower_bound
+ * actually found the bounding elements.
+ */
+ struct bound_hint {
+ /*
+ * Set to true if the element fully matched to the key
+ * according to Compare
+ */
+ bool match;
+ /*
+ * Set to true if the bucket for the given key exists
+ */
+ bool key_match;
+ /*
+ * Set to true if the given key is more than anything
+ * on the bucket and iterator was switched to the next
+ * one (or when the key_match is false)
+ */
+ bool key_tail;
+ };
+
+ iterator begin() { return iterator(_tree.begin(), 0); }
+ iterator end() { return iterator(_tree.end(), 0); }
+

+ double_decker(Less less) : _tree(less) { }

+
+ double_decker(const double_decker& other) = delete;

+ double_decker(double_decker&& other) : _tree(std::move(other._tree)) {}
+
+ iterator insert(Key k, T value, Compare cmp) {

+ std::pair<outer_iterator, bool> oip = _tree.emplace(std::move(k), std::move(value));
+ outer_iterator& bkt = oip.first;

+ int idx = 0;
+

+ if (!oip.second) {
+ /*
+ * Unlikely, but in this case we reconstruct the array. The value
+ * must not have been moved by emplace() above.
+ */

+ idx = bkt->index_of(bkt->lower_bound(value, cmp));

+ bkt.reconstruct(bkt.storage_size() + sizeof(T), *bkt,

+ typename inner_array::grow_tag{idx}, std::move(value));

+ }
+
+ return iterator(bkt, idx);

+ }
+
+ template <typename... Args>

+
+ /*

+ * Key conflict, need to expand some inner vector, but still there
+ * are two cases -- whether the bounding element is on k's bucket
+ * or the bound search overflew and switched to the next one.
+ */
+
+ int idx = i._idx;
+
+ if (hint.key_tail) {
+ /*
+ * The latter case -- i points to the next one. Need to shift
+ * back and append the new element to its tail.
+ */
+ bucket--;
+ idx = bucket->index_of(bucket->end());
+ }
+
+ bucket.reconstruct(bucket.storage_size() + sizeof(T), *bucket,

+ typename inner_array::grow_tag{idx}, std::forward<Args>(args)...);
+ return iterator(bucket, idx);
+ }
+
+ template <typename K = Key>

+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )

+ iterator find(const K& key, Compare cmp) {

+ outer_iterator bkt = _tree.find(key);

+ int idx = 0;
+

+ if (bkt != _tree.end()) {

+ bool match = false;

+ idx = bkt->index_of(bkt->lower_bound(key, cmp, match));

+ if (!match) {
+ bkt = _tree.end();

+ idx = 0;
+ }
+ }

+
+ return iterator(bkt, idx);

+ }
+
+ template <typename K = Key>

+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )

+ iterator lower_bound(const K& key, Compare cmp, bound_hint& hint) {

+ outer_iterator bkt = _tree.lower_bound(key, hint.key_match);
+
+ hint.key_tail = false;
+ hint.match = false;
+
+ if (bkt == _tree.end() || !hint.key_match) {
+ return iterator(bkt, 0);
+ }
+

+ int i = bkt->index_of(bkt->lower_bound(key, cmp, hint.match));

+
+ if (i != 0 && (*bkt)[i - 1].is_tail()) {
+ /*
+ * The lower_bound is after the last element -- shift
+ * to the net bucket's 0'th one.
+ */
+ bkt++;
+ i = 0;
+ hint.key_tail = true;
+ }
+
+ return iterator(bkt, i);

+ }
+
+ template <typename K = Key>

+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )

+ iterator lower_bound(const K& key, Compare cmp) {
+ bound_hint hint;
+ return lower_bound(key, cmp, hint);
+ }
+
+ template <typename K = Key>

+ GCC6_CONCEPT( requires Comparable<K, T, Compare> )

+ iterator upper_bound(const K& key, Compare cmp) {
+ bool key_match;
+ outer_iterator bkt = _tree.lower_bound(key, key_match);
+
+ if (bkt == _tree.end() || !key_match) {

+ return iterator(bkt, 0);
+ }
+

+ int i = bkt->index_of(bkt->upper_bound(key, cmp));

+
+ if (i != 0 && (*bkt)[i - 1].is_tail()) {

+ // Beyond the end() boundary
+ bkt++;
+ i = 0;
+ }
+
+ return iterator(bkt, i);

+ }
+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ void clear_and_dispose(Func&& disp) {
+ _tree.clear_and_dispose([&disp] (inner_array& arr) {
+ arr.for_each(disp);
+ });
+ }
+
+ void clear() { clear_and_dispose(bplus::default_dispose<T>); }
+

+ bool empty() const { return _tree.empty(); }

+};
diff --git a/test/boost/double_decker_test.cc b/test/boost/double_decker_test.cc

new file mode 100644
index 000000000..1f51c40dc
--- /dev/null
+++ b/test/boost/double_decker_test.cc
@@ -0,0 +1,342 @@
+

+ struct compare {

+ int *_cookie;
+ int *_cookie2;

+public:

+ bool is_head() const { return _head; }

+ bool is_tail() const { return _tail; }
+ void set_head(bool v) { _head = v; }
+ void set_tail(bool v) { _tail = v; }
+

+ test_data(int key, std::string sub) : _key(key, sub), _cookie(new int(0)), _cookie2(new int(0)) {}
+

+ test_data(const test_data& other) = delete;
+ test_data(test_data&& other) noexcept : _key(std::move(other._key)), _head(other._head), _tail(other._tail),

+ _cookie(other._cookie), _cookie2(new int(0)) {
+ other._cookie = nullptr;
+ }
+

+ ~test_data() {

+ if (_cookie != nullptr) {
+ delete _cookie;
+ }
+ delete _cookie2;
+ }
+

+ bool operator==(const compound_key& k) { return _key == k; }
+
+ test_data& operator=(const test_data& other) = delete;
+ test_data& operator=(test_data&& other) = delete;
+
+ std::string format() const { return _key.format(); }
+
+ struct compare {
+ compound_key::compare kcmp;
+ int operator()(const int& a, const int& b) { return kcmp(a, b); }
+ int operator()(const compound_key& a, const int& b) { return kcmp(a.key, b); }
+ int operator()(const int& a, const compound_key& b) { return kcmp(a, b.key); }
+ int operator()(const compound_key& a, const compound_key& b) { return kcmp(a, b); }
+ int operator()(const compound_key& a, const test_data& b) { return kcmp(a, b._key); }
+ int operator()(const test_data& a, const compound_key& b) { return kcmp(a._key, b); }
+ int operator()(const test_data& a, const test_data& b) { return kcmp(a._key, b._key); }
+ };
+};
+
+using collection = double_decker<int, test_data, compound_key::less_compare, test_data::compare, 4,
+ bplus::key_search::both, bplus::with_debug::yes>;
+using oracle = std::set<compound_key, compound_key::less_compare>;
+
+BOOST_AUTO_TEST_CASE(test_lower_bound) {

+ collection c(compound_key::less_compare{});

+ test_data::compare cmp;
+

+ c.insert(3, test_data(3, "e"), cmp);
+ c.insert(5, test_data(5, "i"), cmp);
+ c.insert(5, test_data(5, "o"), cmp);

+
+ collection::bound_hint h;
+

+ BOOST_REQUIRE(*c.lower_bound(compound_key(2, "a"), cmp, h) == compound_key(3, "e") && !h.key_match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(3, "a"), cmp, h) == compound_key(3, "e") && h.key_match && !h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(3, "e"), cmp, h) == compound_key(3, "e") && h.key_match && !h.key_tail && h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(3, "o"), cmp, h) == compound_key(5, "i") && h.key_match && h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(4, "i"), cmp, h) == compound_key(5, "i") && !h.key_match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "a"), cmp, h) == compound_key(5, "i") && h.key_match && !h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "i"), cmp, h) == compound_key(5, "i") && h.key_match && !h.key_tail && h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "l"), cmp, h) == compound_key(5, "o") && h.key_match && !h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "o"), cmp, h) == compound_key(5, "o") && h.key_match && !h.key_tail && h.match);
+ BOOST_REQUIRE(c.lower_bound(compound_key(5, "q"), cmp, h) == c.end() && h.key_match && h.key_tail);
+ BOOST_REQUIRE(c.lower_bound(compound_key(6, "q"), cmp, h) == c.end() && !h.key_match);

+
+ c.clear();
+}
+

+BOOST_AUTO_TEST_CASE(test_upper_bound) {
+ collection c(compound_key::less_compare{});

+ test_data::compare cmp;
+

+ c.insert(3, test_data(3, "e"), cmp);
+ c.insert(5, test_data(5, "i"), cmp);
+ c.insert(5, test_data(5, "o"), cmp);
+
+ BOOST_REQUIRE(*c.upper_bound(compound_key(2, "a"), cmp) == compound_key(3, "e"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(3, "a"), cmp) == compound_key(3, "e"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(3, "e"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(3, "o"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(4, "i"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(5, "a"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(5, "i"), cmp) == compound_key(5, "o"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(5, "l"), cmp) == compound_key(5, "o"));
+ BOOST_REQUIRE(c.upper_bound(compound_key(5, "o"), cmp) == c.end());
+ BOOST_REQUIRE(c.upper_bound(compound_key(5, "q"), cmp) == c.end());
+ BOOST_REQUIRE(c.upper_bound(compound_key(6, "q"), cmp) == c.end());
+
+ c.clear();
+}
+BOOST_AUTO_TEST_CASE(test_self_iterator) {
+ collection c(compound_key::less_compare{});

+ test_data::compare cmp;
+

+ c.insert(1, std::move(test_data(1, "a")), cmp);
+ c.insert(1, std::move(test_data(1, "b")), cmp);
+ c.insert(2, std::move(test_data(2, "c")), cmp);
+ c.insert(3, std::move(test_data(3, "d")), cmp);
+ c.insert(3, std::move(test_data(3, "e")), cmp);
+
+ auto erase_by_ptr = [&] (int key, std::string sub) {
+ test_data* d = &*c.find(compound_key(key, sub), cmp);

+ collection::iterator di(d);
+ di.erase(compound_key::less_compare{});
+ };
+
+ erase_by_ptr(1, "b");
+ erase_by_ptr(2, "c");
+ erase_by_ptr(3, "d");

+ auto i = c.begin();
+ BOOST_REQUIRE(*i++ == compound_key(1, "a"));
+ BOOST_REQUIRE(*i++ == compound_key(3, "e"));
+ BOOST_REQUIRE(i == c.end());
+
+ c.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_end_iterator) {

+ collection c(compound_key::less_compare{});

+ test_data::compare cmp;
+

+ c.insert(1, std::move(test_data(1, "a")), cmp);

+ auto i = std::prev(c.end());
+ BOOST_REQUIRE(*i == compound_key(1, "a"));
+
+ c.clear();
+}
+
+void validate_sorted(collection& c) {
+ auto i = c.begin();
+ if (i == c.end()) {
+ return;
+ }
+
+ while (1) {
+ auto cur = i;
+ i++;
+ if (i == c.end()) {
+ break;
+ }
+ test_data::compare cmp;
+ BOOST_REQUIRE(cmp(*cur, *i) < 0);
+ }
+}
+
+void compare_with_set(collection& c, oracle& s) {

+ test_data::compare cmp;

+ /* All keys must be findable */
+ for (auto i = s.begin(); i != s.end(); i++) {

+ auto j = c.find(*i, cmp);

+ BOOST_REQUIRE(j != c.end() && *j == *i);
+ }
+
+ /* Both iterators must coinside */
+ auto i = c.begin();
+ auto j = s.begin();
+
+ while (i != c.end()) {
+ BOOST_REQUIRE(*i == *j);
+ i++;
+ j++;
+ }
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_via_emplace) {

+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;

+ oracle s;
+ int nr = 0;
+
+ while (nr < 4000) {
+ compound_key k(tests::random::get_int<int>(900), tests::random::get_sstring(4));
+
+ collection::bound_hint h;

+ auto i = c.lower_bound(k, cmp, h);

+
+ if (i == c.end() || !h.match) {
+ auto it = c.emplace_before(i, k.key, h, k.key, k.sub_key);
+ BOOST_REQUIRE(*it == k);
+ s.insert(std::move(k));
+ nr++;
+ }
+ }
+
+ compare_with_set(c, s);
+ c.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_and_erase) {

+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;

+ int nr = 0;
+
+ while (nr < 500) {
+ compound_key k(tests::random::get_int<int>(100), tests::random::get_sstring(3));
+

+ if (c.find(k, cmp) == c.end()) {
+ auto it = c.insert(k.key, std::move(test_data(k.key, k.sub_key)), cmp);

+ BOOST_REQUIRE(*it == k);
+ nr++;
+ }
+ }
+
+ validate_sorted(c);
+
+ while (nr > 0) {
+ int n = tests::random::get_int<int>() % nr;

+ auto i = c.begin();
+ while (n > 0) {
+ i++;
+ n--;
+ }
+
+ i.erase(compound_key::less_compare{});
+ nr--;
+
+ validate_sorted(c);
+ }
+}
+
+BOOST_AUTO_TEST_CASE(test_compaction) {
+ logalloc::region reg;
+ with_allocator(reg.allocator(), [&] {

+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;

+ oracle s;
+
+ {
+ logalloc::reclaim_lock rl(reg);
+
+ int nr = 0;
+
+ while (nr < 1500) {
+ compound_key k(tests::random::get_int<int>(400), tests::random::get_sstring(3));
+

+ if (c.find(k, cmp) == c.end()) {
+ auto it = c.insert(k.key, std::move(test_data(k.key, k.sub_key)), cmp);

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:13 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

Update in v7:

- The index_for() implmenetation is templatized the other way
to make it possible for AVX key search specialization (further
patching)

- The find, lower_ and upper_bound do not init empty tree

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

create mode 100644 test/unit/bptree_key.hh
create mode 100644 test/unit/bptree_validation.hh

create mode 100644 utils/bptree.hh
create mode 100644 test/boost/bptree_test.cc

create mode 100644 test/perf/perf_bptree.cc
create mode 100644 test/unit/bptree_compaction_test.cc
create mode 100644 test/unit/bptree_stress_test.cc

diff --git a/configure.py b/configure.py
index 4f39ad55e..262597d60 100755
--- a/configure.py
+++ b/configure.py

@@ -381,6 +381,7 @@ scylla_tests = set([
'test/boost/view_schema_ckey_test',
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',

+ 'test/boost/bptree_test',

'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',

@@ -398,6 +399,7 @@ scylla_tests = set([

'test/perf/perf_fast_forward',
'test/perf/perf_hash',
'test/perf/perf_mutation',
+ 'test/perf/perf_bptree',

'test/perf/perf_row_cache_update',
'test/perf/perf_simple_query',
'test/perf/perf_sstable',

@@ -405,6 +407,8 @@ scylla_tests = set([

'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
'test/unit/row_cache_stress_test',
+ 'test/unit/bptree_stress_test',
+ 'test/unit/bptree_compaction_test',
])

perf_tests = set([

@@ -943,6 +947,7 @@ pure_boost_tests = set([

'test/boost/small_vector_test',
'test/boost/top_k_test',
'test/boost/vint_serialization_test',
+ 'test/boost/bptree_test',
'test/manual/json_test',
'test/manual/streaming_histogram_test',
])

@@ -956,10 +961,13 @@ tests_not_using_seastar_test_framework = set([

'test/perf/perf_cql_parser',
'test/perf/perf_hash',
'test/perf/perf_mutation',
+ 'test/perf/perf_bptree',

'test/perf/perf_row_cache_update',
'test/unit/lsa_async_eviction_test',
'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
+ 'test/unit/bptree_stress_test',
+ 'test/unit/bptree_compaction_test',
'test/manual/sstable_scan_footprint_test',
]) | pure_boost_tests

diff --git a/test/unit/bptree_key.hh b/test/unit/bptree_key.hh

new file mode 100644

index 000000000..54347a54f
--- /dev/null
+++ b/test/unit/bptree_key.hh

@@ -0,0 +1,101 @@

+ other._cookie = nullptr;

+ _p_cookie = new int(*other._p_cookie);
+ }
+
+ ~test_key() {

+ if (_cookie != nullptr) {
+ delete _cookie;
+ }

new file mode 100644

index 000000000..cdf137bda
--- /dev/null
+++ b/test/unit/bptree_validation.hh

@@ -0,0 +1,318 @@

+ fmt::print("\\\n");
+ }
+

+ for (i = 0; i < n._num_keys; i++) {
+ fmt::print(" {}", (int)n._keys[i].v);

+ }
+ fmt::print("\n");
+

+ return;

+ }
+ fmt::print("\n");
+

+ if (n._kids[0].n != nullptr) {
+ print_node(*n._kids[0].n, pfx, indent + 2);

+ }

+ for (i = 0; i < n._num_keys; i++) {
+ fmt::print("{:<{}c}---{}---\n", pfx, indent, (int)n._keys[i].v);
+ print_node(*n._kids[i + 1].n, pfx, indent + 2);
+ }
+ }
+
+ void validate(const tree& t);
+};
+
+
+template <typename K, typename T, typename L, int NS>
+void validator<K, T, L, NS>::validate_node(const tree& t, const node& n, int& prev_key, int& min_key, bool is_root) {
+ int i;
+
+ if (n.is_root() != is_root) {
+ fmt::print("node {} needs to {} root, but {}\n", n.id(), is_root ? "be" : "be not", n._flags);
+ throw "root broken";
+ }

+ }
+ }
+ }
+ }
+}

+ }
+ }
+ }
+

+ }
+ fmt::print("]\n");

+ }
+};
+

+} // namespace
+
diff --git a/utils/bptree.hh b/utils/bptree.hh

new file mode 100644
index 000000000..34ec631f0
--- /dev/null
+++ b/utils/bptree.hh
@@ -0,0 +1,1875 @@

+
+/*

+ * The node_id class is purely a debugging thing -- when reading
+ * the validator print-s it's more handy to look at IDs consisting
+ * of 1-3 digits, rather than 16 hex-digits of a printed pointer
+ */
+template <with_debug D>
+struct node_id {
+ int operator()() const { return reinterpret_cast<uintptr_t>(this); }
+};
+
+template <>
+struct node_id<with_debug::yes> {
+ unsigned int _id;
+ static unsigned int _next() {
+ static std::atomic<unsigned int> rover {1};
+ return rover.fetch_add(1);
+ }
+
+ node_id() : _id(_next()) {}
+ int operator()() const { return _id; }
+};

+
+/*

+ * This wrapper prevents the value from being default-constructed
+ * when its container is created. The intended usage is to wrap
+ * elements of static arrays or containers with .emplace() methods
+ * that can live some time without the value in it.
+ *
+ * Similarly, the value is _not_ automatically destructed when this
+ * thing is, so ~Vaue() must be called by hands. For this there is the
+ * .remove() method and two helpers for common cases -- std::move-ing
+ * the value into another maybe-location (.emplace(maybe&&)) and
+ * constructing the new in place of the existing one (.replace(args...))
+ */
+template <typename Value>

+union maybe_key {
+ Value v;
+ maybe_key() noexcept {}
+ ~maybe_key() {}

+
+ void reset() { v.~Value(); }

+
+ /*

+ * Constructs the value inside the empty maybe wrapper.
+ */

+ template <typename... Args>

+ void emplace(Args&&... args) {
+ new (&v) Value (std::forward<Args>(args)...);

+ }
+
+ /*

+ * The special-case handling of moving some other alive maybe-value.
+ * Calls the source destructor after the move.
+ */

+ void emplace(maybe_key&& other) {

+ new (&v) Value(std::move(other.v));
+ other.reset();

+ }
+
+ /*

+ * Similar to emplace, but to be used on the alive maybe.
+ * Calls the destructor on it before constructing the new value.
+ */

+ template <typename... Args>

+ void replace(Args&&... args) {
+ reset();

+ emplace(std::forward<Args>(args)...);
+ }
+
+ void replace(maybe_key&& other) = delete; // not to be called by chance

+
+/*

+ * This is for testing, validator will be everybody's friend
+ * to have rights to check if the tree is internally correct.
+ */
+template <typename Key, typename T, typename Less, int NodeSize> class validator;
+template <with_debug Debug> class statistics;

+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug> class node;
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug> class data;

+
+/*

+ * The tree itself.
+ * Equipped with O(1) (with little constant) begin() and end()
+ * and the iterator, that scans throug sorted keys and is not
+ * invalidated on insert/remove.

+ *

+ * The NodeSize parameter describes the amount of keys to be
+ * held on each node. Inner nodes will thus have N+1 sub-trees,
+ * leaf nodes will have N data pointers.

+ */
+
+GCC6_CONCEPT(

+ template <typename Key1, typename Key2, typename Less>
+ concept bool LessComparable = requires (const Key1& a, const Key2& b, Less less) {
+ { less(a, b) } -> bool;
+ { less(b, a) } -> bool;

+ };
+

+ template <typename T, typename Key>
+ concept bool CanGetKeyFromValue = requires (T val) {
+ { val.key() } -> Key;
+ };
+)
+
+struct stats {
+ unsigned long nodes;
+ std::vector<unsigned long> nodes_filled;
+ unsigned long leaves;
+ std::vector<unsigned long> leaves_filled;
+ unsigned long datas;
+};

+template <typename Key, typename T, typename Less, int NodeSize,
+ key_search Search = key_search::binary, with_debug Debug = with_debug::no>
+GCC6_CONCEPT( requires LessComparable<Key, Key, Less> &&
+ std::is_nothrow_move_constructible_v<Key> &&
+ std::is_nothrow_move_constructible_v<T>
+)
+class tree {

+public:
+ class iterator;

+ friend class validator<Key, T, Less, NodeSize>;

+ friend class node<Key, T, Less, NodeSize, Search, Debug>;
+

+ // Sanity not to allow slow key-search in non-debug mode
+ static_assert(Debug == with_debug::yes || Search != key_search::both);
+
+ using node = class node<Key, T, Less, NodeSize, Search, Debug>;

+ using data = class data<Key, T, Less, NodeSize, Search, Debug>;
+

+private:
+
+ node* _root = nullptr;
+ node* _left = nullptr;
+ node* _right = nullptr;
+ Less _less;

+
+ template <typename K>

+ node& find_leaf_for(const K& k) const {
+ node* cur = _root;
+
+ while (!cur->is_leaf()) {
+ int i = cur->index_for(k, _less);
+ cur = cur->_kids[i].n;
+ }
+
+ return *cur;
+ }
+
+ void maybe_init_empty_tree() {
+ if (_root != nullptr) {

+ return;
+ }
+

+ node* n = node::create();
+ n->_flags |= node::NODE_LEAF | node::NODE_ROOT | node::NODE_RIGHTMOST | node::NODE_LEFTMOST;
+ do_set_root(n);
+ do_set_left(n);
+ do_set_right(n);
+ }
+
+ node* left_leaf_slow() const {
+ node* cur = _root;
+ while (!cur->is_leaf()) {
+ cur = cur->_kids[0].n;
+ }

+ return cur;
+ }
+

+ node* right_leaf_slow() const {
+ node* cur = _root;
+ while (!cur->is_leaf()) {
+ cur = cur->_kids[cur->_num_keys].n;
+ }

+ return cur;
+ }
+

+ template <typename K>
+ iterator get_bound(const K& k, bool& upper) {

+ if (empty()) {

+ return end();
+ }
+

+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);

+
+ /*

+ * Element at i (key at i - 1) is less or equal to the k,
+ * the next element is greater. Mind corner cases.
+ */

+ if (i == 0) {
+ assert(n.is_leftmost());
+ return begin();
+ } else if (i <= n._num_keys) {
+ iterator cur = iterator(n._kids[i].d, i);
+ if (upper || _less(n._keys[i - 1].v, k)) {
+ cur++;
+ } else {
+ // Here 'upper' becomes 'match'

+ upper = true;
+ }
+

+ return cur;
+ } else {
+ assert(n.is_rightmost());
+ return end();
+ }
+ }
+
+public:
+
+ tree(const tree& other) = delete;
+ const tree& operator=(const tree& other) = delete;

+ tree& operator=(tree&& other) = delete;
+

+ explicit tree(Less less) : _less(less) { }
+ ~tree() {
+ if (_root != nullptr) {
+ node::destroy(*_root);
+ }
+ }
+
+ Less less() const { return _less; }
+
+ tree(tree&& other) noexcept : _less(std::move(other._less)) {
+ if (other._root) {
+ do_set_root(other._root);
+ do_set_left(other._left);
+ do_set_right(other._right);
+
+ other._root = nullptr;
+ other._left = nullptr;

+ other._right = nullptr;
+ }
+ }
+

+ // XXX -- this uses linear scan over the leaf nodes
+ size_t size_slow() const {
+ if (_root == nullptr) {
+ return 0;
+ }
+
+ size_t ret = 0;
+ const node* leaf = _left;
+ while (1) {
+ assert(leaf->is_leaf());
+ ret += leaf->_num_keys;
+ if (leaf == _right) {
+ break;
+ }
+ leaf = leaf->get_next();
+ }
+
+ return ret;
+ }
+
+ // Returns result that is equal (both not less than each other)

+ template <typename K = Key>

+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator find(const K& k) {
+ if (empty()) {

+ return end();
+ }
+

+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ if (i >= 1 && !_less(n._keys[i - 1].v, k)) {
+ return iterator(n._kids[i].d, i);
+ } else {
+ return end();

+ }
+ }
+

+ // Returns the least x out of those !less(x, k)

+ template <typename K = Key>

+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator lower_bound(const K& k) {
+ bool upper = false;
+ return get_bound(k, upper);

+ }
+
+ template <typename K = Key>

+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator lower_bound(const K& k, bool& match) {
+ match = false;
+ return get_bound(k, match);
+ }
+

+ // Returns the least x out of those less(k, x)

+ template <typename K = Key>

+ GCC6_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator upper_bound(const K& k) {
+ bool upper = true;
+ return get_bound(k, upper);

+ }
+
+ /*

+ * Constructs the element with key k inside the tree and returns
+ * iterator on it. If the key already exists -- just returns the
+ * iterator on it and sets the .second to false.
+ */

+ template <typename... Args>

+ std::pair<iterator, bool> emplace(Key k, Args&&... args) {
+ maybe_init_empty_tree();
+
+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ if (i >= 1 && !_less(n._keys[i - 1].v, k)) {
+ // Direct hit
+ return std::pair(iterator(n._kids[i].d, i), false);
+ }
+
+ data* d = data::create(std::forward<Args>(args)...);
+ auto x = seastar::defer([&d] { data::destroy(*d, default_dispose<T>); });
+ n.insert(i, std::move(k), d, _less);
+ assert(d->attached());
+ x.cancel();
+ return std::pair(iterator(d, i + 1), true);

+ }
+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ }
+
+ template <typename... Args>

+ iterator erase(Args&&... args) { return erase_and_dispose(std::forward<Args>(args)..., default_dispose<T>); }
+

+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )
+ void clear_and_dispose(Func&& disp) {

+ if (_root != nullptr) {
+ _root->clear(
+ [this, &disp] (data* d) { data::destroy(*d, disp); },
+ [this] (node* n) { node::destroy(*n); }
+ );
+
+ node::destroy(*_root);
+ _root = nullptr;
+ _left = nullptr;

+ _right = nullptr;
+ }
+ }
+

+ class iterator {
+ friend class tree;
+
+ /*

+ * When the iterator gets to the end the _data is
+ * replaced with the _tree obtained from the right
+ * leaf, and the _idx is set to npos
+ */
+ union {
+ tree* _tree;
+ data* _data;
+ };

+ int _idx;
+

+ /*
+ * It could be 0 as well, as leaf nodes cannot have
+ * kids (data nodes) at 0 position, but ...
+ */
+ static constexpr int npos = -1;
+
+ bool is_end() const { return _idx == npos; }
+
+ explicit iterator(tree* t) : _tree(t), _idx(npos) { }
+ iterator(data* d, int idx) : _data(d), _idx(idx) { }

+
+ /*

+ * The routine makes sure the iterator's index is valid
+ * and returns back the leaf that points to it.
+ */
+ node* revalidate() {
+ assert(!is_end());
+
+ node* leaf = _data->_leaf;

+
+ /*

+ * The data._leaf pointer is always valid (it's updated
+ * on insert/remove operations), the datas do not move
+ * as well, so if the leaf still points at us, it is valid.
+ */
+ if (_idx > leaf->_num_keys || leaf->_kids[_idx].d != _data) {
+ _idx = leaf->index_for(_data);
+ }
+
+ return leaf;
+ }

+
+ public:
+ using iterator_category = std::bidirectional_iterator_tag;

+ using value_type = T;
+ using difference_type = ssize_t;
+ using pointer = value_type*;
+ using reference = value_type&;
+

+ /*
+ * Special constructor for the case when there's the need for an
+ * iterator to the given value poiter. In this case we need to
+ * get three things:
+ * - pointer on class data: we assume that the value pointer
+ * is indeed embedded into the data and do the "container_of"
+ * maneuver
+ * - index at which the data is seen on the leaf: use the
+ * standard revalidation. Note, that we start with index 1
+ * which gives us 1/NodeSize chance of hitting the right index
+ * right at once :)
+ * - the tree itself: the worst thing here, creating an iterator
+ * like this is logN operation
+ */
+ iterator(T* value) : _idx(1) {
+ _data = boost::intrusive::get_parent_from_member(value, &data::value);
+ revalidate();
+ }
+
+ iterator() : iterator(static_cast<tree*>(nullptr)) {}

+
+ reference operator*() const { return _data->value; }
+ pointer operator->() const { return &_data->value; }
+
+ iterator& operator++() {

+ node* leaf = revalidate();
+ if (_idx < leaf->_num_keys) {
+ _idx++;
+ } else {
+ if (leaf->is_rightmost()) {
+ _idx = npos;
+ _tree = leaf->_rightmost_tree;

+ return *this;
+ }
+

+ leaf = leaf->get_next();
+ _idx = 1;
+ }
+ _data = leaf->_kids[_idx].d;

+ return *this;
+ }
+

+ iterator& operator--() {
+ if (is_end()) {
+ node* n = _tree->_right;
+ assert(n->_num_keys > 0);
+ _data = n->_kids[n->_num_keys].d;
+ _idx = n->_num_keys;

+ return *this;
+ }
+

+ node* leaf = revalidate();
+ if (_idx > 1) {
+ _idx--;
+ } else {
+ leaf = leaf->get_prev();
+ _idx = leaf->_num_keys;
+ }
+ _data = leaf->_kids[_idx].d;

+ return *this;
+ }
+
+ iterator operator++(int) {
+ iterator cur = *this;
+ operator++();
+ return cur;
+ }
+

+ iterator operator--(int) {
+ iterator cur = *this;
+ operator--();
+ return cur;
+ }
+

+ bool operator==(const iterator& o) const { return is_end() ? o.is_end() : _data == o._data; }

+ bool operator!=(const iterator& o) const { return !(*this == o); }
+

+ /*
+ * The key _MUST_ be in order and not exist,
+ * neither of those is checked
+ */
+ template <typename KeyFn, typename... Args>
+ iterator emplace_before(KeyFn key, Less less, Args&&... args) {
+ node* leaf;
+ int i;
+
+ if (!is_end()) {
+ leaf = revalidate();
+ i = _idx - 1;

+ return iterator(d, i);
+ }
+

+ template <typename... Args>
+ iterator emplace_before(Key k, Less less, Args&&... args) {
+ return emplace_before([&k] (data*) -> Key { return std::move(k); },

+ less, std::forward<Args>(args)...);
+ }

+
+ template <typename... Args>

+ GCC6_CONCEPT(requires CanGetKeyFromValue<T, Key>)
+ iterator emplace_before(Less less, Args&&... args) {
+ return emplace_before([] (data* d) -> Key { return d->value.key(); },

+ less, std::forward<Args>(args)...);
+ }

+
+ private:
+ /*
+ * Prepare a likely valid iterator for the next element.
+ * Likely means, that unless removal starts rebalancing
+ * datas the _idx will be for the correct pointer.
+ *
+ * This is just like the operator++, with the exception
+ * that staying on the leaf doesn't increase the _idx, as
+ * in this case the next element will be shifted left to
+ * the current position.
+ */

+ iterator next_after_erase(node* leaf) const {

+ if (_idx < leaf->_num_keys) {
+ return iterator(leaf->_kids[_idx + 1].d, _idx);
+ }
+
+ if (leaf->is_rightmost()) {
+ return iterator(leaf->_rightmost_tree);
+ }
+
+ leaf = leaf->get_next();
+ return iterator(leaf->_kids[1].d, 1);
+ }
+
+ public:

+ template <typename Func>

+ iterator erase_and_dispose(Func&& disp, Less less) {
+ node* leaf = revalidate();
+ iterator cur = next_after_erase(leaf);
+
+ leaf->remove(_idx - 1, less);
+ data::destroy(*_data, disp);
+

+ return cur;
+ }
+

+ iterator erase(Less less) { return erase_and_dispose(default_dispose<T>, less); }

+
+ template <typename... Args>

+ }
+
+ size_t storage_size() const { return _data->storage_size(); }

+ };
+
+ iterator begin() {

+ if (empty()) {

+ return end();
+ }
+
+ assert(_left->_num_keys > 0);
+ // Leaf nodes have data pointers starting from index 1
+ return iterator(_left->_kids[1].d, 1);
+ }
+ iterator end() { return iterator(this); }
+
+ using reverse_iterator = std::reverse_iterator<iterator>;
+ reverse_iterator rbegin() { return std::make_reverse_iterator(end()); }
+ reverse_iterator rend() { return std::make_reverse_iterator(begin()); }

+
+ bool empty() const { return _root == nullptr || _root->_num_keys == 0; }

+
+ struct stats get_stats() {
+ struct stats st;
+
+ st.nodes = 0;
+ st.leaves = 0;

+ st.datas = 0;
+

+ if (_root != nullptr) {
+ st.nodes_filled.resize(NodeSize + 1);
+ st.leaves_filled.resize(NodeSize + 1);
+ _root->fill_stats(st);
+ }
+
+ return st;
+ }
+};
+

+/*
+ * Algorithms for searching a key in array
+ */
+
+template <typename K, typename Key, typename Less, int Size, key_search Search>
+struct searcher { };
+
+template <typename K, typename Key, typename Less, int Size>
+struct searcher<K, Key, Less, Size, key_search::linear> {
+ static int gt(const K& k, const maybe_key<Key>* keys, int nr, Less less) {
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ if (less(k, keys[i].v)) {
+ break;
+ }
+ }
+
+ return i;
+ };
+};
+
+template <typename K, typename Key, typename Less, int Size>
+struct searcher<K, Key, Less, Size, key_search::binary> {
+ static int gt(const K& k, const maybe_key<Key>* keys, int nr, Less less) {
+ int s = 0, e = nr - 1, c = 0;
+

+ while (s <= e) {
+ int i = (s + e) / 2;
+ c++;

+ if (less(k, keys[i].v)) {

+ e = i - 1;
+ } else {
+ s = i + 1;

+ }
+ }
+
+ return s;
+ }
+};
+
+template <typename K, typename Key, typename Less, int Size>
+struct searcher<K, Key, Less, Size, key_search::both> {
+ static int gt(const K& k, const maybe_key<Key>* keys, int nr, Less less) {
+ int rl = searcher<K, Key, Less, Size, key_search::linear>::gt(k, keys, nr, less);
+ int rb = searcher<K, Key, Less, Size, key_search::binary>::gt(k, keys, nr, less);

+ assert(rl == rb);
+ return rl;
+ }

+};
+
+/*

+ * A node describes both, inner and leaf nodes.
+ */
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug>
+class node final {
+ friend class validator<Key, T, Less, NodeSize>;

+ friend class tree<Key, T, Less, NodeSize, Search, Debug>;
+
+ using tree = class tree<Key, T, Less, NodeSize, Search, Debug>;
+ using data = class data<Key, T, Less, NodeSize, Search, Debug>;
+
+ class prealloc;
+
+ /*

+
+ /*

+ * separation keys
+ * non-leaf nodes:
+ * the kids[i] contains keys[i + 1] <= k < keys[i + 2]
+ * kids[0] contains keys < all keys in the node
+ * leaf nodes:
+ * kids[i + 1] is the data for keys[i]
+ * kids[0] is unused

+ *

+ * In the examples below the leaf nodes will be shown like
+ *
+ * keys: [012]
+ * datas: [-012]
+ *
+ * and the non-leaf ones like
+ *
+ * keys: [012]
+ * kids: [A012]
+ *
+ * to have digits correspond to different elements and staying
+ * in its correct positions. And the A kid is this left-most one
+ * at index 0 for the non-leaf node.
+ */
+

+ maybe_key<Key> _keys[NodeSize];

+ node_or_data _kids[NodeSize + 1];

+
+ /*

+ * The root node uses this to point to the tree object. This is
+ * needed to update tree->_root on node move.
+ */
+ union {
+ node* _parent;
+ tree* _root_tree;

+ };
+
+ /*

+ node& operator=(node&& other) = delete;
+

+ /*
+ * There's no pointer/reference from nodes to the tree, neither
+ * there is such from data, because otherwise we'd have to update
+ * all of them inside tree move constructor, which in turn would
+ * make it toooo slow linear operation. Thus we walk up the nodes
+ * ._parent chain up to the root node which has the _root_tree.
+ */
+ tree* tree_slow() const {
+ const node* cur = this;
+
+ while (!cur->is_root()) {
+ cur = cur->_parent;
+ }
+
+ return cur->_root_tree;

+ }
+
+ /*

+ * Finds the index of the subtree to which the k belongs.
+ * Respectively, the key[i - 1] <= k < key[i] (and if i == 0
+ * then the node is inner and the key is in leftmost subtree).
+ */
+ template <typename K>
+ int index_for(const K& k, Less less) const {

+ return searcher<K, Key, Less, NodeSize, Search>::gt(k, _keys, _num_keys, less);

+ }
+
+ int index_for(node *n) const {
+ // Keep index on kid (FIXME?)
+
+ int i;

+ for (i = 0; i <= _num_keys; i++) {
+ if (_kids[i].n == n) {
+ break;
+ }
+ }
+ assert(i <= _num_keys);
+ return i;
+ }
+
+ bool need_refill() const {
+ return _num_keys <= NodeHalf;
+ }
+
+ bool can_grab_from() const {

+ return _num_keys > NodeHalf + 1;
+ }
+

+ bool can_push_to() const {
+ return _num_keys < NodeSize;
+ }
+
+ bool can_merge_with(const node& n) const {
+ return _num_keys + n._num_keys + (is_leaf() ? 0 : 1) <= NodeSize;
+ }
+
+ void shift_right(int s) {
+ for (int i = _num_keys - 1; i >= s; i--) {
+ _keys[i + 1].emplace(std::move(_keys[i]));

+ _kids[i + 2] = _kids[i + 1];
+ }

+ _num_keys++;
+ }
+
+ void shift_left(int s) {
+ // The key at s is expected to be .remove()-d !
+ for (int i = s; i < _num_keys - 1; i++) {
+ _keys[i].emplace(std::move(_keys[i + 1]));
+ _kids[i + 1] = _kids[i + 2];
+ }
+ _num_keys--;
+ }
+
+ void move_keys_and_kids(int foff, node& to, int toff, int count) {
+ for (int i = 0; i < count; i++) {
+ to._keys[toff + i].emplace(std::move(_keys[foff + i]));

+ to._kids[toff + i + 1] = _kids[foff + i + 1];
+ }
+ }
+

+ void move_to(node& to, int off, int count) {
+ move_keys_and_kids(off, to, 0, count);
+ _num_keys = off;
+ to._num_keys = count;
+ if (is_leaf()) {
+ for (int i = 0; i < count; i++) {
+ to._kids[i + 1].d->reattach(&to);
+ }
+ } else {
+ for (int i = 0; i < count; i++) {
+ to._kids[i + 1].n->_parent = &to;

+ }
+ }
+
+ }
+

+ void grab_from_left(node& from, maybe_key<Key>& sep) {

+ /*
+ * Grab one element from the left sibling and return
+ * the new separation key for them.
+ *
+ * Leaf: just move the last key (and the last kid) and report
+ * it as new separation key
+ *
+ * keys: [012] -> [56] = [01] [256] 2 is new separation
+ * datas: [-012] -> [-56] = [-01] [-256]
+ *
+ * Non-leaf is trickier. We need the current separation key
+ * as we're grabbing the last element which has no the right
+ * boundary on the node. So the parent node tells us one.
+ *
+ * keys: [012] -> s [56] = [01] 2 [s56] 2 is new separation
+ * kids: [A012] -> [B56] = [A01] [2B56]

+ */
+

+ _num_keys = 0;
+ }
+
+ void grab_from_right(node& from, maybe_key<Key>& sep) {

+ /*
+ * Grab one element from the right sibling and return
+ * the new separation key for them.
+ *
+ * Leaf: just move the 0th key (and 1st kid) and the
+ * new separation key is what becomes 0 in the source.
+ *
+ * keys: [01] <- [456] = [014] [56] 5 is new separation
+ * datas: [-01] <- [-456] = [-014] [-56]
+ *
+ * Non-leaf is trickier. We need the current separation
+ * key as we're grabbing the kids[0] element which has no
+ * corresponding keys[-1]. So the parent node tells us one.
+ *
+ * keys: [01] <- s [456] = [01s] 4 [56] 4 is new separation
+ * kids: [A01] <- [B456] = [A01B] [456]

+ */
+

+ int i = _num_keys;
+
+ if (is_leaf()) {
+ _keys[i].emplace(std::move(from._keys[0]));
+ _kids[i + 1] = from._kids[1];
+ _kids[i + 1].d->reattach(this);
+ sep.replace(std::move(copy_key(from._keys[1].v)));
+ } else {
+ _kids[i + 1] = from._kids[0];
+ _kids[i + 1].n->_parent = this;
+ _keys[i].emplace(std::move(sep));
+ from._kids[0] = from._kids[1];
+ sep.emplace(std::move(from._keys[0]));
+ }
+
+ _num_keys++;
+ from.shift_left(0);

+ }
+
+ /*

+ * When splitting, the result should be almost equal. The
+ * "almost" depends on the node-size being odd or even and
+ * on the node itself being leaf or inner.
+ */
+ bool equally_split(const node& n2) const {
+ if (Debug == with_debug::yes) {
+ return (_num_keys == n2._num_keys) ||
+ (_num_keys == n2._num_keys + 1) ||
+ (_num_keys + 1 == n2._num_keys);
+ }
+ return true;
+ }
+
+ // Helper for assert(). See comment for do_insert for details.
+ bool left_kid_sorted(const Key& k, Less less) const {
+ if (Debug == with_debug::yes && !is_leaf() && _num_keys > 0) {
+ node* x = _kids[0].n;
+ if (x != nullptr && less(k, x->_keys[x->_num_keys - 1].v)) {
+ return false;

+ }
+ }
+
+ return true;
+ }
+

+ template <typename DFunc, typename NFunc>
+ GCC6_CONCEPT(requires
+ requires (DFunc f, data* val) { { f(val) } -> void; } &&
+ requires (NFunc f, node* n) { { f(n) } -> void; }
+ )
+ void clear(DFunc&& ddisp, NFunc&& ndisp) {
+ if (is_leaf()) {
+ _flags &= ~(node::NODE_LEFTMOST | node::NODE_RIGHTMOST);
+ set_next(this);
+ set_prev(this);
+ } else {
+ node* n = _kids[0].n;
+ n->clear(ddisp, ndisp);
+ ndisp(n);
+ }

+ for (int i = 0; i < _num_keys; i++) {
+ _keys[i].reset();
+ if (is_leaf()) {
+ ddisp(_kids[i + 1].d);
+ } else {
+ node* n = _kids[i + 1].n;
+ n->clear(ddisp, ndisp);
+ ndisp(n);
+ }
+ }
+

+ _num_keys = 0;
+ }
+

+ static node* create() {
+ return current_allocator().construct<node>();
+ }
+
+ static void destroy(node& n) {
+ current_allocator().destroy(&n);
+ }
+
+ void drop() {
+ assert(!is_root());
+ if (is_leaf()) {
+ unlink();
+ }
+ destroy(*this);
+ }
+
+ void insert_into_full(int idx, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ if (!is_root()) {
+ node& p = *_parent;
+ int i = p.index_for(_keys[0].v, less);

+
+ /*

+ * Try to push left or right existing keys to the respective
+ * siblings. Keep in mind two corner cases:
+ *
+ * 1. Push to left. In this case the new key should not go
+ * to the [0] element, otherwise we'd have to update the p's
+ * separation key one more time.
+ *
+ * 2. Push to right. In this case we must make sure the new
+ * key is not the rightmost itself, otherwise it's _him_ who
+ * must be pushed there.
+ *
+ * Both corner cases are possible to implement though.
+ */
+ if (idx > 1 && i > 0) {
+ node* left = p._kids[i - 1].n;
+ if (left->can_push_to()) {
+ /*
+ * We've moved the 0th elemet from this, so the index
+ * for the new key shifts too
+ */
+ idx--;

+ left->grab_from_right(*this, p._keys[i - 1]);
+ }
+ }
+

+ if (idx < _num_keys && i < p._num_keys) {
+ node* right = p._kids[i + 1].n;
+ if (right->can_push_to()) {
+ right->grab_from_left(*this, p._keys[i]);
+ }
+ }
+
+ if (_num_keys < NodeSize) {
+ do_insert(idx, std::move(k), nd, less);
+ nodes.drain();

+ return;
+ }
+

+ /*
+ * We can only get here if both ->can_push_to() checks above
+ * had failed. In this case -- go ahead and split this.
+ */
+ }
+
+ split_and_insert(idx, std::move(k), nd, less, nodes);
+ }
+
+ void split_and_insert(int idx, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ assert(_num_keys >= NodeSize);
+
+ node* nn = nodes.pop();

+ maybe_key<Key> sep;
+
+ /*

+ * Insertion with split.
+ * 1. Existing node (this) is split into two. We try a bit harder
+ * than we might to to make the split equal.
+ * 2. The new element is added to either of the resulting nodes.
+ * 3. The new node nn is inserted into parent one with the help
+ * of a separation key sep
+ *
+ * First -- find the position in the current node at which the
+ * new element should have appeared.

+ */
+

+ int off = NodeHalf + (idx > NodeHalf ? 1 : 0);
+
+ if (is_leaf()) {
+ nn->_flags |= NODE_LEAF;
+ link(*nn);

+
+ /*

+ }
+ }
+ }
+

+ assert(equally_split(*nn));
+
+ if (is_root()) {
+ insert_into_root(*nn, std::move(sep.v), nodes);
+ } else {
+ insert_into_parent(*nn, std::move(sep.v), less, nodes);
+ }
+ sep.reset();
+ }
+
+ void do_insert(int i, Key k, node_or_data nd, Less less) {
+ assert(_num_keys < NodeSize);

+
+ /*

+ * We need to locate this node's index at parent array by using
+ * the 0th key, so make sure it exists. We can go even without
+ * it, but since we don't let's be on the safe side.
+ */
+ assert(_num_keys > 0);
+ int i = p.index_for(_keys[0].v, less);
+ assert(p._kids[i].n == this);

+
+ /*

+ * The node is "underflown" (see comment near NodeHalf
+ * about what this means), so we try to refill it at the
+ * siblings' expense. Many cases possible, but we go with
+ * only four:
+ *
+ * 1. Left sibling exists and it has at least 1 item
+ * above being the half-full. -> we grab one element
+ * from it.
+ *
+ * 2. Left sibling exists and we can merge current with
+ * it. "Can" means the resulting node will not overflow
+ * which, in turn, differs by one for leaf and non-leaf
+ * nodes. For leaves the merge is possible is the total
+ * number of the elements fits the maximum. For non-leaf
+ * we'll need room for one more element, here's why:
+ *
+ * [012] + [456] -> [012X456]
+ * [A012] + [B456] -> [A012B456]

+ *

+ * The key X in the middle separates B from everything on
+ * the left and this key was not sitting on either of the
+ * wannabe merging nodes. This X is the current separation
+ * of these two nodes taken from their parent.
+ *
+ * And two same cases for the right sibling.
+ */
+
+ left = i > 0 ? p._kids[i - 1].n : nullptr;
+ right = i < p._num_keys ? p._kids[i + 1].n : nullptr;
+
+ if (left != nullptr && left->can_grab_from()) {
+ grab_from_left(*left, p._keys[i - 1]);

+ return;
+ }
+

+ if (right != nullptr && right->can_grab_from()) {
+ grab_from_right(*right, p._keys[i]);

+ return;
+ }
+

+ if (left != nullptr && can_merge_with(*left)) {
+ p.merge_kids(*left, *this, i - 1, less);

+ return;
+ }
+

+ if (right != nullptr && can_merge_with(*right)) {
+ p.merge_kids(*this, *right, i, less);

+ return;
+ }
+

+ /*
+ * Susprisingly, the node in the B+ tree can violate the
+ * "minimally filled" rule for non roots. It _can_ stay with
+ * less than half elements on board. The next remove from
+ * it or either of its siblings will probably refill it.
+ *
+ * Keeping 1 key on the non-root node is possible, but needs
+ * some special care -- if we will remove this last key from
+ * this node, the code will try to refill one and will not
+ * be able to find this node's index at parent (the call for
+ * index_for() above).
+ */
+ assert(_num_keys > 1);
+ }
+
+ void remove(int i, Less less) {
+ assert(i >= 0);
+ /*
+ * Update the matching separation key from above. It
+ * exists only if we're removing the 0th key, but for
+ * the left-most child it doesn't exist.

+ *

+ * Note, that the latter check is crucial for clear()
+ * performance, as it's always removes the left-most
+ * key, without this check each remove() would walk the
+ * tree upwards in vain.
+ */
+ if (strict_separation_key && i == 0 && !is_leftmost()) {
+ const Key& k = _keys[i].v;
+ node* p = this;
+
+ while (!p->is_root()) {
+ p = p->_parent;
+ int j = p->index_for(k, less) - 1;
+ if (j >= 0) {
+ p->_keys[j].replace(std::move(copy_key(_keys[1].v)));
+ break;

+ }
+ }
+ }
+

+ */
+
+ int i;
+

+ }
+ }
+ }
+};
+

+/*
+ * The data represents data node (the actual data is stored "outside"
+ * of the tree). The tree::emplace() constructs the payload inside the
+ * data before inserting it into the tree.
+ */
+template <typename K, typename T, typename Less, int NS, key_search S, with_debug D>
+class data final {
+ friend class validator<K, T, Less, NS>;
+ template <typename c1, typename c2, typename c3, int s1, key_search p1, with_debug p2>
+ friend class tree<c1, c2, c3, s1, p1, p2>::iterator;
+
+ using node = class node<K, T, Less, NS, S, D>;
+
+ node* _leaf;
+ T value;
+
+public:

+ template <typename... Args>

+ static data* create(Args&&... args) {
+ return current_allocator().construct<data>(std::forward<Args>(args)...);

+ }
+
+ template <typename Func>
+ GCC6_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ static void destroy(data& d, Func&& disp) {
+ disp(d.value);
+ d._leaf = nullptr;
+ current_allocator().destroy(&d);

+ }
+
+ template <typename... Args>

+ data(Args&& ... args) : _leaf(nullptr), value(std::forward<Args>(args)...) {}
+

+ data(data&& other) noexcept : _leaf(other._leaf), value(std::move(other.value)) {
+ if (attached()) {
+ int i = _leaf->index_for(&other);
+ _leaf->_kids[i].d = this;

+ other._leaf = nullptr;
+ }
+ }
+

+ ~data() { assert(!attached()); }
+
+ bool attached() const { return _leaf != nullptr; }
+
+ void attach(node& to) {
+ assert(!attached());
+ _leaf = &to;
+ }
+
+ void reattach(node* to) {
+ assert(attached());
+ _leaf = to;

+ }
+
+ size_t storage_size() const {

+ return sizeof(data) - sizeof(T) + size_for_allocation_strategy(value);
+ }

+
+ friend size_t size_for_allocation_strategy(const data& obj) {

+ return obj.storage_size();
+ }
+};
+

+} // namespace bplus
diff --git a/test/boost/bptree_test.cc b/test/boost/bptree_test.cc

new file mode 100644

index 000000000..ea8c6ce71
--- /dev/null
+++ b/test/boost/bptree_test.cc

@@ -0,0 +1,344 @@

+
+ bool match;

+ {

+ auto i = t.begin();
+ BOOST_REQUIRE(*(i++) == 7);
+ BOOST_REQUIRE(*(i++) == 9);
+ BOOST_REQUIRE(i == t.end());
+ }
+

+ {

+ for (int i = 0; i < 32; i++) {
+ t.emplace(i, i);
+ }
+
+ auto b = t.find(8);
+ auto e = t.find(25);
+ t.erase(b, e);
+
+ BOOST_REQUIRE(*t.find(7) == 7);
+ BOOST_REQUIRE(t.find(8) == t.end());
+ BOOST_REQUIRE(t.find(24) == t.end());
+ BOOST_REQUIRE(*t.find(25) == 25);
+
+ t.clear();
+}
diff --git a/test/perf/perf_bptree.cc b/test/perf/perf_bptree.cc

new file mode 100644
index 000000000..ddef05c2a
--- /dev/null
+++ b/test/perf/perf_bptree.cc
@@ -0,0 +1,246 @@

+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <algorithm>
+#include <vector>
+#include <random>
+#include <fmt/core.h>

+#include "perf.hh"
+
+using per_key_t = int64_t;
+
+struct key_compare {
+ bool operator()(const per_key_t& a, const per_key_t& b) const { return a < b; }

+};
+
+#include "utils/bptree.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+constexpr int TEST_NODE_SIZE = 4;
+

+/* On node size 32 (this test) linear search works better */
+using test_tree = tree<per_key_t, unsigned long, key_compare, TEST_NODE_SIZE, key_search::linear>;
+
+class collection_tester {
+public:
+ virtual void insert(per_key_t k) = 0;
+ virtual void lower_bound(per_key_t k) = 0;
+ virtual void erase(per_key_t k) = 0;

+ virtual void drain(int batch) = 0;

+ virtual void show_stats() = 0;
+ virtual ~collection_tester() {};
+};
+
+class bptree_tester : public collection_tester {
+ test_tree _t;
+public:
+ bptree_tester() : _t(key_compare{}) {}

+ virtual void insert(per_key_t k) override { _t.emplace(k, 0); }
+ virtual void lower_bound(per_key_t k) override {
+ auto i = _t.lower_bound(k);
+ assert(i != _t.end());
+ }
+ virtual void erase(per_key_t k) override { _t.erase(k); }

+ virtual void drain(int batch) override {
+ int x = 0;
+ auto i = _t.begin();
+ while (i != _t.end()) {
+ i = i.erase(key_compare{});
+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }

+ virtual void show_stats() {
+ struct bplus::stats st = _t.get_stats();
+ fmt::print("nodes: {}\n", st.nodes);
+ for (int i = 0; i < (int)st.nodes_filled.size(); i++) {
+ fmt::print(" {}: {} ({}%)\n", i, st.nodes_filled[i], st.nodes_filled[i] * 100 / st.nodes);
+ }
+ fmt::print("leaves: {}\n", st.leaves);
+ for (int i = 0; i < (int)st.leaves_filled.size(); i++) {
+ fmt::print(" {}: {} ({}%)\n", i, st.leaves_filled[i], st.leaves_filled[i] * 100 / st.leaves);
+ }
+ fmt::print("datas: {}\n", st.datas);
+ }

+ virtual ~bptree_tester() {
+ _t.clear();
+ }

+};
+
+class set_tester : public collection_tester {

+ std::set<per_key_t> _s;
+public:
+ virtual void insert(per_key_t k) override { _s.insert(k); }
+ virtual void lower_bound(per_key_t k) override {
+ auto i = _s.lower_bound(k);
+ assert(i != _s.end());
+ }
+ virtual void erase(per_key_t k) override { _s.erase(k); }

+ virtual void drain(int batch) override {
+ int x = 0;
+ auto i = _s.begin();
+ while (i != _s.end()) {
+ i = _s.erase(i);
+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }

+ virtual void show_stats() { }
+ virtual ~set_tester() = default;
+};
+
+class map_tester : public collection_tester {

+ std::map<per_key_t, unsigned long> _m;
+public:
+ virtual void insert(per_key_t k) override { _m[k] = 0; }
+ virtual void lower_bound(per_key_t k) override {
+ auto i = _m.lower_bound(k);
+ assert(i != _m.end());
+ }
+ virtual void erase(per_key_t k) override { _m.erase(k); }

+ virtual void drain(int batch) override {
+ int x = 0;

+ auto i = _m.begin();
+ while (i != _m.end()) {
+ i = _m.erase(i);

+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }

+ virtual void show_stats() { }
+ virtual ~map_tester() = default;
+};
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(5000000), "number of keys to fill the tree with")

+ ("batch", bpo::value<int>()->default_value(50), "number of operations between deferring points")

+ ("iters", bpo::value<int>()->default_value(1), "number of iterations")
+ ("col", bpo::value<std::string>()->default_value("bptree"), "collection to test")

+ ("test", bpo::value<std::string>()->default_value("erase"), "what to test (erase, drain, find)")

+ ("stats", bpo::value<bool>()->default_value(false), "show stats");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto iters = app.configuration()["iters"].as<int>();
+ auto batch = app.configuration()["batch"].as<int>();
+ auto col = app.configuration()["col"].as<std::string>();

+ auto tst = app.configuration()["test"].as<std::string>();

+ auto stats = app.configuration()["stats"].as<bool>();
+

+ return seastar::async([count, iters, batch, col, tst, stats] {

+ int rep = iters;
+ collection_tester* c;
+
+ if (col == "bptree") {
+ c = new bptree_tester();
+ } else if (col == "set") {
+ c = new set_tester();
+ } else if (col == "map") {
+ c = new map_tester();
+ } else {
+ fmt::print("Unknown collection\n");

+ return;
+ }
+
+ std::vector<per_key_t> keys;
+
+ for (per_key_t i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+

+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Inserting {:d} k:v pairs into {} {:d} times\n", count, col, iters);
+
+ again:
+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+

+ auto d = duration_in_seconds([&] {

+ for (int i = 0; i < count; i++) {
+ c->insert(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }

+ });
+
+ fmt::print("fill: {:.6f} ms\n", d.count() * 1000);

+
+ if (stats) {
+ c->show_stats();
+ }
+

+ if (tst == "erase") {

+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+

+ d = duration_in_seconds([&] {

+ for (int i = 0; i < count; i++) {
+ c->erase(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }

+ });
+
+ fmt::print("erase: {:.6f} ms\n", d.count() * 1000);
+ } else if (tst == "drain") {
+ d = duration_in_seconds([&] {
+ c->drain(batch);
+ });
+
+ fmt::print("drain: {:.6f} ms\n", d.count() * 1000);
+ } else if (tst == "find") {

+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+

+ d = duration_in_seconds([&] {
+ for (int i = 0; i < count; i++) {
+ c->lower_bound(keys[i]);

+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }

+ });
+
+ fmt::print("find: {:.6f} ms\n", d.count() * 1000);

+ }
+
+ if (--rep > 0) {
+ goto again;
+ }
+
+ delete c;
+ });
+ });
+}

diff --git a/test/unit/bptree_compaction_test.cc b/test/unit/bptree_compaction_test.cc
new file mode 100644

index 000000000..9b1b48072
--- /dev/null
+++ b/test/unit/bptree_compaction_test.cc

@@ -0,0 +1,210 @@

+ reference(const reference& other) = delete;
+

+ reference(reference&& other) noexcept : _ref(other._ref) {
+ if (_ref != nullptr) {
+ _ref->_ref = this;
+ }

+ other._ref = nullptr;
+ }
+

+ ~reference() {
+ if (_ref != nullptr) {

+ _ref->_ref = nullptr;
+ }
+ }
+

+ tree_pointer(tree_pointer&& other) = delete;
+

+ for (int i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+

+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Compacting {:d} k:v pairs {:d} times\n", count, iter);
+
+ test_validator tv;
+
+ logalloc::region mem;
+
+ with_allocator(mem.allocator(), [&] {
+ tree_pointer t;
+
+ again:
+ {
+ std::shuffle(keys.begin(), keys.end(), g);
+
+ logalloc::reclaim_lock rl(mem);

+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ auto ti = t->emplace(std::move(copy_key(k)), k);
+ assert(ti.second);
+ seastar::thread::maybe_yield();
+ }
+ }
+
+ mem.full_compaction();
+
+ if (verb) {
+ fmt::print("After fill + compact\n");
+ tv.print_tree(*t, '|');
+ }
+
+ tv.validate(*t);
+
+ {
+ std::shuffle(keys.begin(), keys.end(), g);
+
+ logalloc::reclaim_lock rl(mem);

new file mode 100644

index 000000000..3060b1a7b
--- /dev/null
+++ b/test/unit/bptree_stress_test.cc

@@ -0,0 +1,236 @@

+};
+

+ for (int i = 0; i < count; i++) {

+ keys.push_back(i + 1);
+ }
+

+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Inserting {:d} k:v pairs {:d} times\n", count, rep);
+
+ test_validator tv;
+
+ if (ks == "desc") {
+ fmt::print("Reversing keys vector\n");
+ std::reverse(keys.begin(), keys.end());
+ }
+
+ bool shuffle = ks == "rand";
+ if (shuffle) {
+ fmt::print("Will shuffle keys each iteration\n");
+ }
+
+
+ again:
+ auto* itc = new test_iterator_checker(tv, *t);
+
+ if (shuffle) {
+ std::shuffle(keys.begin(), keys.end(), g);
+ }

+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ if (verb) {
+ fmt::print("+++ {}\n", (int)k);
+ }
+
+ if (rep % 2 != 1) {
+ auto ir = t->emplace(std::move(copy_key(k)), k);
+ assert(ir.second);
+ } else {
+ auto ir = t->lower_bound(k);
+ ir.emplace_before(std::move(copy_key(k)), test_key_compare{}, k);
+ }
+ oracle[keys[i]] = keys[i] + 10;
+
+ if (verb) {
+ fmt::print("Validating\n");

+ tv.print_tree(*t, '|');
+ }
+

+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);

+
+ /*

+ * kill iterator if we're removing what it points to,
+ * otherwise it's not invalidated
+ */
+ if (itc->here(k)) {
+ delete itc;

+ itc = nullptr;
+ }
+

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:13 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

Done by specializing searcher template for Key = int64_t and
Search = key_search::linear. Some notes on this

1. Searching on not filled nodes needs to take trailing (unused)
keys into account. Two options for that -- limit the comparisons
"by hands" or keep some known value in the tail and compare full
node including the unused ones. The former approach increases
the array scan time up to 50%.

The latter approach needs some trickery with filling the unused
keys, which is done by specializing the maybe_key template -- the
unused keys are set to int64_t::min value as real tokens cannot
take it.

2. The AVX searcher template is not equipped with constraint, as
otherwise search keys not meeting it (even by mistake) would cause
the compiler to pick some other one, it would choose the default
linear scanner and nobody will know this happened.

3. To make run-time selection of the array search code the gcc's
__attribute__((target)) is used. It's available since gcc 4.8.
For AVX-less chips the default linear scanner is also provided.

With this the 5M keys tree with 4 keys-per-node finds random keys
~10% faster.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

utils/array-search.hh | 96 +++++++++++++++++++++++++++++++++++++++
utils/bptree.hh | 68 +++++++++++++++++++++++++++
test/boost/bptree_test.cc | 55 ++++++++++++++++++++++
3 files changed, 219 insertions(+)
create mode 100644 utils/array-search.hh

diff --git a/utils/array-search.hh b/utils/array-search.hh
new file mode 100644
index 000000000..828beb631
--- /dev/null
+++ b/utils/array-search.hh
@@ -0,0 +1,96 @@

+#include <x86intrin.h>
+
+namespace utils {
+
+#define __target__(tgt) __attribute__((target(tgt)))
+#define __default __target__("default")
+#define __avx2 __target__("avx2")
+
+/*
+ * array_search_gt(value, array, capacity, size)
+ *
+ * Returns the index of the first element in the array that's greater
+ * than the given value.
+ */
+
+/*
+ * The AVX2 version doesn't take @size argument into account and expects
+ * all the elements above it to be less than any possible value.
+ *
+ * To make it work without this requirement we'd need to:
+ * - limit the loop iterations to size instead of capacity
+ * - explicitly set to 1 all the mask's bits for elements >= size
+ * both do make things up to 50% slower.
+ */
+
+static inline int __avx2 array_search_gt(const int64_t& val, const int64_t* array, const int capacity, const int size) {
+ int cnt = 0;
+
+ // 0. Load key into 256-bit ymm
+ __m256i k = _mm256_set1_epi64x(val);
+ for (int i = 0; i < capacity; i += 4) {
+ // 4. Count the number of 1-s, each gt match gives 8 bits
+ cnt += _mm_popcnt_u32(
+ // 3. Pack result into 4 bytes -- 1 byte from each comparison
+ _mm256_movemask_epi8(
+ // 2. Compare array[i] > key, 4 elements in one go
+ _mm256_cmpgt_epi64(
+ // 1. Load next 4 elements into ymm
+ _mm256_lddqu_si256((__m256i*)&array[i]), k
+ )
+ )
+ ) / 8;
+ }
+
+ /*
+ * 5. We need the index of the first gt value. Unused elements are < k
+ * for sure, so count from the tail of the used part.
+ *
+ * <grumble>
+ * We might have done it the other way -- store the maximum in unused,
+ * check for key >= array[i] in the above loop and just return the cnt,
+ * but ... AVX2 instructions set doesn't have the PCMPGE
+ *
+ * SSE* set (predecessor) has cmpge, but eats 2 keys in one go
+ * AVX-512 (successor) has it back, and even eats 8 keys, but is
+ * not widely available
+ * </grumble>
+ */
+ return size - cnt;
+}
+
+static inline int __default array_search_gt(const int64_t& val, const int64_t* array, const int capacity, const int size) {
+ int i;
+
+ for (i = 0; i < size; i++) {
+ if (val < array[i])
+ break;
+ }
+

+ return i;
+}
+

+};
diff --git a/utils/bptree.hh b/utils/bptree.hh
index 34ec631f0..81ccc98bf 100644
--- a/utils/bptree.hh
+++ b/utils/bptree.hh
@@ -25,6 +25,7 @@
#include <seastar/util/defer.hh>
#include <cassert>
#include "utils/logalloc.hh"
+#include "utils/array-search.hh"

namespace bplus {

@@ -109,6 +110,43 @@ union maybe_key {

void replace(maybe_key&& other) = delete; // not to be called by chance

};

+/*
+ * When using int64_t as key the index_for(key) helper might use AVX2
+ * instructions for faster lookup. To make it work correctly (see comment
+ * in array_search_gt() why) the unused keys must be set to minimum.
+ *
+ * Respectively, the maybe wrapper is specialized to put the min into the
+ * keys that are both -- new in the array and that are freed from it.
+ */
+
+template <>
+union maybe_key<int64_t> {
+ int64_t v;
+
+ maybe_key() noexcept : v(std::numeric_limits<int64_t>::min()) {}
+ ~maybe_key() { reset(); }
+
+ void reset() { v = std::numeric_limits<int64_t>::min(); }

+
+ template <typename... Args>

+ void emplace(Args&&... args) {
+ new (&v) int64_t (std::forward<Args>(args)...);
+ }
+
+ void emplace(maybe_key&& other) {
+ v = other.v;
+ other.reset();
+ }
+

+ template <typename... Args>
+ void replace(Args&&... args) {
+ reset();
+ emplace(std::forward<Args>(args)...);
+ }
+
+ void replace(maybe_key&& other) = delete;

+};
+

// For .{do_something_with_data}_and_dispose methods below

template <typename T>
void default_dispose(T& value) { }
@@ -840,6 +878,36 @@ struct searcher<K, Key, Less, Size, key_search::binary> {
}
};

+/*
+ * SIMD-optimized version for int64_t Key with linear search
+ * Note: it ignores the provided Less-er, assuming it's equivalent to a < b
+ */
+
+template <typename Key, typename Less, int Size>
+struct searcher<Key, int64_t, Less, Size, key_search::linear> {
+ /*
+ * ! The constraint is not at the searcher<> specialization because if it
+ * fails compiler will silently pick "default" linear searcher for any K
+ * not satisfying it and the AVX optimization will silently not turn on.
+ */

+ template <typename K = Key>

+ GCC6_CONCEPT( requires requires (const K& key) { { key.int64_key() } -> int64_t } )
+ static int64_t to_int64_key(const K& k) { return k.int64_key(); }
+
+ static_assert(sizeof(maybe_key<int64_t>) == sizeof(int64_t));
+
+ static int gt(const Key& k, const maybe_key<int64_t>* keys, int nr, Less) {
+ return utils::array_search_gt(to_int64_key(k), reinterpret_cast<const int64_t*>(keys), Size, nr);
+ }
+};
+
+template <typename Less, int Size>
+struct searcher<int64_t, int64_t, Less, Size, key_search::linear> {
+ static int gt(const int64_t& k, const maybe_key<int64_t>* keys, int nr, Less) {
+ return utils::array_search_gt(k, reinterpret_cast<const int64_t*>(keys), Size, nr);
+ }
+};
+

template <typename K, typename Key, typename Less, int Size>

struct searcher<K, Key, Less, Size, key_search::both> {

static int gt(const K& k, const maybe_key<Key>* keys, int nr, Less less) {

diff --git a/test/boost/bptree_test.cc b/test/boost/bptree_test.cc
index ea8c6ce71..40030dff4 100644
--- a/test/boost/bptree_test.cc
+++ b/test/boost/bptree_test.cc
@@ -342,3 +342,58 @@ BOOST_AUTO_TEST_CASE(test_erase_range) {

t.clear();
}
+
+BOOST_AUTO_TEST_CASE(test_avx_search) {
+ struct key_wrap {
+ int64_t v;
+ int64_t int64_key() const { return v; }
+ };
+
+ struct int64_compare {
+ bool operator()(const int64_t& a, const int64_t& b) const { return a < b; }
+ bool operator()(const key_wrap& a, const int64_t& b) { return a.v < b; }
+ bool operator()(const int64_t& a, const key_wrap& b) { return a < b.v; }
+ };
+
+ using test_tree = tree<int64_t, unsigned long, int64_compare, 4, key_search::linear>;
+
+ test_tree t(int64_compare{});
+
+ t.emplace(int64_t(0), 0);
+
+ auto i = t.lower_bound(int64_t(0));
+ BOOST_REQUIRE(*i == 0);
+
+ i = t.lower_bound(key_wrap{1});

+ BOOST_REQUIRE(i == t.end());
+

+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_min_max_avx_keys) {
+ struct int64_compare {
+ bool operator()(const int64_t& a, const int64_t& b) const { return a < b; }
+ };
+
+ int64_compare cmp{};
+
+ using test_tree = tree<int64_t, unsigned long, int64_compare, 4, key_search::both, with_debug::yes>;
+ test_tree t(cmp);
+
+ t.emplace(std::numeric_limits<int64_t>::max(), 123);
+ auto i = t.find(std::numeric_limits<int64_t>::max());
+ BOOST_REQUIRE(*i == 123);
+ i.erase(cmp);
+
+ // min() cannot be used as token value
+ t.emplace(std::numeric_limits<int64_t>::min() + 1, 321);
+ i = t.find(std::numeric_limits<int64_t>::min() + 1);
+ BOOST_REQUIRE(*i == 321);
+ i.erase(cmp);
+
+ t.emplace(std::numeric_limits<int64_t>::min() + 1, 231);
+ t.emplace(std::numeric_limits<int64_t>::max(), 132);
+ i = t.find(std::numeric_limits<int64_t>::max());

+ BOOST_REQUIRE(*i == 132);
+ t.clear();
+}

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:16 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

- some classes are equipped with int64_key() methods for AVX B+ lookup

The rest fits smothly into the double_decker API.

Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.

Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.

tests: unit(dev), perf(dev)

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

dht/i_partitioner.hh | 10 ++++
dht/token.hh | 11 ++++
row_cache.hh | 50 ++++++++++------
utils/double-decker.hh | 2 +
dht/token.cc | 22 ++++---
row_cache.cc | 94 ++++++++++++++----------------
test/perf/memory_footprint_test.cc | 3 +-
7 files changed, 112 insertions(+), 80 deletions(-)

diff --git a/dht/i_partitioner.hh b/dht/i_partitioner.hh
index 2717ab711..92f5e877e 100644
--- a/dht/i_partitioner.hh
+++ b/dht/i_partitioner.hh
@@ -116,6 +116,10 @@ class decorated_key {
return _token;
}

+ int64_t int64_key() const {
+ return _token.raw();
+ }
+
const partition_key& key() const {
return _key;
}
@@ -282,6 +286,10 @@ class ring_position {
return _token;
}

+ int64_t int64_key() const {
+ return _token.raw();
+ }
+
// Valid when !has_key()
token_bound bound() const {
return _token_bound;
@@ -423,6 +431,7 @@ class ring_position_view {

const dht::token& token() const { return *_token; }
const partition_key* key() const { return _key; }
+ int64_t int64_key() const { return token().raw(); }

// Only when key() == nullptr
token_bound get_token_bound() const { return token_bound(_weight); }
@@ -552,6 +561,7 @@ class ring_position_ext {

const dht::token& token() const { return _token; }
const std::optional<partition_key>& key() const { return _key; }
+ int64_t int64_key() const { return _token.raw(); }

// Only when key() == std::nullopt
token_bound get_token_bound() const { return token_bound(_weight); }

diff --git a/dht/token.hh b/dht/token.hh
index 439b6b45e..43cd6cd43 100644
--- a/dht/token.hh
+++ b/dht/token.hh
@@ -160,10 +160,21 @@ class token {
return 0; // hardcoded for now; unlikely to change
}

+ int64_t raw() const {
+ if (is_minimum()) {

+ return std::numeric_limits<int64_t>::min();
+ }

+ if (is_maximum()) {
+ return std::numeric_limits<int64_t>::max();
+ }
+
+ return _data;
+ }
};

const token& minimum_token();
const token& maximum_token();
+int tri_compare_raw(const int64_t& t1, const int64_t& t2);
int tri_compare(const token& t1, const token& t2);
inline bool operator==(const token& t1, const token& t2) { return tri_compare(t1, t2) == 0; }
inline bool operator<(const token& t1, const token& t2) { return tri_compare(t1, t2) < 0; }

diff --git a/row_cache.hh b/row_cache.hh
index fecdeba44..ecfda3f30 100644
--- a/row_cache.hh
+++ b/row_cache.hh

@@ -117,6 +119,7 @@ class cache_entry {
{ }

cache_entry(cache_entry&&) noexcept;
+
~cache_entry();

static cache_entry& container_of(partition_entry& pe) {

@@ -148,34 +151,48 @@ class cache_entry {

bool is_dummy_entry() const { return _flags._dummy_entry; }

+ struct token_compare {
+ bool operator()(const int64_t& k1, const int64_t& k2) const {

@@ -315,10 +332,9 @@ void cache_tracker::insert(partition_entry& pe) noexcept {

class row_cache final {
public:
using phase_type = utils::phased_barrier::phase_type;
- using partitions_type = bi::set<cache_entry,
- bi::member_hook<cache_entry, cache_entry::cache_link_type, &cache_entry::_cache_link>,
- bi::constant_time_size<false>, // we need this to have bi::auto_unlink on hooks
- bi::compare<cache_entry::compare>>;
+ using partitions_type = double_decker<int64_t, cache_entry,
+ cache_entry::token_compare, cache_entry::compare,

+ 16, bplus::key_search::linear>;

friend class cache::autoupdating_underlying_reader;
friend class single_partition_populating_reader;
friend class cache_entry;
diff --git a/utils/double-decker.hh b/utils/double-decker.hh

index d797124a7..c8d4879bf 100644
--- a/utils/double-decker.hh
+++ b/utils/double-decker.hh
@@ -44,10 +44,12 @@ template <typename Key, typename T, typename Less, typename Compare, int NodeSiz

bplus::key_search Search = bplus::key_search::binary, bplus::with_debug Debug = bplus::with_debug::no>

GCC6_CONCEPT( requires Comparable<T, T, Compare> && std::is_nothrow_move_constructible_v<T> )

class double_decker {
+public:

using inner_array = array_trusted_bounds<T>;

using outer_tree = bplus::tree<Key, inner_array, Less, NodeSize, Search, Debug>;

using outer_iterator = typename outer_tree::iterator;

+private:
outer_tree _tree;

public:
diff --git a/dht/token.cc b/dht/token.cc
index 96de48607..17e599fd1 100644

diff --git a/row_cache.cc b/row_cache.cc
index 85d9a2f2c..81e4628b2 100644
--- a/row_cache.cc
+++ b/row_cache.cc

@@ -284,12 +284,12 @@ class partition_range_cursor final {
// Strong exception guarantees.
bool advance_to(dht::ring_position_view pos) {
auto cmp = cache_entry::compare(_cache.get()._schema);
- if (cmp(_end_pos, pos)) { // next() may have moved _start_pos past the _end_pos.
+ if (cmp(_end_pos, pos) < 0) { // next() may have moved _start_pos past the _end_pos.
_end_pos = pos;
}

_end = _cache.get()._partitions.lower_bound(_end_pos, cmp);

_it = _cache.get()._partitions.lower_bound(pos, cmp);
- auto same = !cmp(pos, _it->position());

+ auto same = !(cmp(pos, _it->position()) < 0);
set_position(*_it);
_last_reclaim_count = _cache.get().get_cache_tracker().allocator().invalidate_counter();
return same;
@@ -375,13 +375,13 @@ class single_partition_populating_reader final : public flat_mutation_reader::im
_cache._read_section(_cache._tracker.region(), [this] {

with_allocator(_cache._tracker.allocator(), [this] {

dht::decorated_key dk = _read_context->range().start()->value().as_decorated_key();
- _cache.do_find_or_create_entry(dk, nullptr, [&] (auto i) {
+ _cache.do_find_or_create_entry(dk, nullptr, [&] (auto i, const row_cache::partitions_type::bound_hint& hint) {
mutation_partition mp(_cache._schema);

- cache_entry* entry = current_allocator().construct<cache_entry>(

+ row_cache::partitions_type::iterator entry = _cache._partitions.emplace_before(i, dk.token().raw(), hint,

_cache._schema, std::move(dk), std::move(mp));
_cache._tracker.insert(*entry);
entry->set_continuous(i->continuous());
- return _cache._partitions.insert_before(i, *entry);
+ return entry;
}, [&] (auto i) {
_cache._tracker.on_miss_already_populated();
});

@@ -749,7 +749,7 @@ row_cache::make_reader(schema_ptr s,

cache_entry::compare cmp(_schema);
auto&& pos = ctx->range().start()->value();

auto i = _partitions.lower_bound(pos, cmp);
- if (i != _partitions.end() && !cmp(pos, i->position())) {

+ if (i != _partitions.end() && !(cmp(pos, i->position()) < 0)) {
cache_entry& e = *i;
upgrade_entry(e);
on_partition_hit();

@@ -780,12 +780,11 @@ row_cache::make_reader(schema_ptr s,

void row_cache::drain() {
with_allocator(_tracker.allocator(), [this] {
- _partitions.clear_and_dispose([this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
- if (!p->is_dummy_entry()) {
+ _partitions.clear_and_dispose([this] (cache_entry& p) mutable {
+ if (!p.is_dummy_entry()) {

_tracker.on_partition_erase();
}
- p->evict(_tracker);
- deleter(p);

+ p.evict(_tracker);
});
});
}
@@ -809,9 +808,11 @@ cache_entry& row_cache::do_find_or_create_entry(const dht::decorated_key& key,

+ cache_entry::compare cmp(_schema);
+ auto i = _partitions.lower_bound(key, cmp, hint);

+ if (i == _partitions.end() || !hint.match) {
+ i = create_entry(i, hint);
} else {
visit_entry(i);
}

@@ -834,10 +835,11 @@ cache_entry& row_cache::do_find_or_create_entry(const dht::decorated_key& key,

- auto entry = current_allocator().construct<cache_entry>(cache_entry::incomplete_tag{}, _schema, key, t);

+ return do_find_or_create_entry(key, previous, [&] (auto i, const partitions_type::bound_hint& hint) { // create
+ partitions_type::iterator entry = _partitions.emplace_before(i, key.token().raw(), hint,
+ cache_entry::incomplete_tag{}, _schema, key, t);
_tracker.insert(*entry);
- return _partitions.insert_before(i, *entry);
+ return entry;
}, [&] (auto i) { // visit
_tracker.on_miss_already_populated();
cache_entry& e = *i;

@@ -848,14 +850,13 @@ cache_entry& row_cache::find_or_create(const dht::decorated_key& key, tombstone

void row_cache::populate(const mutation& m, const previous_entry_pointer* previous) {
_populate_section(_tracker.region(), [&] {
- do_find_or_create_entry(m.decorated_key(), previous, [&] (auto i) {
- cache_entry* entry = current_allocator().construct<cache_entry>(
+ do_find_or_create_entry(m.decorated_key(), previous, [&] (auto i, const partitions_type::bound_hint& hint) {
+ partitions_type::iterator entry = _partitions.emplace_before(i, m.decorated_key().token().raw(), hint,
m.schema(), m.decorated_key(), m.partition());
_tracker.insert(*entry);
entry->set_continuous(i->continuous());
- i = _partitions.insert_before(i, *entry);
- upgrade_entry(*i);
- return i;
+ upgrade_entry(*entry);
+ return entry;
}, [&] (auto i) {
throw std::runtime_error(format("cache already contains entry for {}", m.key()));
});

@@ -954,8 +955,9 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
_update_section(_tracker.region(), [&] {

memtable_entry& mem_e = *m.partitions.begin();
size_entry = mem_e.size_in_allocator_without_rows(_tracker.allocator());
- auto cache_i = _partitions.lower_bound(mem_e.key(), cmp);
- update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc);
+ partitions_type::bound_hint hint;

+ auto cache_i = _partitions.lower_bound(mem_e.key(), cmp, hint);

+ update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc, hint);
});
}
// We use cooperative deferring instead of futures so that

@@ -1000,11 +1002,11 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)

future<> row_cache::update(external_updater eu, memtable& m) {
return do_update(std::move(eu), m, [this] (logalloc::allocating_section& alloc,
row_cache::partitions_type::iterator cache_i, memtable_entry& mem_e, partition_presence_checker& is_present,
- real_dirty_memory_accounter& acc) mutable {
+ real_dirty_memory_accounter& acc, const partitions_type::bound_hint& hint) mutable {
// If cache doesn't contain the entry we cannot insert it because the mutation may be incomplete.
// FIXME: keep a bitmap indicating which sstables we do cover, so we don't have to
// search it.
- if (cache_i != partitions_end() && cache_i->key().equal(*_schema, mem_e.key())) {
+ if (cache_i != partitions_end() && hint.match) {
cache_entry& entry = *cache_i;
upgrade_entry(entry);
assert(entry._schema == _schema);

@@ -1016,12 +1018,11 @@ future<> row_cache::update(external_updater eu, memtable& m) {

- cache_entry* entry = current_allocator().construct<cache_entry>(cache_entry::evictable_tag(),

- _schema, dht::decorated_key(mem_e.key()),
+ partitions_type::iterator entry = _partitions.emplace_before(cache_i, mem_e.key().token().raw(), hint,
+ cache_entry::evictable_tag(), _schema, dht::decorated_key(mem_e.key()),
partition_entry::make_evictable(*_schema, mutation_partition(_schema)));
entry->set_continuous(cache_i->continuous());
_tracker.insert(*entry);
- _partitions.insert_before(cache_i, *entry);
mem_e.upgrade_schema(_schema, _tracker.memtable_cleaner());
return entry->partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), _tracker.memtable_cleaner(),
alloc, _tracker.region(), _tracker, _underlying_phase, acc);

@@ -1034,7 +1035,7 @@ future<> row_cache::update(external_updater eu, memtable& m) {

future<> row_cache::update_invalidating(external_updater eu, memtable& m) {
return do_update(std::move(eu), m, [this] (logalloc::allocating_section& alloc,
row_cache::partitions_type::iterator cache_i, memtable_entry& mem_e, partition_presence_checker& is_present,
- real_dirty_memory_accounter& acc)
+ real_dirty_memory_accounter& acc, const partitions_type::bound_hint&)
{
if (cache_i != partitions_end() && cache_i->key().equal(*_schema, mem_e.key())) {
// FIXME: Invalidate only affected row ranges.

@@ -1089,11 +1090,10 @@ void row_cache::invalidate_locked(const dht::decorated_key& dk) {

if (pos == partitions_end() || !pos->key().equal(*_schema, dk)) {
_tracker.clear_continuity(*pos);
} else {
- auto it = _partitions.erase_and_dispose(pos,
- [this, &dk, deleter = current_deleter<cache_entry>()](auto&& p) mutable {
+ auto it = pos.erase_and_dispose(cache_entry::token_compare{},
+ [this](cache_entry& p) mutable {

_tracker.on_partition_erase();
- p->evict(_tracker);
- deleter(p);

+ p.evict(_tracker);
});
_tracker.clear_continuity(*it);
}
@@ -1127,13 +1127,12 @@ future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&

auto it = _partitions.lower_bound(*_prev_snapshot_pos, cmp);

auto end = _partitions.lower_bound(dht::ring_position_view::for_range_end(range), cmp);

return with_allocator(_tracker.allocator(), [&] {
- auto deleter = current_deleter<cache_entry>();
while (it != end) {

- it = _partitions.erase_and_dispose(it, [&] (cache_entry* p) mutable {

- _tracker.on_partition_erase();
- p->evict(_tracker);
- deleter(p);
- });

+ it = it.erase_and_dispose(cache_entry::token_compare{},
+ [&] (cache_entry& p) mutable {
+ _tracker.on_partition_erase();

+ p.evict(_tracker);
+ });
// it != end is necessary for correctness. We cannot set _prev_snapshot_pos to end->position()
// because after resuming something may be inserted before "end" which falls into the next range.
if (need_preempt() && it != end) {

@@ -1169,8 +1168,9 @@ void row_cache::evict() {

void row_cache::init_empty(is_continuous cont) {

@@ -1178,7 +1178,7 @@ void row_cache::init_empty(is_continuous cont) {

row_cache::row_cache(schema_ptr s, snapshot_source src, cache_tracker& tracker, is_continuous cont)
: _tracker(tracker)
, _schema(std::move(s))

- , _partitions(cache_entry::compare(_schema))
+ , _partitions(cache_entry::token_compare{})

, _underlying(src())
, _snapshot_source(std::move(src))
{

@@ -1190,13 +1190,7 @@ cache_entry::cache_entry(cache_entry&& o) noexcept

, _key(std::move(o._key))
, _pe(std::move(o._pe))
, _flags(o._flags)
- , _cache_link()
{
- {
- using container_type = row_cache::partitions_type;

- container_type::node_algorithms::replace_node(o._cache_link.this_ptr(), _cache_link.this_ptr());
- container_type::node_algorithms::init(o._cache_link.this_ptr());

- }
}

cache_entry::~cache_entry() {
@@ -1211,11 +1205,11 @@ void row_cache::set_schema(schema_ptr new_schema) noexcept {

}

void cache_entry::on_evicted(cache_tracker& tracker) noexcept {
- auto it = row_cache::partitions_type::s_iterator_to(*this);
+ row_cache::partitions_type::iterator it(this);
std::next(it)->set_continuous(false);
evict(tracker);
- current_deleter<cache_entry>()(this);
tracker.on_partition_eviction();
+ it.erase(cache_entry::token_compare{});
}

void rows_entry::on_evicted(cache_tracker& tracker) noexcept {
diff --git a/test/perf/memory_footprint_test.cc b/test/perf/memory_footprint_test.cc

index 465b0169d..14f6a64f2 100644

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 13, 2020, 12:46:17 PM5/13/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.

The changes are:

Similar to those for row_cache:

- compare() is made to report int, added token_compare to report bool

- insertion and removal happens with the help of double_decker, most
of the places are about slightly changed semantics of it

- flags are added to memtable_entry, this makes its size larger than
it could be, but still smaller than it was before

Memtable-specific:

- when the new entry is inserted into tree iterators _might_ get
invalidated by double-decker inner array. This is easy to check
when it happens, so the invalidation is avoided when possible

- the size_in_allocator_without_rows() is now not very precise. This
is because after the patch memtable_entries are not allocated
individually as they used to. They can be squashed together with
those having token conflict and asking allocator for the occupied
memory slot is not possible. As the closest (lower) estimate the
size of enclosing B+ data node is used

- the ::slice() call wants const iterators to work with, but new
collections do not have it (yet), so a const_cast there :(

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

memtable.hh | 54 +++++++++++++++++-------
row_cache.hh | 1 -
utils/double-decker.hh | 18 ++++++++
memtable.cc | 67 +++++++++++++++++-------------
row_cache.cc | 14 +++----
test/perf/memory_footprint_test.cc | 1 +
6 files changed, 104 insertions(+), 51 deletions(-)

diff --git a/memtable.hh b/memtable.hh
index cbff1ecc6..5c0559a6a 100644
--- a/memtable.hh
+++ b/memtable.hh
@@ -32,11 +32,11 @@
#include "db/commitlog/replay_position.hh"
#include "db/commitlog/rp_set.hh"
#include "utils/extremum_tracking.hh"
-#include "utils/logalloc.hh"
#include "partition_version.hh"
#include "flat_mutation_reader.hh"
#include "mutation_cleaner.hh"
#include "sstables/types.hh"
+#include "utils/double-decker.hh"

class frozen_mutation;

@@ -44,11 +44,17 @@ class frozen_mutation;

namespace bi = boost::intrusive;

class memtable_entry {
- bi::set_member_hook<> _link;

schema_ptr _schema;
dht::decorated_key _key;
partition_entry _pe;

+ bool _head = false;
+ bool _tail = false;
public:
+ bool is_head() const { return _head; }
+ void set_head(bool v) { _head = v; }
+ bool is_tail() const { return _tail; }

+ void set_tail(bool v) { _tail = v; }
+

friend class memtable;

memtable_entry(schema_ptr s, dht::decorated_key key, mutation_partition p)
@@ -77,8 +83,10 @@ class memtable_entry {
return _key.key().external_memory_usage();
}

+ size_t object_memory_size(allocation_strategy& allocator);
+
size_t size_in_allocator_without_rows(allocation_strategy& allocator) {
- return allocator.object_memory_size_in_allocator(this) + external_memory_usage_without_rows();
+ return object_memory_size(allocator) + external_memory_usage_without_rows();
}

size_t size_in_allocator(allocation_strategy& allocator) {
@@ -89,30 +97,48 @@ class memtable_entry {
return size;

}

+ struct token_compare {
+ bool operator()(const int64_t& k1, const int64_t& k2) const {
+ return dht::tri_compare_raw(k1, k2) < 0;
+ }

+ bool operator()(const dht::decorated_key& k1, const int64_t& k2) const {
+ return dht::tri_compare_raw(k1.int64_key(), k2) < 0;
+ }
+ bool operator()(const int64_t& k1, const dht::decorated_key& k2) const {
+ return dht::tri_compare_raw(k1, k2.int64_key()) < 0;
+ }
+ bool operator()(const dht::ring_position& k1, const int64_t& k2) const {

+ return dht::tri_compare_raw(k1.token().raw(), k2) < 0;
+ }

+ bool operator()(const int64_t& k1, const dht::ring_position& k2) const {

+ return dht::tri_compare_raw(k1, k2.token().raw()) < 0;
+ }
+ };
+
struct compare {

- dht::decorated_key::less_comparator _c;

+ dht::ring_position_comparator _c;

compare(schema_ptr s)

- : _c(std::move(s))
+ : _c(*s)
{}

- bool operator()(const dht::decorated_key& k1, const memtable_entry& k2) const {
+ int operator()(const dht::decorated_key& k1, const memtable_entry& k2) const {
return _c(k1, k2._key);
}

- bool operator()(const memtable_entry& k1, const memtable_entry& k2) const {
+ int operator()(const memtable_entry& k1, const memtable_entry& k2) const {
return _c(k1._key, k2._key);
}

- bool operator()(const memtable_entry& k1, const dht::decorated_key& k2) const {
+ int operator()(const memtable_entry& k1, const dht::decorated_key& k2) const {
return _c(k1._key, k2);
}

- bool operator()(const memtable_entry& k1, const dht::ring_position& k2) const {
+ int operator()(const memtable_entry& k1, const dht::ring_position& k2) const {
return _c(k1._key, k2);
}

- bool operator()(const dht::ring_position& k1, const memtable_entry& k2) const {
+ int operator()(const dht::ring_position& k1, const memtable_entry& k2) const {
return _c(k1, k2._key);
}
};
@@ -126,9 +152,9 @@ struct table_stats;
// Managed by lw_shared_ptr<>.
class memtable final : public enable_lw_shared_from_this<memtable>, private logalloc::region {
public:
- using partitions_type = bi::set<memtable_entry,
- bi::member_hook<memtable_entry, bi::set_member_hook<>, &memtable_entry::_link>,
- bi::compare<memtable_entry::compare>>;
+ using partitions_type = double_decker<int64_t, memtable_entry,
+ memtable_entry::token_compare, memtable_entry::compare,

+ 16, bplus::key_search::linear>;

private:
dirty_memory_manager& _dirty_mgr;
mutation_cleaner _cleaner;
@@ -179,7 +205,7 @@ class memtable final : public enable_lw_shared_from_this<memtable>, private loga
friend class flush_reader;
friend class flush_memory_accounter;
private:
- boost::iterator_range<partitions_type::const_iterator> slice(const dht::partition_range& r) const;
+ boost::iterator_range<partitions_type::iterator> slice(const dht::partition_range& r) const;
partition_entry& find_or_create_partition(const dht::decorated_key& key);
partition_entry& find_or_create_partition_slow(partition_key_view key);
void upgrade_entry(memtable_entry&);
diff --git a/row_cache.hh b/row_cache.hh
index ecfda3f30..e407758de 100644
--- a/row_cache.hh
+++ b/row_cache.hh
@@ -31,7 +31,6 @@

#include "mutation_reader.hh"
#include "mutation_partition.hh"
-#include "utils/logalloc.hh"
#include "utils/phased_barrier.hh"
#include "utils/histogram.hh"
#include "partition_version.hh"
diff --git a/utils/double-decker.hh b/utils/double-decker.hh
index c8d4879bf..406eab1dc 100644
--- a/utils/double-decker.hh
+++ b/utils/double-decker.hh
@@ -157,6 +157,15 @@ class double_decker {

* one (or when the key_match is false)

*/
bool key_tail;
+
+ /*
+ * This helper says whether the emplace will invalidate (some)
+ * iterators or not. Emplacing with !key_match will go and create
+ * new node in B+ which doesn't invalidate iterators. In another
+ * case some existing B+ data node will be reconstructed, so the
+ * iterators on those nodes will become invalid.
+ */
+ bool emplace_keeps_iterators() const { return !key_match; }
};

iterator begin() { return iterator(_tree.begin(), 0); }

@@ -307,4 +316,13 @@ class double_decker {

void clear() { clear_and_dispose(bplus::default_dispose<T>); }

bool empty() const { return _tree.empty(); }
+

+ static size_t estimated_object_memory_size_in_allocator(allocation_strategy& allocator, const T* obj) {
+ /*
+ * The T-s are merged together in array, so getting any run-time
+ * value of a pointer would be wrong. So here's some guessing of
+ * how much memory would this thing occupy in memory
+ */
+ return sizeof(typename outer_tree::data);
+ }
};
diff --git a/memtable.cc b/memtable.cc
index a5a02c164..afff0dfd5 100644
--- a/memtable.cc
+++ b/memtable.cc
@@ -117,7 +117,7 @@ memtable::memtable(schema_ptr schema, dirty_memory_manager& dmm, table_stats& ta
, _cleaner(*this, no_cache_tracker, table_stats.memtable_app_stats, compaction_scheduling_group)
, _memtable_list(memtable_list)
, _schema(std::move(schema))
- , partitions(memtable_entry::compare(_schema))
+ , partitions(memtable_entry::token_compare{})
, _table_stats(table_stats) {
}

@@ -145,9 +145,8 @@ void memtable::evict_entry(memtable_entry& e, mutation_cleaner& cleaner) {

void memtable::clear() noexcept {
auto dirty_before = dirty_size();
with_allocator(allocator(), [this] {

- partitions.clear_and_dispose([this] (memtable_entry* e) {
- evict_entry(*e, _cleaner);
- current_deleter<memtable_entry>()(e);
+ partitions.clear_and_dispose([this] (memtable_entry& e) {
+ e.partition().evict(_cleaner);
});
});
remove_flushed_memory(dirty_before - dirty_size());
@@ -167,9 +166,7 @@ future<> memtable::clear_gently() noexcept {
if (p.begin()->clear_gently() == stop_iteration::no) {
break;
}
- p.erase_and_dispose(p.begin(), [&] (auto e) {
- alloc.destroy(e);
- });
+ p.begin().erase(memtable_entry::token_compare{});
if (need_preempt()) {
break;
}
@@ -178,6 +175,13 @@ future<> memtable::clear_gently() noexcept {
remove_flushed_memory(dirty_before - dirty_size());
seastar::thread::yield();
}
+
+ /*
+ * The collection is not guaranteed to free everything
+ * with the last erase. If anything gets freed in destructor,
+ * it will be unaccounted from wrong allocator, so handle it
+ */
+ with_allocator(alloc, [&p] { p.clear(); });
});
auto f = t->join();
return f.then([t = std::move(t)] {});
@@ -211,13 +215,17 @@ memtable::find_or_create_partition(const dht::decorated_key& key) {
assert(!reclaiming_enabled());

// call lower_bound so we have a hint for the insert, just in case.
- auto i = partitions.lower_bound(key, memtable_entry::compare(_schema));
+ partitions_type::bound_hint hint;
+ auto i = partitions.lower_bound(key, memtable_entry::compare(_schema), hint);
if (i == partitions.end() || !key.equal(*_schema, i->key())) {
- memtable_entry* entry = current_allocator().construct<memtable_entry>(
- _schema, dht::decorated_key(key), mutation_partition(_schema));
- partitions.insert_before(i, *entry);
+ partitions_type::iterator entry = partitions.emplace_before(i,
+ key.token().raw(), hint,
+ _schema, dht::decorated_key(key), mutation_partition(_schema));
++nr_partitions;
++_table_stats.memtable_partition_insertions;
+ if (!hint.emplace_keeps_iterators()) {
+ current_allocator().invalidate_references();
+ }
return entry->partition();
} else {
++_table_stats.memtable_partition_hits;
@@ -226,12 +234,14 @@ memtable::find_or_create_partition(const dht::decorated_key& key) {
return i->partition();
}

-boost::iterator_range<memtable::partitions_type::const_iterator>
+boost::iterator_range<memtable::partitions_type::iterator>
memtable::slice(const dht::partition_range& range) const {
+ partitions_type& part = const_cast<partitions_type&>(partitions); // none of them has needed const stuff
+
if (query::is_single_partition(range)) {
const query::ring_position& pos = range.start()->value();
- auto i = partitions.find(pos, memtable_entry::compare(_schema));
- if (i != partitions.end()) {
+ partitions_type::iterator i = part.find(pos, memtable_entry::compare(_schema));
+ if (i != part.end()) {
return boost::make_iterator_range(i, std::next(i));
} else {
return boost::make_iterator_range(i, i);
@@ -241,15 +251,15 @@ memtable::slice(const dht::partition_range& range) const {

auto i1 = range.start()
? (range.start()->is_inclusive()
- ? partitions.lower_bound(range.start()->value(), cmp)
- : partitions.upper_bound(range.start()->value(), cmp))
- : partitions.cbegin();
+ ? part.lower_bound(range.start()->value(), cmp)
+ : part.upper_bound(range.start()->value(), cmp))
+ : part.begin();

auto i2 = range.end()
? (range.end()->is_inclusive()
- ? partitions.upper_bound(range.end()->value(), cmp)
- : partitions.lower_bound(range.end()->value(), cmp))
- : partitions.cend();
+ ? part.upper_bound(range.end()->value(), cmp)
+ : part.lower_bound(range.end()->value(), cmp))
+ : part.end();

return boost::make_iterator_range(i1, i2);
}
@@ -761,15 +771,12 @@ mutation_source memtable::as_data_source() {
}

memtable_entry::memtable_entry(memtable_entry&& o) noexcept
- : _link()
- , _schema(std::move(o._schema))
+ : _schema(std::move(o._schema))

, _key(std::move(o._key))
, _pe(std::move(o._pe))

-{
- using container_type = memtable::partitions_type;
- container_type::node_algorithms::replace_node(o._link.this_ptr(), _link.this_ptr());
- container_type::node_algorithms::init(o._link.this_ptr());
-}
+ , _head(o._head)
+ , _tail(o._tail)
+{ }

stop_iteration memtable_entry::clear_gently() noexcept {
return _pe.clear_gently(no_cache_tracker);
@@ -805,9 +812,13 @@ void memtable::set_schema(schema_ptr new_schema) noexcept {
_schema = std::move(new_schema);
}

+size_t memtable_entry::object_memory_size(allocation_strategy& allocator) {
+ return memtable::partitions_type::estimated_object_memory_size_in_allocator(allocator, this);
+}
+
std::ostream& operator<<(std::ostream& out, memtable& mt) {
logalloc::reclaim_lock rl(mt);
- return out << "{memtable: [" << ::join(",\n", mt.partitions) << "]}";
+ return out << "{memtable: [" << ::join(",\n", mt.partitions.begin(), mt.partitions.end()) << "]}";
}

std::ostream& operator<<(std::ostream& out, const memtable_entry& mt) {
diff --git a/row_cache.cc b/row_cache.cc
index 81e4628b2..bdfdc7542 100644
--- a/row_cache.cc
+++ b/row_cache.cc
@@ -888,15 +888,14 @@ void row_cache::invalidate_sync(memtable& m) noexcept {

bool blow_cache = false;
// Note: clear_and_dispose() ought not to look up any keys, so it doesn't require
// with_linearized_managed_bytes(), but invalidate() does.

- m.partitions.clear_and_dispose([this, &m, deleter = current_deleter<memtable_entry>(), &blow_cache] (memtable_entry* entry) {
+ m.partitions.clear_and_dispose([this, &m, &blow_cache] (memtable_entry& entry) {
with_linearized_managed_bytes([&] {
try {
- invalidate_locked(entry->key());
+ invalidate_locked(entry.key());

} catch (...) {
blow_cache = true;
}

- m.evict_entry(*entry, _tracker.memtable_cleaner());
- deleter(entry);
+ m.evict_entry(entry, _tracker.memtable_cleaner());
});
});
if (blow_cache) {
@@ -970,10 +969,9 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
real_dirty_acc.unpin_memory(size_entry);
_update_section(_tracker.region(), [&] {
auto i = m.partitions.begin();
- memtable_entry& mem_e = *i;
- m.partitions.erase(i);
- m.evict_entry(mem_e, _tracker.memtable_cleaner());
- current_allocator().destroy(&mem_e);
+ i.erase_and_dispose(memtable_entry::token_compare{}, [&] (memtable_entry& e) {
+ m.evict_entry(e, _tracker.memtable_cleaner());
+ });
});
++partition_count;
});
diff --git a/test/perf/memory_footprint_test.cc b/test/perf/memory_footprint_test.cc
index 14f6a64f2..4e3026198 100644
--- a/test/perf/memory_footprint_test.cc
+++ b/test/perf/memory_footprint_test.cc
@@ -56,6 +56,7 @@ class size_calculator {

public:
static void print_cache_entry_size() {
std::cout << prefix() << "sizeof(cache_entry) = " << sizeof(cache_entry) << "\n";

+ std::cout << prefix() << "sizeof(memtable_entry) = " << sizeof(memtable_entry) << "\n";

std::cout << prefix() << "sizeof(bptree::node) = " << sizeof(row_cache::partitions_type::outer_tree::node) << "\n";

std::cout << prefix() << "sizeof(bptree::data) = " << sizeof(row_cache::partitions_type::outer_tree::data) << "\n";

--
2.20.1

Benny Halevy

<bhalevy@scylladb.com>

unread,

May 14, 2020, 7:48:09 AM5/14/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com

nit: what's the motivation?
should be documented in the commit log.

Avi Kivity

<avi@scylladb.com>

unread,

May 14, 2020, 10:47:34 AM5/14/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com, Tomasz Grabiec

On 13/05/2020 19.45, Pavel Emelyanov wrote:
> Done by specializing searcher template for Key = int64_t and
> Search = key_search::linear. Some notes on this
>
> 1. Searching on not filled nodes needs to take trailing (unused)
> keys into account. Two options for that -- limit the comparisons
> "by hands" or keep some known value in the tail and compare full
> node including the unused ones. The former approach increases
> the array scan time up to 50%.
>
> The latter approach needs some trickery with filling the unused
> keys, which is done by specializing the maybe_key template -- the
> unused keys are set to int64_t::min value as real tokens cannot
> take it.
>
> 2. The AVX searcher template is not equipped with constraint, as
> otherwise search keys not meeting it (even by mistake) would cause
> the compiler to pick some other one, it would choose the default
> linear scanner and nobody will know this happened.
>
> 3. To make run-time selection of the array search code the gcc's
> __attribute__((target)) is used. It's available since gcc 4.8.
> For AVX-less chips the default linear scanner is also provided.
>
> With this the 5M keys tree with 4 keys-per-node finds random keys
> ~10% faster.

Let's take this out of the series, and apply it later. The reason is
that this is just an optimization, and I'm going to nit-pick it, while
Tomek should review the integration with the cache which should go in
more quickly.

Names beginning with __ are reserved for the compiler.

I think you can write this in C++ friendly format as
[[gnu::target("avx2")]].

> +
> +/*
> + * array_search_gt(value, array, capacity, size)
> + *
> + * Returns the index of the first element in the array that's greater
> + * than the given value.
> + */
> +
> +/*
> + * The AVX2 version doesn't take @size argument into account and expects
> + * all the elements above it to be less than any possible value.

This is terrible from a purity point of view but very practical for us.

I think you could work around it by creating a mask from size, and
masking out anything above size. But let's pretend we didn't see this
and just use the trick with min_int.

Does the compiler unroll the loops? Maybe add
[[gnu::optimize("unroll-loops")]].

Although I expect the cpu will unroll the loops if the compiler doesn't
with minimal performance impact. What's most important is issuing the
reads early so they can proceed in parallel.

> +
> + /*
> + * 5. We need the index of the first gt value. Unused elements are < k
> + * for sure, so count from the tail of the used part.
> + *
> + * <grumble>
> + * We might have done it the other way -- store the maximum in unused,
> + * check for key >= array[i] in the above loop and just return the cnt,
> + * but ... AVX2 instructions set doesn't have the PCMPGE
> + *
> + * SSE* set (predecessor) has cmpge, but eats 2 keys in one go
> + * AVX-512 (successor) has it back, and even eats 8 keys, but is
> + * not widely available

I don't see cmpge anywhere.

Note that AVX-512 is worthless for us, it slows down the clock for
everyone. So it's only useful if the entire code uses 512-bit vectors
can can tolerate the slowdown.

> + * </grumble>
> + */
> + return size - cnt;
> +}
> +
> +static inline int __default array_search_gt(const int64_t& val, const int64_t* array, const int capacity, const int size) {

Why pass they key by reference? by value is simpler.

Ok, I thought I'd have more nitpicks, but this is really well done.

Avi Kivity

<avi@scylladb.com>

unread,

May 14, 2020, 10:58:33 AM5/14/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com

On 13/05/2020 19.45, Pavel Emelyanov wrote:

Please add #ifdef so it doesn't fail on non-x86.

> + */
> +
> +static inline int __avx2 array_search_gt(const int64_t& val, const int64_t* array, const int capacity, const int size) {
> + int cnt = 0;
> +

And here too, of course.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 14, 2020, 4:53:05 PM5/14/20

to Avi Kivity, scylladb-dev, Tomasz Grabiec

чт, 14 мая 2020 г., 17:47 Avi Kivity <a...@scylladb.com>:

On 13/05/2020 19.45, Pavel Emelyanov wrote:
> Done by specializing searcher template for Key = int64_t and
> Search = key_search::linear. Some notes on this
>
> 1. Searching on not filled nodes needs to take trailing (unused)
> keys into account. Two options for that -- limit the comparisons
> "by hands" or keep some known value in the tail and compare full
> node including the unused ones. The former approach increases
> the array scan time up to 50%.
>
> The latter approach needs some trickery with filling the unused
> keys, which is done by specializing the maybe_key template -- the
> unused keys are set to int64_t::min value as real tokens cannot
> take it.
>
> 2. The AVX searcher template is not equipped with constraint, as
> otherwise search keys not meeting it (even by mistake) would cause
> the compiler to pick some other one, it would choose the default
> linear scanner and nobody will know this happened.
>
> 3. To make run-time selection of the array search code the gcc's
> __attribute__((target)) is used. It's available since gcc 4.8.
> For AVX-less chips the default linear scanner is also provided.
>
> With this the 5M keys tree with 4 keys-per-node finds random keys
> ~10% faster.

Let's take this out of the series, and apply it later.

Sure, I'll push another branch without it if the review goes ok.

That's exactly what I've spend several hours at.

Any math to evaluate this mask result in 30% to 50% showdown on nano-benchmark that just scans the array. Doing a loop not to capacity (compile time constant), but to size (variable) makes things worse. Even the size - nr at the end of this function seems to works a little bit slower versus capacity - nr, that's why I lamented about the lack of cmpge.

I can brush up that nano-benchmark and send to you, maybe there is a way, but so far I haven't found it.

With capacity being 8 yes. For 16 -- IIRC yes. For 32 -- no.

Maybe add
[[gnu::optimize("unroll-loops")]].

Ok, will check.

Although I expect the cpu will unroll the loops if the compiler doesn't

with minimal performance impact. What's most important is issuing the
reads early so they can proceed in parallel.

> +
> + /*
> + * 5. We need the index of the first gt value. Unused elements are < k
> + * for sure, so count from the tail of the used part.
> + *
> + * <grumble>
> + * We might have done it the other way -- store the maximum in unused,
> + * check for key >= array[i] in the above loop and just return the cnt,
> + * but ... AVX2 instructions set doesn't have the PCMPGE
> + *
> + * SSE* set (predecessor) has cmpge, but eats 2 keys in one go
> + * AVX-512 (successor) has it back, and even eats 8 keys, but is
> + * not widely available

I don't see cmpge anywhere.

There's an interactive guide for all this stuff:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3865,4342,474,1196,908,3549&text=Cmpge

The URL already searches for cmpge, it shows only avx-512 (red bars) and sse (green ones).

Note that AVX-512 is worthless for us, it slows down the clock for
everyone. So it's only useful if the entire code uses 512-bit vectors
can can tolerate the slowdown.

> + * </grumble>
> + */
> + return size - cnt;
> +}
> +
> +static inline int __default array_search_gt(const int64_t& val, const int64_t* array, const int capacity, const int size) {

Why pass they key by reference? by value is simpler.

I guess I just made it similar to the whole searcher set of templates in bptree and forgot to clean.

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2020, 2:02:13 AM5/15/20

to Pavel Emelyanov, scylladb-dev, Tomasz Grabiec

Oh, looping to size instead of capacity will be bad since the loop will be mispredicted.

Possibly you can use the bsf instruction to find the first zero. But I don't think it's worse pursuing, what you have is pretty good and suitable for our needs.

Does it place all the loads in the beginning? this is the important bit.

Maybe add
[[gnu::optimize("unroll-loops")]].

Ok, will check.

Although I expect the cpu will unroll the loops if the compiler doesn't

with minimal performance impact. What's most important is issuing the
reads early so they can proceed in parallel.

> +
> + /*
> + * 5. We need the index of the first gt value. Unused elements are < k
> + * for sure, so count from the tail of the used part.
> + *
> + * <grumble>
> + * We might have done it the other way -- store the maximum in unused,
> + * check for key >= array[i] in the above loop and just return the cnt,
> + * but ... AVX2 instructions set doesn't have the PCMPGE
> + *
> + * SSE* set (predecessor) has cmpge, but eats 2 keys in one go
> + * AVX-512 (successor) has it back, and even eats 8 keys, but is
> + * not widely available

I don't see cmpge anywhere.

There's an interactive guide for all this stuff:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3865,4342,474,1196,908,3549&text=Cmpge

The URL already searches for cmpge, it shows only avx-512 (red bars) and sse (green ones).

Ah, I see it. Don't know why I missed it before.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 15, 2020, 5:35:26 AM5/15/20

to Avi Kivity, scylladb-dev, Tomasz Grabiec

>> I think you could work around it by creating a mask from size, and
>> masking out anything above size.
>>
>>
>> That's exactly what I've spend several hours at.
>>
>> Any math to evaluate this mask result in 30% to 50% showdown on nano-benchmark that just scans the array. Doing a loop not to capacity (compile time constant), but to size (variable) makes things worse.
>
>
> Oh, looping to size instead of capacity will be bad since the loop will be mispredicted.
>
>
> Possibly you can use the bsf instruction to find the first zero. But I don't think it's worse pursuing, what you have is pretty good and suitable for our needs.
>
>
>> Even the size - nr at the end of this function seems to works a little bit slower versus capacity - nr, that's why I lamented about the lack of cmpge.
>>
>> I can brush up that nano-benchmark and send to you, maybe there is a way, but so far I haven't found it.

If you want to play with a bit more -- attached is the nono-bench. Just do sh ./test.sh and it will run.
On my box for 16 keys the timings look like

---
loop 16 keys 0.327 s 65.416 ns/search 4.089 ns/op
binary 16 keys 0.277 s 55.409 ns/search 3.463 ns/op

avx_nomsk 16 keys 0.096 s 19.295 ns/search 1.206 ns/op

avx_nomsk_n 16 keys 0.100 s 19.965 ns/search 1.248 ns/op
avx_nomsk_acmp 16 keys 0.113 s 22.506 ns/search 1.407 ns/op
avx_nomsk_bcmp 16 keys 0.103 s 20.666 ns/search 1.292 ns/op
avx_nomsk_smin 16 keys 0.119 s 23.742 ns/search 1.484 ns/op
avx_nomsk_ssemin 16 keys 0.112 s 22.480 ns/search 1.405 ns/op

avx_condmsk 16 keys 0.208 s 41.615 ns/search 2.601 ns/op
avx_ifmsk 16 keys 0.159 s 31.882 ns/search 1.993 ns/op
avx_limmsk 16 keys 0.123 s 24.502 ns/search 1.531 ns/op
avx_arrmsk 16 keys 0.149 s 29.876 ns/search 1.867 ns/op

The avx_nomsk is incorrect straightforward finder, just for calibration.
The avx_nomsk_* is the code that doesn't mask anything, but tries to handle
size-related corner cases with various tricks.
The avx_*msk -- tries to evaluate mask depending on size.

This patch for B+ uses the avx_nomsk_n approach.

Yes, for 8 keys:

401288: c4 c2 7d 59 14 c0 vpbroadcastq (%r8,%rax,8),%ymm2
40128e: c5 ff f0 0c f3 vlddqu (%rbx,%rsi,8),%ymm1
401293: c5 ff f0 44 f3 20 vlddqu 0x20(%rbx,%rsi,8),%ymm0
401299: 89 fe mov %edi,%esi
40129b: c4 e2 75 37 ca vpcmpgtq %ymm2,%ymm1,%ymm1
4012a0: c4 e2 7d 37 c2 vpcmpgtq %ymm2,%ymm0,%ymm0
4012a5: c5 fd d7 c1 vpmovmskb %ymm1,%eax
4012a9: f3 0f b8 c0 popcnt %eax,%eax
4012ad: c1 f8 03 sar $0x3,%eax
4012b0: 89 c2 mov %eax,%edx
4012b2: c5 fd d7 c0 vpmovmskb %ymm0,%eax
4012b6: f3 0f b8 c0 popcnt %eax,%eax

for 16 keys it compiled it as loop.

>>
>> Maybe add
>> [[gnu::optimize("unroll-loops")]].

With this it unrolls even 64 keys and groups vlddqu-s, but every 3 loads are
followed by vpmovmskb, though it doesn't reuse the vlddqu's target ymm registers.

-- Pavel

cmp.hh

test.cc

test.sh

Tomasz Grabiec

<tgrabiec@scylladb.com>

unread,

May 15, 2020, 11:33:11 AM5/15/20

to Pavel Emelyanov, scylladb-dev

clear_now() is marked as noexcept, but init_empty() is not, because it allocates.

I don't see a strong reason to have this patch.

Tomasz Grabiec

<tgrabiec@scylladb.com>

unread,

May 15, 2020, 1:02:31 PM5/15/20

to Pavel Emelyanov, scylladb-dev

On Wed, May 13, 2020 at 6:46 PM Pavel Emelyanov <xe...@scylladb.com> wrote:

Looks like this belongs to the patch which introduces double_decker.

diff --git a/dht/token.cc b/dht/token.cc
index 96de48607..17e599fd1 100644
--- a/dht/token.cc
+++ b/dht/token.cc
@@ -33,11 +33,7 @@ namespace dht {
using uint128_t = unsigned __int128;

inline int64_t long_token(const token& t) {
- if (t.is_minimum() || t.is_maximum()) {
- return std::numeric_limits<int64_t>::min();
- }
-
- return t._data;
+ return t.raw();

This changes the result if t.is_maximum() from min() to max(). Why is this ok?

Should be a separate patch with explanation if intended.

}

static const token min_token{ token::kind::before_all_keys, 0 };
@@ -53,19 +49,21 @@ maximum_token() {
return max_token;
}

+int tri_compare_raw(const int64_t& l1, const int64_t& l2) {

It's better to pass by value.

Unrelated changes

{

nest n;
std::cout << prefix() << "sizeof(decorated_key) = " << sizeof(dht::decorated_key) << "\n";
- std::cout << prefix() << "sizeof(cache_link_type) = " << sizeof(cache_entry::cache_link_type) << "\n";
print_mutation_partition_size();
}

--
2.20.1

--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/20200513164550.7136-9-xemul%40scylladb.com.

Raphael S. Carvalho

<raphaelsc@scylladb.com>

unread,

May 15, 2020, 1:20:16 PM5/15/20

to Pavel Emelyanov, scylladb-dev

wouldn't it be better to call this function raw_token()? binding a
type to an entity is not future proof. Also it improves readability in
the caller. Also avoids some confusion because when I read int64_key()
I may think that it's a transformation of partition key using 64 bits,
not data representation of token.

perhaps we should define a type alias for raw token to increase readability.

Tomasz Grabiec

<tgrabiec@scylladb.com>

unread,

May 15, 2020, 1:20:27 PM5/15/20

to Pavel Emelyanov, scylladb-dev

Why aren't you calling evict_entry() anymore? nr_partitions is not updated now.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/20200513164550.7136-10-xemul%40scylladb.com.

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2020, 1:57:12 PM5/15/20

to Pavel Emelyanov, scylladb-dev, Tomasz Grabiec

On 5/15/20 12:35 PM, Pavel Emelyanov wrote:
>>> I think you could work around it by creating a mask from size, and
>>> masking out anything above size.
>>>
>>>
>>> That's exactly what I've spend several hours at.
>>>
>>> Any math to evaluate this mask result in 30% to 50% showdown on
>>> nano-benchmark that just scans the array. Doing a loop not to
>>> capacity (compile time constant), but to size (variable) makes
>>> things worse.
>>
>>
>> Oh, looping to size instead of capacity will be bad since the loop
>> will be mispredicted.
>>
>>
>> Possibly you can use the bsf instruction to find the first zero. But
>> I don't think it's worse pursuing, what you have is pretty good and
>> suitable for our needs.
>>
>>
>>> Even the size - nr at the end of this function seems to works a
>>> little bit slower versus capacity - nr, that's why I lamented about
>>> the lack of cmpge.
>>>
>>> I can brush up that nano-benchmark and send to you, maybe there is a
>>> way, but so far I haven't found it.
>
> If you want to play with a bit more -- attached is the nono-bench.
> Just do sh ./test.sh and it will run.
> On my box for 16 keys the timings look like

It will be nice to add it to test/perf.

But I am just wasting our time here, because I like this low level stuff
so much. What you have is already excellent and shaving another cycle
here or there is unlikley to make a difference in the real world.

Silly compiler should start the loads earlier.

> 401299:       89 fe                   mov    %edi,%esi
> 40129b:       c4 e2 75 37 ca          vpcmpgtq %ymm2,%ymm1,%ymm1
> 4012a0:       c4 e2 7d 37 c2          vpcmpgtq %ymm2,%ymm0,%ymm0
> 4012a5:       c5 fd d7 c1             vpmovmskb %ymm1,%eax
> 4012a9:       f3 0f b8 c0             popcnt %eax,%eax
> 4012ad:       c1 f8 03                sar    $0x3,%eax
> 4012b0:       89 c2                   mov    %eax,%edx
> 4012b2:       c5 fd d7 c0             vpmovmskb %ymm0,%eax

And do this earlier too, this instruction has high latency.

>
> 4012b6:       f3 0f b8 c0             popcnt %eax,%eax
>
> for 16 keys it compiled it as loop.
>
>>>
>>>     Maybe add
>>>     [[gnu::optimize("unroll-loops")]].
>
> With this it unrolls even 64 keys and groups vlddqu-s, but every 3
> loads are
> followed by vpmovmskb, though it doesn't reuse the vlddqu's target ymm
> registers.

Strange. But as while I'm sure we can improve it, it's probably not
worth the effort on the macro scale.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 18, 2020, 4:17:44 AM5/18/20

to Tomasz Grabiec, scylladb-dev

> void rows_entry::on_evicted(cache_tracker& tracker) noexcept {
> diff --git a/test/perf/memory_footprint_test.cc b/test/perf/memory_footprint_test.cc
> index 465b0169d..14f6a64f2 100644
> --- a/test/perf/memory_footprint_test.cc
> +++ b/test/perf/memory_footprint_test.cc
> @@ -56,11 +56,12 @@ class size_calculator {
> public:
> static void print_cache_entry_size() {
> std::cout << prefix() << "sizeof(cache_entry) = " << sizeof(cache_entry) << "\n";
> + std::cout << prefix() << "sizeof(bptree::node) = " << sizeof(row_cache::partitions_type::outer_tree::node) << "\n";
> + std::cout << prefix() << "sizeof(bptree::data) = " << sizeof(row_cache::partitions_type::outer_tree::data) << "\n";
>
>
> Unrelated changes

I believe they are related as the memory footprint changes after this patch specifically
because we no longer have a sole cache_entry allocation, but also all the bptre:: stuff
and knowing their sizes helps finding out why it changes the way it does.

Or you meant that there should be a separate two-lines patch with this hunk?

-- Pavel

> {

> nest n;
> std::cout << prefix() << "sizeof(decorated_key) = " << sizeof(dht::decorated_key) << "\n";
> - std::cout << prefix() << "sizeof(cache_link_type) = " << sizeof(cache_entry::cache_link_type) << "\n";
> print_mutation_partition_size();
> }
>
> --
> 2.20.1
>
> --
> You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com <mailto:scylladb-dev%2Bunsu...@googlegroups.com>.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 18, 2020, 4:20:19 AM5/18/20

to Tomasz Grabiec, scylladb-dev

On 15.05.2020 20:20, Tomasz Grabiec wrote:

>
>
> On Wed, May 13, 2020 at 6:46 PM Pavel Emelyanov <xe...@scylladb.com <mailto:xe...@scylladb.com>> wrote:
>
> The change is the same as with row-cache -- use B+ with int64_t token
> as key and array of memtable_entry-s inside it.
>
> The changes are:
>
> Similar to those for row_cache:
>
> - compare() is made to report int, added token_compare to report bool
>
> - insertion and removal happens with the help of double_decker, most
> of the places are about slightly changed semantics of it
>
> - flags are added to memtable_entry, this makes its size larger than
> it could be, but still smaller than it was before
>
> Memtable-specific:
>
> - when the new entry is inserted into tree iterators _might_ get
> invalidated by double-decker inner array. This is easy to check
> when it happens, so the invalidation is avoided when possible
>
> - the size_in_allocator_without_rows() is now not very precise. This
> is because after the patch memtable_entries are not allocated
> individually as they used to. They can be squashed together with
> those having token conflict and asking allocator for the occupied
> memory slot is not possible. As the closest (lower) estimate the
> size of enclosing B+ data node is used
>
> - the ::slice() call wants const iterators to work with, but new
> collections do not have it (yet), so a const_cast there :(
>

> Signed-off-by: Pavel Emelyanov <xe...@scylladb.com <mailto:xe...@scylladb.com>>

Buggy rebase :( the evict_entry() is to be here.

> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com <mailto:scylladb-dev%2Bunsu...@googlegroups.com>.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 18, 2020, 4:28:13 AM5/18/20

to Raphael S. Carvalho, scylladb-dev

This very function it to satisfy the B+-tree's AVX searcher's template constraint
from patch #7, this one (I've removed unrelated lines):

+template <typename Key, typename Less, int Size>
+struct searcher<Key, int64_t, Less, Size, key_search::linear> {

+ template <typename K = Key>

+ GCC6_CONCEPT( requires requires (const K& key) { { key.int64_key() } -> int64_t } )
+ static int64_t to_int64_key(const K& k) { return k.int64_key(); }
+

+ static int gt(const Key& k, const maybe_key<int64_t>* keys, int nr, Less) {

+ return utils::array_search_gt(to_int64_key(k), reinterpret_cast<const int64_t*>(keys), Size, nr);
+ }
+};

the searcher wants a "random" key, that's used in heterogeneous lookup, to be
explicitly convertible to int64 value which, in turn, is again a _key_ in B+ tree.
So the function to convert is called like this. Having a word "token" in its name
will be confusing.

We had a quick discussion of this with Botond, he also suggested to have
using value_type = int64_t in dht::token. But looking at git blame of the
dht/token.hh gives an impression, that it had been recently patched to have
the very int64_t around, not anything else.

-- Pavel

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 18, 2020, 4:29:00 AM5/18/20

to Benny Halevy, scylladb-dev@googlegroups.com

On 14.05.2020 14:48, Benny Halevy wrote:
> nit: what's the motivation?
> should be documented in the commit log.

It's to be used in new perf tests. Will add in next re-spin.

-- Pavel

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 18, 2020, 5:41:23 AM5/18/20

to Tomasz Grabiec, scylladb-dev

On 15.05.2020 20:02, Tomasz Grabiec wrote:

> Signed-off-by: Pavel Emelyanov <xe...@scylladb.com <mailto:xe...@scylladb.com>>

This is to make dummy entry appear as the last entry in the B+. In current
code this is done with this "fix":

dht::ring_position_view cache_entry::position() const {
if (is_dummy_entry()) {
return dht::ring_position_view::max();
}
return _key;
}

but with B+ this needs to be on the int64_t level. Maybe it's too big hammer indeed.

-- Pavel

> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com <mailto:scylladb-dev%2Bunsu...@googlegroups.com>.

Tomasz Grabiec

<tgrabiec@scylladb.com>

unread,

May 18, 2020, 8:31:48 AM5/18/20

to Pavel Emelyanov, scylladb-dev

On Mon, May 18, 2020 at 10:17 AM Pavel Emelyanov <xe...@scylladb.com> wrote:

> void rows_entry::on_evicted(cache_tracker& tracker) noexcept {
> diff --git a/test/perf/memory_footprint_test.cc b/test/perf/memory_footprint_test.cc
> index 465b0169d..14f6a64f2 100644
> --- a/test/perf/memory_footprint_test.cc
> +++ b/test/perf/memory_footprint_test.cc
> @@ -56,11 +56,12 @@ class size_calculator {
> public:
> static void print_cache_entry_size() {
> std::cout << prefix() << "sizeof(cache_entry) = " << sizeof(cache_entry) << "\n";
> + std::cout << prefix() << "sizeof(bptree::node) = " << sizeof(row_cache::partitions_type::outer_tree::node) << "\n";
> + std::cout << prefix() << "sizeof(bptree::data) = " << sizeof(row_cache::partitions_type::outer_tree::data) << "\n";
>
>
> Unrelated changes

I believe they are related as the memory footprint changes after this patch specifically
because we no longer have a sole cache_entry allocation, but also all the bptre:: stuff
and knowing their sizes helps finding out why it changes the way it does.

Or you meant that there should be a separate two-lines patch with this hunk?

Yes, I would expect this to go in a separate commit in the series. What you wrote above would be a nice commit log message explaining the motivation.

We try to have individual commits achieve one logical thing. If the "thing" here is adapting cache to use bptree, then this change is not essential. It has merit after the patch, so it belongs to the series.

Raphael S. Carvalho

<raphaelsc@scylladb.com>

unread,

May 18, 2020, 11:25:48 AM5/18/20

to Pavel Emelyanov, scylladb-dev

Isn't it better to make this a standalone function then? like token_to_int64_key()?

This is a concept that belongs outside the class. So better to not pollute it with things that lie outside.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 18, 2020, 11:44:35 AM5/18/20

to Raphael S. Carvalho, scylladb-dev

On 18.05.2020 18:25, Raphael S. Carvalho wrote:

>
>
> On Mon, May 18, 2020 at 5:28 AM Pavel Emelyanov <xe...@scylladb.com <mailto:xe...@scylladb.com>> wrote:
>
> On 15.05.2020 20:19, Raphael S. Carvalho wrote:

> >> Signed-off-by: Pavel Emelyanov <xe...@scylladb.com <mailto:xe...@scylladb.com>>

Won't it result in a token_to_int64_key() function for every Key that needs it anyway?

However... maybe templatizing the introduced to_int64_key like

template <typenmae K = Key>
requires requires (const K& key) { { key.token() } -> dht::token }

static int64_t to_int64_key(const K& k) {

return k.token().raw();
}

?

> >> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com <mailto:scylladb-dev%2Bunsu...@googlegroups.com>.

Raphael S. Carvalho

<raphaelsc@scylladb.com>

unread,

May 18, 2020, 11:49:01 AM5/18/20

to Pavel Emelyanov, scylladb-dev

My thought was to make it static so that it could be inlined.

However... maybe templatizing the introduced to_int64_key like

template <typenmae K = Key>
requires requires (const K& key) { { key.token() } -> dht::token }
static int64_t to_int64_key(const K& k) {
return k.token().raw();
}

?

Yes, I think that could work well.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:14:58 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The data model is now

bplus::tree<Key = int64_t, T = array<entry>>

where entry can be cache_entry or memtable_entry.

The whole thing is encapsulated into a collection called "double_decker"
from patch #3. The array<T> is an array of T-s with 0-bytes overhead used
to resolve hash conflicts (patch #2).

Changes in v5:

- less intrusive conversion from token to int64_t value (patch #6)
- no-throwing double_decker.erase()

it came unnoticed in the previous set, but erasing from double-decker
could throw because of potential reallocation of the underlying
array of elements (not triggered without hash conflicts). Fixed this
by making it possible to drop element "in place" from array at the
cost of carrying the unused memory slot untill next insertion (just
like vector) and one more bit on the element

- double_decker range erase for row_cache::clear_now (previous trick
could throw while shouldn't)

- lost memtable unaccount on eviction
- use double-decker insertion hint in memtable's find_or_create_partition
- no SIMD lookup (postponed for later)
- comments from previous review

Changes in v4:

- SIMD lookup
- switched memtable on B+
- double-decker doesn't carry comparator on board

This is because comarator keep reference on schema and it's
not that easy to prevent one from doing use-after-free. Since
we anyway have schema in all the calls we can use it (as current
code does)

- B+ find/lower_bound do not init empty tree

If lookup happens without proper with_allocator() call the tree
root gets allocated and freed in differente allocators

- Review comments

Changes in v3:

- replace managed_vector with array
- use int64_t as tree key instead of dht::token
- simple tune-up of insertion algo for better packing (15% less inner nodes)
- optimize bplus::tree::erase(iterator)

branch: https://github.com/xemul/scylla/commits/row-cache-over-bptree-5
tests:
unit(debug) for new collections, memtable and row_cache
unit(dev) for the rest

Pavel Emelyanov (9):
core: B+ tree implementation
utils: Array with trusted bounds
double-decker: A combinaiton of B+tree with array
memtable: Count partitions separately
test: Move perf measurement helpers into header
token: Introduce raw() helper and raw comparator
row_cache: Switch partition tree onto B+ rails
memtable: Switch onto B+ rails
test: Print more sizes in memory_footprint_test

configure.py | 10 +
dht/token.hh | 11 +
memtable.hh | 63 +-
row_cache.hh | 53 +-
test/perf/perf.hh | 71 +

test/unit/bptree_key.hh | 101 ++
test/unit/bptree_validation.hh | 318 ++++

test/unit/bptree_compaction_test.cc | 210 +++
test/unit/bptree_stress_test.cc | 236 +++

21 files changed, 4940 insertions(+), 207 deletions(-)

create mode 100644 test/unit/bptree_key.hh
create mode 100644 test/unit/bptree_validation.hh

create mode 100644 utils/array_trusted_bounds.hh
create mode 100644 utils/bptree.hh
create mode 100644 utils/double-decker.hh
create mode 100644 test/boost/array_trusted_bounds_test.cc
create mode 100644 test/boost/bptree_test.cc
create mode 100644 test/boost/double_decker_test.cc

create mode 100644 test/perf/perf_bptree.cc
create mode 100644 test/unit/bptree_compaction_test.cc
create mode 100644 test/unit/bptree_stress_test.cc

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:02 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

A plain array of elements that grows and shrinks by
constructing the new instance from an existing one and
moving the elements from it.

Behaves similarly to vector's external array, but has
0-bytes overhead. The array bounds (0-th and N-th
elemements) are determined by checking the flags on the
elements themselves. For this the type must support
getters and setters for the flags.

To remove an element from array there's also a nothrow
option that drops the requested element from array,
shifts the righter ones left and keeps the trailing
unused memory (so called "train") untill reconstruction
or destruction.

Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

configure.py | 1 +
utils/array_trusted_bounds.hh | 326 ++++++++++++++++++++++++
test/boost/array_trusted_bounds_test.cc | 243 ++++++++++++++++++
3 files changed, 570 insertions(+)
create mode 100644 utils/array_trusted_bounds.hh
create mode 100644 test/boost/array_trusted_bounds_test.cc

diff --git a/configure.py b/configure.py
index 262597d60..393426fc0 100755
--- a/configure.py
+++ b/configure.py
@@ -328,6 +328,7 @@ scylla_tests = set([
'test/boost/log_heap_test',
'test/boost/logalloc_test',
'test/boost/managed_vector_test',
+ 'test/boost/array_trusted_bounds_test',
'test/boost/map_difference_test',
'test/boost/memtable_test',
'test/boost/meta_test',
diff --git a/utils/array_trusted_bounds.hh b/utils/array_trusted_bounds.hh
new file mode 100644
index 000000000..074d02a59
--- /dev/null
+++ b/utils/array_trusted_bounds.hh
@@ -0,0 +1,326 @@

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.

+ */
+
+#pragma once
+

+#include <array>
+#include <cassert>
+#include <seastar/util/concepts.hh>
+
+#include "utils/allocation_strategy.hh"
+
+SEASTAR_CONCEPT(
+ template <typename T>
+ concept bool BoundsKeeper = requires (T val, bool bit) {
+ { val.is_head() } -> bool;
+ { val.set_head(bit) } -> void;
+ { val.is_tail() } -> bool;
+ { val.set_tail(bit) } -> void;
+ { val.with_train() } -> bool;
+ { val.set_train(bit) } -> void;
+ };
+
+ template <typename K, typename T, typename Compare>
+ concept bool Comparable = requires (const K& a, const T& b, Compare cmp) {
+ { cmp(a, b) } -> int;
+ };
+)
+
+/*
+ * A plain array of T-s that grows and shrinks by constructing a new
+ * instances. Holds at least one element. Has facilities for sorting
+ * the elements and for doing "container_of" by the given element
+ * pointer. LSA-compactible.
+ *
+ * Important feature of the array is zero memory overhead -- it doesn't
+ * keep its size/capacity onboard. The size is calculated each time by
+ * walking the array of T-s and checking which one of them is the tail
+ * element. Respectively, the T must keep head/tail flags on itself.
+ */
+template <typename T>
+SEASTAR_CONCEPT( requires BoundsKeeper<T> && std::is_nothrow_move_constructible_v<T> )
+class array_trusted_bounds {
+ union maybe_constructed {
+ maybe_constructed() { }
+ ~maybe_constructed() { }
+ T object;
+
+ /*
+ * Train is 1 or more allocated but unoccupied memory slots after
+ * the tail one. Being unused, this memory keeps the train length.
+ * An array with the train is marked with the respective flag on
+ * the 0th element. Train is created by the erase() call and can
+ * be up to 65535 elements long
+ *
+ * Train length is included into the storage_size() to make
+ * allocator and compaction work correctly, but is not included
+ * into the number_of_elements(), so the array behaves just like
+ * there's no train
+ *
+ * Respectively both grow and shrink constructors do not carry
+ * the train (and drop the bit from 0th element) and don't expect
+ * the memory for the new array to include one
+ */
+ unsigned short train_len;
+ static_assert(sizeof(T) >= sizeof(unsigned short));
+ };
+
+ maybe_constructed _data[1];
+
+ size_t number_of_elements() const {
+ for (int i = 0; ; i++) {
+ if (_data[i].object.is_tail()) {
+ return i + 1;
+ }
+ }
+
+ std::abort();

+ }
+
+ size_t storage_size() const {

+ size_t nr = number_of_elements();
+ if (_data[0].object.with_train()) {
+ nr += _data[nr].train_len;
+ }
+ return nr * sizeof(T);
+ }
+
+public:
+ using iterator = T*;
+
+ /*
+ * There are 3 constructing options for the array: initial, grow
+ * and shrink.
+ *
+ * * initial just creates a 1-element array
+ * * grow -- makes a new one moving all elements from the original
+ * array and inserting the one (only one) more element at the given
+ * position
+ * * shrink -- also makes a new array skipping the not needed
+ * element while moving them from the original one
+ *
+ * In all cases the enough big memory chunk must be provided by the
+ * caller!
+ *
+ * Note, that none of them calls destructors on T-s, unlike vector.
+ * This is because when the older array is destroyed it has no idea
+ * about whether or not it was grown/shrunk and thus it destroys
+ * T-s itself.
+ */
+
+ // Initial
+ template <typename... Args>
+ array_trusted_bounds(Args&&... args) {
+ new (&_data[0].object) T(std::forward<Args>(args)...);
+ _data[0].object.set_head(true);
+ _data[0].object.set_tail(true);
+ }
+
+ // Growing
+ struct grow_tag {
+ int add_pos;

+ };
+
+ template <typename... Args>

+ array_trusted_bounds(array_trusted_bounds& from, grow_tag grow, Args&&... args) {
+ // The add_pos is strongly _expected_ to be within bounds
+ int i, off = 0;
+ bool tail = false;
+
+ for (i = 0; !tail; i++) {
+ if (i == grow.add_pos) {
+ off = 1;
+ continue;
+ }
+
+ tail = from._data[i - off].object.is_tail();
+ new (&_data[i].object) T(std::move(from._data[i - off].object));
+ }
+
+ assert(grow.add_pos <= i);
+
+ new (&_data[grow.add_pos].object) T(std::forward<Args>(args)...);
+
+ _data[0].object.set_head(true);
+ _data[0].object.set_train(false);
+ if (grow.add_pos == 0) {
+ _data[1].object.set_head(false);
+ }
+ _data[i - off].object.set_tail(true);
+ if (off == 0) {
+ _data[i - 1].object.set_tail(false);
+ }
+ }
+
+ // Shrinking
+ struct shrink_tag {
+ int del_pos;
+ };
+
+ array_trusted_bounds(array_trusted_bounds& from, shrink_tag shrink) {
+ int i, off = 0;
+ bool tail = false;
+
+ for (i = 0; !tail; i++) {
+ tail = from._data[i].object.is_tail();
+ if (i == shrink.del_pos) {
+ off = 1;
+ } else {
+ new (&_data[i - off].object) T(std::move(from._data[i].object));
+ }
+ }
+
+ _data[0].object.set_head(true);
+ _data[0].object.set_train(false);
+ _data[i - off - 1].object.set_tail(true);
+ }
+
+ array_trusted_bounds(const array_trusted_bounds& other) = delete;
+ array_trusted_bounds(array_trusted_bounds&& other) noexcept {
+ bool tail = false;
+ int i;
+
+ for (i = 0; !tail; i++) {
+ tail = other._data[i].object.is_tail();
+
+ new (&_data[i].object) T(std::move(other._data[i].object));
+ }
+
+ if (_data[0].object.with_train()) {
+ _data[i].train_len = other._data[i].train_len;
+ }
+ }
+
+ ~array_trusted_bounds() {
+ bool tail = false;
+
+ for (int i = 0; !tail; i++) {
+ tail = _data[i].object.is_tail();
+ _data[i].object.~T();
+ }
+ }
+
+ /*
+ * Drops the element in-place at position @pos and grows the
+ * "train". To be used in places where reconstruction is not
+ * welcome (e.g. because it throws)
+ *
+ * Single-elemented array cannot be erased from, just drop it
+ * alltogether if needed
+ */
+ void erase(int pos) {
+ assert(!is_single_element());
+
+ bool with_train = _data[0].object.with_train();
+ bool tail = _data[pos].object.is_tail();
+ _data[pos].object.~T();
+
+ if (tail) {
+ assert(pos > 0);
+ _data[pos - 1].object.set_tail(true);
+ } else {
+ while (!tail) {
+ new (&_data[pos].object) T(std::move(_data[pos + 1].object));
+ _data[pos + 1].object.~T();
+ tail = _data[pos++].object.is_tail();
+ }
+ _data[0].object.set_head(true);
+ }
+
+ _data[0].object.set_train(true);
+ _data[pos].train_len = (with_train ? _data[pos + 1].train_len + 1 : 1);
+ }
+
+ T& operator[](int pos) noexcept { return _data[pos].object; }
+ const T& operator[](int pos) const noexcept { return _data[pos].object; }
+
+ iterator begin() noexcept { return &_data[0].object; }
+ iterator end() noexcept { return &_data[number_of_elements()].object; }
+
+ size_t index_of(iterator i) const { return i - &_data[0].object; }
+ bool is_single_element() const { return _data[0].object.is_tail(); }
+
+ // A helper for keeping the array sorted
+ template <typename K, typename Compare>
+ SEASTAR_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& val, Compare cmp, bool& match) {
+ int i = 0;
+
+ do {
+ int x = cmp(_data[i].object, val);
+ if (x >= 0) {
+ match = (x == 0);
+ break;
+ }
+ } while (!_data[i++].object.is_tail());
+
+ return &_data[i].object;
+ }
+
+ template <typename K, typename Compare>
+ SEASTAR_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& val, Compare cmp) {
+ bool match = false;
+ return lower_bound(val, cmp, match);
+ }
+
+ // And its peer ... just to be used
+ template <typename K, typename Compare>
+ SEASTAR_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator upper_bound(const K& val, Compare cmp) {
+ int i = 0;
+
+ do {
+ if (cmp(_data[i].object, val) > 0) {
+ break;
+ }
+ } while (!_data[i++].object.is_tail());
+
+ return &_data[i].object;
+ }
+
+ template <typename Func>
+ SEASTAR_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )
+ void for_each(Func&& fn) {
+ bool tail = false;
+
+ for (int i = 0; !tail; i++) {
+ tail = _data[i].object.is_tail();
+ fn(_data[i].object);
+ }
+ }
+
+ size_t size() { return number_of_elements(); }
+
+ friend size_t size_for_allocation_strategy(const array_trusted_bounds& obj) {

+ return obj.storage_size();
+ }
+

+ static array_trusted_bounds& from_element(T* ptr, int& idx) {
+ while (!ptr->is_head()) {
+ idx++;
+ ptr--;
+ }
+
+ static_assert(offsetof(array_trusted_bounds, _data[0].object) == 0);
+ return *reinterpret_cast<array_trusted_bounds*>(ptr);
+ }
+};
diff --git a/test/boost/array_trusted_bounds_test.cc b/test/boost/array_trusted_bounds_test.cc
new file mode 100644
index 000000000..afffe4477
--- /dev/null
+++ b/test/boost/array_trusted_bounds_test.cc
@@ -0,0 +1,243 @@

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <boost/test/unit_test.hpp>
+#include <seastar/testing/thread_test_case.hh>
+#include <fmt/core.h>
+
+#include "utils/array_trusted_bounds.hh"
+#include "utils/logalloc.hh"
+
+class element {

+ bool _head = false;
+ bool _tail = false;

+ bool _train = false;
+
+ long _data;
+ int *_cookie;
+ int *_cookie2;
+
+public:
+ explicit element(long val) : _data(val), _cookie(new int(0)), _cookie2(new int(0)) { }
+
+ element(const element& other) = delete;
+ element(element&& other) noexcept : _head(other._head), _tail(other._tail), _train(other._train),
+ _data(other._data), _cookie(other._cookie), _cookie2(new int(0)) {

+ other._cookie = nullptr;
+ }

+
+ ~element() {

+ if (_cookie != nullptr) {
+ delete _cookie;
+ }
+

+ delete _cookie2;
+ }
+
+ bool is_head() const { return _head; }
+ void set_head(bool v) { _head = v; }
+ bool is_tail() const { return _tail; }

+ void set_tail(bool v) { _tail = v; }

+ bool with_train() const { return _train; }
+ void set_train(bool v) { _train = v; }
+
+ bool operator==(long v) const { return v == _data; }
+ long operator*() const { return _data; }
+
+ bool bound_check(int idx, int size) {
+ return ((idx == 0) == is_head()) && ((idx == size - 1) == is_tail());
+ }
+};
+
+using test_array = array_trusted_bounds<element>;
+
+static bool size_check(test_array& a, size_t size, unsigned short tlen) {
+ return a[size - 1].is_tail() && a.size() == size &&
+ size_for_allocation_strategy(a) == (size + tlen) * sizeof(element) &&
+ ((tlen != 0) == a[0].with_train()) &&
+ ((tlen == 0) || *reinterpret_cast<unsigned short*>(&a[size]) == tlen);
+}
+
+void show(const char *pfx, test_array& a, int sz) {
+ int i;
+
+ fmt::print("{}", pfx);
+ for (i = 0; i < sz; i++) {
+ fmt::print("{}{}{}", a[i].is_head() ? 'H' : ' ', *a[i], a[i].is_tail() ? 'T' : ' ');
+ }
+ if (a[0].with_train()) {
+ fmt::print(" ~{}", *reinterpret_cast<unsigned short *>(&a[i]));

+ }
+ fmt::print("\n");
+}
+

+SEASTAR_THREAD_TEST_CASE(test_basic_construct) {
+ test_array array(12);
+
+ for (auto i = array.begin(); i != array.end(); i++) {
+ BOOST_REQUIRE(*i == 12);
+ }
+}
+
+test_array* grow(test_array& from, size_t nsize, int npos, long ndat) {
+ BOOST_REQUIRE(from.size() + 1 == nsize);
+ auto ptr = current_allocator().alloc(&get_standard_migrator<test_array>(), sizeof(element) * nsize, alignof(test_array));
+ return new (ptr) test_array(from, test_array::grow_tag{npos}, ndat);
+}
+
+test_array* shrink(test_array& from, size_t nszie, int spos) {
+ BOOST_REQUIRE(from.size() - 1 == nszie);
+ auto ptr = current_allocator().alloc(&get_standard_migrator<test_array>(), sizeof(element) * nszie, alignof(test_array));
+ return new (ptr) test_array(from, test_array::shrink_tag{spos});
+}
+
+void grow_shrink_and_check(test_array& cur, int size, int depth) {
+ for (int i = 0; i <= size; i++) {
+ long nel = size + 12;
+ test_array* narr = grow(cur, size + 1, i, nel);
+ int idx = 0;
+
+ BOOST_REQUIRE(size_check(*narr, size + 1, 0));
+
+ for (auto ni = narr->begin(); ni != narr->end(); ni++) {
+ if (idx == i) {
+ BOOST_REQUIRE(*ni == nel);
+ } else if (idx < i) {
+ BOOST_REQUIRE(*ni == *cur[idx]);
+ } else {
+ BOOST_REQUIRE(*ni == *cur[idx - 1]);
+ }
+
+ BOOST_REQUIRE(ni->bound_check(idx, size + 1));
+ idx++;
+ }
+
+ if (size < depth) {
+ grow_shrink_and_check(*narr, size + 1, depth);
+ }
+
+ current_allocator().destroy(narr);
+ }
+
+ if (size > 1) {
+ for (int i = 0; i < size; i++) {
+ test_array* narr = shrink(cur, size - 1, i);
+ int idx = 0;
+
+ BOOST_REQUIRE(size_check(*narr, size - 1, 0));
+
+ for (auto ni = narr->begin(); ni != narr->end(); ni++) {
+ if (idx == i) {
+ continue;
+ } else if (idx < i) {
+ BOOST_REQUIRE(*ni == *cur[idx]);
+ } else {
+ BOOST_REQUIRE(*ni == *cur[idx + 1]);
+ }
+
+ BOOST_REQUIRE(ni->bound_check(idx, size - 1));
+ idx++;
+ }
+
+ current_allocator().destroy(narr);
+ }
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(test_grow_shrink_construct) {
+ test_array array(12);
+ grow_shrink_and_check(array, 1, 5);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_erase) {
+ test_array a1(10);
+ test_array *a2 = grow(a1, 2, 1, 20);
+ test_array *a3 = grow(*a2, 3, 2, 30);
+
+ for (int i = 0; i < 4; i++) {
+ for (int j = 0; j < 3; j++) {
+ for (int k = 0; k < 2; k++) {
+ std::vector<int> x({10, 20, 30, 40});
+ test_array *a4 = grow(*a3, 4, 3, 40);
+
+ auto test_fn = [&] (int idx, int sz) {
+ a4->erase(idx);
+ x.erase(x.begin() + idx);
+ BOOST_REQUIRE(size_check(*a4, sz, 4 - sz));
+ for (int a = 0; a < sz; a++) {
+ BOOST_REQUIRE(x[a] == *(*a4)[a]);
+ }
+ };
+
+ test_fn(i, 3);
+ test_fn(j, 2);
+ test_fn(k, 1);
+
+ current_allocator().destroy(a4);
+ }
+ }
+ }
+
+ current_allocator().destroy(a3);
+ current_allocator().destroy(a2);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_lower_bound) {
+ test_array a1(12);
+ struct compare {
+ int operator()(const element& a, const element& b) const { return *a - *b; }
+ };
+
+ test_array *a2 = grow(a1, 2, 1, 14);
+
+ auto i = a2->lower_bound(element(13), compare{});
+ BOOST_REQUIRE(*i == 14 && a2->index_of(i) == 1);
+
+ test_array *a3 = grow(*a2, 3, 2, 17);
+
+ bool match;
+ BOOST_REQUIRE(*a3->lower_bound(element(11), compare{}, match) == 12 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(12), compare{}, match) == 12 && match);
+ BOOST_REQUIRE(*a3->lower_bound(element(13), compare{}, match) == 14 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(14), compare{}, match) == 14 && match);
+ BOOST_REQUIRE(*a3->lower_bound(element(15), compare{}, match) == 17 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(16), compare{}, match) == 17 && !match);
+ BOOST_REQUIRE(*a3->lower_bound(element(17), compare{}, match) == 17 && match);
+ BOOST_REQUIRE(a3->lower_bound(element(18), compare{}, match) == a3->end());
+
+ current_allocator().destroy(a3);
+ current_allocator().destroy(a2);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_from_element) {
+ test_array a1(12);
+ test_array *a2 = grow(a1, 2, 1, 14);
+ test_array *a3 = grow(*a2, 3, 2, 17);
+
+ element* i = &((*a3)[2]);
+ BOOST_REQUIRE(*i == 17);
+ int idx = 0;
+ test_array& x = test_array::from_element(i, idx);
+ BOOST_REQUIRE(&x == a3 && idx == 2);
+
+ current_allocator().destroy(a3);
+ current_allocator().destroy(a2);
+}
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:05 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

To use the code in new perf tests in next patches.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

* along with Scylla. If not, see <http://www.gnu.org/licenses/>.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:06 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

In next patches the entries having token on-board will be
moved onto B+-tree rails. For this the int64_t value of the
token will be used as B+ key, so prepare for this.

One corner case -- the after_all_keys tokens must be resolved
to int64::max value to appear at the "end" of the tree. This
is not the same as "before_all_keys" case, which maps to the
int64::min value which is not allowed for regular tokens. But
for the sake of B+ switch this is OK, the conflicts of token
raw values are explicitly resolved in next patches.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

dht/token.hh | 11 +++++++++++
dht/token.cc | 16 +++++++++-------
2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/dht/token.hh b/dht/token.hh
index 439b6b45e..6c9e86e5c 100644

--- a/dht/token.hh
+++ b/dht/token.hh
@@ -160,10 +160,21 @@ class token {
return 0; // hardcoded for now; unlikely to change
}

+ int64_t raw() const {
+ if (is_minimum()) {
+ return std::numeric_limits<int64_t>::min();
+ }
+ if (is_maximum()) {
+ return std::numeric_limits<int64_t>::max();
+ }
+
+ return _data;
+ }
};

const token& minimum_token();
const token& maximum_token();

+int tri_compare_raw(const int64_t t1, const int64_t t2);

int tri_compare(const token& t1, const token& t2);
inline bool operator==(const token& t1, const token& t2) { return tri_compare(t1, t2) == 0; }
inline bool operator<(const token& t1, const token& t2) { return tri_compare(t1, t2) < 0; }

diff --git a/dht/token.cc b/dht/token.cc
index 96de48607..1154d4343 100644
--- a/dht/token.cc
+++ b/dht/token.cc
@@ -53,19 +53,21 @@ maximum_token() {
return max_token;
}

+int tri_compare_raw(const int64_t l1, const int64_t l2) {

+ if (l1 == l2) {
+ return 0;
+ } else {
+ return l1 < l2 ? -1 : 1;
+ }
+}
+
int tri_compare(const token& t1, const token& t2) {
if (t1._kind < t2._kind) {
return -1;
} else if (t1._kind > t2._kind) {
return 1;
} else if (t1._kind == token_kind::key) {
- auto l1 = long_token(t1);
- auto l2 = long_token(t2);
- if (l1 == l2) {
- return 0;
- } else {
- return l1 < l2 ? -1 : 1;
- }
+ return tri_compare_raw(long_token(t1), long_token(t2));
}
return 0;
}

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:08 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

row_cache.hh | 52 ++++++++++-----
row_cache.cc | 100 +++++++++++++----------------
test/perf/memory_footprint_test.cc | 1 -
3 files changed, 81 insertions(+), 72 deletions(-)

diff --git a/row_cache.hh b/row_cache.hh
index 3dd90fac4..fc222c57a 100644

--- a/row_cache.hh
+++ b/row_cache.hh
@@ -40,6 +40,7 @@
#include <seastar/core/metrics_registration.hh>
#include "flat_mutation_reader.hh"
#include "mutation_cleaner.hh"
+#include "utils/double-decker.hh"

namespace bi = boost::intrusive;

@@ -61,11 +62,6 @@ class lsa_manager;
//
// TODO: Make memtables use this format too.
class cache_entry {
- // We need auto_unlink<> option on the _cache_link because when entry is
- // evicted from cache via LRU we don't have a reference to the container
- // and don't want to store it with each entry.
- using cache_link_type = bi::set_member_hook<bi::link_mode<bi::auto_unlink>>;
-
schema_ptr _schema;
dht::decorated_key _key;
partition_entry _pe;

@@ -73,8 +69,10 @@ class cache_entry {

struct {
bool _continuous : 1;
bool _dummy_entry : 1;
+ bool _head : 1;
+ bool _tail : 1;

+ bool _train : 1;

} _flags{};
- cache_link_type _cache_link;
friend class size_calculator;

flat_mutation_reader do_read(row_cache&, cache::read_context& reader);

@@ -82,6 +80,13 @@ class cache_entry {

friend class row_cache;
friend class cache_tracker;

+ bool is_head() const { return _flags._head; }
+ void set_head(bool v) { _flags._head = v; }
+ bool is_tail() const { return _flags._tail; }
+ void set_tail(bool v) { _flags._tail = v; }
+ bool with_train() const { return _flags._train; }
+ void set_train(bool v) { _flags._train = v; }

+
struct dummy_entry_tag{};
struct incomplete_tag{};
struct evictable_tag{};

@@ -148,34 +153,48 @@ class cache_entry {

bool is_dummy_entry() const { return _flags._dummy_entry; }

+ struct token_compare {

+ bool operator()(const int64_t k1, const int64_t k2) const {

+ return dht::tri_compare_raw(k1, k2) < 0;
+ }
+

+ bool operator()(const dht::ring_position_view& k1, const int64_t k2) const {

+ return dht::tri_compare_raw(k1.token().raw(), k2) < 0;
+ }
+

+ bool operator()(const int64_t k1, const dht::ring_position_view& k2) const {

@@ -315,10 +334,9 @@ void cache_tracker::insert(partition_entry& pe) noexcept {
class row_cache final {
public:

using phase_type = utils::phased_barrier::phase_type;
- using partitions_type = bi::set<cache_entry,
- bi::member_hook<cache_entry, cache_entry::cache_link_type, &cache_entry::_cache_link>,
- bi::constant_time_size<false>, // we need this to have bi::auto_unlink on hooks
- bi::compare<cache_entry::compare>>;
+ using partitions_type = double_decker<int64_t, cache_entry,
+ cache_entry::token_compare, cache_entry::compare,
+ 16, bplus::key_search::linear>;
friend class cache::autoupdating_underlying_reader;
friend class single_partition_populating_reader;
friend class cache_entry;

diff --git a/row_cache.cc b/row_cache.cc
index fc42ead10..3f76494ed 100644

@@ -781,22 +781,20 @@ row_cache::make_reader(schema_ptr s,

row_cache::~row_cache() {

with_allocator(_tracker.allocator(), [this] {
- _partitions.clear_and_dispose([this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
- if (!p->is_dummy_entry()) {
+ _partitions.clear_and_dispose([this] (cache_entry& p) mutable {
+ if (!p.is_dummy_entry()) {
_tracker.on_partition_erase();
}
- p->evict(_tracker);
- deleter(p);
+ p.evict(_tracker);
});
});
}

void row_cache::clear_now() noexcept {
with_allocator(_tracker.allocator(), [this] {
- auto it = _partitions.erase_and_dispose(_partitions.begin(), partitions_end(), [this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
+ auto it = _partitions.erase_and_dispose(_partitions.begin(), partitions_end(), [this] (cache_entry& p) {

_tracker.on_partition_erase();
- p->evict(_tracker);
- deleter(p);
+ p.evict(_tracker);
});
_tracker.clear_continuity(*it);

});
@@ -812,9 +810,11 @@ cache_entry& row_cache::do_find_or_create_entry(const dht::decorated_key& key,

{
return with_allocator(_tracker.allocator(), [&] () -> cache_entry& {
return with_linearized_managed_bytes([&] () -> cache_entry& {
- auto i = _partitions.lower_bound(key, cache_entry::compare(_schema));
- if (i == _partitions.end() || !i->key().equal(*_schema, key)) {
- i = create_entry(i);
+ partitions_type::bound_hint hint;
+ cache_entry::compare cmp(_schema);
+ auto i = _partitions.lower_bound(key, cmp, hint);
+ if (i == _partitions.end() || !hint.match) {
+ i = create_entry(i, hint);
} else {
visit_entry(i);
}

@@ -837,10 +837,11 @@ cache_entry& row_cache::do_find_or_create_entry(const dht::decorated_key& key,

}

cache_entry& row_cache::find_or_create(const dht::decorated_key& key, tombstone t, row_cache::phase_type phase, const previous_entry_pointer* previous) {
- return do_find_or_create_entry(key, previous, [&] (auto i) { // create
- auto entry = current_allocator().construct<cache_entry>(cache_entry::incomplete_tag{}, _schema, key, t);
+ return do_find_or_create_entry(key, previous, [&] (auto i, const partitions_type::bound_hint& hint) { // create
+ partitions_type::iterator entry = _partitions.emplace_before(i, key.token().raw(), hint,
+ cache_entry::incomplete_tag{}, _schema, key, t);
_tracker.insert(*entry);
- return _partitions.insert_before(i, *entry);
+ return entry;
}, [&] (auto i) { // visit
_tracker.on_miss_already_populated();
cache_entry& e = *i;

@@ -851,14 +852,13 @@ cache_entry& row_cache::find_or_create(const dht::decorated_key& key, tombstone

@@ -957,8 +957,9 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)

_update_section(_tracker.region(), [&] {
memtable_entry& mem_e = *m.partitions.begin();
size_entry = mem_e.size_in_allocator_without_rows(_tracker.allocator());
- auto cache_i = _partitions.lower_bound(mem_e.key(), cmp);
- update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc);
+ partitions_type::bound_hint hint;
+ auto cache_i = _partitions.lower_bound(mem_e.key(), cmp, hint);
+ update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc, hint);
});
}
// We use cooperative deferring instead of futures so that

@@ -1003,11 +1004,11 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)

@@ -1019,12 +1020,11 @@ future<> row_cache::update(external_updater eu, memtable& m) {

|| with_allocator(standard_allocator(), [&] { return is_present(mem_e.key()); })
== partition_presence_checker_result::definitely_doesnt_exist) {
// Partition is absent in underlying. First, insert a neutral partition entry.
- cache_entry* entry = current_allocator().construct<cache_entry>(cache_entry::evictable_tag(),
- _schema, dht::decorated_key(mem_e.key()),
+ partitions_type::iterator entry = _partitions.emplace_before(cache_i, mem_e.key().token().raw(), hint,
+ cache_entry::evictable_tag(), _schema, dht::decorated_key(mem_e.key()),
partition_entry::make_evictable(*_schema, mutation_partition(_schema)));
entry->set_continuous(cache_i->continuous());
_tracker.insert(*entry);
- _partitions.insert_before(cache_i, *entry);
mem_e.upgrade_schema(_schema, _tracker.memtable_cleaner());
return entry->partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), _tracker.memtable_cleaner(),
alloc, _tracker.region(), _tracker, _underlying_phase, acc);

@@ -1037,7 +1037,7 @@ future<> row_cache::update(external_updater eu, memtable& m) {

@@ -1092,11 +1092,10 @@ void row_cache::invalidate_locked(const dht::decorated_key& dk) {

@@ -1130,13 +1129,12 @@ future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&

auto it = _partitions.lower_bound(*_prev_snapshot_pos, cmp);
auto end = _partitions.lower_bound(dht::ring_position_view::for_range_end(range), cmp);
return with_allocator(_tracker.allocator(), [&] {
- auto deleter = current_deleter<cache_entry>();
while (it != end) {
- it = _partitions.erase_and_dispose(it, [&] (cache_entry* p) mutable {
- _tracker.on_partition_erase();
- p->evict(_tracker);
- deleter(p);
- });
+ it = it.erase_and_dispose(cache_entry::token_compare{},
+ [&] (cache_entry& p) mutable {
+ _tracker.on_partition_erase();
+ p.evict(_tracker);
+ });
// it != end is necessary for correctness. We cannot set _prev_snapshot_pos to end->position()
// because after resuming something may be inserted before "end" which falls into the next range.
if (need_preempt() && it != end) {

@@ -1173,14 +1171,14 @@ void row_cache::evict() {

row_cache::row_cache(schema_ptr s, snapshot_source src, cache_tracker& tracker, is_continuous cont)
: _tracker(tracker)
, _schema(std::move(s))
- , _partitions(cache_entry::compare(_schema))
+ , _partitions(cache_entry::token_compare{})
, _underlying(src())
, _snapshot_source(std::move(src))
{

- entry->set_continuous(bool(cont));
+ cache_entry entry(cache_entry::dummy_entry_tag{});
+ entry.set_continuous(bool(cont));
+ _partitions.insert(entry.position().token().raw(), std::move(entry), cache_entry::compare{_schema});
});
}

@@ -1189,13 +1187,7 @@ cache_entry::cache_entry(cache_entry&& o) noexcept

, _key(std::move(o._key))
, _pe(std::move(o._pe))
, _flags(o._flags)
- , _cache_link()
{
- {
- using container_type = row_cache::partitions_type;
- container_type::node_algorithms::replace_node(o._cache_link.this_ptr(), _cache_link.this_ptr());
- container_type::node_algorithms::init(o._cache_link.this_ptr());
- }
}

cache_entry::~cache_entry() {

@@ -1210,11 +1202,11 @@ void row_cache::set_schema(schema_ptr new_schema) noexcept {

index 465b0169d..9252c77e0 100644
--- a/test/perf/memory_footprint_test.cc
+++ b/test/perf/memory_footprint_test.cc
@@ -60,7 +60,6 @@ class size_calculator {
{

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:09 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

memtable.hh | 59 +++++++++++++++++++++++++++---------
row_cache.hh | 1 -
utils/double-decker.hh | 9 ++++++
memtable.cc | 68 ++++++++++++++++++++++++------------------
row_cache.cc | 14 ++++-----
5 files changed, 99 insertions(+), 52 deletions(-)

diff --git a/memtable.hh b/memtable.hh
index cbff1ecc6..ab83b9ba3 100644

--- a/memtable.hh
+++ b/memtable.hh
@@ -32,11 +32,11 @@
#include "db/commitlog/replay_position.hh"
#include "db/commitlog/rp_set.hh"
#include "utils/extremum_tracking.hh"
-#include "utils/logalloc.hh"
#include "partition_version.hh"
#include "flat_mutation_reader.hh"
#include "mutation_cleaner.hh"
#include "sstables/types.hh"
+#include "utils/double-decker.hh"

class frozen_mutation;

@@ -44,11 +44,22 @@ class frozen_mutation;

namespace bi = boost::intrusive;

class memtable_entry {
- bi::set_member_hook<> _link;

schema_ptr _schema;
dht::decorated_key _key;
partition_entry _pe;

+ struct {

+ bool _head : 1;
+ bool _tail : 1;
+ bool _train : 1;

+ } _flags{};
public:

friend class memtable;

memtable_entry(schema_ptr s, dht::decorated_key key, mutation_partition p)

@@ -77,8 +88,10 @@ class memtable_entry {

return _key.key().external_memory_usage();
}

+ size_t object_memory_size(allocation_strategy& allocator);
+
size_t size_in_allocator_without_rows(allocation_strategy& allocator) {
- return allocator.object_memory_size_in_allocator(this) + external_memory_usage_without_rows();
+ return object_memory_size(allocator) + external_memory_usage_without_rows();
}

size_t size_in_allocator(allocation_strategy& allocator) {

@@ -89,30 +102,48 @@ class memtable_entry {
return size;
}

+ struct token_compare {
+ bool operator()(const int64_t k1, const int64_t k2) const {
+ return dht::tri_compare_raw(k1, k2) < 0;
+ }

+ bool operator()(const dht::decorated_key& k1, const int64_t k2) const {

+ return dht::tri_compare_raw(k1.token().raw(), k2) < 0;
+ }

+ bool operator()(const int64_t k1, const dht::decorated_key& k2) const {

+ return dht::tri_compare_raw(k1, k2.token().raw()) < 0;
+ }

+ bool operator()(const dht::ring_position& k1, const int64_t k2) const {

+ return dht::tri_compare_raw(k1.token().raw(), k2) < 0;
+ }

+ bool operator()(const int64_t k1, const dht::ring_position& k2) const {

+ return dht::tri_compare_raw(k1, k2.token().raw()) < 0;
+ }
+ };
+
struct compare {

- dht::decorated_key::less_comparator _c;

+ dht::ring_position_comparator _c;

compare(schema_ptr s)

// Managed by lw_shared_ptr<>.
class memtable final : public enable_lw_shared_from_this<memtable>, private logalloc::region {
public:

- using partitions_type = bi::set<memtable_entry,

- bi::member_hook<memtable_entry, bi::set_member_hook<>, &memtable_entry::_link>,

- bi::compare<memtable_entry::compare>>;
+ using partitions_type = double_decker<int64_t, memtable_entry,
+ memtable_entry::token_compare, memtable_entry::compare,

+ 16, bplus::key_search::linear>;

private:
dirty_memory_manager& _dirty_mgr;
mutation_cleaner _cleaner;

@@ -179,7 +210,7 @@ class memtable final : public enable_lw_shared_from_this<memtable>, private loga

friend class flush_reader;
friend class flush_memory_accounter;
private:
- boost::iterator_range<partitions_type::const_iterator> slice(const dht::partition_range& r) const;
+ boost::iterator_range<partitions_type::iterator> slice(const dht::partition_range& r) const;
partition_entry& find_or_create_partition(const dht::decorated_key& key);
partition_entry& find_or_create_partition_slow(partition_key_view key);
void upgrade_entry(memtable_entry&);

diff --git a/row_cache.hh b/row_cache.hh
index fc222c57a..a02f45091 100644
--- a/row_cache.hh
+++ b/row_cache.hh

@@ -31,7 +31,6 @@

#include "mutation_reader.hh"
#include "mutation_partition.hh"
-#include "utils/logalloc.hh"
#include "utils/phased_barrier.hh"
#include "utils/histogram.hh"
#include "partition_version.hh"

diff --git a/utils/double-decker.hh b/utils/double-decker.hh
index 798760514..1bf8ad399 100644
--- a/utils/double-decker.hh
+++ b/utils/double-decker.hh
@@ -353,4 +353,13 @@ class double_decker {

}

bool empty() const { return _tree.empty(); }
+
+ static size_t estimated_object_memory_size_in_allocator(allocation_strategy& allocator, const T* obj) {
+ /*
+ * The T-s are merged together in array, so getting any run-time
+ * value of a pointer would be wrong. So here's some guessing of
+ * how much memory would this thing occupy in memory
+ */
+ return sizeof(typename outer_tree::data);
+ }
};
diff --git a/memtable.cc b/memtable.cc

index a5a02c164..0551617d0 100644

--- a/memtable.cc
+++ b/memtable.cc
@@ -117,7 +117,7 @@ memtable::memtable(schema_ptr schema, dirty_memory_manager& dmm, table_stats& ta
, _cleaner(*this, no_cache_tracker, table_stats.memtable_app_stats, compaction_scheduling_group)
, _memtable_list(memtable_list)
, _schema(std::move(schema))
- , partitions(memtable_entry::compare(_schema))
+ , partitions(memtable_entry::token_compare{})
, _table_stats(table_stats) {
}

@@ -145,9 +145,8 @@ void memtable::evict_entry(memtable_entry& e, mutation_cleaner& cleaner) {
void memtable::clear() noexcept {
auto dirty_before = dirty_size();
with_allocator(allocator(), [this] {
- partitions.clear_and_dispose([this] (memtable_entry* e) {
- evict_entry(*e, _cleaner);
- current_deleter<memtable_entry>()(e);
+ partitions.clear_and_dispose([this] (memtable_entry& e) {

+ evict_entry(e, _cleaner);

});
});
remove_flushed_memory(dirty_before - dirty_size());
@@ -167,9 +166,7 @@ future<> memtable::clear_gently() noexcept {
if (p.begin()->clear_gently() == stop_iteration::no) {
break;
}
- p.erase_and_dispose(p.begin(), [&] (auto e) {
- alloc.destroy(e);
- });
+ p.begin().erase(memtable_entry::token_compare{});
if (need_preempt()) {
break;
}
@@ -178,6 +175,13 @@ future<> memtable::clear_gently() noexcept {
remove_flushed_memory(dirty_before - dirty_size());
seastar::thread::yield();
}

+
+ /*
+ * The collection is not guaranteed to free everything

+ * with the last erase. If anything gets freed in destructor,
+ * it will be unaccounted from wrong allocator, so handle it
+ */
+ with_allocator(alloc, [&p] { p.clear(); });
});
auto f = t->join();
return f.then([t = std::move(t)] {});
@@ -211,13 +215,17 @@ memtable::find_or_create_partition(const dht::decorated_key& key) {
assert(!reclaiming_enabled());

// call lower_bound so we have a hint for the insert, just in case.

- auto i = partitions.lower_bound(key, memtable_entry::compare(_schema));
- if (i == partitions.end() || !key.equal(*_schema, i->key())) {

- memtable_entry* entry = current_allocator().construct<memtable_entry>(
- _schema, dht::decorated_key(key), mutation_partition(_schema));
- partitions.insert_before(i, *entry);

+ partitions_type::bound_hint hint;
+ auto i = partitions.lower_bound(key, memtable_entry::compare(_schema), hint);
+ if (i == partitions.end() || !hint.match) {
+ partitions_type::iterator entry = partitions.emplace_before(i,

@@ -761,15 +771,11 @@ mutation_source memtable::as_data_source() {

}

memtable_entry::memtable_entry(memtable_entry&& o) noexcept
- : _link()
- , _schema(std::move(o._schema))
+ : _schema(std::move(o._schema))

, _key(std::move(o._key))
, _pe(std::move(o._pe))

-{
- using container_type = memtable::partitions_type;

- container_type::node_algorithms::replace_node(o._link.this_ptr(), _link.this_ptr());
- container_type::node_algorithms::init(o._link.this_ptr());
-}

+ , _flags(o._flags)

+{ }

stop_iteration memtable_entry::clear_gently() noexcept {
return _pe.clear_gently(no_cache_tracker);

@@ -805,9 +811,13 @@ void memtable::set_schema(schema_ptr new_schema) noexcept {

_schema = std::move(new_schema);
}

+size_t memtable_entry::object_memory_size(allocation_strategy& allocator) {
+ return memtable::partitions_type::estimated_object_memory_size_in_allocator(allocator, this);
+}
+
std::ostream& operator<<(std::ostream& out, memtable& mt) {
logalloc::reclaim_lock rl(mt);
- return out << "{memtable: [" << ::join(",\n", mt.partitions) << "]}";
+ return out << "{memtable: [" << ::join(",\n", mt.partitions.begin(), mt.partitions.end()) << "]}";
}

std::ostream& operator<<(std::ostream& out, const memtable_entry& mt) {
diff --git a/row_cache.cc b/row_cache.cc
index 3f76494ed..527acf650 100644
--- a/row_cache.cc
+++ b/row_cache.cc
@@ -890,15 +890,14 @@ void row_cache::invalidate_sync(memtable& m) noexcept {

bool blow_cache = false;
// Note: clear_and_dispose() ought not to look up any keys, so it doesn't require
// with_linearized_managed_bytes(), but invalidate() does.
- m.partitions.clear_and_dispose([this, &m, deleter = current_deleter<memtable_entry>(), &blow_cache] (memtable_entry* entry) {
+ m.partitions.clear_and_dispose([this, &m, &blow_cache] (memtable_entry& entry) {
with_linearized_managed_bytes([&] {
try {
- invalidate_locked(entry->key());
+ invalidate_locked(entry.key());
} catch (...) {
blow_cache = true;
}
- m.evict_entry(*entry, _tracker.memtable_cleaner());
- deleter(entry);
+ m.evict_entry(entry, _tracker.memtable_cleaner());
});
});
if (blow_cache) {

@@ -972,10 +971,9 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)

real_dirty_acc.unpin_memory(size_entry);
_update_section(_tracker.region(), [&] {
auto i = m.partitions.begin();
- memtable_entry& mem_e = *i;
- m.partitions.erase(i);
- m.evict_entry(mem_e, _tracker.memtable_cleaner());
- current_allocator().destroy(&mem_e);
+ i.erase_and_dispose(memtable_entry::token_compare{}, [&] (memtable_entry& e) {
+ m.evict_entry(e, _tracker.memtable_cleaner());
+ });
});
++partition_count;
});

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:32 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

// The story is at
// https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ

This is the B+ version which satisfies seveal specific requirements
to be suitable for row-cache usage.

1. Insert/Remove doesn't invalidate iterators
2. Elements should be LSA-compactable
3. Low overhead of data nodes (1 pointer)
4. Exteral less-only comparator
5. As little actions on insert/delete as possible
6. Iterator walks the sorted keys

The design, briefly is:

There are 3 types of nodes, inner, leaf and data, inner and leaf
keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in
a doubly linked list. Data nodes live separately from the leaf ones
and keep pointers on them. Tree handler keeps pointers on root and
left-most and right-most leaves. Nodes do _not_ keep pointers or
references on the tree (except 3 of them, see below).

Update in v7:

- The index_for() implmenetation is templatized the other way
to make it possible for AVX key search specialization (further
patching)

Update in v6:

- Insertion tries to push kids to siblings before split

Before this change insertion into full node resulted into this
node being split into two equal parts. This behaviour for random
keys stress gives a tree with ~2/3 of nodes half-filled.

With this change before splitting the full node try to push one
element to each of the siblings (if they exist and not full).
This slows the insertion a bit (but it's still way faster than
the std::set), but gives 15% less total number of nodes.

- Iterator method to reconstruct the data at the given position

The helper creates a new data node, emplaces data into it and
replaces the iterator's one with it. Needed to keep arrays of
data in tree.

- Milli-optimize erase()
- Return back an iterator that will likely be not re-validated
- Do not try to update ancestors separation key for leftmost kid

This caused the clear()-like workload work poorly as compared to
std:set. In particular the row_cache::invalidate() method does
exactly this and this change improves its timing.

- Perf test to measure drain speed
- Helper call to collect tree counters

Update in v5:

- Fix corner case of iterator.emplace_before()
- Clean heterogenous lookup API
- Handle exceptions from nodes allocations
- Explicitly mark places where the key is copied (for future)
- Extend the tree.lower_bound() API to report back whether
the bound hit the key or not
- Addressed style/cleanness review comments

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

create mode 100644 test/unit/bptree_key.hh
create mode 100644 test/unit/bptree_validation.hh

create mode 100644 utils/bptree.hh
create mode 100644 test/boost/bptree_test.cc

create mode 100644 test/perf/perf_bptree.cc
create mode 100644 test/unit/bptree_compaction_test.cc
create mode 100644 test/unit/bptree_stress_test.cc

diff --git a/configure.py b/configure.py
index 4f39ad55e..262597d60 100755
--- a/configure.py
+++ b/configure.py
@@ -381,6 +381,7 @@ scylla_tests = set([
'test/boost/view_schema_ckey_test',
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
+ 'test/boost/bptree_test',
'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',
@@ -398,6 +399,7 @@ scylla_tests = set([
'test/perf/perf_fast_forward',
'test/perf/perf_hash',
'test/perf/perf_mutation',
+ 'test/perf/perf_bptree',
'test/perf/perf_row_cache_update',
'test/perf/perf_simple_query',
'test/perf/perf_sstable',
@@ -405,6 +407,8 @@ scylla_tests = set([
'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
'test/unit/row_cache_stress_test',
+ 'test/unit/bptree_stress_test',
+ 'test/unit/bptree_compaction_test',
])

perf_tests = set([
@@ -943,6 +947,7 @@ pure_boost_tests = set([
'test/boost/small_vector_test',
'test/boost/top_k_test',
'test/boost/vint_serialization_test',
+ 'test/boost/bptree_test',
'test/manual/json_test',
'test/manual/streaming_histogram_test',
])
@@ -956,10 +961,13 @@ tests_not_using_seastar_test_framework = set([
'test/perf/perf_cql_parser',
'test/perf/perf_hash',
'test/perf/perf_mutation',
+ 'test/perf/perf_bptree',
'test/perf/perf_row_cache_update',
'test/unit/lsa_async_eviction_test',
'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
+ 'test/unit/bptree_stress_test',
+ 'test/unit/bptree_compaction_test',
'test/manual/sstable_scan_footprint_test',
]) | pure_boost_tests

diff --git a/test/unit/bptree_key.hh b/test/unit/bptree_key.hh
new file mode 100644
index 000000000..54347a54f
--- /dev/null
+++ b/test/unit/bptree_key.hh
@@ -0,0 +1,101 @@

+ if (_cookie != nullptr) {
+ delete _cookie;
+ }

+ fmt::print("\\\n");
+ }
+

+ }
+ fmt::print("\n");
+

+ return;

+ }
+ fmt::print("\n");
+

+ }
+ }
+ }
+ }
+}

+
+template <typename K, typename T, typename L, int NS>
+void validator<K, T, L, NS>::validate_list(const tree& t) {
+ int prev = 0;
+
+ node* lh = t.left_leaf_slow();
+ node* rh = t.right_leaf_slow();
+
+ if (lh != t._left) {
+ fmt::print("left {:d}, slow {:d}\n", t._left->id(), lh->id());
+ throw "list broken";
+ }
+
+ if (!(lh->_flags & node::NODE_LEFTMOST)) {
+ fmt::print("left {:d} is not marked as such {}\n", t._left->id(), t._left->_flags);;
+ throw "list broken";
+ }
+
+ if (rh != t._right) {
+ fmt::print("right {:d}, slow {:d}\n", t._right->id(), rh->id());
+ throw "list broken";
+ }
+
+ if (!(rh->_flags & node::NODE_RIGHTMOST)) {
+ fmt::print("right {:d} is not marked as such {}\n", t._right->id(), t._right->_flags);;
+ throw "list broken";
+ }
+
+ node* r = lh;
+ while (1) {
+ node *ln;
+
+ if (!r->is_rightmost()) {
+ ln = r->get_next();
+ if (ln->get_prev() != r) {
+ fmt::print("next leaf {:d} points to {:d}, expect {:d}\n", ln->id(), ln->get_prev()->id(), r->id());
+ throw "list broken";
+ }
+ } else if (r->_rightmost_tree != &t) {
+ fmt::print("right leaf doesn't point to tree\n");
+ throw "list broken";
+ }
+
+ if (!r->is_leftmost()) {
+ ln = r->get_prev();
+ if (ln->get_next() != r) {
+ fmt::print("prev leaf {:d} points to {:d}, expect {:d}\n", ln->id(), ln->get_next()->id(), r->id());
+ throw "list broken";
+ }
+ } else if (r->_kids[0]._leftmost_tree != &t) {
+ fmt::print("left leaf doesn't point to tree\n");
+ throw "list broken";
+ }
+
+ if (r->_num_keys > 0 && t._less(r->_keys[0].v, K(prev))) {
+ fmt::print("list misorder on element {:d}, keys {}..., prev {:d}\n", r->id(), (int)r->_keys[0].v, prev);
+ throw "list broken";
+ }
+
+ if (!r->is_root() && r->_parent != nullptr) {
+ const auto p = r->_parent;
+ int i = p->index_for(r->_keys[0].v, t._less);
+ if (i > 0) {
+ if (p->_kids[i - 1].n != r->get_prev()) {
+ fmt::print("list misorder on parent check: node {:d}.{:d}, parent prev {:d}, list prev {:d}\n",
+ p->id(), r->id(), p->_kids[i - 1].n->id(), r->get_prev()->id());
+ throw "list broken";
+ }
+ }
+ if (i < p->_num_keys - 1) {
+ if (p->_kids[i + 1].n != r->get_next()) {
+ fmt::print("list misorder on parent check: node {:d}.{:d}, parent next {:d}, list next {:d}\n",
+ p->id(), r->id(), p->_kids[i + 1].n->id(), r->get_next()->id());
+ throw "list broken";
+ }
+ }
+ }
+
+ if (r->_num_keys > 0) {
+ prev = (int)r->_keys[r->_num_keys - 1].v;
+ }
+
+ if (r != t._left && r != t._right && (r->_flags & (node::NODE_LEFTMOST | node::NODE_RIGHTMOST))) {
+ fmt::print("middle {:d} is marked as left/right {}\n", r->id(), r->_flags);;
+ throw "list broken";
+ }
+
+ if (r->is_rightmost()) {
+ break;
+ }
+
+ r = r->get_next();
+ }
+}
+
+template <typename K, typename T, typename L, int NS>
+void validator<K, T, L, NS>::validate(const tree& t) {
+ try {
+ validate_list(t);
+ int min = 0, prev = 0;
+ if (t._root->_root_tree != &t) {
+ fmt::print("root doesn't point to tree\n");
+ throw "root broken";
+ }
+
+ validate_node(t, *t._root, prev, min, true);
+ } catch (...) {
+ print_tree(t, '|');
+ fmt::print("[ ");
+ node* lh = t._left;
+ while (1) {
+ fmt::print(" {:d}", lh->id());
+ if (lh->is_rightmost()) {
+ break;
+ }
+ lh = lh->get_next();

+ }
+ fmt::print("]\n");

+
diff --git a/utils/bptree.hh b/utils/bptree.hh

new file mode 100644
index 000000000..e75da5ee8
--- /dev/null
+++ b/utils/bptree.hh
@@ -0,0 +1,1880 @@

+#include <boost/intrusive/parent_from_member.hpp>
+#include <seastar/util/defer.hh>
+#include <seastar/util/concepts.hh>
+#include <cassert>
+#include "utils/logalloc.hh"
+
+namespace bplus {
+
+enum class with_debug { no, yes };
+
+/*
+ * Linear search in a sorted array of keys slightly beats the
+ * binary one on small sizes. For debugging purposes both methods
+ * should be used (and the result must coinside).
+ */
+enum class key_search { linear, binary, both };
+
+/*
+ * The node_id class is purely a debugging thing -- when reading
+ * the validator print-s it's more handy to look at IDs consisting
+ * of 1-3 digits, rather than 16 hex-digits of a printed pointer
+ */
+template <with_debug D>
+struct node_id {
+ int operator()() const { return reinterpret_cast<uintptr_t>(this); }
+};
+
+template <>
+struct node_id<with_debug::yes> {
+ unsigned int _id;
+ static unsigned int _next() {
+ static std::atomic<unsigned int> rover {1};
+ return rover.fetch_add(1);
+ }
+
+ node_id() : _id(_next()) {}
+ int operator()() const { return _id; }
+};
+
+/*
+ * This wrapper prevents the value from being default-constructed
+ * when its container is created. The intended usage is to wrap
+ * elements of static arrays or containers with .emplace() methods
+ * that can live some time without the value in it.
+ *
+ * Similarly, the value is _not_ automatically destructed when this
+ * thing is, so ~Vaue() must be called by hands. For this there is the
+ * .remove() method and two helpers for common cases -- std::move-ing
+ * the value into another maybe-location (.emplace(maybe&&)) and
+ * constructing the new in place of the existing one (.replace(args...))
+ */
+template <typename Value>
+union maybe_key {
+ Value v;
+ maybe_key() noexcept {}
+ ~maybe_key() {}
+
+ void reset() { v.~Value(); }
+
+ /*
+ * Constructs the value inside the empty maybe wrapper.
+ */
+ template <typename... Args>
+ void emplace(Args&&... args) {
+ new (&v) Value (std::forward<Args>(args)...);
+ }
+
+ /*
+ * The special-case handling of moving some other alive maybe-value.
+ * Calls the source destructor after the move.
+ */
+ void emplace(maybe_key&& other) {
+ new (&v) Value(std::move(other.v));
+ other.reset();
+ }
+
+ /*
+ * Similar to emplace, but to be used on the alive maybe.
+ * Calls the destructor on it before constructing the new value.
+ */
+ template <typename... Args>
+ void replace(Args&&... args) {
+ reset();
+ emplace(std::forward<Args>(args)...);
+ }
+
+ void replace(maybe_key&& other) = delete; // not to be called by chance
+};
+
+// For .{do_something_with_data}_and_dispose methods below
+template <typename T>
+void default_dispose(T& value) { }
+
+/*
+ * Helper to explicitly capture all keys copying.
+ * Check test_key for more information.
+ */
+template <typename Key>
+SEASTAR_CONCEPT(requires std::is_nothrow_copy_constructible_v<Key>)
+Key copy_key(const Key& other) {
+ return Key(other);
+}
+
+/*
+ * Consider a small 2-level tree like this
+ *
+ * [ . 5 . ]
+ * | |
+ * +------+ +-----+
+ * | |
+ * [ 1 . 2 . 3 . ] [ 5 . 6 . 7 . ]
+ *
+ * And we remove key 5 from it. First -- the key is removed
+ * from the leaf entry
+ *
+ * [ . 5 . ]
+ * | |
+ * +------+ +-----+
+ * | |
+ * [ 1 . 2 . 3 . ] [ 6 . 7. ]
+ *
+ * At this point we have a choice -- whether or not to update
+ * the separation key on the parent (root). Strictly speaking,
+ * the whole tree is correct now -- all the keys on the right
+ * are greater-or-equal than their separation key, though the
+ * "equal" never happens.
+ *
+ * This can be problematic if the keys are stored on data nodes
+ * and are referenced from the (non-)leaf nodes. In this case
+ * the separation key must be updated to point to some real key
+ * in its sub-tree.
+ *
+ * [ . 6 . ] <--- this key updated
+ * | |
+ * +------+ +-----+
+ * | |
+ * [ 1 . 2 . 3 . ] [ 6 . 7. ]
+ *
+ * As this update takes some time, this behaviour is tunable.
+ *
+ */
+constexpr bool strict_separation_key = true;
+
+/*
+ * This is for testing, validator will be everybody's friend
+ * to have rights to check if the tree is internally correct.
+ */
+template <typename Key, typename T, typename Less, int NodeSize> class validator;
+template <with_debug Debug> class statistics;
+
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug> class node;
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug> class data;
+
+/*
+ * The tree itself.
+ * Equipped with O(1) (with little constant) begin() and end()
+ * and the iterator, that scans throug sorted keys and is not
+ * invalidated on insert/remove.
+ *
+ * The NodeSize parameter describes the amount of keys to be
+ * held on each node. Inner nodes will thus have N+1 sub-trees,
+ * leaf nodes will have N data pointers.
+ */
+
+SEASTAR_CONCEPT(
+ template <typename Key1, typename Key2, typename Less>
+ concept bool LessComparable = requires (const Key1& a, const Key2& b, Less less) {
+ { less(a, b) } -> bool;
+ { less(b, a) } -> bool;
+ };
+
+ template <typename T, typename Key>
+ concept bool CanGetKeyFromValue = requires (T val) {
+ { val.key() } -> Key;
+ };
+)
+
+struct stats {
+ unsigned long nodes;
+ std::vector<unsigned long> nodes_filled;
+ unsigned long leaves;
+ std::vector<unsigned long> leaves_filled;
+ unsigned long datas;
+};
+
+template <typename Key, typename T, typename Less, int NodeSize,
+ key_search Search = key_search::binary, with_debug Debug = with_debug::no>
+SEASTAR_CONCEPT( requires LessComparable<Key, Key, Less> &&
+ std::is_nothrow_move_constructible_v<Key> &&
+ std::is_nothrow_move_constructible_v<T>
+)
+class tree {
+public:
+ class iterator;
+ friend class validator<Key, T, Less, NodeSize>;
+ friend class node<Key, T, Less, NodeSize, Search, Debug>;
+
+ // Sanity not to allow slow key-search in non-debug mode
+ static_assert(Debug == with_debug::yes || Search != key_search::both);
+
+ using node = class node<Key, T, Less, NodeSize, Search, Debug>;
+ using data = class data<Key, T, Less, NodeSize, Search, Debug>;
+
+private:
+
+ node* _root = nullptr;
+ node* _left = nullptr;
+ node* _right = nullptr;
+ Less _less;
+
+ template <typename K>
+ node& find_leaf_for(const K& k) const {
+ node* cur = _root;
+
+ while (!cur->is_leaf()) {
+ int i = cur->index_for(k, _less);
+ cur = cur->_kids[i].n;
+ }
+
+ return *cur;
+ }
+
+ void maybe_init_empty_tree() {
+ if (_root != nullptr) {
+ return;
+ }
+
+ node* n = node::create();
+ n->_flags |= node::NODE_LEAF | node::NODE_ROOT | node::NODE_RIGHTMOST | node::NODE_LEFTMOST;
+ do_set_root(n);
+ do_set_left(n);
+ do_set_right(n);
+ }
+
+ node* left_leaf_slow() const {
+ node* cur = _root;
+ while (!cur->is_leaf()) {
+ cur = cur->_kids[0].n;
+ }
+ return cur;
+ }
+
+ node* right_leaf_slow() const {
+ node* cur = _root;
+ while (!cur->is_leaf()) {
+ cur = cur->_kids[cur->_num_keys].n;
+ }
+ return cur;

+ }
+
+ template <typename K>

+ iterator get_bound(const K& k, bool& upper) {
+ if (empty()) {
+ return end();
+ }
+
+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ /*
+ * Element at i (key at i - 1) is less or equal to the k,
+ * the next element is greater. Mind corner cases.
+ */
+
+ if (i == 0) {
+ assert(n.is_leftmost());
+ return begin();
+ } else if (i <= n._num_keys) {
+ iterator cur = iterator(n._kids[i].d, i);
+ if (upper || _less(n._keys[i - 1].v, k)) {
+ cur++;
+ } else {
+ // Here 'upper' becomes 'match'
+ upper = true;
+ }
+
+ return cur;
+ } else {
+ assert(n.is_rightmost());
+ return end();
+ }
+ }
+
+public:
+
+ tree(const tree& other) = delete;
+ const tree& operator=(const tree& other) = delete;
+ tree& operator=(tree&& other) = delete;
+
+ explicit tree(Less less) : _less(less) { }
+ ~tree() {
+ if (_root != nullptr) {
+ node::destroy(*_root);
+ }
+ }
+
+ Less less() const { return _less; }
+
+ tree(tree&& other) noexcept : _less(std::move(other._less)) {
+ if (other._root) {
+ do_set_root(other._root);
+ do_set_left(other._left);
+ do_set_right(other._right);
+
+ other._root = nullptr;
+ other._left = nullptr;
+ other._right = nullptr;
+ }
+ }
+
+ // XXX -- this uses linear scan over the leaf nodes
+ size_t size_slow() const {
+ if (_root == nullptr) {
+ return 0;
+ }
+
+ size_t ret = 0;
+ const node* leaf = _left;
+ while (1) {
+ assert(leaf->is_leaf());
+ ret += leaf->_num_keys;
+ if (leaf == _right) {
+ break;
+ }
+ leaf = leaf->get_next();
+ }
+
+ return ret;
+ }
+
+ // Returns result that is equal (both not less than each other)

+ template <typename K = Key>

+ SEASTAR_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator find(const K& k) {
+ if (empty()) {
+ return end();
+ }
+
+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ if (i >= 1 && !_less(n._keys[i - 1].v, k)) {
+ return iterator(n._kids[i].d, i);
+ } else {
+ return end();
+ }
+ }
+
+ // Returns the least x out of those !less(x, k)

+ template <typename K = Key>

+ SEASTAR_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator lower_bound(const K& k) {
+ bool upper = false;
+ return get_bound(k, upper);
+ }
+
+ template <typename K = Key>
+ SEASTAR_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator lower_bound(const K& k, bool& match) {
+ match = false;
+ return get_bound(k, match);
+ }
+
+ // Returns the least x out of those less(k, x)

+ template <typename K = Key>

+ SEASTAR_CONCEPT(requires LessComparable<K, Key, Less>)
+ iterator upper_bound(const K& k) {
+ bool upper = true;
+ return get_bound(k, upper);
+ }
+
+ /*
+ * Constructs the element with key k inside the tree and returns
+ * iterator on it. If the key already exists -- just returns the
+ * iterator on it and sets the .second to false.
+ */
+ template <typename... Args>
+ std::pair<iterator, bool> emplace(Key k, Args&&... args) {
+ maybe_init_empty_tree();
+
+ node& n = find_leaf_for(k);
+ int i = n.index_for(k, _less);
+
+ if (i >= 1 && !_less(n._keys[i - 1].v, k)) {
+ // Direct hit
+ return std::pair(iterator(n._kids[i].d, i), false);
+ }
+
+ data* d = data::create(std::forward<Args>(args)...);
+ auto x = seastar::defer([&d] { data::destroy(*d, default_dispose<T>); });
+ n.insert(i, std::move(k), d, _less);
+ assert(d->attached());
+ x.cancel();
+ return std::pair(iterator(d, i + 1), true);

+ }
+
+ template <typename Func>
+ SEASTAR_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ }
+
+ template <typename... Args>

+ iterator erase(Args&&... args) { return erase_and_dispose(std::forward<Args>(args)..., default_dispose<T>); }

+
+ template <typename Func>
+ SEASTAR_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ }
+
+ template <typename... Args>

+ iterator emplace_before(Key k, Less less, Args&&... args) {
+ return emplace_before([&k] (data*) -> Key { return std::move(k); },
+ less, std::forward<Args>(args)...);
+ }
+
+ template <typename... Args>
+ SEASTAR_CONCEPT(requires CanGetKeyFromValue<T, Key>)
+ iterator emplace_before(Less less, Args&&... args) {
+ return emplace_before([] (data* d) -> Key { return d->value.key(); },
+ less, std::forward<Args>(args)...);
+ }
+
+ private:
+ /*
+ * Prepare a likely valid iterator for the next element.
+ * Likely means, that unless removal starts rebalancing
+ * datas the _idx will be for the correct pointer.
+ *
+ * This is just like the operator++, with the exception
+ * that staying on the leaf doesn't increase the _idx, as
+ * in this case the next element will be shifted left to
+ * the current position.
+ */
+ iterator next_after_erase(node* leaf) const {
+ if (_idx < leaf->_num_keys) {
+ return iterator(leaf->_kids[_idx + 1].d, _idx);
+ }
+
+ if (leaf->is_rightmost()) {
+ return iterator(leaf->_rightmost_tree);
+ }
+
+ leaf = leaf->get_next();
+ return iterator(leaf->_kids[1].d, 1);
+ }
+
+ public:
+ template <typename Func>
+ iterator erase_and_dispose(Func&& disp, Less less) {
+ node* leaf = revalidate();
+ iterator cur = next_after_erase(leaf);
+
+ leaf->remove(_idx - 1, less);
+ data::destroy(*_data, disp);
+
+ return cur;
+ }
+
+ iterator erase(Less less) { return erase_and_dispose(default_dispose<T>, less); }

+
+ template <typename... Args>

+ void reconstruct(size_t new_size, Args&&... args) {
+ node* leaf = revalidate();
+ auto ptr = current_allocator().alloc(&get_standard_migrator<data>(), new_size, alignof(data));
+ data *dat, *cur = _data;
+
+ try {
+ dat = new (ptr) data(std::forward<Args>(args)...);
+ } catch(...) {
+ current_allocator().free(ptr, new_size);
+ throw;
+ }
+
+ dat->_leaf = leaf;
+ cur->_leaf = nullptr;
+
+ _data = dat;
+ leaf->_kids[_idx].d = dat;
+
+ current_allocator().destroy(cur);
+ }
+
+ size_t storage_size(size_t payload) const { return _data->storage_size(payload); }
+ };
+
+ iterator begin() {
+ if (empty()) {
+ return end();
+ }
+
+ assert(_left->_num_keys > 0);
+ // Leaf nodes have data pointers starting from index 1
+ return iterator(_left->_kids[1].d, 1);
+ }
+ iterator end() { return iterator(this); }
+
+ using reverse_iterator = std::reverse_iterator<iterator>;
+ reverse_iterator rbegin() { return std::make_reverse_iterator(end()); }
+ reverse_iterator rend() { return std::make_reverse_iterator(begin()); }
+
+ bool empty() const { return _root == nullptr || _root->_num_keys == 0; }
+
+ struct stats get_stats() const {
+ struct stats st;
+
+ st.nodes = 0;
+ st.leaves = 0;
+ st.datas = 0;
+
+ if (_root != nullptr) {
+ st.nodes_filled.resize(NodeSize + 1);
+ st.leaves_filled.resize(NodeSize + 1);
+ _root->fill_stats(st);
+ }
+
+ return st;
+ }
+};
+
+/*
+ * Algorithms for searching a key in array
+ */
+
+template <typename K, typename Key, typename Less, int Size, key_search Search>
+struct searcher { };
+
+template <typename K, typename Key, typename Less, int Size>
+struct searcher<K, Key, Less, Size, key_search::linear> {
+ static int gt(const K& k, const maybe_key<Key>* keys, int nr, Less less) {
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ if (less(k, keys[i].v)) {

+ break;
+ }
+ }
+
+ return i;
+ };
+};
+

+template <typename K, typename Key, typename Less, int Size>
+struct searcher<K, Key, Less, Size, key_search::binary> {
+ static int gt(const K& k, const maybe_key<Key>* keys, int nr, Less less) {
+ int s = 0, e = nr - 1, c = 0;
+
+ while (s <= e) {
+ int i = (s + e) / 2;
+ c++;
+ if (less(k, keys[i].v)) {
+ e = i - 1;
+ } else {
+ s = i + 1;
+ }
+ }
+
+ return s;
+ }
+};
+
+template <typename K, typename Key, typename Less, int Size>
+struct searcher<K, Key, Less, Size, key_search::both> {
+ static int gt(const K& k, const maybe_key<Key>* keys, int nr, Less less) {
+ int rl = searcher<K, Key, Less, Size, key_search::linear>::gt(k, keys, nr, less);
+ int rb = searcher<K, Key, Less, Size, key_search::binary>::gt(k, keys, nr, less);
+ assert(rl == rb);
+ return rl;
+ }
+};
+
+/*
+ * A node describes both, inner and leaf nodes.
+ */
+template <typename Key, typename T, typename Less, int NodeSize, key_search Search, with_debug Debug>
+class node final {
+ friend class validator<Key, T, Less, NodeSize>;
+ friend class tree<Key, T, Less, NodeSize, Search, Debug>;
+
+ using tree = class tree<Key, T, Less, NodeSize, Search, Debug>;
+ using data = class data<Key, T, Less, NodeSize, Search, Debug>;
+
+ class prealloc;
+
+ /*
+ * The NodeHalf is the level at which the node is considered
+ * to be underflown and should be re-filled. This slightly
+ * differs for even and odd sizes.
+ *
+ * For odd sizes the node will stand until it contains literally
+ * more than 1/2 of it's size (e.g. for size 5 keeping 3 keys
+ * is OK). For even cases this barrier is less than the actual
+ * half (e.g. for size 4 keeping 2 is still OK).
+ */
+ static constexpr int NodeHalf = ((NodeSize - 1) / 2);
+ static_assert(NodeHalf >= 1);
+
+ union node_or_data_or_tree {
+ node* n;
+ data* d;
+
+ tree* _leftmost_tree; // See comment near node::__next about this
+ };
+
+ using node_or_data = node_or_data_or_tree;
+
+ friend data::data(data&&);
+
+ [[no_unique_address]] node_id<Debug> id;
+
+ unsigned short _num_keys;
+ unsigned short _flags;
+
+ static const unsigned short NODE_ROOT = 0x1;
+ static const unsigned short NODE_LEAF = 0x2;
+ static const unsigned short NODE_LEFTMOST = 0x4;
+ static const unsigned short NODE_RIGHTMOST = 0x8;
+
+ bool is_leaf() const { return _flags & NODE_LEAF; }
+ bool is_root() const { return _flags & NODE_ROOT; }
+ bool is_rightmost() const { return _flags & NODE_RIGHTMOST; }
+ bool is_leftmost() const { return _flags & NODE_LEFTMOST; }
+
+ /*
+ * separation keys
+ * non-leaf nodes:
+ * the kids[i] contains keys[i + 1] <= k < keys[i + 2]
+ * kids[0] contains keys < all keys in the node
+ * leaf nodes:
+ * kids[i + 1] is the data for keys[i]
+ * kids[0] is unused
+ *
+ * In the examples below the leaf nodes will be shown like
+ *
+ * keys: [012]
+ * datas: [-012]
+ *
+ * and the non-leaf ones like
+ *
+ * keys: [012]
+ * kids: [A012]
+ *
+ * to have digits correspond to different elements and staying
+ * in its correct positions. And the A kid is this left-most one
+ * at index 0 for the non-leaf node.
+ */
+
+ maybe_key<Key> _keys[NodeSize];
+ node_or_data _kids[NodeSize + 1];
+
+ /*
+ * The root node uses this to point to the tree object. This is
+ * needed to update tree->_root on node move.
+ */
+ union {
+ node* _parent;
+ tree* _root_tree;
+ };
+
+ /*
+ * Leaf nodes are linked in a list, since leaf nodes do
+ * not use the _kids[0] pointer we re-use it. Respectively,
+ * non-leaf nodes don't use the __next one.
+ *
+ * Also, leftmost and rightmost respectively have prev and
+ * next pointing to the tree object itsef. This is done for
+ * _left/_right update on node move.
+ */
+ union {
+ node* __next;
+ tree* _rightmost_tree;
+ };
+
+ node* get_next() const {
+ assert(is_leaf());
+ return __next;
+ }
+
+ void set_next(node *n) {
+ assert(is_leaf());
+ __next = n;
+ }
+
+ node* get_prev() const {
+ assert(is_leaf());
+ return _kids[0].n;
+ }
+
+ void set_prev(node* n) {
+ assert(is_leaf());
+ _kids[0].n = n;
+ }
+
+ // Links the new node n right after the current one
+ void link(node& n) {
+ if (is_rightmost()) {
+ _flags &= ~NODE_RIGHTMOST;
+ n._flags |= node::NODE_RIGHTMOST;
+ tree* t = _rightmost_tree;
+ assert(t->_right == this);
+ t->do_set_right(&n);
+ } else {
+ n.set_next(get_next());
+ get_next()->set_prev(&n);
+ }
+
+ n.set_prev(this);
+ set_next(&n);
+ }
+
+ void unlink() {
+ node* x;
+ tree* t;
+
+ switch (_flags & (node::NODE_LEFTMOST | node::NODE_RIGHTMOST)) {
+ case node::NODE_LEFTMOST:
+ x = get_next();
+ _flags &= ~node::NODE_LEFTMOST;
+ x->_flags |= node::NODE_LEFTMOST;
+ t = _kids[0]._leftmost_tree;
+ assert(t->_left == this);
+ t->do_set_left(x);
+ break;
+ case node::NODE_RIGHTMOST:
+ x = get_prev();
+ _flags &= ~node::NODE_RIGHTMOST;
+ x->_flags |= node::NODE_RIGHTMOST;
+ t = _rightmost_tree;
+ assert(t->_right == this);
+ t->do_set_right(x);
+ break;
+ case 0:
+ get_prev()->set_next(get_next());
+ get_next()->set_prev(get_prev());
+ break;
+ default:
+ /*
+ * Right- and left-most at the same time can only be root,
+ * otherwise this would mean we have root with 0 keys.
+ */
+ assert(false);
+ }
+
+ set_next(this);
+ set_prev(this);
+ }
+
+ node(const node& other) = delete;
+ const node& operator=(const node& other) = delete;
+ node& operator=(node&& other) = delete;
+
+ /*
+ * There's no pointer/reference from nodes to the tree, neither
+ * there is such from data, because otherwise we'd have to update
+ * all of them inside tree move constructor, which in turn would
+ * make it toooo slow linear operation. Thus we walk up the nodes
+ * ._parent chain up to the root node which has the _root_tree.
+ */
+ tree* tree_slow() const {
+ const node* cur = this;
+
+ while (!cur->is_root()) {
+ cur = cur->_parent;
+ }
+
+ return cur->_root_tree;
+ }
+
+ /*
+ * Finds the index of the subtree to which the k belongs.
+ * Respectively, the key[i - 1] <= k < key[i] (and if i == 0
+ * then the node is inner and the key is in leftmost subtree).

+ */
+ template <typename K>

+ int index_for(const K& k, Less less) const {
+ return searcher<K, Key, Less, NodeSize, Search>::gt(k, _keys, _num_keys, less);
+ }
+
+ int index_for(node *n) const {
+ // Keep index on kid (FIXME?)
+
+ int i;
+
+ for (i = 0; i <= _num_keys; i++) {
+ if (_kids[i].n == n) {
+ break;
+ }
+ }
+ assert(i <= _num_keys);

+ return i;
+ }
+

+ bool need_refill() const {
+ return _num_keys <= NodeHalf;
+ }
+
+ bool can_grab_from() const {
+ return _num_keys > NodeHalf + 1;
+ }
+
+ bool can_push_to() const {
+ return _num_keys < NodeSize;
+ }
+
+ bool can_merge_with(const node& n) const {
+ return _num_keys + n._num_keys + (is_leaf() ? 0 : 1) <= NodeSize;
+ }
+
+ void shift_right(int s) {
+ for (int i = _num_keys - 1; i >= s; i--) {
+ _keys[i + 1].emplace(std::move(_keys[i]));
+ _kids[i + 2] = _kids[i + 1];
+ }
+ _num_keys++;
+ }
+
+ void shift_left(int s) {
+ // The key at s is expected to be .remove()-d !
+ for (int i = s; i < _num_keys - 1; i++) {
+ _keys[i].emplace(std::move(_keys[i + 1]));
+ _kids[i + 1] = _kids[i + 2];
+ }
+ _num_keys--;
+ }
+
+ void move_keys_and_kids(int foff, node& to, int toff, int count) {
+ for (int i = 0; i < count; i++) {
+ to._keys[toff + i].emplace(std::move(_keys[foff + i]));
+ to._kids[toff + i + 1] = _kids[foff + i + 1];
+ }
+ }
+
+ void move_to(node& to, int off, int count) {
+ move_keys_and_kids(off, to, 0, count);
+ _num_keys = off;
+ to._num_keys = count;
+ if (is_leaf()) {
+ for (int i = 0; i < count; i++) {
+ to._kids[i + 1].d->reattach(&to);
+ }
+ } else {
+ for (int i = 0; i < count; i++) {
+ to._kids[i + 1].n->_parent = &to;

+ }
+ }
+
+ }
+

+ void grab_from_left(node& from, maybe_key<Key>& sep) {
+ /*
+ * Grab one element from the left sibling and return
+ * the new separation key for them.
+ *
+ * Leaf: just move the last key (and the last kid) and report
+ * it as new separation key
+ *
+ * keys: [012] -> [56] = [01] [256] 2 is new separation
+ * datas: [-012] -> [-56] = [-01] [-256]
+ *
+ * Non-leaf is trickier. We need the current separation key
+ * as we're grabbing the last element which has no the right
+ * boundary on the node. So the parent node tells us one.
+ *
+ * keys: [012] -> s [56] = [01] 2 [s56] 2 is new separation
+ * kids: [A012] -> [B56] = [A01] [2B56]
+ */
+
+ int i = from._num_keys - 1;
+
+ shift_right(0);
+ from._num_keys--;
+
+ if (is_leaf()) {
+ _keys[0].emplace(std::move(from._keys[i]));
+ _kids[1] = from._kids[i + 1];
+ _kids[1].d->reattach(this);
+ sep.replace(std::move(copy_key(_keys[0].v)));
+ } else {
+ _keys[0].emplace(std::move(sep));
+ _kids[1] = _kids[0];
+ _kids[0] = from._kids[i + 1];
+ _kids[0].n->_parent = this;
+ sep.emplace(std::move(from._keys[i]));
+ }
+ }
+
+ void merge_into(node& t, Key key) {
+ /*
+ * Merge current node into t preparing it for being
+ * killed. This merge is slightly different for leaves
+ * and for non-leaves wrt the 0th element.
+ *
+ * Non-leaves. For those we need the separation key, whic
+ * is passed to us. The caller "knows" that this and t are
+ * two siblings and thus the separation key is the one from
+ * the parent node. For this reason merging two non-leaf
+ * nodes needs one more slot in the target as compared to
+ * the leaf-nodes case.
+ *
+ * keys: [012] + K + [456] = [012K456]
+ * kids: [A012] + [B456] = [A012B456]
+ *
+ * Leaves. This is simple -- just go ahead and merge.
+ *
+ * keys: [012] + [456] = [012456]
+ * datas: [-012] + [-456] = [-012456]
+ */
+
+ if (!t.is_leaf()) {
+ int i = t._num_keys;
+ t._keys[i].emplace(std::move(key));
+ t._kids[i + 1] = _kids[0];
+ t._kids[i + 1].n->_parent = &t;
+ t._num_keys++;
+ }
+
+ move_keys_and_kids(0, t, t._num_keys, _num_keys);
+
+ if (t.is_leaf()) {
+ for (int i = t._num_keys; i < t._num_keys + _num_keys; i++) {
+ t._kids[i + 1].d->reattach(&t);
+ }
+ } else {
+ for (int i = t._num_keys; i < t._num_keys + _num_keys; i++) {
+ t._kids[i + 1].n->_parent = &t;
+ }
+ }
+
+ t._num_keys += _num_keys;
+ _num_keys = 0;
+ }
+
+ void grab_from_right(node& from, maybe_key<Key>& sep) {
+ /*
+ * Grab one element from the right sibling and return
+ * the new separation key for them.
+ *
+ * Leaf: just move the 0th key (and 1st kid) and the
+ * new separation key is what becomes 0 in the source.
+ *
+ * keys: [01] <- [456] = [014] [56] 5 is new separation
+ * datas: [-01] <- [-456] = [-014] [-56]
+ *
+ * Non-leaf is trickier. We need the current separation
+ * key as we're grabbing the kids[0] element which has no
+ * corresponding keys[-1]. So the parent node tells us one.
+ *
+ * keys: [01] <- s [456] = [01s] 4 [56] 4 is new separation
+ * kids: [A01] <- [B456] = [A01B] [456]
+ */
+
+ int i = _num_keys;
+
+ if (is_leaf()) {
+ _keys[i].emplace(std::move(from._keys[0]));
+ _kids[i + 1] = from._kids[1];
+ _kids[i + 1].d->reattach(this);
+ sep.replace(std::move(copy_key(from._keys[1].v)));
+ } else {
+ _kids[i + 1] = from._kids[0];
+ _kids[i + 1].n->_parent = this;
+ _keys[i].emplace(std::move(sep));
+ from._kids[0] = from._kids[1];
+ sep.emplace(std::move(from._keys[0]));
+ }
+
+ _num_keys++;
+ from.shift_left(0);
+ }
+
+ /*
+ * When splitting, the result should be almost equal. The
+ * "almost" depends on the node-size being odd or even and
+ * on the node itself being leaf or inner.
+ */
+ bool equally_split(const node& n2) const {
+ if (Debug == with_debug::yes) {
+ return (_num_keys == n2._num_keys) ||
+ (_num_keys == n2._num_keys + 1) ||
+ (_num_keys + 1 == n2._num_keys);
+ }
+ return true;
+ }
+
+ // Helper for assert(). See comment for do_insert for details.
+ bool left_kid_sorted(const Key& k, Less less) const {
+ if (Debug == with_debug::yes && !is_leaf() && _num_keys > 0) {
+ node* x = _kids[0].n;
+ if (x != nullptr && less(k, x->_keys[x->_num_keys - 1].v)) {
+ return false;
+ }
+ }
+
+ return true;
+ }
+
+ template <typename DFunc, typename NFunc>
+ SEASTAR_CONCEPT(requires
+ requires (DFunc f, data* val) { { f(val) } -> void; } &&
+ requires (NFunc f, node* n) { { f(n) } -> void; }
+ )
+ void clear(DFunc&& ddisp, NFunc&& ndisp) {
+ if (is_leaf()) {
+ _flags &= ~(node::NODE_LEFTMOST | node::NODE_RIGHTMOST);
+ set_next(this);
+ set_prev(this);
+ } else {
+ node* n = _kids[0].n;
+ n->clear(ddisp, ndisp);
+ ndisp(n);
+ }
+
+ for (int i = 0; i < _num_keys; i++) {
+ _keys[i].reset();
+ if (is_leaf()) {
+ ddisp(_kids[i + 1].d);
+ } else {
+ node* n = _kids[i + 1].n;
+ n->clear(ddisp, ndisp);
+ ndisp(n);
+ }
+ }
+
+ _num_keys = 0;
+ }
+
+ static node* create() {
+ return current_allocator().construct<node>();
+ }
+
+ static void destroy(node& n) {
+ current_allocator().destroy(&n);
+ }
+
+ void drop() {
+ assert(!is_root());
+ if (is_leaf()) {
+ unlink();
+ }
+ destroy(*this);
+ }
+
+ void insert_into_full(int idx, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ if (!is_root()) {
+ node& p = *_parent;
+ int i = p.index_for(_keys[0].v, less);
+
+ /*
+ * Try to push left or right existing keys to the respective
+ * siblings. Keep in mind two corner cases:
+ *
+ * 1. Push to left. In this case the new key should not go
+ * to the [0] element, otherwise we'd have to update the p's
+ * separation key one more time.
+ *
+ * 2. Push to right. In this case we must make sure the new
+ * key is not the rightmost itself, otherwise it's _him_ who
+ * must be pushed there.
+ *
+ * Both corner cases are possible to implement though.
+ */
+ if (idx > 1 && i > 0) {
+ node* left = p._kids[i - 1].n;
+ if (left->can_push_to()) {
+ /*
+ * We've moved the 0th elemet from this, so the index
+ * for the new key shifts too
+ */
+ idx--;
+ left->grab_from_right(*this, p._keys[i - 1]);
+ }
+ }
+
+ if (idx < _num_keys && i < p._num_keys) {
+ node* right = p._kids[i + 1].n;
+ if (right->can_push_to()) {
+ right->grab_from_left(*this, p._keys[i]);
+ }
+ }
+
+ if (_num_keys < NodeSize) {
+ do_insert(idx, std::move(k), nd, less);
+ nodes.drain();
+ return;
+ }
+
+ /*
+ * We can only get here if both ->can_push_to() checks above
+ * had failed. In this case -- go ahead and split this.
+ */
+ }
+
+ split_and_insert(idx, std::move(k), nd, less, nodes);
+ }
+
+ void split_and_insert(int idx, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ assert(_num_keys >= NodeSize);
+
+ node* nn = nodes.pop();
+ maybe_key<Key> sep;
+
+ /*
+ * Insertion with split.
+ * 1. Existing node (this) is split into two. We try a bit harder
+ * than we might to to make the split equal.
+ * 2. The new element is added to either of the resulting nodes.
+ * 3. The new node nn is inserted into parent one with the help
+ * of a separation key sep
+ *
+ * First -- find the position in the current node at which the
+ * new element should have appeared.
+ */
+
+ int off = NodeHalf + (idx > NodeHalf ? 1 : 0);
+
+ if (is_leaf()) {
+ nn->_flags |= NODE_LEAF;
+ link(*nn);
+
+ /*
+ * Split of leaves. This is simple -- just copy the needed
+ * amount of keys and kids from this to nn, then insert the
+ * new pair into the proper place. When inserting the new
+ * node into parent the separation key is the one latter
+ * starts with.
+ *
+ * keys: [01234]
+ * datas: [-01234]
+ *
+ * if the new key is below 2, then
+ * keys: -> [01] [234] -> [0n1] [234] -> sep is 2
+ * datas: -> [-01] [-234] -> [-0n1] [-234]
+ *
+ * if the new key is above 2, then
+ * keys: -> [012] [34] -> [012] [3n4] -> sep is 3 (or n)
+ * datas: -> [-012] [-34] -> [-012] [-3n4]
+ */
+ move_to(*nn, off, NodeSize - off);
+
+ if (idx <= NodeHalf) {
+ do_insert(idx, std::move(k), nd, less);
+ } else {
+ nn->do_insert(idx - off, std::move(k), nd, less);
+ }
+ sep.emplace(std::move(copy_key(nn->_keys[0].v)));
+ } else {
+ /*
+ * Node insertion has one special case -- when the new key
+ * gets directly into the middle.
+ */
+ if (idx == NodeHalf + 1) {
+ /*
+ * Split of nodes and the new key is in the middle. In this
+ * we need to split the node into two, but take the k as the
+ * separation kep. The corresponding data becomes new node's
+ * 0 kid.
+ *
+ * keys: [012345] -> [012] k [345] (and the k goes up)
+ * kids: [A012345] -> [A012] [n345]
+ */
+ move_to(*nn, off, NodeSize - off);
+ sep.emplace(std::move(k));
+ nn->_kids[0] = nd;
+ nn->_kids[0].n->_parent = nn;
+ } else {
+ /*
+ * Split of nodes and the new key gets into either of the
+ * halves. This is like leaves split, but we need to carefully
+ * handle the kids[0] for both. The correspoding key is not
+ * on the node and "has" an index of -1 and thus becomes the
+ * separation one for the upper layer.
+ *
+ * keys: [012345]
+ * datas: [A012345]
+ *
+ * if the new key goes left then
+ * keys: -> [01] 2 [345] -> [0n1] 2 [345]
+ * datas: -> [A01] [2345] -> [A0n1] [2345]
+ *
+ * if the new key goes right then
+ * keys: -> [012] 3 [45] -> [012] 3 [4n5]
+ * datas: -> [A012] [345] -> [-123] [34n5]
+ */
+ move_to(*nn, off + 1, NodeSize - off - 1);
+ sep.emplace(std::move(_keys[off]));
+ nn->_kids[0] = _kids[off + 1];
+ nn->_kids[0].n->_parent = nn;
+ _num_keys--;
+
+ if (idx <= NodeHalf) {
+ do_insert(idx, std::move(k), nd, less);
+ } else {
+ nd.n->_parent = nn;
+ nn->do_insert(idx - off - 1, std::move(k), nd, less);
+ }
+ }
+ }
+
+ assert(equally_split(*nn));
+
+ if (is_root()) {
+ insert_into_root(*nn, std::move(sep.v), nodes);
+ } else {
+ insert_into_parent(*nn, std::move(sep.v), less, nodes);
+ }
+ sep.reset();
+ }
+
+ void do_insert(int i, Key k, node_or_data nd, Less less) {
+ assert(_num_keys < NodeSize);
+
+ /*
+ * The new k:nd pair should be put into the given index and
+ * shift offenders to the right. However, if it should be
+ * put left to the non-leaf's left-most node -- it's a BUG,
+ * as there's no corresponding key here.
+ *
+ * Non-leaf nodes get here when their kids are split, and
+ * when they do, if the kid gets into the left-most sub-tree,
+ * it's directly put there, and this helper is not called.
+ * Said that, if we're inserting a new pair, the newbie can
+ * only get to the right of the left-most kid.
+ */
+ assert(i != 0 || left_kid_sorted(k, less));
+
+ shift_right(i);
+
+ // Puts k:nd pair at position idx (keys[idx] and kids[idx + 1])
+ _keys[i].emplace(std::move(k));
+ _kids[i + 1] = nd;
+ if (is_leaf()) {
+ nd.d->attach(*this);
+ }
+ }
+
+ void insert_into_parent(node& nn, Key sep, Less less, prealloc& nodes) {
+ nn._parent = _parent;
+ _parent->insert_key(std::move(sep), node_or_data{n: &nn}, less, nodes);
+ }
+
+ void insert_into_root(node& nn, Key sep, prealloc& nodes) {
+ tree* t = _root_tree;
+
+ node* nr = nodes.pop();
+
+ nr->_num_keys = 1;
+ nr->_keys[0].emplace(std::move(sep));
+ nr->_kids[0].n = this;
+ nr->_kids[1].n = &nn;
+ _flags &= ~node::NODE_ROOT;
+ _parent = nr;
+ nn._parent = nr;
+
+ nr->_flags |= node::NODE_ROOT;
+ t->do_set_root(nr);
+ }
+
+ void insert_key(Key k, node_or_data nd, Less less, prealloc& nodes) {
+ int i = index_for(k, less);
+ insert(i, std::move(k), nd, less, nodes);
+ }
+
+ void insert(int i, Key k, node_or_data nd, Less less, prealloc& nodes) {
+ if (_num_keys == NodeSize) {
+ insert_into_full(i, std::move(k), nd, less, nodes);
+ } else {
+ do_insert(i, std::move(k), nd, less);
+ }
+ }
+
+ void insert(int i, Key k, data* d, Less less) {
+ prealloc nodes;
+
+ /*
+ * Prepare the nodes for split in advaice, if the node::create will
+ * start throwing while splitting we'll have troubles "unsplitting"
+ * the nodes back.
+ */
+ node* cur = this;
+ while (cur->_num_keys == NodeSize) {
+ nodes.push();
+ if (cur->is_root()) {
+ nodes.push();
+ break;
+ }
+ cur = cur->_parent;
+ }
+
+ insert(i, std::move(k), node_or_data{d: d}, less, nodes);
+ assert(nodes.empty());
+ }
+
+ void remove_from(int i, Less less) {
+ _keys[i].reset();
+ shift_left(i);
+
+ if (!is_root()) {
+ if (need_refill()) {
+ refill(less);
+ }
+ } else if (_num_keys == 0 && !is_leaf()) {
+ node* nr;
+ nr = _kids[0].n;
+ nr->_flags |= node::NODE_ROOT;
+ _root_tree->do_set_root(nr);
+
+ _flags &= ~node::NODE_ROOT;
+ _parent = nullptr;
+ drop();
+ }
+ }
+
+ void merge_kids(node& t, node& n, int idx, Less less) {
+ n.merge_into(t, std::move(_keys[idx].v));
+ n.drop();
+ remove_from(idx, less);
+ }
+
+ void refill(Less less) {
+ node& p = *_parent, *left, *right;
+
+ /*
+ * We need to locate this node's index at parent array by using
+ * the 0th key, so make sure it exists. We can go even without
+ * it, but since we don't let's be on the safe side.
+ */
+ assert(_num_keys > 0);
+ int i = p.index_for(_keys[0].v, less);
+ assert(p._kids[i].n == this);
+
+ /*
+ * The node is "underflown" (see comment near NodeHalf
+ * about what this means), so we try to refill it at the
+ * siblings' expense. Many cases possible, but we go with
+ * only four:
+ *
+ * 1. Left sibling exists and it has at least 1 item
+ * above being the half-full. -> we grab one element
+ * from it.
+ *
+ * 2. Left sibling exists and we can merge current with
+ * it. "Can" means the resulting node will not overflow
+ * which, in turn, differs by one for leaf and non-leaf
+ * nodes. For leaves the merge is possible is the total
+ * number of the elements fits the maximum. For non-leaf
+ * we'll need room for one more element, here's why:
+ *
+ * [012] + [456] -> [012X456]
+ * [A012] + [B456] -> [A012B456]
+ *
+ * The key X in the middle separates B from everything on
+ * the left and this key was not sitting on either of the
+ * wannabe merging nodes. This X is the current separation
+ * of these two nodes taken from their parent.
+ *
+ * And two same cases for the right sibling.
+ */
+
+ left = i > 0 ? p._kids[i - 1].n : nullptr;
+ right = i < p._num_keys ? p._kids[i + 1].n : nullptr;
+
+ if (left != nullptr && left->can_grab_from()) {
+ grab_from_left(*left, p._keys[i - 1]);
+ return;
+ }
+
+ if (right != nullptr && right->can_grab_from()) {
+ grab_from_right(*right, p._keys[i]);
+ return;
+ }
+
+ if (left != nullptr && can_merge_with(*left)) {
+ p.merge_kids(*left, *this, i - 1, less);
+ return;
+ }
+
+ if (right != nullptr && can_merge_with(*right)) {
+ p.merge_kids(*this, *right, i, less);
+ return;
+ }
+
+ /*
+ * Susprisingly, the node in the B+ tree can violate the
+ * "minimally filled" rule for non roots. It _can_ stay with
+ * less than half elements on board. The next remove from
+ * it or either of its siblings will probably refill it.
+ *
+ * Keeping 1 key on the non-root node is possible, but needs
+ * some special care -- if we will remove this last key from
+ * this node, the code will try to refill one and will not
+ * be able to find this node's index at parent (the call for
+ * index_for() above).
+ */
+ assert(_num_keys > 1);
+ }
+
+ void remove(int i, Less less) {
+ assert(i >= 0);
+ /*
+ * Update the matching separation key from above. It
+ * exists only if we're removing the 0th key, but for
+ * the left-most child it doesn't exist.
+ *
+ * Note, that the latter check is crucial for clear()
+ * performance, as it's always removes the left-most
+ * key, without this check each remove() would walk the
+ * tree upwards in vain.
+ */
+ if (strict_separation_key && i == 0 && !is_leftmost()) {
+ const Key& k = _keys[i].v;
+ node* p = this;
+
+ while (!p->is_root()) {
+ p = p->_parent;
+ int j = p->index_for(k, less) - 1;
+ if (j >= 0) {
+ p->_keys[j].replace(std::move(copy_key(_keys[1].v)));
+ break;
+ }
+ }
+ }
+
+ remove_from(i, less);
+ }
+
+public:
+ explicit node() : _num_keys(0) , _flags(0) , _parent(nullptr) { }
+
+ ~node() {
+ assert(_num_keys == 0);
+ assert(is_root() || !is_leaf() || (get_prev() == this && get_next() == this));
+ }
+
+ node(node&& other) noexcept : _flags(other._flags) {
+ if (is_leaf()) {
+ if (!is_rightmost()) {
+ set_next(other.get_next());
+ get_next()->set_prev(this);
+ } else {
+ other._rightmost_tree->do_set_right(this);
+ }
+
+ if (!is_leftmost()) {
+ set_prev(other.get_prev());
+ get_prev()->set_next(this);
+ } else {
+ other._kids[0]._leftmost_tree->do_set_left(this);
+ }
+
+ other._flags &= ~(NODE_LEFTMOST | NODE_RIGHTMOST);
+ other.set_next(&other);
+ other.set_prev(&other);
+ } else {
+ _kids[0].n = other._kids[0].n;
+ _kids[0].n->_parent = this;
+ }
+
+ other.move_to(*this, 0, other._num_keys);
+
+ if (!is_root()) {
+ _parent = other._parent;
+ int i = _parent->index_for(&other);
+ assert(_parent->_kids[i].n == &other);
+ _parent->_kids[i].n = this;
+ } else {
+ other._root_tree->do_set_root(this);
+ }
+ }
+
+ int index_for(data *d) const {
+ /*
+ * We'd could look up the data's new idex with binary search,
+ * but we don't have the key at hands
+ */
+
+ int i;
+
+ for (i = 1; i <= _num_keys; i++) {
+ if (_kids[i].d == d) {
+ break;
+ }
+ }
+ assert(i <= _num_keys);

+ return i;
+ }
+

+private:
+ class prealloc {
+ std::vector<node*> _nodes;
+ public:
+ bool empty() { return _nodes.empty(); }
+
+ void push() {
+ _nodes.push_back(node::create());
+ }
+
+ node* pop() {
+ assert(!_nodes.empty());
+ node* ret = _nodes.back();
+ _nodes.pop_back();
+ return ret;
+ }
+
+ void drain() {
+ while (!empty()) {
+ node::destroy(*pop());
+ }
+ }
+
+ ~prealloc() {
+ drain();
+ }
+ };
+
+ void fill_stats(struct stats& st) const {
+ if (is_leaf()) {
+ st.leaves_filled[_num_keys]++;
+ st.leaves++;
+ st.datas += _num_keys;
+ } else {
+ st.nodes_filled[_num_keys]++;
+ st.nodes++;
+ for (int i = 0; i <= _num_keys; i++) {
+ _kids[i].n->fill_stats(st);

+ }
+ }
+ }
+};
+

+ template <typename Func>
+ SEASTAR_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ static void destroy(data& d, Func&& disp) {
+ disp(d.value);
+ d._leaf = nullptr;
+ current_allocator().destroy(&d);

+ }
+
+ template <typename... Args>

+ data(Args&& ... args) : _leaf(nullptr), value(std::forward<Args>(args)...) {}
+
+ data(data&& other) noexcept : _leaf(other._leaf), value(std::move(other.value)) {
+ if (attached()) {
+ int i = _leaf->index_for(&other);
+ _leaf->_kids[i].d = this;
+ other._leaf = nullptr;
+ }
+ }
+
+ ~data() { assert(!attached()); }
+
+ bool attached() const { return _leaf != nullptr; }
+
+ void attach(node& to) {
+ assert(!attached());
+ _leaf = &to;
+ }
+
+ void reattach(node* to) {
+ assert(attached());
+ _leaf = to;
+ }
+
+ size_t storage_size(size_t payload) const {
+ return sizeof(data) - sizeof(T) + payload;

+ }
+
+ size_t storage_size() const {

+ return storage_size(size_for_allocation_strategy(value));
+ }
+
+ friend size_t size_for_allocation_strategy(const data& obj) {

+ return obj.storage_size();
+ }
+};
+

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define BOOST_TEST_MODULE bptree
+
+#include <boost/test/unit_test.hpp>
+#include <fmt/core.h>
+
+#include "utils/bptree.hh"
+#include "test/unit/bptree_key.hh"
+
+struct int_compare {
+ bool operator()(const int& a, const int& b) const { return a < b; }
+};
+
+using namespace bplus;
+using test_tree = tree<int, unsigned long, int_compare, 4, key_search::both, with_debug::yes>;
+
+BOOST_AUTO_TEST_CASE(test_ops_empty_tree) {
+ /* Sanity checks for no nullptr dereferences */
+ test_tree t(int_compare{});
+ t.erase(1);
+ t.find(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_double_insert) {
+ /* No assertions should happen in ~tree */
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 1);
+ BOOST_REQUIRE(i.second);
+ i = t.emplace(1, 1);
+ BOOST_REQUIRE(!i.second);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_cookie_find) {
+ struct int_to_key_compare {
+ bool operator()(const test_key& a, const int& b) const { return (int)a < b; }
+ bool operator()(const int& a, const test_key& b) const { return a < (int)b; }
+ bool operator()(const test_key& a, const test_key& b) const {
+ test_key_compare cmp;
+ return cmp(a, b);
+ }
+ };
+
+ using test_tree = tree<test_key, int, int_to_key_compare, 4, key_search::both, with_debug::yes>;
+
+ test_tree t(int_to_key_compare{});
+ t.emplace(test_key{1}, 132);
+
+ auto i = t.find(1);
+ BOOST_REQUIRE(*i == 132);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_double_erase) {
+ test_tree t(int_compare{});
+ t.emplace(1, 1);
+ t.emplace(2, 2);
+ auto i = t.erase(1);
+ BOOST_REQUIRE(*i == 2);
+ i = t.erase(1);
+ BOOST_REQUIRE(i == t.end());
+ i = t.erase(2);
+ BOOST_REQUIRE(i == t.end());
+ t.erase(2);
+}
+
+BOOST_AUTO_TEST_CASE(test_remove_corner_case) {
+ /* Sanity check for erasure to be precise */
+ test_tree t(int_compare{});
+ t.emplace(1, 1);
+ t.emplace(2, 123);
+ t.emplace(3, 3);
+ t.erase(1);
+ t.erase(3);
+ auto f = t.find(2);
+ BOOST_REQUIRE(*f == 123);
+ t.erase(2);
+}
+
+BOOST_AUTO_TEST_CASE(test_end_iterator) {
+ /* Check std::prev(end()) */
+ test_tree t(int_compare{});
+ t.emplace(1, 123);
+ auto i = std::prev(t.end());
+ BOOST_REQUIRE(*i = 123);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_next_to_end_iterator) {
+ /* Same, but with "artificial" end iterator */
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 123).first;
+ i++;
+ BOOST_REQUIRE(i == t.end());
+ i--;
+ BOOST_REQUIRE(*i = 123);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_clear) {
+ /* Quick check for tree::clear */
+ test_tree t(int_compare{});
+
+ for (int i = 0; i < 32; i++) {
+ t.emplace(i, i);
+ }
+
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_post_clear) {
+ /* Check that tree is work-able after clear */
+ test_tree t(int_compare{});
+
+ t.emplace(1, 1);
+ t.clear();
+ t.emplace(2, 2);
+ t.erase(2);
+}
+
+BOOST_AUTO_TEST_CASE(test_iterator_erase) {
+ /* Check iterator::erase */
+ test_tree t(int_compare{});
+ auto it = t.emplace(2, 2);
+ t.emplace(1, 321);
+ it.first.erase(int_compare{});
+ BOOST_REQUIRE(*t.find(1) == 321);
+ t.erase(1);
+}
+
+BOOST_AUTO_TEST_CASE(test_iterator_equal) {
+ test_tree t(int_compare{});
+ auto i1 = t.emplace(1, 1);
+ auto i2 = t.emplace(2, 2);
+ auto i3 = t.find(1);
+ BOOST_REQUIRE(i1.first == i3);
+ BOOST_REQUIRE(i1.first != i2.first);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_lower_bound) {
+ test_tree t(int_compare{});
+ t.emplace(1, 11);
+ t.emplace(3, 13);
+
+ bool match;
+ BOOST_REQUIRE(*t.lower_bound(0, match) == 11 && !match);
+ BOOST_REQUIRE(*t.lower_bound(1, match) == 11 && match);
+ BOOST_REQUIRE(*t.lower_bound(2, match) == 13 && !match);
+ BOOST_REQUIRE(*t.lower_bound(3, match) == 13 && match);
+ BOOST_REQUIRE(t.lower_bound(4, match) == t.end() && !match);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_upper_bound) {
+ test_tree t(int_compare{});
+ t.emplace(1, 11);
+ t.emplace(3, 13);
+
+ BOOST_REQUIRE(*t.upper_bound(0) == 11);
+ BOOST_REQUIRE(*t.upper_bound(1) == 13);
+ BOOST_REQUIRE(*t.upper_bound(2) == 13);
+ BOOST_REQUIRE(t.upper_bound(3) == t.end());
+ BOOST_REQUIRE(t.upper_bound(4) == t.end());
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_iterator_index) {
+ /* Check insertion iterator ++ and duplicate key */
+ test_tree t(int_compare{});
+ t.emplace(1, 10);
+ t.emplace(3, 13);
+ auto i = t.emplace(2, 2).first;
+ i++;
+ BOOST_REQUIRE(*i == 13);
+ auto i2 = t.emplace(2, 2); /* 2nd insert finds the previous */
+ BOOST_REQUIRE(!i2.second);
+ i2.first++;
+ BOOST_REQUIRE(*(i2.first) == 13);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before) {
+ /* Check iterator::insert_before */
+ test_tree t(int_compare{});
+ auto i3 = t.emplace(3, 13).first;
+ auto i2 = i3.emplace_before(2, int_compare{}, 12);
+ BOOST_REQUIRE(++i2 == i3);
+ BOOST_REQUIRE(*i3 == 13);
+ BOOST_REQUIRE(*--i2 == 12);
+ BOOST_REQUIRE(*--i3 == 12);
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before_end) {
+ /* The same but for end() iterator */
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 1).first;
+ auto i2 = t.end().emplace_before(2, int_compare{}, 12);
+ BOOST_REQUIRE(++i == i2);
+ BOOST_REQUIRE(++i2 == t.end());
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before_end_empty) {
+ /* The same, but for empty tree */
+ test_tree t(int_compare{});
+ auto i = t.end().emplace_before(42, int_compare{}, 142);
+ BOOST_REQUIRE(i == t.begin());
+ t.erase(42);
+}
+
+BOOST_AUTO_TEST_CASE(test_iterators) {
+ test_tree t(int_compare{});
+
+ for (auto i = t.rbegin(); i != t.rend(); i++) {
+ BOOST_REQUIRE(false);
+ }
+ for (auto i = t.begin(); i != t.end(); i++) {
+ BOOST_REQUIRE(false);
+ }
+
+ t.emplace(1, 7);
+ t.emplace(2, 9);
+
+ {
+ auto i = t.begin();
+ BOOST_REQUIRE(*(i++) == 7);
+ BOOST_REQUIRE(*(i++) == 9);
+ BOOST_REQUIRE(i == t.end());
+ }
+
+ {
+ auto i = t.rbegin();
+ BOOST_REQUIRE(*(i++) == 9);
+ BOOST_REQUIRE(*(i++) == 7);
+ BOOST_REQUIRE(i == t.rend());
+ }
+
+ t.clear();
+}
+
+/*
+ * Special test that makes sure "self-iterator" works OK.
+ * See comment near the bptree::iterator(T* d) constructor
+ * for details.
+ */
+class tree_data {
+ int _key;
+ int _cookie;
+public:
+ explicit tree_data(int cookie) : _key(-1), _cookie(cookie) {}
+ tree_data(int key, int cookie) : _key(key), _cookie(cookie) {}
+ int cookie() const { return _cookie; }
+ int key() const {
+ assert(_key != -1);
+ return _key;
+ }
+};
+
+BOOST_AUTO_TEST_CASE(test_data_self_iterator) {
+ using test_tree = tree<int, tree_data, int_compare, 4, key_search::both, with_debug::yes>;
+
+ test_tree t(int_compare{});
+ auto i = t.emplace(1, 42);
+ BOOST_REQUIRE(i.second);
+
+ tree_data* d = &(*i.first);
+ BOOST_REQUIRE(d->cookie() == 42);
+
+ test_tree::iterator di(d);
+ BOOST_REQUIRE(di->cookie() == 42);
+
+ di.erase(int_compare{});
+ BOOST_REQUIRE(t.find(1) == t.end());
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_before_nokey) {
+ using test_tree = tree<int, tree_data, int_compare, 4, key_search::both, with_debug::yes>;
+
+ test_tree t(int_compare{});
+ auto i = t.emplace(2, 52).first;
+ auto ni = i.emplace_before(int_compare{}, 1, 42);
+ BOOST_REQUIRE(ni->cookie() == 42);
+ ni++;
+ BOOST_REQUIRE(ni == i);
+ t.clear();
+}
+
+
+BOOST_AUTO_TEST_CASE(test_self_iterator_rover) {
+ test_tree t(int_compare{});
+ auto i = t.emplace(2, 42).first;
+ unsigned long* d = &(*i);
+ test_tree::iterator di(d);
+
+ i = di.emplace_before(1, int_compare{}, 31);
+ BOOST_REQUIRE(*i == 31);
+ BOOST_REQUIRE(*(++i) == 42);
+ BOOST_REQUIRE(++i == t.end());
+ BOOST_REQUIRE(++di == t.end());
+ t.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_erase_range) {
+ /* Quick check for tree::clear */
+ test_tree t(int_compare{});
+
+ for (int i = 0; i < 32; i++) {
+ t.emplace(i, i);
+ }
+
+ auto b = t.find(8);
+ auto e = t.find(25);
+ t.erase(b, e);
+
+ BOOST_REQUIRE(*t.find(7) == 7);
+ BOOST_REQUIRE(t.find(8) == t.end());
+ BOOST_REQUIRE(t.find(24) == t.end());
+ BOOST_REQUIRE(*t.find(25) == 25);
+
+ t.clear();
+}
diff --git a/test/perf/perf_bptree.cc b/test/perf/perf_bptree.cc
new file mode 100644
index 000000000..ddef05c2a
--- /dev/null
+++ b/test/perf/perf_bptree.cc
@@ -0,0 +1,246 @@

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <algorithm>
+#include <vector>
+#include <random>
+#include <fmt/core.h>
+#include "perf.hh"
+
+using per_key_t = int64_t;
+
+struct key_compare {
+ bool operator()(const per_key_t& a, const per_key_t& b) const { return a < b; }
+};
+
+#include "utils/bptree.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+constexpr int TEST_NODE_SIZE = 4;
+
+/* On node size 32 (this test) linear search works better */
+using test_tree = tree<per_key_t, unsigned long, key_compare, TEST_NODE_SIZE, key_search::linear>;
+
+class collection_tester {
+public:
+ virtual void insert(per_key_t k) = 0;
+ virtual void lower_bound(per_key_t k) = 0;
+ virtual void erase(per_key_t k) = 0;
+ virtual void drain(int batch) = 0;
+ virtual void show_stats() = 0;
+ virtual ~collection_tester() {};
+};
+
+class bptree_tester : public collection_tester {
+ test_tree _t;
+public:
+ bptree_tester() : _t(key_compare{}) {}
+ virtual void insert(per_key_t k) override { _t.emplace(k, 0); }
+ virtual void lower_bound(per_key_t k) override {
+ auto i = _t.lower_bound(k);
+ assert(i != _t.end());
+ }
+ virtual void erase(per_key_t k) override { _t.erase(k); }
+ virtual void drain(int batch) override {
+ int x = 0;
+ auto i = _t.begin();
+ while (i != _t.end()) {
+ i = i.erase(key_compare{});
+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }
+ virtual void show_stats() {
+ struct bplus::stats st = _t.get_stats();
+ fmt::print("nodes: {}\n", st.nodes);
+ for (int i = 0; i < (int)st.nodes_filled.size(); i++) {
+ fmt::print(" {}: {} ({}%)\n", i, st.nodes_filled[i], st.nodes_filled[i] * 100 / st.nodes);
+ }
+ fmt::print("leaves: {}\n", st.leaves);
+ for (int i = 0; i < (int)st.leaves_filled.size(); i++) {
+ fmt::print(" {}: {} ({}%)\n", i, st.leaves_filled[i], st.leaves_filled[i] * 100 / st.leaves);
+ }
+ fmt::print("datas: {}\n", st.datas);
+ }
+ virtual ~bptree_tester() {
+ _t.clear();
+ }
+};
+
+class set_tester : public collection_tester {
+ std::set<per_key_t> _s;
+public:
+ virtual void insert(per_key_t k) override { _s.insert(k); }
+ virtual void lower_bound(per_key_t k) override {
+ auto i = _s.lower_bound(k);
+ assert(i != _s.end());
+ }
+ virtual void erase(per_key_t k) override { _s.erase(k); }
+ virtual void drain(int batch) override {
+ int x = 0;
+ auto i = _s.begin();
+ while (i != _s.end()) {
+ i = _s.erase(i);
+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }
+ virtual void show_stats() { }
+ virtual ~set_tester() = default;
+};
+
+class map_tester : public collection_tester {
+ std::map<per_key_t, unsigned long> _m;
+public:
+ virtual void insert(per_key_t k) override { _m[k] = 0; }
+ virtual void lower_bound(per_key_t k) override {
+ auto i = _m.lower_bound(k);
+ assert(i != _m.end());
+ }
+ virtual void erase(per_key_t k) override { _m.erase(k); }
+ virtual void drain(int batch) override {
+ int x = 0;
+ auto i = _m.begin();
+ while (i != _m.end()) {
+ i = _m.erase(i);
+ if (++x % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ }
+ virtual void show_stats() { }
+ virtual ~map_tester() = default;
+};
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(5000000), "number of keys to fill the tree with")
+ ("batch", bpo::value<int>()->default_value(50), "number of operations between deferring points")
+ ("iters", bpo::value<int>()->default_value(1), "number of iterations")
+ ("col", bpo::value<std::string>()->default_value("bptree"), "collection to test")
+ ("test", bpo::value<std::string>()->default_value("erase"), "what to test (erase, drain, find)")
+ ("stats", bpo::value<bool>()->default_value(false), "show stats");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto iters = app.configuration()["iters"].as<int>();
+ auto batch = app.configuration()["batch"].as<int>();
+ auto col = app.configuration()["col"].as<std::string>();
+ auto tst = app.configuration()["test"].as<std::string>();
+ auto stats = app.configuration()["stats"].as<bool>();
+
+ return seastar::async([count, iters, batch, col, tst, stats] {
+ int rep = iters;
+ collection_tester* c;
+
+ if (col == "bptree") {
+ c = new bptree_tester();
+ } else if (col == "set") {
+ c = new set_tester();
+ } else if (col == "map") {
+ c = new map_tester();
+ } else {
+ fmt::print("Unknown collection\n");
+ return;
+ }
+
+ std::vector<per_key_t> keys;
+
+ for (per_key_t i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+
+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Inserting {:d} k:v pairs into {} {:d} times\n", count, col, iters);
+
+ again:
+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+
+ auto d = duration_in_seconds([&] {
+ for (int i = 0; i < count; i++) {
+ c->insert(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ });
+
+ fmt::print("fill: {:.6f} ms\n", d.count() * 1000);
+
+ if (stats) {
+ c->show_stats();
+ }
+
+ if (tst == "erase") {
+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+
+ d = duration_in_seconds([&] {
+ for (int i = 0; i < count; i++) {
+ c->erase(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ });
+
+ fmt::print("erase: {:.6f} ms\n", d.count() * 1000);
+ } else if (tst == "drain") {
+ d = duration_in_seconds([&] {
+ c->drain(batch);
+ });
+
+ fmt::print("drain: {:.6f} ms\n", d.count() * 1000);
+ } else if (tst == "find") {
+ std::shuffle(keys.begin(), keys.end(), g);
+ seastar::thread::yield();
+
+ d = duration_in_seconds([&] {
+ for (int i = 0; i < count; i++) {
+ c->lower_bound(keys[i]);
+ if ((i + 1) % batch == 0) {
+ seastar::thread::yield();
+ }
+ }
+ });
+
+ fmt::print("find: {:.6f} ms\n", d.count() * 1000);
+ }
+
+ if (--rep > 0) {
+ goto again;
+ }
+
+ delete c;

+ });
+ });
+}

diff --git a/test/unit/bptree_compaction_test.cc b/test/unit/bptree_compaction_test.cc
new file mode 100644
index 000000000..9b1b48072
--- /dev/null
+++ b/test/unit/bptree_compaction_test.cc
@@ -0,0 +1,210 @@

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <map>
+#include <vector>
+#include <random>
+#include <string>
+#include <iostream>
+#include <fmt/core.h>
+#include "utils/logalloc.hh"
+
+constexpr int TEST_NODE_SIZE = 7;
+
+#include "bptree_key.hh"
+#include "utils/bptree.hh"
+#include "bptree_validation.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+class test_data {
+ int _value;
+public:
+ test_data() : _value(0) {}
+ test_data(test_key& k) : _value((int)k + 10) {}
+
+ operator unsigned long() const { return _value; }
+ bool match_key(const test_key& k) const { return _value == (int)k + 10; }
+};
+using test_tree = tree<test_key, test_data, test_key_compare, TEST_NODE_SIZE, key_search::both, with_debug::yes>;
+using test_validator = validator<test_key, test_data, test_key_compare, TEST_NODE_SIZE>;
+
+class reference {
+ reference* _ref = nullptr;
+public:
+ reference() = default;
+ reference(const reference& other) = delete;
+
+ reference(reference&& other) noexcept : _ref(other._ref) {
+ if (_ref != nullptr) {
+ _ref->_ref = this;
+ }
+ other._ref = nullptr;
+ }
+
+ ~reference() {
+ if (_ref != nullptr) {
+ _ref->_ref = nullptr;
+ }
+ }
+
+ void link(reference& other) {
+ assert(_ref == nullptr);
+ _ref = &other;
+ other._ref = this;
+ }
+
+ reference* get() {
+ assert(_ref != nullptr);
+ return _ref;
+ }
+};
+
+class tree_pointer {
+ reference _ref;
+
+ class tree_wrapper {
+ friend class tree_pointer;
+ test_tree _tree;
+ reference _ref;
+ public:
+ tree_wrapper() : _tree(test_key_compare{}) {}
+ };
+
+ tree_wrapper* get_wrapper() {
+ return boost::intrusive::get_parent_from_member(_ref.get(), &tree_wrapper::_ref);
+ }
+
+public:
+
+ tree_pointer(const tree_pointer& other) = delete;
+ tree_pointer(tree_pointer&& other) = delete;
+
+ tree_pointer() {
+ tree_wrapper *t = current_allocator().construct<tree_wrapper>();
+ _ref.link(t->_ref);
+ }
+
+ test_tree* operator->() {
+ tree_wrapper *tw = get_wrapper();
+ return &tw->_tree;
+ }
+
+ test_tree& operator*() {
+ tree_wrapper *tw = get_wrapper();
+ return tw->_tree;
+ }
+
+ ~tree_pointer() {
+ tree_wrapper *tw = get_wrapper();
+ current_allocator().destroy(tw);
+ }
+};
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(10000), "number of keys to fill the tree with")
+ ("iters", bpo::value<int>()->default_value(13), "number of iterations")
+ ("verb", bpo::value<bool>()->default_value(false), "be verbose");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto rep = app.configuration()["iters"].as<int>();
+ auto verb = app.configuration()["verb"].as<bool>();
+
+ return seastar::async([count, rep, verb] {
+ int iter = rep;
+ std::vector<int> keys;
+ for (int i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+
+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Compacting {:d} k:v pairs {:d} times\n", count, iter);
+
+ test_validator tv;
+
+ logalloc::region mem;
+
+ with_allocator(mem.allocator(), [&] {
+ tree_pointer t;
+
+ again:
+ {
+ std::shuffle(keys.begin(), keys.end(), g);
+
+ logalloc::reclaim_lock rl(mem);
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ auto ti = t->emplace(std::move(copy_key(k)), k);
+ assert(ti.second);
+ seastar::thread::maybe_yield();
+ }
+ }
+
+ mem.full_compaction();
+
+ if (verb) {
+ fmt::print("After fill + compact\n");
+ tv.print_tree(*t, '|');
+ }
+
+ tv.validate(*t);
+
+ {
+ std::shuffle(keys.begin(), keys.end(), g);
+
+ logalloc::reclaim_lock rl(mem);
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ t->erase(k);
+ seastar::thread::maybe_yield();
+ }
+ }
+
+ mem.full_compaction();
+
+ if (verb) {
+ fmt::print("After erase + compact\n");
+ tv.print_tree(*t, '|');
+ }
+
+ tv.validate(*t);
+
+ if (--iter > 0) {
+ seastar::thread::yield();
+ goto again;
+ }
+ });
+ });
+ });
+}
diff --git a/test/unit/bptree_stress_test.cc b/test/unit/bptree_stress_test.cc
new file mode 100644
index 000000000..3060b1a7b
--- /dev/null
+++ b/test/unit/bptree_stress_test.cc
@@ -0,0 +1,236 @@

+ * along with Scylla. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <map>
+#include <vector>
+#include <random>
+#include <string>
+#include <iostream>
+#include <fmt/core.h>
+#include <fmt/ostream.h>
+
+constexpr int TEST_NODE_SIZE = 16;
+
+#include "bptree_key.hh"
+#include "utils/bptree.hh"
+#include "bptree_validation.hh"
+
+using namespace bplus;
+using namespace seastar;
+
+class test_data {
+ int _value;
+public:
+ test_data() : _value(0) {}
+ test_data(test_key& k) : _value((int)k + 10) {}
+
+ operator unsigned long() const { return _value; }
+ bool match_key(const test_key& k) const { return _value == (int)k + 10; }
+};
+
+std::ostream& operator<<(std::ostream& os, test_data d) {
+ os << (unsigned long)d;
+ return os;
+}
+
+using test_tree = tree<test_key, test_data, test_key_compare, TEST_NODE_SIZE, key_search::both, with_debug::yes>;
+using test_node = typename test_tree::node;
+using test_validator = validator<test_key, test_data, test_key_compare, TEST_NODE_SIZE>;
+using test_iterator_checker = iterator_checker<test_key, test_data, test_key_compare, TEST_NODE_SIZE>;
+
+int main(int argc, char **argv) {
+ namespace bpo = boost::program_options;
+ app_template app;
+ app.add_options()
+ ("count", bpo::value<int>()->default_value(4132), "number of keys to fill the tree with")
+ ("iters", bpo::value<int>()->default_value(9), "number of iterations")
+ ("keys", bpo::value<std::string>()->default_value("rand"), "how to generate keys (rand, asc, desc)")
+ ("verb", bpo::value<bool>()->default_value(false), "be verbose");
+
+ return app.run(argc, argv, [&app] {
+ auto count = app.configuration()["count"].as<int>();
+ auto iters = app.configuration()["iters"].as<int>();
+ auto ks = app.configuration()["keys"].as<std::string>();
+ auto verb = app.configuration()["verb"].as<bool>();
+
+ return seastar::async([count, iters, ks, verb] {
+ int rep = iters;
+ auto *t = new test_tree(test_key_compare{});
+ std::map<int, unsigned long> oracle;
+
+ int p = count / 10;
+ if (p == 0) {
+ p = 1;
+ }
+
+ std::vector<int> keys;
+
+ for (int i = 0; i < count; i++) {
+ keys.push_back(i + 1);
+ }
+
+ std::random_device rd;
+ std::mt19937 g(rd());
+
+ fmt::print("Inserting {:d} k:v pairs {:d} times\n", count, rep);
+
+ test_validator tv;
+
+ if (ks == "desc") {
+ fmt::print("Reversing keys vector\n");
+ std::reverse(keys.begin(), keys.end());
+ }
+
+ bool shuffle = ks == "rand";
+ if (shuffle) {
+ fmt::print("Will shuffle keys each iteration\n");
+ }
+
+
+ again:
+ auto* itc = new test_iterator_checker(tv, *t);
+
+ if (shuffle) {
+ std::shuffle(keys.begin(), keys.end(), g);
+ }
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ if (verb) {
+ fmt::print("+++ {}\n", (int)k);
+ }
+
+ if (rep % 2 != 1) {
+ auto ir = t->emplace(std::move(copy_key(k)), k);
+ assert(ir.second);
+ } else {
+ auto ir = t->lower_bound(k);
+ ir.emplace_before(std::move(copy_key(k)), test_key_compare{}, k);
+ }
+ oracle[keys[i]] = keys[i] + 10;
+
+ if (verb) {
+ fmt::print("Validating\n");
+ tv.print_tree(*t, '|');
+ }
+
+ /* Limit validation rate for many keys */
+ if (i % (i/1000 + 1) == 0) {
+ tv.validate(*t);
+ }
+
+ if (i % 7 == 0) {
+ if (!itc->step()) {
+ delete itc;
+ itc = new test_iterator_checker(tv, *t);
+ }
+ }
+
+ seastar::thread::maybe_yield();
+ }
+
+ auto sz = t->size_slow();
+ if (sz != (size_t)count) {
+ fmt::print("Size {} != count {}\n", sz, count);
+ throw "size";
+ }
+
+ auto ti = t->begin();
+ for (auto oe : oracle) {
+ if (*ti != oe.second) {
+ fmt::print("Data mismatch {} vs {}\n", oe.second, *ti);
+ throw "oracle";
+ }
+ ti++;
+ }
+
+ if (shuffle) {
+ std::shuffle(keys.begin(), keys.end(), g);
+ }
+
+ for (int i = 0; i < count; i++) {
+ test_key k(keys[i]);
+
+ /*
+ * kill iterator if we're removing what it points to,
+ * otherwise it's not invalidated
+ */
+ if (itc->here(k)) {
+ delete itc;
+ itc = nullptr;
+ }
+
+ if (verb) {
+ fmt::print("--- {}\n", (int)k);
+ }
+
+ if (rep % 3 != 2) {
+ t->erase(k);
+ } else {
+ auto ri = t->find(k);
+ auto ni = ri;
+ ni++;
+ auto eni = ri.erase(test_key_compare{});
+ assert(ni == eni);
+ }
+
+ oracle.erase(keys[i]);
+
+ if (verb) {
+ fmt::print("Validating\n");
+ tv.print_tree(*t, '|');
+ }
+
+ if ((count-i) % ((count-i)/1000 + 1) == 0) {
+ tv.validate(*t);
+ }
+
+ if (itc == nullptr) {
+ itc = new test_iterator_checker(tv, *t);
+ }
+
+ if (i % 5 == 0) {
+ if (!itc->step()) {
+ delete itc;
+ itc = new test_iterator_checker(tv, *t);
+ }
+ }
+
+ seastar::thread::maybe_yield();
+ }
+
+ delete itc;
+
+ if (--rep > 0) {
+ if (verb) {
+ fmt::print("{:d} iterations left\n", rep);
+ }
+ goto again;
+ }
+
+ oracle.clear();
+ delete t;

+ });
+ });
+}

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:34 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The collection is K:V store

bplus::tree<Key = K, Value = array_trusted_bounds<V>>

It will be used as partitions cache. The outer tree is used to
quickly map token to cache_entry, the inner array -- to resolve
(expected to be rare) hash collisions.

It also must be equipped with two comparators -- less one for
keys and full one for values. The latter is not kept on-board,
but it required on all calls.

The core API consists of just 2 calls

- Heterogenuous lower_bound(search_key) -> iterator : finds the
element that's greater or equal to the provided search key

Other than the iterator the call returns a "hint" object
that helps the next call.

- emplace_before(iterator, key, hint, ...) : the call construct
the element right before the given iterator. The key and hint
are needed for more optimal algo, but strictly speaking not
required.

Adding an entry to the double_decker may result in growing the
node's array. Here to B+ iterator's .reconstruct() method
comes into play. The new array is created, old elements are
moved onto it, then the fresh node replaces the old one.

// TODO: Ideally this should be turned into the
// template <typename OuterCollection, typename InnerCollection>
// but for now the double_decker still has some intimate knowledge
// about what outer and inner collections are.

Insertion into this collection _may_ invalidate iterators, but
may leave intact. Invalidation only happens in case of hashing
conflict, which can be clearly seen from the hint object, so
there's a good room for improvement.

The main usage by row_cache (the find_or_create_entry) looks like

cache_entry find_or_create_entry() {
bound_hint hint;

it = lower_bound(decorated_key, &hint);
if !found {
it = emplace_before(it, decorated_key.token(), hint,
<constructor args>)
}
return *it;
}

Now the hint. It contains 3 booleans, that are

- match: set to true when the "greater or equal" condition
evaluated to "equal". This frees the caller from the need
to manually check whether the entry returned matches the
search key or the new one should be inserted.

This is the "!found" check from the above snipet.

To explain the next 2 bools, here's a small example. Consider
the tree containing two elements {token, partition key}:

{ 3, "a" }, { 5, "z" }

As the collection is sorted they go in the order shown. Next,
this is what the lower_bound would return for some cases:

{ 3, "z" } -> { 5, "z" }
{ 4, "a" } -> { 5, "z" }
{ 5, "a" } -> { 5, "z" }

Apparently, the lower bound for those 3 elements are the same,
but the code-flows of emplacing them before one differ drastically.

{ 3, "z" } : need to get previous element from the tree and
push the element to it's vector's back
{ 4, "a" } : need to create new element in the tree and populate
its empty vector with the single element
{ 5, "a" } : need to put the new element in the found tree
element right before the found vector position

To make one of the above decisions the .emplace_before would need
to perform another set of comparisons of keys and elements.
Fortunatelly, the needed information was already known inside the
lower_bound call and can be reported via the hint.

Said that,

- key_match: set to true if tree.lower_bound() found the element
for the Key (which is token). For above examples this will be
true for cases 3z and 5a.

- key_tail: set to true if the tree element was found, but when
comparing values from array the bounding element turned out
to belong to the next tree element and the iterator was ++-ed.
For above examples this would be true for case 3z only.

And the last, but not least -- the "erase self" feature. Which is
given only the cache_entry pointer at hands remove it from the
collection. To make this happen we need to make two steps:

1. get the array the entry sits in
2. get the b+ tree node the vectors sits in

Both methods are provided by array_trusted_bounds and bplus::tree.
So, when we need to get iterator from the given T pointer, the algo
looks like

- Walk back the T array untill hitting the head element
- Call array_trusted_bounds::from_element() getting the array
- Construct b+ iterator from obtained array
- Construct the double_decker iterator from b+ iterator and from
the number of "steps back" from above
- Call double_decker::iterator.erase()

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

configure.py | 1 +
utils/double-decker.hh | 356 ++++++++++++++++++++++++++++
test/boost/double_decker_test.cc | 386 +++++++++++++++++++++++++++++++
3 files changed, 743 insertions(+)
create mode 100644 utils/double-decker.hh
create mode 100644 test/boost/double_decker_test.cc

diff --git a/configure.py b/configure.py
index 393426fc0..593f982eb 100755
--- a/configure.py
+++ b/configure.py
@@ -383,6 +383,7 @@ scylla_tests = set([
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/bptree_test',
+ 'test/boost/double_decker_test',

'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',

diff --git a/utils/double-decker.hh b/utils/double-decker.hh
new file mode 100644
index 000000000..798760514
--- /dev/null
+++ b/utils/double-decker.hh
@@ -0,0 +1,356 @@

+#include <seastar/util/concepts.hh>
+#include "utils/bptree.hh"
+#include "utils/array_trusted_bounds.hh"
+#include <fmt/core.h>
+
+/*
+ * The double-decker is the ordered keeper of key:value pairs having
+ * the pairs sorted by both key and value (key first).
+ *
+ * The keys collisions are expected to be rare enough to afford holding
+ * the values in a sorted array with the help of linear algorithms.
+ */
+
+SEASTAR_CONCEPT(
+ template <typename T1, typename T2, typename Compare>
+ concept bool Comparable = requires (const T1& a, const T2& b, Compare cmp) {

+ { cmp(a, b) } -> int;
+ };
+)

+
+template <typename Key, typename T, typename Less, typename Compare, int NodeSize,
+ bplus::key_search Search = bplus::key_search::binary, bplus::with_debug Debug = bplus::with_debug::no>
+SEASTAR_CONCEPT( requires Comparable<T, T, Compare> && std::is_nothrow_move_constructible_v<T> )
+class double_decker {
+public:
+ using inner_array = array_trusted_bounds<T>;
+ using outer_tree = bplus::tree<Key, inner_array, Less, NodeSize, Search, Debug>;
+ using outer_iterator = typename outer_tree::iterator;
+
+private:
+ outer_tree _tree;
+
+public:
+ class iterator {
+ friend class double_decker;
+
+ using inner_array = typename double_decker::inner_array;
+ using outer_iterator = typename double_decker::outer_iterator;
+
+ outer_iterator _bucket;
+ int _idx;
+

+ public:
+ using iterator_category = std::bidirectional_iterator_tag;

+ using difference_type = ssize_t;
+ using value_type = T;
+ using pointer = value_type*;
+ using reference = value_type&;
+

+ return *this;
+ }
+
+ iterator operator++(int) {
+ iterator cur = *this;
+ operator++();
+ return cur;
+ }
+

+ iterator& operator--() {
+ if (_idx-- == 0) {
+ _bucket--;
+ _idx = _bucket->index_of(_bucket->end()) - 1;
+ }
+

+ return *this;
+ }
+

+ iterator operator--(int) {
+ iterator cur = *this;
+ operator--();
+ return cur;
+ }
+

+ bool operator==(const iterator& o) const { return _bucket == o._bucket && _idx == o._idx; }

+ bool operator!=(const iterator& o) const { return !(*this == o); }
+

+ template <typename Func>
+ SEASTAR_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ iterator erase_and_dispose(Less less, Func&& disp) {
+ disp(**this);
+
+ if (_bucket->is_single_element()) {
+ outer_iterator bkt = _bucket.erase(less);
+ return iterator(bkt, 0);
+ }
+
+ bool tail = (*_bucket)[_idx].is_tail();
+ _bucket->erase(_idx);
+ if (tail) {
+ _bucket++;
+ _idx = 0;
+ }
+

+ return *this;
+ }
+

+ * iterators or not. Emplacing with !key_match will go and create
+ * new node in B+ which doesn't invalidate iterators. In another
+ * case some existing B+ data node will be reconstructed, so the
+ * iterators on those nodes will become invalid.
+ */
+ bool emplace_keeps_iterators() const { return !key_match; }

+ };
+
+ iterator begin() { return iterator(_tree.begin(), 0); }
+ iterator end() { return iterator(_tree.end(), 0); }
+
+ double_decker(Less less) : _tree(less) { }
+
+ double_decker(const double_decker& other) = delete;
+ double_decker(double_decker&& other) noexcept : _tree(std::move(other._tree)) {}
+
+ iterator insert(Key k, T value, Compare cmp) {
+ std::pair<outer_iterator, bool> oip = _tree.emplace(std::move(k), std::move(value));
+ outer_iterator& bkt = oip.first;
+ int idx = 0;
+
+ if (!oip.second) {
+ /*
+ * Unlikely, but in this case we reconstruct the array. The value
+ * must not have been moved by emplace() above.
+ */
+ idx = bkt->index_of(bkt->lower_bound(value, cmp));
+ size_t new_size = (bkt->size() + 1) * sizeof(T);
+ bkt.reconstruct(bkt.storage_size(new_size), *bkt,
+ typename inner_array::grow_tag{idx}, std::move(value));
+ }
+
+ return iterator(bkt, idx);

+ }
+
+ template <typename... Args>

+ iterator emplace_before(iterator i, Key k, const bound_hint& hint, Args&&... args) {
+ assert(!hint.match);
+ outer_iterator& bucket = i._bucket;
+
+ if (!hint.key_match) {
+ /*
+ * The most expected case -- no key conflict, respectively the
+ * bucket is not found, and i points to the next one. Just go
+ * ahead and emplace the new bucket before the i and push the
+ * 0th element into it.
+ */
+ outer_iterator nb = bucket.emplace_before(std::move(k), _tree.less(), std::forward<Args>(args)...);
+ return iterator(nb, 0);
+ }
+
+ /*
+ * Key conflict, need to expand some inner vector, but still there
+ * are two cases -- whether the bounding element is on k's bucket
+ * or the bound search overflew and switched to the next one.
+ */
+
+ int idx = i._idx;
+
+ if (hint.key_tail) {
+ /*
+ * The latter case -- i points to the next one. Need to shift
+ * back and append the new element to its tail.
+ */
+ bucket--;
+ idx = bucket->index_of(bucket->end());
+ }
+
+ size_t new_size = (bucket->size() + 1) * sizeof(T);
+ bucket.reconstruct(bucket.storage_size(new_size), *bucket,
+ typename inner_array::grow_tag{idx}, std::forward<Args>(args)...);
+ return iterator(bucket, idx);

+ }
+
+ template <typename K = Key>

+ SEASTAR_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator find(const K& key, Compare cmp) {
+ outer_iterator bkt = _tree.find(key);
+ int idx = 0;
+
+ if (bkt != _tree.end()) {

+ bool match = false;

+ idx = bkt->index_of(bkt->lower_bound(key, cmp, match));
+ if (!match) {
+ bkt = _tree.end();
+ idx = 0;
+ }
+ }
+
+ return iterator(bkt, idx);

+ }
+
+ template <typename K = Key>

+ SEASTAR_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& key, Compare cmp, bound_hint& hint) {
+ outer_iterator bkt = _tree.lower_bound(key, hint.key_match);
+
+ hint.key_tail = false;
+ hint.match = false;
+
+ if (bkt == _tree.end() || !hint.key_match) {
+ return iterator(bkt, 0);
+ }
+
+ int i = bkt->index_of(bkt->lower_bound(key, cmp, hint.match));
+
+ if (i != 0 && (*bkt)[i - 1].is_tail()) {
+ /*
+ * The lower_bound is after the last element -- shift
+ * to the net bucket's 0'th one.
+ */
+ bkt++;
+ i = 0;
+ hint.key_tail = true;
+ }
+
+ return iterator(bkt, i);
+ }
+

+ template <typename K = Key>

+ SEASTAR_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator lower_bound(const K& key, Compare cmp) {
+ bound_hint hint;
+ return lower_bound(key, cmp, hint);

+ }
+
+ template <typename K = Key>

+ SEASTAR_CONCEPT( requires Comparable<K, T, Compare> )
+ iterator upper_bound(const K& key, Compare cmp) {
+ bool key_match;
+ outer_iterator bkt = _tree.lower_bound(key, key_match);
+
+ if (bkt == _tree.end() || !key_match) {
+ return iterator(bkt, 0);
+ }
+
+ int i = bkt->index_of(bkt->upper_bound(key, cmp));
+
+ if (i != 0 && (*bkt)[i - 1].is_tail()) {
+ // Beyond the end() boundary
+ bkt++;
+ i = 0;
+ }
+
+ return iterator(bkt, i);
+ }
+

+ template <typename Func>
+ SEASTAR_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )
+ void clear_and_dispose(Func&& disp) {

+ _tree.clear_and_dispose([&disp] (inner_array& arr) {
+ arr.for_each(disp);
+ });
+ }
+
+ void clear() { clear_and_dispose(bplus::default_dispose<T>); }

+
+ template <typename Func>
+ SEASTAR_CONCEPT(requires requires (Func f, T val) { { f(val) } -> void; } )

+ iterator erase_and_dispose(iterator begin, iterator end, Func&& disp) {
+ // Drop the tail of the starting bucket if it's not fully erased
+ while (begin._idx != 0) {
+ begin = begin.erase_and_dispose(_tree.less(), disp);
+ }
+
+ // Drop all the buckets in between
+ outer_iterator nb = _tree.erase_and_dispose(begin._bucket, end._bucket, [&disp] (inner_array& arr) {
+ arr.for_each(disp);
+ });
+
+ assert(nb == end._bucket);
+
+ /*
+ * Drop the head of the ending bucket. Every erased element is the 0th
+ * one, when erased it will shift the rest left and reconstruct the array,
+ * thus we cannot rely on the end to keep neither _bucket not _idx.
+ *
+ * Said that -- just erase the required number of elements. A corner case
+ * when end points to the tree end is handled, _idx is 0 in this case.
+ */
+ iterator next(nb, 0);
+ while (end._idx-- != 0) {
+ next = next.erase_and_dispose(_tree.less(), disp);
+ }
+
+ return next;
+ }
+
+ iterator erase(iterator begin, iterator end) {
+ return erase_and_dispose(begin, end, bplus::default_dispose<T>);
+ }
+
+ bool empty() const { return _tree.empty(); }
+};
diff --git a/test/boost/double_decker_test.cc b/test/boost/double_decker_test.cc
new file mode 100644
index 000000000..0ffc35574
--- /dev/null
+++ b/test/boost/double_decker_test.cc
@@ -0,0 +1,386 @@

+ return *this;
+ }
+

+ bool _head = false;
+ bool _tail = false;
+ bool _train = false;
+

+ int *_cookie;
+ int *_cookie2;

+public:
+ bool is_head() const { return _head; }
+ bool is_tail() const { return _tail; }
+ bool with_train() const { return _train; }

+ void set_head(bool v) { _head = v; }

+ void set_tail(bool v) { _tail = v; }

+ void set_train(bool v) { _train = v; }
+

+ test_data(int key, std::string sub) : _key(key, sub), _cookie(new int(0)), _cookie2(new int(0)) {}
+
+ test_data(const test_data& other) = delete;
+ test_data(test_data&& other) noexcept : _key(std::move(other._key)),
+ _head(other._head), _tail(other._tail), _train(other._train),
+ _cookie(other._cookie), _cookie2(new int(0)) {

+ other._cookie = nullptr;
+ }

+
+ ~test_data() {

+ if (_cookie != nullptr) {
+ delete _cookie;
+ }

+ delete _cookie2;
+ }
+
+ bool operator==(const compound_key& k) { return _key == k; }
+
+ test_data& operator=(const test_data& other) = delete;
+ test_data& operator=(test_data&& other) = delete;
+
+ std::string format() const { return _key.format(); }
+
+ struct compare {
+ compound_key::compare kcmp;
+ int operator()(const int& a, const int& b) { return kcmp(a, b); }
+ int operator()(const compound_key& a, const int& b) { return kcmp(a.key, b); }
+ int operator()(const int& a, const compound_key& b) { return kcmp(a, b.key); }
+ int operator()(const compound_key& a, const compound_key& b) { return kcmp(a, b); }
+ int operator()(const compound_key& a, const test_data& b) { return kcmp(a, b._key); }
+ int operator()(const test_data& a, const compound_key& b) { return kcmp(a._key, b); }
+ int operator()(const test_data& a, const test_data& b) { return kcmp(a._key, b._key); }
+ };
+};
+
+using collection = double_decker<int, test_data, compound_key::less_compare, test_data::compare, 4,
+ bplus::key_search::both, bplus::with_debug::yes>;
+using oracle = std::set<compound_key, compound_key::less_compare>;
+
+BOOST_AUTO_TEST_CASE(test_lower_bound) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+
+ c.insert(3, test_data(3, "e"), cmp);
+ c.insert(5, test_data(5, "i"), cmp);
+ c.insert(5, test_data(5, "o"), cmp);
+
+ collection::bound_hint h;
+
+ BOOST_REQUIRE(*c.lower_bound(compound_key(2, "a"), cmp, h) == compound_key(3, "e") && !h.key_match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(3, "a"), cmp, h) == compound_key(3, "e") && h.key_match && !h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(3, "e"), cmp, h) == compound_key(3, "e") && h.key_match && !h.key_tail && h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(3, "o"), cmp, h) == compound_key(5, "i") && h.key_match && h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(4, "i"), cmp, h) == compound_key(5, "i") && !h.key_match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "a"), cmp, h) == compound_key(5, "i") && h.key_match && !h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "i"), cmp, h) == compound_key(5, "i") && h.key_match && !h.key_tail && h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "l"), cmp, h) == compound_key(5, "o") && h.key_match && !h.key_tail && !h.match);
+ BOOST_REQUIRE(*c.lower_bound(compound_key(5, "o"), cmp, h) == compound_key(5, "o") && h.key_match && !h.key_tail && h.match);
+ BOOST_REQUIRE(c.lower_bound(compound_key(5, "q"), cmp, h) == c.end() && h.key_match && h.key_tail);
+ BOOST_REQUIRE(c.lower_bound(compound_key(6, "q"), cmp, h) == c.end() && !h.key_match);
+
+ c.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_upper_bound) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+
+ c.insert(3, test_data(3, "e"), cmp);
+ c.insert(5, test_data(5, "i"), cmp);
+ c.insert(5, test_data(5, "o"), cmp);
+
+ BOOST_REQUIRE(*c.upper_bound(compound_key(2, "a"), cmp) == compound_key(3, "e"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(3, "a"), cmp) == compound_key(3, "e"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(3, "e"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(3, "o"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(4, "i"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(5, "a"), cmp) == compound_key(5, "i"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(5, "i"), cmp) == compound_key(5, "o"));
+ BOOST_REQUIRE(*c.upper_bound(compound_key(5, "l"), cmp) == compound_key(5, "o"));
+ BOOST_REQUIRE(c.upper_bound(compound_key(5, "o"), cmp) == c.end());
+ BOOST_REQUIRE(c.upper_bound(compound_key(5, "q"), cmp) == c.end());
+ BOOST_REQUIRE(c.upper_bound(compound_key(6, "q"), cmp) == c.end());
+
+ c.clear();
+}
+BOOST_AUTO_TEST_CASE(test_self_iterator) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+
+ c.insert(1, std::move(test_data(1, "a")), cmp);
+ c.insert(1, std::move(test_data(1, "b")), cmp);
+ c.insert(2, std::move(test_data(2, "c")), cmp);
+ c.insert(3, std::move(test_data(3, "d")), cmp);
+ c.insert(3, std::move(test_data(3, "e")), cmp);
+
+ auto erase_by_ptr = [&] (int key, std::string sub) {
+ test_data* d = &*c.find(compound_key(key, sub), cmp);
+ collection::iterator di(d);
+ di.erase(compound_key::less_compare{});
+ };
+
+ erase_by_ptr(1, "b");
+ erase_by_ptr(2, "c");
+ erase_by_ptr(3, "d");
+
+ auto i = c.begin();
+ BOOST_REQUIRE(*i++ == compound_key(1, "a"));
+ BOOST_REQUIRE(*i++ == compound_key(3, "e"));
+ BOOST_REQUIRE(i == c.end());
+
+ c.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_end_iterator) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+
+ c.insert(1, std::move(test_data(1, "a")), cmp);
+ auto i = std::prev(c.end());
+ BOOST_REQUIRE(*i == compound_key(1, "a"));
+
+ c.clear();
+}
+
+void validate_sorted(collection& c) {
+ auto i = c.begin();
+ if (i == c.end()) {
+ return;
+ }
+
+ while (1) {
+ auto cur = i;
+ i++;
+ if (i == c.end()) {
+ break;
+ }
+ test_data::compare cmp;
+ BOOST_REQUIRE(cmp(*cur, *i) < 0);
+ }
+}
+
+void compare_with_set(collection& c, oracle& s) {
+ test_data::compare cmp;
+ /* All keys must be findable */
+ for (auto i = s.begin(); i != s.end(); i++) {
+ auto j = c.find(*i, cmp);
+ BOOST_REQUIRE(j != c.end() && *j == *i);
+ }
+
+ /* Both iterators must coinside */
+ auto i = c.begin();
+ auto j = s.begin();
+
+ while (i != c.end()) {
+ BOOST_REQUIRE(*i == *j);
+ i++;
+ j++;
+ }
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_via_emplace) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+ oracle s;
+ int nr = 0;
+
+ while (nr < 4000) {
+ compound_key k(tests::random::get_int<int>(900), tests::random::get_sstring(4));
+
+ collection::bound_hint h;
+ auto i = c.lower_bound(k, cmp, h);
+
+ if (i == c.end() || !h.match) {
+ auto it = c.emplace_before(i, k.key, h, k.key, k.sub_key);
+ BOOST_REQUIRE(*it == k);
+ s.insert(std::move(k));
+ nr++;
+ }
+ }
+
+ compare_with_set(c, s);
+ c.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_insert_and_erase) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+ int nr = 0;
+
+ while (nr < 500) {
+ compound_key k(tests::random::get_int<int>(100), tests::random::get_sstring(3));
+
+ if (c.find(k, cmp) == c.end()) {
+ auto it = c.insert(k.key, std::move(test_data(k.key, k.sub_key)), cmp);
+ BOOST_REQUIRE(*it == k);
+ nr++;
+ }
+ }
+
+ validate_sorted(c);
+
+ while (nr > 0) {
+ int n = tests::random::get_int<int>() % nr;
+
+ auto i = c.begin();
+ while (n > 0) {
+ i++;
+ n--;
+ }
+
+ i.erase(compound_key::less_compare{});
+ nr--;
+
+ validate_sorted(c);
+ }
+}
+
+BOOST_AUTO_TEST_CASE(test_compaction) {
+ logalloc::region reg;
+ with_allocator(reg.allocator(), [&] {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+ oracle s;
+
+ {
+ logalloc::reclaim_lock rl(reg);
+
+ int nr = 0;
+
+ while (nr < 1500) {
+ compound_key k(tests::random::get_int<int>(400), tests::random::get_sstring(3));
+
+ if (c.find(k, cmp) == c.end()) {
+ auto it = c.insert(k.key, std::move(test_data(k.key, k.sub_key)), cmp);
+ BOOST_REQUIRE(*it == k);
+ s.insert(std::move(k));
+ nr++;
+ }
+ }
+ }
+
+ reg.full_compaction();
+
+ compare_with_set(c, s);
+ c.clear();
+ });
+}
+
+BOOST_AUTO_TEST_CASE(test_range_erase) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+
+ c.insert(1, std::move(test_data(1, "a")), cmp);
+ c.insert(1, std::move(test_data(1, "b")), cmp);
+ c.insert(1, std::move(test_data(1, "c")), cmp);
+ c.insert(2, std::move(test_data(2, "a")), cmp);
+ c.insert(2, std::move(test_data(2, "b")), cmp);
+ c.insert(2, std::move(test_data(2, "c")), cmp);
+ c.insert(3, std::move(test_data(3, "a")), cmp);
+ c.insert(3, std::move(test_data(3, "b")), cmp);
+ c.insert(3, std::move(test_data(3, "c")), cmp);
+
+ auto i = c.erase(c.find(compound_key(1, "b"), cmp), c.find(compound_key(3, "b"), cmp));
+ BOOST_REQUIRE(*i == compound_key(3, "b"));
+
+ auto x = c.begin();
+ BOOST_REQUIRE(*(x++) == compound_key(1, "a"));
+ BOOST_REQUIRE(x++ == i);
+ BOOST_REQUIRE(*(x++) == compound_key(3, "c"));
+ BOOST_REQUIRE(x == c.end());
+
+ c.clear();
+}
+
+BOOST_AUTO_TEST_CASE(test_range_full_erase) {
+ collection c(compound_key::less_compare{});
+ test_data::compare cmp;
+
+ c.insert(1, std::move(test_data(1, "a")), cmp);
+ c.insert(1, std::move(test_data(1, "b")), cmp);
+ c.insert(2, std::move(test_data(2, "a")), cmp);
+ c.insert(2, std::move(test_data(2, "b")), cmp);
+
+ auto i = c.erase(c.begin(), c.end());
+ BOOST_REQUIRE(i == c.end());
+}
--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:35 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The row cache memory footprint changed after switch to B+

because we no longer have a sole cache_entry allocation, but

also the bplus::data and bplus::node. Knowing their sizes
helps analyzing the footprint changes.

Also print the size of memtable_entry that's now also stored
in B+'s data.

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

test/perf/memory_footprint_test.cc | 3 +++
1 file changed, 3 insertions(+)

diff --git a/test/perf/memory_footprint_test.cc b/test/perf/memory_footprint_test.cc
index 9252c77e0..4e3026198 100644
--- a/test/perf/memory_footprint_test.cc
+++ b/test/perf/memory_footprint_test.cc
@@ -56,6 +56,9 @@ class size_calculator {

public:
static void print_cache_entry_size() {
std::cout << prefix() << "sizeof(cache_entry) = " << sizeof(cache_entry) << "\n";

+ std::cout << prefix() << "sizeof(memtable_entry) = " << sizeof(memtable_entry) << "\n";

+ std::cout << prefix() << "sizeof(bptree::node) = " << sizeof(row_cache::partitions_type::outer_tree::node) << "\n";
+ std::cout << prefix() << "sizeof(bptree::data) = " << sizeof(row_cache::partitions_type::outer_tree::data) << "\n";

{
nest n;

--
2.20.1

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 9:15:35 AM5/19/20

to scylladb-dev@googlegroups.com, Pavel Emelyanov

The B+ will not have constant-time .size() call, so do it by hands

Signed-off-by: Pavel Emelyanov <xe...@scylladb.com>
---

void memtable::clear() noexcept {
auto dirty_before = dirty_size();
with_allocator(allocator(), [this] {

partitions.clear_and_dispose([this] (memtable_entry* e) {
- e->partition().evict(_cleaner);
+ evict_entry(*e, _cleaner);
current_deleter<memtable_entry>()(e);
});
});
@@ -154,6 +159,7 @@ future<> memtable::clear_gently() noexcept {
auto& alloc = allocator();

auto p = std::move(partitions);
+ nr_partitions = 0;
while (!p.empty()) {
auto dirty_before = dirty_size();
with_allocator(alloc, [&] () noexcept {
@@ -210,6 +216,7 @@ memtable::find_or_create_partition(const dht::decorated_key& key) {

memtable_entry* entry = current_allocator().construct<memtable_entry>(

_schema, dht::decorated_key(key), mutation_partition(_schema));
partitions.insert_before(i, *entry);
+ ++nr_partitions;
++_table_stats.memtable_partition_insertions;
return entry->partition();
} else {
@@ -753,10 +760,6 @@ mutation_source memtable::as_data_source() {
});
}

-size_t memtable::partition_count() const {
- return partitions.size();
-}
-
memtable_entry::memtable_entry(memtable_entry&& o) noexcept
: _link()
, _schema(std::move(o._schema))
diff --git a/row_cache.cc b/row_cache.cc
index 0c8c96d56..fc42ead10 100644
--- a/row_cache.cc
+++ b/row_cache.cc
@@ -890,14 +890,14 @@ void row_cache::invalidate_sync(memtable& m) noexcept {

bool blow_cache = false;
// Note: clear_and_dispose() ought not to look up any keys, so it doesn't require
// with_linearized_managed_bytes(), but invalidate() does.

- m.partitions.clear_and_dispose([this, deleter = current_deleter<memtable_entry>(), &blow_cache] (memtable_entry* entry) {
+ m.partitions.clear_and_dispose([this, &m, deleter = current_deleter<memtable_entry>(), &blow_cache] (memtable_entry* entry) {
with_linearized_managed_bytes([&] {
try {
invalidate_locked(entry->key());

} catch (...) {
blow_cache = true;
}

- entry->partition().evict(_tracker.memtable_cleaner());
+ m.evict_entry(*entry, _tracker.memtable_cleaner());
deleter(entry);
});
});
@@ -973,7 +973,7 @@ future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater)
auto i = m.partitions.begin();
memtable_entry& mem_e = *i;
m.partitions.erase(i);
- mem_e.partition().evict(_tracker.memtable_cleaner());
+ m.evict_entry(mem_e, _tracker.memtable_cleaner());
current_allocator().destroy(&mem_e);
});
++partition_count;
--
2.20.1

Botond Dénes

<bdenes@scylladb.com>

unread,

May 19, 2020, 9:46:29 AM5/19/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com

Already made this comment on an earlier version: I think this should be
renamed to intrusive_array and the comment above should mention that
the array doesn't manage its own storage, although this would be kindof
obvious if it would be called intrusive_array.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 19, 2020, 11:06:34 AM5/19/20

to Botond Dénes, scylladb-dev@googlegroups.com

I remember that. But it still sounds to me like "seekable pipe" :(
Does intrusive-ness mainly and mostly mean "don't manage it's own storage"?

Nonetheless: https://github.com/xemul/scylla/tree/row-cache-over-bptree-5.1

-- Pavel

Botond Dénes

<bdenes@scylladb.com>

unread,

May 19, 2020, 11:12:02 AM5/19/20

to Pavel Emelyanov, scylladb-dev@googlegroups.com

As far as I know yes.

>
> Nonetheless:
> https://github.com/xemul/scylla/tree/row-cache-over-bptree-5.1

I don't remember seeing any reaction to that comment. You don't have to
agree and/or do everything I (or other reviewers) ask, but if you don't
respond I don't know whether you just forgot about it or don't agree.

Avi Kivity

<avi@scylladb.com>

unread,

May 24, 2020, 4:14:01 AM5/24/20

to Pavel Emelyanov, Raphael S. Carvalho, scylladb-dev

I don't think btree should be involved in the conversion of tokens to
int64 at all. The caller should convert tokens to int64 and back.

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 25, 2020, 4:25:39 AM5/25/20

to Avi Kivity, Raphael S. Carvalho, scylladb-dev

вс, 24 мая 2020 г., 11:14 Avi Kivity <a...@scylladb.com>:

But isn't it the goal of "heterogenous lookup" -- make it possible to lookup value by key of any type that's "compatible" with the in-tree one?

The caller of b+ uses several types for lookup (cache_entry, memtable_entry, ring_position_view, etc), they all have the .token() method that's used by comparators to call tri_compare in the end. Why can't the same method be used to get int64_t for optimized low-level lookup?

-- Pavel

Avi Kivity

<avi@scylladb.com>

unread,

May 25, 2020, 4:40:45 AM5/25/20

to Pavel Emelyanov, Raphael S. Carvalho, scylladb-dev

Yes.

The caller of b+ uses several types for lookup (cache_entry, memtable_entry, ring_position_view, etc), they all have the .token() method that's used by comparators to call tri_compare in the end. Why can't the same method be used to get int64_t for optimized low-level lookup?

Maybe I lost the plot. The btree template isn't aware of tokens. So there needs to be a translation layer outside btree from tokens to int64_t.

Where to you propose to add to_int64_key?

Pavel Emelyanov

<xemul@scylladb.com>

unread,

May 25, 2020, 4:55:26 AM5/25/20

to Avi Kivity, Raphael S. Carvalho, scylladb-dev

пн, 25 мая 2020 г., 11:40 Avi Kivity <a...@scylladb.com>:

There's a template called searcher that searches for a stubtree on a node by provided search key. It has two specializations for binary and linear search types. In both it accepts a search key and a comparator that compares search key to tree key (from node).

This patch adds the 3rd specialization -- for the case when search type is linear and tree key is int64_t. In this case it needs to convert arbitrary search key to int64_t to pass it along with tree key into avx scanner helper.

The to_int64_key is effectively this searcher's auxiliary sub template to impose convertability concept on search key type. If I put the concept on searcher itself then any search key type that doesn't fit it willl be compiled with default linear searcher, which is not good.

-- Pavel

Avi Kivity

<avi@scylladb.com>

unread,

May 25, 2020, 5:29:23 AM5/25/20

to Pavel Emelyanov, Raphael S. Carvalho, scylladb-dev

This is traditionally done by the comparator (which is supplied by the user). If the comparator knows to compare token to int64_t, that is okay.

I think I see the problem - operator< is not enough here, because we perform special hacks.

How about this: if comparator::simplify_key() and comparator::simplified_key_type exist, call it on any input key. Otherwise just use the key directly. We will have simplify_key() that converts token to int64_t. Then all compares are done on the int64_t, and searcher's specializtion works.