[PATCH seastar v1 0/6] memory: optimize thread-local initialization

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2021, 3:21:02 PM5/15/21

to seastar-dev@googlegroups.com

The memory allocator uses lots of thread-local variables, which cause
the compiler to emit checks for thread-local guard variables and lazy
initialization. This was observed (#829) to be expensive.

This series slims down those initializations. Verified by inspecting
the generated code, and using Scylla's perf_simple_query:

before: 186247.78 tps ( 87.1 allocs/op, 20.1 tasks/op, 53777 insns/op)
after: 189393.75 tps ( 87.1 allocs/op, 20.1 tasks/op, 50858 insns/op)

More than 5% reduction in instructions executed.

Fixes #829.

https://github.com/avikivity/seastar memory-reduce-thread-local-overhead/v1

Avi Kivity (6):
memory: initialize cpu_mem access pointer eagerly, not lazily
memory: do not return original value when incrementing statistics
memory: fix coding style in statistics increment()
memory: use relaxed memory order for incrementing cross-thread stats
memory: fast-path local statistics
memory: avoid gratuitous cpu_mem initialization checks

src/core/memory.cc | 168 +++++++++++++++++++++++----------------------
1 file changed, 87 insertions(+), 81 deletions(-)

--
2.31.1

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2021, 3:21:04 PM5/15/21

to seastar-dev@googlegroups.com

cpu_mem is an expensive variable to use, since it is thread-local
and has non-trivial constructors and destructors. To work around
the compiler thread-local guard variable checks, we have get_cpu_mem
which is supposed to be inlined and the initialization optimized,
but it was observed not to be when we increased complexity.

To fix this, eagerly initialize the cpu_mem pointer, making
get_cpu_mem() trivial. It can now be inlined without pulling
in cpu_mem initialization.

This was made possible by the new checks for is_reactor_thread
that give us an opportunity for eager initialiation (calling a
new function init_cpu_mem()).
---
src/core/memory.cc | 48 ++++++++++++++++++++++++----------------------
1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/src/core/memory.cc b/src/core/memory.cc
index 68d475be0d..4ebac25142 100644
--- a/src/core/memory.cc
+++ b/src/core/memory.cc
@@ -1332,44 +1332,43 @@ void free_large(void* ptr) {

size_t object_size(void* ptr) {
return cpu_pages::all_cpus[object_cpu_id(ptr)]->object_size(ptr);
}

+static thread_local cpu_pages* cpu_mem_ptr = nullptr;
+
// Mark as cold so that GCC8+ can move to .text.unlikely.
[[gnu::cold]]
-static void init_cpu_mem_ptr(cpu_pages*& cpu_mem_ptr) {
+static void init_cpu_mem() {
cpu_mem_ptr = &cpu_mem;
-};
+ cpu_mem.initialize();
+}

[[gnu::always_inline]]
static inline cpu_pages& get_cpu_mem()
{
// cpu_pages has a non-trivial constructor which means that the compiler
// must make sure the instance local to the current thread has been
- // constructed before each access.
- // Unfortunately, this means that GCC will emit an unconditional call
- // to __tls_init(), which may incurr a noticeable overhead in applications
- // that are heavy on memory allocations.
- // This can be solved by adding an easily predictable branch checking
- // whether the object has already been constructed.
- static thread_local cpu_pages* cpu_mem_ptr;
- if (__builtin_expect(!bool(cpu_mem_ptr), false)) {
- init_cpu_mem_ptr(cpu_mem_ptr);
- }
+ // constructed before each access. So instead we access cpu_mem_ptr
+ // which has been initialized by calls to init_cpu_mem() before it is
+ // accessed.
return *cpu_mem_ptr;
}

#ifdef SEASTAR_DEBUG_ALLOCATIONS
static constexpr int debug_allocation_pattern = 0xab;
#endif

void* allocate(size_t size) {
- // original_malloc_func might be null for allocations before main
- // in constructors before original_malloc_func ctor is called
- if (!is_reactor_thread && original_malloc_func) {
- alloc_stats::increment(alloc_stats::types::foreign_mallocs);
- return original_malloc_func(size);
+ if (!is_reactor_thread) {
+ if (original_malloc_func) {
+ alloc_stats::increment(alloc_stats::types::foreign_mallocs);
+ return original_malloc_func(size);
+ }
+ // original_malloc_func might be null for allocations before main
+ // in constructors before original_malloc_func ctor is called
+ init_cpu_mem();
}
if (size <= sizeof(free_object)) {
size = sizeof(free_object);
}
void* ptr;
@@ -1389,15 +1388,18 @@ void* allocate(size_t size) {
alloc_stats::increment(alloc_stats::types::allocs);
return ptr;
}

void* allocate_aligned(size_t align, size_t size) {
- // original_realloc_func might be null for allocations before main
- // in constructors before original_realloc_func ctor is called
- if (!is_reactor_thread && original_aligned_alloc_func) {
- alloc_stats::increment(alloc_stats::types::foreign_mallocs);
- return original_aligned_alloc_func(align, size);
+ if (!is_reactor_thread) {
+ if (original_aligned_alloc_func) {
+ alloc_stats::increment(alloc_stats::types::foreign_mallocs);
+ return original_aligned_alloc_func(align, size);
+ }
+ // original_realloc_func might be null for allocations before main
+ // in constructors before original_realloc_func ctor is called
+ init_cpu_mem();
}
if (size <= sizeof(free_object)) {
size = std::max(sizeof(free_object), align);
}
void* ptr;
@@ -1492,11 +1494,11 @@ void configure(std::vector<resource::memory> m, bool mbind,
// and we might reach configure without ever allocating, hence without ever calling
// cpu_pages::initialize.
// The correct solution is to add a condition inside cpu_mem.resize, but since all
// other paths to cpu_pages::resize are already verifying initialize was called, we
// verify that here.
- cpu_mem.initialize();
+ init_cpu_mem();
is_reactor_thread = true;
size_t total = 0;
for (auto&& x : m) {
total += x.bytes;
}
--
2.31.1

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2021, 3:21:05 PM5/15/21

to seastar-dev@googlegroups.com

This is not expected to improve performance, but stops my eyes
from bleeding.
---
src/core/memory.cc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/core/memory.cc b/src/core/memory.cc
index 4ec76184da..1100842137 100644
--- a/src/core/memory.cc
+++ b/src/core/memory.cc
@@ -212,11 +212,11 @@ std::array<stats_atomic_array, max_cpus> alien_stats{};
static void increment(types stat_type, uint64_t size=1)
{
auto i = static_cast<std::size_t>(stat_type);
// fast path, reactor threads takes thread local statistics
if (is_reactor_thread) {
- stats[i]+=size;
+ stats[i] += size;
} else {
auto hash = std::hash<std::thread::id>()(std::this_thread::get_id());
alien_stats[hash % alien_stats.size()][i].fetch_add(size);
}
}
--
2.31.1

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2021, 3:21:05 PM5/15/21

to seastar-dev@googlegroups.com

The function that increments statistics returns the original value,
but no one is interested in it. Simplify by dropping the return
value.
---
src/core/memory.cc | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/core/memory.cc b/src/core/memory.cc
index 4ebac25142..4ec76184da 100644
--- a/src/core/memory.cc
+++ b/src/core/memory.cc
@@ -207,21 +207,19 @@ using stats_array = std::array<uint64_t, static_cast<std::size_t>(types::enum_si
using stats_atomic_array = std::array<std::atomic_uint64_t, static_cast<std::size_t>(types::enum_size)>;

thread_local stats_array stats;

std::array<stats_atomic_array, max_cpus> alien_stats{};

-static uint64_t increment(types stat_type, uint64_t size=1)
+static void increment(types stat_type, uint64_t size=1)

{
auto i = static_cast<std::size_t>(stat_type);
// fast path, reactor threads takes thread local statistics
if (is_reactor_thread) {

- auto tmp = stats[i];
stats[i]+=size;
- return tmp;

} else {
auto hash = std::hash<std::thread::id>()(std::this_thread::get_id());

- return alien_stats[hash % alien_stats.size()][i].fetch_add(size);
+ alien_stats[hash % alien_stats.size()][i].fetch_add(size);
}
}

static uint64_t get(types stat_type)
{
--
2.31.1

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2021, 3:21:06 PM5/15/21

to seastar-dev@googlegroups.com

The default, sequentially consistent increment, is expensive on some
architectures. This isn't a big deal since this is a slow path, but
is still worth it.

---
src/core/memory.cc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/core/memory.cc b/src/core/memory.cc
index 1100842137..88b2638f85 100644
--- a/src/core/memory.cc
+++ b/src/core/memory.cc
@@ -215,11 +215,11 @@ static void increment(types stat_type, uint64_t size=1)

// fast path, reactor threads takes thread local statistics
if (is_reactor_thread) {

stats[i] += size;

} else {
auto hash = std::hash<std::thread::id>()(std::this_thread::get_id());

- alien_stats[hash % alien_stats.size()][i].fetch_add(size);
+ alien_stats[hash % alien_stats.size()][i].fetch_add(size, std::memory_order_relaxed);

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2021, 3:21:08 PM5/15/21

to seastar-dev@googlegroups.com

We keep two different kinds of statistics: thread-local, for reactor
threads, and global/hashed, for alien threads. The increment() function
decides what type it needs to use.

Reduce instruction count by splitting the local option into a new
increment_local() function, and call it directly when it is known
that it is the correct choice.
---
src/core/memory.cc | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/src/core/memory.cc b/src/core/memory.cc
index 88b2638f85..406d25686e 100644
--- a/src/core/memory.cc
+++ b/src/core/memory.cc
@@ -207,18 +207,22 @@ using stats_array = std::array<uint64_t, static_cast<std::size_t>(types::enum_si

using stats_atomic_array = std::array<std::atomic_uint64_t, static_cast<std::size_t>(types::enum_size)>;

thread_local stats_array stats;
std::array<stats_atomic_array, max_cpus> alien_stats{};

+static void increment_local(types stat_type, uint64_t size = 1) {
+ stats[static_cast<std::size_t>(stat_type)] += size;
+}
+

static void increment(types stat_type, uint64_t size=1)

{
- auto i = static_cast<std::size_t>(stat_type);

// fast path, reactor threads takes thread local statistics
if (is_reactor_thread) {

- stats[i] += size;
+ increment_local(stat_type, size);

} else {
auto hash = std::hash<std::thread::id>()(std::this_thread::get_id());

+ auto i = static_cast<std::size_t>(stat_type);

alien_stats[hash % alien_stats.size()][i].fetch_add(size, std::memory_order_relaxed);
}
}

static uint64_t get(types stat_type)

@@ -744,11 +748,11 @@ cpu_pages::allocate_large_and_trim(unsigned n_pages) {
return mem() + span_idx * page_size;
}

void
cpu_pages::warn_large_allocation(size_t size) {
- alloc_stats::increment(alloc_stats::types::large_allocs);
+ alloc_stats::increment_local(alloc_stats::types::large_allocs);
seastar_memory_logger.warn("oversized allocation: {} bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at {}", size, current_backtrace());
large_allocation_warning_threshold *= 1.618; // prevent spam
}

void
@@ -886,11 +890,11 @@ bool cpu_pages::drain_cross_cpu_freelist() {
return false;
}
auto p = xcpu_freelist.exchange(nullptr, std::memory_order_acquire);
while (p) {
auto n = p->next;
- alloc_stats::increment(alloc_stats::types::frees);
+ alloc_stats::increment_local(alloc_stats::types::frees);
free(p);
p = n;
}
return true;
}
@@ -939,11 +943,11 @@ cpu_pages::try_foreign_free(void* ptr) {
if (__builtin_expect((reinterpret_cast<uintptr_t>(ptr) & cpu_id_and_mem_base_mask) == local_expected_cpu_id, true)) {
return false;
}
if (!is_seastar_memory(ptr)) {
if (is_reactor_thread) {
- alloc_stats::increment(alloc_stats::types::foreign_cross_frees);
+ alloc_stats::increment_local(alloc_stats::types::foreign_cross_frees);
} else {
alloc_stats::increment(alloc_stats::types::foreign_frees);
}
original_free_func(ptr);
return true;
@@ -1113,11 +1117,11 @@ void cpu_pages::resize(size_t new_size, allocate_system_memory_fn alloc_memory)
reclaiming_result cpu_pages::run_reclaimers(reclaimer_scope scope, size_t n_pages) {
auto target = std::max<size_t>(nr_free_pages + n_pages, min_free_pages);
reclaiming_result result = reclaiming_result::reclaimed_nothing;
while (nr_free_pages < target) {
bool made_progress = false;
- alloc_stats::increment(alloc_stats::types::reclaims);
+ alloc_stats::increment_local(alloc_stats::types::reclaims);
for (auto&& r : reclaimers) {
if (r->scope() >= scope) {
made_progress |= r->do_reclaim((target - nr_free_pages) * page_size) == reclaiming_result::reclaimed_something;
}
}
@@ -1381,11 +1385,11 @@ void* allocate(size_t size) {
} else {
#ifdef SEASTAR_DEBUG_ALLOCATIONS
std::memset(ptr, debug_allocation_pattern, size);
#endif
}
- alloc_stats::increment(alloc_stats::types::allocs);
+ alloc_stats::increment_local(alloc_stats::types::allocs);

return ptr;
}

void* allocate_aligned(size_t align, size_t size) {

if (!is_reactor_thread) {
@@ -1414,27 +1418,27 @@ void* allocate_aligned(size_t align, size_t size) {
} else {
#ifdef SEASTAR_DEBUG_ALLOCATIONS
std::memset(ptr, debug_allocation_pattern, size);
#endif
}
- alloc_stats::increment(alloc_stats::types::allocs);
+ alloc_stats::increment_local(alloc_stats::types::allocs);
return ptr;
}

void free(void* obj) {
if (cpu_pages::try_foreign_free(obj)) {
return;
}
- alloc_stats::increment(alloc_stats::types::frees);
+ alloc_stats::increment_local(alloc_stats::types::frees);
get_cpu_mem().free(obj);
}

void free(void* obj, size_t size) {
if (cpu_pages::try_foreign_free(obj)) {
return;
}
- alloc_stats::increment(alloc_stats::types::frees);
+ alloc_stats::increment_local(alloc_stats::types::frees);
get_cpu_mem().free(obj, size);
}

void free_aligned(void* obj, size_t align, size_t size) {
if (size <= sizeof(free_object)) {
@@ -1446,12 +1450,12 @@ void free_aligned(void* obj, size_t align, size_t size) {
}
free(obj, size);
}

void shrink(void* obj, size_t new_size) {
- alloc_stats::increment(alloc_stats::types::frees);
- alloc_stats::increment(alloc_stats::types::allocs); // keep them balanced
+ alloc_stats::increment_local(alloc_stats::types::frees);
+ alloc_stats::increment_local(alloc_stats::types::allocs); // keep them balanced
cpu_mem.shrink(obj, new_size);
}

void set_reclaim_hook(std::function<void (std::function<void ()>)> hook) {
cpu_mem.set_reclaim_hook(hook);
--
2.31.1

Avi Kivity

<avi@scylladb.com>

unread,

May 15, 2021, 3:21:09 PM5/15/21

to seastar-dev@googlegroups.com

We have get_cpu_mem() to avoid unneeded cpu_mem intitialization
checks, but we don't use it everywhere. This causes extra checks
to be emitted. Fix by switching to get_cpu_mem() in all call sites
where it should already be initialized.

Most of the cases don't matter for performance, but switching
makes it easier to verify by inspection that no unnecessary
call sites remain.
---
src/core/memory.cc | 86 ++++++++++++++++++++++++----------------------
1 file changed, 44 insertions(+), 42 deletions(-)

diff --git a/src/core/memory.cc b/src/core/memory.cc
index 406d25686e..830ebb16a5 100644
--- a/src/core/memory.cc
+++ b/src/core/memory.cc
@@ -552,24 +552,26 @@ struct cpu_pages {

static thread_local cpu_pages cpu_mem;
std::atomic<unsigned> cpu_pages::cpu_id_gen;
cpu_pages* cpu_pages::all_cpus[max_cpus];

+static cpu_pages& get_cpu_mem();
+
#ifdef SEASTAR_HEAPPROF

void set_heap_profiling_enabled(bool enable) {
- bool is_enabled = cpu_mem.collect_backtrace;
+ bool is_enabled = get_cpu_mem().collect_backtrace;
if (enable) {
if (!is_enabled) {
seastar_logger.info("Enabling heap profiler");
}
} else {
if (is_enabled) {
seastar_logger.info("Disabling heap profiler");
}
}
- cpu_mem.collect_backtrace = enable;
+ get_cpu_mem().collect_backtrace = enable;
}

static thread_local int64_t scoped_heap_profiling_embed_count = 0;

scoped_heap_profiling::scoped_heap_profiling() noexcept {
@@ -775,16 +777,16 @@ cpu_pages::allocate_large_aligned(unsigned align_pages, unsigned n_pages) {
// buddy allocation is always aligned
return allocate_large_and_trim(n_pages);
}

disable_backtrace_temporarily::disable_backtrace_temporarily() {
- _old = cpu_mem.collect_backtrace;
- cpu_mem.collect_backtrace = false;
+ _old = get_cpu_mem().collect_backtrace;
+ get_cpu_mem().collect_backtrace = false;
}

disable_backtrace_temporarily::~disable_backtrace_temporarily() {
- cpu_mem.collect_backtrace = _old;
+ get_cpu_mem().collect_backtrace = _old;
}

static
saved_backtrace get_backtrace() noexcept {
disable_backtrace_temporarily dbt;
@@ -811,11 +813,11 @@ allocation_site_ptr get_allocation_site() {
#ifdef SEASTAR_HEAPPROF

allocation_site_ptr&
small_pool::alloc_site_holder(void* ptr) {
if (objects_page_aligned()) {
- return cpu_mem.to_page(ptr)->alloc_site;
+ return get_cpu_mem().to_page(ptr)->alloc_site;
} else {
return *reinterpret_cast<allocation_site_ptr*>(reinterpret_cast<char*>(ptr) + _object_size - sizeof(allocation_site_ptr));
}
}

@@ -1237,12 +1239,12 @@ small_pool::deallocate(void* object) {

void
small_pool::add_more_objects() {
auto goal = (_min_free + _max_free) / 2;
while (!_span_list.empty() && _free_count < goal) {
- page& span = _span_list.front(cpu_mem.pages);
- _span_list.pop_front(cpu_mem.pages);
+ page& span = _span_list.front(get_cpu_mem().pages);
+ _span_list.pop_front(get_cpu_mem().pages);
while (span.freelist) {
auto obj = span.freelist;
span.freelist = span.freelist->next;
obj->next = _free;
_free = obj;
@@ -1251,19 +1253,19 @@ small_pool::add_more_objects() {
}
}
while (_free_count < goal) {
disable_backtrace_temporarily dbt;
auto span_size = _span_sizes.preferred;
- auto data = reinterpret_cast<char*>(cpu_mem.allocate_large(span_size));
+ auto data = reinterpret_cast<char*>(get_cpu_mem().allocate_large(span_size));
if (!data) {
span_size = _span_sizes.fallback;
- data = reinterpret_cast<char*>(cpu_mem.allocate_large(span_size));
+ data = reinterpret_cast<char*>(get_cpu_mem().allocate_large(span_size));
if (!data) {
return;
}
}
- auto span = cpu_mem.to_page(data);
+ auto span = get_cpu_mem().to_page(data);
span_size = span->span_size;
_pages_in_use += span_size;
for (unsigned i = 0; i < span_size; ++i) {
span[i].offset_in_span = i;
span[i].pool = this;
@@ -1285,22 +1287,22 @@ small_pool::trim_free_list() {
auto goal = (_min_free + _max_free) / 2;
while (_free && _free_count > goal) {
auto obj = _free;
_free = _free->next;
--_free_count;
- page* span = cpu_mem.to_page(obj);
+ page* span = get_cpu_mem().to_page(obj);
span -= span->offset_in_span;
if (!span->freelist) {
new (&span->link) page_list_link();
- _span_list.push_front(cpu_mem.pages, *span);
+ _span_list.push_front(get_cpu_mem().pages, *span);
}
obj->next = span->freelist;
span->freelist = obj;
if (--span->nr_small_alloc == 0) {
_pages_in_use -= span->span_size;
- _span_list.erase(cpu_mem.pages, *span);
- cpu_mem.free_span(span - cpu_mem.pages, span->span_size);
+ _span_list.erase(get_cpu_mem().pages, *span);
+ get_cpu_mem().free_span(span - get_cpu_mem().pages, span->span_size);
}
}
}

void
@@ -1315,23 +1317,23 @@ void* allocate_large(size_t size) {
abort_on_underflow(size);
unsigned size_in_pages = (size + page_size - 1) >> page_bits;
if ((size_t(size_in_pages) << page_bits) < size) {
return nullptr; // (size + page_size - 1) caused an overflow
}
- return cpu_mem.allocate_large(size_in_pages);
+ return get_cpu_mem().allocate_large(size_in_pages);

}

void* allocate_large_aligned(size_t align, size_t size) {
abort_on_underflow(size);
unsigned size_in_pages = (size + page_size - 1) >> page_bits;
unsigned align_in_pages = std::max(align, page_size) >> page_bits;
- return cpu_mem.allocate_large_aligned(align_in_pages, size_in_pages);
+ return get_cpu_mem().allocate_large_aligned(align_in_pages, size_in_pages);
}

void free_large(void* ptr) {
- return cpu_mem.free_large(ptr);
+ return get_cpu_mem().free_large(ptr);

}

size_t object_size(void* ptr) {
return cpu_pages::all_cpus[object_cpu_id(ptr)]->object_size(ptr);
}

@@ -1452,15 +1454,15 @@ void free_aligned(void* obj, size_t align, size_t size) {

}

void shrink(void* obj, size_t new_size) {

alloc_stats::increment_local(alloc_stats::types::frees);

alloc_stats::increment_local(alloc_stats::types::allocs); // keep them balanced

- cpu_mem.shrink(obj, new_size);
+ get_cpu_mem().shrink(obj, new_size);

}

void set_reclaim_hook(std::function<void (std::function<void ()>)> hook) {

- cpu_mem.set_reclaim_hook(hook);
+ get_cpu_mem().set_reclaim_hook(hook);
}

reclaimer::reclaimer(std::function<reclaiming_result ()> reclaim, reclaimer_scope scope)
: reclaimer([reclaim = std::move(reclaim)] (request) {
return reclaim();
@@ -1468,28 +1470,28 @@ reclaimer::reclaimer(std::function<reclaiming_result ()> reclaim, reclaimer_scop
}

reclaimer::reclaimer(std::function<reclaiming_result (request)> reclaim, reclaimer_scope scope)
: _reclaim(std::move(reclaim))
, _scope(scope) {
- cpu_mem.reclaimers.push_back(this);
+ get_cpu_mem().reclaimers.push_back(this);
}

reclaimer::~reclaimer() {
- auto& r = cpu_mem.reclaimers;
+ auto& r = get_cpu_mem().reclaimers;
r.erase(std::find(r.begin(), r.end(), this));
}

void set_large_allocation_warning_threshold(size_t threshold) {
- cpu_mem.large_allocation_warning_threshold = threshold;
+ get_cpu_mem().large_allocation_warning_threshold = threshold;
}

size_t get_large_allocation_warning_threshold() {
- return cpu_mem.large_allocation_warning_threshold;
+ return get_cpu_mem().large_allocation_warning_threshold;
}

void disable_large_allocation_warning() {
- cpu_mem.large_allocation_warning_threshold = std::numeric_limits<size_t>::max();
+ get_cpu_mem().large_allocation_warning_threshold = std::numeric_limits<size_t>::max();

}

void configure(std::vector<resource::memory> m, bool mbind,

optional<std::string> hugetlbfs_path) {
// we need to make sure cpu_mem is initialize since configure calls cpu_mem.resize
@@ -1510,19 +1512,19 @@ void configure(std::vector<resource::memory> m, bool mbind,
// a shared_ptr to allow sys_alloc to be copied around
auto fdp = make_lw_shared<file_desc>(file_desc::temporary(*hugetlbfs_path));
sys_alloc = [fdp] (void* where, size_t how_much) {
return allocate_hugetlbfs_memory(*fdp, where, how_much);
};
- cpu_mem.replace_memory_backing(sys_alloc);
+ get_cpu_mem().replace_memory_backing(sys_alloc);
}
- cpu_mem.resize(total, sys_alloc);
+ get_cpu_mem().resize(total, sys_alloc);
size_t pos = 0;
for (auto&& x : m) {
#ifdef SEASTAR_HAVE_NUMA
unsigned long nodemask = 1UL << x.nodeid;
if (mbind) {
- auto r = ::mbind(cpu_mem.mem() + pos, x.bytes,
+ auto r = ::mbind(get_cpu_mem().mem() + pos, x.bytes,
MPOL_PREFERRED,
&nodemask, std::numeric_limits<unsigned long>::digits,
MPOL_MF_MOVE);

if (r == -1) {
@@ -1542,23 +1544,23 @@ statistics stats() {
cpu_mem.nr_pages * page_size, cpu_mem.nr_free_pages * page_size, alloc_stats::get(alloc_stats::types::reclaims), alloc_stats::get(alloc_stats::types::large_allocs),
alloc_stats::get(alloc_stats::types::foreign_mallocs), alloc_stats::get(alloc_stats::types::foreign_frees), alloc_stats::get(alloc_stats::types::foreign_cross_frees)};
}

bool drain_cross_cpu_freelist() {
- return cpu_mem.drain_cross_cpu_freelist();
+ return get_cpu_mem().drain_cross_cpu_freelist();
}

memory_layout get_memory_layout() {
- return cpu_mem.memory_layout();
+ return get_cpu_mem().memory_layout();
}

size_t min_free_memory() {
- return cpu_mem.min_free_pages * page_size;
+ return get_cpu_mem().min_free_pages * page_size;
}

void set_min_free_pages(size_t pages) {
- cpu_mem.set_min_free_pages(pages);
+ get_cpu_mem().set_min_free_pages(pages);
}

static thread_local int report_on_alloc_failure_suppressed = 0;

class disable_report_on_alloc_failure_temporarily {
@@ -1643,12 +1645,12 @@ static human_readable_value to_hr_number(uint64_t number) {
const std::array<char, 5> suffixes = {'\0', 'k', 'm', 'b', 't'};
return to_human_readable_value(number, 1000, 10000, suffixes);
}

seastar::internal::log_buf::inserter_iterator do_dump_memory_diagnostics(seastar::internal::log_buf::inserter_iterator it) {
- auto free_mem = cpu_mem.nr_free_pages * page_size;
- auto total_mem = cpu_mem.nr_pages * page_size;
+ auto free_mem = get_cpu_mem().nr_free_pages * page_size;
+ auto total_mem = get_cpu_mem().nr_pages * page_size;
it = fmt::format_to(it, "Dumping seastar memory diagnostics\n");

it = fmt::format_to(it, "Used memory: {}\n", to_hr_size(total_mem - free_mem));
it = fmt::format_to(it, "Free memory: {}\n", to_hr_size(free_mem));
it = fmt::format_to(it, "Total memory: {}\n\n", to_hr_size(total_mem));
@@ -1659,12 +1661,12 @@ seastar::internal::log_buf::inserter_iterator do_dump_memory_diagnostics(seastar
});
}

it = fmt::format_to(it, "Small pools:\n");
it = fmt::format_to(it, "objsz\tspansz\tusedobj\tmemory\tunused\twst%\n");
- for (unsigned i = 0; i < cpu_mem.small_pools.nr_small_pools; i++) {
- auto& sp = cpu_mem.small_pools[i];
+ for (unsigned i = 0; i < get_cpu_mem().small_pools.nr_small_pools; i++) {
+ auto& sp = get_cpu_mem().small_pools[i];
// We don't use pools too small to fit a free_object, so skip these, they
// are always empty.
if (sp.object_size() < sizeof(free_object)) {
continue;
}
@@ -1675,11 +1677,11 @@ seastar::internal::log_buf::inserter_iterator do_dump_memory_diagnostics(seastar
// becomes too large: they are instead attached to the spans allocated to this
// pool. To count this second category, we iterate over the spans below.
uint32_t span_freelist_objs = 0;
auto front = sp._span_list._front;
while (front) {
- auto& span = cpu_mem.pages[front];
+ auto& span = get_cpu_mem().pages[front];
auto capacity_in_objects = span.span_size * page_size / sp.object_size();
span_freelist_objs += capacity_in_objects - span.nr_small_alloc;
front = span.link._next;
}
const auto free_objs = sp._free_count + span_freelist_objs; // pool + span free objects
@@ -1700,26 +1702,26 @@ seastar::internal::log_buf::inserter_iterator do_dump_memory_diagnostics(seastar
it = fmt::format_to(it, "index\tsize\tfree\tused\tspans\n");

std::array<uint32_t, cpu_pages::nr_span_lists> span_size_histogram;
span_size_histogram.fill(0);

- for (unsigned i = 0; i < cpu_mem.nr_pages;) {
- const auto span_size = cpu_mem.pages[i].span_size;
+ for (unsigned i = 0; i < get_cpu_mem().nr_pages;) {
+ const auto span_size = get_cpu_mem().pages[i].span_size;
if (!span_size) {
++i;
continue;
}
++span_size_histogram[log2ceil(span_size)];
i += span_size;
}

- for (unsigned i = 0; i< cpu_mem.nr_span_lists; i++) {
- auto& span_list = cpu_mem.free_spans[i];
+ for (unsigned i = 0; i< get_cpu_mem().nr_span_lists; i++) {
+ auto& span_list = get_cpu_mem().free_spans[i];
auto front = span_list._front;
uint32_t free_pages = 0;
while (front) {
- auto& span = cpu_mem.pages[front];
+ auto& span = get_cpu_mem().pages[front];
free_pages += span.span_size;
front = span.link._next;
}
const auto total_spans = span_size_histogram[i];
const auto total_pages = total_spans * (1 << i);
--
2.31.1

Avi Kivity

<avi@scylladb.com>

unread,

May 27, 2021, 10:51:41 AM5/27/21

to seastar-dev@googlegroups.com, Tomasz Grabiec

Tomek, please review, it fixes a bug you reported.

Pekka Enberg

<penberg@scylladb.com>

unread,

May 28, 2021, 5:08:05 AM5/28/21

to Avi Kivity, seastar-dev

On Sat, May 15, 2021 at 10:21 PM Avi Kivity <a...@scylladb.com> wrote:
> The memory allocator uses lots of thread-local variables, which cause
> the compiler to emit checks for thread-local guard variables and lazy
> initialization. This was observed (#829) to be expensive.
>
> This series slims down those initializations. Verified by inspecting
> the generated code, and using Scylla's perf_simple_query:
>
> before: 186247.78 tps ( 87.1 allocs/op, 20.1 tasks/op, 53777 insns/op)
> after: 189393.75 tps ( 87.1 allocs/op, 20.1 tasks/op, 50858 insns/op)
>
> More than 5% reduction in instructions executed.
>
> Fixes #829.
>
> https://github.com/avikivity/seastar memory-reduce-thread-local-overhead/v1

Reviewed-by: Pekka Enberg <pen...@scylladb.com>

Reply all

Reply to author

Forward