I wanted to start a discussion around the advise/design for foreign_ptr
https://github.com/scylladb/seastar/blob/master/doc/tutorial.md#foreign-pointers
After reading Nadav's notes above on memory management x-cores, i wanted
to measure what it actually means 'expensive' to free on a remote core:
BLUF (bottom line up front):
void cpu_pages::free_cross_cpu(unsigned cpu_id, void* ptr) {
if (!live_cpus[cpu_id].load(std::memory_order_relaxed)) {
// Thread was destroyed; leak object
// should only happen for boost unit-tests.
return;
}
auto p = reinterpret_cast<cross_cpu_free_item*>(ptr);
auto& list = all_cpus[cpu_id]->xcpu_freelist;
auto old = list.load(std::memory_order_relaxed);
do {
p->next = old;
} while (!list.compare_exchange_weak(old, p,
std::memory_order_release, std::memory_order_relaxed));
++g_cross_cpu_frees;
}
Details:
The current design of the allocator is exceedingly simple which is good
// Memory map:
//
// 0x0000'sccc'vvvv'vvvv
//
// 0000 - required by architecture (only 48 bits of address space)
// s - chosen to satisfy system allocator (1-7)
// ccc - cpu number (0-12 bits allocated vary according to system)
// v - virtual address within cpu (32-44 bits, according to
// how much ccc leaves us
So finding the pointer is effectively a couple of operations
inline
unsigned object_cpu_id(const void* ptr) {
return (reinterpret_cast<uintptr_t>(ptr) >> cpu_id_shift) & 0xff;
}
Here is the generated assembly for it
mov rax, QWORD PTR [rbp-8]
shr rax, 36
movzx eax, al
How is this all wired up?:
bool
cpu_pages::try_cross_cpu_free(void* ptr) {
auto obj_cpu = object_cpu_id(ptr);
if (obj_cpu != cpu_id) {
free_cross_cpu(obj_cpu, ptr); //............. expensive-part
return true;
}
return false;
}
and the top level function is what you would expect
void free(void* obj) {
if (get_cpu_mem().try_cross_cpu_free(obj)) {
return;
}
++g_frees;
get_cpu_mem().free(obj);
}
At vectorized we have cross memory sinks like most seastar apps. We take
in a buffer that is to be _consumed entirely_ by the destination core.
Say a user pushes a request to Kafka and they just get some very small
metadata back (logical offset of the append request).
Scanning the scylla source code, the gist is that you want to use a
foreign_ptr<> _every_ time you are doing x-core movement of any kind.
foreign_ptr's are viral because they become part of the interface. The
same is true of seastar::future<> and sstring, which why I wanted to
start this discussion before I decorate some my types with foreign_ptrs.
However, i wrote this benchmark below and the numbers tell me a
different story.
The TL;DR: For small object graphs (1 or 2 foreign_ptr<>) it is around
10-15% faster(ran each bench for 500 seconds), for larger graphs it is
actually ~30% more slower.
The Code:
static inline future<> simple_int_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
auto v = std::make_unique<int>(42);
return smp::submit_to(
c, [v = std::move(v)] { perf_tests::do_not_optimize(v); });
});
}
static inline future<> foreign_int_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
auto v = make_foreign<std::unique_ptr<int>>(
std::make_unique<int>(42));
return smp::submit_to(
c, [v = std::move(v)] { perf_tests::do_not_optimize(v); });
});
}
PERF_TEST(xcore_dealloc, simple_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return simple_int_for_all(); });
}
PERF_TEST(xcore_dealloc, foreign_ptr_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return foreign_int_for_all(); });
}
static inline future<> large_simple_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
using ptr = std::unique_ptr<int>;
using vec_t = std::vector<ptr>;
auto vec = std::make_unique<vec_t>();
vec->resize(200);
for (auto i = 0; i < 200; ++i) {
vec->push_back(std::make_unique<int>(i));
}
return smp::submit_to(
c, [v = std::move(vec)] { perf_tests::do_not_optimize(v); });
});
}
static inline future<> large_foreign_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
using ptr = foreign_ptr<std::unique_ptr<int>>;
using vec_t = std::vector<ptr>;
auto vec = make_foreign<std::unique_ptr<vec_t>>(
std::make_unique<vec_t>());
vec->resize(200);
for (auto i = 0; i < 200; ++i) {
vec->push_back(
make_foreign<std::unique_ptr<int>>(std::make_unique<int>(i)));
}
return smp::submit_to(
c, [v = std::move(vec)] { perf_tests::do_not_optimize(v); });
});
}
PERF_TEST(xcore_dealloc, large_simple_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return large_simple_for_all(); });
}
PERF_TEST(xcore_dealloc, large_foreign_ptr_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return large_foreign_for_all(); });
}
The results:
xcore_dealloc.simple_n_square 92890 10.722us
36.735ns 10.519us 10.822us
xcore_dealloc.foreign_ptr_n_square 84201 11.904us
104.836ns 11.784us 12.086us
xcore_dealloc.large_simple_n_square 3435 286.240us
800.340ns 284.491us 289.458us
xcore_dealloc.large_foreign_ptr_n_square 2638 377.165us
998.519ns 373.607us 380.197us
Looking at the numbers, it makes sense for foreign_ptr to be used for
different purpose than performance. That is, because you probably
holding a semaphore or some other thread-local thing that you want to
release.
However, the advise around performance doesn't seem to hold up after
benchmarking.
Am i missing something?