[DISCUSS] foreign_ptr

67 views
Skip to first unread message

Alexander Gallego

<alex@vectorized.io>
unread,
Oct 5, 2019, 11:52:54 AM10/5/19
to seastar-dev@googlegroups.com
I wanted to start a discussion around the advise/design for foreign_ptr

https://github.com/scylladb/seastar/blob/master/doc/tutorial.md#foreign-pointers

After reading Nadav's notes above on memory management x-cores, i wanted
to measure what it actually means 'expensive' to free on a remote core:


BLUF (bottom line up front):


void cpu_pages::free_cross_cpu(unsigned cpu_id, void* ptr) {
if (!live_cpus[cpu_id].load(std::memory_order_relaxed)) {
// Thread was destroyed; leak object
// should only happen for boost unit-tests.
return;
}
auto p = reinterpret_cast<cross_cpu_free_item*>(ptr);
auto& list = all_cpus[cpu_id]->xcpu_freelist;
auto old = list.load(std::memory_order_relaxed);
do {
p->next = old;
} while (!list.compare_exchange_weak(old, p,
std::memory_order_release, std::memory_order_relaxed));
++g_cross_cpu_frees;
}


Details:

The current design of the allocator is exceedingly simple which is good

// Memory map:
//
// 0x0000'sccc'vvvv'vvvv
//
// 0000 - required by architecture (only 48 bits of address space)
// s - chosen to satisfy system allocator (1-7)
// ccc - cpu number (0-12 bits allocated vary according to system)
// v - virtual address within cpu (32-44 bits, according to
// how much ccc leaves us


So finding the pointer is effectively a couple of operations

inline
unsigned object_cpu_id(const void* ptr) {
return (reinterpret_cast<uintptr_t>(ptr) >> cpu_id_shift) & 0xff;
}


Here is the generated assembly for it

mov rax, QWORD PTR [rbp-8]
shr rax, 36
movzx eax, al


How is this all wired up?:

bool
cpu_pages::try_cross_cpu_free(void* ptr) {
auto obj_cpu = object_cpu_id(ptr);
if (obj_cpu != cpu_id) {
free_cross_cpu(obj_cpu, ptr); //............. expensive-part
return true;
}
return false;
}


and the top level function is what you would expect


void free(void* obj) {
if (get_cpu_mem().try_cross_cpu_free(obj)) {
return;
}
++g_frees;
get_cpu_mem().free(obj);
}


At vectorized we have cross memory sinks like most seastar apps. We take
in a buffer that is to be _consumed entirely_ by the destination core.
Say a user pushes a request to Kafka and they just get some very small
metadata back (logical offset of the append request).


Scanning the scylla source code, the gist is that you want to use a
foreign_ptr<> _every_ time you are doing x-core movement of any kind.
foreign_ptr's are viral because they become part of the interface. The
same is true of seastar::future<> and sstring, which why I wanted to
start this discussion before I decorate some my types with foreign_ptrs.

However, i wrote this benchmark below and the numbers tell me a
different story.


The TL;DR: For small object graphs (1 or 2 foreign_ptr<>) it is around
10-15% faster(ran each bench for 500 seconds), for larger graphs it is
actually ~30% more slower.

The Code:

static inline future<> simple_int_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
auto v = std::make_unique<int>(42);
return smp::submit_to(
c, [v = std::move(v)] { perf_tests::do_not_optimize(v); });
});
}
static inline future<> foreign_int_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
auto v = make_foreign<std::unique_ptr<int>>(
std::make_unique<int>(42));
return smp::submit_to(
c, [v = std::move(v)] { perf_tests::do_not_optimize(v); });
});
}

PERF_TEST(xcore_dealloc, simple_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return simple_int_for_all(); });
}
PERF_TEST(xcore_dealloc, foreign_ptr_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return foreign_int_for_all(); });
}

static inline future<> large_simple_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
using ptr = std::unique_ptr<int>;
using vec_t = std::vector<ptr>;
auto vec = std::make_unique<vec_t>();
vec->resize(200);
for (auto i = 0; i < 200; ++i) {
vec->push_back(std::make_unique<int>(i));
}
return smp::submit_to(
c, [v = std::move(vec)] { perf_tests::do_not_optimize(v); });
});
}
static inline future<> large_foreign_for_all() {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count), [](unsigned c) {
using ptr = foreign_ptr<std::unique_ptr<int>>;
using vec_t = std::vector<ptr>;
auto vec = make_foreign<std::unique_ptr<vec_t>>(
std::make_unique<vec_t>());
vec->resize(200);
for (auto i = 0; i < 200; ++i) {
vec->push_back(

make_foreign<std::unique_ptr<int>>(std::make_unique<int>(i)));
}
return smp::submit_to(
c, [v = std::move(vec)] { perf_tests::do_not_optimize(v); });
});
}

PERF_TEST(xcore_dealloc, large_simple_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return large_simple_for_all(); });
}
PERF_TEST(xcore_dealloc, large_foreign_ptr_n_square) {
return parallel_for_each(
boost::irange<unsigned>(0, smp::count),
[](unsigned) { return large_foreign_for_all(); });
}


The results:

xcore_dealloc.simple_n_square 92890 10.722us
36.735ns 10.519us 10.822us
xcore_dealloc.foreign_ptr_n_square 84201 11.904us
104.836ns 11.784us 12.086us
xcore_dealloc.large_simple_n_square 3435 286.240us
800.340ns 284.491us 289.458us
xcore_dealloc.large_foreign_ptr_n_square 2638 377.165us
998.519ns 373.607us 380.197us



Looking at the numbers, it makes sense for foreign_ptr to be used for
different purpose than performance. That is, because you probably
holding a semaphore or some other thread-local thing that you want to
release.

However, the advise around performance doesn't seem to hold up after
benchmarking.

Am i missing something?



Alexander Gallego

<alex@vectorized.io>
unread,
Oct 5, 2019, 12:04:05 PM10/5/19
to seastar-dev
Here is a working, compilable gist (depends on my mailing list patch for exposing seastar_perf_testing lib)

Nadav Har'El

<nyh@scylladb.com>
unread,
Oct 5, 2019, 12:15:34 PM10/5/19
to Alexander Gallego, seastar-dev
On Sat, Oct 5, 2019 at 6:52 PM Alexander Gallego <al...@vectorized.io> wrote:
I wanted to start a discussion around the advise/design for foreign_ptr

I'd be happy too.
I'll also be happy if additional people start contributing text to the tutorial. Even though I wrote most of the
text currently in the tutorial, I'm not the top expert in many of the things I explained, and other people may have
better understanding than me in some things. Moreover, some things are more opinions than just "facts" (why
certain features exist? why one feature is recommended, or un-recommended?). So we definitely need more
people commenting on tutorial.md - or better yet - writing text for it.


https://github.com/scylladb/seastar/blob/master/doc/tutorial.md#foreign-pointers

After reading Nadav's notes above on memory management x-cores, i wanted
to measure what it actually means 'expensive' to free on a remote core:

Something I tried to explain in the text, and maybe I didn't explain well enough, is that
foreign-pointers is *not* about freeing remotely-allocated pointers. As I said, this is already
supported without foreign pointers. Yes, I said it was "slow", maybe this was an overstatement
(and it will be good to understand if it's really slow). Rather, foreign-pointers give you two
other things:

1. It automates running the objects destructor on its home shard. This is a different thing
from where to free the memory! In most C++ code, the destructor does a lot more than just
freeing. Often, the destructor needs to access shared memory, and needs to do this on the
owner shard.

2. It is a "signal" to the programmer to remember that not just the destructor, but other
methods as well need to be run on the owner shard. A "foreign-pointer", like you said is
"viral", you can't forget you are holding one, so you can't forget you need to call pointers
accordingly.

If it seemed from my explanation that the only, or main, purpose of foreign_ptr is performance,
than I didn't explain it well. If you can send a patch to improve the explanation, it will be great.

At vectorized we have cross memory sinks like most seastar apps. We take
in a buffer that is to be _consumed entirely_ by the destination core.
Say a user pushes a request to Kafka and they just get some very small
metadata back (logical offset of the append request).


Scanning the scylla source code,  the gist is that you want to use a
foreign_ptr<> _every_ time you are doing x-core movement of any kind.
foreign_ptr's are viral because they become part of the interface. The
same is true of seastar::future<> and sstring, which why I wanted to
start this discussion before I decorate some my types with foreign_ptrs.

Maybe when you pass an object which has a trivial destructor - e.g., a unique_ptr<vector<char>>-
then indeed a normal pointer is better than a foreign_ptr, because:
1.  In this case we don't need the feature of running the destructor on the owner shard (there is no destructor), and
2.. you to don't need to "tell" the new owner - via the type name - that it needs to coordinate with the owner shard, because ownership was passed.

Maybe we should indeed codify a way to do this - or simply explain this issue in words in the tutorial.


However, i wrote this benchmark below and the numbers tell me a
different story.


The TL;DR: For small object graphs (1 or 2 foreign_ptr<>) it is around
10-15% faster(ran each bench for 500 seconds), for larger graphs it is
actually ~30%  more slower.

By "graphs", you mean objects?
--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/2388784c-8b90-a62c-8587-a44a4c0b40ea%40vectorized.io.

Alexander Gallego

<alex@vectorized.io>
unread,
Oct 5, 2019, 12:27:19 PM10/5/19
to seastar-dev
I 100% agree on the correctness argument. This class is very much in line with the RAII-thinking.

I am happy to send in a patch for docs.

I think your docs are amazing in general, that's how I started almost ~3 years ago.

I was just being cautions with performance.


 

At vectorized we have cross memory sinks like most seastar apps. We take
in a buffer that is to be _consumed entirely_ by the destination core.
Say a user pushes a request to Kafka and they just get some very small
metadata back (logical offset of the append request).


Scanning the scylla source code,  the gist is that you want to use a
foreign_ptr<> _every_ time you are doing x-core movement of any kind.
foreign_ptr's are viral because they become part of the interface. The
same is true of seastar::future<> and sstring, which why I wanted to
start this discussion before I decorate some my types with foreign_ptrs.

Maybe when you pass an object which has a trivial destructor - e.g., a unique_ptr<vector<char>>-
then indeed a normal pointer is better than a foreign_ptr, because:
1.  In this case we don't need the feature of running the destructor on the owner shard (there is no destructor), and
2.. you to don't need to "tell" the new owner - via the type name - that it needs to coordinate with the owner shard, because ownership was passed.

Maybe we should indeed codify a way to do this - or simply explain this issue in words in the tutorial.


I can patch the docs. But note that at first I was doing this with definitely non-trivial types with similar results.

 


However, i wrote this benchmark below and the numbers tell me a
different story.


The TL;DR: For small object graphs (1 or 2 foreign_ptr<>) it is around
10-15% faster(ran each bench for 500 seconds), for larger graphs it is
actually ~30%  more slower.

By "graphs", you mean objects?


object-graphs refer to nested structures.

 
To unsubscribe from this group and stop receiving emails from it, send an email to seast...@googlegroups.com.

Nadav Har'El

<nyh@scylladb.com>
unread,
Oct 5, 2019, 5:45:06 PM10/5/19
to Alexander Gallego, seastar-dev
Great.

By the way, another thing we should consider when it comes to documenting feature is the sentence I opened this section with (http://nadav.harel.org.il/seastar/23.html#foreign-pointers):

"Freeing memory on the wrong thread is strongly discouraged, but is currently supported (albeit slowly) to support library code beyond Seastar’s control."

This indeed, if my memory serves me correctly (Avi probably remembers better), was correct historically: we wanted freeing on the wrong CPU to be illegal, but when we tried to do this we had a mess from other libraries which allocate memory, so we had to support this.
But we need to decide if it's still discouraged - and in some sense even deprecated (i.e., can be removed in the future if we fix the issues with the libraries) - or, the point that you are raising now and sounds like it has a lot of merit - that freeing memory on the wrong shard even has benefits in some cases - we it's not discouraged, and certainly not deprecated.

That being said, it is still dangerous to not use foreign pointers for non-trivial (for some sense of non-trivial... more on this below) types because their destructors and other methods will likely be run on the wrong CPU. If we don't "discourage" this use in general, we should at least explain when it makes sense, and when it does not.

I think your docs are amazing in general, that's how I started almost ~3 years ago.

I was just being cautions with performance.

The whole "performance" issue can be addressed by removing two words "(albeit slowly)" from the text.
I'll admit that I never benchmarked this, and I assumed the penalty is fairly high without actually having numbers to support this assumption.

That being said, IIUC you compared the performance of "free on the wrong CPU" to "free via foreign-ptr". This is not what I had in
mind when I said that "free on the wrong CPU" is slow. What I had in mind was that "free on the wrong CPU" was slower than
"freeing on the right CPU" (i.e., just a normal run-of-the-mill free of an object this CPU allocated), which suggests that
applications should be "mostly" sharded - i.e. most of the objects will be created and deleted on the same CPU - except
specific objects which you want to send cross shards.

 

At vectorized we have cross memory sinks like most seastar apps. We take
in a buffer that is to be _consumed entirely_ by the destination core.
Say a user pushes a request to Kafka and they just get some very small
metadata back (logical offset of the append request).


Scanning the scylla source code,  the gist is that you want to use a
foreign_ptr<> _every_ time you are doing x-core movement of any kind.
foreign_ptr's are viral because they become part of the interface. The
same is true of seastar::future<> and sstring, which why I wanted to
start this discussion before I decorate some my types with foreign_ptrs.

Maybe when you pass an object which has a trivial destructor - e.g., a unique_ptr<vector<char>>-
then indeed a normal pointer is better than a foreign_ptr, because:
1.  In this case we don't need the feature of running the destructor on the owner shard (there is no destructor), and
2.. you to don't need to "tell" the new owner - via the type name - that it needs to coordinate with the owner shard, because ownership was passed.

Maybe we should indeed codify a way to do this - or simply explain this issue in words in the tutorial.


I can patch the docs. But note that at first I was doing this with definitely non-trivial types with similar results.

When I wrote "trivial" I didn't mean just "trivial" in the C++ standard sense. I probably should not have used that word.
What I actually meant was a destructor which besides freeing memory does not have any side-effects on other objects.
Freeing a std::vector, for example, does not go about touching other vectors, or some central registry of vectors, or anything
of this sort. But imagine you had a new type "myvector", which registers itself in some shard-local registry of vectors.
In this case, it is imperative that you destruct the myvector object on the same shard where it was created - otherwise
the destructor will not be able to delete itself from the registry on the shard where it was originally registered.
As another example, consider an object which contains a lw_shared_ptr to various other
objects it needs. This object needs to be destroyed on the home shard so the lw_shared_ptr counters will be
decremented on the correct shard.
These sort of "non trivial" (or whatever we call them...) destructors need to be run on the home shard...


 


However, i wrote this benchmark below and the numbers tell me a
different story.


The TL;DR: For small object graphs (1 or 2 foreign_ptr<>) it is around
10-15% faster(ran each bench for 500 seconds), for larger graphs it is
actually ~30%  more slower.

By "graphs", you mean objects?


object-graphs refer to nested structures.

I see, you mean an object containing many foreign_ptr<>.
I wonder if this use-case can be somehow optimized , perhaps batching these frees more, or something,
so foreign_ptr<> will not only be safer than "freeing on the wrong shard", it will also be faster...
 

Alexander Gallego

<alex@vectorized.io>
unread,
Oct 6, 2019, 12:21:39 AM10/6/19
to Nadav Har'El, seastar-dev
>>> 1. It automates running the objects *destructor *on its home shard. This
>>> is a different thing
>>> from where to free the memory! In most C++ code, the destructor does a
>>> lot more than just
>>> freeing. Often, the destructor needs to access shared memory, and needs
>>> to do this on the
>>> owner shard.
>>>
>>> 2. It is a "signal" to the programmer to remember that not just the
>>> destructor, but other
>>> methods as well need to be run on the owner shard. A "foreign-pointer",
>>> like you said is
>>> "viral", you can't forget you are holding one, so you can't forget you
>>> need to call pointers
>>> accordingly.
>>>
>>> If it seemed from my explanation that the only, or main, purpose of
>>> foreign_ptr is performance,
>>> than I didn't explain it well. If you can send a patch to improve the
>>> explanation, it will be great.
>>>
>>
>>
>> I 100% agree on the correctness argument. This class is very much in line
>> with the RAII-thinking.
>>
>> I am happy to send in a patch for docs.
>>
>
> Great.
>
> By the way, another thing we should consider when it comes to documenting
> feature is the sentence I opened this section with (
> http://nadav.harel.org.il/seastar/23.html#foreign-pointers):
>
> "Freeing memory on the *wrong* thread is strongly discouraged, but is
> currently supported (albeit slowly) to support library code beyond
> Seastar’s control."
>
> This indeed, if my memory serves me correctly (Avi probably remembers
> better), was correct historically: we *wanted* freeing on the wrong CPU to
> be illegal, but when we tried to do this we had a mess from other libraries
> which allocate memory, so we had to support this.
> But we need to decide if it's still discouraged - and in some sense even
> deprecated (i.e., can be removed in the future if we fix the issues with
> the libraries) - or, the point that you are raising now and sounds like it
> has a lot of merit - that freeing memory on the wrong shard even has
> benefits in some cases - we it's not discouraged, and certainly not
> deprecated.
>
> That being said, it is still *dangerous* to not use foreign pointers for
> non-trivial (for some sense of non-trivial... more on this below) types
> because their destructors and other methods will likely be run on the wrong
> CPU. If we don't "discourage" this use in general, we should at least
> explain when it makes sense, and when it does not.
>
>>
>> I think your docs are amazing in general, that's how I started almost ~3
>> years ago.
>>
>> I was just being cautions with performance.
>>
>
> The whole "performance" issue can be addressed by removing two words
> "(albeit slowly)" from the text.


+1 That should be enough.

When you read docs like that, it makes you think that you did something
wrong when you measure and the results are different.

Anyway, yeah, that should be good.


> I'll admit that I never benchmarked this, and I assumed the penalty is
> fairly high without actually having numbers to support this assumption.
>
> That being said, IIUC you compared the performance of "free on the wrong
> CPU" to "free via foreign-ptr". This is not what I had in
> mind when I said that "free on the wrong CPU" is slow. What I had in mind
> was that "free on the wrong CPU" was slower than
> "freeing on the right CPU" (i.e., just a normal run-of-the-mill free of an
> object this CPU allocated), which suggests that
> applications should be "mostly" sharded - i.e. most of the objects will be
> created and deleted on the same CPU - except
> specific objects which you want to send cross shards.
>
>
>>
>>>
>>> At vectorized we have cross memory sinks like most seastar apps. We take
>>>> in a buffer that is to be _consumed entirely_ by the destination core.
>>>> Say a user pushes a request to Kafka and they just get some very small
>>>> metadata back (logical offset of the append request).
>>>>
>>>>
>>>> Scanning the scylla source code, the gist is that you want to use a
>>>> foreign_ptr<> _every_ time you are doing x-core movement of any kind.
>>>> foreign_ptr's are viral because they become part of the interface. The
>>>> same is true of seastar::future<> and sstring, which why I wanted to
>>>> start this discussion before I decorate some my types with foreign_ptrs.
>>>>
>>>
>>> Maybe when you *pass* an object which has a *trivial destructor* - e.g.,
Yup. tempbufs are such example if you `.share()` them.

Original thread was never about the correctness which is easy to see. I
think we're on the same page.

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 6, 2019, 4:22:24 AM10/6/19
to Nadav Har'El, Alexander Gallego, seastar-dev
On Sun, Oct 06, 2019 at 12:44:53AM +0300, Nadav Har'El wrote:
> "Freeing memory on the *wrong* thread is strongly discouraged, but is
> currently supported (albeit slowly) to support library code beyond
> Seastar’s control."
>
> This indeed, if my memory serves me correctly (Avi probably remembers
> better), was correct historically: we *wanted* freeing on the wrong CPU to
> be illegal, but when we tried to do this we had a mess from other libraries
> which allocate memory, so we had to support this.
The main offender and the reason we had to support it was exception_ptr.
It is out of our control where it is allocated and freed. The reason it
is deemed to be slower than freeing using foreign_ptr is that it uses
MPSC queue (while the later uses SPSC) and hence prone to congestion on
large system with a lot of freeing going on. Our SPSC is also optimized
for batching to reduce atomic operation and xcpu freeing code lacks
that.

--
Gleb.

noah@vectorized.io

<noah@vectorized.io>
unread,
Oct 6, 2019, 12:06:40 PM10/6/19
to seastar-dev
Is it correct to say that this example is not about the safety of freeing memory, but about controlling concurrent access to the hypothetical registry?
 
As another example, consider an object which contains a lw_shared_ptr to various other
objects it needs. This object needs to be destroyed on the home shard so the lw_shared_ptr counters will be
decremented on the correct shard.

Same question here--in this case because the underlying counter is non-atomic.

These sort of "non trivial" (or whatever we call them...) destructors need to be run on the home shard...

The use of the foreign_ptr as a mechanism to control where the destructor is called from makes sense (especially in the above two examples), but I'm still a little fuzzy on if there are _any_ scenarios in which the safety of freeing memory is dependent on shard location. The two examples you gave above seem to be the only exceptions alluded to, so their answers might be fairly instructive.

Nadav Har'El

<nyh@scylladb.com>
unread,
Oct 6, 2019, 12:22:11 PM10/6/19
to noah@vectorized.io, seastar-dev
On Sun, Oct 6, 2019 at 7:06 PM <no...@vectorized.io> wrote:
 
But imagine you had a new type "myvector", which registers itself in some shard-local registry of vectors.
In this case, it is imperative that you destruct the myvector object on the same shard where it was created - otherwise
the destructor will not be able to delete itself from the registry on the shard where it was originally registered.

Is it correct to say that this example is not about the safety of freeing memory, but about controlling concurrent access to the hypothetical registry?

Yes, it's not about the safety of "freeing memory" (free(), delete(), etc.) which Seastar ensures is always safe - even if called on the wrong CPU.
The problem in this example is the safety of destructing the object, i.e., running its user-defined destructor.

In this example, if this destructor needs to update some shard-local (C++ thread_local) registry, it will not find it when running on the wrong shard. It's not even an issue of concurrency if every shard has its own local registry (in this hypothetical example) and the wrong one is looked up.

 
As another example, consider an object which contains a lw_shared_ptr to various other
objects it needs. This object needs to be destroyed on the home shard so the lw_shared_ptr counters will be
decremented on the correct shard.

Same question here--in this case because the underlying counter is non-atomic.

Indeed: if several CPUs try to destroy their copies of the same lw_shared_ptr concurrently, they will decrement the same counter concurrently, and it is not atomic (this is the main difference between Seastar's and C++'s shared pointers) and not allowed.
 

These sort of "non trivial" (or whatever we call them...) destructors need to be run on the home shard...

The use of the foreign_ptr as a mechanism to control where the destructor is called from makes sense (especially in the above two examples), but I'm still a little fuzzy on if there are _any_ scenarios in which the safety of freeing memory is dependent on shard location.

Again it depends what you call "freeing memory"... The freeing itself is always safe - the problem is running the destructor. Because the destructor can run any user-written code, it can do basically anything, and in particular above we saw two broad classes of bad things which can happen if the destructor - or any method for that matter - is called from the wrong shard:

1. The destructor may look for an object in some shard-local (C++ thread-local, or something linked to from another thread-local object) and won't find it because it is looking up in the wrong shard.

2. The destructor may follow a pointer to an object "owned" by some other shard to write something (in the previous example - the counter), and because multiple shards may be using this pointed-to object concurrently, we can have unprotected concurrent access to shared memory, with unknown results.
 
The two examples you gave above seem to be the only exceptions alluded to, so their answers might be fairly instructive.

The above two scenarios are fairly broad, they are not just two specific problems.

Avi Kivity

<avi@scylladb.com>
unread,
Oct 7, 2019, 3:25:12 AM10/7/19
to Alexander Gallego, seastar-dev@googlegroups.com

On 05/10/2019 18.52, Alexander Gallego wrote:
> I wanted to start a discussion around the advise/design for foreign_ptr
>

<snip>


> The results:
>
> xcore_dealloc.simple_n_square                  92890    10.722us
> 36.735ns    10.519us    10.822us
> xcore_dealloc.foreign_ptr_n_square             84201    11.904us
> 104.836ns    11.784us    12.086us
> xcore_dealloc.large_simple_n_square             3435   286.240us
> 800.340ns   284.491us   289.458us
> xcore_dealloc.large_foreign_ptr_n_square        2638   377.165us
> 998.519ns   373.607us   380.197us
>
>

It's helpful to add labels to the results so we know what the numbers
mean, and to normalize results to be per-operation rather than having to
do the normalization in the reader's head. Also, if your results aren't
accurate to 1 part per million, don't include so many decimal points,
they obfuscate the results.


In the end, I have no idea what this pile of numbers means.


>
> Looking at the numbers, it makes sense for foreign_ptr to be used for
> different purpose than performance. That is, because you probably
> holding a semaphore or some other thread-local thing that you want to
> release.
>
> However, the advise around performance doesn't seem to hold up after
> benchmarking.
>
> Am i missing something?
>
>

Did you test on a laptop or a 2s36c72t server?

Alexander Gallego

<alex@vectorized.io>
unread,
Oct 7, 2019, 11:48:02 AM10/7/19
to Avi Kivity, seastar-dev@googlegroups.com


On 10/7/19 12:25 AM, Avi Kivity wrote:
>
> On 05/10/2019 18.52, Alexander Gallego wrote:
>> I wanted to start a discussion around the advise/design for foreign_ptr
>>
>
> <snip>
>
>
>> The results:
>>
>> xcore_dealloc.simple_n_square                  92890    10.722us
>> 36.735ns    10.519us    10.822us
>> xcore_dealloc.foreign_ptr_n_square             84201    11.904us
>> 104.836ns    11.784us    12.086us
>> xcore_dealloc.large_simple_n_square             3435   286.240us
>> 800.340ns   284.491us   289.458us
>> xcore_dealloc.large_foreign_ptr_n_square        2638   377.165us
>> 998.519ns   373.607us   380.197us
>>
>>
>
> It's helpful to add labels to the results so we know what the numbers
> mean, and to normalize results to be per-operation rather than having to
> do the normalization in the reader's head. Also, if your results aren't
> accurate to 1 part per million, don't include so many decimal points,
> they obfuscate the results.
>
>
> In the end, I have no idea what this pile of numbers means.
>

It's the seastar benchmark framework.

Happy to add the headers when I run on a larger framework, but seems
reasonable to post the numbers in the format of the internal
benchmarking framework.

>
>>
>> Looking at the numbers, it makes sense for foreign_ptr to be used for
>> different purpose than performance. That is, because you probably
>> holding a semaphore or some other thread-local thing that you want to
>> release.
>>
>> However, the advise around performance doesn't seem to hold up after
>> benchmarking.
>>
>> Am i missing something?
>>
>>
>
> Did you test on a laptop or a 2s36c72t server?
>

Good point. It was on a small computer. I'll test on a large machine.

I assume Gleb's comment of the mpsc queue vs a spsc queue perf
difference will become large there.

Will post back



Reply all
Reply to author
Forward
0 new messages