Re: Optimizing Linux WaitableEvent's using futex2

72 views
Skip to first unread message

Benoit Lize

unread,
May 3, 2021, 7:54:28 AM5/3/21
to André Almeida, Ken Rockot, Chromium-dev, Gabriel Charette, ker...@collabora.com, Joel Fernandes
Hello, and thanks for reaching out!

I had seen the futex2 proposal (from Hacker News), and was curious about it :)
Looking at Chrome's code (using cross-references in code search), WaitableEvent::WaitMulitple() is not really used widely outside of tests, with the only use case I can see that looks like it could be hot in mojo. Perhaps +Ken Rockot would know whether it's a hot code path. So while the proposed system calls seems to closely match WaitableEvent::WaitMultiple(), I'm not sure how much that would change things for Chrome (outside of simpler code, as you mention).

Are there other features of your patch set that could help with locking performance? Chrome has a large number of locks, in various places, with a pretty wide range of coding patterns.

Thanks,

-- 
Benoit



On Fri, Apr 30, 2021 at 1:48 AM André Almeida <andre...@collabora.com> wrote:
Hi there,

I'm the author of futex2[0], a WIP new set of Linux's syscalls that
allows userspace to write efficient sync mechanisms. I believe that
Chromium's WaitableEvent could benefit from it, so I would like to hear
what you think about it :)

* What's futex?

futex stands for "fast userspace mutexes". It's a syscall that basically
exposes a wait queue, where userspace can ask to put threads to sleep
and to wake them. The index of such queue is an unsigned int address
(that is usually called a futex).

It's value doesn't mean anything for the kernel, userspace is
responsible to defining the rules of it and implementing the logic.
Usually there's a value for when the lock is free, and another one for
when the lock is hold, for instance. Also, it doesn't require to call a
syscall on the uncontested path.

* Can you point some examples of what use futex?

glibc uses it to implement pthreads, Wine uses it to emulate WinAPI's
sync primitives. Chromium uses for partition allocator's spinlock[1],
and some of it's dependencies like abseill-cpp and tcmalloc uses futex
as well.

* Why futex2?

futex has long standing issues that seems impossible to be solved in the
current interface, so we are designing a new set of syscalls taking
those problems in account (more on that at [0]). It's not merged yet,
and I'm researching some uses cases to validate and get feedback about
the API.

One of those features is the ability to wait on multiple futexes at the
same time (called futex_waitv()). The main use case for that is to
properly map WinAPI's WaitForMultipleObjects()[2] to a Linux alternative
in Wine.

* Wait for multiple futexes at Chromium

Chromium's WaitableEvent API is implemented in different ways,
correspondingly to it's target platform. When compiling for Windows[3],
WaitableEvent::WaitMany() calls WaitForMultipleObjects().

At Linux[4], it seems to enqueue a "global" SyncWaiter object to each
one of the events it's waiting for, and wait for it. When it's time to a
event to signal, it will signal both it's own SyncWaiter object and the
global one. Please, correct me if I'm wrong!

I'm not sure what are the performance implications of this design, but
sharing a global object with multiple events could led to some problems
if they all signal at same time and the access to the global object
needs to be serialized, or if the same event is used for several
WaitAny()s at the same time, meaning that it would have to signal a lot
of global objects.

My assumption is that futex_waitv() could be used here, in the same
design used for the Windows. No need to create and enqueue a new global
waiter SyncWaiter, just wait on all those objects at the same time. I
speculate that this could improve the code performance, or at least make
the code simpler, so I'm here to hear the opinion of Chromium developers
about it. Please, let me know your thoughts on the subject!

I even wrote a simple draft showing how it would look like to use
futexes at WaitableEvent class: https://gitlab.collabora.com/-/snippets/101

Thanks,
        André

Additional resources:
- futex2 talk at OpenSource Summit:
https://www.youtube.com/watch?v=eG6GMBTcPQ8
- Basics of futex: https://eli.thegreenplace.net/2018/basics-of-futexes/
- Futexes Are Tricky: https://www.akkadia.org/drepper/futex.pdf

[0]
https://lore.kernel.org/lkml/20210427231248.22...@collabora.com/
[1]
https://github.com/chromium/chromium/blob/master/base/allocator/partition_allocator/spinning_mutex.cc
[2]
https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitformultipleobjects
[3]
https://github.com/chromium/chromium/blob/master/base/synchronization/waitable_event_win.cc
[4]
https://github.com/chromium/chromium/blob/master/base/synchronization/waitable_event_posix.cc

Ken Rockot

unread,
May 3, 2021, 10:46:37 AM5/3/21
to Benoit Lize, André Almeida, Chromium-dev, Gabriel Charette, ker...@collabora.com, Joel Fernandes


On Mon, May 3, 2021 at 7:44 AM Ken Rockot <roc...@chromium.org> wrote:


On Mon, May 3, 2021 at 4:53 AM Benoit Lize <li...@chromium.org> wrote:
Hello, and thanks for reaching out!

I had seen the futex2 proposal (from Hacker News), and was curious about it :)
Looking at Chrome's code (using cross-references in code search), WaitableEvent::WaitMulitple() is not really used widely outside of tests, with the only use case I can see that looks like it could be hot in mojo. Perhaps +Ken Rockot would know whether it's a hot code path. So while the proposed system calls seems to closely match WaitableEvent::WaitMultiple(), I'm not sure how much that would change things for Chrome (outside of simpler code, as you mention).

Yes, WaitMain() is used almost exclusively by Mojo and legacy IPC to wait on sync IPC replies. Since there's a lot of sync IPC happening, it's

Oof. WaitMany().
 
probably quite hot especially under heavy graphics load (e.g. Meet).

Note however that in an overwhelming majority of cases we only call WaitMany() with at most 2 events at a time, so that may limit the practical impact of any optimization there.

Ken Rockot

unread,
May 3, 2021, 10:47:23 AM5/3/21
to Benoit Lize, André Almeida, Chromium-dev, Gabriel Charette, ker...@collabora.com, Joel Fernandes
On Mon, May 3, 2021 at 4:53 AM Benoit Lize <li...@chromium.org> wrote:
Hello, and thanks for reaching out!

I had seen the futex2 proposal (from Hacker News), and was curious about it :)
Looking at Chrome's code (using cross-references in code search), WaitableEvent::WaitMulitple() is not really used widely outside of tests, with the only use case I can see that looks like it could be hot in mojo. Perhaps +Ken Rockot would know whether it's a hot code path. So while the proposed system calls seems to closely match WaitableEvent::WaitMultiple(), I'm not sure how much that would change things for Chrome (outside of simpler code, as you mention).

Yes, WaitMain() is used almost exclusively by Mojo and legacy IPC to wait on sync IPC replies. Since there's a lot of sync IPC happening, it's probably quite hot especially under heavy graphics load (e.g. Meet).

Note however that in an overwhelming majority of cases we only call WaitMany() with at most 2 events at a time, so that may limit the practical impact of any optimization there.

Benoit Lize

unread,
May 4, 2021, 12:57:52 PM5/4/21
to André Almeida, Chromium-dev, Gabriel Charette, ker...@collabora.com, Joel Fernandes, Ken Rockot


On Mon, May 3, 2021 at 4:56 PM André Almeida <andre...@collabora.com> wrote:
Hi Benoit,

Às 08:52 de 03/05/21, Benoit Lize escreveu:

> Hello, and thanks for reaching out!
>
> I had seen the futex2 proposal (from Hacker News), and was curious about
> it :)
> Looking at Chrome's code (using cross-references in code search
> WaitableEvent::WaitMulitple() is not really used widely outside of
> tests, with the only use case I can see that looks like it could be hot
> in mojo. Perhaps +Ken Rockot <mailto:roc...@chromium.org> would know
> whether it's a hot code path. So while the proposed system calls seems
> to closely match WaitableEvent::WaitMultiple(), I'm not sure how much
> that would change things for Chrome (outside of simpler code, as you
> mention).
>

Thanks for the clarification :) I didn't find any sync primitive
implementation that would benefit from "wait on multiple" apart from
"WaitableEvent::WaitMultiple()", but let me know if you find anything
and I can do some research.


> Are there other features of your patch set that could help with locking
> performance? Chrome has a large number of locks, in various places, with
> a pretty wide range of coding patterns.
>

Here are the new features in the new interface:
   a) Wait on multiple

   b) Different sizes of futex: 8, 16, 32 and 64 bits (the current
interface only supports 32 bit sized futexes).

   c) NUMA awareness (currently, all kernel data about futex is stored
in a single node).

I'm not sure if they would help Chromium:

   b) Using 8/16 bit futexes could reduce some memory utilization for
Chromium. Also, for some cases is useful to wait on a memory address,
and that's why we support 64 bits.

In most cases, we use pthread_mutex_t under the hood, which is much larger anyway, so unless we move to futex() directly from pthread_mutex_t, this would not help us indeed. The fact that pthread mutexes are large does not seem to be an issue in Chrome though, at least it hasn't been shown as an issue. So smaller futexes would likely not help much.
 

   c) I don't think that NUMA awareness would help a desktop app, where
usually users are running it in an UMA machine.

André Almeida

unread,
May 4, 2021, 4:57:47 PM5/4/21
to Chromium-dev, g...@chromium.org, li...@chromium.org, jo...@joelfernandes.org
Hi there,

I'm the author of futex2[0], a WIP new set of Linux's syscalls that allows userspace to write efficient sync mechanisms. I believe that Chromium's WaitableEvent could benefit from it, so I would like to hear what you think about it

* What's futex?

futex stands for "fast userspace mutexes". It's a syscall that basically exposes a wait queue, where userspace can ask to put threads to sleep and to wake them. The index of such queue is an unsigned int address (that is usually called a futex).

It's value doesn't mean anything for the kernel, userspace is responsible to defining the rules of it and implementing the logic. Usually there's a value for when the lock is free, and another one for when the lock is hold, for instance. Also, it doesn't require to call a syscall on the uncontested path.

* Can you point some examples of what use futex?

glibc uses it to implement pthreads, Wine uses it to emulate WinAPI's sync primitives. Chromium uses for partition allocator's spinlock[1], and some of it's dependencies like abseill-cpp and tcmalloc uses futex as well.

* Why futex2?

futex has long standing issues that seems impossible to be solved in the current interface, so we are designing a new set of syscalls taking those problems in account (more on that at [0]). It's not merged yet, and I'm researching some uses cases to validate and get feedback about the API.

One of those features is the ability to wait on multiple futexes at the same time (called futex_waitv()). The main use case for that is to properly map WinAPI's WaitForMultipleObjects()[2] to a Linux alternative in Wine.

* Wait for multiple futexes at Chromium

Chromium's WaitableEvent API is implemented in different ways, correspondingly to it's target platform. When compiling for Windows[3], WaitableEvent::WaitMany() calls WaitForMultipleObjects().

At Linux[4], it seems to enqueue a "global" SyncWaiter object to each one of the events it's waiting for, and wait for it. When it's time to a event to signal, it will signal both it's own SyncWaiter object and the global one. Please, correct me if I'm wrong!

I'm not sure what are the performance implications of this design, but sharing a global object with multiple events could led to some problems if they all signal at same time and the access to the global object needs to be serialized, or if the same event is used for several WaitAny()s at the same time, meaning that it would have to signal a lot of global objects.

My assumption is that futex_waitv() could be used here, in the same design used for the Windows. No need to create and enqueue a new global waiter SyncWaiter, just wait on all those objects at the same time. I speculate that this could improve the code performance, or at least make the code simpler, so I'm here to hear the opinion of Chromium developers about it. Please, let me know your thoughts on the subject!

I even wrote a simple draft showing how it would look like to use futexes at WaitableEvent class: https://gitlab.collabora.com/-/snippets/101

Thanks,
    André

Additional resources:
- futex2 talk at OpenSource Summit: https://www.youtube.com/watch?v=eG6GMBTcPQ8
[0] https://lore.kernel.org/lkml/20210427231248.22...@collabora.com/
[1] https://github.com/chromium/chromium/blob/master/base/allocator/partition_allocator/spinning_mutex.cc
[2] https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitformultipleobjects
[3] https://github.com/chromium/chromium/blob/master/base/synchronization/waitable_event_win.cc
[4] https://github.com/chromium/chromium/blob/master/base/synchronization/waitable_event_posix.cc
Reply all
Reply to author
Forward
0 new messages