[RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

Mathieu Desnoyers

unread,

Feb 12, 2010, 5:50:02 PM2/12/10

to

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process. It can be used
to distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of sys_membarrier()
and a compiler barrier. For synchronization primitives that distinguish between
read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be
accelerated significantly by moving the bulk of the memory barrier overhead to
the write-side.

The first user of this system call is the "liburcu" Userspace RCU implementation
found at http://lttng.org/urcu. It aims at greatly simplifying and enhancing the
current implementation, which uses a scheme similar to the sys_membarrier(), but
based on signals sent to each reader thread.

Editorial question:

This synchronization only takes care of threads using the current process memory
map. It should not be used to synchronize accesses performed on memory maps
shared between different processes. Is that a limitation we can live with ?

Changes since v8:
- Go back to rq spin locks taken by sys_membarrier() rather than adding memory
barriers to the scheduler. It implies a potential RoS (reduction of service)
if sys_membarrier() is executed in a busy-loop by a user, but nothing more
than what is already possible with other existing system calls, but saves
memory barriers in the scheduler fast path.
- re-add the memory barrier comments to x86 switch_mm() as an example to other
architectures.
- Update documentation of the memory barriers in sys_membarrier and switch_mm().
- Append execution scenarios to the changelog showing the purpose of each memory
barrier.

Changes since v7:
- Move spinlock-mb and scheduler related changes to separate patches.
- Add support for sys_membarrier on x86_32.
- Only x86 32/64 system calls are reserved in this patch. It is planned to
incrementally reserve syscall IDs on other architectures as these are tested.

Changes since v6:
- Remove some unlikely() not so unlikely.
- Add the proper scheduler memory barriers needed to only use the RCU read lock
in sys_membarrier rather than take each runqueue spinlock:
- Move memory barriers from per-architecture switch_mm() to schedule() and
finish_lock_switch(), where they clearly document that all data protected by
the rq lock is guaranteed to have memory barriers issued between the scheduler
update and the task execution. Replacing the spin lock acquire/release
barriers with these memory barriers imply either no overhead (x86 spinlock
atomic instruction already implies a full mb) or some hopefully small
overhead caused by the upgrade of the spinlock acquire/release barriers to
more heavyweight smp_mb().
- The "generic" version of spinlock-mb.h declares both a mapping to standard
spinlocks and full memory barriers. Each architecture can specialize this
header following their own need and declare CONFIG_HAVE_SPINLOCK_MB to use
their own spinlock-mb.h.
- Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
implementations on a wide range of architecture would be welcome.

Changes since v5:
- Plan ahead for extensibility by introducing mandatory/optional masks to the
"flags" system call parameter. Past experience with accept4(), signalfd4(),
eventfd2(), epoll_create1(), dup3(), pipe2(), and inotify_init1() indicates
that this is the kind of thing we want to plan for. Return -EINVAL if the
mandatory flags received are unknown.
- Create include/linux/membarrier.h to define these flags.
- Add MEMBARRIER_QUERY optional flag.

Changes since v4:
- Add "int expedited" parameter, use synchronize_sched() in the non-expedited
case. Thanks to Lai Jiangshan for making us consider seriously using
synchronize_sched() to provide the low-overhead membarrier scheme.
- Check num_online_cpus() == 1, quickly return without doing nothing.

Changes since v3a:
- Confirm that each CPU indeed runs the current task's ->mm before sending an
IPI. Ensures that we do not disturb RT tasks in the presence of lazy TLB
shootdown.
- Document memory barriers needed in switch_mm().
- Surround helper functions with #ifdef CONFIG_SMP.

Changes since v2:
- simply send-to-many to the mm_cpumask. It contains the list of processors we
have to IPI to (which use the mm), and this mask is updated atomically.

Changes since v1:
- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

To explain the benefit of this scheme, let's introduce two example threads:

Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A are ordering memory accesses with
respect to smp_mb() present in Thread B, we can change each smp_mb() within
Thread A into calls to sys_membarrier() and each smp_mb() within
Thread B into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A Thread B
previous mem accesses previous mem accesses
smp_mb() smp_mb()
following mem accesses following mem accesses

After the change, these pairs become:

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
prev mem accesses
barrier()
follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A Thread B
prev mem accesses prev mem accesses
sys_membarrier() barrier()
follow mem accesses follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() by to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

* Benchmarks

For an Intel Xeon E5405
(one thread is calling sys_membarrier, the other T threads are busy looping)

* expedited

10,000,000 sys_membarrier calls:

T=1: 0m20.173s
T=2: 0m20.506s
T=3: 0m22.632s
T=4: 0m24.759s
T=5: 0m26.633s
T=6: 0m29.654s
T=7: 0m30.669s

----> For a 2-3 microseconds/call.

* non-expedited

1000 sys_membarrier calls:

T=1-7: 0m16.002s

----> For a 16 milliseconds/call. (~5000-8000 times slower than expedited)

* User-space user of this system call: Userspace RCU library

Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme: 6289946025 reads, 1251 writes

(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 4316818891 reads, 503790 writes

(dynamic sys_membarrier check, non-expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 8698725501 reads, 313 writes

So the dynamic sys_membarrier availability check adds some overhead to the
read-side, but besides that, with the expedited scheme, we can see that we are
close to the read-side performance of the signal-based scheme and also close
(5/8) to the performance of the memory-barrier write-side. We have a write-side
speedup of 400:1 over the signal-based scheme by using the sys_membarrier system
call. This allows a 4.5:1 read-side speedup over the memory barrier scheme.

The non-expedited scheme adds indeed a much lower overhead on the read-side
both because we do not send IPIs and because we perform less updates, which in
turn generates less cache-line exchanges. The write-side latency becomes even
higher than with the signal-based scheme. The advantage of the non-expedited
sys_membarrier() scheme over signal-based scheme is that it does not require to
wake up all the process threads.

* More information about memory barriers in:

- sys_membarrier()
- membarrier_ipi()
- switch_mm()
- issued with ->mm update while the rq lock is held

The goal of these memory barriers is to ensure that all memory accesses to
user-space addresses performed by every processor which execute threads
belonging to the current process are observed to be in program order at least
once between the two memory barriers surrounding sys_membarrier().

If we were to simply broadcast an IPI to all processors between the two smp_mb()
in sys_membarrier(), membarrier_ipi() would execute on each processor, and
waiting for these handlers to complete execution guarantees that each running
processor passed through a state where user-space memory address accesses were
in program order.

However, this "big hammer" approach does not please the real-time concerned
people. This would let a non RT task disturb real-time tasks by sending useless
IPIs to processors not concerned by the memory of the current process.

This is why we iterate on the mm_cpumask, which is a superset of the processors
concerned by the process memory map and check each processor ->mm with the rq
lock held to confirm that the processor is indeed running a thread concerned
with our mm (and not just part of the mm_cpumask due to lazy TLB shootdown).

The barriers added in switch_mm() have one objective: user-space memory address
accesses must be in program order when mm_cpumask is set or cleared. (more
details in the x86 switch_mm() comments).

The verification, for each cpu part of the mm_cpumask, that the rq ->mm is
indeed part of the current ->mm needs to be done with the rq lock held. This
ensures that each time a rq ->mm is modified, a memory barrier (typically
implied by the change of memory mapping) is also issued. These ->mm update and
memory barrier are made atomic by the rq spinlock.

The execution scenario (1) shows the behavior of the sys_membarrier() system
call executed on Thread A while Thread B executes memory accesses that need to
be ordered. Thread B is running. Memory accesses in Thread B are in program
order (e.g. separated by a compiler barrier()).

1) Thread B running, ordering ensured by the membarrier_ipi():

Thread A Thread B
-------------------------------------------------------------------------
prev accesses to userspace addr. prev accesses to userspace addr.
sys_membarrier
smp_mb
IPI ------------------------------> membarrier_ipi()
smp_mb
return
smp_mb
following accesses to userspace addr. following accesses to userspace addr.

The execution scenarios (2-3-4-5) show the same setup as (1), but Thread B is
not running while sys_membarrier() is called. Thanks to the memory barriers
added to switch_mm(), Thread B user-space address memory accesses are already in
program order when sys_membarrier finds out that either the mm_cpumask does not
contain Thread B CPU or that that CPU's ->mm is not running the current process
mm.

2) Context switch in, showing rq spin lock synchronization:

Thread A Thread B
-------------------------------------------------------------------------
<prev accesses to userspace addr. saved
on stack>
prev accesses to userspace addr.
sys_membarrier
smp_mb
for each cpu in mm_cpumask
<Thread B CPU is present e.g. due
to lazy TLB shootdown>
spin lock cpu rq
mm = cpu rq mm
spin unlock cpu rq
context switch in
<spin lock cpu rq by other thread>
load_cr3 (or equiv. mem. barrier)
spin unlock cpu rq
following accesses to userspace addr.
if (mm == current rq mm)
<false>
smp_mb
following accesses to userspace addr.

Here, the important point is that Thread B have passed through a point where all
its userspace memory address accesses were in program order between the two
smp_mb() in sys_membarrier.

3) Context switch out, showing rq spin lock synchronization:

Thread A Thread B
-------------------------------------------------------------------------
prev accesses to userspace addr.
prev accesses to userspace addr.
sys_membarrier
smp_mb
for each cpu in mm_cpumask
context switch out
spin lock cpu rq
load_cr3 (or equiv. mem. barrier)
<spin unlock cpu rq by other thread>
<following accesses to userspace addr.
will happen when rescheduled>
spin lock cpu rq
mm = cpu rq mm
spin unlock cpu rq
if (mm == current rq mm)
<false>
smp_mb
following accesses to userspace addr.

Same as (2): the important point is that Thread B have passed through a point
where all its userspace memory address accesses were in program order between
the two smp_mb() in sys_membarrier.

4) Context switch in, showing mm_cpumask synchronization:

Thread A Thread B
-------------------------------------------------------------------------
<prev accesses to userspace addr. saved
on stack>
prev accesses to userspace addr.
sys_membarrier
smp_mb
for each cpu in mm_cpumask
<Thread B CPU not in mask>
context switch in
set cpu bit in mm_cpumask
load_cr3 (or equiv. mem. barrier)
following accesses to userspace addr.
smp_mb
following accesses to userspace addr.

Same as 2-3: Thread B is passing through a point where userspace memory address
accesses are in program order between the two smp_mb() in sys_membarrier().

5) Context switch out, showing mm_cpumask synchronization:

Thread A Thread B
-------------------------------------------------------------------------
prev accesses to userspace addr.
prev accesses to userspace addr.
sys_membarrier
smp_mb
context switch out
smp_mb_before_clear_bit
clear cpu bit in mm_cpumask
<following accesses to userspace addr.
will happen when rescheduled>
for each cpu in mm_cpumask
<Thread B CPU not in mask>
smp_mb
following accesses to userspace addr.

Same as 2-3-4: Thread B is passing through a point where userspace memory
address accesses are in program order between the two smp_mb() in
sys_membarrier().

This patch only adds the system calls to x86 32/64. See the sys_membarrier()
comments for memory barriers requirement in switch_mm() to port to other
architectures.

Signed-off-by: Mathieu Desnoyers <mathieu....@efficios.com>
Acked-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Acked-by: Steven Rostedt <ros...@goodmis.org>
CC: "Paul E. McKenney" <pau...@linux.vnet.ibm.com>
CC: Nicholas Miell <nmi...@comcast.net>
CC: Linus Torvalds <torv...@linux-foundation.org>
CC: mi...@elte.hu
CC: la...@cn.fujitsu.com
CC: dipa...@in.ibm.com
CC: ak...@linux-foundation.org
CC: jo...@joshtriplett.org
CC: dvh...@us.ibm.com
CC: n...@us.ibm.com
CC: tg...@linutronix.de
CC: pet...@infradead.org
CC: Valdis.K...@vt.edu
CC: dhow...@redhat.com
---
arch/x86/ia32/ia32entry.S | 1
arch/x86/include/asm/mmu_context.h | 28 +++++
arch/x86/include/asm/unistd_32.h | 3
arch/x86/include/asm/unistd_64.h | 2
arch/x86/kernel/syscall_table_32.S | 1
include/linux/Kbuild | 1
include/linux/membarrier.h | 47 +++++++++
kernel/sched.c | 189 +++++++++++++++++++++++++++++++++++++
8 files changed, 269 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h 2010-02-12 14:00:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h 2010-02-12 14:21:04.000000000 -0500
@@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
#define __NR_recvmmsg 299
__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_membarrier 300
+__SYSCALL(__NR_membarrier, sys_membarrier)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c 2010-02-12 14:00:43.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c 2010-02-12 16:27:29.000000000 -0500
@@ -71,6 +71,7 @@
#include <linux/debugfs.h>
#include <linux/ctype.h>
#include <linux/ftrace.h>
+#include <linux/membarrier.h>

#include <asm/tlb.h>
#include <asm/irq_regs.h>
@@ -10929,6 +10930,194 @@ struct cgroup_subsys cpuacct_subsys = {
};
#endif /* CONFIG_CGROUP_CPUACCT */

+#ifdef CONFIG_SMP
+
+/*
+ * Execute a memory barrier on all active threads from the current process
+ * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
+ * because batched IPI lists are synchronized with spinlocks rather than full
+ * memory barriers. This is not the bulk of the overhead anyway, so let's stay
+ * on the safe side.
+ */
+static void membarrier_ipi(void *unused)
+{
+ smp_mb();
+}
+
+/*
+ * Handle out-of-mem by sending per-cpu IPIs instead.
+ */
+static void membarrier_retry(void)
+{
+ struct mm_struct *mm;
+ int cpu;
+
+ for_each_cpu(cpu, mm_cpumask(current->mm)) {
+ raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ mm = cpu_curr(cpu)->mm;
+ raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ if (current->mm == mm)
+ smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+ }
+}
+
+#endif /* #ifdef CONFIG_SMP */
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ * @flags: One of these must be set:
+ * MEMBARRIER_EXPEDITED
+ * Adds some overhead, fast execution (few microseconds)
+ * MEMBARRIER_DELAYED
+ * Low overhead, but slow execution (few milliseconds)
+ *
+ * MEMBARRIER_QUERY
+ * This optional flag can be set to query if the kernel supports
+ * a set of flags.
+ *
+ * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel
+ * sys_membarrier support can be done by checking for -ENOSYS return value.
+ * Return values >= 0 indicate success. For a given set of flags on a given
+ * kernel, this system call will always return the same value. It is therefore
+ * correct to check the return value only once at library load, passing the
+ * MEMBARRIER_QUERY flag in addition to only check if the flags are supported,
+ * without performing any synchronization.
+ *
+ * This system call executes a memory barrier on all running threads of the
+ * current process. Upon completion, the caller thread is ensured that all
+ * process threads have passed through a state where all memory accesses to
+ * user-space addresses match program order. (non-running threads are de facto
+ * in such a state)
+ *
+ * Using the non-expedited mode is recommended for applications which can
+ * afford leaving the caller thread waiting for a few milliseconds. A good
+ * example would be a thread dedicated to execute RCU callbacks, which waits
+ * for callbacks to enqueue most of the time anyway.
+ *
+ * The expedited mode is recommended whenever the application needs to have
+ * control returning to the caller thread as quickly as possible. An example
+ * of such application would be one which uses the same thread to perform
+ * data structure updates and issue the RCU synchronization.
+ *
+ * It is perfectly safe to call both expedited and non-expedited
+ * sys_membarrier() in a process.
+ *
+ * mm_cpumask is used as an approximation of the processors which run threads
+ * belonging to the current process. It is a superset of the cpumask to which we
+ * must send IPIs, mainly due to lazy TLB shootdown. Therefore, for each CPU in
+ * the mm_cpumask, we check each runqueue with the rq lock held to make sure our
+ * ->mm is indeed running on them. The rq lock ensures that a memory barrier is
+ * issued each time the rq current task is changed. This reduces the risk of
+ * disturbing a RT task by sending unnecessary IPIs. There is still a slight
+ * chance to disturb an unrelated task, because we do not lock the runqueues
+ * while sending IPIs, but the real-time effect of this heavy locking would be
+ * worse than the comparatively small disruption of an IPI.
+ *
+ * RED PEN: before assinging a system call number for sys_membarrier() to an
+ * architecture, we must ensure that switch_mm issues full memory barriers
+ * (or a synchronizing instruction having the same effect) between:
+ * - memory accesses to user-space addresses and clear mm_cpumask.
+ * - set mm_cpumask and memory accesses to user-space addresses.
+ *
+ * The reason why these memory barriers are required is that mm_cpumask updates,
+ * as well as iteration on the mm_cpumask, offer no ordering guarantees.
+ * These added memory barriers ensure that any thread modifying the mm_cpumask
+ * is in a state where all memory accesses to user-space addresses are
+ * guaranteed to be in program order.
+ *
+ * In some case adding a comment to this effect will suffice, in others we
+ * will need to add smp_mb__before_clear_bit()/smp_mb__after_clear_bit() or
+ * simply smp_mb(). These barriers are required to ensure we do not _miss_ a
+ * CPU that need to receive an IPI, which would be a bug.
+ *
+ * On uniprocessor systems, this system call simply returns 0 without doing
+ * anything, so user-space knows it is implemented.
+ *
+ * The flags argument has room for extensibility, with 16 lower bits holding
+ * mandatory flags for which older kernels will fail if they encounter an
+ * unknown flag. The high 16 bits are used for optional flags, which older
+ * kernels don't have to care about.
+ *
+ * This synchronization only takes care of threads using the current process
+ * memory map. It should not be used to synchronize accesses performed on memory
+ * maps shared between different processes.
+ */
+SYSCALL_DEFINE1(membarrier, unsigned int, flags)
+{
+#ifdef CONFIG_SMP
+ struct mm_struct *mm;
+ cpumask_var_t tmpmask;
+ int cpu;
+
+ /*
+ * Expect _only_ one of expedited or delayed flags.
+ * Don't care about optional mask for now.
+ */
+ switch (flags & MEMBARRIER_MANDATORY_MASK) {
+ case MEMBARRIER_EXPEDITED:
+ case MEMBARRIER_DELAYED:
+ break;
+ default:
+ return -EINVAL;
+ }
+ if (unlikely(flags & MEMBARRIER_QUERY
+ || thread_group_empty(current))
+ || num_online_cpus() == 1)
+ return 0;
+ if (flags & MEMBARRIER_DELAYED) {
+ synchronize_sched();
+ return 0;
+ }
+ /*
+ * Memory barrier on the caller thread between previous memory accesses
+ * to user-space addresses and sending memory-barrier IPIs. Orders all
+ * user-space address memory accesses prior to sys_membarrier() before
+ * mm_cpumask read and membarrier_ipi executions. This barrier is paired
+ * with memory barriers in:
+ * - membarrier_ipi() (for each running threads of the current process)
+ * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
+ * accesses to user-space addresses)
+ * - Each CPU ->mm update performed with rq lock held by the scheduler.
+ * A memory barrier is issued each time ->mm is changed while the rq
+ * lock is held.
+ */
+ smp_mb();
+ if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
+ membarrier_retry();
+ goto out;
+ }
+ cpumask_copy(tmpmask, mm_cpumask(current->mm));
+ preempt_disable();
+ cpumask_clear_cpu(smp_processor_id(), tmpmask);
+ for_each_cpu(cpu, tmpmask) {
+ raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ mm = cpu_curr(cpu)->mm;
+ raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ if (current->mm != mm)
+ cpumask_clear_cpu(cpu, tmpmask);
+ }
+ smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+ preempt_enable();
+ free_cpumask_var(tmpmask);
+out:
+ /*
+ * Memory barrier on the caller thread between sending&waiting for
+ * memory-barrier IPIs and following memory accesses to user-space
+ * addresses. Orders mm_cpumask read and membarrier_ipi executions
+ * before all user-space address memory accesses following
+ * sys_membarrier(). This barrier is paired with memory barriers in:
+ * - membarrier_ipi() (for each running threads of the current process)
+ * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
+ * accesses to user-space addresses)
+ * - Each CPU ->mm update performed with rq lock held by the scheduler.
+ * A memory barrier is issued each time ->mm is changed while the rq
+ * lock is held.
+ */
+ smp_mb();
+#endif /* #ifdef CONFIG_SMP */
+ return 0;
+}
+
#ifndef CONFIG_SMP

int rcu_expedited_torture_stats(char *page)
Index: linux-2.6-lttng/include/linux/membarrier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/membarrier.h 2010-02-12 16:27:32.000000000 -0500
@@ -0,0 +1,47 @@
+#ifndef _LINUX_MEMBARRIER_H
+#define _LINUX_MEMBARRIER_H
+
+/* First argument to membarrier syscall */
+
+/*
+ * Mandatory flags to the membarrier system call that the kernel must
+ * understand are in the low 16 bits.
+ */
+#define MEMBARRIER_MANDATORY_MASK 0x0000FFFF /* Mandatory flags */
+
+/*
+ * Optional hints that the kernel can ignore are in the high 16 bits.
+ */
+#define MEMBARRIER_OPTIONAL_MASK 0xFFFF0000 /* Optional hints */
+
+/* Expedited: adds some overhead, fast execution (few microseconds) */
+#define MEMBARRIER_EXPEDITED (1 << 0)
+/* Delayed: Low overhead, but slow execution (few milliseconds) */
+#define MEMBARRIER_DELAYED (1 << 1)
+
+/* Query flag support, without performing synchronization */
+#define MEMBARRIER_QUERY (1 << 16)
+
+
+/*
+ * All memory accesses performed in program order from each process threads are
+ * guaranteed to be ordered with respect to sys_membarrier(). If we use the
+ * semantic "barrier()" to represent a compiler barrier forcing memory accesses
+ * to be performed in program order across the barrier, and smp_mb() to
+ * represent explicit memory barriers forcing full memory ordering across the
+ * barrier, we have the following ordering table for each pair of barrier(),
+ * sys_membarrier() and smp_mb() :
+ *
+ * The pair ordering is detailed as (O: ordered, X: not ordered):
+ *
+ * barrier() smp_mb() sys_membarrier()
+ * barrier() X X O
+ * smp_mb() X O O
+ * sys_membarrier() O O O
+ *
+ * This synchronization only takes care of threads using the current process
+ * memory map. It should not be used to synchronize accesses performed on memory
+ * maps shared between different processes.
+ */
+
+#endif
Index: linux-2.6-lttng/include/linux/Kbuild
===================================================================
--- linux-2.6-lttng.orig/include/linux/Kbuild 2010-02-12 14:00:43.000000000 -0500
+++ linux-2.6-lttng/include/linux/Kbuild 2010-02-12 14:21:04.000000000 -0500
@@ -110,6 +110,7 @@ header-y += magic.h
header-y += major.h
header-y += map_to_7segment.h
header-y += matroxfb.h
+header-y += membarrier.h
header-y += meye.h
header-y += minix_fs.h
header-y += mmtimer.h
Index: linux-2.6-lttng/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_32.h 2010-02-12 14:00:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_32.h 2010-02-12 14:21:04.000000000 -0500
@@ -343,10 +343,11 @@
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
#define __NR_recvmmsg 337
+#define __NR_membarrier 338

#ifdef __KERNEL__

-#define NR_syscalls 338
+#define NR_syscalls 339

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/arch/x86/ia32/ia32entry.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/ia32/ia32entry.S 2010-02-12 14:00:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/ia32/ia32entry.S 2010-02-12 14:21:04.000000000 -0500
@@ -842,4 +842,5 @@ ia32_sys_call_table:
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
.quad compat_sys_recvmmsg
+ .quad sys_membarrier
ia32_syscall_end:
Index: linux-2.6-lttng/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/syscall_table_32.S 2010-02-12 14:00:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/syscall_table_32.S 2010-02-12 14:21:04.000000000 -0500
@@ -337,3 +337,4 @@ ENTRY(sys_call_table)
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
.long sys_recvmmsg
+ .long sys_membarrier
Index: linux-2.6-lttng/arch/x86/include/asm/mmu_context.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/mmu_context.h 2010-02-12 14:00:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/mmu_context.h 2010-02-12 15:26:11.000000000 -0500
@@ -36,6 +36,16 @@ static inline void switch_mm(struct mm_s
unsigned cpu = smp_processor_id();

if (likely(prev != next)) {
+ /*
+ * smp_mb() between memory accesses to user-space addresses and
+ * mm_cpumask clear is required by sys_membarrier(). This
+ * ensures that all user-space address memory accesses are in
+ * program order when the mm_cpumask is cleared.
+ * smp_mb__before_clear_bit() turns into a barrier() on x86. It
+ * is left here to document that this barrier is needed, as an
+ * example for other architectures.
+ */
+ smp_mb__before_clear_bit();
/* stop flush ipis for the previous mm */
cpumask_clear_cpu(cpu, mm_cpumask(prev));
#ifdef CONFIG_SMP
@@ -43,7 +53,13 @@ static inline void switch_mm(struct mm_s
percpu_write(cpu_tlbstate.active_mm, next);
#endif
cpumask_set_cpu(cpu, mm_cpumask(next));
-
+ /*
+ * smp_mb() between mm_cpumask set and memory accesses to
+ * user-space addresses is required by sys_membarrier(). This
+ * ensures that all user-space address memory accesses performed
+ * by the current thread are in program order when the
+ * mm_cpumask is set. Implied by load_cr3.
+ */
/* Re-load page tables */
load_cr3(next->pgd);

@@ -59,9 +75,17 @@ static inline void switch_mm(struct mm_s
BUG_ON(percpu_read(cpu_tlbstate.active_mm) != next);

if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
- /* We were in lazy tlb mode and leave_mm disabled
+ /*
+ * We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables.
+ *
+ * smp_mb() between mm_cpumask set and memory accesses
+ * to user-space addresses is required by
+ * sys_membarrier(). This ensures that all user-space
+ * address memory accesses performed by the current
+ * thread are in program order when the mm_cpumask is
+ * set. Implied by load_cr3.
*/
load_cr3(next->pgd);
load_LDT_nolock(&next->context);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Paul E. McKenney

unread,

Feb 15, 2010, 3:00:03 PM2/15/10

to

On Fri, Feb 12, 2010 at 05:46:06PM -0500, Mathieu Desnoyers wrote:
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process. It can be used
> to distribute the cost of user-space memory barriers asymmetrically by
> transforming pairs of memory barriers into pairs consisting of sys_membarrier()
> and a compiler barrier. For synchronization primitives that distinguish between
> read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be
> accelerated significantly by moving the bulk of the memory barrier overhead to
> the write-side.
>
> The first user of this system call is the "liburcu" Userspace RCU implementation
> found at http://lttng.org/urcu. It aims at greatly simplifying and enhancing the
> current implementation, which uses a scheme similar to the sys_membarrier(), but
> based on signals sent to each reader thread.
>
> Editorial question:
>
> This synchronization only takes care of threads using the current process memory
> map. It should not be used to synchronize accesses performed on memory maps
> shared between different processes. Is that a limitation we can live with ?

Acked-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

KOSAKI Motohiro

unread,

Feb 15, 2010, 8:00:01 PM2/15/10

to

> On Fri, Feb 12, 2010 at 05:46:06PM -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process. It can be used
> > to distribute the cost of user-space memory barriers asymmetrically by
> > transforming pairs of memory barriers into pairs consisting of sys_membarrier()
> > and a compiler barrier. For synchronization primitives that distinguish between
> > read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be
> > accelerated significantly by moving the bulk of the memory barrier overhead to
> > the write-side.
> >
> > The first user of this system call is the "liburcu" Userspace RCU implementation
> > found at http://lttng.org/urcu. It aims at greatly simplifying and enhancing the
> > current implementation, which uses a scheme similar to the sys_membarrier(), but
> > based on signals sent to each reader thread.
> >
> > Editorial question:
> >
> > This synchronization only takes care of threads using the current process memory
> > map. It should not be used to synchronize accesses performed on memory maps
> > shared between different processes. Is that a limitation we can live with ?
>
> Acked-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

Yes.

Personally, I think this patch's concept is clear and it can construct the base
of userland lockless programming.
If a userland programmer want to use lockless, hazard pointer is one of common
technique and this syscall help it.

Chris Friesen

unread,

Feb 22, 2010, 1:40:02 PM2/22/10

to

On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:

> Editorial question:
>
> This synchronization only takes care of threads using the current process memory
> map. It should not be used to synchronize accesses performed on memory maps
> shared between different processes. Is that a limitation we can live with ?

It makes sense for an initial version. It would be unfortunate if this
were a permanent limitation, since using separate processes with
explicit shared memory is a useful way to mitigate memory trampler issues.

If we were going to allow that, it might make sense to add an address
range such that only those processes which have mapped that range would
execute the barrier. Come to think of it, it might be possible to use
this somehow to avoid having to execute the barrier on *all* threads
within a process.

Chris

Mathieu Desnoyers

unread,

Feb 22, 2010, 4:30:01 PM2/22/10

to

* Chris Friesen (cfri...@nortel.com) wrote:
> On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
>
> > Editorial question:
> >
> > This synchronization only takes care of threads using the current process memory
> > map. It should not be used to synchronize accesses performed on memory maps
> > shared between different processes. Is that a limitation we can live with ?
>
> It makes sense for an initial version. It would be unfortunate if this
> were a permanent limitation, since using separate processes with
> explicit shared memory is a useful way to mitigate memory trampler issues.
>
> If we were going to allow that, it might make sense to add an address
> range such that only those processes which have mapped that range would
> execute the barrier. Come to think of it, it might be possible to use
> this somehow to avoid having to execute the barrier on *all* threads
> within a process.

The extensible system call mandatory and optional flags will allow this kind of
improvement later on if this appears to be needed. It will also allow user-space
to detect if later kernels support these new features or not. But meanwhile I
think it's good to start with this implementation that covers 99.99% of
use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)

Thanks,

Mathieu

Nick Piggin

unread,

Feb 24, 2010, 4:20:03 AM2/24/10

to

On Mon, Feb 22, 2010 at 04:23:21PM -0500, Mathieu Desnoyers wrote:
> * Chris Friesen (cfri...@nortel.com) wrote:
> > On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
> >
> > > Editorial question:
> > >
> > > This synchronization only takes care of threads using the current process memory
> > > map. It should not be used to synchronize accesses performed on memory maps
> > > shared between different processes. Is that a limitation we can live with ?
> >
> > It makes sense for an initial version. It would be unfortunate if this
> > were a permanent limitation, since using separate processes with
> > explicit shared memory is a useful way to mitigate memory trampler issues.
> >
> > If we were going to allow that, it might make sense to add an address
> > range such that only those processes which have mapped that range would
> > execute the barrier. Come to think of it, it might be possible to use
> > this somehow to avoid having to execute the barrier on *all* threads
> > within a process.
>
> The extensible system call mandatory and optional flags will allow this kind of
> improvement later on if this appears to be needed. It will also allow user-space
> to detect if later kernels support these new features or not. But meanwhile I
> think it's good to start with this implementation that covers 99.99% of
> use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)

It's a good point, I think having at least the ability to do
process-shared or process-private in the first version of the API might
be a good idea. That matches glibc's synchronisation routines so it
would probably be a desirable feature even if you don't implement it in
your library initially.

When writing multiprocessor scalable software, threads should often be
avoided. They share so much state that it is easy to run into
scalability issues in the kernel. So yes it would be really nice to
have userspace RCU available in a process-shared mode.

Mathieu Desnoyers

unread,

Feb 24, 2010, 10:30:02 AM2/24/10

to

I am tempted to say that we should probably wait for users of this API feature
to manifest themselves before we go on and implement it. This will ensure that
we don't end up maintaining an unused feature and this provides a minimum
testability. For now, returning -EINVAL seems like an appropriate response for
this system call feature.

As I said above, given the exensible nature of the sys_membarrier flags, we can
assign a MEMBARRIER_SHARED_MEM or something like that to a mandatory flag bit
later on. So when userspace start using this flag on old kernels that do not
support it, -EINVAL will be returned, and then the application will know it must
use a fallback. So, basically, we don't even need to define this flag now.

>
> When writing multiprocessor scalable software, threads should often be
> avoided. They share so much state that it is easy to run into
> scalability issues in the kernel. So yes it would be really nice to
> have userspace RCU available in a process-shared mode.
>

Agreed, although some major modifications would also be needed in the userspace
RCU library to do that, because it currently rely on being able to access other
thread's TLS.

Thanks,

Mathieu

Darren Hart

unread,

Feb 24, 2010, 12:40:02 PM2/24/10

to

Nick Piggin wrote:

> When writing multiprocessor scalable software, threads should often be
> avoided. They share so much state that it is easy to run into
> scalability issues in the kernel. So yes it would be really nice to
> have userspace RCU available in a process-shared mode.

A bit off topic, but I'm interested in what you feel some of these
scalability issues are. Is it mostly bouncing this shared context from
one CPU to the next and the related cache effects, or is there something
more you are referring to?

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

Nick Piggin

unread,

Feb 25, 2010, 12:30:02 AM2/25/10

to

On Wed, Feb 24, 2010 at 09:29:46AM -0800, Darren Hart wrote:
> Nick Piggin wrote:
>
> >When writing multiprocessor scalable software, threads should often be
> >avoided. They share so much state that it is easy to run into
> >scalability issues in the kernel. So yes it would be really nice to
> >have userspace RCU available in a process-shared mode.
>
> A bit off topic, but I'm interested in what you feel some of these
> scalability issues are. Is it mostly bouncing this shared context
> from one CPU to the next and the related cache effects, or is there
> something more you are referring to?

Just in general shared state is almost always going to be more costly in
SMP than non-shared.

From VM to files and fs state to signals and timers and process
accounting. And this also carries up to libc, and critical user code
like the heap allocator.

Linux is usually pretty good, a lot due to RCU, but there are still
contention points.

Andrew had investigated this a lot (in relation to samba) and had a good
talk on it, but the slides don't really do it justice.
http://www.samba.org/~tridge/talks/threads.pdf

Nick Piggin

unread,

Feb 25, 2010, 12:40:02 AM2/25/10

to

It would be very trivial compared to the process-private case. Just IPI
all CPUs. It would allow older kernels to work with newer process based
apps as they get implemented. But... not a really big deal I suppose.

> As I said above, given the exensible nature of the sys_membarrier flags, we can
> assign a MEMBARRIER_SHARED_MEM or something like that to a mandatory flag bit
> later on. So when userspace start using this flag on old kernels that do not
> support it, -EINVAL will be returned, and then the application will know it must
> use a fallback. So, basically, we don't even need to define this flag now.
>
> >
> > When writing multiprocessor scalable software, threads should often be
> > avoided. They share so much state that it is easy to run into
> > scalability issues in the kernel. So yes it would be really nice to
> > have userspace RCU available in a process-shared mode.
> >
>
> Agreed, although some major modifications would also be needed in the userspace
> RCU library to do that, because it currently rely on being able to access other
> thread's TLS.

OK. It would be a good feature to keep in mind, I believe.

Mathieu Desnoyers

unread,

Feb 25, 2010, 12:00:03 PM2/25/10

to

This is actually what I did in v1 of the patch, but this implementation met
resistance from the RT people, who were concerned about the impact on RT tasks
of a lower priority process doing lots of sys_membarrier() calls. So if we want
to do other-process-aware sys_membarrier(), we would have to iterate on all
cpus, for every running process shared memory maps and see if there is something
shared with all shm of the current process. This is clearly not as trivial as
just broadcasting the IPI to all cpus.

>
>
> > As I said above, given the exensible nature of the sys_membarrier flags, we can
> > assign a MEMBARRIER_SHARED_MEM or something like that to a mandatory flag bit
> > later on. So when userspace start using this flag on old kernels that do not
> > support it, -EINVAL will be returned, and then the application will know it must
> > use a fallback. So, basically, we don't even need to define this flag now.
> >
> > >
> > > When writing multiprocessor scalable software, threads should often be
> > > avoided. They share so much state that it is easy to run into
> > > scalability issues in the kernel. So yes it would be really nice to
> > > have userspace RCU available in a process-shared mode.
> > >
> >
> > Agreed, although some major modifications would also be needed in the userspace
> > RCU library to do that, because it currently rely on being able to access other
> > thread's TLS.
>
> OK. It would be a good feature to keep in mind, I believe.
>

Sure.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Steven Rostedt

unread,

Feb 25, 2010, 12:30:02 PM2/25/10

to

On Thu, 2010-02-25 at 11:53 -0500, Mathieu Desnoyers wrote:

> > It would be very trivial compared to the process-private case. Just IPI
> > all CPUs. It would allow older kernels to work with newer process based
> > apps as they get implemented. But... not a really big deal I suppose.
>
> This is actually what I did in v1 of the patch, but this implementation met
> resistance from the RT people, who were concerned about the impact on RT tasks
> of a lower priority process doing lots of sys_membarrier() calls. So if we want
> to do other-process-aware sys_membarrier(), we would have to iterate on all
> cpus, for every running process shared memory maps and see if there is something
> shared with all shm of the current process. This is clearly not as trivial as
> just broadcasting the IPI to all cpus.

Right, it may require another syscall or parameter to let the tasks
register a shared page. Then have some mechanism to find a way to
quickly check if a CPU is running a process with that page.

-- Steve

Mathieu Desnoyers

unread,

Feb 25, 2010, 1:00:02 PM2/25/10

to

* Steven Rostedt (ros...@goodmis.org) wrote:
> On Thu, 2010-02-25 at 11:53 -0500, Mathieu Desnoyers wrote:
>
> > > It would be very trivial compared to the process-private case. Just IPI
> > > all CPUs. It would allow older kernels to work with newer process based
> > > apps as they get implemented. But... not a really big deal I suppose.
> >
> > This is actually what I did in v1 of the patch, but this implementation met
> > resistance from the RT people, who were concerned about the impact on RT tasks
> > of a lower priority process doing lots of sys_membarrier() calls. So if we want
> > to do other-process-aware sys_membarrier(), we would have to iterate on all
> > cpus, for every running process shared memory maps and see if there is something
> > shared with all shm of the current process. This is clearly not as trivial as
> > just broadcasting the IPI to all cpus.
>
> Right, it may require another syscall or parameter to let the tasks
> register a shared page. Then have some mechanism to find a way to
> quickly check if a CPU is running a process with that page.

Well, either we explicitly require the task to register its shared pages, which
could be error-prone in terms of API, or simply consider all pages that are
shared between the current process and every process running on other CPUs. That
would be much simpler to use from a user-level perspective I think. The
downside is that it may generate a few IPIs to processes that happen not to need
them, but we are talking of a relatively small overhead to processes that we are
interacting with anyway. It's not like we would be interrupting completely
unrelated RT threads. I'm just not sure if it would be valid to exclude COW and
RO shared pages from that check. For instance, if a pages is mapped as RO on one
process and RW on another, then we have to synchronize these processes. Similar
weird cases could happen if a memory map is changed from RW to RO right after
the content is modified, and then we need to execute sys_membarrier: we might
miss a memory map that actually needs to be synchronized.

And yes, as you say, we'd have to find a way to quickly compare shared-memory
maps from two processes. The dumb approach, O(n^2), would be to compare these
entries element by element. Assuming a relatively low amount of shared mmaps,
this could make sense, otherwise we'd have to construct a lookup hash table to
accelerate the lookup, but it adds either a basic runtime overhead if we
construct it within sys_membarrier() or a memory overhead if we choose to add it
to the task struct (which I'd really like to avoid).

But... either way we chose, we can extend the system call flags and parameters
as needed, so I think it really should not be part of this initial
implementation.

Thanks,

Mathieu

>
> -- Steve
>
>

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Steven Rostedt

unread,

Feb 25, 2010, 1:10:02 PM2/25/10

to

On Thu, 2010-02-25 at 12:51 -0500, Mathieu Desnoyers wrote:

> But... either way we chose, we can extend the system call flags and parameters
> as needed, so I think it really should not be part of this initial
> implementation.

I agree here too.

If you have two different tasks doing lockless RCU or what not on shared
memory, it's best to stick with the mb() on the reader side. Yeah, it
makes the performance go down, but heck, I'm really worried about the
crazy complexity that wound need to go into the kernel to prevent this.

-- Steve

Mathieu Desnoyers

unread,

Feb 25, 2010, 1:10:02 PM2/25/10

to

* Mathieu Desnoyers (mathieu....@efficios.com) wrote:
[...]

> But... either way we chose, we can extend the system call flags and parameters
> as needed, so I think it really should not be part of this initial
> implementation.

So... considering all this discussion is about future enhancements that are not
required by anyone at this stage, and that it will be possible to add these
later on thanks to the extensible sys_membarrier() flags, I propose to merge v9
of this patch for 2.6.34. I think the logical path for this patch is to go
through Ingo's tree, as it sits mostly along with the scheduler, but I have not
heard anything from him yet. Am I taking the correct path ?

Thanks,

Mathieu

Steven Rostedt

unread,

Feb 25, 2010, 1:30:02 PM2/25/10

to

On Thu, 2010-02-25 at 13:00 -0500, Mathieu Desnoyers wrote:

> So... considering all this discussion is about future enhancements that are not
> required by anyone at this stage, and that it will be possible to add these
> later on thanks to the extensible sys_membarrier() flags, I propose to merge v9
> of this patch for 2.6.34. I think the logical path for this patch is to go
> through Ingo's tree, as it sits mostly along with the scheduler, but I have not
> heard anything from him yet. Am I taking the correct path ?

I agree this should probably go through tip. I'm sure Ingo is busy
working through the merge window now too, and is not focusing on this
thread.

Anyway, this thread still has RFC in it. Send out a new patch (new
thread) with the Subject:

[PATCH -tip] introduce sys_membarrier(): process-wide memory barrier (v9))

With all Acked-by's given and state that it is ready for inclusion in
v2.6.34. (make this statement at the top of the email) It may still not
make 2.6.34, but at least it will be on its way to 2.6.25 (or 3.0
*wish*)

-- Steve

Mathieu Desnoyers

unread,

Feb 25, 2010, 6:30:01 PM2/25/10

to

I am proposing this patch for the 2.6.34 merge window, as I think it is ready
for inclusion.

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process. It can be used
to distribute the cost of user-space memory barriers asymmetrically by
transforming pairs of memory barriers into pairs consisting of sys_membarrier()
and a compiler barrier. For synchronization primitives that distinguish between
read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be
accelerated significantly by moving the bulk of the memory barrier overhead to
the write-side.

The first user of this system call is the "liburcu" Userspace RCU implementation
found at http://lttng.org/urcu. It aims at greatly simplifying and enhancing the
current implementation, which uses a scheme similar to the sys_membarrier(), but
based on signals sent to each reader thread.

This patch mostly sits in kernel/sched.c (it needs to access struct rq). It is
based on tip/master commit bd37c0157993c3f2fcf9eecbe1a04c246df69eab. (also
applies correctly to 2.6.33) I think the -tip tree would be the right one to
pick up this patch, as it touches sched.c.

* Benchmarks

* expedited

10,000,000 sys_membarrier calls:

* non-expedited

1000 sys_membarrier calls:

T=1-7: 0m16.002s

Results in liburcu:

Acked-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

CC: Nicholas Miell <nmi...@comcast.net>
CC: Linus Torvalds <torv...@linux-foundation.org>
CC: mi...@elte.hu
CC: la...@cn.fujitsu.com
CC: dipa...@in.ibm.com
CC: ak...@linux-foundation.org
CC: jo...@joshtriplett.org
CC: dvh...@us.ibm.com
CC: n...@us.ibm.com
CC: tg...@linutronix.de
CC: pet...@infradead.org
CC: Valdis.K...@vt.edu
CC: dhow...@redhat.com

CC: Nick Piggin <npi...@suse.de>
CC: Chris Friesen <cfri...@nortel.com>

Index: linux.trees.git/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/unistd_64.h 2010-02-25 18:15:06.000000000 -0500
+++ linux.trees.git/arch/x86/include/asm/unistd_64.h 2010-02-25 18:16:13.000000000 -0500

@@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
#define __NR_recvmmsg 299
__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_membarrier 300
+__SYSCALL(__NR_membarrier, sys_membarrier)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c 2010-02-25 18:15:06.000000000 -0500
+++ linux.trees.git/kernel/sched.c 2010-02-25 18:16:13.000000000 -0500

@@ -71,6 +71,7 @@
#include <linux/debugfs.h>
#include <linux/ctype.h>
#include <linux/ftrace.h>
+#include <linux/membarrier.h>

#include <asm/tlb.h>
#include <asm/irq_regs.h>

@@ -9077,6 +9078,194 @@ struct cgroup_subsys cpuacct_subsys = {

Index: linux.trees.git/include/linux/membarrier.h

===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000

+++ linux.trees.git/include/linux/membarrier.h 2010-02-25 18:16:13.000000000 -0500

Index: linux.trees.git/include/linux/Kbuild
===================================================================
--- linux.trees.git.orig/include/linux/Kbuild 2010-02-25 18:15:06.000000000 -0500
+++ linux.trees.git/include/linux/Kbuild 2010-02-25 18:16:13.000000000 -0500

@@ -110,6 +110,7 @@ header-y += magic.h
header-y += major.h
header-y += map_to_7segment.h
header-y += matroxfb.h
+header-y += membarrier.h
header-y += meye.h
header-y += minix_fs.h
header-y += mmtimer.h

Index: linux.trees.git/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/unistd_32.h 2010-02-25 18:15:05.000000000 -0500
+++ linux.trees.git/arch/x86/include/asm/unistd_32.h 2010-02-25 18:16:13.000000000 -0500

@@ -343,10 +343,11 @@
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
#define __NR_recvmmsg 337
+#define __NR_membarrier 338

#ifdef __KERNEL__

-#define NR_syscalls 338
+#define NR_syscalls 339

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR

Index: linux.trees.git/arch/x86/ia32/ia32entry.S
===================================================================
--- linux.trees.git.orig/arch/x86/ia32/ia32entry.S 2010-02-25 18:15:06.000000000 -0500
+++ linux.trees.git/arch/x86/ia32/ia32entry.S 2010-02-25 18:16:13.000000000 -0500

@@ -842,4 +842,5 @@ ia32_sys_call_table:
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
.quad compat_sys_recvmmsg
+ .quad sys_membarrier
ia32_syscall_end:

Index: linux.trees.git/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/syscall_table_32.S 2010-02-25 18:15:05.000000000 -0500
+++ linux.trees.git/arch/x86/kernel/syscall_table_32.S 2010-02-25 18:16:13.000000000 -0500

@@ -337,3 +337,4 @@ ENTRY(sys_call_table)
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
.long sys_recvmmsg
+ .long sys_membarrier

Index: linux.trees.git/arch/x86/include/asm/mmu_context.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/mmu_context.h 2010-02-25 18:15:06.000000000 -0500
+++ linux.trees.git/arch/x86/include/asm/mmu_context.h 2010-02-25 18:16:13.000000000 -0500

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Nick Piggin

unread,

Feb 26, 2010, 12:10:02 AM2/26/10

to

I don't see how this is fundamentally worse than your existing approach,
because on some architectures with asids, the mm_cpumask isn't cleared
when a process is scheduled off the CPU then you could effectively just
cause IPIs to lots of CPUs anyway.

x86 may also one day implement ASIDS in the same way.

So if we are worried about this then we need to solve it properly IMO.
Rate-limiting it might work.

Steven Rostedt

unread,

Feb 26, 2010, 12:40:02 AM2/26/10

to

On Fri, 2010-02-26 at 16:08 +1100, Nick Piggin wrote:
> On Thu, Feb 25, 2010 at 11:53:01AM -0500, Mathieu Desnoyers wrote:

> > This is actually what I did in v1 of the patch, but this implementation met
> > resistance from the RT people, who were concerned about the impact on RT tasks
> > of a lower priority process doing lots of sys_membarrier() calls. So if we want
> > to do other-process-aware sys_membarrier(), we would have to iterate on all
> > cpus, for every running process shared memory maps and see if there is something
> > shared with all shm of the current process. This is clearly not as trivial as
> > just broadcasting the IPI to all cpus.
>
> I don't see how this is fundamentally worse than your existing approach,
> because on some architectures with asids, the mm_cpumask isn't cleared
> when a process is scheduled off the CPU then you could effectively just
> cause IPIs to lots of CPUs anyway.

That's why checking the mm_cpumask isn't the only check. That just
limits what CPUs we check, but before a IPI is sent, that cpu has its rq
lock held and a check against cpu_curr(cpu)->mm vs the current->mm. If
that fails, then that CPU does not have an IPI sent to it.

-- Steve

Mathieu Desnoyers

unread,

Mar 1, 2010, 9:30:02 AM3/1/10

to

Hello,

I sent this patch (v9) for the 3rd time 4 days ago (v9 has been sent once as RFC
and twice as merge requests, and only received acked-by, but no effective
merge happened). I understand that Ingo is quite busy, but I just want to make
sure it did not fall into a mailbox vortex. This is why I am sending this
friendly reminder.

Thanks,

Mathieu

* Mathieu Desnoyers (mathieu....@efficios.com) wrote:

Josh Triplett

unread,

Mar 2, 2010, 1:00:03 PM3/2/10

to

On Thu, Feb 25, 2010 at 06:23:16PM -0500, Mathieu Desnoyers wrote:
> I am proposing this patch for the 2.6.34 merge window, as I think it is ready
> for inclusion.
>
> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process.

[...]

> Signed-off-by: Mathieu Desnoyers <mathieu....@efficios.com>
> Acked-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> Acked-by: Steven Rostedt <ros...@goodmis.org>
> Acked-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>
> CC: Nicholas Miell <nmi...@comcast.net>
> CC: Linus Torvalds <torv...@linux-foundation.org>
> CC: mi...@elte.hu
> CC: la...@cn.fujitsu.com
> CC: dipa...@in.ibm.com
> CC: ak...@linux-foundation.org
> CC: jo...@joshtriplett.org

Acked-by: Josh Triplett <jo...@joshtriplett.org>

I agree that v9 seems ready for inclusion.

Out of curiosity, do you have any benchmarks for the case of not
detecting sys_membarrier dynamically? Detecting it at library
initialization time, for instance, or even just compiling to assume its
presence? I'd like to know how much that would improve the numbers.

If significant, it might make sense to try to have a mechanism similar
to SMP alternatives, to have different code in either case. dlopen,
function pointers, runtime code patching (nop out the rmb), or similar.

- Josh Triplett

Mathieu Desnoyers

unread,

Mar 2, 2010, 6:10:01 PM3/2/10

to

* Josh Triplett (jo...@joshtriplett.org) wrote:
> On Thu, Feb 25, 2010 at 06:23:16PM -0500, Mathieu Desnoyers wrote:
> > I am proposing this patch for the 2.6.34 merge window, as I think it is ready
> > for inclusion.
> >
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> [...]
>
> > Signed-off-by: Mathieu Desnoyers <mathieu....@efficios.com>
> > Acked-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> > Acked-by: Steven Rostedt <ros...@goodmis.org>
> > Acked-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>
> > CC: Nicholas Miell <nmi...@comcast.net>
> > CC: Linus Torvalds <torv...@linux-foundation.org>
> > CC: mi...@elte.hu
> > CC: la...@cn.fujitsu.com
> > CC: dipa...@in.ibm.com
> > CC: ak...@linux-foundation.org
> > CC: jo...@joshtriplett.org
>
> Acked-by: Josh Triplett <jo...@joshtriplett.org>
>
> I agree that v9 seems ready for inclusion.

Thanks!

>
> Out of curiosity, do you have any benchmarks for the case of not
> detecting sys_membarrier dynamically? Detecting it at library
> initialization time, for instance, or even just compiling to assume its
> presence? I'd like to know how much that would improve the numbers.

Citing the patch changelog:

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

(what we previously had)
memory barriers in reader: 973494744 reads, 892368 writes
signal-based scheme: 6289946025 reads, 1251 writes

(what we have now, with dynamic sys_membarrier check, expedited scheme)
memory barriers in reader: 907693804 reads, 817793 writes
sys_membarrier scheme: 4316818891 reads, 503790 writes

So basically, yes, there is a significant overhead on the read-side if we
compare the dynamic check (0.39 ns/read per reader) to the signal-based scheme
(0.26 ns/read per reader) (which only needs the barrier()). On the update-side,
we cannot care less though.

>
> If significant, it might make sense to try to have a mechanism similar
> to SMP alternatives, to have different code in either case. dlopen,
> function pointers, runtime code patching (nop out the rmb), or similar.

Yes, definitely. It could also be useful to switch between UP and SMP primitives
dynamically when spawning the second thread in a process. We should be careful
when sharing memory maps between processes though.

Thanks,

Mathieu

>
> - Josh Triplett

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Josh Triplett

unread,

Mar 2, 2010, 9:00:02 PM3/2/10

to

Just wanted to confirm that the signal results also hold for the
assume-sys_membarrier approach.

> > If significant, it might make sense to try to have a mechanism similar
> > to SMP alternatives, to have different code in either case. dlopen,
> > function pointers, runtime code patching (nop out the rmb), or similar.
>
> Yes, definitely. It could also be useful to switch between UP and SMP primitives
> dynamically when spawning the second thread in a process. We should be careful
> when sharing memory maps between processes though.

Might prove useful for some use cases, sure. Not a high priority given
complexity:performance ratio though, I think.

- Josh

Mathieu Desnoyers

unread,

Mar 4, 2010, 11:00:01 AM3/4/10

to

* Ingo Molnar (mi...@elte.hu) wrote:

>
> * Mathieu Desnoyers <mathieu....@efficios.com> wrote:
>
> > I am proposing this patch for the 2.6.34 merge window, as I think it is
> > ready for inclusion.
>

> It's a bit late for this merge window i think.

OK, no problem. Thanks for taking time to review the patch. See below for
response to your comments.

>
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process. It can be
> > used to distribute the cost of user-space memory barriers asymmetrically by
> > transforming pairs of memory barriers into pairs consisting of
> > sys_membarrier() and a compiler barrier. For synchronization primitives that
> > distinguish between read-side and write-side (e.g. userspace RCU, rwlocks),
> > the read-side can be accelerated significantly by moving the bulk of the
> > memory barrier overhead to the write-side.
>

> Why is this such a low level and still special-purpose facility?
>
> Synchronization facilities for high-performance threading may want to do a bit
> more than just execute a barrier instruction on another CPU that has a
> relevant thread running.

Yep, I'm aware of that.

>
> You cited signal based numbers:

>
> > (what we have now, with dynamic sys_membarrier check, expedited scheme)
> > memory barriers in reader: 907693804 reads, 817793 writes
> > sys_membarrier scheme: 4316818891 reads, 503790 writes
> >
> > (dynamic sys_membarrier check, non-expedited scheme)
> > memory barriers in reader: 907693804 reads, 817793 writes
> > sys_membarrier scheme: 8698725501 reads, 313 writes
>

> Much of that signal handler overhead is i think due to:
>
> - FPU/SSE context save/restore
> - the need to wake up, run and deschedule all threads

This second point hurts, especially if we have more threads than processors.

>
> Instead i'd suggest for you to try to implement user-space RCU speedups not
> via the new sys_membarrier() syscall, but via two new signal extensions:
>
> - SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
> purpose signal handlers? (can whip up a quick patch for you if you want)

This could help.

>
> - SA_RUNNING: a way to signal only running threads - as a way for user-space
> based concurrency control mechanisms to deschedule running threads (or, like
> in your case, to implement barrier / garbage collection schemes).
>
> ( Note: to properly sync back you'll also need an sa_info field to tell
> target tasks how many tasks were woken up. That way a futex can be used
> as a semaphore to signal back to the issuing thread, and make it all
> properly event triggered and nicely scalable. Also, queued signals are a
> must for such a scheme. )

Ah, nice! I wondered how you'd propose to deal with that one. It was actually my
main problem: how to wait for all running threads to complete their execution.
This added sa_info count and futex usage will indeed deal with the problem. And
rt_sigqueueinfo() will ensure that we don't collapse multiple concurrent
requests for execution of the same signal. For syncing back, I think we can do
this without modifying sa_info. Simply passing a pointer to the counter to
increment in the sigval value to rt_sigqueueinfo() should do the trick.

>
> My estimation is that it will be _much_ faster than the naive signal based
> approach - maybe even quite comparable to an open-coded sys_membarrier():

Yes, especially given that your proposal permits to send all signals in in
"broadcast to all running threads" mode, in a single system call.

>
> - as most of the overhead in a real scenario ought to be the IPI sending and
> latency - not the syscall entry/exit. (with a signal approach we'd still go
> into target thread user-mode, so one more syscall exit+re-entry)
>
> - or for the common case where there are no other threads running, we are
> just in/out of SA_RUNNING without having to do any synchronization. In that
> case it should be quite close to sys_membarrier() - modulo some minimal
> signal API overhead. [which we could optimize some more, if it's visible in
> your benchmarks.]
>
> Signals per se are pretty scalable these days - now that most of the fastpaths
> are decoupled from tasklist_lock and everything is RCU-ized.
>
> Further benefits are:
>
> - both SA_NOFPU and SA_RUNNING could be used by a _lot_ more user-space
> facilities than just user-space RCU.
>
> - synergetic effects: growing some real high-performance facility based on
> signals would ensure further signal speedups in the future as well.
> Currently any server app that runs into signal limitations tends to shy
> away from them and use some different (and often inferior) signalling
> scheme. It would be better extend signals with 'lightweight' capabilities
> as well.
>
> All in one, signals are used by like 99.9% of Linux apps, while
> sys_membarrier() would be used only by [WAG] 0.00001% of them.
>
> So before we can merge this (at least via the RCU tree, which you have sent it
> to), i'd like to see you try _much_, _MUCH_ harder to fix the very obvious
> signal overhead performance problems you have demoed via the numbers above so
> nicely.

I think we can start with the SA_RUNNING+modified sa_info approach to signal
only running threads. I expect that much of the benefit will come from there.
Then, from that point, we can see if SA_NOFPU provides a significant performance
improvement.

Now, a very basic questions: in the signal-based approach I currently use, I
reserve SIGUSR1 _from my liburcu library_ (yeah, that's pretty ugly). The
problem is: how can I reserve new signal numbers from a library point of view
without having the applications using it too ? We have room left in the rt
signals numbers, so maybe this is a lesser problem than with standard signals,
which are quite full, but the problem of making sure the application does not
conflict stays.

>
> If _that_ fails, and if we get all the fruits of that, _then_ we might
> perhaps, with a lot of hesitation, concede defeat and think about adding yet
> another syscall.
>
> I know it's cool to add a brand new syscall - but, unfortunately, in practice
> it doesnt help Linux apps all that much. (at least until we have tools/klibc/
> or so.)
>
> [ There's also a few small cleanliness details i noticed in your patch: enums
> are a tiny bit nicer for ABIs than #define's, the #ifdef SMP is ugly, etc. -
> but it doesnt really matter much as i think we should concentrate on the
> scalability problems of signals first. ]

OK, let's do that.

Thanks,

Mathieu

>
> Thanks,
>
> Ingo

Mathieu Desnoyers

unread,

Mar 4, 2010, 11:10:02 AM3/4/10

to

* Mathieu Desnoyers (mathieu....@efficios.com) wrote:

Hrm, I overlooked the fact that this counter must be written by the signal
sender. So we probably need to add a field to sa_info as you proposed.

Thanks,

Mathieu

Ingo Molnar

unread,

Mar 4, 2010, 11:20:01 AM3/4/10

to

* Mathieu Desnoyers <mathieu....@efficios.com> wrote:

> I am proposing this patch for the 2.6.34 merge window, as I think it is
> ready for inclusion.

It's a bit late for this merge window i think.

> Here is an implementation of a new system call, sys_membarrier(), which

> executes a memory barrier on all threads of the current process. It can be
> used to distribute the cost of user-space memory barriers asymmetrically by
> transforming pairs of memory barriers into pairs consisting of
> sys_membarrier() and a compiler barrier. For synchronization primitives that
> distinguish between read-side and write-side (e.g. userspace RCU, rwlocks),
> the read-side can be accelerated significantly by moving the bulk of the
> memory barrier overhead to the write-side.

Why is this such a low level and still special-purpose facility?

Synchronization facilities for high-performance threading may want to do a bit
more than just execute a barrier instruction on another CPU that has a
relevant thread running.

You cited signal based numbers:

> (what we have now, with dynamic sys_membarrier check, expedited scheme)

> memory barriers in reader: 907693804 reads, 817793 writes
> sys_membarrier scheme: 4316818891 reads, 503790 writes
>
> (dynamic sys_membarrier check, non-expedited scheme)
> memory barriers in reader: 907693804 reads, 817793 writes
> sys_membarrier scheme: 8698725501 reads, 313 writes

Much of that signal handler overhead is i think due to:

- FPU/SSE context save/restore
- the need to wake up, run and deschedule all threads

Instead i'd suggest for you to try to implement user-space RCU speedups not

via the new sys_membarrier() syscall, but via two new signal extensions:

- SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
purpose signal handlers? (can whip up a quick patch for you if you want)

- SA_RUNNING: a way to signal only running threads - as a way for user-space

based concurrency control mechanisms to deschedule running threads (or, like
in your case, to implement barrier / garbage collection schemes).

( Note: to properly sync back you'll also need an sa_info field to tell
target tasks how many tasks were woken up. That way a futex can be used
as a semaphore to signal back to the issuing thread, and make it all
properly event triggered and nicely scalable. Also, queued signals are a
must for such a scheme. )

My estimation is that it will be _much_ faster than the naive signal based

approach - maybe even quite comparable to an open-coded sys_membarrier():

- as most of the overhead in a real scenario ought to be the IPI sending and

Further benefits are:

If _that_ fails, and if we get all the fruits of that, _then_ we might

perhaps, with a lot of hesitation, concede defeat and think about adding yet
another syscall.

I know it's cool to add a brand new syscall - but, unfortunately, in practice
it doesnt help Linux apps all that much. (at least until we have tools/klibc/
or so.)

[ There's also a few small cleanliness details i noticed in your patch: enums
are a tiny bit nicer for ABIs than #define's, the #ifdef SMP is ugly, etc. -
but it doesnt really matter much as i think we should concentrate on the
scalability problems of signals first. ]

Thanks,

Ingo

Linus Torvalds

unread,

Mar 4, 2010, 11:40:02 AM3/4/10

to

On Thu, 4 Mar 2010, Ingo Molnar wrote:
>
> - SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
> purpose signal handlers? (can whip up a quick patch for you if you want)

I'd love to do this, but it's wrong.

It's too damn easy to use the FPU by mistake in user land, without ever
being aware of it. memset()/memcpy are obvious potential users SSE, but
they might be called in non-obvious ways implicitly by the compiler (ie
structure copy and setup).

And modern glibc ends up using SSE4 even for things like strstr and
strlen, so it really is creeping into all kinds of trivial helper
functions that might not be obvious. So SA_NOFPU is a lovely idea, but
it's also an idea that sucks rotten eggs in practice, with quite possibly
the same _binary_ working or not working depending on what kind of CPU and
what shared library it happens to be using.

Too damn fragile, in other words.

(Now, if it's accompanied by the kernel actually _testing_ that there is
no FPU activity, by setting the TS flag and checking at fault time and
causing a SIGFPE, then that would be better. At least you'd get a nice
clear signal rather than random FPU state corruption. But you're still in
the situation that now the binary might work on some machines and setups,
and not on others.

> - SA_RUNNING: a way to signal only running threads - as a way for user-space
> based concurrency control mechanisms to deschedule running threads (or, like
> in your case, to implement barrier / garbage collection schemes).

Hmm. This sounds less fundamentally broken, but at the same time also
_way_ more invasive in the signal handling layer. It's already one of our
more "exciting" layers out there.

Linus

Paul E. McKenney

unread,

Mar 4, 2010, 12:00:03 PM3/4/10

to

On Thu, Mar 04, 2010 at 08:34:16AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 4 Mar 2010, Ingo Molnar wrote:
> >
> > - SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
> > purpose signal handlers? (can whip up a quick patch for you if you want)
>
> I'd love to do this, but it's wrong.
>
> It's too damn easy to use the FPU by mistake in user land, without ever
> being aware of it. memset()/memcpy are obvious potential users SSE, but
> they might be called in non-obvious ways implicitly by the compiler (ie
> structure copy and setup).
>
> And modern glibc ends up using SSE4 even for things like strstr and
> strlen, so it really is creeping into all kinds of trivial helper
> functions that might not be obvious. So SA_NOFPU is a lovely idea, but
> it's also an idea that sucks rotten eggs in practice, with quite possibly
> the same _binary_ working or not working depending on what kind of CPU and
> what shared library it happens to be using.
>
> Too damn fragile, in other words.
>
> (Now, if it's accompanied by the kernel actually _testing_ that there is
> no FPU activity, by setting the TS flag and checking at fault time and
> causing a SIGFPE, then that would be better. At least you'd get a nice
> clear signal rather than random FPU state corruption. But you're still in
> the situation that now the binary might work on some machines and setups,
> and not on others.

I was assuming that using the FPE in the special handler would result in
a SIGFPE -- but that it would not affect normal signal handlers, only
those invoked by this user-level-RCU acceleration mechanism.

Thanx, Paul

Mathieu Desnoyers

unread,

Mar 4, 2010, 1:00:03 PM3/4/10

to

* Linus Torvalds (torv...@linux-foundation.org) wrote:
> > - SA_RUNNING: a way to signal only running threads - as a way for user-space
> > based concurrency control mechanisms to deschedule running threads (or, like
> > in your case, to implement barrier / garbage collection schemes).
>
> Hmm. This sounds less fundamentally broken, but at the same time also
> _way_ more invasive in the signal handling layer. It's already one of our
> more "exciting" layers out there.
>

Hrm, thinking about it a bit further, the only way I see we could provide a
usable SA_RUNNING flag would be to add hooks to the scheduler. These hooks would
somehow have to call user-space code (!) when scheduling in/out a thread. Yes,
this sounds utterly broken (since these hooks would have to be preemptable).

The idea is this: if we look, for instance, at the kernel preemptable RCU
implementations, they consist of two parts: one is iteration on all CPUs to
consider all active CPUs, and the other is a modification of the scheduler to
note all preempted tasks that were in a preemptable RCU C.S..

Just for the memory barrier we consider for sys_membarrier(), I had to ensure
that the scheduler issues memory barriers to order accesses to user-space memory
and mm_cpumask modifications. In reality, what we are doing is to ensure that
the operation required on the running thread is done by the scheduler too when
scheduling in/out the task.

As soon as we have signal handlers which perform more than a simple memory
barrier (e.g. something that has side-effects outside of the processor), I doubt
it would ever make sense to only run the handler on running threads unless we
have hooks in the scheduler too.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Ingo Molnar

unread,

Mar 4, 2010, 3:30:02 PM3/4/10

to

* Linus Torvalds <torv...@linux-foundation.org> wrote:

>
> On Thu, 4 Mar 2010, Ingo Molnar wrote:
> >
> > - SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
> > purpose signal handlers? (can whip up a quick patch for you if you want)
>
> I'd love to do this, but it's wrong.
>
> It's too damn easy to use the FPU by mistake in user land, without ever
> being aware of it. memset()/memcpy are obvious potential users SSE, but they
> might be called in non-obvious ways implicitly by the compiler (ie structure
> copy and setup).
>
> And modern glibc ends up using SSE4 even for things like strstr and strlen,
> so it really is creeping into all kinds of trivial helper functions that
> might not be obvious. So SA_NOFPU is a lovely idea, but it's also an idea
> that sucks rotten eggs in practice, with quite possibly the same _binary_
> working or not working depending on what kind of CPU and what shared library
> it happens to be using.
>
> Too damn fragile, in other words.
>
> (Now, if it's accompanied by the kernel actually _testing_ that there is no
> FPU activity, by setting the TS flag and checking at fault time and causing
> a SIGFPE, then that would be better. At least you'd get a nice clear signal
> rather than random FPU state corruption. But you're still in the situation
> that now the binary might work on some machines and setups, and not on
> others.

Perhaps NOFPU could do lazy context saving: clear the TS flag and only save
the FPU state if it's actually used by the signal handler?

This turns it into a 'hint', not into an FPU state corruption issue.

Clearing/enabling FPU instructions is still faster than a full-blown FPU
context save/restore.

Careful and lightweight signal handlers (like a GC scheme would likely be)
would thus be faster. In the worst-case it incures an extra trap and a
(measurable/profilable) slowdown.

In any case this would be a secondary optimization - the biggest difference
i'd expect from the 'dont wake up the world' logic:

> > - SA_RUNNING: a way to signal only running threads - as a way for user-space
> > based concurrency control mechanisms to deschedule running threads (or, like
> > in your case, to implement barrier / garbage collection schemes).
>
> Hmm. This sounds less fundamentally broken, but at the same time also _way_
> more invasive in the signal handling layer. It's already one of our more
> "exciting" layers out there.

Yeah, definitely. But i still tend to think it should be actively tried, at
which point we can still say 'yuck this cannot work, lets go for the
sys_membarrier() solution'.

Ingo

Linus Torvalds

unread,

Mar 6, 2010, 2:50:02 PM3/6/10

to

On Thu, 4 Mar 2010, Ingo Molnar wrote:
>
> Perhaps NOFPU could do lazy context saving: clear the TS flag and only save
> the FPU state if it's actually used by the signal handler?

If we can get that working reliably, we probably shouldn't use NOFPU at
all, and we should just do it unconditionally. That big (and almost always
pointless) FPU state save is a _big_ performance issue on signal handling,
and if we can do it lazily, we should.

However, I'm not at all convinced we can do this reliably. How do we
detect the "signal frame is dead" case with things like siglongjmp() etc?

And if we can't detect that "frame no longer exists", we can't really do
the lazy context saving.

Now, there's _also_ the issue of the signal handler function possibly
actually looking at the FPU state on the stack, and for that, a SA_NOFPU
would be a good way to say "you can't do that". So it's possible that even
if we could reliably detect the frame liveness we'd really have to use
that new flag anyway.

But if we do need a SA_NOFPU flag, then that means that basically no app
will use it, and it will be some special case for some really unusual
library. So I really don't think this whole thing is worth it unless you
could do it automatically.

(The "user accesses the frame" case _could_ possibly be handled by
pointing the FP frame to a special faulting location, and never nesting
the FP optimization. Nested signal handlers are unusual enough that they
aren't worth optimizing for anyway. So I'm sure that there are possible
solutions for "automatically just do the right thing" in theory, but I
suspect they get rather complex)

Linus

Nick Piggin

unread,

Mar 9, 2010, 2:10:03 AM3/9/10

to

On Sat, Mar 06, 2010 at 11:43:26AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 4 Mar 2010, Ingo Molnar wrote:
> >
> > Perhaps NOFPU could do lazy context saving: clear the TS flag and only save
> > the FPU state if it's actually used by the signal handler?
>
> If we can get that working reliably, we probably shouldn't use NOFPU at
> all, and we should just do it unconditionally. That big (and almost always
> pointless) FPU state save is a _big_ performance issue on signal handling,
> and if we can do it lazily, we should.
>
> However, I'm not at all convinced we can do this reliably. How do we
> detect the "signal frame is dead" case with things like siglongjmp() etc?
>
> And if we can't detect that "frame no longer exists", we can't really do
> the lazy context saving.
>
> Now, there's _also_ the issue of the signal handler function possibly
> actually looking at the FPU state on the stack, and for that, a SA_NOFPU
> would be a good way to say "you can't do that". So it's possible that even
> if we could reliably detect the frame liveness we'd really have to use
> that new flag anyway.
>
> But if we do need a SA_NOFPU flag, then that means that basically no app
> will use it, and it will be some special case for some really unusual
> library. So I really don't think this whole thing is worth it unless you
> could do it automatically.

The library is librcu, which I suspect will become quite important for
parallel programming in future (maybe I hope for too much).

But maybe it's better to not merge _any_ librcu special case until
we see results from programs using it. More general speedups or features
(that also help librcu) is a different story.

Mathieu Desnoyers

unread,

Mar 9, 2010, 11:20:01 PM3/9/10

to

* Nick Piggin (npi...@suse.de) wrote:

[...]

> The library is librcu, which I suspect will become quite important for
> parallel programming in future (maybe I hope for too much).
>
> But maybe it's better to not merge _any_ librcu special case until
> we see results from programs using it. More general speedups or features
> (that also help librcu) is a different story.
>

Hi Nick,

So, about the current state of liburcu and its users:

It is currently packaged in Debian, Ubuntu, Gentoo, and it is also being
packaged for Fedora. It is already being used by a few programs/libraries, and
given it's wide availability, we can expect more in a near future.

The first user of this library is the UST (Userspace Tracing) library; a port of
LTTng to a userspace.

http://lttng.org/ust

Modulo a few changes to port it to userspace, the kernel and user-space LTTng
should be expected to have similar performance, because they use essentially the
same lockless buffering scheme, described in chapter 5 of my thesis:

http://www.lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf

Here is the impact of two additional memory barriers on the LTTng tracer fast
path:

Intel Core Xeon 2.0 GHz
LTTng probe writing 16-byte worth of data to the trace (+4 byte event header)
(execution of 200000 loops, therefore trace buffers are cache-hot)

119 ns per event

adding 2 memory barriers, one before and one after the tracepoint:

155 ns per event

So we have a 25% slowdown on the tracer fast path, which is quite significant
when it comes to trace heavy workloads. The slowdown ratio may change slightly
for non cache-hot scenarios, but I expect it to stay in the same range. Section
8.4 of my thesis discusses the overhead of cache-cold buffers (around 333 ns per
event rather than 119 ns). I expect the cost of the memory barriers to increase
too in a cache-cold scenario.

If you want to have an insight on the class of applications that can be improved
with the userspace RCU library, you can have a look at Section 6.3 "User-Space
RCU Usage Scenarios" of my dissertation.

If you still wonder "who is using/contributing to LTTng ?", see section 9.2 of
my thesis. Or here is a quick list, taken from our website:

Google, IBM, Ericsson, Autodesk, Wind River, Fujitsu, Monta Vista, ST
Microelectronics, C2 Microsystems, Sony, Siemens, Nokia, Defence Research and
Development Canada.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Mathieu Desnoyers

unread,

Mar 15, 2010, 5:00:02 PM3/15/10

to

* Mathieu Desnoyers (mathieu....@efficios.com) wrote:

Unless this question is answered, Ingo's SA_RUNNING signal proposal, as
appealing as it may look at a first glance, falls into the "fundamentally
broken" category. I don't see any neat way to make the scheduler call into
user-space hooks to deal with inherent synchronization required between
iteration on active threads and scheduler activity. But who knows, maybe it's
just a lack of imagination from my part.

Thanks,

Mathieu

Ingo Molnar

unread,

Mar 16, 2010, 3:40:01 AM3/16/10

to

* Mathieu Desnoyers <mathieu....@efficios.com> wrote:

> broken" category. [...]

How is it different from your syscall? I.e. which lines of code make the
difference? We could certainly apply the (trivial) barrier change to
context_switch().

Ingo

Nick Piggin

unread,

Mar 16, 2010, 4:00:03 AM3/16/10

to

I think it is just easy for userspace to misuse or think it does
something that it doesn't (because of races).

If a context switch includes a barrier, then it is easy to know that
either the task of interest will execute the barrier, or it will have
context switched.

What more complex operation could be done in the signal handler that
isn't broken by races? Programs that use realtime scheduling policies,
and maybe some statistical or heuristic operations... Any cool use that
would make anybody other than librcu bother using it?

Mathieu Desnoyers

unread,

Mar 16, 2010, 9:10:02 AM3/16/10

to

* Nick Piggin (npi...@suse.de) wrote:

Yep, this is exactly my point.

> If a context switch includes a barrier, then it is easy to know that
> either the task of interest will execute the barrier, or it will have
> context switched.
>
> What more complex operation could be done in the signal handler that
> isn't broken by races? Programs that use realtime scheduling policies,
> and maybe some statistical or heuristic operations... Any cool use that
> would make anybody other than librcu bother using it?
>

Yes, there seems to be no point in providing a nice flexible interface through
signals if the only race-less use we can find is to issue memory barriers
(which would be race-less because we add the proper barriers to the scheduler mm
switch code). And even if we find a userland use for such a signal, I tend to
think that the inherent risk of misuse and races would overweight its benefit.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Ingo Molnar

unread,

Mar 16, 2010, 9:20:02 AM3/16/10

to

* Nick Piggin <npi...@suse.de> wrote:

> On Tue, Mar 16, 2010 at 08:36:35AM +0100, Ingo Molnar wrote:
> >
> > * Mathieu Desnoyers <mathieu....@efficios.com> wrote:
> >
> > > Unless this question is answered, Ingo's SA_RUNNING signal proposal, as
> > > appealing as it may look at a first glance, falls into the
> > > "fundamentally broken" category. [...]
> >
> > How is it different from your syscall? I.e. which lines of code make the
> > difference? We could certainly apply the (trivial) barrier change to
> > context_switch().
>
> I think it is just easy for userspace to misuse or think it does something
> that it doesn't (because of races).

That wasnt my question though. The question i asked Mathieu was to show how
SA_RUNNING is "fundamentally broken" for librcu use while sys_membarrier() is
not?

This is really what he claims above. (i preserved the quote)

It must be a misunderstanding either on my side or on his side. (Once that is
cleared we can discuss further usecases for SA_RUNNING.)

Thanks,

Ingo

Mathieu Desnoyers

unread,

Mar 16, 2010, 9:40:02 AM3/16/10

to

* Ingo Molnar (mi...@elte.hu) wrote:
>

> * Nick Piggin <npi...@suse.de> wrote:
>
> > On Tue, Mar 16, 2010 at 08:36:35AM +0100, Ingo Molnar wrote:
> > >
> > > * Mathieu Desnoyers <mathieu....@efficios.com> wrote:
> > >
> > > > Unless this question is answered, Ingo's SA_RUNNING signal proposal, as
> > > > appealing as it may look at a first glance, falls into the
> > > > "fundamentally broken" category. [...]
> > >
> > > How is it different from your syscall? I.e. which lines of code make the
> > > difference? We could certainly apply the (trivial) barrier change to
> > > context_switch().
> >
> > I think it is just easy for userspace to misuse or think it does something
> > that it doesn't (because of races).
>
> That wasnt my question though. The question i asked Mathieu was to show how
> SA_RUNNING is "fundamentally broken" for librcu use while sys_membarrier() is
> not?
>
> This is really what he claims above. (i preserved the quote)
>
> It must be a misunderstanding either on my side or on his side. (Once that is
> cleared we can discuss further usecases for SA_RUNNING.)

Well, it's not broken for sys_membarrier() specifically if we add the proper
memory barriers to the scheduler, but it's broken when we try to use it for
anything else. What makes it broken is that it requires that the scheduler
switch guarantee to have the same side-effect on a running thread than execution
on the per-running-thread signal handler.

What's different with the sys_membarrier system call is that it does not try to
make generic something that should probably stay case-specific due to its
close coupling with the scheduler.

Thanks,

Mathieu

>
> Thanks,
>
> Ingo

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

Ingo Molnar

unread,

Mar 16, 2010, 10:00:02 AM3/16/10

to

* Mathieu Desnoyers <mathieu....@efficios.com> wrote:

> * Ingo Molnar (mi...@elte.hu) wrote:
> >
> > * Nick Piggin <npi...@suse.de> wrote:
> >
> > > On Tue, Mar 16, 2010 at 08:36:35AM +0100, Ingo Molnar wrote:
> > > >
> > > > * Mathieu Desnoyers <mathieu....@efficios.com> wrote:
> > > >
> > > > > Unless this question is answered, Ingo's SA_RUNNING signal proposal, as
> > > > > appealing as it may look at a first glance, falls into the
> > > > > "fundamentally broken" category. [...]
> > > >
> > > > How is it different from your syscall? I.e. which lines of code make the
> > > > difference? We could certainly apply the (trivial) barrier change to
> > > > context_switch().
> > >
> > > I think it is just easy for userspace to misuse or think it does something
> > > that it doesn't (because of races).
> >
> > That wasnt my question though. The question i asked Mathieu was to show how
> > SA_RUNNING is "fundamentally broken" for librcu use while sys_membarrier() is
> > not?
> >
> > This is really what he claims above. (i preserved the quote)
> >
> > It must be a misunderstanding either on my side or on his side. (Once that is
> > cleared we can discuss further usecases for SA_RUNNING.)
>
> Well, it's not broken for sys_membarrier() specifically if we add the proper
> memory barriers to the scheduler, but it's broken when we try to use it for

> anything else. [...]

That's quite an important distinction to an unqualified "fundamentally
broken", right?

> [...] What makes it broken is that it requires that the scheduler switch

> guarantee to have the same side-effect on a running thread than execution on
> the per-running-thread signal handler.
>
> What's different with the sys_membarrier system call is that it does not try
> to make generic something that should probably stay case-specific due to its
> close coupling with the scheduler.

Yeah, that's a fair point.

Without another realistic usecase SA_RUNNING would just essentially be a
SA_BARRIER special-case. (IMO even in that case signal handling speedups
driven via this usecase would still be tempting though.)

But note that some other usecase is possible as well:

In theory we could inject signals at context-switch time (if that signal is
not pending yet) - signals are fairly atomic [with a preallocated pool] and
the 'wakeup' property of signals is not needed as the to-be-running task is
obviously up to execution. (so there's no deadlock. It doesnt have to run with
the rq lock taken in any case - it can run from sched_tail() i suspect.)

So all this could be done via the ret-to-user framework that KVM uses at
essentially no extra scheduler overhead. I think :-) It would be a bit like
SIGALRM for timers.

Plus another performance optimization would be useful as well: signals could
be turned on/off without having to enter the kernel. This could be done via a
in-user-memory enable/disable-signals flag/mask associated with each task. (it
would pin a page of memory.)

The question is, do we want to enable user-space to trigger a signal upon
context-switches?

It probably cannot be a queued one, as preemption from the signal handler
itself would be rather yucky. As long as concurrency control is involved,
user-space only wants a callback for the _first_ reschedule - subsequent
reschedules dont need to trigger a signal, until the signal handler has
finished.

Ingo

Mathieu Desnoyers

unread,

Mar 16, 2010, 10:20:02 AM3/16/10

to

OK, I guess "conceptually broken" would be more precise in this case. ;)

>
> > [...] What makes it broken is that it requires that the scheduler switch
> > guarantee to have the same side-effect on a running thread than execution on
> > the per-running-thread signal handler.
> >
> > What's different with the sys_membarrier system call is that it does not try
> > to make generic something that should probably stay case-specific due to its
> > close coupling with the scheduler.
>
> Yeah, that's a fair point.
>
> Without another realistic usecase SA_RUNNING would just essentially be a
> SA_BARRIER special-case. (IMO even in that case signal handling speedups
> driven via this usecase would still be tempting though.)
>
> But note that some other usecase is possible as well:
>
> In theory we could inject signals at context-switch time (if that signal is
> not pending yet) - signals are fairly atomic [with a preallocated pool] and
> the 'wakeup' property of signals is not needed as the to-be-running task is
> obviously up to execution. (so there's no deadlock. It doesnt have to run with
> the rq lock taken in any case - it can run from sched_tail() i suspect.)
>
> So all this could be done via the ret-to-user framework that KVM uses at
> essentially no extra scheduler overhead. I think :-) It would be a bit like
> SIGALRM for timers.

That could be an interesting approach to hook into the scheduler "return to
userspace" path. We have to consider that this signal should probably have a
very high priority if we expect it to effectively nest over other signal
handlers.

But it does not address the hook needed upon entry into the scheduler context
switch. I fear this one might be a bit harder to do without tons of extra
overhead.

>
> Plus another performance optimization would be useful as well: signals could
> be turned on/off without having to enter the kernel. This could be done via a
> in-user-memory enable/disable-signals flag/mask associated with each task. (it
> would pin a page of memory.)

Hrm, it makes we wonder if this optimization would not add a slight overhead to
the scheduler. By allowing this kind of enable/disable flag, we would have to
check for blocked signal delivery upon each return to userspace. With the
current system call used for masking signals, this check can accurately be done
only in the signal-related system calls. (but maybe the scheduler already has to
take part of this burden for other reasons I'm not aware of). But yes,
independently of the SA_RUNNING topic, this optimization might very well be
worth it. I've actually been thinking along the same lines for a enable-disable
"thread migration" flag too, but that's a completely different topic (and has
impact on scheduler migration and cpu hotplug, so it's not as easy as it seems).

>
> The question is, do we want to enable user-space to trigger a signal upon
> context-switches?
>
> It probably cannot be a queued one, as preemption from the signal handler
> itself would be rather yucky. As long as concurrency control is involved,
> user-space only wants a callback for the _first_ reschedule - subsequent
> reschedules dont need to trigger a signal, until the signal handler has
> finished.

That could work for return to userspace, any clever idea about how to deal with
the hook to call upon entry into context switch ?

Thanks,

Mathieu

>
> Ingo

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com