[this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu

Christoph Lameter

unread,

Dec 18, 2009, 5:30:02 PM12/18/09

to

Leftovers from the earlier patchset rediffed to 2.6.33-rc1.
Mostly applications of per cpu counters to core components.

After this patchset there will be only one user of local_t left: Mathieu's
trace ringbuffer.

I added some patches that define additional this_cpu ops in order to help
Mathieu with making the trace ringbuffer use this_cpu ops. These have barely
been tested (boots fine on 32 and 64 bit but there is no user of these operations).
RFC state.

V7->V8
- Fix issue in slub patch
- Fix issue in modules patch
- Rediff page allocator patch
- Provide new this_cpu ops needed for ringbuffer [RFC state]

V6->V7
- Drop patches merged in 2.6.33 merge cycle
- Drop risky slub patches

V5->V6:
- Drop patches merged by Tejun.
- Drop irqless slub fastpath for now.
- Patches against Tejun percpu for-next branch.

V4->V5:
- Avoid setup_per_cpu_area() modifications and fold the remainder of the
patch into the page allocator patch.
- Irq disable / per cpu ptr fixes for page allocator patch.

V3->V4:
- Fix various macro definitions.
- Provide experimental percpu based fastpath that does not disable
interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
can fallback to the use of preempt or interrupt disable. 64 bit
platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

unread,

Dec 18, 2009, 5:30:01 PM12/18/09

to

percpu_page_allocator

Christoph Lameter

unread,

Dec 18, 2009, 5:30:01 PM12/18/09

to

percpu_slub_remove_fields

Christoph Lameter

unread,

Dec 18, 2009, 5:30:01 PM12/18/09

to

percpu_slub_conversion

Christoph Lameter

unread,

Dec 18, 2009, 7:30:01 PM12/18/09

to

percpu_add_cmpxchg

Christoph Lameter

unread,

Dec 18, 2009, 7:30:03 PM12/18/09

to

percpu_add_xchg_x86

Christoph Lameter

unread,

Dec 18, 2009, 7:30:02 PM12/18/09

to

percpu_add_xchg

Christoph Lameter

unread,

Dec 18, 2009, 7:30:03 PM12/18/09

to

percpu_add_add_return_x86

Christoph Lameter

unread,

Dec 18, 2009, 7:30:03 PM12/18/09

to

percpu_add_cmpxchg_x86

Christoph Lameter

unread,

Dec 18, 2009, 7:30:03 PM12/18/09

to

percpu_add_inc_dec

Christoph Lameter

unread,

Dec 18, 2009, 7:30:02 PM12/18/09

to

percpu_local_t_remove_cpu_alloc

Christoph Lameter

unread,

Dec 18, 2009, 7:30:02 PM12/18/09

to

percpu_add_inc_dec_x86

Christoph Lameter

unread,

Dec 18, 2009, 7:30:03 PM12/18/09

to

percpu_local_t_remove_modules

Christoph Lameter

unread,

Dec 18, 2009, 7:30:03 PM12/18/09

to

percpu_add_add_return

Mathieu Desnoyers

unread,

Dec 19, 2009, 9:50:01 AM12/19/09

to

* Christoph Lameter (c...@linux-foundation.org) wrote:
> Provide support for this_cpu_cmpxchg.
>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>
> ---
> include/linux/percpu.h | 99 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 99 insertions(+)
>
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h 2009-12-18 15:17:05.000000000 -0600
> +++ linux-2.6/include/linux/percpu.h 2009-12-18 15:33:12.000000000 -0600
> @@ -443,6 +443,62 @@ do { \
> # define this_cpu_xor(pcp, val) __pcpu_size_call(this_cpu_or_, (pcp), (val))
> #endif
>
> +static inline unsigned long __this_cpu_cmpxchg_generic(volatile void *pptr,
> + unsigned long old, unsigned long new, int size)
> +{
> + unsigned long prev;
> + volatile void * ptr = __this_cpu_ptr(pptr);
> +
> + switch (size) {
> + case 1: prev = *(u8 *)ptr;
> + if (prev == old)
> + *(u8 *)ptr = (u8)new;
> + break;
> + case 2: prev = *(u16 *)ptr;
> + if (prev == old)
> + *(u16 *)ptr = (u16)new;
> + break;
> + case 4: prev = *(u32 *)ptr;
> + if (prev == old)
> + *(u32 *)ptr = (u32)new;
> + break;
> + case 8: prev = *(u64 *)ptr;
> + if (prev == old)
> + *(u64 *)ptr = (u64)new;
> + break;
> + default:
> + __bad_size_call_parameter();
> + }
> + return prev;
> +}

Hi Christoph,

I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
Given that what LTTng needs is basically an atomic, nmi-safe version of
the primitive (on all architectures that have something close to a NMI),
this means that it could not switch over to your primitives until we add
the equivalent support we currently have with local_t to all
architectures. The transition would be faster if we create an
atomic_cpu_*() variant which would map to local_t operations in the
initial version.

Or maybe have I missed something in your patchset that address this ?

Thanks,

Mathieu

> +
> +static inline unsigned long this_cpu_cmpxchg_generic(volatile void *ptr,
> + unsigned long old, unsigned long new, int size)
> +{
> + unsigned long prev;
> +
> + preempt_disable();
> + prev = __this_cpu_cmpxchg_generic(ptr, old, new, size);
> + preempt_enable();
> + return prev;
> +}
> +
> +#ifndef this_cpu_cmpxchg
> +# ifndef this_cpu_cmpxchg_1
> +# define this_cpu_cmpxchg_1(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 1)
> +# endif
> +# ifndef this_cpu_cmpxchg_2
> +# define this_cpu_cmpxchg_2(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 2)
> +# endif
> +# ifndef this_cpu_cmpxchg_4
> +# define this_cpu_cmpxchg_4(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 4)
> +# endif
> +# ifndef this_cpu_cmpxchg_8
> +# define this_cpu_cmpxchg_8(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 8)
> +# endif
> +# define this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
> +#endif
> +
> /*
> * Generic percpu operations that do not require preemption handling.
> * Either we do not care about races or the caller has the
> @@ -594,6 +650,22 @@ do { \
> # define __this_cpu_xor(pcp, val) __pcpu_size_call(__this_cpu_xor_, (pcp), (val))
> #endif
>
> +#ifndef __this_cpu_cmpxchg
> +# ifndef __this_cpu_cmpxchg_1
> +# define __this_cpu_cmpxchg_1(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 1)
> +# endif
> +# ifndef __this_cpu_cmpxchg_2
> +# define __this_cpu_cmpxchg_2(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 2)
> +# endif
> +# ifndef __this_cpu_cmpxchg_4
> +# define __this_cpu_cmpxchg_4(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 4)
> +# endif
> +# ifndef __this_cpu_cmpxchg_8
> +# define __this_cpu_cmpxchg_8(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 8)
> +# endif
> +# define __this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
> +#endif
> +
> /*
> * IRQ safe versions of the per cpu RMW operations. Note that these operations
> * are *not* safe against modification of the same variable from another
> @@ -709,4 +781,31 @@ do { \
> # define irqsafe_cpu_xor(pcp, val) __pcpu_size_call(irqsafe_cpu_xor_, (val))
> #endif
>
> +static inline unsigned long irqsafe_cpu_cmpxchg_generic(volatile void *ptr,
> + unsigned long old, unsigned long new, int size)
> +{
> + unsigned long flags, prev;
> +
> + local_irq_save(flags);
> + prev = __this_cpu_cmpxchg_generic(ptr, old, new, size);
> + local_irq_restore(flags);
> + return prev;
> +}
> +
> +#ifndef irqsafe_cpu_cmpxchg
> +# ifndef irqsafe_cpu_cmpxchg_1
> +# define irqsafe_cpu_cmpxchg_1(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 1)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_2
> +# define irqsafe_cpu_cmpxchg_2(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 2)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_4
> +# define irqsafe_cpu_cmpxchg_4(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 4)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_8
> +# define irqsafe_cpu_cmpxchg_8(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 8)
> +# endif
> +# define irqsafe_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(irqsafe_cpu_cmpxchg_, (old), (new))
> +#endif
> +
> #endif /* __LINUX_PERCPU_H */
>
> --

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

Pekka Enberg

unread,

Dec 20, 2009, 4:20:01 AM12/20/09

to

Christoph Lameter wrote:
> Using per cpu allocations removes the needs for the per cpu arrays in the
> kmem_cache struct. These could get quite big if we have to support systems
> with thousands of cpus. The use of this_cpu_xx operations results in:
>
> 1. The size of kmem_cache for SMP configuration shrinks since we will only
> need 1 pointer instead of NR_CPUS. The same pointer can be used by all
> processors. Reduces cache footprint of the allocator.
>
> 2. We can dynamically size kmem_cache according to the actual nodes in the
> system meaning less memory overhead for configurations that may potentially
> support up to 1k NUMA nodes / 4k cpus.
>
> 3. We can remove the diddle widdle with allocating and releasing of
> kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
> alloc logic will do it all for us. Removes some portions of the cpu hotplug
> functionality.
>
> 4. Fastpath performance increases since per cpu pointer lookups and
> address calculations are avoided.
>
> V7-V8
> - Convert missed get_cpu_slab() under CONFIG_SLUB_STATS
>
> Cc: Pekka Enberg <pen...@cs.helsinki.fi>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

Patches 3-6 applied.

Tejun Heo

unread,

Dec 21, 2009, 2:30:02 AM12/21/09

to

On 12/19/2009 07:26 AM, Christoph Lameter wrote:
> Current this_cpu ops only allow an arch to specify add RMW operations or inc
> and dec for all sizes. Some arches can do more efficient inc and dec
> operations. Allow size specific override of fallback functions like with
> the other operations.

Wouldn't it be better to use __builtin_constant_p() and switching in
arch add/dec macros? It just seems a bit extreme to define all those
different variants from generic header. I'm quite unsure whether
providing overrides for all the different size variants from generic
header is necessary at all. If an arch is gonna override the
operation, it's probably best to just let it override the whole thing.

Thanks.

--
tejun

Tejun Heo

unread,

Dec 21, 2009, 2:50:02 AM12/21/09

to

Hello,

On 12/19/2009 07:26 AM, Christoph Lameter wrote:

> Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
> requirements, cache footprint and decreases cycle counts.
>
> The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
> not drop the use of __module_ref_addr() which would make per cpu data handling
> complicated. this_cpu_xx operations have their own fallback for !SMP.
>
> The last hold out of users of local_t is the tracing ringbuffer after this patch
> has been applied.
>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

Please keep Rusty Russell cc'd on module changes.

> Index: linux-2.6/kernel/trace/ring_buffer.c
> ===================================================================
> --- linux-2.6.orig/kernel/trace/ring_buffer.c 2009-12-18 13:13:24.000000000 -0600
> +++ linux-2.6/kernel/trace/ring_buffer.c 2009-12-18 14:15:57.000000000 -0600
> @@ -12,6 +12,7 @@
> #include <linux/hardirq.h>
> #include <linux/kmemcheck.h>
> #include <linux/module.h>
> +#include <asm/local.h>

This doesn't belong to this patch, right? I stripped this part out,
added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch. I'll
post the updated patch here.

Thanks.

--
tejun

Tejun Heo

unread,

Dec 21, 2009, 3:00:02 AM12/21/09

to

Here's the version I committed to percpu branch.

Thanks.

From 8af47ddd01364ae3c663c0bc92415c06fe887ba1 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <c...@linux-foundation.org>
Date: Fri, 18 Dec 2009 16:26:24 -0600
Subject: [PATCH] module: Use this_cpu_xx to dynamically allocate counters

Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.

The last hold out of users of local_t is the tracing ringbuffer after this patch
has been applied.

Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
Cc: Rusty Russell <ru...@rustcorp.com.au>
Signed-off-by: Tejun Heo <t...@kernel.org>
---
include/linux/module.h | 38 ++++++++++++++------------------------
kernel/module.c | 30 ++++++++++++++++--------------
2 files changed, 30 insertions(+), 38 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 482efc8..9350577 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -16,8 +16,7 @@
#include <linux/kobject.h>
#include <linux/moduleparam.h>
#include <linux/tracepoint.h>
-
-#include <asm/local.h>
+#include <linux/percpu.h>
#include <asm/module.h>

#include <trace/events/module.h>
@@ -361,11 +360,9 @@ struct module
/* Destruction function. */
void (*exit)(void);

-#ifdef CONFIG_SMP
- char *refptr;
-#else
- local_t ref;
-#endif
+ struct module_ref {
+ int count;
+ } *refptr;
#endif

#ifdef CONFIG_CONSTRUCTORS
@@ -452,25 +449,16 @@ void __symbol_put(const char *symbol);
#define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x)
void symbol_put_addr(void *addr);

-static inline local_t *__module_ref_addr(struct module *mod, int cpu)
-{
-#ifdef CONFIG_SMP
- return (local_t *) (mod->refptr + per_cpu_offset(cpu));
-#else
- return &mod->ref;
-#endif
-}
-
/* Sometimes we know we already have a refcount, and it's easier not
to handle the error case (which only happens with rmmod --wait). */
static inline void __module_get(struct module *module)
{
if (module) {
- unsigned int cpu = get_cpu();
- local_inc(__module_ref_addr(module, cpu));
+ preempt_disable();
+ __this_cpu_inc(module->refptr->count);
trace_module_get(module, _THIS_IP_,
- local_read(__module_ref_addr(module, cpu)));
- put_cpu();
+ __this_cpu_read(module->refptr->count));
+ preempt_enable();
}
}

@@ -479,15 +467,17 @@ static inline int try_module_get(struct module *module)
int ret = 1;

if (module) {
- unsigned int cpu = get_cpu();
+ preempt_disable();
+
if (likely(module_is_live(module))) {
- local_inc(__module_ref_addr(module, cpu));
+ __this_cpu_inc(module->refptr->count);
trace_module_get(module, _THIS_IP_,
- local_read(__module_ref_addr(module, cpu)));
+ __this_cpu_read(module->refptr->count));
}
else
ret = 0;
- put_cpu();
+
+ preempt_enable();
}
return ret;
}
diff --git a/kernel/module.c b/kernel/module.c
index 64787cd..9bc04de 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -474,9 +474,10 @@ static void module_unload_init(struct module *mod)

INIT_LIST_HEAD(&mod->modules_which_use_me);
for_each_possible_cpu(cpu)
- local_set(__module_ref_addr(mod, cpu), 0);
+ per_cpu_ptr(mod->refptr, cpu)->count = 0;
+
/* Hold reference count during initialization. */
- local_set(__module_ref_addr(mod, raw_smp_processor_id()), 1);
+ __this_cpu_write(mod->refptr->count, 1);
/* Backwards compatibility macros put refcount during init. */
mod->waiter = current;
}
@@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
kfree(use);
sysfs_remove_link(i->holders_dir, mod->name);
/* There can be at most one match. */
+ free_percpu(i->refptr);
break;
}
}
@@ -619,7 +621,7 @@ unsigned int module_refcount(struct module *mod)
int cpu;

for_each_possible_cpu(cpu)
- total += local_read(__module_ref_addr(mod, cpu));
+ total += per_cpu_ptr(mod->refptr, cpu)->count;
return total;
}
EXPORT_SYMBOL(module_refcount);
@@ -796,14 +798,15 @@ static struct module_attribute refcnt = {
void module_put(struct module *module)
{
if (module) {
- unsigned int cpu = get_cpu();
- local_dec(__module_ref_addr(module, cpu));
+ preempt_disable();
+ __this_cpu_dec(module->refptr->count);
+
trace_module_put(module, _RET_IP_,
- local_read(__module_ref_addr(module, cpu)));
+ __this_cpu_read(module->refptr->count));
/* Maybe they're waiting for us to drop reference? */
if (unlikely(!module_is_live(module)))
wake_up_process(module->waiter);
- put_cpu();
+ preempt_enable();
}
}
EXPORT_SYMBOL(module_put);
@@ -1377,9 +1380,9 @@ static void free_module(struct module *mod)
kfree(mod->args);
if (mod->percpu)
percpu_modfree(mod->percpu);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
+#if defined(CONFIG_MODULE_UNLOAD)
if (mod->refptr)
- percpu_modfree(mod->refptr);
+ free_percpu(mod->refptr);
#endif
/* Free lock-classes: */
lockdep_free_key_range(mod->module_core, mod->core_size);
@@ -2145,9 +2148,8 @@ static noinline struct module *load_module(void __user *umod,
mod = (void *)sechdrs[modindex].sh_addr;
kmemleak_load_module(mod, hdr, sechdrs, secstrings);

-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
- mod->refptr = percpu_modalloc(sizeof(local_t), __alignof__(local_t),
- mod->name);
+#if defined(CONFIG_MODULE_UNLOAD)
+ mod->refptr = alloc_percpu(struct module_ref);
if (!mod->refptr) {
err = -ENOMEM;
goto free_init;
@@ -2373,8 +2375,8 @@ static noinline struct module *load_module(void __user *umod,
kobject_put(&mod->mkobj.kobj);
free_unload:
module_unload_free(mod);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
- percpu_modfree(mod->refptr);
+#if defined(CONFIG_MODULE_UNLOAD)
+ free_percpu(mod->refptr);
free_init:
#endif
module_free(mod, mod->module_init);
--
1.6.4.2

Tejun Heo

unread,

Dec 21, 2009, 3:20:01 AM12/21/09

to

On 12/21/2009 04:59 PM, Tejun Heo wrote:
> Here's the version I committed to percpu branch.
>
> Thanks.
>
> From 8af47ddd01364ae3c663c0bc92415c06fe887ba1 Mon Sep 17 00:00:00 2001
> From: Christoph Lameter <c...@linux-foundation.org>
> Date: Fri, 18 Dec 2009 16:26:24 -0600
> Subject: [PATCH] module: Use this_cpu_xx to dynamically allocate counters
>
> Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
> requirements, cache footprint and decreases cycle counts.
>
> The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
> not drop the use of __module_ref_addr() which would make per cpu data handling
> complicated. this_cpu_xx operations have their own fallback for !SMP.
>
> The last hold out of users of local_t is the tracing ringbuffer after this patch
> has been applied.
>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
> Cc: Rusty Russell <ru...@rustcorp.com.au>

This was changed to Acked-by as Rusty acked on the previous thread.

Thanks.

--
tejun

Rusty Russell

unread,

Dec 21, 2009, 6:30:02 PM12/21/09

to

On Mon, 21 Dec 2009 06:29:36 pm Tejun Heo wrote:
> @@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
> kfree(use);
> sysfs_remove_link(i->holders_dir, mod->name);
> /* There can be at most one match. */
> + free_percpu(i->refptr);
> break;
> }
> }

This looks very wrong.

Rusty

Tejun Heo

unread,

Dec 21, 2009, 7:10:02 PM12/21/09

to

On 12/22/2009 08:28 AM, Rusty Russell wrote:
> On Mon, 21 Dec 2009 06:29:36 pm Tejun Heo wrote:
>> @@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
>> kfree(use);
>> sysfs_remove_link(i->holders_dir, mod->name);
>> /* There can be at most one match. */
>> + free_percpu(i->refptr);
>> break;
>> }
>> }
>
> This looks very wrong.

Indeed, thanks for spotting it. Christoph, I'm rolling back all
patches from this series. Please re-post with updates.

Thanks.

--
tejun

Christoph Lameter

unread,

Dec 22, 2009, 11:00:03 AM12/22/09

to

On Sat, 19 Dec 2009, Mathieu Desnoyers wrote:

> I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
> Given that what LTTng needs is basically an atomic, nmi-safe version of
> the primitive (on all architectures that have something close to a NMI),
> this means that it could not switch over to your primitives until we add
> the equivalent support we currently have with local_t to all
> architectures. The transition would be faster if we create an
> atomic_cpu_*() variant which would map to local_t operations in the
> initial version.
>
> Or maybe have I missed something in your patchset that address this ?

NMI safeness is not covered by this_cpu operations.

We could add nmi_safe_.... ops?

The atomic_cpu reference make me think that you want full (LOCK)
semantics? Then use the regular atomic ops?

Christoph Lameter

unread,

Dec 22, 2009, 11:00:03 AM12/22/09

to

On Mon, 21 Dec 2009, Tejun Heo wrote:

>
> > Index: linux-2.6/kernel/trace/ring_buffer.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/trace/ring_buffer.c 2009-12-18 13:13:24.000000000 -0600
> > +++ linux-2.6/kernel/trace/ring_buffer.c 2009-12-18 14:15:57.000000000 -0600
> > @@ -12,6 +12,7 @@
> > #include <linux/hardirq.h>
> > #include <linux/kmemcheck.h>
> > #include <linux/module.h>
> > +#include <asm/local.h>
>
> This doesn't belong to this patch, right? I stripped this part out,
> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch. I'll
> post the updated patch here. Thanks.

If you strip this part out then ringbuffer.c will no longer compile
since it relies on the "#include <local.h>" that is removed by this patch
from the module.h file.

Christoph Lameter

unread,

Dec 22, 2009, 11:00:03 AM12/22/09

to

On Mon, 21 Dec 2009, Tejun Heo wrote:

> index 482efc8..9350577 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -16,8 +16,7 @@
> #include <linux/kobject.h>
> #include <linux/moduleparam.h>
> #include <linux/tracepoint.h>
> -
> -#include <asm/local.h>
> +#include <linux/percpu.h>

This is going to break ringbuffer.c.

Christoph Lameter

unread,

Dec 22, 2009, 11:00:03 AM12/22/09

to

On Mon, 21 Dec 2009, Tejun Heo wrote:

> On 12/19/2009 07:26 AM, Christoph Lameter wrote:
> > Current this_cpu ops only allow an arch to specify add RMW operations or inc
> > and dec for all sizes. Some arches can do more efficient inc and dec
> > operations. Allow size specific override of fallback functions like with
> > the other operations.
>
> Wouldn't it be better to use __builtin_constant_p() and switching in
> arch add/dec macros? It just seems a bit extreme to define all those

Yes that could be done but I am on vacation till next year ;-)...

Christoph Lameter

unread,

Dec 22, 2009, 11:20:02 AM12/22/09

to

On Tue, 22 Dec 2009, Tejun Heo wrote:

> On 12/22/2009 08:28 AM, Rusty Russell wrote:
> > On Mon, 21 Dec 2009 06:29:36 pm Tejun Heo wrote:
> >> @@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
> >> kfree(use);
> >> sysfs_remove_link(i->holders_dir, mod->name);
> >> /* There can be at most one match. */
> >> + free_percpu(i->refptr);
> >> break;
> >> }
> >> }
> >
> > This looks very wrong.
>
> Indeed, thanks for spotting it. Christoph, I'm rolling back all
> patches from this series. Please re-post with updates.

Simply drop the chunk?

Mathieu Desnoyers

unread,

Dec 22, 2009, 12:30:02 PM12/22/09

to

* Christoph Lameter (c...@linux-foundation.org) wrote:

> On Sat, 19 Dec 2009, Mathieu Desnoyers wrote:
>
> > I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
> > Given that what LTTng needs is basically an atomic, nmi-safe version of
> > the primitive (on all architectures that have something close to a NMI),
> > this means that it could not switch over to your primitives until we add
> > the equivalent support we currently have with local_t to all
> > architectures. The transition would be faster if we create an
> > atomic_cpu_*() variant which would map to local_t operations in the
> > initial version.
> >
> > Or maybe have I missed something in your patchset that address this ?
>
> NMI safeness is not covered by this_cpu operations.
>
> We could add nmi_safe_.... ops?
>
> The atomic_cpu reference make me think that you want full (LOCK)
> semantics? Then use the regular atomic ops?

nmi_safe would probably make sense here.

But given that we have to disable preemption to add precision in terms
of trace clock timestamp, I wonder if we would really gain something
considerable performance-wise.

I also thought about the design change this requires for the per-cpu
buffer commit count pointer which would have to become a per-cpu pointer
independent of the buffer structure, and I foresee a problem with
Steven's irq off tracing which need to perform buffer exchanges while
tracing is active. Basically, having only one top-level pointer for the
buffer makes it possible to exchange it atomically, but if we have to
have two separate pointers (one for per-cpu buffer, one for per-cpu
commit count array), then we are stucked.

So given that per-cpu ops limits us in terms of data structure layout, I
am less and less sure it's the best fit for ring buffers, especially if
we don't gain much performance-wise.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

Tejun Heo

unread,

Dec 22, 2009, 9:10:02 PM12/22/09

to

Hello, Christoph.

On 12/23/2009 12:56 AM, Christoph Lameter wrote:
> On Mon, 21 Dec 2009, Tejun Heo wrote:
>
>> On 12/19/2009 07:26 AM, Christoph Lameter wrote:
>>> Current this_cpu ops only allow an arch to specify add RMW operations or inc
>>> and dec for all sizes. Some arches can do more efficient inc and dec
>>> operations. Allow size specific override of fallback functions like with
>>> the other operations.
>>
>> Wouldn't it be better to use __builtin_constant_p() and switching in
>> arch add/dec macros? It just seems a bit extreme to define all those
>
> Yes that could be done but I am on vacation till next year ;-)...

I'm not sure how much I would be working from now till the end of the
year but if I end up doing some work I'll give it a shot.

Happy holidays and enjoy your vacation!

--
tejun

Tejun Heo

unread,

Dec 22, 2009, 9:10:02 PM12/22/09

to

On 12/23/2009 12:57 AM, Christoph Lameter wrote:
> On Mon, 21 Dec 2009, Tejun Heo wrote:
>
>>
>>> Index: linux-2.6/kernel/trace/ring_buffer.c
>>> ===================================================================
>>> --- linux-2.6.orig/kernel/trace/ring_buffer.c 2009-12-18 13:13:24.000000000 -0600
>>> +++ linux-2.6/kernel/trace/ring_buffer.c 2009-12-18 14:15:57.000000000 -0600
>>> @@ -12,6 +12,7 @@
>>> #include <linux/hardirq.h>
>>> #include <linux/kmemcheck.h>
>>> #include <linux/module.h>
>>> +#include <asm/local.h>
>>
>> This doesn't belong to this patch, right? I stripped this part out,
>> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch. I'll
>> post the updated patch here. Thanks.
>
> If you strip this part out then ringbuffer.c will no longer compile
> since it relies on the "#include <local.h>" that is removed by this patch
> from the module.h file.

Oh... alright. I'll add that comment and drop the offending chunk and
recommit.

Thanks.

--
tejun

Christoph Lameter

unread,

Jan 4, 2010, 12:30:02 PM1/4/10

to

On Wed, 23 Dec 2009, Tejun Heo wrote:

> On 12/23/2009 12:57 AM, Christoph Lameter wrote:
> > On Mon, 21 Dec 2009, Tejun Heo wrote:
> >
> >>
> >>> Index: linux-2.6/kernel/trace/ring_buffer.c
> >>> ===================================================================
> >>> --- linux-2.6.orig/kernel/trace/ring_buffer.c 2009-12-18 13:13:24.000000000 -0600
> >>> +++ linux-2.6/kernel/trace/ring_buffer.c 2009-12-18 14:15:57.000000000 -0600
> >>> @@ -12,6 +12,7 @@
> >>> #include <linux/hardirq.h>
> >>> #include <linux/kmemcheck.h>
> >>> #include <linux/module.h>
> >>> +#include <asm/local.h>
> >>
> >> This doesn't belong to this patch, right? I stripped this part out,
> >> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch. I'll
> >> post the updated patch here. Thanks.
> >
> > If you strip this part out then ringbuffer.c will no longer compile
> > since it relies on the "#include <local.h>" that is removed by this patch
> > from the module.h file.
>
> Oh... alright. I'll add that comment and drop the offending chunk and
> recommit.

So we need a separate patch to deal with remval of the #include
<asm/local.h> from modules.h?

Christoph Lameter

unread,

Jan 4, 2010, 12:30:02 PM1/4/10

to

On Tue, 22 Dec 2009, Mathieu Desnoyers wrote:

> > > I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
> > > Given that what LTTng needs is basically an atomic, nmi-safe version of
> > > the primitive (on all architectures that have something close to a NMI),
> > > this means that it could not switch over to your primitives until we add
> > > the equivalent support we currently have with local_t to all
> > > architectures. The transition would be faster if we create an
> > > atomic_cpu_*() variant which would map to local_t operations in the
> > > initial version.
> > >
> > > Or maybe have I missed something in your patchset that address this ?
> >
> > NMI safeness is not covered by this_cpu operations.
> >
> > We could add nmi_safe_.... ops?
> >
> > The atomic_cpu reference make me think that you want full (LOCK)
> > semantics? Then use the regular atomic ops?
>
> nmi_safe would probably make sense here.

I am not sure how to implement fallback for nmi_safe operations though
since there is no way of disabling NMIs.

> But given that we have to disable preemption to add precision in terms
> of trace clock timestamp, I wonder if we would really gain something
> considerable performance-wise.

Not sure what exactly you attempt to do there.

> I also thought about the design change this requires for the per-cpu
> buffer commit count pointer which would have to become a per-cpu pointer
> independent of the buffer structure, and I foresee a problem with
> Steven's irq off tracing which need to perform buffer exchanges while
> tracing is active. Basically, having only one top-level pointer for the
> buffer makes it possible to exchange it atomically, but if we have to
> have two separate pointers (one for per-cpu buffer, one for per-cpu
> commit count array), then we are stucked.

You just need to keep percpu pointers that are offsets into the percpu
area. They can be relocated as needed to the processor specific addresses
using the cpu ops.

> So given that per-cpu ops limits us in terms of data structure layout, I
> am less and less sure it's the best fit for ring buffers, especially if
> we don't gain much performance-wise.

I dont understand how exactly the ring buffer logic works and what you are
trying to do here.

The ringbuffers are per cpu structures right and you do not change cpus
while performing operations on them? If not then the per cpu ops are not
useful to you.

If you dont: How can you safely use the local_t operations for the
ringbuffer logic?

Mathieu Desnoyers

unread,

Jan 4, 2010, 1:00:02 PM1/4/10

to

* Christoph Lameter (c...@linux-foundation.org) wrote:

> On Wed, 23 Dec 2009, Tejun Heo wrote:
>
> > On 12/23/2009 12:57 AM, Christoph Lameter wrote:
> > > On Mon, 21 Dec 2009, Tejun Heo wrote:
> > >
> > >>
> > >>> Index: linux-2.6/kernel/trace/ring_buffer.c
> > >>> ===================================================================
> > >>> --- linux-2.6.orig/kernel/trace/ring_buffer.c 2009-12-18 13:13:24.000000000 -0600
> > >>> +++ linux-2.6/kernel/trace/ring_buffer.c 2009-12-18 14:15:57.000000000 -0600
> > >>> @@ -12,6 +12,7 @@
> > >>> #include <linux/hardirq.h>
> > >>> #include <linux/kmemcheck.h>
> > >>> #include <linux/module.h>
> > >>> +#include <asm/local.h>
> > >>
> > >> This doesn't belong to this patch, right? I stripped this part out,
> > >> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch. I'll
> > >> post the updated patch here. Thanks.
> > >
> > > If you strip this part out then ringbuffer.c will no longer compile
> > > since it relies on the "#include <local.h>" that is removed by this patch
> > > from the module.h file.
> >
> > Oh... alright. I'll add that comment and drop the offending chunk and
> > recommit.
>
> So we need a separate patch to deal with remval of the #include
> <asm/local.h> from modules.h?
>

I think this would make sense, yes. This would be a patch specific to
the ring-buffer code that would go through (or be acked by) Steven.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

Mathieu Desnoyers

unread,

Jan 5, 2010, 5:30:02 PM1/5/10

to

* Christoph Lameter (c...@linux-foundation.org) wrote:

Trying to make a long story short:

This scheme where Steven moves buffers from one CPU to another, he only
performs this operation when some other exclusion mechanism ensures that
the buffer is not used for writing by the CPU when this move operation
is done.

It is therefore correct, and needs local_t type to deal with the data
structure hierarchy vs atomic exchange as I pointed in my email.

Mathieu

>
> If you dont: How can you safely use the local_t operations for the
> ringbuffer logic?
>

--

Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

Christoph Lameter

unread,

Jan 5, 2010, 5:40:02 PM1/5/10

to

On Tue, 5 Jan 2010, Mathieu Desnoyers wrote:

> This scheme where Steven moves buffers from one CPU to another, he only
> performs this operation when some other exclusion mechanism ensures that
> the buffer is not used for writing by the CPU when this move operation
> is done.
>
> It is therefore correct, and needs local_t type to deal with the data
> structure hierarchy vs atomic exchange as I pointed in my email.

Yes this_cpu_xx does not seem to work for your scheme. Please review the
patchset that I sent you instead.

[this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Mathieu Desnoyers

Pekka Enberg

Tejun Heo

Tejun Heo

Tejun Heo

Tejun Heo

Rusty Russell

Tejun Heo

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Christoph Lameter

Mathieu Desnoyers

Tejun Heo

Tejun Heo

Christoph Lameter

Christoph Lameter

Mathieu Desnoyers

Mathieu Desnoyers

Christoph Lameter