[this_cpu_xx V4 00/20] Introduce per cpu atomic operations and avoid per cpu address arithmetic

c...@linux-foundation.org

unread,

Oct 1, 2009, 6:40:06 PM10/1/09

to

V3->V4:
- Fix various macro definitions.
- Provider experimental percpu based fastpath that does not disable
interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
can fallback to the use of preempt or interrupt disable. 64 bit
platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today

The patchset introduces various operations to allow efficient access
to per cpu variables for the current processor. Currently there is
no way in the core to calculate the address of the instance
of a per cpu variable without a table lookup. So we see a lot of

per_cpu_ptr(x, smp_processor_id())

The patchset introduces a way to calculate the address using the offset
that is available in arch specific ways (register or special memory
locations) using

this_cpu_ptr(x)

In addition macros are provided that can operate on per cpu
variables in a per cpu atomic way. With that scalars in structures
allocated with the new percpu allocator can be modified without disabling
preempt or interrupts. This works by generating a single instruction that
does both the relocation of the address to the proper percpu area and
the RMW action.

F.e.

this_cpu_add(x->var, 20)

can be used to generate an instruction that uses a segment register for the
relocation of the per cpu address into the per cpu area of the current processor
and then increments the variable by 20. The instruction cannot be interrupted
and therefore the modification is atomic vs the cpu (it either happens or not).
Rescheduling or interrupt can only happen before or after the instruction.

Per cpu atomicness does not provide protection from concurrent modifications from
other processors. In general per cpu data is modified only from the processor
that the per cpu area is associated with. So per cpu atomicness provides a fast
and effective means of dealing with concurrency. It may allow development of
better fastpaths for allocators and other important subsystems.

The per cpu atomic RMW operations can be used to avoid having to dimension pointer
arrays in the allocators (patches for page allocator and slub are provided) and
avoid pointer lookups in the hot paths of the allocators thereby decreasing
latency of critical OS paths. The macros could be used to revise the critical
paths in the allocators to no longer need to disable interrupts (not included).

Per cpu atomic RMW operations are useful to decrease the overhead of counter
maintenance in the kernel. A this_cpu_inc() f.e. can generate a single
instruction that has no needs for registers on x86. preempt on / off can
be avoided in many places.

Patchset will reduce the code size and increase speed of operations for
dynamically allocated per cpu based statistics. A set of patches modifies
the fastpaths of the SLUB allocator reducing code size and cache footprint
through the per cpu atomic operations.

This patch depends on all arches supporting the new per cpu allocator.
IA64 still uses the old percpu allocator. Tejon has patches to fixup IA64
and the patch was approved by Tony Luck but the IA64 patches have not been
merged yet.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

c...@linux-foundation.org

unread,

Oct 1, 2009, 6:40:11 PM10/1/09

to

this_cpu_slub_aggressive_cpu_ops

c...@linux-foundation.org

unread,

Oct 1, 2009, 7:10:05 PM10/1/09

to

this_cpu_x86_ops

c...@linux-foundation.org

unread,

Oct 1, 2009, 7:10:09 PM10/1/09

to

this_cpu_snmp

c...@linux-foundation.org

unread,

Oct 1, 2009, 7:40:06 PM10/1/09

to

this_cpu_ptr_eliminate_get_put_cpu

c...@linux-foundation.org

unread,

Oct 1, 2009, 7:40:08 PM10/1/09

to

this_cpu_nfs

c...@linux-foundation.org

unread,

Oct 1, 2009, 10:00:15 PM10/1/09

to

this_cpu_ptr_xfs

c...@linux-foundation.org

unread,

Oct 1, 2009, 10:00:21 PM10/1/09

to

this_cpu_rcu

c...@linux-foundation.org

unread,

Oct 1, 2009, 10:00:22 PM10/1/09

to

this_cpu_slub_cleanup_stat

c...@linux-foundation.org

unread,

Oct 1, 2009, 11:40:05 PM10/1/09

to

this_cpu_net

c...@linux-foundation.org

unread,

Oct 1, 2009, 11:40:04 PM10/1/09

to

this_cpu_ptr_straight_transforms

c...@linux-foundation.org

unread,

Oct 1, 2009, 11:40:05 PM10/1/09

to

this_cpu_page_allocator

c...@linux-foundation.org

unread,

Oct 2, 2009, 3:10:04 AM10/2/09

to

this_cpu_slub_static_dma_kmalloc

c...@linux-foundation.org

unread,

Oct 2, 2009, 3:10:06 AM10/2/09

to

this_cpu_vmstats

c...@linux-foundation.org

unread,

Oct 2, 2009, 3:10:07 AM10/2/09

to

this_cpu_slub_irqless

c...@linux-foundation.org

unread,

Oct 2, 2009, 3:10:08 AM10/2/09

to

this_cpu_move_initialization

c...@linux-foundation.org

unread,

Oct 2, 2009, 3:10:05 AM10/2/09

to

this_cpu_ptr_intro

Tejun Heo

unread,

Oct 2, 2009, 5:20:04 AM10/2/09

to

c...@linux-foundation.org wrote:
> Basically the existing percpu ops can be used for this_cpu variants that allow
> operations also on dynamically allocated percpu data. However, we do not pass a
> reference to a percpu variable in. Instead a dynamically or statically
> allocated percpu variable is provided.
>
> Preempt, the non preempt and the irqsafe operations generate the same code.
> It will always be possible to have the requires per cpu atomicness in a single
> RMW instruction with segment override on x86.
>
> 64 bit this_cpu operations are not supported on 32 bit.
>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

Acked-by: Tejun Heo <t...@kernel.org>

--
tejun

Tejun Heo

unread,

Oct 2, 2009, 5:20:06 AM10/2/09

to

Hello,

c...@linux-foundation.org wrote:
> This patch introduces two things: First this_cpu_ptr and then per cpu
> atomic operations.

I'm still not quite sure about the lvalue parameter but given that
get/put_user() is already using it, I don't think my indecisiveness
warrants NACK, so...

Acked-by: Tejun Heo <t...@kernel.org>

--
tejun

Tejun Heo

unread,

Oct 2, 2009, 5:40:09 AM10/2/09

to

Hello,

c...@linux-foundation.org wrote:
> V3->V4:
> - Fix various macro definitions.
> - Provider experimental percpu based fastpath that does not disable
> interrupts for SLUB.

The series looks very good to me. percpu#for-next now has ia64 bits
included and the legacy allocator is gone there so it can carry this
series. Sans the last one, they seem they can be stable and
incremental from now on, right? Shall I include this series into the
percpu tree?

Thanks.

--
tejun

Ingo Molnar

unread,

Oct 2, 2009, 5:40:09 AM10/2/09

to

* c...@linux-foundation.org <c...@linux-foundation.org> wrote:

> --- linux-2.6.orig/include/asm-generic/percpu.h 2009-10-01 14:14:00.000000000 -0500
> +++ linux-2.6/include/asm-generic/percpu.h 2009-10-01 14:14:02.000000000 -0500

> @@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
> #define per_cpu(var, cpu) (*((void)(cpu), &per_cpu_var(var)))
> #define __get_cpu_var(var) per_cpu_var(var)
> #define __raw_get_cpu_var(var) per_cpu_var(var)
> +#define this_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
> +#define __this_cpu_ptr(ptr) this_cpu_ptr(ptr)

Small detail: please have a look at the existing vertical alignment
style of the code there and follow it with new entries.

Thanks,

Ingo

Ingo Molnar

unread,

Oct 2, 2009, 6:00:18 AM10/2/09

to

* Tejun Heo <t...@kernel.org> wrote:

> Hello,
>
> c...@linux-foundation.org wrote:
> > V3->V4:
> > - Fix various macro definitions.
> > - Provider experimental percpu based fastpath that does not disable
> > interrupts for SLUB.
>

> The series looks very good to me. [...]

Seconded, very nice series!

One final step/cleanup seems to be missing from it: it should replace
current uses of percpu_op() [percpu_read(), etc.] in the x86 tree and
elsewhere with the new this_cpu_*() primitives. this_cpu_*() is using
per_cpu_from_op/per_cpu_to_op directly, we dont need those percpu_op()
variants anymore.

There should also be a kernel image size comparison done for that step,
to make sure all the new primitives are optimized to the max on the
instruction level.

> [...] percpu#for-next now has ia64 bits included and the legacy

> allocator is gone there so it can carry this series. Sans the last
> one, they seem they can be stable and incremental from now on, right?
> Shall I include this series into the percpu tree?

I'd definitely recommend doing that - it should be tested early and wide
for v2.6.33, and together with other percpu bits.

Ingo

Ingo Molnar

unread,

Oct 2, 2009, 6:00:18 AM10/2/09

to

* c...@linux-foundation.org <c...@linux-foundation.org> wrote:

> Basically the existing percpu ops can be used for this_cpu variants
> that allow operations also on dynamically allocated percpu data.
> However, we do not pass a reference to a percpu variable in. Instead a
> dynamically or statically allocated percpu variable is provided.
>
> Preempt, the non preempt and the irqsafe operations generate the same
> code. It will always be possible to have the requires per cpu
> atomicness in a single RMW instruction with segment override on x86.
>
> 64 bit this_cpu operations are not supported on 32 bit.
>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

Acked-by: Ingo Molnar <mi...@elte.hu>

Ingo

Mel Gorman

unread,

Oct 2, 2009, 10:20:10 AM10/2/09

to

On Thu, Oct 01, 2009 at 05:25:33PM -0400, c...@linux-foundation.org wrote:
> Explicitly initialize the pagesets after the per cpu areas have been
> initialized. This is necessary in order to be able to use per cpu
> operations in later patches.
>

Can you be more explicit about this? I think the reasoning is as follows

A later patch will use DEFINE_PER_CPU which allocates memory later in
the boot-cycle after zones have already been initialised. Without this
patch, use of DEFINE_PER_CPU would result in invalid memory accesses
during pageset initialisation.

Have another question below as well.

> Cc: Mel Gorman <m...@csn.ul.ie>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>
> ---
> arch/ia64/kernel/setup.c | 1 +
> arch/powerpc/kernel/setup_64.c | 1 +
> arch/sparc/kernel/smp_64.c | 1 +
> arch/x86/kernel/setup_percpu.c | 2 ++
> include/linux/mm.h | 1 +
> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++-----------
> mm/percpu.c | 2 ++
> 7 files changed, 37 insertions(+), 11 deletions(-)
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c 2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c 2009-10-01 09:36:19.000000000 -0500
> @@ -3270,23 +3270,42 @@ void zone_pcp_update(struct zone *zone)
> stop_machine(__zone_pcp_update, zone, NULL);
> }
>
> -static __meminit void zone_pcp_init(struct zone *zone)
> +/*
> + * Early setup of pagesets.
> + *
> + * In the NUMA case the pageset setup simply results in all zones pcp
> + * pointer being directed at a per cpu pageset with zero batchsize.
> + *

The batchsize becomes 1, not 0 if you look at setup_pageset() but that aside,
it's unclear from the comment *why* the batchsize is 1 in the NUMA case.
Maybe something like the following?

=====
In the NUMA case, the boot_pageset is used until the slab allocator is
available to allocate per-zone pagesets as each CPU is brought up. At
this point, the batchsize is set to 1 to prevent pages "leaking" onto the
boot_pageset freelists.
=====

Otherwise, nothing in the patch jumped out at me other than to double
check CPU-up events actually result in process_zones() being called and
that boot_pageset is not being accidentally used in the long term.

> + * This means that every free and every allocation occurs directly from
> + * the buddy allocator tables.
> + *
> + * The pageset never queues pages during early boot and is therefore usable
> + * for every type of zone.
> + */
> +__meminit void setup_pagesets(void)
> {
> int cpu;
> - unsigned long batch = zone_batchsize(zone);
> + struct zone *zone;
>
> - for (cpu = 0; cpu < NR_CPUS; cpu++) {
> + for_each_zone(zone) {
> #ifdef CONFIG_NUMA
> - /* Early boot. Slab allocator not functional yet */
> - zone_pcp(zone, cpu) = &boot_pageset[cpu];
> - setup_pageset(&boot_pageset[cpu],0);
> + unsigned long batch = 0;
> +
> + for (cpu = 0; cpu < NR_CPUS; cpu++) {
> + /* Early boot. Slab allocator not functional yet */
> + zone_pcp(zone, cpu) = &boot_pageset[cpu];
> + }
> #else
> - setup_pageset(zone_pcp(zone,cpu), batch);
> + unsigned long batch = zone_batchsize(zone);
> #endif
> +
> + for_each_possible_cpu(cpu)
> + setup_pageset(zone_pcp(zone, cpu), batch);
> +
> + if (zone->present_pages)
> + printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
> + zone->name, zone->present_pages, batch);
> }
> - if (zone->present_pages)
> - printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
> - zone->name, zone->present_pages, batch);
> }
>
> __meminit int init_currently_empty_zone(struct zone *zone,
> @@ -3841,7 +3860,6 @@ static void __paginginit free_area_init_
>
> zone->prev_priority = DEF_PRIORITY;
>
> - zone_pcp_init(zone);
> for_each_lru(l) {
> INIT_LIST_HEAD(&zone->lru[l].list);
> zone->reclaim_stat.nr_saved_scan[l] = 0;
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h 2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/include/linux/mm.h 2009-10-01 09:36:19.000000000 -0500
> @@ -1060,6 +1060,7 @@ extern void show_mem(void);
> extern void si_meminfo(struct sysinfo * val);
> extern void si_meminfo_node(struct sysinfo *val, int nid);
> extern int after_bootmem;
> +extern void setup_pagesets(void);
>
> #ifdef CONFIG_NUMA
> extern void setup_per_cpu_pageset(void);
> Index: linux-2.6/arch/ia64/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/ia64/kernel/setup.c 2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/ia64/kernel/setup.c 2009-10-01 09:35:39.000000000 -0500
> @@ -864,6 +864,7 @@ void __init
> setup_per_cpu_areas (void)
> {
> /* start_kernel() requires this... */
> + setup_pagesets();
> }
> #endif
>
> Index: linux-2.6/arch/powerpc/kernel/setup_64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/setup_64.c 2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/powerpc/kernel/setup_64.c 2009-10-01 09:35:39.000000000 -0500
> @@ -578,6 +578,7 @@ static void ppc64_do_msg(unsigned int sr
> snprintf(buf, 128, "%s", msg);
> ppc_md.progress(buf, 0);
> }
> + setup_pagesets();
> }
>
> /* Print a boot progress message. */
> Index: linux-2.6/arch/sparc/kernel/smp_64.c
> ===================================================================
> --- linux-2.6.orig/arch/sparc/kernel/smp_64.c 2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/sparc/kernel/smp_64.c 2009-10-01 09:35:39.000000000 -0500
> @@ -1486,4 +1486,5 @@ void __init setup_per_cpu_areas(void)
> of_fill_in_cpu_data();
> if (tlb_type == hypervisor)
> mdesc_fill_in_cpu_data(cpu_all_mask);
> + setup_pagesets();
> }
> Index: linux-2.6/arch/x86/kernel/setup_percpu.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/setup_percpu.c 2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/arch/x86/kernel/setup_percpu.c 2009-10-01 09:35:39.000000000 -0500
> @@ -269,4 +269,6 @@ void __init setup_per_cpu_areas(void)
>
> /* Setup cpu initialized, callin, callout masks */
> setup_cpu_local_masks();
> +
> + setup_pagesets();
> }
> Index: linux-2.6/mm/percpu.c
> ===================================================================
> --- linux-2.6.orig/mm/percpu.c 2009-10-01 08:54:19.000000000 -0500
> +++ linux-2.6/mm/percpu.c 2009-10-01 09:35:39.000000000 -0500
> @@ -2062,5 +2062,7 @@ void __init setup_per_cpu_areas(void)
> delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
> for_each_possible_cpu(cpu)
> __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
> +
> + setup_pagesets();
> }
> #endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
>
> --

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Mel Gorman

unread,

Oct 2, 2009, 11:20:04 AM10/2/09

to

On Thu, Oct 01, 2009 at 05:25:34PM -0400, c...@linux-foundation.org wrote:
> Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
>
> This drastically reduces the size of struct zone for systems with large
> amounts of processors and allows placement of critical variables of struct
> zone in one cacheline even on very large systems.
>

This seems reasonably accurate. The largest shrink is on !NUMA configured
systems but the NUMA case deletes a lot of pointers.

> Another effect is that the pagesets of one processor are placed near one
> another. If multiple pagesets from different zones fit into one cacheline
> then additional cacheline fetches can be avoided on the hot paths when
> allocating memory from multiple zones.
>

Out of curiousity, how common an occurance is it that a CPU allocate from
multiple zones? I would have thought it was rare but I never checked
either.

> Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
> are reduced and we can drop the zone_pcp macro.
>
> Hotplug handling is also simplified since cpu alloc can bring up and
> shut down cpu areas for a specific cpu as a whole. So there is no need to
> allocate or free individual pagesets.

>
> Cc: Mel Gorman <m...@csn.ul.ie>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>
> ---

> include/linux/mm.h | 4 -
> include/linux/mmzone.h | 12 ---
> mm/page_alloc.c | 156 ++++++++++++++-----------------------------------
> mm/vmstat.c | 14 ++--
> 4 files changed, 55 insertions(+), 131 deletions(-)
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h 2009-09-29 09:30:37.000000000 -0500
> +++ linux-2.6/include/linux/mm.h 2009-09-29 09:30:39.000000000 -0500
> @@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
> extern int after_bootmem;
> extern void setup_pagesets(void);
>
> -#ifdef CONFIG_NUMA
> extern void setup_per_cpu_pageset(void);
> -#else
> -static inline void setup_per_cpu_pageset(void) {}
> -#endif
>
> extern void zone_pcp_update(struct zone *zone);
>
> Index: linux-2.6/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmzone.h 2009-09-29 09:30:25.000000000 -0500
> +++ linux-2.6/include/linux/mmzone.h 2009-09-29 09:30:39.000000000 -0500
> @@ -184,13 +184,7 @@ struct per_cpu_pageset {
> s8 stat_threshold;
> s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
> #endif
> -} ____cacheline_aligned_in_smp;
> -
> -#ifdef CONFIG_NUMA
> -#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
> -#else
> -#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
> -#endif
> +};
>
> #endif /* !__GENERATING_BOUNDS.H */
>
> @@ -306,10 +300,8 @@ struct zone {
> */
> unsigned long min_unmapped_pages;
> unsigned long min_slab_pages;
> - struct per_cpu_pageset *pageset[NR_CPUS];
> -#else
> - struct per_cpu_pageset pageset[NR_CPUS];
> #endif
> + struct per_cpu_pageset *pageset;
> /*
> * free areas of different sizes
> */
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c 2009-09-29 09:30:37.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c 2009-09-29 09:30:50.000000000 -0500
> @@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
> struct per_cpu_pageset *pset;
> struct per_cpu_pages *pcp;
>
> - pset = zone_pcp(zone, cpu);
> + pset = per_cpu_ptr(zone->pageset, cpu);
>
> pcp = &pset->pcp;
> local_irq_save(flags);
> @@ -1098,7 +1098,7 @@ static void free_hot_cold_page(struct pa
> arch_free_page(page, 0);
> kernel_map_pages(page, 1, 0);
>
> - pcp = &zone_pcp(zone, get_cpu())->pcp;
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> migratetype = get_pageblock_migratetype(page);
> set_page_private(page, migratetype);
> local_irq_save(flags);
> @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
>
> out:
> local_irq_restore(flags);
> - put_cpu();

Previously we get_cpu() to be preemption safe. We then disable
interrupts and potentially take a spinlock later.

Is the point we disable interrupts a pre-emption point? Even if it's not
on normal kernels, is it a preemption point on the RT kernel?

If it is a pre-emption point, what stops us getting rescheduled on another
CPU after PCP has been looked up? Sorry if this has been brought up and
resolved already. This is my first proper look at this patchset. The same
query applies to any section that was

get_cpu()
look up PCP structure
disable interrupts
stuff
enable interrupts
put_cpu()

is converted to

this_cpu_ptr() looks up PCP
disable interrupts
enable interrupts

> }
>
> void free_hot_page(struct page *page)
> @@ -1183,15 +1182,13 @@ struct page *buffered_rmqueue(struct zon
> unsigned long flags;
> struct page *page;
> int cold = !!(gfp_flags & __GFP_COLD);
> - int cpu;
>
> again:
> - cpu = get_cpu();
> if (likely(order == 0)) {
> struct per_cpu_pages *pcp;
> struct list_head *list;
>
> - pcp = &zone_pcp(zone, cpu)->pcp;
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> list = &pcp->lists[migratetype];
> local_irq_save(flags);
> if (list_empty(list)) {
> @@ -1234,7 +1231,6 @@ again:
> __count_zone_vm_events(PGALLOC, zone, 1 << order);
> zone_statistics(preferred_zone, zone);
> local_irq_restore(flags);
> - put_cpu();
>
> VM_BUG_ON(bad_range(zone, page));
> if (prep_new_page(page, order, gfp_flags))
> @@ -1243,7 +1239,6 @@ again:
>
> failed:
> local_irq_restore(flags);
> - put_cpu();
> return NULL;
> }
>
> @@ -2172,7 +2167,7 @@ void show_free_areas(void)
> for_each_online_cpu(cpu) {
> struct per_cpu_pageset *pageset;
>
> - pageset = zone_pcp(zone, cpu);
> + pageset = per_cpu_ptr(zone->pageset, cpu);
>
> printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
> cpu, pageset->pcp.high,
> @@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
> }
>
>
> -#ifdef CONFIG_NUMA
> /*
> * Boot pageset table. One per cpu which is going to be used for all
> * zones and all nodes. The parameters will be set in such a way
> @@ -3095,112 +3089,67 @@ static void setup_pagelist_highmark(stru
> * the buddy list. This is safe since pageset manipulation is done
> * with interrupts disabled.
> *
> - * Some NUMA counter updates may also be caught by the boot pagesets.
> - *
> - * The boot_pagesets must be kept even after bootup is complete for
> - * unused processors and/or zones. They do play a role for bootstrapping
> - * hotplugged processors.
> + * Some counter updates may also be caught by the boot pagesets.
> *
> * zoneinfo_show() and maybe other functions do
> * not check if the processor is online before following the pageset pointer.
> * Other parts of the kernel may not check if the zone is available.
> */
> -static struct per_cpu_pageset boot_pageset[NR_CPUS];
> -
> -/*
> - * Dynamically allocate memory for the
> - * per cpu pageset array in struct zone.
> - */
> -static int __cpuinit process_zones(int cpu)
> -{
> - struct zone *zone, *dzone;
> - int node = cpu_to_node(cpu);
> -
> - node_set_state(node, N_CPU); /* this node has a cpu */
> -
> - for_each_populated_zone(zone) {
> - zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
> - GFP_KERNEL, node);
> - if (!zone_pcp(zone, cpu))
> - goto bad;
> -
> - setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
> -
> - if (percpu_pagelist_fraction)
> - setup_pagelist_highmark(zone_pcp(zone, cpu),
> - (zone->present_pages / percpu_pagelist_fraction));
> - }
> -
> - return 0;
> -bad:
> - for_each_zone(dzone) {
> - if (!populated_zone(dzone))
> - continue;
> - if (dzone == zone)
> - break;
> - kfree(zone_pcp(dzone, cpu));
> - zone_pcp(dzone, cpu) = &boot_pageset[cpu];
> - }
> - return -ENOMEM;
> -}
> -
> -static inline void free_zone_pagesets(int cpu)
> -{
> - struct zone *zone;
> -
> - for_each_zone(zone) {
> - struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
> -
> - /* Free per_cpu_pageset if it is slab allocated */
> - if (pset != &boot_pageset[cpu])
> - kfree(pset);

> - zone_pcp(zone, cpu) = &boot_pageset[cpu];
> - }

> -}
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>
> static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
> unsigned long action,
> void *hcpu)
> {
> int cpu = (long)hcpu;
> - int ret = NOTIFY_OK;
>
> switch (action) {
> case CPU_UP_PREPARE:
> case CPU_UP_PREPARE_FROZEN:
> - if (process_zones(cpu))
> - ret = NOTIFY_BAD;
> - break;
> - case CPU_UP_CANCELED:
> - case CPU_UP_CANCELED_FROZEN:
> - case CPU_DEAD:
> - case CPU_DEAD_FROZEN:
> - free_zone_pagesets(cpu);
> + node_set_state(cpu_to_node(cpu), N_CPU);
> break;
> default:
> break;
> }
> - return ret;
> + return NOTIFY_OK;
> }
>
> static struct notifier_block __cpuinitdata pageset_notifier =
> { &pageset_cpuup_callback, NULL, 0 };
>
> +/*
> + * Allocate per cpu pagesets and initialize them.
> + * Before this call only boot pagesets were available.
> + * Boot pagesets will no longer be used after this call is complete.

If they are no longer used, do we get the memory back?

> + */
> void __init setup_per_cpu_pageset(void)
> {
> - int err;
> + struct zone *zone;
> + int cpu;
> +
> + for_each_populated_zone(zone) {
> + zone->pageset = alloc_percpu(struct per_cpu_pageset);
>
> - /* Initialize per_cpu_pageset for cpu 0.
> - * A cpuup callback will do this for every cpu
> - * as it comes online
> + for_each_possible_cpu(cpu) {
> + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> +
> + setup_pageset(pcp, zone_batchsize(zone));
> +
> + if (percpu_pagelist_fraction)
> + setup_pagelist_highmark(pcp,
> + (zone->present_pages /
> + percpu_pagelist_fraction));
> + }
> + }

This would have been easier to review if you left process_zones() where it
was and converted it to the new API. I'm assuming this is just shuffling
code around.

> +
> + /*
> + * The boot cpu is always the first active.
> + * The boot node has a processor
> */
> - err = process_zones(smp_processor_id());
> - BUG_ON(err);
> + node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
> register_cpu_notifier(&pageset_notifier);
> }
>
> -#endif
> -
> static noinline __init_refok
> int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
> {
> @@ -3254,7 +3203,7 @@ static int __zone_pcp_update(void *data)
> struct per_cpu_pageset *pset;
> struct per_cpu_pages *pcp;
>
> - pset = zone_pcp(zone, cpu);
> + pset = per_cpu_ptr(zone->pageset, cpu);
> pcp = &pset->pcp;
>
> local_irq_save(flags);
> @@ -3272,15 +3221,7 @@ void zone_pcp_update(struct zone *zone)
>
> /*

> * Early setup of pagesets.

> - *
> - * In the NUMA case the pageset setup simply results in all zones pcp
> - * pointer being directed at a per cpu pageset with zero batchsize.
> - *
> - * This means that every free and every allocation occurs directly from
> - * the buddy allocator tables.
> - *
> - * The pageset never queues pages during early boot and is therefore usable
> - * for every type of zone.
> + * At this point various allocators are not operational yet.
> */
> __meminit void setup_pagesets(void)
> {
> @@ -3288,23 +3229,15 @@ __meminit void setup_pagesets(void)
> struct zone *zone;
>
> for_each_zone(zone) {
> -#ifdef CONFIG_NUMA
> - unsigned long batch = 0;
> -

> - for (cpu = 0; cpu < NR_CPUS; cpu++) {

> - /* Early boot. Slab allocator not functional yet */
> - zone_pcp(zone, cpu) = &boot_pageset[cpu];
> - }

> -#else

> - unsigned long batch = zone_batchsize(zone);

> -#endif
> + zone->pageset = &per_cpu_var(boot_pageset);
>
> + /*
> + * Special pagesets with zero elements so that frees
> + * and allocations are not buffered at all.
> + */
> for_each_possible_cpu(cpu)
> - setup_pageset(zone_pcp(zone, cpu), batch);
> + setup_pageset(per_cpu_ptr(zone->pageset, cpu), 0);

>
> - if (zone->present_pages)
> - printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
> - zone->name, zone->present_pages, batch);
> }
> }
>

> @@ -4818,10 +4751,11 @@ int percpu_pagelist_fraction_sysctl_hand
> if (!write || (ret == -EINVAL))
> return ret;
> for_each_populated_zone(zone) {
> - for_each_online_cpu(cpu) {
> + for_each_possible_cpu(cpu) {
> unsigned long high;
> high = zone->present_pages / percpu_pagelist_fraction;
> - setup_pagelist_highmark(zone_pcp(zone, cpu), high);
> + setup_pagelist_highmark(
> + per_cpu_ptr(zone->pageset, cpu), high);
> }
> }
> return 0;
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c 2009-09-29 09:30:25.000000000 -0500
> +++ linux-2.6/mm/vmstat.c 2009-09-29 09:30:43.000000000 -0500
> @@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
> threshold = calculate_threshold(zone);
>
> for_each_online_cpu(cpu)
> - zone_pcp(zone, cpu)->stat_threshold = threshold;
> + per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> + = threshold;
> }
> }
>
> @@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
> void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> int delta)
> {
> - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> +
> s8 *p = pcp->vm_stat_diff + item;
> long x;
>
> @@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
> */
> void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> {
> - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> s8 *p = pcp->vm_stat_diff + item;
>
> (*p)++;
> @@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
>
> void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
> {
> - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> s8 *p = pcp->vm_stat_diff + item;
>
> (*p)--;
> @@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
> for_each_populated_zone(zone) {
> struct per_cpu_pageset *p;
>
> - p = zone_pcp(zone, cpu);
> + p = per_cpu_ptr(zone->pageset, cpu);
>
> for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> if (p->vm_stat_diff[i]) {
> @@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
> for_each_online_cpu(i) {
> struct per_cpu_pageset *pageset;
>
> - pageset = zone_pcp(zone, i);
> + pageset = per_cpu_ptr(zone->pageset, i);
> seq_printf(m,
> "\n cpu: %i"
> "\n count: %i"
>
> --
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Christoph Lameter

unread,

Oct 2, 2009, 1:20:12 PM10/2/09

to

On Fri, 2 Oct 2009, Ingo Molnar wrote:

> > @@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
> > #define per_cpu(var, cpu) (*((void)(cpu), &per_cpu_var(var)))
> > #define __get_cpu_var(var) per_cpu_var(var)
> > #define __raw_get_cpu_var(var) per_cpu_var(var)
> > +#define this_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
> > +#define __this_cpu_ptr(ptr) this_cpu_ptr(ptr)
>
> Small detail: please have a look at the existing vertical alignment
> style of the code there and follow it with new entries.

Ok. Fixed.

Christoph Lameter

unread,

Oct 2, 2009, 1:20:10 PM10/2/09

to

On Fri, 2 Oct 2009, Tejun Heo wrote:

> c...@linux-foundation.org wrote:
> > V3->V4:
> > - Fix various macro definitions.
> > - Provider experimental percpu based fastpath that does not disable
> > interrupts for SLUB.
>
> The series looks very good to me. percpu#for-next now has ia64 bits
> included and the legacy allocator is gone there so it can carry this
> series. Sans the last one, they seem they can be stable and
> incremental from now on, right? Shall I include this series into the
> percpu tree?

You can include all but the last patch that is experimental.

Christoph Lameter

unread,

Oct 2, 2009, 1:30:09 PM10/2/09

to

On Fri, 2 Oct 2009, Ingo Molnar wrote:

> One final step/cleanup seems to be missing from it: it should replace
> current uses of percpu_op() [percpu_read(), etc.] in the x86 tree and
> elsewhere with the new this_cpu_*() primitives. this_cpu_*() is using
> per_cpu_from_op/per_cpu_to_op directly, we dont need those percpu_op()
> variants anymore.

Well after things settle with this_cpu_xx we can drop those.

> There should also be a kernel image size comparison done for that step,
> to make sure all the new primitives are optimized to the max on the
> instruction level.

Right. There will be a time period in which other arches will need to add
support for this_cpu_xx first.

Ingo Molnar

unread,

Oct 2, 2009, 1:40:04 PM10/2/09

to

* Christoph Lameter <c...@linux-foundation.org> wrote:

> On Fri, 2 Oct 2009, Ingo Molnar wrote:
>
> > One final step/cleanup seems to be missing from it: it should
> > replace current uses of percpu_op() [percpu_read(), etc.] in the x86
> > tree and elsewhere with the new this_cpu_*() primitives.
> > this_cpu_*() is using per_cpu_from_op/per_cpu_to_op directly, we
> > dont need those percpu_op() variants anymore.
>
> Well after things settle with this_cpu_xx we can drop those.
>
> > There should also be a kernel image size comparison done for that
> > step, to make sure all the new primitives are optimized to the max
> > on the instruction level.
>
> Right. There will be a time period in which other arches will need to
> add support for this_cpu_xx first.

Size comparison should be only on architectures that support it (i.e.
x86 right now). The generic fallbacks might be bloaty, no argument about
that. ( => the more reason for any architecture to add optimizations for
this_cpu_*() APIs. )

Ingo

Christoph Lameter

unread,

Oct 2, 2009, 1:40:05 PM10/2/09

to

On Fri, 2 Oct 2009, Mel Gorman wrote:

> On Thu, Oct 01, 2009 at 05:25:33PM -0400, c...@linux-foundation.org wrote:
> > Explicitly initialize the pagesets after the per cpu areas have been
> > initialized. This is necessary in order to be able to use per cpu
> > operations in later patches.
> >
>
> Can you be more explicit about this? I think the reasoning is as follows
>
> A later patch will use DEFINE_PER_CPU which allocates memory later in
> the boot-cycle after zones have already been initialised. Without this
> patch, use of DEFINE_PER_CPU would result in invalid memory accesses
> during pageset initialisation.

Nope. Pagesets are not statically allocated per cpu data. They are
allocated with the per cpu allocator.

The per cpu allocator is not initialized that early in boot. We cannot
allocate the pagesets then. Therefore we use a fake single item pageset
(like used now for NUMA boot) to take its place until the slab and percpu
allocators are up. Then we allocate the real pagesets.

> > -static __meminit void zone_pcp_init(struct zone *zone)
> > +/*
> > + * Early setup of pagesets.
> > + *
> > + * In the NUMA case the pageset setup simply results in all zones pcp
> > + * pointer being directed at a per cpu pageset with zero batchsize.
> > + *
>
> The batchsize becomes 1, not 0 if you look at setup_pageset() but that aside,
> it's unclear from the comment *why* the batchsize is 1 in the NUMA case.
> Maybe something like the following?
>
> =====
> In the NUMA case, the boot_pageset is used until the slab allocator is
> available to allocate per-zone pagesets as each CPU is brought up. At
> this point, the batchsize is set to 1 to prevent pages "leaking" onto the
> boot_pageset freelists.
> =====
>
> Otherwise, nothing in the patch jumped out at me other than to double
> check CPU-up events actually result in process_zones() being called and
> that boot_pageset is not being accidentally used in the long term.

This is already explained in a commment where boot_pageset is defined.
Should we add some more elaborate comments to zone_pcp_init()?

Christoph Lameter

unread,

Oct 2, 2009, 1:50:06 PM10/2/09

to

On Fri, 2 Oct 2009, Mel Gorman wrote:

> On Thu, Oct 01, 2009 at 05:25:34PM -0400, c...@linux-foundation.org wrote:
> > Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> >
> > This drastically reduces the size of struct zone for systems with large
> > amounts of processors and allows placement of critical variables of struct
> > zone in one cacheline even on very large systems.
> >
>
> This seems reasonably accurate. The largest shrink is on !NUMA configured
> systems but the NUMA case deletes a lot of pointers.

True, the !NUMA case will then avoid allocating pagesets for unused
zones. But the NUMA case will have the most benefit since the large arrays
in struct zone are gone. Removing the pagesets from struct zone also
increases the cacheability of struct zone information. This is
particularly useful since the size of the pagesets grew with the addition
of the various types of allocation queues.

> > Another effect is that the pagesets of one processor are placed near one
> > another. If multiple pagesets from different zones fit into one cacheline
> > then additional cacheline fetches can be avoided on the hot paths when
> > allocating memory from multiple zones.
> >
>
> Out of curiousity, how common an occurance is it that a CPU allocate from
> multiple zones? I would have thought it was rare but I never checked
> either.

zone allocations are determined by their use. GFP_KERNEL allocs come from
ZONE_NORMAL whereas typical application pages may come from ZONE_HIGHMEM.
The mix depends on what the kernel and the application are doing.

> > pcp = &pset->pcp;

> > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > migratetype = get_pageblock_migratetype(page);
> > set_page_private(page, migratetype);
> > local_irq_save(flags);
> > @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
> >
> > out:
> > local_irq_restore(flags);
> > - put_cpu();
>
> Previously we get_cpu() to be preemption safe. We then disable
> interrupts and potentially take a spinlock later.

Right. WE need to move the local_irq_save() up two lines. Why disable
preempt and two instructions later disable interrupts? Isnt this bloating
the code?

> this_cpu_ptr() looks up PCP
> disable interrupts
> enable interrupts

Move disable interrupts before the this_cpu_ptr?

> > +/*
> > + * Allocate per cpu pagesets and initialize them.
> > + * Before this call only boot pagesets were available.
> > + * Boot pagesets will no longer be used after this call is complete.
>
> If they are no longer used, do we get the memory back?

No we need to keep them for onlining new processors.

> > - * as it comes online
> > + for_each_possible_cpu(cpu) {
> > + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> > +
> > + setup_pageset(pcp, zone_batchsize(zone));
> > +
> > + if (percpu_pagelist_fraction)
> > + setup_pagelist_highmark(pcp,
> > + (zone->present_pages /
> > + percpu_pagelist_fraction));
> > + }
> > + }
>
> This would have been easier to review if you left process_zones() where it
> was and converted it to the new API. I'm assuming this is just shuffling
> code around.

Yes I think this was the result of reducing #ifdefs.

Christoph Lameter

unread,

Oct 2, 2009, 2:00:14 PM10/2/09

to

On Fri, 2 Oct 2009, Ingo Molnar wrote:

> > Right. There will be a time period in which other arches will need to
> > add support for this_cpu_xx first.
>
> Size comparison should be only on architectures that support it (i.e.
> x86 right now). The generic fallbacks might be bloaty, no argument about
> that. ( => the more reason for any architecture to add optimizations for
> this_cpu_*() APIs. )

The fallbacks basically generate the same code (at least for the core
code) that was there before. F.e.

Before:

#define SNMP_INC_STATS(mib, field) \
do { \
per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]++; \
put_cpu(); \
} while (0)

After

#define SNMP_INC_STATS_USER(mib, field) \
this_cpu_inc(mib[1]->mibs[field])

For the x86 case this means that we can use a simple atomic increment
with a segment prefix to do all the work.

The fallback case for arches not providing per cpu atomics is:

preempt_disable();
*__this_cpu_ptr(&mib[1]->mibs[field]) += 1;
preempt_enable();

If the arch can optimize __this_cpu_ptr (and provides __my_cpu_offset)
because it has the per cpu offset of the local cpu in some priviledged
location then this is still going to be a win since we avoid
smp_processor_id() entirely and we also avoid the array lookup.

If the arch has no such mechanism then we fall back for this_cpu_ptr too:

#ifndef __my_cpu_offset
#define __my_cpu_offset per_cpu_offset(raw_smp_processor_id())
#endif

And then the result in terms of overhead is the same as before the
per_cpu_xx patches since get_cpu() does both a preempt_disable as well as
a smp_processor_id() call.

Tejun Heo

unread,

Oct 3, 2009, 6:30:09 AM10/3/09

to

c...@linux-foundation.org wrote:
> Explicitly initialize the pagesets after the per cpu areas have been
> initialized. This is necessary in order to be able to use per cpu
> operations in later patches.
>

> Cc: Mel Gorman <m...@csn.ul.ie>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>
> ---

Hmmm... why not call this function from start_kernel() after calling
setup_per_cpu_areas() instead of modifying every implementation of
setup_per_cpu_areas()?

Thanks.

--
tejun

Tejun Heo

unread,

Oct 3, 2009, 7:00:08 AM10/3/09

to

c...@linux-foundation.org wrote:
> RCU does not do dynamic allocations but it increments per cpu variables
> a lot. These instructions results in a move to a register and then back
> to memory. This patch will make it use the inc/dec instructions on x86
> that do not need a register.
>
> Acked-by: Tejun Heo <t...@kernel.org>
> Acked-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>
> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

Patch titles slightly cleaned up and 0001-0011 are committed and
published to percpu#for-next.

Pekka Enberg

unread,

Oct 3, 2009, 3:40:04 PM10/3/09

to

Hi,

Ingo Molnar wrote:
> * c...@linux-foundation.org <c...@linux-foundation.org> wrote:
>
>> Basically the existing percpu ops can be used for this_cpu variants
>> that allow operations also on dynamically allocated percpu data.
>> However, we do not pass a reference to a percpu variable in. Instead a
>> dynamically or statically allocated percpu variable is provided.
>>
>> Preempt, the non preempt and the irqsafe operations generate the same
>> code. It will always be possible to have the requires per cpu
>> atomicness in a single RMW instruction with segment override on x86.
>>
>> 64 bit this_cpu operations are not supported on 32 bit.
>>
>> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>
> Acked-by: Ingo Molnar <mi...@elte.hu>

I haven't looked at the series in detail but AFAICT the SLUB patches
depend on the x86 ones. Any suggestions how to get all this into
linux-next? Should I make a topic branch in slab.git on top of -tip or
something?

Pekka

Ingo Molnar

unread,

Oct 4, 2009, 12:50:07 PM10/4/09

to

* Pekka Enberg <pen...@cs.helsinki.fi> wrote:

> Hi,
>
> Ingo Molnar wrote:
>> * c...@linux-foundation.org <c...@linux-foundation.org> wrote:
>>
>>> Basically the existing percpu ops can be used for this_cpu variants
>>> that allow operations also on dynamically allocated percpu data.
>>> However, we do not pass a reference to a percpu variable in. Instead
>>> a dynamically or statically allocated percpu variable is provided.
>>>
>>> Preempt, the non preempt and the irqsafe operations generate the same
>>> code. It will always be possible to have the requires per cpu
>>> atomicness in a single RMW instruction with segment override on x86.
>>>
>>> 64 bit this_cpu operations are not supported on 32 bit.
>>>
>>> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>>
>> Acked-by: Ingo Molnar <mi...@elte.hu>
>
> I haven't looked at the series in detail but AFAICT the SLUB patches
> depend on the x86 ones. Any suggestions how to get all this into
> linux-next? Should I make a topic branch in slab.git on top of -tip or
> something?

I'd suggest to keep these patches together in the right topical tree:
Tejun's percpu tree. Any problem with that approach?

Ingo

Pekka Enberg

unread,

Oct 4, 2009, 1:00:18 PM10/4/09

to

Hi Ingo,

Ingo Molnar wrote:
> * Pekka Enberg <pen...@cs.helsinki.fi> wrote:
>
>> Hi,
>>
>> Ingo Molnar wrote:
>>> * c...@linux-foundation.org <c...@linux-foundation.org> wrote:
>>>
>>>> Basically the existing percpu ops can be used for this_cpu variants
>>>> that allow operations also on dynamically allocated percpu data.
>>>> However, we do not pass a reference to a percpu variable in. Instead
>>>> a dynamically or statically allocated percpu variable is provided.
>>>>
>>>> Preempt, the non preempt and the irqsafe operations generate the same
>>>> code. It will always be possible to have the requires per cpu
>>>> atomicness in a single RMW instruction with segment override on x86.
>>>>
>>>> 64 bit this_cpu operations are not supported on 32 bit.
>>>>
>>>> Signed-off-by: Christoph Lameter <c...@linux-foundation.org>
>>> Acked-by: Ingo Molnar <mi...@elte.hu>
>> I haven't looked at the series in detail but AFAICT the SLUB patches
>> depend on the x86 ones. Any suggestions how to get all this into
>> linux-next? Should I make a topic branch in slab.git on top of -tip or
>> something?
>
> I'd suggest to keep these patches together in the right topical tree:
> Tejun's percpu tree. Any problem with that approach?

I'm fine with that. Just wanted to make sure who is taking the patches
and if I should pick any of them up. We can get some conflicts between
the per-cpu tree and slab.git if new SLUB patches get merged but that's
probably not a huge problem.

Pekka

Mel Gorman

unread,

Oct 5, 2009, 5:50:09 AM10/5/09

to

On Fri, Oct 02, 2009 at 01:30:58PM -0400, Christoph Lameter wrote:
> On Fri, 2 Oct 2009, Mel Gorman wrote:
>
> > On Thu, Oct 01, 2009 at 05:25:33PM -0400, c...@linux-foundation.org wrote:
> > > Explicitly initialize the pagesets after the per cpu areas have been
> > > initialized. This is necessary in order to be able to use per cpu
> > > operations in later patches.
> > >
> >
> > Can you be more explicit about this? I think the reasoning is as follows
> >
> > A later patch will use DEFINE_PER_CPU which allocates memory later in
> > the boot-cycle after zones have already been initialised. Without this
> > patch, use of DEFINE_PER_CPU would result in invalid memory accesses
> > during pageset initialisation.
>
> Nope. Pagesets are not statically allocated per cpu data. They are
> allocated with the per cpu allocator.
>

I don't think I said they were statically allocated.

> The per cpu allocator is not initialized that early in boot. We cannot
> allocate the pagesets then. Therefore we use a fake single item pageset
> (like used now for NUMA boot) to take its place until the slab and percpu
> allocators are up. Then we allocate the real pagesets.
>

Ok, that explanation matches my expectations. Thanks.

> > > -static __meminit void zone_pcp_init(struct zone *zone)
> > > +/*
> > > + * Early setup of pagesets.
> > > + *
> > > + * In the NUMA case the pageset setup simply results in all zones pcp
> > > + * pointer being directed at a per cpu pageset with zero batchsize.
> > > + *
> >
> > The batchsize becomes 1, not 0 if you look at setup_pageset() but that aside,
> > it's unclear from the comment *why* the batchsize is 1 in the NUMA case.
> > Maybe something like the following?
> >
> > =====
> > In the NUMA case, the boot_pageset is used until the slab allocator is
> > available to allocate per-zone pagesets as each CPU is brought up. At
> > this point, the batchsize is set to 1 to prevent pages "leaking" onto the
> > boot_pageset freelists.
> > =====
> >
> > Otherwise, nothing in the patch jumped out at me other than to double
> > check CPU-up events actually result in process_zones() being called and
> > that boot_pageset is not being accidentally used in the long term.
>
> This is already explained in a commment where boot_pageset is defined.
> Should we add some more elaborate comments to zone_pcp_init()?
>

I suppose not. Point them to the comment in boot_pageset so there is a
chance the comment stays up to date.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Mel Gorman

unread,

Oct 5, 2009, 5:50:11 AM10/5/09

to

On Fri, Oct 02, 2009 at 01:39:28PM -0400, Christoph Lameter wrote:
> On Fri, 2 Oct 2009, Mel Gorman wrote:
>
> > On Thu, Oct 01, 2009 at 05:25:34PM -0400, c...@linux-foundation.org wrote:
> > > Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> > >
> > > This drastically reduces the size of struct zone for systems with large
> > > amounts of processors and allows placement of critical variables of struct
> > > zone in one cacheline even on very large systems.
> > >
> >
> > This seems reasonably accurate. The largest shrink is on !NUMA configured
> > systems but the NUMA case deletes a lot of pointers.
>
> True, the !NUMA case will then avoid allocating pagesets for unused
> zones. But the NUMA case will have the most benefit since the large arrays
> in struct zone are gone.

Indeed. Out of curiousity, has this patchset been performance tested? I
would expect there to be a small but measurable improvement. If there is
a regression, it might point to poor placement of read/write fields in
the zone.

> Removing the pagesets from struct zone also
> increases the cacheability of struct zone information. This is
> particularly useful since the size of the pagesets grew with the addition
> of the various types of allocation queues.
>
> > > Another effect is that the pagesets of one processor are placed near one
> > > another. If multiple pagesets from different zones fit into one cacheline
> > > then additional cacheline fetches can be avoided on the hot paths when
> > > allocating memory from multiple zones.
> > >
> >
> > Out of curiousity, how common an occurance is it that a CPU allocate from
> > multiple zones? I would have thought it was rare but I never checked
> > either.
>
> zone allocations are determined by their use. GFP_KERNEL allocs come from
> ZONE_NORMAL whereas typical application pages may come from ZONE_HIGHMEM.
> The mix depends on what the kernel and the application are doing.
>

I just wouldn't have expected a significant enough mix to make a measurable
performance difference. It's no biggie.

> > > pcp = &pset->pcp;
> > > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > migratetype = get_pageblock_migratetype(page);
> > > set_page_private(page, migratetype);
> > > local_irq_save(flags);
> > > @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
> > >
> > > out:
> > > local_irq_restore(flags);
> > > - put_cpu();
> >
> > Previously we get_cpu() to be preemption safe. We then disable
> > interrupts and potentially take a spinlock later.
>
> Right. WE need to move the local_irq_save() up two lines.

Just so I'm 100% clear, IRQ disabling is considered a preemption point?

> Why disable
> preempt and two instructions later disable interrupts? Isnt this bloating
> the code?
>

By and large, IRQs are disabled at the last possible moment with the minimum
amount of code in between. While the current location does not make perfect
sense, it was probably many small changes that placed it like this each
person avoiding IRQ-disabling for too long without considering what the cost
of get_cpu() was.

Similar care needs to be taken with the other removals of get_cpu() in
this patch to ensure it's still preemption-safe.

> > this_cpu_ptr() looks up PCP
> > disable interrupts
> > enable interrupts
>
> Move disable interrupts before the this_cpu_ptr?
>

In this case, why not move this_cpu_ptr() down until its first use just
before the if (cold) check? It'll still be within the IRQ disabling but
without significantly increasing the amount of time the IRQ is disabled.

> > > +/*
> > > + * Allocate per cpu pagesets and initialize them.
> > > + * Before this call only boot pagesets were available.
> > > + * Boot pagesets will no longer be used after this call is complete.
> >
> > If they are no longer used, do we get the memory back?
>
> No we need to keep them for onlining new processors.
>

That comment would appear to disagree.

> > > - * as it comes online
> > > + for_each_possible_cpu(cpu) {
> > > + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> > > +
> > > + setup_pageset(pcp, zone_batchsize(zone));
> > > +
> > > + if (percpu_pagelist_fraction)
> > > + setup_pagelist_highmark(pcp,
> > > + (zone->present_pages /
> > > + percpu_pagelist_fraction));
> > > + }
> > > + }
> >
> > This would have been easier to review if you left process_zones() where it
> > was and converted it to the new API. I'm assuming this is just shuffling
> > code around.
>
> Yes I think this was the result of reducing #ifdefs.
>

Ok.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Christoph Lameter

unread,

Oct 5, 2009, 11:00:10 AM10/5/09

to

On Sat, 3 Oct 2009, Tejun Heo wrote:

> > arch/ia64/kernel/setup.c | 1 +
> > arch/powerpc/kernel/setup_64.c | 1 +
> > arch/sparc/kernel/smp_64.c | 1 +
> > arch/x86/kernel/setup_percpu.c | 2 ++
> > include/linux/mm.h | 1 +
> > mm/page_alloc.c | 40 +++++++++++++++++++++++++++++-----------
> > mm/percpu.c | 2 ++
>
> Hmmm... why not call this function from start_kernel() after calling
> setup_per_cpu_areas() instead of modifying every implementation of
> setup_per_cpu_areas()?

Because it has to be called immediately after per cpu areas become
available. Otherwise page allocator uses will fail.

Christoph Lameter

unread,

Oct 5, 2009, 11:00:08 AM10/5/09

to

On Mon, 5 Oct 2009, Mel Gorman wrote:

> > Right. WE need to move the local_irq_save() up two lines.
>
> Just so I'm 100% clear, IRQ disabling is considered a preemption point?

Yes.

> > Move disable interrupts before the this_cpu_ptr?
> >
>
> In this case, why not move this_cpu_ptr() down until its first use just
> before the if (cold) check? It'll still be within the IRQ disabling but
> without significantly increasing the amount of time the IRQ is disabled.

Good idea. Ill put that into the next release.

> > > > + * Before this call only boot pagesets were available.
> > > > + * Boot pagesets will no longer be used after this call is complete.
> > >
> > > If they are no longer used, do we get the memory back?
> >
> > No we need to keep them for onlining new processors.
> >
>
> That comment would appear to disagree.

The comment is accurate for a processor. Once the pagesets are allocated
for a processor then the boot pageset is no longer used.

Tejun Heo

unread,

Oct 5, 2009, 11:10:07 AM10/5/09

to

Christoph Lameter wrote:
> On Sat, 3 Oct 2009, Tejun Heo wrote:
>
>>> arch/ia64/kernel/setup.c | 1 +
>>> arch/powerpc/kernel/setup_64.c | 1 +
>>> arch/sparc/kernel/smp_64.c | 1 +
>>> arch/x86/kernel/setup_percpu.c | 2 ++
>>> include/linux/mm.h | 1 +
>>> mm/page_alloc.c | 40 +++++++++++++++++++++++++++++-----------
>>> mm/percpu.c | 2 ++
>> Hmmm... why not call this function from start_kernel() after calling
>> setup_per_cpu_areas() instead of modifying every implementation of
>> setup_per_cpu_areas()?
>
> Because it has to be called immediately after per cpu areas become
> available. Otherwise page allocator uses will fail.

But... setup_per_cpu_area() isn't supposed to call page allocator and
start_kernel() is the only caller of setup_per_cpu_areas().

--
tejun

Christoph Lameter

unread,

Oct 5, 2009, 11:10:07 AM10/5/09

to

Changes to this patch so far:

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2009-10-05 09:49:07.000000000 -0500
+++ linux-2.6/mm/page_alloc.c 2009-10-05 09:48:43.000000000 -0500
@@ -1098,8 +1098,6 @@ static void free_hot_cold_page(struct pa

arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);

- local_irq_save(flags);
- pcp = &this_cpu_ptr(zone->pageset)->pcp;

migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);

if (unlikely(wasMlocked))
@@ -1121,6 +1119,8 @@ static void free_hot_cold_page(struct pa
migratetype = MIGRATE_MOVABLE;
}

+ local_irq_save(flags);

+ pcp = &this_cpu_ptr(zone->pageset)->pcp;

if (cold)
list_add_tail(&page->lru, &pcp->lists[migratetype]);
else
@@ -3120,7 +3120,8 @@ static struct notifier_block __cpuinitda
/*

* Allocate per cpu pagesets and initialize them.

* Before this call only boot pagesets were available.

- * Boot pagesets will no longer be used after this call is complete.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
*/
void __init setup_per_cpu_pageset(void)
{
@@ -3232,11 +3233,11 @@ __meminit void setup_pagesets(void)
zone->pageset = &per_cpu_var(boot_pageset);

/*
- * Special pagesets with zero elements so that frees
+ * Special pagesets with one element so that frees

* and allocations are not buffered at all.

*/
for_each_possible_cpu(cpu)
- setup_pageset(per_cpu_ptr(zone->pageset, cpu), 0);
+ setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);

Christoph Lameter

unread,

Oct 5, 2009, 11:20:09 AM10/5/09

to

On Tue, 6 Oct 2009, Tejun Heo wrote:

> > Because it has to be called immediately after per cpu areas become
> > available. Otherwise page allocator uses will fail.
>
> But... setup_per_cpu_area() isn't supposed to call page allocator and
> start_kernel() is the only caller of setup_per_cpu_areas().

setup_per_cpu_areas() is not calling the page allocator. However, any
caller after that can call the page allocator.

There are various arch implementations that do their own implementation of
setup_per_cpu_areas() at their own time (Check sparc and ia64 for
example).

Tejun Heo

unread,

Oct 5, 2009, 11:30:10 AM10/5/09

to

Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Tejun Heo wrote:
>
>>> Because it has to be called immediately after per cpu areas become
>>> available. Otherwise page allocator uses will fail.
>> But... setup_per_cpu_area() isn't supposed to call page allocator and
>> start_kernel() is the only caller of setup_per_cpu_areas().
>
> setup_per_cpu_areas() is not calling the page allocator. However, any
> caller after that can call the page allocator.
>
> There are various arch implementations that do their own implementation of
> setup_per_cpu_areas() at their own time (Check sparc and ia64 for
> example).

sparc is doing everything on setup_per_cpu_areas(). ia64 is an
exception. Hmmm... I really don't like scattering mostly unrelated
init call to every setup_per_cpu_areas() implementation. How about
adding "static bool initialized __initdata" to the function and allow
it to be called earlier when necessary (only ia64 at the moment)?

Thanks.

--
tejun

Christoph Lameter

unread,

Oct 5, 2009, 11:50:15 AM10/5/09

to

On Tue, 6 Oct 2009, Tejun Heo wrote:

> I'm under the impression that ia64 needs its percpu areas setup
> earlier during the boot so I'm not sure what you have in mind but as
> long as the call isn't scattered over different setup_per_cpu_areas()
> implementations, I'm okay.

Currently it does. But there must be some way to untangle that mess.

Tejun Heo

unread,

Oct 5, 2009, 11:50:16 AM10/5/09

to

Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Tejun Heo wrote:
>
>>> There are various arch implementations that do their own implementation of
>>> setup_per_cpu_areas() at their own time (Check sparc and ia64 for
>>> example).
>> sparc is doing everything on setup_per_cpu_areas(). ia64 is an
>> exception. Hmmm... I really don't like scattering mostly unrelated
>> init call to every setup_per_cpu_areas() implementation. How about
>> adding "static bool initialized __initdata" to the function and allow
>> it to be called earlier when necessary (only ia64 at the moment)?
>

> It would be best to consolidate all setup_per_cpu_areas() to work at the
> same time during bootup. How about having a single setup_per_cpu_areas()
> function and then within the function allow arch specific processing. Try
> to share as much code as possible?

I'm under the impression that ia64 needs its percpu areas setup
earlier during the boot so I'm not sure what you have in mind but as
long as the call isn't scattered over different setup_per_cpu_areas()
implementations, I'm okay.

Thanks.

Christoph Lameter

unread,

Oct 5, 2009, 11:50:14 AM10/5/09

to

On Tue, 6 Oct 2009, Tejun Heo wrote:

> > There are various arch implementations that do their own implementation of
> > setup_per_cpu_areas() at their own time (Check sparc and ia64 for
> > example).
>
> sparc is doing everything on setup_per_cpu_areas(). ia64 is an
> exception. Hmmm... I really don't like scattering mostly unrelated
> init call to every setup_per_cpu_areas() implementation. How about
> adding "static bool initialized __initdata" to the function and allow
> it to be called earlier when necessary (only ia64 at the moment)?

It would be best to consolidate all setup_per_cpu_areas() to work at the

same time during bootup. How about having a single setup_per_cpu_areas()
function and then within the function allow arch specific processing. Try
to share as much code as possible?

--

Mel Gorman

unread,

Oct 6, 2009, 6:00:17 AM10/6/09

to

On Mon, Oct 05, 2009 at 10:55:49AM -0400, Christoph Lameter wrote:
>
> Changes to this patch so far:
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c 2009-10-05 09:49:07.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c 2009-10-05 09:48:43.000000000 -0500
> @@ -1098,8 +1098,6 @@ static void free_hot_cold_page(struct pa
> arch_free_page(page, 0);
> kernel_map_pages(page, 1, 0);
>
> - local_irq_save(flags);
> - pcp = &this_cpu_ptr(zone->pageset)->pcp;
> migratetype = get_pageblock_migratetype(page);
> set_page_private(page, migratetype);
> if (unlikely(wasMlocked))

Why did you move local_irq_save() ? It should have stayed where it was
because VM counters are updated under the lock. Only the this_cpu_ptr
should be moving.

--

Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Christoph Lameter

unread,

Oct 6, 2009, 12:50:09 PM10/6/09

to

On Tue, 6 Oct 2009, Mel Gorman wrote:

> > - local_irq_save(flags);
> > - pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > migratetype = get_pageblock_migratetype(page);
> > set_page_private(page, migratetype);
> > if (unlikely(wasMlocked))
>
> Why did you move local_irq_save() ? It should have stayed where it was
> because VM counters are updated under the lock. Only the this_cpu_ptr
> should be moving.

The __count_vm_event()? VM counters may be incremented in a racy way if
convenient. x86 usually produces non racy code (and with this patchset
will always produce non racy code) but f.e. IA64 has always had racy
updates. I'd rather shorted the irq off section.

See the comment in vmstat.h.

Mel Gorman

unread,

Oct 6, 2009, 1:10:09 PM10/6/09

to

On Tue, Oct 06, 2009 at 12:34:56PM -0400, Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Mel Gorman wrote:
>
> > > - local_irq_save(flags);
> > > - pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > migratetype = get_pageblock_migratetype(page);
> > > set_page_private(page, migratetype);
> > > if (unlikely(wasMlocked))
> >
> > Why did you move local_irq_save() ? It should have stayed where it was
> > because VM counters are updated under the lock. Only the this_cpu_ptr
> > should be moving.
>
> The __count_vm_event()?

and the __dec_zone_page_state within free_page_mlock(). However, it's already
atomic so it shouldn't be a problem.

> VM counters may be incremented in a racy way if
> convenient. x86 usually produces non racy code (and with this patchset
> will always produce non racy code) but f.e. IA64 has always had racy
> updates. I'd rather shorted the irq off section.
>

The count_vm_event is now racier than it was and no longer symmetric with
the PGALLOC counting which still happens with IRQs disabled. The assymetry
could look very strange if there are a lot more frees than allocs for example
because the raciness between the counters is difference.

While I have no problem as such with the local_irq_save() moving (although
I would like PGFREE and PGALLOC to be accounted both with or without IRQs
enabled), I think it deserves to be in a patch all to itself and not hidden
in an apparently unrelated change.

> See the comment in vmstat.h.
>

--

Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Christoph Lameter

unread,

Oct 6, 2009, 2:10:14 PM10/6/09

to

On Tue, 6 Oct 2009, Mel Gorman wrote:

> While I have no problem as such with the local_irq_save() moving (although
> I would like PGFREE and PGALLOC to be accounted both with or without IRQs
> enabled), I think it deserves to be in a patch all to itself and not hidden
> in an apparently unrelated change.

Ok I have moved the local_irq_save back.

Full patch (will push it back into my git tree if you approve)

From: Christoph Lameter <c...@linux-foundation.org>
Subject: this_cpu_ops: page allocator conversion

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one

another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

Cc: Mel Gorman <m...@csn.ul.ie>
Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

---
include/linux/mm.h | 4 -
include/linux/mmzone.h | 12 ---
mm/page_alloc.c | 157 ++++++++++++++-----------------------------------
mm/vmstat.c | 14 ++--
4 files changed, 56 insertions(+), 131 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2009-10-06 12:41:19.000000000 -0500
+++ linux-2.6/include/linux/mm.h 2009-10-06 12:41:19.000000000 -0500
@@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
extern int after_bootmem;
extern void setup_pagesets(void);

-#ifdef CONFIG_NUMA
extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif

extern void zone_pcp_update(struct zone *zone);

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h 2009-10-05 15:33:08.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h 2009-10-06 12:41:19.000000000 -0500
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
s8 stat_threshold;
s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
#endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};

#endif /* !__GENERATING_BOUNDS.H */

@@ -306,10 +300,8 @@ struct zone {
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
- struct per_cpu_pageset *pageset[NR_CPUS];
-#else
- struct per_cpu_pageset pageset[NR_CPUS];
#endif
+ struct per_cpu_pageset *pageset;
/*
* free areas of different sizes
*/
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2009-10-06 12:41:19.000000000 -0500
+++ linux-2.6/mm/page_alloc.c 2009-10-06 12:43:27.000000000 -0500
@@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
struct per_cpu_pageset *pset;
struct per_cpu_pages *pcp;

- pset = zone_pcp(zone, cpu);
+ pset = per_cpu_ptr(zone->pageset, cpu);

pcp = &pset->pcp;
local_irq_save(flags);
@@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa

arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);

- pcp = &zone_pcp(zone, get_cpu())->pcp;

migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);

local_irq_save(flags);
@@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
migratetype = MIGRATE_MOVABLE;
}

+ pcp = &this_cpu_ptr(zone->pageset)->pcp;

if (cold)
list_add_tail(&page->lru, &pcp->lists[migratetype]);
else

@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa

out:
local_irq_restore(flags);
- put_cpu();
}

void free_hot_page(struct page *page)
@@ -1183,15 +1182,13 @@ struct page *buffered_rmqueue(struct zon
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
- int cpu;

again:
- cpu = get_cpu();
if (likely(order == 0)) {
struct per_cpu_pages *pcp;
struct list_head *list;

- pcp = &zone_pcp(zone, cpu)->pcp;
+ pcp = &this_cpu_ptr(zone->pageset)->pcp;
list = &pcp->lists[migratetype];
local_irq_save(flags);
if (list_empty(list)) {
@@ -1234,7 +1231,6 @@ again:
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone);
local_irq_restore(flags);
- put_cpu();

VM_BUG_ON(bad_range(zone, page));
if (prep_new_page(page, order, gfp_flags))
@@ -1243,7 +1239,6 @@ again:

failed:
local_irq_restore(flags);
- put_cpu();
return NULL;
}

@@ -2172,7 +2167,7 @@ void show_free_areas(void)
for_each_online_cpu(cpu) {
struct per_cpu_pageset *pageset;

- pageset = zone_pcp(zone, cpu);
+ pageset = per_cpu_ptr(zone->pageset, cpu);

printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
cpu, pageset->pcp.high,
@@ -3087,7 +3082,6 @@ static void setup_pagelist_highmark(stru
}

-#ifdef CONFIG_NUMA
/*
* Boot pageset table. One per cpu which is going to be used for all
* zones and all nodes. The parameters will be set in such a way
@@ -3095,112 +3089,68 @@ static void setup_pagelist_highmark(stru
* the buddy list. This is safe since pageset manipulation is done
* with interrupts disabled.
*
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
+ * Some counter updates may also be caught by the boot pagesets.
*
* zoneinfo_show() and maybe other functions do
* not check if the processor is online before following the pageset pointer.
* Other parts of the kernel may not check if the zone is available.
*/
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
- struct zone *zone, *dzone;
- int node = cpu_to_node(cpu);
-
- node_set_state(node, N_CPU); /* this node has a cpu */
-
- for_each_populated_zone(zone) {
- zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
- GFP_KERNEL, node);
- if (!zone_pcp(zone, cpu))
- goto bad;
-
- setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
- if (percpu_pagelist_fraction)
- setup_pagelist_highmark(zone_pcp(zone, cpu),
- (zone->present_pages / percpu_pagelist_fraction));
- }
-
- return 0;
-bad:
- for_each_zone(dzone) {
- if (!populated_zone(dzone))
- continue;
- if (dzone == zone)
- break;
- kfree(zone_pcp(dzone, cpu));
- zone_pcp(dzone, cpu) = &boot_pageset[cpu];
- }
- return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
- struct zone *zone;
-
- for_each_zone(zone) {
- struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
- /* Free per_cpu_pageset if it is slab allocated */
- if (pset != &boot_pageset[cpu])
- kfree(pset);
- zone_pcp(zone, cpu) = &boot_pageset[cpu];
- }
-}
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);

static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
unsigned long action,
void *hcpu)
{
int cpu = (long)hcpu;
- int ret = NOTIFY_OK;

switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
- if (process_zones(cpu))
- ret = NOTIFY_BAD;
- break;
- case CPU_UP_CANCELED:
- case CPU_UP_CANCELED_FROZEN:
- case CPU_DEAD:
- case CPU_DEAD_FROZEN:
- free_zone_pagesets(cpu);
+ node_set_state(cpu_to_node(cpu), N_CPU);
break;
default:
break;
}
- return ret;
+ return NOTIFY_OK;
}

static struct notifier_block __cpuinitdata pageset_notifier =
{ &pageset_cpuup_callback, NULL, 0 };

+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.

+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().

+ */
void __init setup_per_cpu_pageset(void)
{
- int err;
+ struct zone *zone;
+ int cpu;
+
+ for_each_populated_zone(zone) {
+ zone->pageset = alloc_percpu(struct per_cpu_pageset);

- /* Initialize per_cpu_pageset for cpu 0.
- * A cpuup callback will do this for every cpu

- * as it comes online
+ for_each_possible_cpu(cpu) {
+ struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+ setup_pageset(pcp, zone_batchsize(zone));
+
+ if (percpu_pagelist_fraction)
+ setup_pagelist_highmark(pcp,
+ (zone->present_pages /
+ percpu_pagelist_fraction));
+ }
+ }

+
+ /*
+ * The boot cpu is always the first active.
+ * The boot node has a processor
*/
- err = process_zones(smp_processor_id());
- BUG_ON(err);
+ node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
register_cpu_notifier(&pageset_notifier);
}

-#endif
-
static noinline __init_refok
int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
{
@@ -3254,7 +3204,7 @@ static int __zone_pcp_update(void *data)
struct per_cpu_pageset *pset;
struct per_cpu_pages *pcp;

- pset = zone_pcp(zone, cpu);
+ pset = per_cpu_ptr(zone->pageset, cpu);
pcp = &pset->pcp;

local_irq_save(flags);
@@ -3272,15 +3222,7 @@ void zone_pcp_update(struct zone *zone)

/*

* Early setup of pagesets.

- *
- * In the NUMA case the pageset setup simply results in all zones pcp
- * pointer being directed at a per cpu pageset with zero batchsize.
- *
- * This means that every free and every allocation occurs directly from
- * the buddy allocator tables.
- *
- * The pageset never queues pages during early boot and is therefore usable
- * for every type of zone.
+ * At this point various allocators are not operational yet.
*/
__meminit void setup_pagesets(void)
{
@@ -3288,23 +3230,15 @@ __meminit void setup_pagesets(void)
struct zone *zone;

for_each_zone(zone) {
-#ifdef CONFIG_NUMA
- unsigned long batch = 0;
-
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
- /* Early boot. Slab allocator not functional yet */
- zone_pcp(zone, cpu) = &boot_pageset[cpu];
- }
-#else
- unsigned long batch = zone_batchsize(zone);
-#endif
+ zone->pageset = &per_cpu_var(boot_pageset);

+ /*

+ * Special pagesets with one element so that frees

+ * and allocations are not buffered at all.
+ */
for_each_possible_cpu(cpu)
- setup_pageset(zone_pcp(zone, cpu), batch);

+ setup_pageset(per_cpu_ptr(zone->pageset, cpu), 1);

- if (zone->present_pages)
- printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
- zone->name, zone->present_pages, batch);
}
}

@@ -4818,10 +4752,11 @@ int percpu_pagelist_fraction_sysctl_hand
if (!write || (ret == -EINVAL))
return ret;
for_each_populated_zone(zone) {
- for_each_online_cpu(cpu) {
+ for_each_possible_cpu(cpu) {
unsigned long high;
high = zone->present_pages / percpu_pagelist_fraction;
- setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+ setup_pagelist_highmark(
+ per_cpu_ptr(zone->pageset, cpu), high);
}
}
return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2009-10-05 15:33:08.000000000 -0500
+++ linux-2.6/mm/vmstat.c 2009-10-06 12:43:22.000000000 -0500
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
threshold = calculate_threshold(zone);

for_each_online_cpu(cpu)
- zone_pcp(zone, cpu)->stat_threshold = threshold;
+ per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+ = threshold;
}
}

@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
int delta)
{
- struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
s8 *p = pcp->vm_stat_diff + item;
long x;

@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
*/
void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
- struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
s8 *p = pcp->vm_stat_diff + item;

(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);

void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
- struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
s8 *p = pcp->vm_stat_diff + item;

(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
for_each_populated_zone(zone) {
struct per_cpu_pageset *p;

- p = zone_pcp(zone, cpu);
+ p = per_cpu_ptr(zone->pageset, cpu);

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (p->vm_stat_diff[i]) {
@@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
for_each_online_cpu(i) {
struct per_cpu_pageset *pageset;

- pageset = zone_pcp(zone, i);
+ pageset = per_cpu_ptr(zone->pageset, i);
seq_printf(m,
"\n cpu: %i"
"\n count: %i"

Mel Gorman

unread,

Oct 6, 2009, 2:50:06 PM10/6/09

to

On Tue, Oct 06, 2009 at 01:51:38PM -0400, Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Mel Gorman wrote:
>
> > While I have no problem as such with the local_irq_save() moving (although
> > I would like PGFREE and PGALLOC to be accounted both with or without IRQs
> > enabled), I think it deserves to be in a patch all to itself and not hidden
> > in an apparently unrelated change.
>
> Ok I have moved the local_irq_save back.
>

Thanks.

> Full patch (will push it back into my git tree if you approve)
>

Few more comments I'm afraid :(

It's not your fault and it doesn't actually matter to the current callers
of drain_pages, but you might as well move the per_cpu_ptr inside the
local_irq_save() here as well while you're changing here.

I believe this falls foul of the same problem as in the free path. We
are no longer preempt safe and this_cpu_ptr() needs to move within the
local_irq_save().

I didn't spot anything out of the ordinary after this but I haven't tested
the series.

--

Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Christoph Lameter

unread,

Oct 6, 2009, 3:20:13 PM10/6/09

to

On Tue, 6 Oct 2009, Mel Gorman wrote:

> > --- linux-2.6.orig/mm/page_alloc.c 2009-10-06 12:41:19.000000000 -0500
> > +++ linux-2.6/mm/page_alloc.c 2009-10-06 12:43:27.000000000 -0500
> > @@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
> > struct per_cpu_pageset *pset;
> > struct per_cpu_pages *pcp;
> >
> > - pset = zone_pcp(zone, cpu);
> > + pset = per_cpu_ptr(zone->pageset, cpu);
> >
> > pcp = &pset->pcp;
> > local_irq_save(flags);
>
> It's not your fault and it doesn't actually matter to the current callers
> of drain_pages, but you might as well move the per_cpu_ptr inside the
> local_irq_save() here as well while you're changing here.

The comments before drain_pages() clearly state that the caller must be
pinned to a processor. But lets change it for consistencies sake.

> > - cpu = get_cpu();
> > if (likely(order == 0)) {
> > struct per_cpu_pages *pcp;
> > struct list_head *list;
> >
> > - pcp = &zone_pcp(zone, cpu)->pcp;
> > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > list = &pcp->lists[migratetype];
> > local_irq_save(flags);
>
> I believe this falls foul of the same problem as in the free path. We
> are no longer preempt safe and this_cpu_ptr() needs to move within the
> local_irq_save().

Ok.

From: Christoph Lameter <c...@linux-foundation.org>
Subject: this_cpu_ops: page allocator conversion

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

Cc: Mel Gorman <m...@csn.ul.ie>
Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

---
include/linux/mm.h | 4 -
include/linux/mmzone.h | 12 ---

mm/page_alloc.c | 161 ++++++++++++++-----------------------------------
mm/vmstat.c | 14 ++--
4 files changed, 58 insertions(+), 133 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2009-10-06 13:54:25.000000000 -0500
+++ linux-2.6/include/linux/mm.h 2009-10-06 13:54:25.000000000 -0500

@@ -1062,11 +1062,7 @@ extern void si_meminfo_node(struct sysin
extern int after_bootmem;
extern void setup_pagesets(void);

-#ifdef CONFIG_NUMA
extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif

extern void zone_pcp_update(struct zone *zone);

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h 2009-10-06 12:48:46.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h 2009-10-06 13:54:25.000000000 -0500

#endif /* !__GENERATING_BOUNDS.H */

--- linux-2.6.orig/mm/page_alloc.c 2009-10-06 13:54:25.000000000 -0500
+++ linux-2.6/mm/page_alloc.c 2009-10-06 13:59:27.000000000 -0500
@@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu

struct per_cpu_pageset *pset;
struct per_cpu_pages *pcp;

- pset = zone_pcp(zone, cpu);

+ local_irq_save(flags);

+ pset = per_cpu_ptr(zone->pageset, cpu);

pcp = &pset->pcp;
- local_irq_save(flags);
free_pcppages_bulk(zone, pcp->count, pcp);
pcp->count = 0;
local_irq_restore(flags);

@@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);

- pcp = &zone_pcp(zone, get_cpu())->pcp;
migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);
local_irq_save(flags);
@@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
migratetype = MIGRATE_MOVABLE;
}

+ pcp = &this_cpu_ptr(zone->pageset)->pcp;
if (cold)
list_add_tail(&page->lru, &pcp->lists[migratetype]);
else
@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa

out:
local_irq_restore(flags);
- put_cpu();
}

void free_hot_page(struct page *page)
@@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon

unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
- int cpu;

again:
- cpu = get_cpu();
if (likely(order == 0)) {
struct per_cpu_pages *pcp;
struct list_head *list;

- pcp = &zone_pcp(zone, cpu)->pcp;

- list = &pcp->lists[migratetype];
local_irq_save(flags);

+ pcp = &this_cpu_ptr(zone->pageset)->pcp;

+ list = &pcp->lists[migratetype];
if (list_empty(list)) {
pcp->count += rmqueue_bulk(zone, 0,
pcp->batch, list,

- /* Initialize per_cpu_pageset for cpu 0.

- * A cpuup callback will do this for every cpu
- * as it comes online
+ for_each_populated_zone(zone) {
+ zone->pageset = alloc_percpu(struct per_cpu_pageset);

+

--- linux-2.6.orig/mm/vmstat.c 2009-10-06 12:48:46.000000000 -0500
+++ linux-2.6/mm/vmstat.c 2009-10-06 13:59:23.000000000 -0500

Rusty Russell

unread,

Oct 6, 2009, 7:40:05 PM10/6/09

to

On Sat, 3 Oct 2009 02:41:54 am Christoph Lameter wrote:
> On Fri, 2 Oct 2009, Ingo Molnar wrote:
>
> > > @@ -66,6 +69,8 @@ extern void setup_per_cpu_areas(void);
> > > #define per_cpu(var, cpu) (*((void)(cpu), &per_cpu_var(var)))
> > > #define __get_cpu_var(var) per_cpu_var(var)
> > > #define __raw_get_cpu_var(var) per_cpu_var(var)
> > > +#define this_cpu_ptr(ptr) per_cpu_ptr(ptr, 0)
> > > +#define __this_cpu_ptr(ptr) this_cpu_ptr(ptr)

I still think that it's not symmetrical: get_cpu_var implies get_cpu_ptr; there's no
"this" in any Linux API until now.

OTOH, this_cpu_<op> makes much more sense than cpu_<op> or something, so I'm
not really going to complain.

Does this mean we can kill local.h soon?

Thanks for all this!
Rusty.

Tejun Heo

unread,

Oct 6, 2009, 8:10:06 PM10/6/09

to

Hello,

Rusty Russell wrote:
> I still think that it's not symmetrical: get_cpu_var implies
> get_cpu_ptr; there's no "this" in any Linux API until now.

Yeah, the naming of percpu related stuff is a big mess. :-( Given that
percpu variables aren't being used too widely at this time, cleaning
up the API is an option. What do you think? Any good proposal on
mind?

Thanks.

--
tejun

Christoph Lameter

unread,

Oct 6, 2009, 9:10:10 PM10/6/09

to

On Tue, 6 Oct 2009, Rusty Russell wrote:

> Does this mean we can kill local.h soon?

Yes if you let me ... Last time Andrew had some concerns so I hid my
patches that remove local.h ;-).

Mel Gorman

unread,

Oct 7, 2009, 6:50:07 AM10/7/09

to

On Tue, Oct 06, 2009 at 03:06:27PM -0400, Christoph Lameter wrote:
> On Tue, 6 Oct 2009, Mel Gorman wrote:
>
> > > --- linux-2.6.orig/mm/page_alloc.c 2009-10-06 12:41:19.000000000 -0500
> > > +++ linux-2.6/mm/page_alloc.c 2009-10-06 12:43:27.000000000 -0500
> > > @@ -1011,7 +1011,7 @@ static void drain_pages(unsigned int cpu
> > > struct per_cpu_pageset *pset;
> > > struct per_cpu_pages *pcp;
> > >
> > > - pset = zone_pcp(zone, cpu);
> > > + pset = per_cpu_ptr(zone->pageset, cpu);
> > >
> > > pcp = &pset->pcp;
> > > local_irq_save(flags);
> >
> > It's not your fault and it doesn't actually matter to the current callers
> > of drain_pages, but you might as well move the per_cpu_ptr inside the
> > local_irq_save() here as well while you're changing here.
>
> The comments before drain_pages() clearly state that the caller must be
> pinned to a processor. But lets change it for consistencies sake.
>

I noted the comment all right hence me saying that it doesn't matter to
the current callers because they obey the rules.

It was consistency I was looking at but I should have kept quiet because
there are a few oddities like this. It doesn't hurt to fix though.

I can't see anything else to complain about. Performance figures would
be nice but otherwise

Reviewed-by: Mel Gorman <m...@csn.ul.ie>

Thanks

--

Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Christoph Lameter

unread,

Oct 8, 2009, 2:20:06 PM10/8/09

to

On Tue, 6 Oct 2009, Rusty Russell wrote:

> Does this mean we can kill local.h soon?

Can we remove local.h from modules?

Subject: Module handling: Use this_cpu_xx to dynamically allocate counters

Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.

Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

---
include/linux/module.h | 36 ++++++++++++------------------------
kernel/module.c | 30 ++++++++++++++++--------------
kernel/trace/ring_buffer.c | 1 +
3 files changed, 29 insertions(+), 38 deletions(-)

Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h 2009-10-08 11:36:05.000000000 -0500
+++ linux-2.6/include/linux/module.h 2009-10-08 11:40:57.000000000 -0500
@@ -16,8 +16,7 @@
#include <linux/kobject.h>
#include <linux/moduleparam.h>
#include <linux/tracepoint.h>
-
-#include <asm/local.h>
+#include <linux/percpu.h>
#include <asm/module.h>

#include <trace/events/module.h>
@@ -361,11 +360,9 @@ struct module
/* Destruction function. */
void (*exit)(void);

-#ifdef CONFIG_SMP
- char *refptr;
-#else
- local_t ref;
-#endif
+ struct module_ref {
+ int count;
+ } *refptr;
#endif

#ifdef CONFIG_CONSTRUCTORS
@@ -452,25 +449,16 @@ void __symbol_put(const char *symbol);
#define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x)
void symbol_put_addr(void *addr);

-static inline local_t *__module_ref_addr(struct module *mod, int cpu)
-{
-#ifdef CONFIG_SMP
- return (local_t *) (mod->refptr + per_cpu_offset(cpu));
-#else
- return &mod->ref;
-#endif
-}
-
/* Sometimes we know we already have a refcount, and it's easier not
to handle the error case (which only happens with rmmod --wait). */
static inline void __module_get(struct module *module)
{
if (module) {
- unsigned int cpu = get_cpu();
- local_inc(__module_ref_addr(module, cpu));
+ preempt_disable();
+ __this_cpu_inc(module->refptr->count);
trace_module_get(module, _THIS_IP_,
- local_read(__module_ref_addr(module, cpu)));
- put_cpu();
+ __this_cpu_read(module->refptr->count));
+ preempt_enable();
}
}

@@ -479,15 +467,15 @@ static inline int try_module_get(struct
int ret = 1;

if (module) {
- unsigned int cpu = get_cpu();
if (likely(module_is_live(module))) {
- local_inc(__module_ref_addr(module, cpu));
+ preempt_disable();
+ __this_cpu_inc(module->refptr->count);
trace_module_get(module, _THIS_IP_,
- local_read(__module_ref_addr(module, cpu)));
+ __this_cpu_read(module->refptr->count));
+ preempt_enable();
}
else
ret = 0;
- put_cpu();
}
return ret;
}
Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c 2009-10-08 11:36:05.000000000 -0500
+++ linux-2.6/kernel/module.c 2009-10-08 11:42:05.000000000 -0500
@@ -474,9 +474,10 @@ static void module_unload_init(struct mo

INIT_LIST_HEAD(&mod->modules_which_use_me);
for_each_possible_cpu(cpu)
- local_set(__module_ref_addr(mod, cpu), 0);
+ per_cpu_ptr(mod->refptr, cpu)->count = 0;
+
/* Hold reference count during initialization. */
- local_set(__module_ref_addr(mod, raw_smp_processor_id()), 1);
+ __this_cpu_write(mod->refptr->count, 1);
/* Backwards compatibility macros put refcount during init. */
mod->waiter = current;
}
@@ -555,6 +556,7 @@ static void module_unload_free(struct mo
kfree(use);
sysfs_remove_link(i->holders_dir, mod->name);
/* There can be at most one match. */
+ free_percpu(i->refptr);
break;
}
}
@@ -619,7 +621,7 @@ unsigned int module_refcount(struct modu
int cpu;

for_each_possible_cpu(cpu)
- total += local_read(__module_ref_addr(mod, cpu));
+ total += per_cpu_ptr(mod->refptr, cpu)->count;
return total;
}
EXPORT_SYMBOL(module_refcount);
@@ -796,14 +798,15 @@ static struct module_attribute refcnt =
void module_put(struct module *module)
{
if (module) {
- unsigned int cpu = get_cpu();
- local_dec(__module_ref_addr(module, cpu));
+ preempt_disable();
+ __this_cpu_dec(module->refptr->count);
+
trace_module_put(module, _RET_IP_,
- local_read(__module_ref_addr(module, cpu)));
+ __this_cpu_read(module->refptr->count));
/* Maybe they're waiting for us to drop reference? */
if (unlikely(!module_is_live(module)))
wake_up_process(module->waiter);
- put_cpu();
+ preempt_enable();
}
}
EXPORT_SYMBOL(module_put);
@@ -1377,9 +1380,9 @@ static void free_module(struct module *m
kfree(mod->args);
if (mod->percpu)
percpu_modfree(mod->percpu);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
+#if defined(CONFIG_MODULE_UNLOAD)
if (mod->refptr)
- percpu_modfree(mod->refptr);
+ free_percpu(mod->refptr);
#endif
/* Free lock-classes: */
lockdep_free_key_range(mod->module_core, mod->core_size);
@@ -2145,9 +2148,8 @@ static noinline struct module *load_modu
mod = (void *)sechdrs[modindex].sh_addr;
kmemleak_load_module(mod, hdr, sechdrs, secstrings);

-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
- mod->refptr = percpu_modalloc(sizeof(local_t), __alignof__(local_t),
- mod->name);
+#if defined(CONFIG_MODULE_UNLOAD)
+ mod->refptr = alloc_percpu(struct module_ref);
if (!mod->refptr) {
err = -ENOMEM;
goto free_init;
@@ -2373,8 +2375,8 @@ static noinline struct module *load_modu
kobject_put(&mod->mkobj.kobj);
free_unload:
module_unload_free(mod);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
- percpu_modfree(mod->refptr);
+#if defined(CONFIG_MODULE_UNLOAD)
+ free_percpu(mod->refptr);
free_init:
#endif
module_free(mod, mod->module_init);
Index: linux-2.6/kernel/trace/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/trace/ring_buffer.c 2009-10-08 12:46:29.000000000 -0500
+++ linux-2.6/kernel/trace/ring_buffer.c 2009-10-08 12:46:46.000000000 -0500
@@ -12,6 +12,7 @@
#include <linux/hardirq.h>
#include <linux/kmemcheck.h>
#include <linux/module.h>
+#include <asm/local.h>
#include <linux/percpu.h>
#include <linux/mutex.h>
#include <linux/init.h>

Christoph Lameter

unread,

Oct 8, 2009, 2:20:07 PM10/8/09

to

On Tue, 6 Oct 2009, Rusty Russell wrote:

> Does this mean we can kill local.h soon?

We can kill cpu_local_xx right now. this_cpu_xx is a superset of that
functionality and cpu_local_xx is not used at all. We have to wait for the
merging of other patches to do more.

Subject: [percpu next] Remove cpu_local_xx macros

These macros have not been used for awhile now.

Signed-off-by: Christoph Lameter <c...@linux-foundation.org>

---
arch/alpha/include/asm/local.h | 17 -----------------
arch/m32r/include/asm/local.h | 25 -------------------------
arch/mips/include/asm/local.h | 25 -------------------------
arch/powerpc/include/asm/local.h | 25 -------------------------
arch/x86/include/asm/local.h | 37 -------------------------------------
include/asm-generic/local.h | 19 -------------------
6 files changed, 148 deletions(-)

Index: linux-2.6/arch/alpha/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/local.h 2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/alpha/include/asm/local.h 2009-10-08 12:57:40.000000000 -0500
@@ -98,21 +98,4 @@ static __inline__ long local_sub_return(
#define __local_add(i,l) ((l)->a.counter+=(i))
#define __local_sub(i,l) ((l)->a.counter-=(i))

-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-#define cpu_local_read(l) local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i) local_set(&__get_cpu_var(l), (i))
-
-#define cpu_local_inc(l) local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l) local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l) local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l) local_sub((i), &__get_cpu_var(l))
-
-#define __cpu_local_inc(l) __local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l) __local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l) __local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l) __local_sub((i), &__get_cpu_var(l))
-
#endif /* _ALPHA_LOCAL_H */
Index: linux-2.6/arch/m32r/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/m32r/include/asm/local.h 2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/m32r/include/asm/local.h 2009-10-08 12:57:40.000000000 -0500
@@ -338,29 +338,4 @@ static inline void local_set_mask(unsign
* a variable, not an address.
*/

-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non local way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
#endif /* __M32R_LOCAL_H */
Index: linux-2.6/arch/mips/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/mips/include/asm/local.h 2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/mips/include/asm/local.h 2009-10-08 12:57:40.000000000 -0500
@@ -193,29 +193,4 @@ static __inline__ long local_sub_return(
#define __local_add(i, l) ((l)->a.counter+=(i))
#define __local_sub(i, l) ((l)->a.counter-=(i))

-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
#endif /* _ARCH_MIPS_LOCAL_H */
Index: linux-2.6/arch/powerpc/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/local.h 2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/powerpc/include/asm/local.h 2009-10-08 12:57:40.000000000 -0500
@@ -172,29 +172,4 @@ static __inline__ long local_dec_if_posi
#define __local_add(i,l) ((l)->a.counter+=(i))
#define __local_sub(i,l) ((l)->a.counter-=(i))

-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
#endif /* _ARCH_POWERPC_LOCAL_H */
Index: linux-2.6/arch/x86/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/local.h 2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/local.h 2009-10-08 12:57:40.000000000 -0500
@@ -195,41 +195,4 @@ static inline long local_sub_return(long
#define __local_add(i, l) local_add((i), (l))
#define __local_sub(i, l) local_sub((i), (l))

-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- *
- * X86_64: This could be done better if we moved the per cpu data directly
- * after GS.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
-({ \
- local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; \
-})
-#define cpu_local_wrap(l) \
-({ \
- preempt_disable(); \
- (l); \
- preempt_enable(); \
-}) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
-
-#define __cpu_local_inc(l) cpu_local_inc((l))
-#define __cpu_local_dec(l) cpu_local_dec((l))
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
#endif /* _ASM_X86_LOCAL_H */
Index: linux-2.6/include/asm-generic/local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/local.h 2009-10-08 12:57:38.000000000 -0500
+++ linux-2.6/include/asm-generic/local.h 2009-10-08 12:57:40.000000000 -0500
@@ -52,23 +52,4 @@ typedef struct
#define __local_add(i,l) local_set((l), local_read(l) + (i))
#define __local_sub(i,l) local_set((l), local_read(l) - (i))

-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable (eg. mystruct.foo), not an address.
- */
-#define cpu_local_read(l) local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i) local_set(&__get_cpu_var(l), (i))
-#define cpu_local_inc(l) local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l) local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l) local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l) local_sub((i), &__get_cpu_var(l))
-
-/* Non-atomic increments, ie. preemption disabled and won't be touched
- * in interrupt, etc. Some archs can optimize this case well.
- */
-#define __cpu_local_inc(l) __local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l) __local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l) __local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l) __local_sub((i), &__get_cpu_var(l))
-
#endif /* _ASM_GENERIC_LOCAL_H */