V3->V4:
- Fix various macro definitions.
- Provide experimental percpu based fastpath that does not disable
interrupts for SLUB.
V2->V3:
- Available via git tree against latest upstream from
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
can fallback to the use of preempt or interrupt disable. 64 bit
platforms can use 64 bit atomic per cpu ops.
V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today
The patchset introduces various operations to allow efficient access
to per cpu variables for the current processor. Currently there is
no way in the core to calculate the address of the instance
of a per cpu variable without a table lookup. So we see a lot of
per_cpu_ptr(x, smp_processor_id())
The patchset introduces a way to calculate the address using the offset
that is available in arch specific ways (register or special memory
locations) using
this_cpu_ptr(x)
In addition macros are provided that can operate on per cpu
variables in a per cpu atomic way. With that scalars in structures
allocated with the new percpu allocator can be modified without disabling
preempt or interrupts. This works by generating a single instruction that
does both the relocation of the address to the proper percpu area and
the RMW action.
F.e.
this_cpu_add(x->var, 20)
can be used to generate an instruction that uses a segment register for the
relocation of the per cpu address into the per cpu area of the current processor
and then increments the variable by 20. The instruction cannot be interrupted
and therefore the modification is atomic vs the cpu (it either happens or not).
Rescheduling or interrupt can only happen before or after the instruction.
Per cpu atomicness does not provide protection from concurrent modifications from
other processors. In general per cpu data is modified only from the processor
that the per cpu area is associated with. So per cpu atomicness provides a fast
and effective means of dealing with concurrency. It may allow development of
better fastpaths for allocators and other important subsystems.
The per cpu atomic RMW operations can be used to avoid having to dimension pointer
arrays in the allocators (patches for page allocator and slub are provided) and
avoid pointer lookups in the hot paths of the allocators thereby decreasing
latency of critical OS paths. The macros could be used to revise the critical
paths in the allocators to no longer need to disable interrupts (not included).
Per cpu atomic RMW operations are useful to decrease the overhead of counter
maintenance in the kernel. A this_cpu_inc() f.e. can generate a single
instruction that has no needs for registers on x86. preempt on / off can
be avoided in many places.
Patchset will reduce the code size and increase speed of operations for
dynamically allocated per cpu based statistics. A set of patches modifies
the fastpaths of the SLUB allocator reducing code size and cache footprint
through the per cpu atomic operations.
This patch depends on all arches supporting the new per cpu allocator.
IA64 still uses the old percpu allocator. Tejon has patches to fixup IA64
and the patch was approved by Tony Luck but the IA64 patches have not been
merged yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
per_cpu_ptr(x, smp_processor_id())
this_cpu_ptr(x)
F.e.
this_cpu_add(x->var, 20)
---
c...@linux-foundation.org wrote:
> +/*
> + * Boot pageset table. One per cpu which is going to be used for all
> + * zones and all nodes. The parameters will be set in such a way
> + * that an item put on a list will immediately be handed over to
> + * the buddy list. This is safe since pageset manipulation is done
> + * with interrupts disabled.
> + *
> + * The boot_pagesets must be kept even after bootup is complete for
> + * unused processors and/or zones. They do play a role for bootstrapping
> + * hotplugged processors.
> + *
> + * zoneinfo_show() and maybe other functions do
> + * not check if the processor is online before following the pageset pointer.
> + * Other parts of the kernel may not check if the zone is available.
> + */
> +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
This looks much better but I'm not sure whether it's safe. percpu
offsets have not been set up before setup_per_cpu_areas() is complete
on most archs but if all that's necessary is getting the page
allocator up and running as soon as static per cpu areas and offsets
are set up (which basically means as soon as cpu init is complete on
ia64 and setup_per_cpu_areas() is complete on all other archs). This
should be correct. Is this what you're expecting?
Thanks.
--
tejun
Also, as I'm not very familiar with the code, I'd really appreciate
Mel Gorman's acked or reviewed-by.
I haven't tested the patch series but it now looks good to my eyes at
least. Thanks
Acked-by: Mel Gorman <m...@csn.ul.ie>
> ---
> include/linux/mm.h | 4 -
> include/linux/mmzone.h | 12 ---
> mm/page_alloc.c | 187 ++++++++++++++++++-------------------------------
> mm/vmstat.c | 14 ++-
> 4 files changed, 81 insertions(+), 136 deletions(-)
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h 2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/include/linux/mm.h 2009-10-07 14:48:09.000000000 -0500
> @@ -1061,11 +1061,7 @@ extern void si_meminfo(struct sysinfo *
> extern void si_meminfo_node(struct sysinfo *val, int nid);
> extern int after_bootmem;
>
> -#ifdef CONFIG_NUMA
> extern void setup_per_cpu_pageset(void);
> -#else
> -static inline void setup_per_cpu_pageset(void) {}
> -#endif
>
> extern void zone_pcp_update(struct zone *zone);
>
> Index: linux-2.6/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmzone.h 2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/include/linux/mmzone.h 2009-10-07 14:48:09.000000000 -0500
> @@ -184,13 +184,7 @@ struct per_cpu_pageset {
> s8 stat_threshold;
> s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
> #endif
> -} ____cacheline_aligned_in_smp;
> -
> -#ifdef CONFIG_NUMA
> -#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
> -#else
> -#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
> -#endif
> +};
>
> #endif /* !__GENERATING_BOUNDS.H */
>
> @@ -306,10 +300,8 @@ struct zone {
> */
> unsigned long min_unmapped_pages;
> unsigned long min_slab_pages;
> - struct per_cpu_pageset *pageset[NR_CPUS];
> -#else
> - struct per_cpu_pageset pageset[NR_CPUS];
> #endif
> + struct per_cpu_pageset *pageset;
> /*
> * free areas of different sizes
> */
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c 2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c 2009-10-07 14:48:09.000000000 -0500
> @@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu
> struct per_cpu_pageset *pset;
> struct per_cpu_pages *pcp;
>
> - pset = zone_pcp(zone, cpu);
> + local_irq_save(flags);
> + pset = per_cpu_ptr(zone->pageset, cpu);
>
> pcp = &pset->pcp;
> - local_irq_save(flags);
> free_pcppages_bulk(zone, pcp->count, pcp);
> pcp->count = 0;
> local_irq_restore(flags);
> @@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
> arch_free_page(page, 0);
> kernel_map_pages(page, 1, 0);
>
> - pcp = &zone_pcp(zone, get_cpu())->pcp;
> migratetype = get_pageblock_migratetype(page);
> set_page_private(page, migratetype);
> local_irq_save(flags);
> @@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
> migratetype = MIGRATE_MOVABLE;
> }
>
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> if (cold)
> list_add_tail(&page->lru, &pcp->lists[migratetype]);
> else
> @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
>
> out:
> local_irq_restore(flags);
> - put_cpu();
> }
>
> void free_hot_page(struct page *page)
> @@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon
> unsigned long flags;
> struct page *page;
> int cold = !!(gfp_flags & __GFP_COLD);
> - int cpu;
>
> again:
> - cpu = get_cpu();
> if (likely(order == 0)) {
> struct per_cpu_pages *pcp;
> struct list_head *list;
>
> - pcp = &zone_pcp(zone, cpu)->pcp;
> - list = &pcp->lists[migratetype];
> local_irq_save(flags);
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> + list = &pcp->lists[migratetype];
> if (list_empty(list)) {
> pcp->count += rmqueue_bulk(zone, 0,
> pcp->batch, list,
> @@ -1234,7 +1231,6 @@ again:
> __count_zone_vm_events(PGALLOC, zone, 1 << order);
> zone_statistics(preferred_zone, zone);
> local_irq_restore(flags);
> - put_cpu();
>
> VM_BUG_ON(bad_range(zone, page));
> if (prep_new_page(page, order, gfp_flags))
> @@ -1243,7 +1239,6 @@ again:
>
> failed:
> local_irq_restore(flags);
> - put_cpu();
> return NULL;
> }
>
> @@ -2172,7 +2167,7 @@ void show_free_areas(void)
> for_each_online_cpu(cpu) {
> struct per_cpu_pageset *pageset;
>
> - pageset = zone_pcp(zone, cpu);
> + pageset = per_cpu_ptr(zone->pageset, cpu);
>
> printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
> cpu, pageset->pcp.high,
> @@ -2735,10 +2730,29 @@ static void build_zonelist_cache(pg_data
>
> #endif /* CONFIG_NUMA */
>
> +/*
> + * Boot pageset table. One per cpu which is going to be used for all
> + * zones and all nodes. The parameters will be set in such a way
> + * that an item put on a list will immediately be handed over to
> + * the buddy list. This is safe since pageset manipulation is done
> + * with interrupts disabled.
> + *
> + * The boot_pagesets must be kept even after bootup is complete for
> + * unused processors and/or zones. They do play a role for bootstrapping
> + * hotplugged processors.
> + *
> + * zoneinfo_show() and maybe other functions do
> + * not check if the processor is online before following the pageset pointer.
> + * Other parts of the kernel may not check if the zone is available.
> + */
> +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
> +
> /* return values int ....just for stop_machine() */
> static int __build_all_zonelists(void *dummy)
> {
> int nid;
> + int cpu;
>
> #ifdef CONFIG_NUMA
> memset(node_load, 0, sizeof(node_load));
> @@ -2749,6 +2763,14 @@ static int __build_all_zonelists(void *d
> build_zonelists(pgdat);
> build_zonelist_cache(pgdat);
> }
> +
> + /*
> + * Initialize the boot_pagesets that are going to be used
> + * for bootstrapping processors.
> + */
> + for_each_possible_cpu(cpu)
> + setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> +
> return 0;
> }
>
> @@ -3087,120 +3109,60 @@ static void setup_pagelist_highmark(stru
> }
>
>
> -#ifdef CONFIG_NUMA
> -/*
> - * Boot pageset table. One per cpu which is going to be used for all
> - * zones and all nodes. The parameters will be set in such a way
> - * that an item put on a list will immediately be handed over to
> - * the buddy list. This is safe since pageset manipulation is done
> - * with interrupts disabled.
> - *
> - * Some NUMA counter updates may also be caught by the boot pagesets.
> - *
> - * The boot_pagesets must be kept even after bootup is complete for
> - * unused processors and/or zones. They do play a role for bootstrapping
> - * hotplugged processors.
> - *
> - * zoneinfo_show() and maybe other functions do
> - * not check if the processor is online before following the pageset pointer.
> - * Other parts of the kernel may not check if the zone is available.
> - */
> -static struct per_cpu_pageset boot_pageset[NR_CPUS];
> -
> -/*
> - * Dynamically allocate memory for the
> - * per cpu pageset array in struct zone.
> - */
> -static int __cpuinit process_zones(int cpu)
> -{
> - struct zone *zone, *dzone;
> - int node = cpu_to_node(cpu);
> -
> - node_set_state(node, N_CPU); /* this node has a cpu */
> -
> - for_each_populated_zone(zone) {
> - zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
> - GFP_KERNEL, node);
> - if (!zone_pcp(zone, cpu))
> - goto bad;
> -
> - setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
> -
> - if (percpu_pagelist_fraction)
> - setup_pagelist_highmark(zone_pcp(zone, cpu),
> - (zone->present_pages / percpu_pagelist_fraction));
> - }
> -
> - return 0;
> -bad:
> - for_each_zone(dzone) {
> - if (!populated_zone(dzone))
> - continue;
> - if (dzone == zone)
> - break;
> - kfree(zone_pcp(dzone, cpu));
> - zone_pcp(dzone, cpu) = &boot_pageset[cpu];
> - }
> - return -ENOMEM;
> -}
> -
> -static inline void free_zone_pagesets(int cpu)
> -{
> - struct zone *zone;
> -
> - for_each_zone(zone) {
> - struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
> -
> - /* Free per_cpu_pageset if it is slab allocated */
> - if (pset != &boot_pageset[cpu])
> - kfree(pset);
> - zone_pcp(zone, cpu) = &boot_pageset[cpu];
> - }
> -}
> -
> static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
> unsigned long action,
> void *hcpu)
> {
> int cpu = (long)hcpu;
> - int ret = NOTIFY_OK;
>
> switch (action) {
> case CPU_UP_PREPARE:
> case CPU_UP_PREPARE_FROZEN:
> - if (process_zones(cpu))
> - ret = NOTIFY_BAD;
> - break;
> - case CPU_UP_CANCELED:
> - case CPU_UP_CANCELED_FROZEN:
> - case CPU_DEAD:
> - case CPU_DEAD_FROZEN:
> - free_zone_pagesets(cpu);
> + node_set_state(cpu_to_node(cpu), N_CPU);
> break;
> default:
> break;
> }
> - return ret;
> + return NOTIFY_OK;
> }
>
> static struct notifier_block __cpuinitdata pageset_notifier =
> { &pageset_cpuup_callback, NULL, 0 };
>
> +/*
> + * Allocate per cpu pagesets and initialize them.
> + * Before this call only boot pagesets were available.
> + * Boot pagesets will no longer be used by this processorr
> + * after setup_per_cpu_pageset().
> + */
> void __init setup_per_cpu_pageset(void)
> {
> - int err;
> + struct zone *zone;
> + int cpu;
> +
> + for_each_populated_zone(zone) {
> + zone->pageset = alloc_percpu(struct per_cpu_pageset);
> +
> + for_each_possible_cpu(cpu) {
> + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> +
> + setup_pageset(pcp, zone_batchsize(zone));
> +
> + if (percpu_pagelist_fraction)
> + setup_pagelist_highmark(pcp,
> + (zone->present_pages /
> + percpu_pagelist_fraction));
> + }
> + }
>
> - /* Initialize per_cpu_pageset for cpu 0.
> - * A cpuup callback will do this for every cpu
> - * as it comes online
> + /*
> + * The boot cpu is always the first active.
> + * The boot node has a processor
> */
> - err = process_zones(smp_processor_id());
> - BUG_ON(err);
> + node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
> register_cpu_notifier(&pageset_notifier);
> }
>
> -#endif
> -
> static noinline __init_refok
> int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
> {
> @@ -3254,7 +3216,7 @@ static int __zone_pcp_update(void *data)
> struct per_cpu_pageset *pset;
> struct per_cpu_pages *pcp;
>
> - pset = zone_pcp(zone, cpu);
> + pset = per_cpu_ptr(zone->pageset, cpu);
> pcp = &pset->pcp;
>
> local_irq_save(flags);
> @@ -3272,21 +3234,13 @@ void zone_pcp_update(struct zone *zone)
>
> static __meminit void zone_pcp_init(struct zone *zone)
> {
> - int cpu;
> - unsigned long batch = zone_batchsize(zone);
> + /* Use boot pagesets until we have the per cpu allocator up */
> + zone->pageset = &per_cpu_var(boot_pageset);
>
> - for (cpu = 0; cpu < NR_CPUS; cpu++) {
> -#ifdef CONFIG_NUMA
> - /* Early boot. Slab allocator not functional yet */
> - zone_pcp(zone, cpu) = &boot_pageset[cpu];
> - setup_pageset(&boot_pageset[cpu],0);
> -#else
> - setup_pageset(zone_pcp(zone,cpu), batch);
> -#endif
> - }
> if (zone->present_pages)
> - printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
> - zone->name, zone->present_pages, batch);
> + printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n",
> + zone->name, zone->present_pages,
> + zone_batchsize(zone));
> }
>
> __meminit int init_currently_empty_zone(struct zone *zone,
> @@ -4800,10 +4754,11 @@ int percpu_pagelist_fraction_sysctl_hand
> if (!write || (ret == -EINVAL))
> return ret;
> for_each_populated_zone(zone) {
> - for_each_online_cpu(cpu) {
> + for_each_possible_cpu(cpu) {
> unsigned long high;
> high = zone->present_pages / percpu_pagelist_fraction;
> - setup_pagelist_highmark(zone_pcp(zone, cpu), high);
> + setup_pagelist_highmark(
> + per_cpu_ptr(zone->pageset, cpu), high);
> }
> }
> return 0;
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c 2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/mm/vmstat.c 2009-10-07 14:48:09.000000000 -0500
> @@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
> threshold = calculate_threshold(zone);
>
> for_each_online_cpu(cpu)
> - zone_pcp(zone, cpu)->stat_threshold = threshold;
> + per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> + = threshold;
> }
> }
>
> @@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
> void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> int delta)
> {
> - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> +
> s8 *p = pcp->vm_stat_diff + item;
> long x;
>
> @@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
> */
> void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> {
> - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> s8 *p = pcp->vm_stat_diff + item;
>
> (*p)++;
> @@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
>
> void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
> {
> - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> s8 *p = pcp->vm_stat_diff + item;
>
> (*p)--;
> @@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
> for_each_populated_zone(zone) {
> struct per_cpu_pageset *p;
>
> - p = zone_pcp(zone, cpu);
> + p = per_cpu_ptr(zone->pageset, cpu);
>
> for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> if (p->vm_stat_diff[i]) {
> @@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
> for_each_online_cpu(i) {
> struct per_cpu_pageset *pageset;
>
> - pageset = zone_pcp(zone, i);
> + pageset = per_cpu_ptr(zone->pageset, i);
> seq_printf(m,
> "\n cpu: %i"
> "\n count: %i"
>
> --
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
> > +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> > +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>
> This looks much better but I'm not sure whether it's safe. percpu
> offsets have not been set up before setup_per_cpu_areas() is complete
> on most archs but if all that's necessary is getting the page
> allocator up and running as soon as static per cpu areas and offsets
> are set up (which basically means as soon as cpu init is complete on
> ia64 and setup_per_cpu_areas() is complete on all other archs). This
> should be correct. Is this what you're expecting?
paging_init() is called after the per cpu areas have been initialized. So
I thought this would be safe. Tested it on x86.
zone_pcp_init() only sets up the per cpu pointers to the pagesets. That
works regardless of the boot stage. Then then build_all_zonelists()
initializes the actual contents of the per cpu variables.
Finally the per cpu pagesets are allocated from the percpu allocator when
all allocators are up and the pagesets are sized.
Christoph Lameter wrote:
>> The biggest grief I have is that the meaning of __ is different among
>> different accessors. If that can be cleared up, we would be in much
>> better shape without adding any extra macros. Can we just remove all
>> __'s and use meaningful pre or suffixes like raw or irq or whatever?
>
> It currently means that we do not deal with preempt and do not check for
> preemption. That is consistent.
If you define it inclusively, it can be consistent.
> Sure we could change the API to have even more macros than the large
> amount it already has so that we can check for proper preempt disablement.
>
> I guess that would mean adding
>
> raw_nopreempt_this_cpu_xx and nopreempt_this_cpu_xx variants? The thing
> gets huge. I think we could just leave it. __ suggests that serialization
> and checking is not performed like in the full versions and that is true.
I don't think we'll need to add new variants. Just renaming existing
ones so that they have more specific pre/suffix should make things
clearer. I'll give a shot at that once the sparse annotation patchset
is merged.
Thanks.
--
tejun