[PATCH] [RESEND] x86_64: add memory hotremove config option

Gary Hade

unread,

Sep 5, 2008, 1:30:20 PM9/5/08

to

Resending with linux-...@vger.kernel.org and x...@kernel.org copied
this time. No changes other than this and modified Subject line. The
only response so far on linux-mm has been an Acked-by: from
Yasunori Goto <y-g...@jp.fujitsu.com>

Add memory hotremove config option to x86_64

Memory hotremove functionality can currently be configured into
the ia64, powerpc, and s390 kernels. This patch makes it possible
to configure the memory hotremove functionality into the x86_64
kernel as well.

Signed-off-by: Gary Hade <gary...@us.ibm.com>

---
arch/x86/Kconfig | 3 +++
arch/x86/mm/init_64.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+)

Index: linux-2.6.27-rc5/arch/x86/Kconfig
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/Kconfig 2008-09-03 13:33:59.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/Kconfig 2008-09-03 13:34:55.000000000 -0700
@@ -1384,6 +1384,9 @@
def_bool y
depends on X86_64 || (X86_32 && HIGHMEM)

+config ARCH_ENABLE_MEMORY_HOTREMOVE
+ def_bool y
+
config HAVE_ARCH_EARLY_PFN_TO_NID
def_bool X86_64
depends on NUMA
Index: linux-2.6.27-rc5/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/mm/init_64.c 2008-09-03 13:34:08.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/mm/init_64.c 2008-09-03 13:34:55.000000000 -0700
@@ -740,6 +740,24 @@
EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
#endif

+#ifdef CONFIG_MEMORY_HOTREMOVE
+int remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn, end_pfn;
+ unsigned long timeout = 120 * HZ;
+ int ret;
+ start_pfn = start >> PAGE_SHIFT;
+ end_pfn = start_pfn + (size >> PAGE_SHIFT);
+ ret = offline_pages(start_pfn, end_pfn, timeout);
+ if (ret)
+ goto out;
+ /* Arch-specific calls go here */
+out:
+ return ret;
+}
+EXPORT_SYMBOL_GPL(remove_memory);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
#endif /* CONFIG_MEMORY_HOTPLUG */

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

unread,

Sep 5, 2008, 1:50:16 PM9/5/08

to

* Gary Hade <gary...@us.ibm.com> wrote:

> Add memory hotremove config option to x86_64
>
> Memory hotremove functionality can currently be configured into the
> ia64, powerpc, and s390 kernels. This patch makes it possible to
> configure the memory hotremove functionality into the x86_64 kernel as
> well.

hm, why is it for 64-bit only?

> +++ linux-2.6.27-rc5/arch/x86/Kconfig 2008-09-03 13:34:55.000000000 -0700
> @@ -1384,6 +1384,9 @@
> def_bool y
> depends on X86_64 || (X86_32 && HIGHMEM)
>
> +config ARCH_ENABLE_MEMORY_HOTREMOVE
> + def_bool y

so this will break the build on 32-bit, if CONFIG_MEMORY_HOTREMOVE=y?
mm/memory_hotplug.c assumes that remove_memory() is provided by the
architecture.

> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int remove_memory(u64 start, u64 size)
> +{
> + unsigned long start_pfn, end_pfn;
> + unsigned long timeout = 120 * HZ;
> + int ret;
> + start_pfn = start >> PAGE_SHIFT;
> + end_pfn = start_pfn + (size >> PAGE_SHIFT);
> + ret = offline_pages(start_pfn, end_pfn, timeout);
> + if (ret)
> + goto out;
> + /* Arch-specific calls go here */
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(remove_memory);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */

hm, nothing appears to be arch-specific about this trivial wrapper
around offline_pages().

Shouldnt this be moved to the CONFIG_MEMORY_HOTREMOVE portion of
mm/memory_hotplug.c instead, as a weak function? That way architectures
only have to enable ARCH_ENABLE_MEMORY_HOTREMOVE - and architectures
with different/special needs can override it.

Ingo

Andi Kleen

unread,

Sep 5, 2008, 2:10:12 PM9/5/08

to

Gary Hade <gary...@us.ibm.com> writes:
>
> Add memory hotremove config option to x86_64
>
> Memory hotremove functionality can currently be configured into
> the ia64, powerpc, and s390 kernels. This patch makes it possible
> to configure the memory hotremove functionality into the x86_64
> kernel as well.

You forgot to describe how you tested it? Does it actually work.
And why do you want to do it it? What's the use case?

The general understanding was that it doesn't work very well on a real
machine at least because it cannot be controlled how that memory maps
to real pluggable hardware (and you cannot completely empty a node at runtime)
and a Hypervisor would likely use different interfaces anyways.

-Andi

Badari Pulavarty

unread,

Sep 5, 2008, 2:20:13 PM9/5/08

to

Yes. All the archs (ppc64, ia64, s390, x86_64) have exact same
function. No architecture needed special handling so far (initial
versions of ppc64 needed extra handling, but I moved the code
to different place).

We can make this generic and kill all arch-specific ones.
Initially, we didn't know if any arch needs special handling -
so ended up having private functions for each arch.
I think its time to merge them all.

>
> Shouldnt this be moved to the CONFIG_MEMORY_HOTREMOVE portion of
> mm/memory_hotplug.c instead, as a weak function? That way architectures
> only have to enable ARCH_ENABLE_MEMORY_HOTREMOVE - and architectures
> with different/special needs can override it.

Yes. We should do that. I will send out a patch.

Thanks,
Badari

Ingo Molnar

unread,

Sep 5, 2008, 2:20:16 PM9/5/08

to

ok - if all architectures have the same function then please make it a
regular function not a weak one, and remove all the duplications.

Ingo

Badari Pulavarty

unread,

Sep 5, 2008, 2:40:17 PM9/5/08

to

On Fri, 2008-09-05 at 20:04 +0200, Andi Kleen wrote:
> Gary Hade <gary...@us.ibm.com> writes:
> >
> > Add memory hotremove config option to x86_64
> >
> > Memory hotremove functionality can currently be configured into
> > the ia64, powerpc, and s390 kernels. This patch makes it possible
> > to configure the memory hotremove functionality into the x86_64
> > kernel as well.
>
> You forgot to describe how you tested it? Does it actually work.
> And why do you want to do it it? What's the use case?

I will let Gary answer these :)

> The general understanding was that it doesn't work very well on a real
> machine at least because it cannot be controlled how that memory maps
> to real pluggable hardware (and you cannot completely empty a node at runtime)
> and a Hypervisor would likely use different interfaces anyways.

At this time we are interested on node remove (on x86_64).
It doesn't really work well at this time - due to some of the structures
(pgdat etc) are striped across all nodes. These is no easy way to
relocate them. Yasunori Goto is working on patches to address some of
these issues.

But we are considering adding support to restrict/skip bootmem
allocations on selected nodes. That way, we should be able to do
node remove.

(BTW, on ppc64 this works fine - since we are interested mostly in
removing *some* sections of memory to give it back to hypervisor -
not entire node removal).

Thanks,
Badari

Andi Kleen

unread,

Sep 5, 2008, 3:00:19 PM9/5/08

to

> At this time we are interested on node remove (on x86_64).
> It doesn't really work well at this time -

That's a quite euphemistic way to put it.

> due to some of the structures

That means you can never put any slab data on specific nodes.
And all the kernel subsystems on that node will not ever get local
memory. How are you going to solve that? And if you disallow
kernel allocations in so large memory areas you get many of the highmem
issues that plagued 32bit back in the 64bit kernel.

There are lots of other issues. It's quite questionable if this
whole exercise makes sense at all.

> (BTW, on ppc64 this works fine - since we are interested mostly in
> removing *some* sections of memory to give it back to hypervisor -
> not entire node removal).

Ok for hypervisors you can do it reasonably easy on x86 too, but it's likely
that some hypercall interface is better than going through
sysfs.

-Andi

Gary Hade

unread,

Sep 5, 2008, 4:00:20 PM9/5/08

to

On Fri, Sep 05, 2008 at 08:04:55PM +0200, Andi Kleen wrote:
> Gary Hade <gary...@us.ibm.com> writes:
> >
> > Add memory hotremove config option to x86_64
> >
> > Memory hotremove functionality can currently be configured into
> > the ia64, powerpc, and s390 kernels. This patch makes it possible
> > to configure the memory hotremove functionality into the x86_64
> > kernel as well.
>
> You forgot to describe how you tested it? Does it actually work.

So far, I have tested it on a 2-node IBM x460, 2-node IBM x3950, and
a 4-node IBM x3950 M2 and have been able to successfully offline and
re-online all memory sections marked as removable multiple times with
no apparent problems.

By directing the change to -mm our hope is that others will try it
on their systems and help us shake out any issues that they my find.

> And why do you want to do it it? What's the use case?

A baby step towards evental total node removal.

>
> The general understanding was that it doesn't work very well on a real
> machine at least because it cannot be controlled how that memory maps
> to real pluggable hardware (and you cannot completely empty a node at runtime)
> and a Hypervisor would likely use different interfaces anyways.

The inability to offline all non-primary node memory sections
certainly needs to be addressed. The pgdat removal work that
Yasunori Goto has started will hopefully continue and help resolve
this issue. We have only just started thinking about issues related
to resources other that CPUs and memory that will need to be released
in preparation for node removal (e.g. memory and i/o resources
assigned to PCI devices on a node targeted for removal). Much of
this is new territory for us so any suggestions that you and others
can offer will be much appreciated.

Thanks for asking.

Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
gary...@us.ibm.com
http://www.ibm.com/linux/ltc

Andi Kleen

unread,

Sep 5, 2008, 4:10:09 PM9/5/08

to

> The inability to offline all non-primary node memory sections
> certainly needs to be addressed. The pgdat removal work that
> Yasunori Goto has started will hopefully continue and help resolve
> this issue.

You make it sound like it's just some minor technical hurdle
that needs to be addressed. But from all analysis of these issues
I've seen so far it's extremly hard and all possible solutions
have serious issues. So before doing some baby steps there
should be at least some general idea how this thing is supposed
to work in the end.

> We have only just started thinking about issues related
> to resources other that CPUs and memory that will need to be released
> in preparation for node removal (e.g. memory and i/o resources
> assigned to PCI devices on a node targeted for removal).

That's the easy stuff. The hard parts are all the kernel objects
that you cannot move.

-Andi

Gary Hade

unread,

Sep 5, 2008, 6:00:13 PM9/5/08

to

On Fri, Sep 05, 2008 at 10:04:01PM +0200, Andi Kleen wrote:
> > The inability to offline all non-primary node memory sections
> > certainly needs to be addressed. The pgdat removal work that
> > Yasunori Goto has started will hopefully continue and help resolve
> > this issue.
>
> You make it sound like it's just some minor technical hurdle
> that needs to be addressed.

Sorry, that was not my intent.

> But from all analysis of these issues
> I've seen so far it's extremly hard and all possible solutions
> have serious issues. So before doing some baby steps there
> should be at least some general idea how this thing is supposed
> to work in the end.

I am not sure if I understand why you appear to be opposed to
enabling the hotremove function before all the issues related
to an eventual goal of being able to free all memory on a node
are addressed. Even in the absence of solutions for these issues
it seems like there could still be other possible benefits such
as the ability to selectively expand and shrink available memory
for testing or debugging purposes. I believe it would also be
helpful to those working on or testing possible solutions for
the removal issues.

Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
gary...@us.ibm.com
http://www.ibm.com/linux/ltc

--

Badari Pulavarty

unread,

Sep 5, 2008, 6:40:05 PM9/5/08

to

On Fri, 2008-09-05 at 20:54 +0200, Andi Kleen wrote:
> > At this time we are interested on node remove (on x86_64).
> > It doesn't really work well at this time -
>
> That's a quite euphemistic way to put it.
>
> > due to some of the structures
>
> That means you can never put any slab data on specific nodes.
> And all the kernel subsystems on that node will not ever get local
> memory. How are you going to solve that? And if you disallow
> kernel allocations in so large memory areas you get many of the highmem
> issues that plagued 32bit back in the 64bit kernel.

You are absolutely correct. There is no easy solution - one has
to loose performance in order to support node removal, along with
some old x86 issues :(

We were contemplating idea of limiting node removal to few
select set of nodes as a compromise - but it didn't sound right :(

>
> There are lots of other issues. It's quite questionable if this
> whole exercise makes sense at all.

Same issues exist with ia64 and x86_64 won't be any worse off.
Gary was trying to enable the functionality so that we can atleast
test out offlining memory section easier (test page migration,
isolation code and hash out issues)

Another possible idea being considered (still lot of unknowns)
to make use offline memory section feature for power management
(*cough*).

Anyway, as you can see this patch doesn't add any code - just
enables config option for x86_64. (if you are worried about
code bloat).

> > (BTW, on ppc64 this works fine - since we are interested mostly in
> > removing *some* sections of memory to give it back to hypervisor -
> > not entire node removal).
>
> Ok for hypervisors you can do it reasonably easy on x86 too, but it's likely
> that some hypercall interface is better than going through
> sysfs.

sysfs interface already exists to offline sections of memory. (same
interface as online).

The proposed patch provides easy way to find out what sections of
memory belongs to which node. (could be useful on its own).

Thanks,
Badari

Andi Kleen

unread,

Sep 5, 2008, 8:00:14 PM9/5/08

to

> I am not sure if I understand why you appear to be opposed to
> enabling the hotremove function before all the issues related

I'm quite sceptical that it can be ever made to work in a useful
way for real hardware (as opposed to an hypervisor para virtual setup
for which this interface is not the right way -- it should be done
in some specific driver instead)

And if it cannot be made to work then it will be a false promise
to the user. They will see it and think it will work, but it will
not.

This means I don't see a real use case for this feature.

-Andi

Yasunori Goto

unread,

Sep 6, 2008, 3:10:10 AM9/6/08

to

> > I am not sure if I understand why you appear to be opposed to
> > enabling the hotremove function before all the issues related
>
> I'm quite sceptical that it can be ever made to work in a useful
> way for real hardware (as opposed to an hypervisor para virtual setup
> for which this interface is not the right way -- it should be done
> in some specific driver instead)
> And if it cannot be made to work then it will be a false promise
> to the user. They will see it and think it will work, but it will
> not.
>
> This means I don't see a real use case for this feature.

I don't think its driver is almighty.
IIRC, balloon driver can be cause of fragmentation for 24-7 system.

In addition, I have heard that memory hotplug would be useful for reducing
of power consumption of DIMM.

I have to admit that memory hotplug has many issues, but I would like to
solve them step by step.

Thanks.
--
Yasunori Goto

Andi Kleen

unread,

Sep 6, 2008, 5:00:16 AM9/6/08

to

On Sat, Sep 06, 2008 at 04:06:38PM +0900, Yasunori Goto wrote:
> > not.
> >
> > This means I don't see a real use case for this feature.
>
> I don't think its driver is almighty.
> IIRC, balloon driver can be cause of fragmentation for 24-7 system.

Sure the balloon driver can be likely improved too, it's just
that I don't think a balloon driver should call into the function
the original patch in the series hooked up.

>
> In addition, I have heard that memory hotplug would be useful for reducing
> of power consumption of DIMM.

It's unclear that memory hotplug is the right model for DIMM power management.
The problem is that DIMMs are interleaved, so you again have to completely
free a quite large area. It's not much easier than node hotplug.

> I have to admit that memory hotplug has many issues, but I would like to

Let's call it "node" or "hardware" memory hot unplug, not that
anyone confuses it with the easier VM based hot unplug or the really
easy hotadd.

> solve them step by step.

The question is if they are even solvable in a useful way.
I'm not sure it's that useful to start and then find out
that it doesn't work anyways.

-Andi

Ingo Molnar

unread,

Sep 6, 2008, 10:40:05 AM9/6/08

to

* Yasunori Goto <y-g...@jp.fujitsu.com> wrote:

> I don't think its driver is almighty. IIRC, balloon driver can be
> cause of fragmentation for 24-7 system.
>
> In addition, I have heard that memory hotplug would be useful for
> reducing of power consumption of DIMM.
>
> I have to admit that memory hotplug has many issues, but I would like
> to solve them step by step.

What would be nice is to insert the information both during bootup and
in /proc/meminfo and 'free' output that hot-removable memory segments
are not generic free memory, it's currently a limited resource that
might or might not be sufficient to serve a given workload.

Perhaps even exclude it from 'total' memory reported by meminfo - to be
on the safe side of user expectations. In terms of user-space memory it
is already generic swappable memory but in terms of kernel-space
allocations it is not.

As i said it earlier in the thread, i certainly have no objections from
the x86 maintenance side - nothing is worse than a generic kernel
feature only available on certain less frequently used platforms. Memory
hotplug has been available for some time in the MM and it's not really
causing any maintenance trouble at the moment and it is not enabled by
default either.

Having said that, i have my doubts about its generic utility (the power
saving aspects are likely not realizable - nobody really wants DIMMs to
just sit there unused and the cost of dynamic migration is just
horrendous) - but as long as it's opt-in there's no reason to limit the
availability of an in-kernel feature artificially.

Removing those limitations of kernel-space allocations should indeed be
done in baby steps - and whether it's worth turning such memory into
completely generic kernel memory is an open question.

But the fact that a piece of memory is not fully generic is no reason
not to allow users to create special, capability-limited RAM resources
like they can already do via hugetlbfs or ramfs, as long as the the
capability limitations are advertised clearly.

Yes, memory hotplug has limitations we all understand, but still it's an
arguably useful feature in some circumstances. If we never give a
feature a chance to evolve on the main Linux platform that 90%+ of our
users use it wont ever be truly useful.

Please send the new patches against -git or -tip and we can put them
into a separate standalone feature topic and can test it on various x86
boxes and send them towards linux-next if Andrew agrees with that
process too.

Btw., it would be nice if memory hotplug had a self-test that could be
activated from the .config and would run autonomously (a bit like
rcu-torture): it would mark say 10% of all RAM as hot-pluggable during
bootup and would periodically hot-plug and hot-unplug that memory, every
10 seconds or 30 seconds or so, transparently. That would also test the
x86 architecture's pagetable init code, the page migration code, etc.
(Disabled by default and dependent on DEBUG_KERNEL && EXPERIMENTAL.)

Ingo

kamezaw...@jp.fujitsu.com

unread,

Sep 6, 2008, 12:10:16 PM9/6/08

to

----- Original Message -----
>>Having said that, i have my doubts about its generic utility (the power
>>saving aspects are likely not realizable - nobody really wants DIMMs to
>>just sit there unused and the cost of dynamic migration is just
>>horrendous) - but as long as it's opt-in there's no reason to limit the
>>availability of an in-kernel feature artificially.
>

>Nobody ? maybe just a trade-off problem in user side.
>Even without DIMM hotplug or DIMM's power save mode, making a DIMM idle
>is of no use ? I think memory consumes much power when it used.
>Memory Hotplug and ZONE_MOVABLE can make some memory idle.
>(I'm sorry if my thinking is wrong.)
>
But I have to point out HDD access consumes far power than memory.
That's trade-off problem depends on usage, anyway.

Thanks,
-Kame

kamezaw...@jp.fujitsu.com

unread,

Sep 6, 2008, 12:10:17 PM9/6/08

to

----- Original Message -----
>* Yasunori Goto <y-g...@jp.fujitsu.com> wrote:
>
>> I don't think its driver is almighty. IIRC, balloon driver can be
>> cause of fragmentation for 24-7 system.
>>
>> In addition, I have heard that memory hotplug would be useful for
>> reducing of power consumption of DIMM.
>>
>> I have to admit that memory hotplug has many issues, but I would like
>> to solve them step by step.
>
>What would be nice is to insert the information both during bootup and
>in /proc/meminfo and 'free' output that hot-removable memory segments
>are not generic free memory, it's currently a limited resource that
>might or might not be sufficient to serve a given workload.
>
>Perhaps even exclude it from 'total' memory reported by meminfo - to be
>on the safe side of user expectations. In terms of user-space memory it
>is already generic swappable memory but in terms of kernel-space
>allocations it is not.
>

I wonder why anyone doesn't talk about ZONE_MOVABLE...When I wrote memory
hotplug, I assumed help of ZONE_MOVABLE and SPARSEMEM. It is shown in
meminfo.(I think memory hotplug is useful only when ZONE_MOVABLE is used.)

Most of problems which Goto wrote are mainly about placement of memmap and
pgdat, zones. One example is that "when SPARSEMEM_VMEMMAP is enabled,
memmap is not removed even when memory is removed. "

>As i said it earlier in the thread, i certainly have no objections from
>the x86 maintenance side - nothing is worse than a generic kernel
>feature only available on certain less frequently used platforms. Memory
>hotplug has been available for some time in the MM and it's not really
>causing any maintenance trouble at the moment and it is not enabled by
>default either.
>
>Having said that, i have my doubts about its generic utility (the power
>saving aspects are likely not realizable - nobody really wants DIMMs to
>just sit there unused and the cost of dynamic migration is just
>horrendous) - but as long as it's opt-in there's no reason to limit the
>availability of an in-kernel feature artificially.

Nobody ? maybe just a trade-off problem in user side.

Even without DIMM hotplug or DIMM's power save mode, making a DIMM idle
is of no use ? I think memory consumes much power when it used.
Memory Hotplug and ZONE_MOVABLE can make some memory idle.
(I'm sorry if my thinking is wrong.)

>

>Removing those limitations of kernel-space allocations should indeed be
>done in baby steps - and whether it's worth turning such memory into
>completely generic kernel memory is an open question.
>

I think generic kernel space memory hotplug will never be available.

>But the fact that a piece of memory is not fully generic is no reason
>not to allow users to create special, capability-limited RAM resources
>like they can already do via hugetlbfs or ramfs, as long as the the
>capability limitations are advertised clearly.
>

Hmm, adding a feature like
- offline some memory at boot.
- online-memory-as-hugeltb mode

is useful for generic pc users ?

Regards,
-Kame

Ingo Molnar

unread,

Sep 6, 2008, 12:20:11 PM9/6/08

to

* kamezaw...@jp.fujitsu.com <kamezaw...@jp.fujitsu.com> wrote:

> > Removing those limitations of kernel-space allocations should indeed
> > be done in baby steps - and whether it's worth turning such memory
> > into completely generic kernel memory is an open question.
>
> I think generic kernel space memory hotplug will never be available.

yeah, most likely. (It's possible technically even on a native kernel -
just very expensive to various aspects of the kernel.)

> > But the fact that a piece of memory is not fully generic is no
> > reason not to allow users to create special, capability-limited RAM
> > resources like they can already do via hugetlbfs or ramfs, as long
> > as the the capability limitations are advertised clearly.
>
> Hmm, adding a feature like
> - offline some memory at boot.
> - online-memory-as-hugeltb mode
>
> is useful for generic pc users ?

yeah - it's actually the way how hugetlb should be done. Plus expand
gbpages to hugetlbfs and hotplug memory on Barcelona CPUs and you can do
user-space apps that can run for a long time without any TLB misses.
_That_ might make sense to explore in practice. (i'm not holding my
breath though, TLB misses are _fast_ on the best x86 CPUs.)

But we wont be able to make such experiments without having the
capability on x86. So i'd like to break the catch-22 by accepting all
this into arch/x86, it certainly is simple and makes some sense, it's
just that i'm not that convinced about it personally at the moment.

So feel free to turn it all into a killer feature (make hugetlb backed
memory transparent to user-space, etc. etc.) that high-performance
computing users strive for and all that will change. Please send the
reshaped patches so we can move past the 'what if' discussion phase ;-)

Ingo

Nick Piggin

unread,

Sep 8, 2008, 2:00:19 AM9/8/08

to

On Saturday 06 September 2008 18:53, Andi Kleen wrote:
> On Sat, Sep 06, 2008 at 04:06:38PM +0900, Yasunori Goto wrote:
> > > not.
> > >
> > > This means I don't see a real use case for this feature.
> >
> > I don't think its driver is almighty.
> > IIRC, balloon driver can be cause of fragmentation for 24-7 system.
>
> Sure the balloon driver can be likely improved too, it's just
> that I don't think a balloon driver should call into the function
> the original patch in the series hooked up.
>
> > In addition, I have heard that memory hotplug would be useful for
> > reducing of power consumption of DIMM.
>
> It's unclear that memory hotplug is the right model for DIMM power
> management. The problem is that DIMMs are interleaved, so you again have to
> completely free a quite large area. It's not much easier than node hotplug.
>
> > I have to admit that memory hotplug has many issues, but I would like to
>
> Let's call it "node" or "hardware" memory hot unplug, not that
> anyone confuses it with the easier VM based hot unplug or the really
> easy hotadd.
>
> > solve them step by step.
>
> The question is if they are even solvable in a useful way.
> I'm not sure it's that useful to start and then find out
> that it doesn't work anyways.

You use non-linear mappings for the kernel, so that kernel data is
not tied to a specific physical address. AFAIK, that is the only way
to really do it completely (like the fragmentation problem).

Of course, I don't think that would be a good idea to do that in the
forseeable future.

Andi Kleen

unread,

Sep 8, 2008, 5:40:06 AM9/8/08

to

> You use non-linear mappings for the kernel, so that kernel data is
> not tied to a specific physical address. AFAIK, that is the only way
> to really do it completely (like the fragmentation problem).

Even with that there are lots of issues, like keeping track of
DMAs or handling executing kernel code.

>
> Of course, I don't think that would be a good idea to do that in the
> forseeable future.

Agreed.

-Andi

--
a...@linux.intel.com

Nick Piggin

unread,

Sep 8, 2008, 5:50:06 AM9/8/08

to

On Monday 08 September 2008 19:36, Andi Kleen wrote:
> > You use non-linear mappings for the kernel, so that kernel data is
> > not tied to a specific physical address. AFAIK, that is the only way
> > to really do it completely (like the fragmentation problem).
>
> Even with that there are lots of issues, like keeping track of
> DMAs or handling executing kernel code.

Right, but the "high level" software solution is to have nonlinear
kernel mappings. Executing kernel code should not be so hard because
it could be handled just like executing user code (ie. the CPU that
is executing will subsequently fault and be blocked until the
relocation is complete).

DMAs aren't trivial at all, but I guess there could be say, a method
to submit and revoke areas of memory for DMA, and the submit would
block if the memory is currently being relocated underneath it (then
it would be able to find the new address).

Anwyay, whatever the case, yeah I'm not trying to say it is trivial
at all. Even without thinking about DMA it would be costly.

> > Of course, I don't think that would be a good idea to do that in the
> > forseeable future.
>
> Agreed.

Same as the "anti-frag" patches. We must not proceed with this kind of
thing on the justification that "in future we'll be able to unplug any
bit of memory". Because it is not just a matter of logical steps to
reach that point, but basically a fundamental rethink of how the kernel
memory mapping should work.

Other realistic justifications are OK, but if someone wants to unplug
everything, then please put effort into *first* making the kernel
mapping nonlinear, and then we can look at the complexity and
performance costs of that fundamental step.

Andi Kleen

unread,

Sep 8, 2008, 6:30:18 AM9/8/08

to

On Mon, Sep 08, 2008 at 07:46:30PM +1000, Nick Piggin wrote:
> On Monday 08 September 2008 19:36, Andi Kleen wrote:
> > > You use non-linear mappings for the kernel, so that kernel data is
> > > not tied to a specific physical address. AFAIK, that is the only way
> > > to really do it completely (like the fragmentation problem).
> >
> > Even with that there are lots of issues, like keeping track of
> > DMAs or handling executing kernel code.
>
> Right, but the "high level" software solution is to have nonlinear
> kernel mappings. Executing kernel code should not be so hard because
> it could be handled just like executing user code (ie. the CPU that
> is executing will subsequently fault and be blocked until the
> relocation is complete).

First blocking arbitary code is hard. There is some code parts
which are not allowed to block arbitarily. Machine check or NMI
handlers come to mind, but there are likely more.

Then that would be essentially a hypervisor or micro kernel approach.
e.g. Xen does that already kind of, but even there it would
be quite hard to do fully in a general way. And for hardware hotplug
only the fully generally way is actually useful unfortunately.

-Andi

Nick Piggin

unread,

Sep 8, 2008, 7:30:20 AM9/8/08

to

On Monday 08 September 2008 20:30, Andi Kleen wrote:
> On Mon, Sep 08, 2008 at 07:46:30PM +1000, Nick Piggin wrote:
> > On Monday 08 September 2008 19:36, Andi Kleen wrote:
> > > > You use non-linear mappings for the kernel, so that kernel data is
> > > > not tied to a specific physical address. AFAIK, that is the only way
> > > > to really do it completely (like the fragmentation problem).
> > >
> > > Even with that there are lots of issues, like keeping track of
> > > DMAs or handling executing kernel code.
> >
> > Right, but the "high level" software solution is to have nonlinear
> > kernel mappings. Executing kernel code should not be so hard because
> > it could be handled just like executing user code (ie. the CPU that
> > is executing will subsequently fault and be blocked until the
> > relocation is complete).
>
> First blocking arbitary code is hard. There is some code parts
> which are not allowed to block arbitarily. Machine check or NMI
> handlers come to mind, but there are likely more.

Sorry, by "block", I really mean spin I guess. I mean that the CPU will
be forced to stop executing due to the page fault during this sequence:

for prot RO:
alloc new page
memcpy(new, old)
ptep_clear_flush(ptep) <--- from here
set_pte(ptep, newpte) <--- until here

for prot RW, the window also would include the memcpy, however if that
adds too much latency for execute/reads, then it can be mapped RO first,
then memcpy, then flushed and switched.

> Then that would be essentially a hypervisor or micro kernel approach.

What would be? Blocking in interrupts? Or non-linear kernel mapping in
general? Nonlinear kernel mapping I don't think anyone disputes is the
only way to defragment (for unplug or large allocations) arbitrary
physical memory with any sort of guarantee. In the future if TLB costs
grow very much larger, I think this might be worth considering.

But until that becomes inevitable, I really don't want to hack the VM
with crap like transparent variable order mappings etc. but rather
"encourage" CPU manufacturers to have big fast TLBs :)

> e.g. Xen does that already kind of, but even there it would
> be quite hard to do fully in a general way. And for hardware hotplug
> only the fully generally way is actually useful unfortunately.

Yeah I don't really get the hardware hotplug thing. For reliability or
anything it should all be done in hardware (eg. warm/hot spare memory
module). For power I guess there is some argument, but I would prefer
to wait the trends out longer before committing to something big: non
volatile ram replacement for dram for example might be achieved in
future.

But if anybody disagrees, they are sure free to implement non-linear
kernel mappings and physical defragmentation and shut me up with
real numbers!

Andi Kleen

unread,

Sep 8, 2008, 7:30:17 AM9/8/08

to

> Sorry, by "block", I really mean spin I guess. I mean that the CPU will
> be forced to stop executing due to the page fault during this sequence:

It's hard for NMIs at least. They cannot execute faults.

In the end you would need to define a core kernel which
cannot be remapped and the rest which can and you end up
with even more micro kernel like mess.

> ptep_clear_flush(ptep) <--- from here
> set_pte(ptep, newpte) <--- until here
>
> for prot RW, the window also would include the memcpy, however if that
> adds too much latency for execute/reads, then it can be mapped RO first,
> then memcpy, then flushed and switched.
>
>
> > Then that would be essentially a hypervisor or micro kernel approach.
>
> What would be? Blocking in interrupts? Or non-linear kernel mapping in

Well in general someone remapping all the memory beyond you.
That's essentially a hypervisor in my book.

-Andi

Nick Piggin

unread,

Sep 8, 2008, 9:50:08 AM9/8/08

to

On Monday 08 September 2008 21:30, Andi Kleen wrote:
> > Sorry, by "block", I really mean spin I guess. I mean that the CPU will
> > be forced to stop executing due to the page fault during this sequence:
>
> It's hard for NMIs at least. They cannot execute faults.

Well, just for executing code (and reading RO data), then it shouldn't
matter at all actually if the CPU starts executing from the new page
or the old page, so long as there is a way to quiesce NMIs before freeing
the old page.

So the NMI can run, and read data, but it may have a problem with stores.
At least, some kind of redesign of NMI handlers might be required so that
they can make a note of the pending operation and try to do something
sane in that case. Or, there could be a small region of memory; a page or
two, which does not get migrated and NMIs can write to it. I don't think
you need to go so far as saying the entire kernel image must be non
movable just for NMIs.

> In the end you would need to define a core kernel which
> cannot be remapped and the rest which can and you end up
> with even more micro kernel like mess.

Are there any important NMIs that really can't fit with this?

> > ptep_clear_flush(ptep) <--- from here
> > set_pte(ptep, newpte) <--- until here
> >
> > for prot RW, the window also would include the memcpy, however if that
> > adds too much latency for execute/reads, then it can be mapped RO first,
> > then memcpy, then flushed and switched.
> >
> > > Then that would be essentially a hypervisor or micro kernel approach.
> >
> > What would be? Blocking in interrupts? Or non-linear kernel mapping in
>
> Well in general someone remapping all the memory beyond you.
> That's essentially a hypervisor in my book.

I don't see it. It is among one of the things a hypervisor may do.
But anyway, call it what you will.

Badari Pulavarty

unread,

Sep 8, 2008, 6:00:14 PM9/8/08

to

There is nothing architecture specific about remove_memory().
remove_memory() function is common for all architectures which
support hotplug memory remove. Instead of duplicating it in every
architecture, collapse them into arch neutral function.

Signed-off-by: Badari Pulavarty <pba...@us.ibm.com>

arch/ia64/mm/init.c | 17 -----------------
arch/powerpc/mm/mem.c | 17 -----------------
arch/s390/mm/init.c | 11 -----------
mm/memory_hotplug.c | 10 ++++++++++
4 files changed, 10 insertions(+), 45 deletions(-)

Index: linux-2.6.27-rc5/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/ia64/mm/init.c 2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/ia64/mm/init.c 2008-09-08 12:38:59.000000000 -0700
@@ -701,23 +701,6 @@ int arch_add_memory(int nid, u64 start,

return ret;
}
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
- unsigned long start_pfn, end_pfn;
- unsigned long timeout = 120 * HZ;
- int ret;
- start_pfn = start >> PAGE_SHIFT;
- end_pfn = start_pfn + (size >> PAGE_SHIFT);
- ret = offline_pages(start_pfn, end_pfn, timeout);
- if (ret)
- goto out;
- /* we can free mem_map at this point */
-out:
- return ret;
-}
-EXPORT_SYMBOL_GPL(remove_memory);
-#endif /* CONFIG_MEMORY_HOTREMOVE */
#endif

/*
Index: linux-2.6.27-rc5/arch/powerpc/mm/mem.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/powerpc/mm/mem.c 2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/powerpc/mm/mem.c 2008-09-08 12:39:19.000000000 -0700
@@ -135,23 +135,6 @@ int arch_add_memory(int nid, u64 start,

return __add_pages(zone, start_pfn, nr_pages);
}
-
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
- unsigned long start_pfn, end_pfn;
- int ret;
-
- start_pfn = start >> PAGE_SHIFT;
- end_pfn = start_pfn + (size >> PAGE_SHIFT);
- ret = offline_pages(start_pfn, end_pfn, 120 * HZ);
- if (ret)
- goto out;
- /* Arch-specific calls go here - next patch */
-out:
- return ret;
-}
-#endif /* CONFIG_MEMORY_HOTREMOVE */

#endif /* CONFIG_MEMORY_HOTPLUG */

/*

Index: linux-2.6.27-rc5/arch/s390/mm/init.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/s390/mm/init.c 2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/s390/mm/init.c 2008-09-08 12:40:41.000000000 -0700
@@ -189,14 +189,3 @@ int arch_add_memory(int nid, u64 start,
return rc;
}
#endif /* CONFIG_MEMORY_HOTPLUG */
-
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
- unsigned long start_pfn, end_pfn;
-
- start_pfn = PFN_DOWN(start);
- end_pfn = start_pfn + PFN_DOWN(size);
- return offline_pages(start_pfn, end_pfn, 120 * HZ);
-}
-#endif /* CONFIG_MEMORY_HOTREMOVE */
Index: linux-2.6.27-rc5/mm/memory_hotplug.c
===================================================================
--- linux-2.6.27-rc5.orig/mm/memory_hotplug.c 2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/mm/memory_hotplug.c 2008-09-08 12:41:37.000000000 -0700
@@ -26,6 +26,7 @@
#include <linux/delay.h>
#include <linux/migrate.h>
#include <linux/page-isolation.h>
+#include <linux/pfn.h>

#include <asm/tlbflush.h>

@@ -849,6 +850,15 @@ failed_removal:

return ret;
}
+

+int remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn, end_pfn;
+

+ start_pfn = PFN_DOWN(start);
+ end_pfn = start_pfn + PFN_DOWN(size);
+ return offline_pages(start_pfn, end_pfn, 120 * HZ);
+}
#else
int remove_memory(u64 start, u64 size)
{

Badari Pulavarty

unread,

Sep 8, 2008, 6:00:18 PM9/8/08

to

Cleaned up patch with out remove_memory().
Depends on make remove_memory() arch neutral patch.

Thanks,
Badari

Add memory hotremove config option to x86

Memory hotremove functionality can currently be configured into

the ia64, powerpc, and s390 kernels. This patch makes it possible
to configure the memory hotremove functionality into the x86

kernel as well.

Signed-off-by: Badari Pulavarty <pba...@us.ibm.com>
Signed-off-by: Gary Hade <gary...@us.ibm.com>
---
arch/x86/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

Index: linux-2.6.27-rc5/arch/x86/Kconfig
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/Kconfig 2008-09-08 12:36:06.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/Kconfig 2008-09-08 12:45:30.000000000 -0700
@@ -1384,6 +1384,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG

def_bool y
depends on X86_64 || (X86_32 && HIGHMEM)

+config ARCH_ENABLE_MEMORY_HOTREMOVE
+ def_bool y

+ depends on MEMORY_HOTPLUG
+
config HAVE_ARCH_EARLY_PFN_TO_NID
def_bool X86_64
depends on NUMA

Andrew Morton

unread,

Sep 8, 2008, 9:00:18 PM9/8/08

to

On Mon, 08 Sep 2008 14:52:34 -0700
Badari Pulavarty <pba...@us.ibm.com> wrote:

> There is nothing architecture specific about remove_memory().
> remove_memory() function is common for all architectures which
> support hotplug memory remove. Instead of duplicating it in every
> architecture, collapse them into arch neutral function.
>
> Signed-off-by: Badari Pulavarty <pba...@us.ibm.com>
>
> arch/ia64/mm/init.c | 17 -----------------
> arch/powerpc/mm/mem.c | 17 -----------------
> arch/s390/mm/init.c | 11 -----------
> mm/memory_hotplug.c | 10 ++++++++++
> 4 files changed, 10 insertions(+), 45 deletions(-)

I spent some time trying to build-test this on ia64 and gave up. How
the heck do you turn on memory hotplug on ia64?

Randy Dunlap

unread,

Sep 8, 2008, 9:30:24 PM9/8/08

to

On Mon, 8 Sep 2008 17:56:21 -0700 Andrew Morton wrote:

> On Mon, 08 Sep 2008 14:52:34 -0700
> Badari Pulavarty <pba...@us.ibm.com> wrote:
>
> > There is nothing architecture specific about remove_memory().
> > remove_memory() function is common for all architectures which
> > support hotplug memory remove. Instead of duplicating it in every
> > architecture, collapse them into arch neutral function.
> >
> > Signed-off-by: Badari Pulavarty <pba...@us.ibm.com>
> >
> > arch/ia64/mm/init.c | 17 -----------------
> > arch/powerpc/mm/mem.c | 17 -----------------
> > arch/s390/mm/init.c | 11 -----------
> > mm/memory_hotplug.c | 10 ++++++++++
> > 4 files changed, 10 insertions(+), 45 deletions(-)
>
> I spent some time trying to build-test this on ia64 and gave up. How
> the heck do you turn on memory hotplug on ia64?

After using ia64 defconfig, all I had to do was enable Sparse Memory model
instead of Discontiguous.

---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

Yasunori Goto

unread,

Sep 8, 2008, 9:30:24 PM9/8/08

to

> On Mon, 08 Sep 2008 14:52:34 -0700
> Badari Pulavarty <pba...@us.ibm.com> wrote:
>
> > There is nothing architecture specific about remove_memory().
> > remove_memory() function is common for all architectures which
> > support hotplug memory remove. Instead of duplicating it in every
> > architecture, collapse them into arch neutral function.
> >
> > Signed-off-by: Badari Pulavarty <pba...@us.ibm.com>
> >
> > arch/ia64/mm/init.c | 17 -----------------
> > arch/powerpc/mm/mem.c | 17 -----------------
> > arch/s390/mm/init.c | 11 -----------
> > mm/memory_hotplug.c | 10 ++++++++++
> > 4 files changed, 10 insertions(+), 45 deletions(-)
>
> I spent some time trying to build-test this on ia64 and gave up. How
> the heck do you turn on memory hotplug on ia64?
>

EXPORT_SYMBOL_GPL(remove_memory) is removed.
It is required by drivers/acpi/acpi_memhotplug.ko.

--
Yasunori Goto

Badari Pulavarty

unread,

Sep 9, 2008, 11:20:22 AM9/9/08

to

On Tue, 2008-09-09 at 10:21 +0900, Yasunori Goto wrote:
> > On Mon, 08 Sep 2008 14:52:34 -0700
> > Badari Pulavarty <pba...@us.ibm.com> wrote:
> >
> > > There is nothing architecture specific about remove_memory().
> > > remove_memory() function is common for all architectures which
> > > support hotplug memory remove. Instead of duplicating it in every
> > > architecture, collapse them into arch neutral function.
> > >
> > > Signed-off-by: Badari Pulavarty <pba...@us.ibm.com>
> > >
> > > arch/ia64/mm/init.c | 17 -----------------
> > > arch/powerpc/mm/mem.c | 17 -----------------
> > > arch/s390/mm/init.c | 11 -----------
> > > mm/memory_hotplug.c | 10 ++++++++++
> > > 4 files changed, 10 insertions(+), 45 deletions(-)
> >
> > I spent some time trying to build-test this on ia64 and gave up. How
> > the heck do you turn on memory hotplug on ia64?
> >
>
> EXPORT_SYMBOL_GPL(remove_memory) is removed.
> It is required by drivers/acpi/acpi_memhotplug.ko.

Thanks for catching it. I forgot that it was being used
by acpi. Since we didn't export it for ppc and s390,
I assumed its safe to remove the export. Sorry !!

Thanks,
Badari