Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

Tim Gardner

unread,

Feb 25, 2013, 10:32:41 AM2/25/13

to linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, H. Peter Anvin, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

On 02/25/2013 08:02 AM, Tim Gardner wrote:
> Is this an expected warning ? I'll boot a vanilla kernel just to be sure.
>
> rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:
>

Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
is having an impact:

[ 0.170435] ------------[ cut here ]------------
[ 0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
topology_sane.isra.2+0x71/0x84()
[ 0.170452] Hardware name: S2600CP
[ 0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
node! [node: 1 != 0]. Ignoring dependency.
[ 0.156000] smpboot: Booting Node 1, Processors #1
[ 0.170455] Modules linked in:
[ 0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
[ 0.170461] Call Trace:
[ 0.170466] [<ffffffff810597bf>] warn_slowpath_common+0x7f/0xc0
[ 0.170473] [<ffffffff810598b6>] warn_slowpath_fmt+0x46/0x50
[ 0.170477] [<ffffffff816cc752>] topology_sane.isra.2+0x71/0x84
[ 0.170482] [<ffffffff816cc9de>] set_cpu_sibling_map+0x23f/0x436
[ 0.170487] [<ffffffff816ccd0c>] start_secondary+0x137/0x201
[ 0.170502] ---[ end trace 09222f596307ca1d ]---

rtg
--
Tim Gardner tim.g...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Don Morris

unread,

Feb 25, 2013, 4:27:58 PM2/25/13

to Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, H. Peter Anvin, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

On 02/25/2013 10:32 AM, Tim Gardner wrote:
> On 02/25/2013 08:02 AM, Tim Gardner wrote:
>> Is this an expected warning ? I'll boot a vanilla kernel just to be sure.
>>
>> rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:
>>
>
> Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
> is having an impact:

Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
still Sandy Bridge, though I don't think that matters).

Bisection leads to:
# bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
parse SRAT before memblock is ready

Nothing terribly obvious leaps out as to *why* that reshuffling messes
up the cpu<-->node bindings, but I wanted to put this out there while
I poke around further. [Note that the SRAT: PXM -> APIC -> Node print
outs during boot are the same either way -- if you look at the APIC
numbers of the processors (from /proc/cpuinfo), the processors should
be assigned to the correct node, but they aren't.] cc'ing Tang Chen
in case this is obvious to him or he's already fixed it somewhere not
on Linus's tree yet.

Don Morris

>
> [ 0.170435] ------------[ cut here ]------------
> [ 0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
> topology_sane.isra.2+0x71/0x84()
> [ 0.170452] Hardware name: S2600CP
> [ 0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
> node! [node: 1 != 0]. Ignoring dependency.
> [ 0.156000] smpboot: Booting Node 1, Processors #1
> [ 0.170455] Modules linked in:
> [ 0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
> [ 0.170461] Call Trace:
> [ 0.170466] [<ffffffff810597bf>] warn_slowpath_common+0x7f/0xc0
> [ 0.170473] [<ffffffff810598b6>] warn_slowpath_fmt+0x46/0x50
> [ 0.170477] [<ffffffff816cc752>] topology_sane.isra.2+0x71/0x84
> [ 0.170482] [<ffffffff816cc9de>] set_cpu_sibling_map+0x23f/0x436
> [ 0.170487] [<ffffffff816ccd0c>] start_secondary+0x137/0x201
> [ 0.170502] ---[ end trace 09222f596307ca1d ]---
>
> rtg
>

--

Yinghai Lu

unread,

Feb 25, 2013, 5:51:03 PM2/25/13

to Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

that commit is totally broken, and it should be reverted.

1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed.
please consider sequence is: numaq, srat, amd, dummy.
You need to make fall back path working!

2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....

3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes
to x86 code.

4, it does not CC to TJ and other numa guys...

Thanks

Yinghai

Tang Chen

unread,

Feb 25, 2013, 8:52:29 PM2/25/13

to Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

Hi Yinghai, Don,

OK, I see this. I'll fix it soon. :)

Thanks. :)

Martin Bligh

unread,

Feb 25, 2013, 10:22:05 PM2/25/13

to Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Ingo Molnar, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

>>> 4, it does not CC to TJ and other numa guys...
>>

>> attached workaround the problem for now.
>> but it will assume NUMAQ would not have SRAT table.
>
> Martin, can you confirm that numaq does not have srat?

No, it's pre-SRAT. I forget the exact name of the table, but no SRAT until x440.

OTOH, you should probably feel free to break it by now, I can't
imagine they are any use to man nor beast any more.

M.

Yinghai Lu

unread,

Feb 25, 2013, 11:20:35 PM2/25/13

to Martin Bligh, Ingo Molnar, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

On Mon, Feb 25, 2013 at 7:21 PM, Martin Bligh <mbl...@mbligh.org> wrote:
>>>> 4, it does not CC to TJ and other numa guys...
>>>
>>> attached workaround the problem for now.
>>> but it will assume NUMAQ would not have SRAT table.
>>
>> Martin, can you confirm that numaq does not have srat?
>
> No, it's pre-SRAT. I forget the exact name of the table, but no SRAT until x440.
>
> OTOH, you should probably feel free to break it by now, I can't
> imagine they are any use to man nor beast any more.

Do you mean we can remove numaq x86 32bit code now?

Thanks

Yinghai

Martin Bligh

unread,

Feb 25, 2013, 11:51:38 PM2/25/13

to Yinghai Lu, Ingo Molnar, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

> Do you mean we can remove numaq x86 32bit code now?

Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
Was useful in the early days of getting NUMA up and running on Linux,
but is now too old to be a museum piece, really.

M.

Tang Chen

unread,

Feb 26, 2013, 1:10:50 AM2/26/13

to Martin Bligh, Yinghai Lu, Ingo Molnar, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

On 02/26/2013 12:51 PM, Martin Bligh wrote:
>> Do you mean we can remove numaq x86 32bit code now?
>
> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
> Was useful in the early days of getting NUMA up and running on Linux,
> but is now too old to be a museum piece, really.
>
> M.
>

Hi Martin, Yinghai,

It was me that I failed to make numa_init() fall back path working, and
forgot
to call early_parse_srat in ia64. Sorry for the breaking of other
platform. :)

So now, is Yinghai's patch enough for this problem ?
Or we can encapsulate the following clear up work into one function ?

+ for (i = 0; i < MAX_LOCAL_APIC; i++)
+ set_apicid_to_node(i, NUMA_NO_NODE);
+ nodes_clear(numa_nodes_parsed);
+ memset(&numa_meminfo, 0, sizeof(numa_meminfo));

Thanks. :)

Yinghai Lu

unread,

Feb 26, 2013, 1:57:46 AM2/26/13

to Tang Chen, Martin Bligh, Ingo Molnar, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

On Mon, Feb 25, 2013 at 10:09 PM, Tang Chen <tang...@cn.fujitsu.com> wrote:
> On 02/26/2013 12:51 PM, Martin Bligh wrote:
>>>
>>> Do you mean we can remove numaq x86 32bit code now?
>>
>>
>> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
>> Was useful in the early days of getting NUMA up and running on Linux,
>> but is now too old to be a museum piece, really.
>>
>> M.
>>
>
> Hi Martin, Yinghai,
>
> It was me that I failed to make numa_init() fall back path working, and
> forgot
> to call early_parse_srat in ia64. Sorry for the breaking of other platform.
> :)
>
> So now, is Yinghai's patch enough for this problem ?
> Or we can encapsulate the following clear up work into one function ?
>
>
> + for (i = 0; i < MAX_LOCAL_APIC; i++)
> + set_apicid_to_node(i, NUMA_NO_NODE);
> + nodes_clear(numa_nodes_parsed);
> + memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>
>

That is temporary workaround and your patch and this workaround make
x86 acpi numa init too messy.

I don't see the point to hack SRAT to make memory hotplug working.

Do you guys check and use PMTT in ACPI spec instead?

Yinghai

Tang Chen

unread,

Feb 26, 2013, 2:30:22 AM2/26/13

to Yinghai Lu, Martin Bligh, Ingo Molnar, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

On 02/26/2013 02:57 PM, Yinghai Lu wrote:
> That is temporary workaround and your patch and this workaround make
> x86 acpi numa init too messy.
>
> I don't see the point to hack SRAT to make memory hotplug working.
>
> Do you guys check and use PMTT in ACPI spec instead?

Hi Yinghai,

Thanks for the suggestion. :)

The point we are using SRAT is that we need the hot-pluggable bit in SRAT.
I didn't find such info in PMTT or elsewhere.

We use SRAT in this way aims to satisfy users who don't want to specify
physical address ranges in kernel command line. They want to use SRAT to
determine which memory is hot-pluggable, and which is not.

To achieve this aim, we have to ensure we have the SRAT info before
memblock
starts to allocate memory. So that we can prevent memblock from allocating
memory in the hot-pluggable area. So I have to parse SRAT earlier.

I don't think the code is that messy. I think we can encapsulate the clear
up job into one function, and call it where it is needed.

How do you think ?

Thanks. :)

Yasuaki Ishimatsu

unread,

Feb 26, 2013, 2:55:04 AM2/26/13

to Yinghai Lu, Tang Chen, Martin Bligh, Ingo Molnar, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

Hi Yinghai,

2013/02/26 15:57, Yinghai Lu wrote:
> On Mon, Feb 25, 2013 at 10:09 PM, Tang Chen <tang...@cn.fujitsu.com> wrote:
>> On 02/26/2013 12:51 PM, Martin Bligh wrote:
>>>>
>>>> Do you mean we can remove numaq x86 32bit code now?
>>>
>>>
>>> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
>>> Was useful in the early days of getting NUMA up and running on Linux,
>>> but is now too old to be a museum piece, really.
>>>
>>> M.
>>>
>>
>> Hi Martin, Yinghai,
>>
>> It was me that I failed to make numa_init() fall back path working, and
>> forgot
>> to call early_parse_srat in ia64. Sorry for the breaking of other platform.
>> :)
>>
>> So now, is Yinghai's patch enough for this problem ?
>> Or we can encapsulate the following clear up work into one function ?
>>
>>
>> + for (i = 0; i < MAX_LOCAL_APIC; i++)
>> + set_apicid_to_node(i, NUMA_NO_NODE);
>> + nodes_clear(numa_nodes_parsed);
>> + memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>>
>>
>
> That is temporary workaround and your patch and this workaround make
> x86 acpi numa init too messy.
>
> I don't see the point to hack SRAT to make memory hotplug working.
>
> Do you guys check and use PMTT in ACPI spec instead?

I read PMTT specification in ACPI spec revision 5.0. But this table
does not have hotpluggable information. So we cannot know which memory
device can hotplug from this table.

Thanks,
Yasuaki Ishimatsu

Yinghai Lu

unread,

Feb 26, 2013, 4:36:52 PM2/26/13

to Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.

>
> 3. that patch TITLE is total misleading, there is NO x86 in the title,
> but it changes
> to x86 code.
>
> 4, it does not CC to TJ and other numa guys...

Yinghai Lu

unread,

Feb 26, 2013, 5:44:56 PM2/26/13

to Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

After looked at the code more, thought that theory that does not let
kernel use ram
on hotplug area is not right.

after that commit, following range can not use movable ram:
1. real_mode code.... well..funny, legacy cpu0 [0,1M) could be hot-removed?
2. dma_continguous ?
3. log buff ring.
4. initrd... why it will be freed after booting, so it could be on movable...
5. crashkernel for kdump...: : looks like we can not put kdump kernel
above 4G anymore
6. initmem_init: it will allocate page table to setup kernel mapping
for memory..., it should
be with BRK and near end of max_pfn....

If node is hotplugable, the mem related stuff like page table and
vmemmap could be
on the that node without problem and should be on that node.

assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
and those cpu with ram could be hotadd and hotremoved.
Now you want to put page table and vmemmap on first node.
The system would not boot as not enough memory for cover whole system RAM.

e8d1955258091e4c92d5a975ebd7fd8a98f5d30f and related commits should be just
reverted now.

Thanks

Yinghai

Yasuaki Ishimatsu

unread,

Feb 26, 2013, 7:53:53 PM2/26/13

to Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

If you use "movablemem_map=srat", abobe memory can not use movable memory.
But in my understanding, current Linux cannot move above memory. So above
memory should not use movable memory.

>
> If node is hotplugable, the mem related stuff like page table and
> vmemmap could be
> on the that node without problem and should be on that node.
>

> assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
> and those cpu with ram could be hotadd and hotremoved.
> Now you want to put page table and vmemmap on first node.
> The system would not boot as not enough memory for cover whole system RAM.

Even if we solve your above mentions, the system cannot boot.
In this case, user should:
o add ram to first cpu
o decreases hotpluggable ram by :
- changing hotpluggable information of SRAT
- using movablemem_map=nn[KMG]@ss[KMG]

Thansk,
Yasuaki Ishimatsu

Tang Chen

unread,

Feb 26, 2013, 9:14:58 PM2/26/13

to Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

Hi Yinghai,

Please see below. :)

AFAIK, Linux kernel now cannot migrate memory used by the kernel
because. So any memory
used by the kernel should not be on movable area.

>
> If node is hotplugable, the mem related stuff like page table and
> vmemmap could be
> on the that node without problem and should be on that node.

page tables and vmemmap are kernel memory. They should not be movable, I
think.

>
> assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
> and those cpu with ram could be hotadd and hotremoved.
> Now you want to put page table and vmemmap on first node.
> The system would not boot as not enough memory for cover whole system RAM.

Yes, you are right. And a more extreme situation has been talked about
by HPA.

"If all the memory is hot-pluggable, then the kernel won't be able to boot."

So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
acpi, memory-hotplug: support getting hotplug info from SRAT

I have excluded all the memory reserved by memblock, and any node that
has memory
reserved by memblock will be set to un-hot-pluggable, which means we
will have
enough memory (all the memory on the node) to boot the kernel. So I
think the problem
you are talking about has been solved.

Yinghai Lu

unread,

Feb 26, 2013, 9:24:31 PM2/26/13

to Tang Chen, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

that depends.

initrd will be freed later, so it should be put anywhere that is under
max_pfn during boot.

>
>
>>
>> If node is hotplugable, the mem related stuff like page table and
>> vmemmap could be
>> on the that node without problem and should be on that node.
>
>
> page tables and vmemmap are kernel memory. They should not be movable, I
> think.

why do you need to migrate pagetable and vmemmap for the memory range
that will be
offline ?

>
>
>>
>> assume first cpu only have 1G ram, and other 31 socket will have bunch of
>> ram
>> and those cpu with ram could be hotadd and hotremoved.
>> Now you want to put page table and vmemmap on first node.
>> The system would not boot as not enough memory for cover whole system RAM.
>
>
> Yes, you are right. And a more extreme situation has been talked about by
> HPA.
>
> "If all the memory is hot-pluggable, then the kernel won't be able to boot."
>
> So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
> acpi, memory-hotplug: support getting hotplug info from SRAT
>
> I have excluded all the memory reserved by memblock, and any node that has
> memory
> reserved by memblock will be set to un-hot-pluggable, which means we will
> have
> enough memory (all the memory on the node) to boot the kernel. So I think
> the problem
> you are talking about has been solved.

I don't think that you understand the problem.

for the system that will put all pagetable and vmemmap on the 1G ram
of first cpu.
as all other ram are MOVABLE, so memblock_find_in_range will not use any local
ram on those nodes.

Yinghai Lu

unread,

Feb 26, 2013, 9:30:25 PM2/26/13

to Yasuaki Ishimatsu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

that depends, like relocating initrd to different position.

>
>>
>> If node is hotplugable, the mem related stuff like page table and
>> vmemmap could be
>> on the that node without problem and should be on that node.
>>
>
>> assume first cpu only have 1G ram, and other 31 socket will have bunch of
>> ram
>> and those cpu with ram could be hotadd and hotremoved.
>> Now you want to put page table and vmemmap on first node.
>> The system would not boot as not enough memory for cover whole system RAM.
>
>
> Even if we solve your above mentions, the system cannot boot.
> In this case, user should:
> o add ram to first cpu
> o decreases hotpluggable ram by :
> - changing hotpluggable information of SRAT
> - using movablemem_map=nn[KMG]@ss[KMG]

Do you mean you can not boot one socket system with 1G ram ?

Assume socket 0 does not support hotplug, other 31 sockets support hot plug.

So we could boot system only with socket0, and later one by one hot
add other cpus.

We should simulate that way, just like boot system with PXM0 at first
and later during acpi scan, add other cpus/ram.

Yasuaki Ishimatsu

unread,

Feb 26, 2013, 10:39:55 PM2/26/13

to Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

In this case, system can boot. But other cpus with bunch of ram hot
plug may fails, since system does not have enough memory for cover
hot added memory. When hot adding memory device, kernel object for the
memory is allocated from 1G ram since hot added memory has not been
enabled.

Thanks,
Yasuaki Ishimatsu

Yinghai Lu

unread,

Feb 26, 2013, 11:04:46 PM2/26/13

to Yasuaki Ishimatsu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
<isimatu...@jp.fujitsu.com> wrote:
> 2013/02/27 11:30, Yinghai Lu wrote:
>> Do you mean you can not boot one socket system with 1G ram ?
>> Assume socket 0 does not support hotplug, other 31 sockets support hot
>> plug.
>>
>> So we could boot system only with socket0, and later one by one hot
>> add other cpus.
>
>
> In this case, system can boot. But other cpus with bunch of ram hot
> plug may fails, since system does not have enough memory for cover
> hot added memory. When hot adding memory device, kernel object for the
> memory is allocated from 1G ram since hot added memory has not been
> enabled.
>

yes, it may fail, if the one node memory need page table and vmemmap
is more than 1g ...

for hot add memory we need to
1. add another wrapper for init_memory_mapping, just like
init_mem_mapping() for booting path.
2. we need make memblock more generic, so we can use it with hot add
memory during runtime.
3. with that we can initialize page table for hot added node with ram.
a. initial page table for 2M near node top is from node0 ( that does
not support hot plug).
b. then will use 2M for memory below node top...
c. with that we will make sure page table stay on local node.
alloc_low_pages need to be updated to support that.
4. need to make sure vmemmap on local node too.

so hot-remove node will work too later.

In the long run, we should make booting path and hot adding more
similar and share at most code.
That will make code get more test coverage.

Tang Chen

unread,

Feb 26, 2013, 11:33:32 PM2/26/13

to Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com

On 02/27/2013 10:24 AM, Yinghai Lu wrote:
>>> After looked at the code more, thought that theory that does not let
>>> kernel use ram
>>> on hotplug area is not right.
>>>
>>> after that commit, following range can not use movable ram:
>>> 1. real_mode code.... well..funny, legacy cpu0 [0,1M) could be
>>> hot-removed?
>>> 2. dma_continguous ?
>>> 3. log buff ring.
>>> 4. initrd... why it will be freed after booting, so it could be on
>>> movable...
>>> 5. crashkernel for kdump...: : looks like we can not put kdump kernel
>>> above 4G anymore
>>> 6. initmem_init: it will allocate page table to setup kernel mapping
>>> for memory..., it should
>>> be with BRK and near end of max_pfn....
>>
>>
>> AFAIK, Linux kernel now cannot migrate memory used by the kernel because. So
>> any memory
>> used by the kernel should not be on movable area.
>
> that depends.
>
> initrd will be freed later, so it should be put anywhere that is under
> max_pfn during boot.
>

OK，but initrd is not that big. Actually, before my code start to work,
memblock
has reserved some memory. But it is not that big. On the other hand, it
is not that
easy to find out which memory should be kept in unmovable area, and
which should not.

>>
>>
>>>
>>> If node is hotplugable, the mem related stuff like page table and
>>> vmemmap could be
>>> on the that node without problem and should be on that node.
>>
>>
>> page tables and vmemmap are kernel memory. They should not be movable, I
>> think.
>
> why do you need to migrate pagetable and vmemmap for the memory range
> that will be
> offline ?

Hum, you are right. :)

True, we can store pagetable and vmemmap on the node that is hot-pluggable.
But just like the page_cgroup structs, we need additional work to handle it.

But based on the existing code, we didn't do any special handling. I think
we can improve it if needed. ：）

>
>>
>>
>>>
>>> assume first cpu only have 1G ram, and other 31 socket will have bunch of
>>> ram
>>> and those cpu with ram could be hotadd and hotremoved.
>>> Now you want to put page table and vmemmap on first node.
>>> The system would not boot as not enough memory for cover whole system RAM.
>>
>>
>> Yes, you are right. And a more extreme situation has been talked about by
>> HPA.
>>
>> "If all the memory is hot-pluggable, then the kernel won't be able to boot."
>>
>> So, please refer to commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb:
>> acpi, memory-hotplug: support getting hotplug info from SRAT
>>
>> I have excluded all the memory reserved by memblock, and any node that has
>> memory
>> reserved by memblock will be set to un-hot-pluggable, which means we will
>> have
>> enough memory (all the memory on the node) to boot the kernel. So I think
>> the problem
>> you are talking about has been solved.
>
> I don't think that you understand the problem.
>
> for the system that will put all pagetable and vmemmap on the 1G ram
> of first cpu.
> as all other ram are MOVABLE, so memblock_find_in_range will not use any local
> ram on those nodes.
>

Yes, I konw that. :)

In this case, the kernel will not able to use local ram on those nodes.
It will
cause some performance down.

I mean if the 1G ram is not enough for the kernel to boot, the current
code will
set all the ram on the same node as un-hot-pluggable.

If all the ram on the node is not enough for kernel to boot, it is a
really extreme
situation, IIUC.

I think users can solve this problem in two ways:
1) add more ram to the node.
2) use movablemem_map=nn[KMG]@ss[KMG] to configure more ram as unmovable.

Thanks. :)

Yasuaki Ishimatsu

unread,

Feb 26, 2013, 11:44:40 PM2/26/13

to Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Tony Luck, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

I think so too. By this, memory hot plug becomes more useful.

Thanks,
Yasuaki Ishimatsu

Yasuaki Ishimatsu

unread,

Feb 27, 2013, 12:51:04 AM2/27/13

to Yinghai Lu, Andrew Morton, Tang Chen, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

2013/02/27 14:11, Yinghai Lu wrote:
> On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu

I agree with your idea. But I think above ideas is future work.
So at first we should use movable memory for memory hot plug.
After that, we will implement above ideas.

>>
>>>
>>> so hot-remove node will work too later.
>>>
>>> In the long run, we should make booting path and hot adding more
>>> similar and share at most code.
>>> That will make code get more test coverage.
>

> Tang, Yasuaki, Andrew,
>
> Please check if you are ok with attached reverting patch.

We will fix this problem with no objection. So please wait a while.

And the problem occurs by "movablemem_map=srat" not "movablemem_map=nn[KMG]@ss[KMG]"
At least, if you want to revert it, you should revert only "movablemem_map=srat" part.

Thanks,
Yasuaki Ishimatsu

>
> Tim, Don,
> Can you try if attached reverting patch fix all the problems for you ?

Yinghai Lu

unread,

Feb 27, 2013, 1:55:00 AM2/27/13

to Yasuaki Ishimatsu, Andrew Morton, Tang Chen, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

On Tue, Feb 26, 2013 at 9:49 PM, Yasuaki Ishimatsu

Those patches are tangled together.

Also it looks funny to ask user to specify mem range in boot command
line to enable mem hotplug.

Tang Chen

unread,

Feb 27, 2013, 2:12:02 AM2/27/13

to Yinghai Lu, Yasuaki Ishimatsu, Andrew Morton, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

No, they are not.

The following commits supports "movablemem_map=nn[KMG]@ss[KMG]".

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
page_alloc: bootmem limit with movablecore_map
commit 42f47e27e761fee07da69e04612ec7dd0d490edd
page_alloc: make movablemem_map have higher priority
commit 6981ec31146cf19454c55c130625f6cee89aab95
page_alloc: introduce zone_movable_limit[] to keep movable limit
for nodes
commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
page_alloc: add movable_memmap kernel parameter
commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
x86: get pg_data_t's memory from other node

And the following supports "movablemem_map=srat".

commit f7210e6c4ac795694106c1c5307134d3fc233e88
mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect
movablecore_map in memblock_overlaps_region().

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT

commit 27168d38fa209073219abedbe6a9de7ba9acbfad
acpi, memory-hotplug: extend movablemem_map ranges to the end of node
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f

acpi, memory-hotplug: parse SRAT before memblock is ready

>

> Also it looks funny to ask user to specify mem range in boot command
> line to enable mem hotplug.

Well, I think sometimes users don't like the SRAT memory style, and want to
increase or reduce hot-pluggable memory by themselves. And also, it is
useful
for debuging firmware bugs.

I agree that "movablemem_map=srat" functionality need more work to improve.
Can we not revert it, and improve it during 3.9rc ? I think during rc time,
at least we can fix the problems brought by early_parse_srat().

Thanks. :)

Yinghai Lu

unread,

Feb 27, 2013, 2:25:28 AM2/27/13

to Tang Chen, Yasuaki Ishimatsu, Andrew Morton, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

On Tue, Feb 26, 2013 at 11:11 PM, Tang Chen <tang...@cn.fujitsu.com> wrote:
> On 02/27/2013 02:54 PM, Yinghai Lu wrote:
>>
>> Those patches are tangled together.
>
>
> No, they are not.
>
> The following commits supports "movablemem_map=nn[KMG]@ss[KMG]".
>
> commit fb06bc8e5f42f38c011de0e59481f464a82380f6
> page_alloc: bootmem limit with movablecore_map
> commit 42f47e27e761fee07da69e04612ec7dd0d490edd
> page_alloc: make movablemem_map have higher priority
> commit 6981ec31146cf19454c55c130625f6cee89aab95
> page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
> commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
> page_alloc: add movable_memmap kernel parameter
> commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
> x86: get pg_data_t's memory from other node
>
> And the following supports "movablemem_map=srat".
>
> commit f7210e6c4ac795694106c1c5307134d3fc233e88
> mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().
> commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
> acpi, memory-hotplug: support getting hotplug info from SRAT
> commit 27168d38fa209073219abedbe6a9de7ba9acbfad
> acpi, memory-hotplug: extend movablemem_map ranges to the end of node
> commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> acpi, memory-hotplug: parse SRAT before memblock is ready

those four can be reverted cleanly?

>
>>
>> Also it looks funny to ask user to specify mem range in boot command
>> line to enable mem hotplug.
>
>
> Well, I think sometimes users don't like the SRAT memory style, and want to
> increase or reduce hot-pluggable memory by themselves. And also, it is
> useful
> for debuging firmware bugs.
>
> I agree that "movablemem_map=srat" functionality need more work to improve.
> Can we not revert it, and improve it during 3.9rc ? I think during rc time,
> at least we can fix the problems brought by early_parse_srat().

looks like acpi_override can not be fixed.

Thanks

Yinghai

Tang Chen

unread,

Feb 27, 2013, 2:45:31 AM2/27/13

to Yinghai Lu, Yasuaki Ishimatsu, Andrew Morton, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

Sorry, if you want to revert, you just need to revert:

commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
acpi, memory-hotplug: parse SRAT before memblock is ready

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT

The other two have nothing to do with SRAT. And they are necessary.

Seeing from the code, I think it is clean. But we'd better test it.

>
>>
>>>
>>> Also it looks funny to ask user to specify mem range in boot command
>>> line to enable mem hotplug.
>>
>>
>> Well, I think sometimes users don't like the SRAT memory style, and want to
>> increase or reduce hot-pluggable memory by themselves. And also, it is
>> useful
>> for debuging firmware bugs.
>>
>> I agree that "movablemem_map=srat" functionality need more work to improve.
>> Can we not revert it, and improve it during 3.9rc ? I think during rc time,
>> at least we can fix the problems brought by early_parse_srat().
>
> looks like acpi_override can not be fixed.

About this problem, I need to do some investigation, and I think we can
have a try.

I do hope we can keep these patches. And put the improve work in the
future. :)

Thanks. :)

Lai Jiangshan

unread,

Feb 27, 2013, 2:58:50 AM2/27/13

to Yinghai Lu, Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

On 02/27/2013 01:11 PM, Yinghai Lu wrote:
> On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu

>>> so hot-remove node will work too later.
>>>
>>> In the long run, we should make booting path and hot adding more
>>> similar and share at most code.
>>> That will make code get more test coverage.
>

> Tang, Yasuaki, Andrew,
>
> Please check if you are ok with attached reverting patch.
>

> Tim, Don,
> Can you try if attached reverting patch fix all the problems for you ?
>

Hi, Yinghai, Andrew

In the mails and the changlog of the revert-patch, I think Yinghai
mainly worries about 3 problems.

1) the current implement has bug and bad code.

Yes. Any bug should be fixed. we should fix it directly, or
we can revert the related patches and then send the fixed patches.

But the related patch is only one or two, it is not good idea
to revert the whole patchset or the whole feature. Right?

Thank you all for addressing the bug. we are on the way to fix it.

2) many memory can be put into hotplugable memory, but we have not yet moved them
into hotplugable memory yet. like: vmemmap, some page table ...etc, a lot.

This is a restriction in the currently kernel, we can't convert them quickly.
we must convert them step by step. example, we are converting the memory of
page_cgroup to hotplugable memory.

3) if the user(or firmware) specify the un-hotplugable memory too small, the system can't
work, even can't boot.

Any feature/system has its own minimum requirements, the user should
meet the requirements and specify more un-hotplugable memory.
so I don't think it is a problem in kernel land.

But the problem 2)(above) make this feature's "minimum requirements"
much higher. It is the real thing that Yinghai worries about.

But all systems which use this feature can offer this higher requirement
very easily. The users should specify enough un-hotplugable memory
before and after we decrease the "minimum requirements".

The whole feature works very well if the user specify enough
un-hotplugable memory. So the problem 2) and 3) are not urgent
problems.

And our team has another problem, we are still not good at community work,
(example, the patch TITLE is total misleading), but we are growing up.
We are sorry and thank you for pointing out the mistakes.

The feature/patchset does have problems. But it is not good to tangle
all the problems together and revert the whole feature.

Thanks,
Lai

Don Morris

unread,

Feb 27, 2013, 7:40:20 AM2/27/13

to Yinghai Lu, Andrew Morton, Yasuaki Ishimatsu, Tang Chen, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

On 02/27/2013 12:11 AM, Yinghai Lu wrote:
> On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu

>>> so hot-remove node will work too later.
>>>
>>> In the long run, we should make booting path and hot adding more
>>> similar and share at most code.
>>> That will make code get more test coverage.
>

> Tang, Yasuaki, Andrew,
>
> Please check if you are ok with attached reverting patch.
>
> Tim, Don,
> Can you try if attached reverting patch fix all the problems for you ?

I'm sure from the discussion on how to leave in memory hotplug it
likely won't be just a clean reversion, but as a data point -- yes,
this patch does remove the problem as expected (and I don't see
any new ones at first glance... though I'm not trying hotplug yet
obviously).

Thanks,
Don Morris

Luck, Tony

unread,

Feb 27, 2013, 11:28:49 AM2/27/13

to Yasuaki Ishimatsu, Yinghai Lu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, Sakkinen, Jarkko, tang...@cn.fujitsu.com

> assume first cpu only have 1G ram, and other 31 socket will have bunch of ram

That doesn't seem to be a very realistic assumption. Can you even still buy 1G
DIMMs for servers? I'd think that a minimum would be to have each of four
channels populated with a 4G DIMM - so 16GB on first cpu. But even that feels
rather low.

I think that making sure that the system can boot is good (and maybe it should
ignore/override[*] parameters that would prevent booting). But let's be realistic
about the cases we actually have to deal with (before somebody comes and talks
about systems with just 16MB).

-Tony

[*] with some noisy warnings in the console log

Yinghai Lu

unread,

Feb 27, 2013, 12:31:10 PM2/27/13

to Luck, Tony, Yasuaki Ishimatsu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, Sakkinen, Jarkko, tang...@cn.fujitsu.com

On Wed, Feb 27, 2013 at 8:28 AM, Luck, Tony <tony...@intel.com> wrote:
>> assume first cpu only have 1G ram, and other 31 socket will have bunch of ram
>
> That doesn't seem to be a very realistic assumption. Can you even still buy 1G
> DIMMs for servers? I'd think that a minimum would be to have each of four
> channels populated with a 4G DIMM - so 16GB on first cpu. But even that feels
> rather low.

We could use memmap= to exclude mem, right?

>
> I think that making sure that the system can boot is good (and maybe it should
> ignore/override[*] parameters that would prevent booting). But let's be realistic
> about the cases we actually have to deal with (before somebody comes and talks
> about systems with just 16MB).

About make memory hotplug working:
1. find out ram that is used by kernel in early time.
2. check if
a. it is with kernel code that will not be moved.
like real_mode.
b. it will be freed to slub before run time.
like init code and initrd disk.
c. if it is on local node ram that will not prevent mem hot-remove
like page table and vmemmap.
current we already have vmemmap and node_data on local node.
May need to put page table on local node too. or just put page
table with local node that kernel is on.
d. something could be anywhere, and could be moved down after
slub is ready.

movablemem_map patchset prevents kernel using kernel from local node.

In that case, so they should just boot system with numa=off.

Thanks

Yinghai

Luck, Tony

unread,

Feb 27, 2013, 12:51:07 PM2/27/13

to Yinghai Lu, Yasuaki Ishimatsu, Don Morris, H. Peter Anvin, Tejun Heo, Andrew Morton, Thomas Renninger, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, Sakkinen, Jarkko, tang...@cn.fujitsu.com

> b. it will be freed to slub before run time.
> like init code and initrd disk.

If this is a problem - I'd be inclined to disable the code that frees it. It's only
a few hundred KB of code, and possibly a few MB of initrd. Too small to
worry about on a hot pluggable server.

> In that case, so they should just boot system with numa=off.

But we will still care about NUMA locality.

-Tony

Andrew Morton

unread,

Feb 27, 2013, 4:26:21 PM2/27/13

to Lai Jiangshan, Yinghai Lu, Yasuaki Ishimatsu, Tang Chen, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

On Wed, 27 Feb 2013 16:00:36 +0800
Lai Jiangshan <la...@cn.fujitsu.com> wrote:

> In the mails and the changlog of the revert-patch, I think Yinghai
> mainly worries about 3 problems.
>
> 1) the current implement has bug and bad code.
>
> Yes. Any bug should be fixed. we should fix it directly, or
> we can revert the related patches and then send the fixed patches.
>
> But the related patch is only one or two, it is not good idea
> to revert the whole patchset or the whole feature. Right?

Reverting a new patchset isn't really a big deal. The patchset gets
fixed up, retested then reapplied. We like to do things this way
because it minimises the amount of trouble which the regression is
causing other people.

Reverting one or two patches from a fairly large and complex patchset
sounds risky - we're putting an untested patch combination straight
into mainline with minimal testing. It would be safer to revert
everything.

So I'm thinking that the best approach here is to revert everything and
then try again for 3.10-rc1. This gives people time to test the code
while it's only in linux-next. (Hint!)

> Thank you all for addressing the bug. we are on the way to fix it.

How long do you think this will take?

> 2) many memory can be put into hotplugable memory, but we have not yet moved them
> into hotplugable memory yet. like: vmemmap, some page table ...etc, a lot.
>
> This is a restriction in the currently kernel, we can't convert them quickly.
> we must convert them step by step. example, we are converting the memory of
> page_cgroup to hotplugable memory.
>
>
> 3) if the user(or firmware) specify the un-hotplugable memory too small, the system can't
> work, even can't boot.
>
> Any feature/system has its own minimum requirements, the user should
> meet the requirements and specify more un-hotplugable memory.
> so I don't think it is a problem in kernel land.
>
> But the problem 2)(above) make this feature's "minimum requirements"
> much higher. It is the real thing that Yinghai worries about.
>
> But all systems which use this feature can offer this higher requirement
> very easily. The users should specify enough un-hotplugable memory
> before and after we decrease the "minimum requirements".
>
> The whole feature works very well if the user specify enough
> un-hotplugable memory. So the problem 2) and 3) are not urgent
> problems.

Yes, let's not mingle concepts. From a feature perspective we've
always understood that 3.9 memory hotplug would be "has limitations,
needs work, but better than it was before". Let's consider that
separately from "your patchset broke my kernel".

Tang Chen

unread,

Feb 28, 2013, 5:02:24 AM2/28/13

to Andrew Morton, Lai Jiangshan, Yinghai Lu, Yasuaki Ishimatsu, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tejun Heo, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

Hi Andrew,

On 02/28/2013 05:26 AM, Andrew Morton wrote:
>> Thank you all for addressing the bug. we are on the way to fix it.
>
> How long do you think this will take?
>

I think we need one week to solve these problems. I do hope we can catch up
the merge window for 3.9.

Thanks. :)

Yinghai Lu

unread,

Feb 28, 2013, 11:07:12 AM2/28/13

to Tang Chen, Andrew Morton, Benjamin Herrenschmidt, Tejun Heo, Yasuaki Ishimatsu, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

On Tue, Feb 26, 2013 at 11:44 PM, Tang Chen <tang...@cn.fujitsu.com> wrote:
>
> Sorry, if you want to revert, you just need to revert:
>
> commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> acpi, memory-hotplug: parse SRAT before memblock is ready
> commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
> acpi, memory-hotplug: support getting hotplug info from SRAT
>
> The other two have nothing to do with SRAT. And they are necessary.
>
> Seeing from the code, I think it is clean. But we'd better test it.

We should revert them all.

as

commit fb06bc8e5f42f38c011de0e59481f464a82380f6
Author: Tang Chen <tang...@cn.fujitsu.com>
Date: Fri Feb 22 16:33:42 2013 -0800

page_alloc: bootmem limit with movablecore_map

It is totally misleading in the TITLE. Come on, what is movablecore_map?

It actually use movablemem_map to exclude some range during
memblock_find_in_range.

That make memblock less generic.

That patch is the base of the whole patchset.

Also you and Yasuaki keep saying: movablemem_map=srat.
But where is doc and code for it?
Looks like there is only movablemem_map=acpi.

I'm upset by this patchset.

Next time, please get Ack from TJ or Ben when you touch memblock code.
And at least make the TITLE is right.

Thanks

Yinghai

Tang Chen

unread,

Feb 28, 2013, 9:00:21 PM2/28/13

to Yinghai Lu, Andrew Morton, Benjamin Herrenschmidt, Tejun Heo, Yasuaki Ishimatsu, Don Morris, Tim Gardner, H. Peter Anvin, Linus Torvalds, Tony Luck, Thomas Renninger, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, a.p.zi...@chello.nl, jarkko....@intel.com

On 03/01/2013 12:07 AM, Yinghai Lu wrote:
> On Tue, Feb 26, 2013 at 11:44 PM, Tang Chen<tang...@cn.fujitsu.com> wrote:
>>
>> Sorry, if you want to revert, you just need to revert:
>>
>> commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
>> acpi, memory-hotplug: parse SRAT before memblock is ready
>> commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
>> acpi, memory-hotplug: support getting hotplug info from SRAT
>>
>> The other two have nothing to do with SRAT. And they are necessary.
>>
>> Seeing from the code, I think it is clean. But we'd better test it.
>
> We should revert them all.
>
> as
>
> commit fb06bc8e5f42f38c011de0e59481f464a82380f6
> Author: Tang Chen<tang...@cn.fujitsu.com>
> Date: Fri Feb 22 16:33:42 2013 -0800
>
> page_alloc: bootmem limit with movablecore_map
>
> It is totally misleading in the TITLE. Come on, what is movablecore_map?
>
> It actually use movablemem_map to exclude some range during
> memblock_find_in_range.
>
> That make memblock less generic.
>
> That patch is the base of the whole patchset.
>
> Also you and Yasuaki keep saying: movablemem_map=srat.
> But where is doc and code for it?
> Looks like there is only movablemem_map=acpi.

Hi Yinghai,

I think I forgot to change the title when merging the related bugfix patches
into one. And yes, movablecore_map has been changed to movablemem_map.

How about this:

For now, let's revert the SRAT related patch, and keep
movablecore_map=nn[KMG]@ss[KMG].

About the SRAT thing, we have the following solution:

1) keep the original init series, parse acpi tables and modify global
variables as before
2) introduce a new function to obtain SRAT info earlier, store the info
somewhere,
and touch no numa related thing
3) use the info to do movablemem_map thing, and free them when it is done

In this way, we keep our code isolated from numa code. And the numa will
be initialized as before.
This can be done in one week or faster. And I'll cc x86 guys, and they
can choose whenever
to merge the new code.

And about movablecore_map=nn[KMG]@ss[KMG] code, there is no harm to the
kernel. And we
have documented it that using this option will cause numa performance
down. And users who
don't want to lose the numa performance can boot the kernel without this
option, and the
kernel will work as before.

I do hope we can keep the code in 3.9, and do more improvement in the
future.
So please just revert the two SRAT related patches.

Thanks. :)

Linus Torvalds

unread,

Feb 28, 2013, 10:13:20 PM2/28/13

to Andrew Morton, Lai Jiangshan, Yinghai Lu, Yasuaki Ishimatsu, Tang Chen, Don Morris, Tim Gardner, H. Peter Anvin, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen

On Wed, Feb 27, 2013 at 1:26 PM, Andrew Morton
<ak...@linux-foundation.org> wrote:
>
> So I'm thinking that the best approach here is to revert everything and
> then try again for 3.10-rc1. This gives people time to test the code
> while it's only in linux-next. (Hint!)

I'd prefer to revert too by now - the bug seems to be known, and
apparently it's not a trivial fix. We're getting close to the end of
the merge window, and it's still being discussed, it clearly wasn't
really fully cooked.

Can we agree on some minimal set of reverts? Can somebody send me a
patch with the revert and the commit explanation for the revert?
Yinghai? Or I can do the reverts too if just the exact set of commits
is clear, but I'd rather get it from somebody who sees and understand
the problem, and can test the state afterwards..

Linus

Tang Chen

unread,

Feb 28, 2013, 10:46:55 PM2/28/13

to Linus Torvalds, Andrew Morton, Lai Jiangshan, Yinghai Lu, Yasuaki Ishimatsu, Don Morris, Tim Gardner, H. Peter Anvin, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

Hi Linus,

Please refer to the attached patch.

This patch everts only the following two patches.

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT
commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
acpi, memory-hotplug: parse SRAT before memblock is ready

Without these two patches, users can use "movablemem_map=nn[KMG]@ss[KMG]"
correctly, and cause no problem.

And of course, the kernel will work as before if users don't use

"movablemem_map=nn[KMG]@ss[KMG]".

I do hope we can keep "movablemem_map=nn[KMG]@ss[KMG]" in 3.9.

We are working on fixing the SRAT problems, and we aims to push SRAT related
patches in 3.10. And we will also improve "movablemem_map=nn[KMG]@ss[KMG]"
functionality consistently in the future.

Thanks. :)

0001-x86-ACPI-mm-Revert-SRAT-support-from-movablemem_map-.patch

Linus Torvalds

unread,

Feb 28, 2013, 11:32:24 PM2/28/13

to Tang Chen, Andrew Morton, Lai Jiangshan, Yinghai Lu, Yasuaki Ishimatsu, Don Morris, Tim Gardner, H. Peter Anvin, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

Yingai, Andrew,
is this ok with you two?

Linus

H. Peter Anvin

unread,

Feb 28, 2013, 11:41:33 PM2/28/13

to Linus Torvalds, Tang Chen, Andrew Morton, Lai Jiangshan, Yinghai Lu, Yasuaki Ishimatsu, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

On 02/28/2013 08:32 PM, Linus Torvalds wrote:
> Yingai, Andrew,
> is this ok with you two?
>
> Linus

FWIW, it makes sense to me iff it resolves the problems.

-hpa

Andrew Morton

unread,

Feb 28, 2013, 11:41:35 PM2/28/13

to Linus Torvalds, Tang Chen, Lai Jiangshan, Yinghai Lu, Yasuaki Ishimatsu, Don Morris, Tim Gardner, H. Peter Anvin, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

On Thu, 28 Feb 2013 20:32:15 -0800 Linus Torvalds <torv...@linux-foundation.org> wrote:

> Yingai, Andrew,
> is this ok with you two?

If it works. I haven't tested it yet! Ordinarily I'd give it a few
days for -next testing and to let Fengguang's testbot chew on it.

Yasuaki Ishimatsu

unread,

Mar 1, 2013, 1:03:53 AM3/1/13

to Yinghai Lu, H. Peter Anvin, Linus Torvalds, Tang Chen, Andrew Morton, Lai Jiangshan, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

2013/03/01 14:00, Yinghai Lu wrote:

> On Thursday, February 28, 2013, H. Peter Anvin wrote:
>
>> On 02/28/2013 08:32 PM, Linus Torvalds wrote:
>>> Yingai, Andrew,
>>> is this ok with you two?
>>>
>>> Linus
>>
>> FWIW, it makes sense to me iff it resolves the problems
>
>

> I prefer to reverting all 8 patches.
>
> Actually I have worked out one patch that could solve all problems, but it
> is too intrusive that I do not want to split it to small pieces to
> post it.
>

> Leaving the movablemem_map related changes in the upstream tree,
> will prevent me from continuing to make memblock to be used to allocate
> page table on local node ram for hot add.

Original issue occurs by two patches. And it is fixed by Tang's reverting
patch. So other patches are obviously unrelated to original problem. Thus
there is no reason to revert all patches related with movablemem_map.

If there is a reason, movablemem_map patches prevent only your work.

If you keep on developing your work, you should develop it in consideration
of those patches.

Thanks,
Yasuaki Ishimatsu

>
> Will send reverting patch and putting page table on local node patch around
> 10pm after I get home.
>
> Thanks

Tang Chen

unread,

Mar 1, 2013, 1:19:46 AM3/1/13

to Yinghai Lu, H. Peter Anvin, Linus Torvalds, Andrew Morton, Lai Jiangshan, Yasuaki Ishimatsu, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

On 03/01/2013 01:00 PM, Yinghai Lu wrote:
> On Thursday, February 28, 2013, H. Peter Anvin wrote:
>

>> On 02/28/2013 08:32 PM, Linus Torvalds wrote:
>>> Yingai, Andrew,
>>> is this ok with you two?
>>>
>>> Linus
>>

>> FWIW, it makes sense to me iff it resolves the problems
>
>
> I prefer to reverting all 8 patches.
>
> Actually I have worked out one patch that could solve all problems, but it
> is too intrusive that I do not want to split it to small pieces to
> post it.
>
> Leaving the movablemem_map related changes in the upstream tree,
> will prevent me from continuing to make memblock to be used to allocate
> page table on local node ram for hot add.

Hi Yinghai,

Would you please give me a url to your code ?

I don't think movablemem_map will block your work a lot. According to your
description, you are modifying memblock to reserve some memory for local
node pagetables, right ?

If so, I think it won't be too difficult to make the code OK with your work.

Thanks. :)

>
> Will send reverting patch and putting page table on local node patch around
> 10pm after I get home.
>
> Thanks
>

H. Peter Anvin

unread,

Mar 1, 2013, 1:38:58 AM3/1/13

to Martin Bligh, Yinghai Lu, Ingo Molnar, Don Morris, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

On 02/25/2013 08:51 PM, Martin Bligh wrote:
>> Do you mean we can remove numaq x86 32bit code now?
>
> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
> Was useful in the early days of getting NUMA up and running on Linux,
> but is now too old to be a museum piece, really.
>

I'd be very happy to get the NUMAQ code ripped out. I am wondering if
there are any reasons to keep any 32-bit x86 NUMA code at all.

-hpa

Yinghai Lu

unread,

Mar 1, 2013, 2:55:37 AM3/1/13

to Yasuaki Ishimatsu, H. Peter Anvin, Linus Torvalds, Tang Chen, Andrew Morton, Lai Jiangshan, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

On Thu, Feb 28, 2013 at 10:02 PM, Yasuaki Ishimatsu
<isimatu...@jp.fujitsu.com> wrote:
> 2013/03/01 14:00, Yinghai Lu wrote:
>
> Original issue occurs by two patches. And it is fixed by Tang's reverting
> patch. So other patches are obviously unrelated to original problem. Thus
> there is no reason to revert all patches related with movablemem_map.
>
> If there is a reason, movablemem_map patches prevent only your work.
>
> If you keep on developing your work, you should develop it in consideration
> of those patches.

Let me try again:

movablemem_map is broken idea or poor design.

It just push down kernel memory from local node to some place.

It is ridiculous to let use specify mem range in command line to make
memory hotplug working.
Think about different memory layout conf, that will drive customer crazy.
Also not mention there is performance regarding put numa data low.

Right way or good pratice is:
Find out those kernel memory that can not be moved, either put them low
or make it to local node ram.

Thanks

Yinghai

Yinghai Lu

unread,

Mar 1, 2013, 3:02:34 AM3/1/13

to Tang Chen, H. Peter Anvin, Linus Torvalds, Andrew Morton, Lai Jiangshan, Yasuaki Ishimatsu, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

On Thu, Feb 28, 2013 at 10:18 PM, Tang Chen <tang...@cn.fujitsu.com> wrote:
> On 03/01/2013 01:00 PM, Yinghai Lu wrote:
>>
>> On Thursday, February 28, 2013, H. Peter Anvin wrote:
>>
>>> On 02/28/2013 08:32 PM, Linus Torvalds wrote:
>>>>
>>>> Yingai, Andrew,
>>>> is this ok with you two?
>>>>
>>>> Linus
>>>
>>>
>>> FWIW, it makes sense to me iff it resolves the problems
>>
>>
>>
>> I prefer to reverting all 8 patches.
>>
>> Actually I have worked out one patch that could solve all problems, but it
>> is too intrusive that I do not want to split it to small pieces to
>> post it.
>>
>> Leaving the movablemem_map related changes in the upstream tree,
>> will prevent me from continuing to make memblock to be used to allocate
>> page table on local node ram for hot add.
>
>
> Hi Yinghai,
>
> Would you please give me a url to your code ?
>
> I don't think movablemem_map will block your work a lot. According to your
> description, you are modifying memblock to reserve some memory for local
> node pagetables, right ?

My idea:
current for hotadd mem, page table will from other nodes from slub.
that is not right. that will prevent others nodes to be hot removed.

To fix the problem
a. make memblock still alive after booting.
b. or have separated dynamical memblock.

second way looks more clean.
so alloc_low_pages will get initial page for page table from low range
with slub.
and later will get page table from its own just mapped range.

Now need to make memblock more clean and remove hardcoded reference in
those functions.

Thanks

Yinghai

Yinghai Lu

unread,

Mar 1, 2013, 3:05:30 AM3/1/13

to H. Peter Anvin, Martin Bligh, Ingo Molnar, Don Morris, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

On Thu, Feb 28, 2013 at 10:37 PM, H. Peter Anvin <h...@zytor.com> wrote:
> On 02/25/2013 08:51 PM, Martin Bligh wrote:
>>> Do you mean we can remove numaq x86 32bit code now?
>>
>> Wouldn't bother me at all. The machine is from 1995, end of life c. 2000?
>> Was useful in the early days of getting NUMA up and running on Linux,
>> but is now too old to be a museum piece, really.
>>
>
> I'd be very happy to get the NUMAQ code ripped out. I am wondering if
> there are any reasons to keep any 32-bit x86 NUMA code at all.

Agreed!

Yasuaki Ishimatsu

unread,

Mar 1, 2013, 3:40:26 AM3/1/13

to Yinghai Lu, Tang Chen, H. Peter Anvin, Linus Torvalds, Andrew Morton, Lai Jiangshan, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

2013/03/01 17:02, Yinghai Lu wrote:
> On Thu, Feb 28, 2013 at 10:18 PM, Tang Chen <tang...@cn.fujitsu.com> wrote:
>> On 03/01/2013 01:00 PM, Yinghai Lu wrote:
>>>
>>> On Thursday, February 28, 2013, H. Peter Anvin wrote:
>>>
>>>> On 02/28/2013 08:32 PM, Linus Torvalds wrote:
>>>>>
>>>>> Yingai, Andrew,
>>>>> is this ok with you two?
>>>>>
>>>>> Linus
>>>>
>>>>
>>>> FWIW, it makes sense to me iff it resolves the problems
>>>
>>>
>>>
>>> I prefer to reverting all 8 patches.
>>>
>>> Actually I have worked out one patch that could solve all problems, but it
>>> is too intrusive that I do not want to split it to small pieces to
>>> post it.
>>>
>>> Leaving the movablemem_map related changes in the upstream tree,
>>> will prevent me from continuing to make memblock to be used to allocate
>>> page table on local node ram for hot add.
>>
>>
>> Hi Yinghai,
>>
>> Would you please give me a url to your code ?
>>
>> I don't think movablemem_map will block your work a lot. According to your
>> description, you are modifying memblock to reserve some memory for local
>> node pagetables, right ?
>

> My idea:
> current for hotadd mem, page table will from other nodes from slub.
> that is not right. that will prevent others nodes to be hot removed.

If we use your idea, pglist_data and zone are also allocated from local
node. In my understanding, pglist_data and zone cannot be deleted safely
since there is no way to guarantee that nobody use them. So it means
that all nodes cannot be hot removed.
If you develop your idea, you should consider memory hot remove.

Thanks,
Yasuaki Ishimatsu

Ingo Molnar

unread,

Mar 1, 2013, 6:00:08 AM3/1/13

to H. Peter Anvin, Martin Bligh, Yinghai Lu, Don Morris, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

* H. Peter Anvin <h...@zytor.com> wrote:

> On 02/25/2013 08:51 PM, Martin Bligh wrote:
> >> Do you mean we can remove numaq x86 32bit code now?
> >
> > Wouldn't bother me at all. The machine is from 1995, end of life c. 2000? Was
> > useful in the early days of getting NUMA up and running on Linux, but is now too
> > old to be a museum piece, really.
>
> I'd be very happy to get the NUMAQ code ripped out. I am wondering if there are
> any reasons to keep any 32-bit x86 NUMA code at all.

Not much I suspect.

Thanks,

Ingo

Borislav Petkov

unread,

Mar 1, 2013, 6:03:31 AM3/1/13

to H. Peter Anvin, Martin Bligh, Yinghai Lu, Ingo Molnar, Don Morris, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
> I'd be very happy to get the NUMAQ code ripped out. I am wondering if
> there are any reasons to keep any 32-bit x86 NUMA code at all.

How much would it hurt us if we said 3.8 is the last kernel that
supported NUMAQ? If anyone wants the functionality, they should use 3.8
or older.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Ingo Molnar

unread,

Mar 1, 2013, 6:24:44 AM3/1/13

to Borislav Petkov, H. Peter Anvin, Martin Bligh, Yinghai Lu, Don Morris, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

* Borislav Petkov <b...@alien8.de> wrote:

> On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
> > I'd be very happy to get the NUMAQ code ripped out. I am wondering if
> > there are any reasons to keep any 32-bit x86 NUMA code at all.
>
> How much would it hurt us if we said 3.8 is the last kernel that supported NUMAQ?
> If anyone wants the functionality, they should use 3.8 or older.

v3.9 - any non-trivial patch in the stage of being contemplated near the end of the
v3.9 merge window is most likely v3.10 material.

Thanks,

Ingo

Tang Chen

unread,

Mar 1, 2013, 6:30:12 AM3/1/13

to Yinghai Lu, H. Peter Anvin, Linus Torvalds, Andrew Morton, Konrad Rzeszutek Wilk, Stefano Stabellini, Yasuaki Ishimatsu, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Benjamin Herrenschmidt

On 03/01/2013 03:43 PM, Yinghai Lu wrote:
> Please check attached patches.
>
> Plan A. revert all 8 patches:
> revert_movablemem_map.patch
>
> Plan B. fix movablemem_map:
> kill_max_low_pfn_mapped.patch and fix_movablemem_map.patch
>
> fix_movablemem_map.patch is too risky, and need more test.
>

Hi Yinghai,

In your Plan B, you allocated pgdat on local node, right ?

- nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd_pa = memblock_find_in_range_node(start, end, nd_size,
+ SMP_CACHE_BYTES, nid); ---------------- Here,
right ?

Without movablemem_map, pgdat will be allocated successfully on local
node, right ?

If so, this will prevent node hot-plug, because as mentioned by
Kamezawa, there is
no way to ensure pgdat is not used by others on stack.

I do hope you can stop putting pgdat and zone on local node for now. And
improve it
in the future.

And I also hope you can apply my revert SRAT patch first, and then do
your work.
It will seem more clean to me.

Thanks. :)

H. Peter Anvin

unread,

Mar 1, 2013, 10:34:39 AM3/1/13

to Ingo Molnar, Borislav Petkov, Martin Bligh, Yinghai Lu, Don Morris, Tejun Heo, Andrew Morton, Tony Luck, Linus Torvalds, Tim Gardner, linux-...@vger.kernel.org, tg...@linutronix.de, mi...@redhat.com, x...@kernel.org, a.p.zi...@chello.nl, jarkko....@intel.com, tang...@cn.fujitsu.com

If NUMAQ is breaking real stuff we can kill it by marking it BROKEN. Rip-out is 3.10 at this stage.

Ingo Molnar <mi...@kernel.org> wrote:

>
>* Borislav Petkov <b...@alien8.de> wrote:
>
>> On Thu, Feb 28, 2013 at 10:37:10PM -0800, H. Peter Anvin wrote:
>> > I'd be very happy to get the NUMAQ code ripped out. I am wondering
>if
>> > there are any reasons to keep any 32-bit x86 NUMA code at all.
>>
>> How much would it hurt us if we said 3.8 is the last kernel that
>supported NUMAQ?
>> If anyone wants the functionality, they should use 3.8 or older.
>
>v3.9 - any non-trivial patch in the stage of being contemplated near
>the end of the
>v3.9 merge window is most likely v3.10 material.
>
>Thanks,
>
> Ingo

--

Sent from my mobile phone. Please excuse brevity and lack of formatting.

H. Peter Anvin

unread,

Mar 1, 2013, 10:46:08 AM3/1/13

to Yinghai Lu, Yasuaki Ishimatsu, Linus Torvalds, Tang Chen, Andrew Morton, Lai Jiangshan, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Jarkko Sakkinen, Benjamin Herrenschmidt, Wen Congyang, Lin Feng, guz....@cn.fujitsu.com, Gui jianfeng

On 02/28/2013 11:55 PM, Yinghai Lu wrote:
>
> Let me try again:
>
> movablemem_map is broken idea or poor design.
>

Very much so. I have said this before: this is potentially useful
during development/testing, but anyone who expects to actually tell
their customers to use it is abusive.

-hpa

Yinghai Lu

unread,

Mar 1, 2013, 5:53:22 PM3/1/13

to Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton, Linus Torvalds, linux-...@vger.kernel.org, Yinghai Lu, Tony Luck, Thomas Renninger, Tejun Heo, Tang Chen, Yasuaki Ishimatsu

Tim found:
[ 0.181441] WARNING: at
/home/rtg/ukb/raring/amd64/unstable-3.9/ubuntu-raring/arch/x86/kernel/smpboot.c:324
topology_sane.isra.2+0x6f/0x80()
[ 0.181443] Hardware name: S2600CP
[ 0.181445] sched: CPU #1's llc-sibling CPU #0 is not on the same
node! [node: 1 != 0]. Ignoring dependency.
[ 0.166925] smpboot: Booting Node 1, Processors #1
[ 0.181446] Modules linked in:
[ 0.181451] Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
[ 0.181452] Call Trace:
[ 0.181457] [<ffffffff816c0270>] ? topology_sane.isra.2+0x6f/0x80
[ 0.181463] [<ffffffff8105914f>] warn_slowpath_common+0x7f/0xc0
[ 0.181469] [<ffffffff8105924c>] warn_slowpath_fmt+0x4c/0x50
[ 0.181473] [<ffffffff816bf737>] ? mcheck_cpu_init+0x378/0x3fb
[ 0.181478] [<ffffffff816ca32b>] ? cpuid_eax+0x27/0x2e
[ 0.181483] [<ffffffff816c0270>] topology_sane.isra.2+0x6f/0x80
[ 0.181488] [<ffffffff816c0534>] set_cpu_sibling_map+0x279/0x449
[ 0.181493] [<ffffffff816c0821>] start_secondary+0x11d/0x1e5
[ 0.181507] ---[ end trace 8c24ebb220b8c665 ]---

Don Morris reproduced on a HP z620 workstation, and bisect to
# bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:

parse SRAT before memblock is ready

It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.
2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.
3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.
4: after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.

If node is hotplugable, the mem related range like page table and vmemmap could be
on the that node without problem and should be on that node.

We have workaround patch that could fix some problems, but some
can not be fixed.

So just remove that offending commit and related ones including:
commit f7210e6c4ac795694106c1c5307134d3fc233e88
mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().

commit 01a178a94e8eaec351b29ee49fbb3d1c124cb7fb
acpi, memory-hotplug: support getting hotplug info from SRAT

commit 27168d38fa209073219abedbe6a9de7ba9acbfad
acpi, memory-hotplug: extend movablemem_map ranges to the end of node

commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
acpi, memory-hotplug: parse SRAT before memblock is ready

commit fb06bc8e5f42f38c011de0e59481f464a82380f6

page_alloc: bootmem limit with movablecore_map

commit 42f47e27e761fee07da69e04612ec7dd0d490edd
page_alloc: make movablemem_map have higher priority
commit 6981ec31146cf19454c55c130625f6cee89aab95
page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
commit 34b71f1e04fcba578e719e675b4882eeeb2a1f6f
page_alloc: add movable_memmap kernel parameter
commit 4d59a75125d5a4717e57e9fc62c64b3d346e603e
x86: get pg_data_t's memory from other node

Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0.
Also need to find way to put other kernel used ram to local node ram.

Reported-by: Tim Gardner <tim.g...@canonical.com>
Reported-by: Don Morris <don.m...@hp.com>
Bisected-by: Don Morris <don.m...@hp.com>
Tested-by: Don Morris <don.m...@hp.com>
Signed-off-by: Yinghai Lu <yin...@kernel.org>
Cc: Tony Luck <tony...@intel.com>
Cc: Thomas Renninger <tr...@suse.de>
Cc: Tejun Heo <t...@kernel.org>
Cc: Tang Chen <tang...@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu...@jp.fujitsu.com>

---
Documentation/kernel-parameters.txt | 36 ----
arch/x86/kernel/setup.c | 13 -
arch/x86/mm/numa.c | 11 -
arch/x86/mm/srat.c | 125 ---------------
drivers/acpi/numa.c | 23 +-
include/linux/acpi.h | 8 -
include/linux/memblock.h | 2
include/linux/mm.h | 18 --
mm/memblock.c | 50 ------
mm/page_alloc.c | 285 ------------------------------------
10 files changed, 27 insertions(+), 544 deletions(-)

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1645,42 +1645,6 @@ bytes respectively. Such letter suffixes
that the amount of memory usable for all allocations
is not too small.

- movablemem_map=acpi
- [KNL,X86,IA-64,PPC] This parameter is similar to
- memmap except it specifies the memory map of
- ZONE_MOVABLE.
- This option inform the kernel to use Hot Pluggable bit
- in flags from SRAT from ACPI BIOS to determine which
- memory devices could be hotplugged. The corresponding
- memory ranges will be set as ZONE_MOVABLE.
- NOTE: Whatever node the kernel resides in will always
- be un-hotpluggable.
-
- movablemem_map=nn[KMG]@ss[KMG]
- [KNL,X86,IA-64,PPC] This parameter is similar to
- memmap except it specifies the memory map of
- ZONE_MOVABLE.
- If user specifies memory ranges, the info in SRAT will
- be ingored. And it works like the following:
- - If more ranges are all within one node, then from
- lowest ss to the end of the node will be ZONE_MOVABLE.
- - If a range is within a node, then from ss to the end
- of the node will be ZONE_MOVABLE.
- - If a range covers two or more nodes, then from ss to
- the end of the 1st node will be ZONE_MOVABLE, and all
- the rest nodes will only have ZONE_MOVABLE.
- If memmap is specified at the same time, the
- movablemem_map will be limited within the memmap
- areas. If kernelcore or movablecore is also specified,
- movablemem_map will have higher priority to be
- satisfied. So the administrator should be careful that
- the amount of movablemem_map areas are not too large.
- Otherwise kernel won't have enough memory to start.
- NOTE: We don't stop users specifying the node the
- kernel resides in as hotpluggable so that this
- option can be used as a workaround of firmware
- bugs.
-
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>

Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -1056,15 +1056,6 @@ void __init setup_arch(char **cmdline_p)
setup_bios_corruption_check();
#endif

- /*
- * In the memory hotplug case, the kernel needs info from SRAT to
- * determine which memory is hotpluggable before allocating memory
- * using memblock.
- */
- acpi_boot_table_init();
- early_acpi_boot_init();
- early_parse_srat();
-
#ifdef CONFIG_X86_32
printk(KERN_DEBUG "initial memory mapped: [mem 0x00000000-%#010lx]\n",
(max_pfn_mapped<<PAGE_SHIFT) - 1);
@@ -1110,6 +1101,10 @@ void __init setup_arch(char **cmdline_p)
/*
* Parse the ACPI tables for possible boot-time SMP configuration.
*/
+ acpi_boot_table_init();
+
+ early_acpi_boot_init();
+
initmem_init();
memblock_find_dma_reserve();

Index: linux-2.6/arch/x86/mm/numa.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa.c
+++ linux-2.6/arch/x86/mm/numa.c
@@ -212,9 +212,10 @@ static void __init setup_node_data(int n
* Allocate node data. Try node-local memory and then any node.
* Never allocate in DMA zone.
*/

- nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);

+ nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in any node\n", nd_size);
+ pr_err("Cannot find %zu bytes in node %d\n",
+ nd_size, nid);
return;
}
nd = __va(nd_pa);
@@ -559,12 +560,10 @@ static int __init numa_init(int (*init_f
for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE);

- /*
- * Do not clear numa_nodes_parsed or zero numa_meminfo here, because
- * SRAT was parsed earlier in early_parse_srat().
- */
+ nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
nodes_clear(node_online_map);
+ memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();

Index: linux-2.6/arch/x86/mm/srat.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/srat.c
+++ linux-2.6/arch/x86/mm/srat.c
@@ -141,126 +141,11 @@ static inline int save_add_info(void) {r
static inline int save_add_info(void) {return 0;}
#endif

-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-static void __init
-handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable)
-{
- int overlap, i;
- unsigned long start_pfn, end_pfn;
-
- start_pfn = PFN_DOWN(start);
- end_pfn = PFN_UP(end);
-
- /*
- * For movablemem_map=acpi:
- *
- * SRAT: |_____| |_____| |_________| |_________| ......
- * node id: 0 1 1 2
- * hotpluggable: n y y n
- * movablemem_map: |_____| |_________|
- *
- * Using movablemem_map, we can prevent memblock from allocating memory
- * on ZONE_MOVABLE at boot time.
- *
- * Before parsing SRAT, memblock has already reserve some memory ranges
- * for other purposes, such as for kernel image. We cannot prevent
- * kernel from using these memory, so we need to exclude these memory
- * even if it is hotpluggable.
- * Furthermore, to ensure the kernel has enough memory to boot, we make
- * all the memory on the node which the kernel resides in
- * un-hotpluggable.
- */
- if (hotpluggable && movablemem_map.acpi) {
- /* Exclude ranges reserved by memblock. */
- struct memblock_type *rgn = &memblock.reserved;
-
- for (i = 0; i < rgn->cnt; i++) {
- if (end <= rgn->regions[i].base ||
- start >= rgn->regions[i].base +
- rgn->regions[i].size)
- continue;
-
- /*
- * If the memory range overlaps the memory reserved by
- * memblock, then the kernel resides in this node.
- */
- node_set(node, movablemem_map.numa_nodes_kernel);
-
- goto out;
- }
-
- /*
- * If the kernel resides in this node, then the whole node
- * should not be hotpluggable.
- */
- if (node_isset(node, movablemem_map.numa_nodes_kernel))
- goto out;
-
- insert_movablemem_map(start_pfn, end_pfn);
-
- /*
- * numa_nodes_hotplug nodemask represents which nodes are put
- * into movablemem_map.map[].
- */
- node_set(node, movablemem_map.numa_nodes_hotplug);
- goto out;
- }
-
- /*
- * For movablemem_map=nn[KMG]@ss[KMG]:
- *
- * SRAT: |_____| |_____| |_________| |_________| ......
- * node id: 0 1 1 2
- * user specified: |__| |___|
- * movablemem_map: |___| |_________| |______| ......
- *
- * Using movablemem_map, we can prevent memblock from allocating memory
- * on ZONE_MOVABLE at boot time.
- *
- * NOTE: In this case, SRAT info will be ingored.
- */
- overlap = movablemem_map_overlap(start_pfn, end_pfn);
- if (overlap >= 0) {
- /*
- * If part of this range is in movablemem_map, we need to
- * add the range after it to extend the range to the end
- * of the node, because from the min address specified to
- * the end of the node will be ZONE_MOVABLE.
- */
- start_pfn = max(start_pfn,
- movablemem_map.map[overlap].start_pfn);
- insert_movablemem_map(start_pfn, end_pfn);
-
- /*
- * Set the nodemask, so that if the address range on one node
- * is not continuse, we can add the subsequent ranges on the
- * same node into movablemem_map.
- */
- node_set(node, movablemem_map.numa_nodes_hotplug);
- } else {
- if (node_isset(node, movablemem_map.numa_nodes_hotplug))
- /*
- * Insert the range if we already have movable ranges
- * on the same node.
- */
- insert_movablemem_map(start_pfn, end_pfn);
- }
-out:
- return;
-}
-#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-static inline void
-handle_movablemem(int node, u64 start, u64 end, u32 hotpluggable)
-{
-}
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
-
/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
int __init
acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
- u32 hotpluggable;
int node, pxm;

if (srat_disabled())
@@ -269,8 +154,7 @@ acpi_numa_memory_affinity_init(struct ac
goto out_err_bad_srat;
if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
goto out_err;
- hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
- if (hotpluggable && !save_add_info())
+ if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && !save_add_info())
goto out_err;

start = ma->base_address;
@@ -290,12 +174,9 @@ acpi_numa_memory_affinity_init(struct ac

node_set(node, numa_nodes_parsed);

- printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
+ printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
node, pxm,
- (unsigned long long) start, (unsigned long long) end - 1,
- hotpluggable ? "Hot Pluggable": "");
-
- handle_movablemem(node, start, end, hotpluggable);
+ (unsigned long long) start, (unsigned long long) end - 1);

return 0;
out_err_bad_srat:
Index: linux-2.6/drivers/acpi/numa.c
===================================================================
--- linux-2.6.orig/drivers/acpi/numa.c
+++ linux-2.6/drivers/acpi/numa.c
@@ -282,10 +282,10 @@ acpi_table_parse_srat(enum acpi_srat_typ
handler, max_entries);
}

-static int srat_mem_cnt;
-
-void __init early_parse_srat(void)
+int __init acpi_numa_init(void)
{
+ int cnt = 0;
+
/*
* Should not limit number with cpu num that is from NR_CPUS or nr_cpus=
* SRAT cpu entries could have different order with that in MADT.
@@ -295,24 +295,21 @@ void __init early_parse_srat(void)
/* SRAT: Static Resource Affinity Table */
if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
acpi_table_parse_srat(ACPI_SRAT_TYPE_X2APIC_CPU_AFFINITY,
- acpi_parse_x2apic_affinity, 0);
+ acpi_parse_x2apic_affinity, 0);
acpi_table_parse_srat(ACPI_SRAT_TYPE_CPU_AFFINITY,
- acpi_parse_processor_affinity, 0);
- srat_mem_cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
- acpi_parse_memory_affinity,
- NR_NODE_MEMBLKS);
+ acpi_parse_processor_affinity, 0);
+ cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
+ acpi_parse_memory_affinity,
+ NR_NODE_MEMBLKS);
}
-}

-int __init acpi_numa_init(void)
-{
/* SLIT: System Locality Information Table */
acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);

acpi_numa_arch_fixup();

- if (srat_mem_cnt < 0)
- return srat_mem_cnt;
+ if (cnt < 0)
+ return cnt;
else if (!parsed_numa_memblks)
return -ENOENT;
return 0;
Index: linux-2.6/include/linux/acpi.h
===================================================================
--- linux-2.6.orig/include/linux/acpi.h
+++ linux-2.6/include/linux/acpi.h
@@ -485,14 +485,6 @@ static inline bool acpi_driver_match_dev

#endif /* !CONFIG_ACPI */

-#ifdef CONFIG_ACPI_NUMA
-void __init early_parse_srat(void);
-#else
-static inline void early_parse_srat(void)
-{
-}
-#endif
-
#ifdef CONFIG_ACPI
void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
u32 pm1a_ctrl, u32 pm1b_ctrl));
Index: linux-2.6/include/linux/memblock.h
===================================================================
--- linux-2.6.orig/include/linux/memblock.h
+++ linux-2.6/include/linux/memblock.h
@@ -42,7 +42,6 @@ struct memblock {

extern struct memblock memblock;
extern int memblock_debug;
-extern struct movablemem_map movablemem_map;

#define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -61,7 +60,6 @@ int memblock_reserve(phys_addr_t base, p
void memblock_trim_memory(phys_addr_t align);

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
unsigned long *out_end_pfn, int *out_nid);

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1333,24 +1333,6 @@ extern void free_bootmem_with_active_reg
unsigned long max_low_pfn);
extern void sparse_memory_present_with_active_regions(int nid);

-#define MOVABLEMEM_MAP_MAX MAX_NUMNODES
-struct movablemem_entry {
- unsigned long start_pfn; /* start pfn of memory segment */
- unsigned long end_pfn; /* end pfn of memory segment (exclusive) */
-};
-
-struct movablemem_map {
- bool acpi; /* true if using SRAT info */
- int nr_map;
- struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
- nodemask_t numa_nodes_hotplug; /* on which nodes we specify memory */
- nodemask_t numa_nodes_kernel; /* on which nodes kernel resides in */
-};
-
-extern void __init insert_movablemem_map(unsigned long start_pfn,
- unsigned long end_pfn);
-extern int __init movablemem_map_overlap(unsigned long start_pfn,
- unsigned long end_pfn);
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
Index: linux-2.6/mm/memblock.c
===================================================================
--- linux-2.6.orig/mm/memblock.c
+++ linux-2.6/mm/memblock.c
@@ -92,58 +92,9 @@ static long __init_memblock memblock_ove
*
* Find @size free area aligned to @align in the specified range and node.
*
- * If we have CONFIG_HAVE_MEMBLOCK_NODE_MAP defined, we need to check if the
- * memory we found if not in hotpluggable ranges.
- *
* RETURNS:
* Found address on success, %0 on failure.
*/
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
- phys_addr_t end, phys_addr_t size,
- phys_addr_t align, int nid)
-{
- phys_addr_t this_start, this_end, cand;
- u64 i;
- int curr = movablemem_map.nr_map - 1;
-
- /* pump up @end */
- if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
- end = memblock.current_limit;
-
- /* avoid allocating the first page */
- start = max_t(phys_addr_t, start, PAGE_SIZE);
- end = max(start, end);
-
- for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
- this_start = clamp(this_start, start, end);
- this_end = clamp(this_end, start, end);
-
-restart:
- if (this_end <= this_start || this_end < size)
- continue;
-
- for (; curr >= 0; curr--) {
- if ((movablemem_map.map[curr].start_pfn << PAGE_SHIFT)
- < this_end)
- break;
- }
-
- cand = round_down(this_end - size, align);
- if (curr >= 0 &&
- cand < movablemem_map.map[curr].end_pfn << PAGE_SHIFT) {
- this_end = movablemem_map.map[curr].start_pfn
- << PAGE_SHIFT;
- goto restart;
- }
-
- if (cand >= this_start)
- return cand;
- }
-
- return 0;
-}
-#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
phys_addr_t end, phys_addr_t size,
phys_addr_t align, int nid)
@@ -172,7 +123,6 @@ phys_addr_t __init_memblock memblock_fin
}
return 0;
}
-#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

/**
* memblock_find_in_range - find free area in given range
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -202,18 +202,11 @@ static unsigned long __meminitdata nr_al
static unsigned long __meminitdata dma_reserve;

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-/* Movable memory ranges, will also be used by memblock subsystem. */
-struct movablemem_map movablemem_map = {
- .acpi = false,
- .nr_map = 0,
-};
-
static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
-static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];

/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -4412,77 +4405,6 @@ static unsigned long __meminit zone_abse
return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
}

-/**
- * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
- *
- * zone_movable_limit is initialized as 0. This function will try to get
- * the first ZONE_MOVABLE pfn of each node from movablemem_map, and
- * assigne them to zone_movable_limit.
- * zone_movable_limit[nid] == 0 means no limit for the node.
- *
- * Note: Each range is represented as [start_pfn, end_pfn)
- */
-static void __meminit sanitize_zone_movable_limit(void)
-{
- int map_pos = 0, i, nid;
- unsigned long start_pfn, end_pfn;
-
- if (!movablemem_map.nr_map)
- return;
-
- /* Iterate all ranges from minimum to maximum */
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
- /*
- * If we have found lowest pfn of ZONE_MOVABLE of the node
- * specified by user, just go on to check next range.
- */
- if (zone_movable_limit[nid])
- continue;
-
-#ifdef CONFIG_ZONE_DMA
- /* Skip DMA memory. */
- if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA])
- start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
-#endif
-
-#ifdef CONFIG_ZONE_DMA32
- /* Skip DMA32 memory. */
- if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA32])
- start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
-#endif
-
-#ifdef CONFIG_HIGHMEM
- /* Skip lowmem if ZONE_MOVABLE is highmem. */
- if (zone_movable_is_highmem() &&
- start_pfn < arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
- start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
-#endif
-
- if (start_pfn >= end_pfn)
- continue;
-
- while (map_pos < movablemem_map.nr_map) {
- if (end_pfn <= movablemem_map.map[map_pos].start_pfn)
- break;
-
- if (start_pfn >= movablemem_map.map[map_pos].end_pfn) {
- map_pos++;
- continue;
- }
-
- /*
- * The start_pfn of ZONE_MOVABLE is either the minimum
- * pfn specified by movablemem_map, or 0, which means
- * the node has no ZONE_MOVABLE.
- */
- zone_movable_limit[nid] = max(start_pfn,
- movablemem_map.map[map_pos].start_pfn);
-
- break;
- }
- }
-}
-
#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
@@ -4500,6 +4422,7 @@ static inline unsigned long __meminit zo

return zholes_size[zone_type];
}
+
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
@@ -4941,19 +4864,12 @@ static void __init find_zone_movable_pfn
required_kernelcore = max(required_kernelcore, corepages);
}

- /*
- * If neither kernelcore/movablecore nor movablemem_map is specified,
- * there is no ZONE_MOVABLE. But if movablemem_map is specified, the
- * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
- */
- if (!required_kernelcore) {
- if (movablemem_map.nr_map)
- memcpy(zone_movable_pfn, zone_movable_limit,
- sizeof(zone_movable_pfn));
+ /* If kernelcore was not specified, there is no ZONE_MOVABLE */
+ if (!required_kernelcore)
goto out;
- }

/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
+ find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];

restart:
@@ -4981,24 +4897,10 @@ restart:
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
unsigned long size_pages;

- /*
- * Find more memory for kernelcore in
- * [zone_movable_pfn[nid], zone_movable_limit[nid]).
- */
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;

- if (zone_movable_limit[nid]) {
- end_pfn = min(end_pfn, zone_movable_limit[nid]);
- /* No range left for kernelcore in this node */
- if (start_pfn >= end_pfn) {
- zone_movable_pfn[nid] =
- zone_movable_limit[nid];
- break;
- }
- }
-
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
@@ -5058,12 +4960,12 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;

-out:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);

+out:
/* restore the node_state */
node_states[N_MEMORY] = saved_node_state;
}
@@ -5126,8 +5028,6 @@ void __init free_area_init_nodes(unsigne

/* Find the PFNs that ZONE_MOVABLE begins at in each node */
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
- find_usable_zone_for_movable();
- sanitize_zone_movable_limit();
find_zone_movable_pfns_for_nodes();

/* Print out the zone ranges */
@@ -5211,181 +5111,6 @@ static int __init cmdline_parse_movablec
early_param("kernelcore", cmdline_parse_kernelcore);
early_param("movablecore", cmdline_parse_movablecore);

-/**
- * movablemem_map_overlap() - Check if a range overlaps movablemem_map.map[].
- * @start_pfn: start pfn of the range to be checked
- * @end_pfn: end pfn of the range to be checked (exclusive)
- *
- * This function checks if a given memory range [start_pfn, end_pfn) overlaps
- * the movablemem_map.map[] array.
- *
- * Return: index of the first overlapped element in movablemem_map.map[]
- * or -1 if they don't overlap each other.
- */
-int __init movablemem_map_overlap(unsigned long start_pfn,
- unsigned long end_pfn)
-{
- int overlap;
-
- if (!movablemem_map.nr_map)
- return -1;
-
- for (overlap = 0; overlap < movablemem_map.nr_map; overlap++)
- if (start_pfn < movablemem_map.map[overlap].end_pfn)
- break;
-
- if (overlap == movablemem_map.nr_map ||
- end_pfn <= movablemem_map.map[overlap].start_pfn)
- return -1;
-
- return overlap;
-}
-
-/**
- * insert_movablemem_map - Insert a memory range in to movablemem_map.map.
- * @start_pfn: start pfn of the range
- * @end_pfn: end pfn of the range
- *
- * This function will also merge the overlapped ranges, and sort the array
- * by start_pfn in monotonic increasing order.
- */
-void __init insert_movablemem_map(unsigned long start_pfn,
- unsigned long end_pfn)
-{
- int pos, overlap;
-
- /*
- * pos will be at the 1st overlapped range, or the position
- * where the element should be inserted.
- */
- for (pos = 0; pos < movablemem_map.nr_map; pos++)
- if (start_pfn <= movablemem_map.map[pos].end_pfn)
- break;
-
- /* If there is no overlapped range, just insert the element. */
- if (pos == movablemem_map.nr_map ||
- end_pfn < movablemem_map.map[pos].start_pfn) {
- /*
- * If pos is not the end of array, we need to move all
- * the rest elements backward.
- */
- if (pos < movablemem_map.nr_map)
- memmove(&movablemem_map.map[pos+1],
- &movablemem_map.map[pos],
- sizeof(struct movablemem_entry) *
- (movablemem_map.nr_map - pos));
- movablemem_map.map[pos].start_pfn = start_pfn;
- movablemem_map.map[pos].end_pfn = end_pfn;
- movablemem_map.nr_map++;
- return;
- }
-
- /* overlap will be at the last overlapped range */
- for (overlap = pos + 1; overlap < movablemem_map.nr_map; overlap++)
- if (end_pfn < movablemem_map.map[overlap].start_pfn)
- break;
-
- /*
- * If there are more ranges overlapped, we need to merge them,
- * and move the rest elements forward.
- */
- overlap--;
- movablemem_map.map[pos].start_pfn = min(start_pfn,
- movablemem_map.map[pos].start_pfn);
- movablemem_map.map[pos].end_pfn = max(end_pfn,
- movablemem_map.map[overlap].end_pfn);
-
- if (pos != overlap && overlap + 1 != movablemem_map.nr_map)
- memmove(&movablemem_map.map[pos+1],
- &movablemem_map.map[overlap+1],
- sizeof(struct movablemem_entry) *
- (movablemem_map.nr_map - overlap - 1));
-
- movablemem_map.nr_map -= overlap - pos;
-}
-
-/**
- * movablemem_map_add_region - Add a memory range into movablemem_map.
- * @start: physical start address of range
- * @end: physical end address of range
- *
- * This function transform the physical address into pfn, and then add the
- * range into movablemem_map by calling insert_movablemem_map().
- */
-static void __init movablemem_map_add_region(u64 start, u64 size)
-{
- unsigned long start_pfn, end_pfn;
-
- /* In case size == 0 or start + size overflows */
- if (start + size <= start)
- return;
-
- if (movablemem_map.nr_map >= ARRAY_SIZE(movablemem_map.map)) {
- pr_err("movablemem_map: too many entries;"
- " ignoring [mem %#010llx-%#010llx]\n",
- (unsigned long long) start,
- (unsigned long long) (start + size - 1));
- return;
- }
-
- start_pfn = PFN_DOWN(start);
- end_pfn = PFN_UP(start + size);
- insert_movablemem_map(start_pfn, end_pfn);
-}
-
-/*
- * cmdline_parse_movablemem_map - Parse boot option movablemem_map.
- * @p: The boot option of the following format:
- * movablemem_map=nn[KMG]@ss[KMG]
- *
- * This option sets the memory range [ss, ss+nn) to be used as movable memory.
- *
- * Return: 0 on success or -EINVAL on failure.
- */
-static int __init cmdline_parse_movablemem_map(char *p)
-{
- char *oldp;
- u64 start_at, mem_size;
-
- if (!p)
- goto err;
-
- if (!strcmp(p, "acpi"))
- movablemem_map.acpi = true;
-
- /*
- * If user decide to use info from BIOS, all the other user specified
- * ranges will be ingored.
- */
- if (movablemem_map.acpi) {
- if (movablemem_map.nr_map) {
- memset(movablemem_map.map, 0,
- sizeof(struct movablemem_entry)
- * movablemem_map.nr_map);
- movablemem_map.nr_map = 0;
- }
- return 0;
- }
-
- oldp = p;
- mem_size = memparse(p, &p);
- if (p == oldp)
- goto err;
-
- if (*p == '@') {
- oldp = ++p;
- start_at = memparse(p, &p);
- if (p == oldp || *p != '\0')
- goto err;
-
- movablemem_map_add_region(start_at, mem_size);
- return 0;
- }
-err:
- return -EINVAL;
-}
-early_param("movablemem_map", cmdline_parse_movablemem_map);
-
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */

/**

Yinghai Lu

unread,

Mar 2, 2013, 12:46:10 AM3/2/13

to chen tang, H. Peter Anvin, Linus Torvalds, Andrew Morton, Konrad Rzeszutek Wilk, Stefano Stabellini, Tang Chen, Yasuaki Ishimatsu, Don Morris, Tim Gardner, Tejun Heo, Tony Luck, Thomas Renninger, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Benjamin Herrenschmidt

On Fri, Mar 1, 2013 at 7:03 PM, chen tang <imtan...@gmail.com> wrote:
>
> Thank you for your suggestion and fix work. :)
> I would prefer your Plan b. But one last thing I want to confirm:
>
> Will "allocating pgdat and zone on local node" prevent node hot-removing ?
> Or is it safe to free all node data when removing a node ?
> AFAIK, no way to ensure node data is not on thread stack.

Not sure. I need to go over the code.
That is slub's limitation.

If it is not, it should be fixed.

>
> If it is OK, I think Plan B is OK, and we can improve movablemem_map more in
> the future.
>
> BTW, I didn't mean to deny your idea and work. NUMA performance is always
> understand our consideration.
> It's just we plan it as a long way development in the future.
> movablemem_map is very important to us. And we do hope to keep it in kernel
> now, and improve it later.

That does not look like right way to do development with mainline tree
to add new
features.

You don't need to put development/testing support patches in the mainline.
Just put those support patches in your local tree.

Everyone have bunch of development/debug/teststub patches in their own
hardisk for their working area, but don't need put them into mainline tree.

Good practice should be:
Have the feature completely done in your local tree and etc.
then send out several patchset. and get reviewed and get merged
one by one.

Sometime would turn out that your whole patchset has problem that
can not be fixed during review, and should be redesign again.

Mainline tree is NOT testbed.

For pci-root-bus hotplug, I already had code done completely.
Then send out patchset one by one to get completely review.
One patchset about acpi-scan is totally rewritten by Rafael after he understood
our needs with better and clean design.
Now still have ioapic and iommu left, and those patchset have been in
my local tree more than 6 months and I keep optimizing them.

BTW, Please do not top-post later.

Thanks

Yinghai