Signed-off-by: Haicheng Li <haich...@linux.intel.com>
---
mm/slab.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 7dfa481..a9486a0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -966,18 +966,17 @@ static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
{
struct array_cache **ac_ptr;
- int memsize = sizeof(void *) * nr_node_ids;
+ int memsize = sizeof(void *) * MAX_NUMNODES;
int i;
if (limit > 1)
limit = 12;
ac_ptr = kmalloc_node(memsize, gfp, node);
if (ac_ptr) {
+ memset(ac_ptr, 0, memsize);
for_each_node(i) {
- if (i == node || !node_online(i)) {
- ac_ptr[i] = NULL;
+ if (i == node || !node_online(i))
continue;
- }
ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d, gfp);
if (!ac_ptr[i]) {
for (i--; i >= 0; i--)
--
1.6.0.rc1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> Memory hotplug would online new node in runtime, then reap timer will
> add this new node as a reap node. In such case, for each existing
> kmem_list, we need to ensure that the alien cache entry corresponding
> to this new added node is NULL. Otherwise, it might cause BUG when
> reap_alien() affecting the new added node.
>
> Signed-off-by: Haicheng Li <haich...@linux.intel.com>
Acked-by: Andi Kleen <a...@linux.intel.com>
IMHO a 2.6.33 and even stable candidate
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
> struct array_cache **ac_ptr;
> - int memsize = sizeof(void *) * nr_node_ids;
> + int memsize = sizeof(void *) * MAX_NUMNODES;
> int i;
Why does the alien cache pointer array size have to be increased? node ids
beyond nr_node_ids cannot be used.
>
> if (limit > 1)
> limit = 12;
> ac_ptr = kmalloc_node(memsize, gfp, node);
Use kzalloc to ensure zeroed memory.
> if (ac_ptr) {
> + memset(ac_ptr, 0, memsize);
> for_each_node(i) {
> - if (i == node || !node_online(i)) {
> - ac_ptr[i] = NULL;
> + if (i == node || !node_online(i))
> continue;
> - }
> ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d,
> gfp);
> if (!ac_ptr[i]) {
> for (i--; i >= 0; i--)
>
--
> ac_ptr = kmalloc_node(memsize, gfp, node);
> if (ac_ptr) {
> + memset(ac_ptr, 0, memsize);
Please use kzalloc_node here.
I'm not sure what's going on with nr_node_id vs MAX_NUMNODES, so I think
we need to see an answer to Christoph's question before going forward
with this.
--
http://selenic.com : development and support for Mercurial and Linux
Thanks for the review. Node ids beyond nr_node_ids could be used in the case of
memory hotadding.
Let me explain here:
Firstly, original nr_node_ids = 1 + nid of highest POSSIBLE node.
Secondly, consider hotplug-adding the memories that are on a new_added node:
1. when acpi event is triggered:
acpi_memory_device_add() -> acpi_memory_enable_device() -> add_memory() -> node_set_online()
The node_state[N_ONLINE] is updated with this new node added.
And the id of this new node is beyond nr_node_ids.
2. Then as reap_timer is scheduled in:
cache_reap() -> next_reap_node() -> node = next_node(node, node_online_map)
then the new_added node would be selected as the reap node of this cpu, for example CPU X.
3. when reap_timer of CPU X is triggered again:
cache_reap() -> reap_alien()
it would access this new added node as reap_node of CPU X.
I have caught this BUG in our memory-hotadd testing as below:
the test scenario is that there are originally 2 nodes enabled on the machine,
then hot-add memory on the 3rd node.
the BUG is:
[ 141.667487] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[ 141.667782] IP: [<ffffffff810b8a64>] cache_reap+0x71/0x236
[ 141.667969] PGD 0
[ 141.668129] Oops: 0000 [#1] SMP
[ 141.668357] last sysfs file: /sys/class/scsi_host/host4/proc_name
[ 141.668469] CPU
[ 141.668630] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill
binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan
battery ac parport_pc lp parport joydev usbhid sr_mod cdrom thermal processor thermal_sys
container button rtc_cmos rtc_core rtc_lib i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd
ehci_hcd usbcore
[ 141.671659] Pid: 126, comm: events/27 Not tainted 2.6.32 #9 Server
[ 141.671771] RIP: 0010:[<ffffffff810b8a64>] [<ffffffff810b8a64>] cache_reap+0x71/0x236
[ 141.671981] RSP: 0018:ffff88027e81bdf0 EFLAGS: 00010206
[ 141.672089] RAX: 0000000000000002 RBX: 0000000000000078 RCX: ffff88047d86e580
[ 141.672204] RDX: ffff88047dfcbc00 RSI: ffff88047f13f6c0 RDI: ffff88047d9136c0
[ 141.672319] RBP: ffff88027e81be30 R08: 0000000000000001 R09: 0000000000000001
[ 141.672433] R10: 0000000000000000 R11: 0000000000000086 R12: ffff88047d87c200
[ 141.672548] R13: ffff88047d87d680 R14: ffffffff810b89f3 R15: 0000000000000002
[ 141.672663] FS: 0000000000000000(0000) GS:ffff88028b5a0000(0000) knlGS:0000000000000000
[ 141.672807] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 141.672917] CR2: 0000000000000078 CR3: 0000000001001000 CR4: 00000000000006e0
[ 141.673032] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 141.673147] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 141.673262] Process events/27 (pid: 126, threadinfo ffff88027e81a000, task
ffff88027f3ea040)
[ 141.673406] Stack:
[ 141.673503] ffff88027e81be30 ffff88028b5b05a0 0000000100000000 ffff88027e81be80
[ 141.673808] <0> ffff88028b5b5b40 ffff88028b5b05a0 ffffffff810b89f3 fffffffff00000c6
[ 141.674265] <0> ffff88027e81bec0 ffffffff81057394 ffffffff8105733e ffffffff81369f3a
[ 141.674813] Call Trace:
[ 141.674915] [<ffffffff810b89f3>] ? cache_reap+0x0/0x236
[ 141.675028] [<ffffffff81057394>] worker_thread+0x17a/0x27b
[ 141.675138] [<ffffffff8105733e>] ? worker_thread+0x124/0x27b
[ 141.675256] [<ffffffff81369f3a>] ? thread_return+0x3e/0xee
[ 141.675369] [<ffffffff8105a244>] ? autoremove_wake_function+0x0/0x38
[ 141.675482] [<ffffffff8105721a>] ? worker_thread+0x0/0x27b
[ 141.675593] [<ffffffff8105a146>] kthread+0x7d/0x87
[ 141.675707] [<ffffffff81012daa>] child_rip+0xa/0x20
[ 141.675817] [<ffffffff81012710>] ? restore_args+0x0/0x30
[ 141.675927] [<ffffffff8105a0c9>] ? kthread+0x0/0x87
[ 141.676035] [<ffffffff81012da0>] ? child_rip+0x0/0x20
[ 141.676142] Code: a4 c5 68 08 00 00 65 48 8b 04 25 00 e4 00 00 48 8b 04 18 49 8b 4c 24
78 48 85 c9 74 5b 41 89 c7 48 98 48 8b 1c c1 48 85 db 74 4d <83> 3b 00 74 48 48 83 3d ff
d4 65 00 00 75 04 0f 0b eb fe fa 66
[ 141.680610] RIP [<ffffffff810b8a64>] cache_reap+0x71/0x236
[ 141.680785] RSP <ffff88027e81bdf0>
[ 141.680886] CR2: 0000000000000078
[ 141.681016] ---[ end trace b1e17069ef81fe83 ]--
corresponding assembly code is:
ffffffff810b8a3f: 65 48 8b 04 25 00 e4 mov %gs:0xe400,%rax
ffffffff810b8a46: 00 00
ffffffff810b8a48: 48 8b 04 18 mov (%rax,%rbx,1),%rax
ffffffff810b8a4c: 49 8b 4c 24 78 mov 0x78(%r12),%rcx
ffffffff810b8a51: 48 85 c9 test %rcx,%rcx
ffffffff810b8a54: 74 5b je ffffffff810b8ab1 <cache_reap+0xbe>
ffffffff810b8a56: 41 89 c7 mov %eax,%r15d
ffffffff810b8a59: 48 98 cltq
ffffffff810b8a5b: 48 8b 1c c1 mov (%rcx,%rax,8),%rbx
ffffffff810b8a5f: 48 85 db test %rbx,%rbx
ffffffff810b8a62: 74 4d je ffffffff810b8ab1 <cache_reap+0xbe>
ffffffff810b8a64: 83 3b 00 cmpl $0x0,(%rbx)
here (0xffffffff810b8a64) this is the oops point, corresponding to code $KSRC/mm/slab.c:1035:
1025 /*
1026 * Called from cache_reap() to regularly drain alien caches round robin.
1027 */
1028 static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3)
1029 {
1030 int node = __get_cpu_var(reap_node);
1031
1032 if (l3->alien) {
1033 struct array_cache *ac = l3->alien[node];
1034
1035 if (ac && ac->avail && spin_trylock_irq(&ac->lock)) {
1036 __drain_alien_cache(cachep, ac, node);
1037 spin_unlock_irq(&ac->lock);
1038 }
1039 }
1040 }
RAX: 0000000000000002 -> node
RBX: 0000000000000078 -> ac
(%rbx) -> ac->avail
The value of ac is random and invalid, ac->avail dereference causes the oops.
the reap_node (3rd node) is the new added node by mem hotadd. however, for old kmem_list,
its l3->alien has only 2 cache entries (nr_node_ids = 2), so l3->alien[2] is invalid.
Otherwise, it might cause BUG when reap_alien() affecting the new added node.
V2: use kzalloc_node() to ensure zeroed memory.
CC: Pekka Enberg <pen...@cs.helsinki.fi>
Acked-by: Andi Kleen <a...@linux.intel.com>
Reviewed-by: Christoph Lameter <c...@linux-foundation.org>
Reviewed-by: Matt Mackall <m...@selenic.com>
Signed-off-by: Haicheng Li <haich...@linux.intel.com>
---
mm/slab.c | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 7dfa481..000e9ed 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -966,18 +966,16 @@ static void *alternate_node_alloc(struct kmem_cache *, gfp_t);
static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
{
struct array_cache **ac_ptr;
- int memsize = sizeof(void *) * nr_node_ids;
+ int memsize = sizeof(void *) * MAX_NUMNODES;
int i;
if (limit > 1)
limit = 12;
- ac_ptr = kmalloc_node(memsize, gfp, node);
+ ac_ptr = kzalloc_node(memsize, gfp, node);
if (ac_ptr) {
for_each_node(i) {
- if (i == node || !node_online(i)) {
- ac_ptr[i] = NULL;
+ if (i == node || !node_online(i))
continue;
- }
ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d, gfp);
if (!ac_ptr[i]) {
for (i--; i >= 0; i--)
--
1.5.3.8
Then, this is a violation of the first statement :
nr_node_ids = 1 + nid of highest POSSIBLE node.
If your system allows hotplugging of new nodes, then POSSIBLE nodes should include them
at boot time.
Same thing for cpus and nr_cpus_ids. If a cpu is added, then its id MUST be < nr_cpus_ids
Agreed, nr_node_ids must be possible nodes.
It should have been set up by the SRAT parser (modulo regressions)
Haicheng, did you verify with printk it's really incorrect at this point?
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Yup. See below debug patch & Oops info.
If we can make sure that SRAT parser must be able to detect out all possible node (even
the node, cpu+mem, is not populated on the motherboard), it would be ACPI Parser issue or
BIOS issue rather than a slab issue. In such case, I think this patch might become a
workaround for buggy system board; and we might need to look into ACPI SRAT parser code as
well:).
---
diff --git a/mm/slab.c b/mm/slab.c
index 7dfa481..3a4e1f4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1032,6 +1032,9 @@ static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3)
if (l3->alien) {
struct array_cache *ac = l3->alien[node];
+ if (node >= nr_node_ids)
+ printk("node=%d, nr_node_ids=%d, ac=%p\n",
+ node, nr_node_ids, ac);
if (ac && ac->avail && spin_trylock_irq(&ac->lock)) {
__drain_alien_cache(cachep, ac, node);
spin_unlock_irq(&ac->lock);
---
[ 151.732864] node=3, nr_node_ids=2, ac=(null)
[ 151.732873] node=3, nr_node_ids=2, ac=(null)
[ 151.732882] node=3, nr_node_ids=2, ac=(null)
[ 151.732889] node=3, nr_node_ids=2, ac=(null)
[ 151.732897] node=3, nr_node_ids=2, ac=000000004b31f78f
[ 151.732941] BUG: unable to handle kernel paging request at 000000004b31f78f
[ 151.741026] IP: [<ffffffff810bd460>] cache_reap+0x8d/0x252
[ 151.747363] PGD 0
[ 151.749793] Oops: 0000 [#1] SMP
[ 151.753658] last sysfs file: /sys/kernel/kexec_crash_loaded
[ 151.759990] CPU
[ 151.762509] Modules linked in: ipv6 autofs4 rfcomm l2cap crc16 bluetooth rfkill
binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc fan
battery ac parport_pc lp parport joydev usbhid sr_mod cdrom processor thermal thermal_sys
container button rtc_cmos rtc_core rtc_lib i2c_i801 i2c_core pcspkr uhci_hcd ohci_hcd
ehci_hcd usbcore
[ 151.802035] Pid: 120, comm: events/21 Not tainted 2.6.32-haicheng-cpuhp #34 Server
[ 151.810911] RIP: 0010:[<ffffffff810bd460>] [<ffffffff810bd460>] cache_reap+0x8d/0x252
[ 151.815485] RSP: 0018:ffff88027e81ddf0 EFLAGS: 00010202
[ 151.815491] RAX: 000000000000003d RBX: 000000004b31f78f RCX: 0000000000000000
[ 151.815496] RDX: ffff88027f3f5040 RSI: 0000000000000001 RDI: 0000000000000286
[ 151.815503] RBP: ffff88027e81de30 R08: 0000000000000002 R09: ffffffff8105ee06
[ 151.815507] R10: ffff88027e81dbe0 R11: ffffffff81066722 R12: ffff88047f223080
[ 151.815513] R13: ffff88047dd201c0 R14: 0000000000000003 R15: fffffffff00000c6
[ 151.815518] FS: 0000000000000000(0000) GS:ffff88028b540000(0000) knlGS:0000000000000000
[ 151.815524] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 151.815528] CR2: 000000004b31f78f CR3: 0000000001001000 CR4: 00000000000006e0
[ 151.815533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 151.815538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 151.815547] Process events/21 (pid: 120, threadinfo ffff88027e81c000, task
ffff88027f3f5040)
[ 151.815550] Stack:
[ 151.817896] ffff88027e81de30 ffff88028b5517c0 0000000100000000 ffff88027e81de80
[ 151.817908] <0> ffff88028b556d80 ffff88028b5517c0 ffffffff810bd3d3 fffffffff00000c6
[ 151.817918] <0> ffff88027e81dec0 ffffffff81058d0c ffffffff81058cb6 ffffffff81376fea
[ 151.817928] Call Trace:
[ 151.820772] [<ffffffff810bd3d3>] ? cache_reap+0x0/0x252
[ 151.820786] [<ffffffff81058d0c>] worker_thread+0x17a/0x27b
[ 151.820793] [<ffffffff81058cb6>] ? worker_thread+0x124/0x27b
[ 151.820806] [<ffffffff81376fea>] ? thread_return+0x3e/0xee
[ 151.820816] [<ffffffff8105bbbc>] ? autoremove_wake_function+0x0/0x38
[ 151.820827] [<ffffffff81058b92>] ? worker_thread+0x0/0x27b
[ 151.820833] [<ffffffff8105babe>] kthread+0x7d/0x87
[ 151.820848] [<ffffffff81012daa>] child_rip+0xa/0x20
[ 151.820857] [<ffffffff81012710>] ? restore_args+0x0/0x30
[ 151.820863] [<ffffffff8105ba41>] ? kthread+0x0/0x87
[ 151.820874] [<ffffffff81012da0>] ? child_rip+0x0/0x20
[ 151.820879] Code: 77 48 63 c6 41 89 f6 48 8b 1c c2 8b 15 be 28 6e 00 39 d6 7c 11 48 89
d9 48 c7 c7 97 98 4c 81 31 c0 e8 23 bf f8 ff 48 85 db 74 4d <83> 3b 00 74 48 48 83 3d 83
ab 66 00 00 75 04 0f 0b eb fe fa 66
[ 151.845235] RIP [<ffffffff810bd460>] cache_reap+0x8d/0x252
[ 151.845255] RSP <ffff88027e81ddf0>
[ 151.845260] CR2: 000000004b31f78f
[ 151.845415] ---[ end trace be6e21fde5d02b06 ]---
> > It should have been set up by the SRAT parser (modulo regressions)
> >
> > Haicheng, did you verify with printk it's really incorrect at this point?
>
> Yup. See below debug patch & Oops info.
>
> If we can make sure that SRAT parser must be able to detect out all possible
> node (even the node, cpu+mem, is not populated on the motherboard), it would
> be ACPI Parser issue or BIOS issue rather than a slab issue. In such case, I
> think this patch might become a workaround for buggy system board; and we
> might need to look into ACPI SRAT parser code as well:).
Right. Lets fix the SRAT / ACPI issue. Code elsewhere also dimensions
arrays to nr_node_ids.
> @@ -966,18 +966,16 @@ static void *alternate_node_alloc(struct kmem_cache *,
> gfp_t);
> static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
> {
> struct array_cache **ac_ptr;
> - int memsize = sizeof(void *) * nr_node_ids;
> + int memsize = sizeof(void *) * MAX_NUMNODES;
> int i;
Remove this change and I will ack the patch.
CC: Pekka Enberg <pen...@cs.helsinki.fi>
CC: Eric Dumazet <eric.d...@gmail.com>
Acked-by: Andi Kleen <a...@linux.intel.com>
Acked-by: Christoph Lameter <c...@linux-foundation.org>
Reviewed-by: Matt Mackall <m...@selenic.com>
Signed-off-by: Haicheng Li <haich...@linux.intel.com>
---
mm/slab.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 7dfa481..5d1a782 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -971,13 +971,11 @@ static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
if (limit > 1)
limit = 12;
- ac_ptr = kmalloc_node(memsize, gfp, node);
+ ac_ptr = kzalloc_node(memsize, gfp, node);
if (ac_ptr) {
for_each_node(i) {
- if (i == node || !node_online(i)) {
- ac_ptr[i] = NULL;
+ if (i == node || !node_online(i))
continue;
- }
ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d, gfp);
if (!ac_ptr[i]) {
for (i--; i >= 0; i--)
--
1.5.3.8
I can find a trace of Andi acking the previous version of this patch
but I don't see an ACK from Christoph nor a revieved-by from Matt. Was
I not CC'd on those emails or what's going on here?
Pekka
Pekka,
Christoph said he will ack this patch if remove the change of MAX_NUMNODES (see below),
so I add him directly as Acked-by in this revised patch. And also, I got review
comments from Matt for v1 and changed the patch accordingly.
Is it a violation of the rule? if so, I'm sorry, actually not quite clear with the rule.
Christoph Lameter wrote:
> On Wed, 23 Dec 2009, Haicheng Li wrote:
>
>> @@ -966,18 +966,16 @@ static void *alternate_node_alloc(struct kmem_cache *,
>> gfp_t);
>> static struct array_cache **alloc_alien_cache(int node, int limit, gfp_t gfp)
>> {
>> struct array_cache **ac_ptr;
>> - int memsize = sizeof(void *) * nr_node_ids;
>> + int memsize = sizeof(void *) * MAX_NUMNODES;
>> int i;
>
> Remove this change and I will ack the patch.
>
Haicheng Li wrote:
> Pekka Enberg wrote:
> > I can find a trace of Andi acking the previous version of this patch
> > but I don't see an ACK from Christoph nor a revieved-by from Matt. Was
> > I not CC'd on those emails or what's going on here?
> >
>
> Christoph said he will ack this patch if remove the change of
> MAX_NUMNODES (see below),
> so I add him directly as Acked-by in this revised patch. And also, I got
> review
> comments from Matt for v1 and changed the patch accordingly.
>
> Is it a violation of the rule? if so, I'm sorry, actually not quite
> clear with the rule.
See Section 14 of Documentation/SubmittingPatches. You should never add
tags unless they came from the said person. The ACKs from Andi is fine,
the one from Christoph is borderline but OK and the one from Matt is
_not_ OK.
I can fix those up but I'll wait from an explicit ACK from Christoph first.
Pekka
Christoph? Matt?
Acked-by: Christoph Lameter <c...@linux-foundation.org>
Looks like a fine cleanup.
Acked-by: Matt Mackall <m...@selenic.com>
--
http://selenic.com : development and support for Mercurial and Linux