Re: [PATCH V6] x86: NX protection for kernel data

Siarhei Liakh

unread,

Jan 31, 2010, 6:20:02 PM1/31/10

to

On Fri, Oct 16, 2009 at 8:27 AM, Ingo Molnar <mi...@elte.hu> wrote:
>
[...]
> Well, if we are going to touch this area i'd like to see them addressed
> together.
>
> The 1MB thing would obviously be a default-off Kconfig option so in that
> sense it should not break anything by default.

Ingo,

I've got this 1MB thing fixed. I have also adjusted my previous
patches to be compatible with 2.6.33-rc5.
I will be re-posting this patch set to LKML shortly.
Let me know if there is anything else you would like me to adjust.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Siarhei Liakh

unread,

Jan 31, 2010, 6:30:02 PM1/31/10

to

This patch expands functionality of CONFIG_DEBUG_RODATA to set main
(static) kernel data area as NX.
The following steps are taken to achieve this:
1. Linker script is adjusted so .text always starts and ends on a page boundary
2. Linker script is adjusted so .rodata and .data always start and
end on a page boundary
3. void mark_nxdata_nx(void) added to arch/x86/mm/init.c with actual
functionality: NX is set for all pages from _etext through _end.
4. mark_nxdata_nx() called from free_initmem() (after init has been released)
5. free_init_pages() sets released memory NX in arch/x86/mm/init.c

The patch have been developed for Linux 2.6.31-rc7 x86 by Siarhei Liakh
<sliak...@gmail.com> and Xuxian Jiang <ji...@cs.ncsu.edu>.

V1: initial patch for 2.6.30
V2: patch for 2.6.31-rc7
V3: moved all code into arch/x86, adjusted credits
V4: fixed ifdef, removed credits from CREDITS
V5: fixed an address calculation bug in mark_nxdata_nx()
V6: updated for compatibility with 2.6.33-rc5
---

Signed-off-by: Siarhei Liakh <sliak...@gmail.com>
Signed-off-by: Xuxian Jiang <ji...@cs.ncsu.edu>
Acked-by: Arjan van de Ven <ar...@linux.intel.com>

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index f92a0da..2cb7369 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -69,7 +69,7 @@ jiffies_64 = jiffies;

PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
- data PT_LOAD FLAGS(7); /* RWE */
+ data PT_LOAD FLAGS(6); /* RW_ */
#ifdef CONFIG_X86_64
user PT_LOAD FLAGS(5); /* R_E */
#ifdef CONFIG_SMP
@@ -108,6 +108,8 @@ SECTIONS
IRQENTRY_TEXT
*(.fixup)
*(.gnu.warning)
+ /* .text should occupy whole number of pages */
+ . = ALIGN(PAGE_SIZE);
/* End of text section */
_etext = .;
} :text = 0x9090
@@ -143,6 +145,8 @@ SECTIONS
/* rarely changed data like cpu maps */
READ_MOSTLY_DATA(INTERNODE_CACHE_BYTES)

+ /* .data should occupy whole number of pages */
+ . = ALIGN(PAGE_SIZE);
/* End of data section */
_edata = .;
} :data
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d406c52..d613d0a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -356,9 +356,10 @@ void free_init_pages(char *what, unsigned long
begin, unsigned long end)
/*
* We just marked the kernel text read only above, now that
* we are going to free part of that, we need to make that
- * writeable first.
+ * writeable and non-executable first.
*/
set_memory_rw(begin, (end - begin) >> PAGE_SHIFT);
+ set_memory_nx(begin, (end - begin) >> PAGE_SHIFT);

printk(KERN_INFO "Freeing %s: %luk freed\n", what, (end - begin) >> 10);

@@ -373,11 +374,29 @@ void free_init_pages(char *what, unsigned long
begin, unsigned long end)
#endif
}

+void mark_nxdata_nx(void)
+{
+#ifdef CONFIG_DEBUG_RODATA
+ /*
+ * When this called, init has already been executed and released,
+ * so everything past _etext sould be NX.
+ */
+ unsigned long start = PAGE_ALIGN((unsigned long)(&_etext));
+ unsigned long size = PAGE_ALIGN((unsigned long)(&_end)) - start;
+
+ printk(KERN_INFO "NX-protecting the kernel data: %lx, %lu pages\n",
+ start, size >> PAGE_SHIFT);
+ set_memory_nx(start, size >> PAGE_SHIFT);
+#endif
+}
+
void free_initmem(void)
{
free_init_pages("unused kernel memory",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
+ /* Set kernel's data as NX */
+ mark_nxdata_nx();
}

#ifdef CONFIG_BLK_DEV_INITRD

James Morris

unread,

Feb 1, 2010, 4:50:02 PM2/1/10

to

On Sun, 31 Jan 2010, Siarhei Liakh wrote:

> V1: initial patch for 2.6.30
> V2: patch for 2.6.31-rc7
> V3: moved all code into arch/x86, adjusted credits
> V4: fixed ifdef, removed credits from CREDITS
> V5: fixed an address calculation bug in mark_nxdata_nx()
> V6: updated for compatibility with 2.6.33-rc5
> ---
>
> Signed-off-by: Siarhei Liakh <sliak...@gmail.com>
> Signed-off-by: Xuxian Jiang <ji...@cs.ncsu.edu>
> Acked-by: Arjan van de Ven <ar...@linux.intel.com>

Reviewed-by: James Morris <jmo...@namei.org>

--
James Morris
<jmo...@namei.org>

tip-bot for Siarhei Liakh

unread,

Feb 17, 2010, 3:00:03 PM2/17/10

to

Commit-ID: 01ab31371da90a795b774d87edf2c21bb3a64dda
Gitweb: http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda
Author: Siarhei Liakh <sliak...@gmail.com>
AuthorDate: Sun, 31 Jan 2010 18:27:55 -0500
Committer: H. Peter Anvin <h...@zytor.com>
CommitDate: Wed, 17 Feb 2010 10:11:24 -0800

x86, mm: NX protection for kernel data

This patch expands functionality of CONFIG_DEBUG_RODATA to set main
(static) kernel data area as NX.

The following steps are taken to achieve this:
1. Linker script is adjusted so .text always starts and ends on a page boundary
2. Linker script is adjusted so .rodata and .data always start and
end on a page boundary
3. void mark_nxdata_nx(void) added to arch/x86/mm/init.c with actual
functionality: NX is set for all pages from _etext through _end.
4. mark_nxdata_nx() called from free_initmem() (after init has been released)
5. free_init_pages() sets released memory NX in arch/x86/mm/init.c

V1: initial patch for 2.6.30

V2: patch for 2.6.31-rc7
V3: moved all code into arch/x86, adjusted credits
V4: fixed ifdef, removed credits from CREDITS
V5: fixed an address calculation bug in mark_nxdata_nx()
V6: updated for compatibility with 2.6.33-rc5

Signed-off-by: Siarhei Liakh <sliak...@gmail.com>

Signed-off-by: Xuxian Jiang <ji...@cs.ncsu.edu>
Acked-by: Arjan van de Ven <ar...@linux.intel.com>

Reviewed-by: James Morris <jmo...@namei.org>
LKML-Reference: <817ecb6f1001311527w791...@mail.gmail.com>
Signed-off-by: H. Peter Anvin <h...@zytor.com>
---
arch/x86/kernel/vmlinux.lds.S | 6 +++++-
arch/x86/mm/init.c | 21 ++++++++++++++++++++-
2 files changed, 25 insertions(+), 2 deletions(-)

Ingo Molnar

unread,

Feb 22, 2010, 6:00:02 AM2/22/10

to

* tip-bot for Siarhei Liakh <sliak...@gmail.com> wrote:

> Commit-ID: 01ab31371da90a795b774d87edf2c21bb3a64dda
> Gitweb: http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda
> Author: Siarhei Liakh <sliak...@gmail.com>
> AuthorDate: Sun, 31 Jan 2010 18:27:55 -0500
> Committer: H. Peter Anvin <h...@zytor.com>
> CommitDate: Wed, 17 Feb 2010 10:11:24 -0800
>
> x86, mm: NX protection for kernel data
>
> This patch expands functionality of CONFIG_DEBUG_RODATA to set main
> (static) kernel data area as NX.

-tip testing is seeing boot hangs along the lines of:

[ 15.568108] EXT3-fs (sda1): recovery complete
[ 15.573064] EXT3-fs (sda1): mounted filesystem with ordered data mode
[ 15.580313] VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
[ 15.584021] async_waiting @ 1
[ 15.588008] async_continuing @ 1 after 0 usec
[ 15.592163] Freeing unused kernel memory: 540k freed
[ 15.600126] NX-protecting the kernel data: c15ab000, 2919 pages

which i suspect could be due to the commit above.

Config attached. Athlon64 testbox.

Ingo

config

Ingo Molnar

unread,

Feb 22, 2010, 6:10:02 AM2/22/10

to

Yep, that's confirmed now, applying these 3 reverts makes it boot fine:

833e0ca: Revert "x86, mm: NX protection for kernel data"
ce4b6b4: Revert "x86: RO/NX protection for loadable kernel modules"
e357312: Revert "module: fix () used as prototype in include/linux/module.h"

Ingo

Ingo Molnar

unread,

Feb 22, 2010, 12:30:02 PM2/22/10

to

* H. Peter Anvin <h...@zytor.com> wrote:

> On 02/22/2010 03:01 AM, Ingo Molnar wrote:
> >>
> >>> Commit-ID: 01ab31371da90a795b774d87edf2c21bb3a64dda
> >>> Gitweb: http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda
> >>> Author: Siarhei Liakh <sliak...@gmail.com>
> >>> AuthorDate: Sun, 31 Jan 2010 18:27:55 -0500
> >>> Committer: H. Peter Anvin <h...@zytor.com>
> >>> CommitDate: Wed, 17 Feb 2010 10:11:24 -0800
> >>>
> >>> x86, mm: NX protection for kernel data
> >>>
> >>> This patch expands functionality of CONFIG_DEBUG_RODATA to set main
> >>> (static) kernel data area as NX.
> >>
> >> -tip testing is seeing boot hangs along the lines of:
> >>
> >> [ 15.568108] EXT3-fs (sda1): recovery complete
> >> [ 15.573064] EXT3-fs (sda1): mounted filesystem with ordered data mode
> >> [ 15.580313] VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
> >> [ 15.584021] async_waiting @ 1
> >> [ 15.588008] async_continuing @ 1 after 0 usec
> >> [ 15.592163] Freeing unused kernel memory: 540k freed
> >> [ 15.600126] NX-protecting the kernel data: c15ab000, 2919 pages
> >>
> >> which i suspect could be due to the commit above.
> >
> > Yep, that's confirmed now, applying these 3 reverts makes it boot fine:
> >
> > 833e0ca: Revert "x86, mm: NX protection for kernel data"
> > ce4b6b4: Revert "x86: RO/NX protection for loadable kernel modules"
> > e357312: Revert "module: fix () used as prototype in include/linux/module.h"
> >
>

> Given that e357312 is a () -> (void) prototype fix, is hardly seems
> likely to be at fault. The RO/NX stuff, on the other hand, make a lot
> of sense.

Yes, i reverted e357312 because it was a dependent change.

Thanks,

H. Peter Anvin

unread,

Feb 22, 2010, 12:30:01 PM2/22/10

to

On 02/22/2010 03:01 AM, Ingo Molnar wrote:
>>
>>> Commit-ID: 01ab31371da90a795b774d87edf2c21bb3a64dda
>>> Gitweb: http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda
>>> Author: Siarhei Liakh <sliak...@gmail.com>
>>> AuthorDate: Sun, 31 Jan 2010 18:27:55 -0500
>>> Committer: H. Peter Anvin <h...@zytor.com>
>>> CommitDate: Wed, 17 Feb 2010 10:11:24 -0800
>>>
>>> x86, mm: NX protection for kernel data
>>>
>>> This patch expands functionality of CONFIG_DEBUG_RODATA to set main
>>> (static) kernel data area as NX.
>>
>> -tip testing is seeing boot hangs along the lines of:
>>
>> [ 15.568108] EXT3-fs (sda1): recovery complete
>> [ 15.573064] EXT3-fs (sda1): mounted filesystem with ordered data mode
>> [ 15.580313] VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
>> [ 15.584021] async_waiting @ 1
>> [ 15.588008] async_continuing @ 1 after 0 usec
>> [ 15.592163] Freeing unused kernel memory: 540k freed
>> [ 15.600126] NX-protecting the kernel data: c15ab000, 2919 pages
>>
>> which i suspect could be due to the commit above.
>
> Yep, that's confirmed now, applying these 3 reverts makes it boot fine:
>
> 833e0ca: Revert "x86, mm: NX protection for kernel data"
> ce4b6b4: Revert "x86: RO/NX protection for loadable kernel modules"
> e357312: Revert "module: fix () used as prototype in include/linux/module.h"
>

Given that e357312 is a () -> (void) prototype fix, is hardly seems

likely to be at fault. The RO/NX stuff, on the other hand, make a lot
of sense.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

castet....@free.fr

unread,

Mar 2, 2010, 11:30:01 AM3/2/10

to

> At this point I need some help and guidance on how to track down what
> exactly happens there, as I am not very familiar with what goes into
> .data and why are we trying to execute it.
Can't you add debug printk in the fault handler before any exception processing

Something like that.

Matthieu

test.diff

Siarhei Liakh

unread,

Mar 2, 2010, 1:00:02 PM3/2/10

to

That does not really give any additional information. The message does
not show up in the output and the stack trace says that we are somehow
ended up in doublefault_fn.

any other ideas?

I really appreciate your help.
===========
...
[ 17.652000] BUG: unable to handle kernel NULL pointer dereference at 00000014
[ 17.652000] IP: [<c102e722>] vprintk+0x12/0x398
[ 17.652000] *pdpt = 00000000018e7001 *pde = 0000000000000000
[ 17.652000] Oops: 0000 [#1] SMP
[ 17.652000] last sysfs file:
[ 17.652000] Modules linked in:
[ 17.652000]
[ 17.652000] Pid: 314, comm: rcu_torture_rea Not tainted 2.6.33-tip+ #15 /
[ 17.652000] EIP: 0060:[<c102e722>] EFLAGS: 00004082 CPU: 0
[ 17.652000] EIP is at vprintk+0x12/0x398
[ 17.652000] EAX: c171dc07 EBX: c2802000 ECX: 00000000 EDX: 00000000
[ 17.652000] ESI: c18f1b90 EDI: 00000000 EBP: c18f1b74 ESP: c18f1b10
[ 17.652000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 17.652000] Process rcu_torture_rea (pid: 314, ti=c18f1000
task=f797c000 task.ti=f791a000)
[ 17.652000] Stack:
[ 17.652000] 00000000 00000000 c171dc07 00000000 00000000 00000000
00000000 00000000
[ 17.652000] <0> 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[ 17.652000] <0> 00000000 00000000 00000000 00000000 00000000
00000000 c2802000 00000000
[ 17.652000] Call Trace:
[ 17.652000] [<c15babb1>] ? printk+0xf/0x16
[ 17.652000] [<c101458b>] ? doublefault_fn+0x2b/0xd8
[ 17.652000] Code: 76 80 c1 e8 4e 8d 01 00 c7 05 f0 34 8f c1 00 00
00 00 e8 66 fc ff ff 5d c3 55 89 e5 57 56 53 83 ec 58 8b 45 08 89 45
a4 8b 75 0c <65> 8b 15 14 00 00 00 89 55 f0 31 d2 a1 a0 75 80 c1 89 45
a8 8b
[ 17.652000] EIP: [<c102e722>] vprintk+0x12/0x398 SS:ESP 0068:c18f1b10
[ 17.652000] CR2: 0000000000000014
[ 17.652000] ---[ end trace 6164e4a9acb59023 ]---
[ 17.656000] BUG: spinlock lockup on CPU#0, rcu_torture_rea/314, c2809800
[ 17.656000] Pid: 314, comm: rcu_torture_rea Tainted: G D
2.6.33-tip+ #15
[ 17.656000] Call Trace:
[ 17.656000] [<c15babb1>] ? printk+0xf/0x16
[ 17.656000] [<c12758a5>] do_raw_spin_lock+0xfb/0x126
[ 17.656000] [<c15bd2d3>] _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] printk+0xf/0x16
[ 17.656000] [<c101458b>] doublefault_fn+0x2b/0xd8
[ 17.656000] sending NMI to all CPUs:
[ 17.656000] NMI backtrace for cpu 0
[ 17.656000] Modules linked in:
[ 17.656000]
[ 17.656000] Pid: 314, comm: rcu_torture_rea Tainted: G D
2.6.33-tip+ #15 /
[ 17.656000] EIP: 0060:[<c10112b6>] EFLAGS: 00000046 CPU: 0
[ 17.656000] EIP is at default_send_IPI_mask_logical+0xc3/0xdb
[ 17.656000] EAX: ffffb300 EBX: 01000000 ECX: c1011239 EDX: 00000c00
[ 17.656000] ESI: 00000002 EDI: 00000002 EBP: c18f1848 ESP: c18f1838
[ 17.656000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 17.656000] Process rcu_torture_rea (pid: 314, ti=c18f1000
task=f797c000 task.ti=f791a000)
[ 17.656000] Stack:
[ 17.656000] 00000800 08453000 00000000 c2809800 c18f1854 c101106a
08453000 c18f1864
[ 17.656000] <0> c1011739 c171c7e2 08453000 c18f189c c12758aa
c175c91e 00000000 f797c318
[ 17.656000] <0> 0000013a c2809800 f797c318 f797c000 08453000
00000001 c2809800 c2809800
[ 17.656000] Call Trace:
[ 17.656000] [<c101106a>] ? default_send_IPI_all+0x22/0x62
[ 17.656000] [<c1011739>] ? arch_trigger_all_cpu_backtrace+0x2b/0x4f
[ 17.656000] [<c12758aa>] ? do_raw_spin_lock+0x100/0x126
[ 17.656000] [<c15bd2d3>] ? _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] ? update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] ? tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] ? tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] ? smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] ? apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] ? do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] ? oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] ? no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] ? __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] ? bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] ? do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] ? error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] ? printk+0xf/0x16
[ 17.656000] [<c101458b>] ? doublefault_fn+0x2b/0xd8
[ 17.656000] Code: 00 89 da 89 10 83 fe 02 74 07 8b 55 f0 09 f2 eb
06 8b 55 f0 80 ce 04 a1 2c 6c 80 c1 2d 00 3d 00 00 89 10 f7 c7 00 02
00 00 75 09 <57> 9d e8 0c 0c 04 00 eb 07 e8 9e 1a 04 00 57 9d 8d 65 f4
5b 5e
[ 17.656000] Call Trace:
[ 17.656000] [<c101106a>] default_send_IPI_all+0x22/0x62
[ 17.656000] [<c1011739>] arch_trigger_all_cpu_backtrace+0x2b/0x4f
[ 17.656000] [<c12758aa>] do_raw_spin_lock+0x100/0x126
[ 17.656000] [<c15bd2d3>] _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] printk+0xf/0x16
[ 17.656000] [<c101458b>] doublefault_fn+0x2b/0xd8
[ 17.656000] Pid: 314, comm: rcu_torture_rea Tainted: G D
2.6.33-tip+ #15
[ 17.656000] Call Trace:
[ 17.656000] [<c10076eb>] ? show_regs+0x1a/0x20
[ 17.656000] [<c15bf177>] nmi_watchdog_tick+0xa3/0x181
[ 17.656000] [<c15be62c>] do_nmi+0xc6/0x2d1
[ 17.656000] [<c15be1d0>] nmi_stack_correct+0x2f/0x34
[ 17.656000] [<c1011239>] ? default_send_IPI_mask_logical+0x46/0xdb
[ 17.656000] [<c10112b6>] ? default_send_IPI_mask_logical+0xc3/0xdb
[ 17.656000] [<c101106a>] default_send_IPI_all+0x22/0x62
[ 17.656000] [<c1011739>] arch_trigger_all_cpu_backtrace+0x2b/0x4f
[ 17.656000] [<c12758aa>] do_raw_spin_lock+0x100/0x126
[ 17.656000] [<c15bd2d3>] _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] printk+0xf/0x16
[ 17.656000] [<c101458b>] doublefault_fn+0x2b/0xd8

Siarhei Liakh

unread,

Mar 2, 2010, 1:10:02 PM3/2/10

to

On Tue, Mar 2, 2010 at 12:51 PM, Siarhei Liakh <sliak...@gmail.com> wrote:
>>> At this point I need some help and guidance on how to track down what
>>> exactly happens there, as I am not very familiar with what goes into
>>> .data and why are we trying to execute it.
>> Can't you add debug printk in the fault handler before any exception processing
>>
>> Something like that.
>
> That does not really give any additional information. The message does
> not show up in the output and the stack trace says that we are somehow
> ended up in doublefault_fn.

Forgot to add one more thing: it looks like fault on instruction
prefetch is already taken care of in fault handlers and should
generate an appropriate message. The way I interpret the fact that
doublefault_fn is the one that prints out a message, is that main
fault handler is somehow dependent on .data being executable.... (but
this is pure speculation on my part at this point)...

Ingo Molnar

unread,

Mar 10, 2010, 8:40:02 AM3/10/10

to

* Siarhei Liakh <sliak...@gmail.com> wrote:

> Any suggestions on how should I proceed forward in troubleshooting this issue?

Can you reproduce it in KVM?

If yes then that might give you a more debuggable state than a crashed native
system.

Siarhei Liakh

unread,

Mar 10, 2010, 10:10:02 AM3/10/10

to

>> Any suggestions on how should I proceed forward in troubleshooting this issue?
>
> Can you reproduce it in KVM?

Yes, it crashes in KVM.

> If yes then that might give you a more debuggable state than a crashed native
> system.

Running kernel in KVM under kgdb with a watchpoint set for
($eip>=_etext)&&($eip<=_edata) still produces a dump but without ever
triggering the watchpoint.
I was hoping that some Kernel Locking Guru would say "take a look at
__xxx_yyy_zzz(), the things it does always looked suspicious to me."
:)

But joking aside, I would really appreciate ANY advice on debugging
this problem, as my best idea at this point is to pretty much
single-step through the whole thing... And since I am not familiar
with lock debugging at all, this will take a long time to figure out
on my own.

Thank you, guys.

Siarhei Liakh

unread,

Mar 11, 2010, 10:20:01 PM3/11/10

to

On Sat, Mar 6, 2010 at 2:44 PM, Siarhei Liakh <sliak...@gmail.com> wrote:

> On Mon, Feb 22, 2010 at 12:21 PM, Ingo Molnar <mi...@elte.hu> wrote:
>>
>> * H. Peter Anvin <h...@zytor.com> wrote:
>>
>>> On 02/22/2010 03:01 AM, Ingo Molnar wrote:
>>> >>
>>> >>> Commit-ID: 01ab31371da90a795b774d87edf2c21bb3a64dda
>>> >>> Gitweb: http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda

[ . . . ]
> I was able to narrow down the issue to spinlock debugging. More
> specifically, DEBUG_SPINLOCK=y seem to be somehow incompatible with
> kernel's RW-data being NX.
[ . . . ]
> Kernel crash dump:
> ============================================
> [ 2.844000] EXT3-fs (sda1): warning: maximal mount count reached,
> running e2fsck is recommended
> [ 2.848000] EXT3-fs (sda1): using internal journal
> [ 2.849556] EXT3-fs (sda1): recovery complete
> [ 2.852000] EXT3-fs (sda1): mounted filesystem with ordered data mode
> [ 2.854168] VFS: Mounted root (ext3 filesystem) on device 8:1.
> [ 2.856000] Freeing unused kernel memory (init): 540k freed
> [ 2.857056] NX-protecting the kernel data: 0xc15b3000 - 0xc1834000, 641 pages
> [ 2.860328] do_page_fault - entry
> [ 2.862554] do_page_fault: 0xc17ebdb8
> [ 2.864000] do_page_fault - kernel space
> [ 2.864000] do_page_fault - about to call bad_area_nosemaphore()
> [ 2.864000] BUG: unable to handle kernel paging request at c17ebdb8
> [ 2.864000] IP: [<c12609f7>] do_raw_spin_unlock+0x5e/0x71
> [ 2.864000] *pdpt = 00000000018c0001 *pde = 80000000016001e1
> [ 2.864000] Oops: 0003 [#1] SMP
> [ 2.864000] last sysfs file:
> [ 2.864000] Modules linked in:
> [ 2.864000]
> [ 2.864000] Pid: 1, comm: swapper Not tainted 2.6.33-tip+ #41 /
> [ 2.864000] EIP: 0060:[<c12609f7>] EFLAGS: 00010046 CPU: 0
> [ 2.864000] EIP is at do_raw_spin_unlock+0x5e/0x71
> [ 2.864000] EAX: 00000000 EBX: c17ebdac ECX: 00000001 EDX: 00000c0b
> [ 2.864000] ESI: 00000246 EDI: c18c0058 EBP: f780fe14 ESP: f780fe10
> [ 2.864000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [ 2.864000] Process swapper (pid: 1, ti=f780f000 task=f7826000
> task.ti=f780f000)
> [ 2.864000] Stack:
> [ 2.864000] c17ebdac f780fe24 c15ad3f2 00000000 00000000 f780ff18
> c1017a57 00000000
> [ 2.864000] <0> 016001e3 00000000 016001e3 f77a8004 00000001
> 00000000 00000163 80000000
> [ 2.864000] <0> 00000000 ffffffff ffffffff 80000000 000001e1
> 80000000 00000000 80000000
> [ 2.864000] Call Trace:
> [ 2.864000] [<c15ad3f2>] ? _raw_spin_unlock_irqrestore+0x20/0x3c
> [ 2.864000] [<c1017a57>] ? __change_page_attr_set_clr+0x65c/0x945
> [ 2.864000] [<c1092245>] ? vm_unmap_aliases+0x17b/0x186
> [ 2.864000] [<c15b3000>] ? _etext+0x0/0x24
> [ 2.864000] [<c1017eb4>] ? change_page_attr_set_clr+0x174/0x312
> [ 2.864000] [<c15b3000>] ? _etext+0x0/0x24
> [ 2.864000] [<c10182d1>] ? set_memory_nx+0x2d/0x32
> [ 2.864000] [<c10163ab>] ? mark_nxdata_nx+0x37/0x41
> [ 2.864000] [<c15b3000>] ? _etext+0x0/0x24
> [ 2.864000] [<c1834000>] ? i386_start_kernel+0x0/0xaa
> [ 2.864000] [<c101649d>] ? free_initmem+0x1c/0x1e
> [ 2.864000] [<c1001148>] ? init_post+0xd/0x121
> [ 2.864000] [<c1834401>] ? kernel_init+0x1d5/0x1df
> [ 2.864000] [<c183422c>] ? kernel_init+0x0/0x1df
> [ 2.864000] [<c1002e66>] ? kernel_thread_helper+0x6/0x10
> [ 2.864000] Code: 54 8b c1 39 43 0c 74 0c ba 74 e1 73 c1 89 d8 e8
> 31 ff ff ff 64 a1 d8 6b 8b c1 39 43 08 74 0c ba 80 e1 73 c1 89 d8 e8
> 1a ff ff ff <c7> 43 0c ff ff ff ff c7 43 08 ff ff ff ff fe 03 5b 5d c3
> 55 89
> [ 2.864000] EIP: [<c12609f7>] do_raw_spin_unlock+0x5e/0x71 SS:ESP
> 0068:f780fe10
> [ 2.864000] CR2: 00000000c17ebdb8
> [ 2.864000] ---[ end trace 0d94f53e9dfe82f9 ]---
> [ 2.948071] swapper used greatest stack depth: 1804 bytes left
> [ 2.952000] Kernel panic - not syncing: Attempted to kill init!
> ============================================
>
> looking for c17ebdb8 in system.map points to a location in pgd_lock:
> ============================================
> $grep c17ebd System.map
> c17ebd68 d bios_check_work
> c17ebda8 d highmem_pages
> c17ebdac D pgd_lock
> c17ebdc8 D pgd_list
> c17ebdd0 D show_unhandled_signals
> c17ebdd4 d cpa_lock
> c17ebdf0 d memtype_lock
> ============================================
>
> I've looked at the lock debugging and could not find any place that
> would look like an attempt to execute data. This would lead me to
> think that calling set_memory_nx from kernel_init somehow confuses the
> lock debugging subsystem, or set_memory_nx does not change page
> attributes in a safe manner (for example when a lock is stored inside
> the page whose attributes are being changed).

I've done some extra debugging and it really does look like the crash
happens when we are setting NX on a large page which has pgd_lock
inside it.

Here is a trace of printk's that I added to troubleshoot this issue:
=========================
[ 3.072003] try_preserve_large_page - enter
[ 3.073185] try_preserve_large_page - address: 0xc1600000
[ 3.074513] try_preserve_large_page - 2M page
[ 3.075606] try_preserve_large_page - about to call static_protections
[ 3.076000] try_preserve_large_page - back from static_protections
[ 3.076000] try_preserve_large_page - past loop
[ 3.076000] try_preserve_large_page - new_prot != old_prot
[ 3.076000] try_preserve_large_page - the address is aligned and
the number of pages covers the full range
[ 3.076000] try_preserve_large_page - about to call __set_pmd_pte
[ 3.076000] __set_pmd_pte - enter
[ 3.076000] __set_pmd_pte - address: 0xc1600000
[ 3.076000] __set_pmd_pte - about to call
set_pte_atomic(*0xc18c0058(low=0x16001e3, high=0x0), (low=0x16001e1,
high=0x80000000))
[lock-up here]
=========================

matthieu castet

unread,

Mar 13, 2010, 7:20:02 AM3/13/10

to

Hi,

This may be stupid but :

0xc1600000 2MB page is in 0xc1600000-0xc1800000 range. pgd_lock (0xc17ebdac) seems to be in that range.

You change attribute from (low=0x16001e3, high=0x0) to (low=0x16001e1, high=0x80000000). IE you set
NX bit (bit 63), but you also clear R/W bit (bit 2). So the page become read only, but you are using a lock
inside this page that need RW access. So you got a page fault.

Now I don't know what should be done.
Is that normal we set the page RO ?

Matthieu

Siarhei Liakh

unread,

Mar 15, 2010, 2:30:02 PM3/15/10

to

[...]

> 0xc1600000 2MB page is in 0xc1600000-0xc1800000 range. pgd_lock
> (0xc17ebdac) seems to be in that range.

That's what I was thinking...

> You change attribute from (low=0x16001e3, high=0x0) to (low=0x16001e1,
> high=0x80000000). IE you set
> NX bit (bit 63), but you also clear R/W bit (bit 2). So the page become read
> only, but you are using a lock
> inside this page that need RW access. So you got a page fault.

Yes, that would do it.

> Now I don't know what should be done.
> Is that normal we set the page RO ?

No, this page should not be RO, as it contains kernel's RW data.
The interesting part is that the call that initiates the change is
set_memory_nx(), so it should not be clearing RW bit... The
interesting part is that the kernel does not crash with lock debugging
disabled.

Thanks for your help.

Siarhei Liakh

unread,

Mar 15, 2010, 5:50:01 PM3/15/10

to

On Mon, Mar 15, 2010 at 2:20 PM, Siarhei Liakh <sliak...@gmail.com> wrote:
> On Sat, Mar 13, 2010 at 8:12 AM, matthieu castet
> <castet....@free.fr> wrote:
>> Hi,
>>
>>> > looking for c17ebdb8 in system.map points to a location in pgd_lock:
>>> > ============================================
>>> > $grep c17ebd System.map
>>> > c17ebd68 d bios_check_work
>>> > c17ebda8 d highmem_pages
>>> > c17ebdac D pgd_lock
>>> > c17ebdc8 D pgd_list
>>> > c17ebdd0 D show_unhandled_signals
>>> > c17ebdd4 d cpa_lock
>>> > c17ebdf0 d memtype_lock
>>> > ============================================

[ . . . ]

>>> Here is a trace of printk's that I added to troubleshoot this issue:
>>> =========================
>>> [ 3.072003] try_preserve_large_page - enter
>>> [ 3.073185] try_preserve_large_page - address: 0xc1600000
>>> [ 3.074513] try_preserve_large_page - 2M page
>>> [ 3.075606] try_preserve_large_page - about to call static_protections
>>> [ 3.076000] try_preserve_large_page - back from static_protections
>>> [ 3.076000] try_preserve_large_page - past loop
>>> [ 3.076000] try_preserve_large_page - new_prot != old_prot
>>> [ 3.076000] try_preserve_large_page - the address is aligned and
>>> the number of pages covers the full range
>>> [ 3.076000] try_preserve_large_page - about to call __set_pmd_pte
>>> [ 3.076000] __set_pmd_pte - enter
>>> [ 3.076000] __set_pmd_pte - address: 0xc1600000
>>> [ 3.076000] __set_pmd_pte - about to call
>>> set_pte_atomic(*0xc18c0058(low=0x16001e3, high=0x0), (low=0x16001e1,
>>> high=0x80000000))
>>> [lock-up here]
>>> =========================
>>>
[...]
>> 0xc1600000 2MB page is in 0xc1600000-0xc1800000 range. pgd_lock
>> (0xc17ebdac) seems to be in that range.

[ . . . ]

>> You change attribute from (low=0x16001e3, high=0x0) to (low=0x16001e1,
>> high=0x80000000). IE you set
>> NX bit (bit 63), but you also clear R/W bit (bit 2). So the page become read
>> only, but you are using a lock
>> inside this page that need RW access. So you got a page fault.

[ . . . ]

>> Now I don't know what should be done.
>> Is that normal we set the page RO ?
>
> No, this page should not be RO, as it contains kernel's RW data.
> The interesting part is that the call that initiates the change is
> set_memory_nx(), so it should not be clearing RW bit... The
> interesting part is that the kernel does not crash with lock debugging
> disabled.

Turns out that address is indeed within .rodata range, so
static_protections() flips RW bit to 0:

[ 0.000000] Memory: 889320k/914776k available (5836k kernel code,
25064k reserved, 2564k data, 540k init, 0k highmem)
[ 0.000000] virtual kernel memory layout:
[ 0.000000] fixmap : 0xffd58000 - 0xfffff000 (2716 kB)
[ 0.000000] vmalloc : 0xf8556000 - 0xffd56000 ( 120 MB)
[ 0.000000] lowmem : 0xc0000000 - 0xf7d56000 ( 893 MB)
[ 0.000000] .init : 0xc1834000 - 0xc18bb000 ( 540 kB)
[ 0.000000] .data : 0xc15b3000 - 0xc1834000 (2564 kB)
[ 0.000000] .rodata : 0xc15b4000 - 0xc17e3000 (2236 kB)
[ 0.000000] .text : 0xc1000000 - 0xc15b3000 (5836 kB)
[ 0.000000] pgd_lock address: 0xc17ebdac
[...]
[ 3.496969] try_preserve_large_page - enter
[ 3.500004] try_preserve_large_page - address: 0xc1600000
[ 3.501730] try_preserve_large_page - 2M page
[ 3.503100] try_preserve_large_page - NX:1 RW:1
[ 3.504000] try_preserve_large_page - about to call static_protections
[ 3.504000] static_protections - .rodata PFN:0x1600 VA:0xc1600000
[ 3.504000] try_preserve_large_page - back from static_protections
[ 3.504000] try_preserve_large_page - NX:1 RW:0

So, her is what we have:
1. RO-data is at 0xc15b4000 - 0xc17e3000
2. pgd_lock is at 0xc17ebdac
3. single large page maps tail end of RO-data, and a head of RW-data,
including pgd_lock
4. static_protections says that 0xc1600000 - 0xc17e2000 should be
read-only, and that is true
5. However, try_preserve_large_page assumes that whole large page is
RO since whole requested RO-range fits within the page (0xc1600000 -
0xc1800000) -- FALSE. The problem is that try_preserve_large_page()
never checks static_protections() for the remainder of the page, which
is wrong.

The bug seems to be in the following piece of code (arch/x86/mm/pageattr.c:434):
================================================
/*
* We need to check the full range, whether
* static_protection() requires a different pgprot for one of
* the pages in the range we try to preserve:
*/
addr = address + PAGE_SIZE;
pfn++;
for (i = 1; i < cpa->numpages; i++, addr += PAGE_SIZE, pfn++) {
pgprot_t chk_prot = static_protections(new_prot, addr, pfn);

if (pgprot_val(chk_prot) != pgprot_val(new_prot))
goto out_unlock;
}
================================================

It seems to me that the for loop needs to run for EACH small page
within large page, instead of just from addr through cpa->numpages:
================================================
- addr = address + PAGE_SIZE;
- pfn++;
- for (i = 1; i < cpa->numpages; i++, addr += PAGE_SIZE, pfn++) {
+ addr = address & pmask;
+ pfn = pte_pfn(old_pte);
+ for ( i = 0; i < (psize >> PAGE_SHIFT); i++, addr +=
PAGE_SIZE, pfn++) {
pgprot_t chk_prot = static_protections(new_prot, addr, pfn);

if (pgprot_val(chk_prot) != pgprot_val(new_prot))
goto out_unlock;
}
================================================

Further, I do not think that the conditions for "whole-pageness" are
correct (arch/x86/mm/pageattr.c:457)
================================================
/*
* We need to change the attributes. Check, whether we can
* change the large page in one go. We request a split, when
* the address is not aligned and the number of pages is
* smaller than the number of pages in the large page. Note
* that we limited the number of possible pages already to
* the number of pages in the large page.
*/
- if (address == (nextpage_addr - psize) && cpa->numpages == numpages) {
+ if (address == (address & pmask) && cpa->numpages == (psize
>> PAGE_SHIFT)) {
/*
* The address is aligned and the number of pages
* covers the full page.
*/
new_pte = pfn_pte(pte_pfn(old_pte), canon_pgprot(new_prot));
__set_pmd_pte(kpte, address, new_pte);
cpa->flags |= CPA_FLUSHTLB;
do_split = 0;
}
================================================

Please let me know if this makes any sense, and I will submit a proper patch.

Thank you.