Ingo,
I've got this 1MB thing fixed. I have also adjusted my previous
patches to be compatible with 2.6.33-rc5.
I will be re-posting this patch set to LKML shortly.
Let me know if there is anything else you would like me to adjust.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
The patch have been developed for Linux 2.6.31-rc7 x86 by Siarhei Liakh
<sliak...@gmail.com> and Xuxian Jiang <ji...@cs.ncsu.edu>.
V1: initial patch for 2.6.30
V2: patch for 2.6.31-rc7
V3: moved all code into arch/x86, adjusted credits
V4: fixed ifdef, removed credits from CREDITS
V5: fixed an address calculation bug in mark_nxdata_nx()
V6: updated for compatibility with 2.6.33-rc5
---
Signed-off-by: Siarhei Liakh <sliak...@gmail.com>
Signed-off-by: Xuxian Jiang <ji...@cs.ncsu.edu>
Acked-by: Arjan van de Ven <ar...@linux.intel.com>
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index f92a0da..2cb7369 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -69,7 +69,7 @@ jiffies_64 = jiffies;
PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
- data PT_LOAD FLAGS(7); /* RWE */
+ data PT_LOAD FLAGS(6); /* RW_ */
#ifdef CONFIG_X86_64
user PT_LOAD FLAGS(5); /* R_E */
#ifdef CONFIG_SMP
@@ -108,6 +108,8 @@ SECTIONS
IRQENTRY_TEXT
*(.fixup)
*(.gnu.warning)
+ /* .text should occupy whole number of pages */
+ . = ALIGN(PAGE_SIZE);
/* End of text section */
_etext = .;
} :text = 0x9090
@@ -143,6 +145,8 @@ SECTIONS
/* rarely changed data like cpu maps */
READ_MOSTLY_DATA(INTERNODE_CACHE_BYTES)
+ /* .data should occupy whole number of pages */
+ . = ALIGN(PAGE_SIZE);
/* End of data section */
_edata = .;
} :data
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d406c52..d613d0a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -356,9 +356,10 @@ void free_init_pages(char *what, unsigned long
begin, unsigned long end)
/*
* We just marked the kernel text read only above, now that
* we are going to free part of that, we need to make that
- * writeable first.
+ * writeable and non-executable first.
*/
set_memory_rw(begin, (end - begin) >> PAGE_SHIFT);
+ set_memory_nx(begin, (end - begin) >> PAGE_SHIFT);
printk(KERN_INFO "Freeing %s: %luk freed\n", what, (end - begin) >> 10);
@@ -373,11 +374,29 @@ void free_init_pages(char *what, unsigned long
begin, unsigned long end)
#endif
}
+void mark_nxdata_nx(void)
+{
+#ifdef CONFIG_DEBUG_RODATA
+ /*
+ * When this called, init has already been executed and released,
+ * so everything past _etext sould be NX.
+ */
+ unsigned long start = PAGE_ALIGN((unsigned long)(&_etext));
+ unsigned long size = PAGE_ALIGN((unsigned long)(&_end)) - start;
+
+ printk(KERN_INFO "NX-protecting the kernel data: %lx, %lu pages\n",
+ start, size >> PAGE_SHIFT);
+ set_memory_nx(start, size >> PAGE_SHIFT);
+#endif
+}
+
void free_initmem(void)
{
free_init_pages("unused kernel memory",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
+ /* Set kernel's data as NX */
+ mark_nxdata_nx();
}
#ifdef CONFIG_BLK_DEV_INITRD
> V1: initial patch for 2.6.30
> V2: patch for 2.6.31-rc7
> V3: moved all code into arch/x86, adjusted credits
> V4: fixed ifdef, removed credits from CREDITS
> V5: fixed an address calculation bug in mark_nxdata_nx()
> V6: updated for compatibility with 2.6.33-rc5
> ---
>
> Signed-off-by: Siarhei Liakh <sliak...@gmail.com>
> Signed-off-by: Xuxian Jiang <ji...@cs.ncsu.edu>
> Acked-by: Arjan van de Ven <ar...@linux.intel.com>
Reviewed-by: James Morris <jmo...@namei.org>
--
James Morris
<jmo...@namei.org>
x86, mm: NX protection for kernel data
This patch expands functionality of CONFIG_DEBUG_RODATA to set main
(static) kernel data area as NX.
The following steps are taken to achieve this:
1. Linker script is adjusted so .text always starts and ends on a page boundary
2. Linker script is adjusted so .rodata and .data always start and
end on a page boundary
3. void mark_nxdata_nx(void) added to arch/x86/mm/init.c with actual
functionality: NX is set for all pages from _etext through _end.
4. mark_nxdata_nx() called from free_initmem() (after init has been released)
5. free_init_pages() sets released memory NX in arch/x86/mm/init.c
V1: initial patch for 2.6.30
V2: patch for 2.6.31-rc7
V3: moved all code into arch/x86, adjusted credits
V4: fixed ifdef, removed credits from CREDITS
V5: fixed an address calculation bug in mark_nxdata_nx()
V6: updated for compatibility with 2.6.33-rc5
Signed-off-by: Siarhei Liakh <sliak...@gmail.com>
Signed-off-by: Xuxian Jiang <ji...@cs.ncsu.edu>
Acked-by: Arjan van de Ven <ar...@linux.intel.com>
Reviewed-by: James Morris <jmo...@namei.org>
LKML-Reference: <817ecb6f1001311527w791...@mail.gmail.com>
Signed-off-by: H. Peter Anvin <h...@zytor.com>
---
arch/x86/kernel/vmlinux.lds.S | 6 +++++-
arch/x86/mm/init.c | 21 ++++++++++++++++++++-
2 files changed, 25 insertions(+), 2 deletions(-)
> Commit-ID: 01ab31371da90a795b774d87edf2c21bb3a64dda
> Gitweb: http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda
> Author: Siarhei Liakh <sliak...@gmail.com>
> AuthorDate: Sun, 31 Jan 2010 18:27:55 -0500
> Committer: H. Peter Anvin <h...@zytor.com>
> CommitDate: Wed, 17 Feb 2010 10:11:24 -0800
>
> x86, mm: NX protection for kernel data
>
> This patch expands functionality of CONFIG_DEBUG_RODATA to set main
> (static) kernel data area as NX.
-tip testing is seeing boot hangs along the lines of:
[ 15.568108] EXT3-fs (sda1): recovery complete
[ 15.573064] EXT3-fs (sda1): mounted filesystem with ordered data mode
[ 15.580313] VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
[ 15.584021] async_waiting @ 1
[ 15.588008] async_continuing @ 1 after 0 usec
[ 15.592163] Freeing unused kernel memory: 540k freed
[ 15.600126] NX-protecting the kernel data: c15ab000, 2919 pages
which i suspect could be due to the commit above.
Config attached. Athlon64 testbox.
Ingo
Yep, that's confirmed now, applying these 3 reverts makes it boot fine:
833e0ca: Revert "x86, mm: NX protection for kernel data"
ce4b6b4: Revert "x86: RO/NX protection for loadable kernel modules"
e357312: Revert "module: fix () used as prototype in include/linux/module.h"
Ingo
> On 02/22/2010 03:01 AM, Ingo Molnar wrote:
> >>
> >>> Commit-ID: 01ab31371da90a795b774d87edf2c21bb3a64dda
> >>> Gitweb: http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda
> >>> Author: Siarhei Liakh <sliak...@gmail.com>
> >>> AuthorDate: Sun, 31 Jan 2010 18:27:55 -0500
> >>> Committer: H. Peter Anvin <h...@zytor.com>
> >>> CommitDate: Wed, 17 Feb 2010 10:11:24 -0800
> >>>
> >>> x86, mm: NX protection for kernel data
> >>>
> >>> This patch expands functionality of CONFIG_DEBUG_RODATA to set main
> >>> (static) kernel data area as NX.
> >>
> >> -tip testing is seeing boot hangs along the lines of:
> >>
> >> [ 15.568108] EXT3-fs (sda1): recovery complete
> >> [ 15.573064] EXT3-fs (sda1): mounted filesystem with ordered data mode
> >> [ 15.580313] VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
> >> [ 15.584021] async_waiting @ 1
> >> [ 15.588008] async_continuing @ 1 after 0 usec
> >> [ 15.592163] Freeing unused kernel memory: 540k freed
> >> [ 15.600126] NX-protecting the kernel data: c15ab000, 2919 pages
> >>
> >> which i suspect could be due to the commit above.
> >
> > Yep, that's confirmed now, applying these 3 reverts makes it boot fine:
> >
> > 833e0ca: Revert "x86, mm: NX protection for kernel data"
> > ce4b6b4: Revert "x86: RO/NX protection for loadable kernel modules"
> > e357312: Revert "module: fix () used as prototype in include/linux/module.h"
> >
>
> Given that e357312 is a () -> (void) prototype fix, is hardly seems
> likely to be at fault. The RO/NX stuff, on the other hand, make a lot
> of sense.
Yes, i reverted e357312 because it was a dependent change.
Thanks,
Given that e357312 is a () -> (void) prototype fix, is hardly seems
likely to be at fault. The RO/NX stuff, on the other hand, make a lot
of sense.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
> At this point I need some help and guidance on how to track down what
> exactly happens there, as I am not very familiar with what goes into
> .data and why are we trying to execute it.
Can't you add debug printk in the fault handler before any exception processing
Something like that.
Matthieu
That does not really give any additional information. The message does
not show up in the output and the stack trace says that we are somehow
ended up in doublefault_fn.
any other ideas?
I really appreciate your help.
===========
...
[ 17.652000] BUG: unable to handle kernel NULL pointer dereference at 00000014
[ 17.652000] IP: [<c102e722>] vprintk+0x12/0x398
[ 17.652000] *pdpt = 00000000018e7001 *pde = 0000000000000000
[ 17.652000] Oops: 0000 [#1] SMP
[ 17.652000] last sysfs file:
[ 17.652000] Modules linked in:
[ 17.652000]
[ 17.652000] Pid: 314, comm: rcu_torture_rea Not tainted 2.6.33-tip+ #15 /
[ 17.652000] EIP: 0060:[<c102e722>] EFLAGS: 00004082 CPU: 0
[ 17.652000] EIP is at vprintk+0x12/0x398
[ 17.652000] EAX: c171dc07 EBX: c2802000 ECX: 00000000 EDX: 00000000
[ 17.652000] ESI: c18f1b90 EDI: 00000000 EBP: c18f1b74 ESP: c18f1b10
[ 17.652000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 17.652000] Process rcu_torture_rea (pid: 314, ti=c18f1000
task=f797c000 task.ti=f791a000)
[ 17.652000] Stack:
[ 17.652000] 00000000 00000000 c171dc07 00000000 00000000 00000000
00000000 00000000
[ 17.652000] <0> 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[ 17.652000] <0> 00000000 00000000 00000000 00000000 00000000
00000000 c2802000 00000000
[ 17.652000] Call Trace:
[ 17.652000] [<c15babb1>] ? printk+0xf/0x16
[ 17.652000] [<c101458b>] ? doublefault_fn+0x2b/0xd8
[ 17.652000] Code: 76 80 c1 e8 4e 8d 01 00 c7 05 f0 34 8f c1 00 00
00 00 e8 66 fc ff ff 5d c3 55 89 e5 57 56 53 83 ec 58 8b 45 08 89 45
a4 8b 75 0c <65> 8b 15 14 00 00 00 89 55 f0 31 d2 a1 a0 75 80 c1 89 45
a8 8b
[ 17.652000] EIP: [<c102e722>] vprintk+0x12/0x398 SS:ESP 0068:c18f1b10
[ 17.652000] CR2: 0000000000000014
[ 17.652000] ---[ end trace 6164e4a9acb59023 ]---
[ 17.656000] BUG: spinlock lockup on CPU#0, rcu_torture_rea/314, c2809800
[ 17.656000] Pid: 314, comm: rcu_torture_rea Tainted: G D
2.6.33-tip+ #15
[ 17.656000] Call Trace:
[ 17.656000] [<c15babb1>] ? printk+0xf/0x16
[ 17.656000] [<c12758a5>] do_raw_spin_lock+0xfb/0x126
[ 17.656000] [<c15bd2d3>] _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] printk+0xf/0x16
[ 17.656000] [<c101458b>] doublefault_fn+0x2b/0xd8
[ 17.656000] sending NMI to all CPUs:
[ 17.656000] NMI backtrace for cpu 0
[ 17.656000] Modules linked in:
[ 17.656000]
[ 17.656000] Pid: 314, comm: rcu_torture_rea Tainted: G D
2.6.33-tip+ #15 /
[ 17.656000] EIP: 0060:[<c10112b6>] EFLAGS: 00000046 CPU: 0
[ 17.656000] EIP is at default_send_IPI_mask_logical+0xc3/0xdb
[ 17.656000] EAX: ffffb300 EBX: 01000000 ECX: c1011239 EDX: 00000c00
[ 17.656000] ESI: 00000002 EDI: 00000002 EBP: c18f1848 ESP: c18f1838
[ 17.656000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 17.656000] Process rcu_torture_rea (pid: 314, ti=c18f1000
task=f797c000 task.ti=f791a000)
[ 17.656000] Stack:
[ 17.656000] 00000800 08453000 00000000 c2809800 c18f1854 c101106a
08453000 c18f1864
[ 17.656000] <0> c1011739 c171c7e2 08453000 c18f189c c12758aa
c175c91e 00000000 f797c318
[ 17.656000] <0> 0000013a c2809800 f797c318 f797c000 08453000
00000001 c2809800 c2809800
[ 17.656000] Call Trace:
[ 17.656000] [<c101106a>] ? default_send_IPI_all+0x22/0x62
[ 17.656000] [<c1011739>] ? arch_trigger_all_cpu_backtrace+0x2b/0x4f
[ 17.656000] [<c12758aa>] ? do_raw_spin_lock+0x100/0x126
[ 17.656000] [<c15bd2d3>] ? _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] ? update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] ? tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] ? tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] ? smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] ? apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] ? do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] ? oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] ? no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] ? __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] ? bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] ? do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] ? error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] ? printk+0xf/0x16
[ 17.656000] [<c101458b>] ? doublefault_fn+0x2b/0xd8
[ 17.656000] Code: 00 89 da 89 10 83 fe 02 74 07 8b 55 f0 09 f2 eb
06 8b 55 f0 80 ce 04 a1 2c 6c 80 c1 2d 00 3d 00 00 89 10 f7 c7 00 02
00 00 75 09 <57> 9d e8 0c 0c 04 00 eb 07 e8 9e 1a 04 00 57 9d 8d 65 f4
5b 5e
[ 17.656000] Call Trace:
[ 17.656000] [<c101106a>] default_send_IPI_all+0x22/0x62
[ 17.656000] [<c1011739>] arch_trigger_all_cpu_backtrace+0x2b/0x4f
[ 17.656000] [<c12758aa>] do_raw_spin_lock+0x100/0x126
[ 17.656000] [<c15bd2d3>] _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] printk+0xf/0x16
[ 17.656000] [<c101458b>] doublefault_fn+0x2b/0xd8
[ 17.656000] Pid: 314, comm: rcu_torture_rea Tainted: G D
2.6.33-tip+ #15
[ 17.656000] Call Trace:
[ 17.656000] [<c10076eb>] ? show_regs+0x1a/0x20
[ 17.656000] [<c15bf177>] nmi_watchdog_tick+0xa3/0x181
[ 17.656000] [<c15be62c>] do_nmi+0xc6/0x2d1
[ 17.656000] [<c15be1d0>] nmi_stack_correct+0x2f/0x34
[ 17.656000] [<c1011239>] ? default_send_IPI_mask_logical+0x46/0xdb
[ 17.656000] [<c10112b6>] ? default_send_IPI_mask_logical+0xc3/0xdb
[ 17.656000] [<c101106a>] default_send_IPI_all+0x22/0x62
[ 17.656000] [<c1011739>] arch_trigger_all_cpu_backtrace+0x2b/0x4f
[ 17.656000] [<c12758aa>] do_raw_spin_lock+0x100/0x126
[ 17.656000] [<c15bd2d3>] _raw_spin_lock+0x22/0x2a
[ 17.656000] [<c102b5c2>] ? scheduler_tick+0x33/0x233
[ 17.656000] [<c102b5c2>] scheduler_tick+0x33/0x233
[ 17.656000] [<c1032e17>] ? raise_softirq+0x43/0x50
[ 17.656000] [<c1039a5b>] update_process_times+0x3c/0x48
[ 17.656000] [<c104e92f>] tick_periodic+0x66/0x72
[ 17.656000] [<c104e954>] tick_handle_periodic+0x19/0x71
[ 17.656000] [<c1010c19>] smp_apic_timer_interrupt+0x6a/0x7d
[ 17.656000] [<c15bdee6>] apic_timer_interrupt+0x36/0x40
[ 17.656000] [<c1060331>] ? acct_collect+0x12e/0x134
[ 17.656000] [<c15bd72e>] ? _raw_spin_unlock_irq+0x22/0x26
[ 17.656000] [<c15bd730>] ? _raw_spin_unlock_irq+0x24/0x26
[ 17.656000] [<c1060331>] acct_collect+0x12e/0x134
[ 17.656000] [<c1031000>] do_exit+0x187/0x625
[ 17.656000] [<c102eba7>] ? kmsg_dump+0xff/0x113
[ 17.656000] [<c102ddf5>] ? oops_exit+0x2a/0x2f
[ 17.656000] [<c15beaaa>] oops_end+0x92/0x9a
[ 17.656000] [<c1016c90>] no_context+0x15f/0x169
[ 17.656000] [<c1016dec>] __bad_area_nosemaphore+0x152/0x15a
[ 17.656000] [<c1016e01>] bad_area_nosemaphore+0xd/0x10
[ 17.656000] [<c15c053d>] do_page_fault+0x199/0x30a
[ 17.656000] [<c15c03a4>] ? do_page_fault+0x0/0x30a
[ 17.656000] [<c15be108>] error_code+0x78/0x80
[ 17.656000] [<c102e722>] ? vprintk+0x12/0x398
[ 17.656000] [<c15babb1>] printk+0xf/0x16
[ 17.656000] [<c101458b>] doublefault_fn+0x2b/0xd8
Forgot to add one more thing: it looks like fault on instruction
prefetch is already taken care of in fault handlers and should
generate an appropriate message. The way I interpret the fact that
doublefault_fn is the one that prints out a message, is that main
fault handler is somehow dependent on .data being executable.... (but
this is pure speculation on my part at this point)...
> Any suggestions on how should I proceed forward in troubleshooting this issue?
Can you reproduce it in KVM?
If yes then that might give you a more debuggable state than a crashed native
system.
Yes, it crashes in KVM.
> If yes then that might give you a more debuggable state than a crashed native
> system.
Running kernel in KVM under kgdb with a watchpoint set for
($eip>=_etext)&&($eip<=_edata) still produces a dump but without ever
triggering the watchpoint.
I was hoping that some Kernel Locking Guru would say "take a look at
__xxx_yyy_zzz(), the things it does always looked suspicious to me."
:)
But joking aside, I would really appreciate ANY advice on debugging
this problem, as my best idea at this point is to pretty much
single-step through the whole thing... And since I am not familiar
with lock debugging at all, this will take a long time to figure out
on my own.
Thank you, guys.
I've done some extra debugging and it really does look like the crash
happens when we are setting NX on a large page which has pgd_lock
inside it.
Here is a trace of printk's that I added to troubleshoot this issue:
=========================
[ 3.072003] try_preserve_large_page - enter
[ 3.073185] try_preserve_large_page - address: 0xc1600000
[ 3.074513] try_preserve_large_page - 2M page
[ 3.075606] try_preserve_large_page - about to call static_protections
[ 3.076000] try_preserve_large_page - back from static_protections
[ 3.076000] try_preserve_large_page - past loop
[ 3.076000] try_preserve_large_page - new_prot != old_prot
[ 3.076000] try_preserve_large_page - the address is aligned and
the number of pages covers the full range
[ 3.076000] try_preserve_large_page - about to call __set_pmd_pte
[ 3.076000] __set_pmd_pte - enter
[ 3.076000] __set_pmd_pte - address: 0xc1600000
[ 3.076000] __set_pmd_pte - about to call
set_pte_atomic(*0xc18c0058(low=0x16001e3, high=0x0), (low=0x16001e1,
high=0x80000000))
[lock-up here]
=========================
This may be stupid but :
0xc1600000 2MB page is in 0xc1600000-0xc1800000 range. pgd_lock (0xc17ebdac) seems to be in that range.
You change attribute from (low=0x16001e3, high=0x0) to (low=0x16001e1, high=0x80000000). IE you set
NX bit (bit 63), but you also clear R/W bit (bit 2). So the page become read only, but you are using a lock
inside this page that need RW access. So you got a page fault.
Now I don't know what should be done.
Is that normal we set the page RO ?
Matthieu
That's what I was thinking...
> You change attribute from (low=0x16001e3, high=0x0) to (low=0x16001e1,
> high=0x80000000). IE you set
> NX bit (bit 63), but you also clear R/W bit (bit 2). So the page become read
> only, but you are using a lock
> inside this page that need RW access. So you got a page fault.
Yes, that would do it.
> Now I don't know what should be done.
> Is that normal we set the page RO ?
No, this page should not be RO, as it contains kernel's RW data.
The interesting part is that the call that initiates the change is
set_memory_nx(), so it should not be clearing RW bit... The
interesting part is that the kernel does not crash with lock debugging
disabled.
Thanks for your help.
Turns out that address is indeed within .rodata range, so
static_protections() flips RW bit to 0:
[ 0.000000] Memory: 889320k/914776k available (5836k kernel code,
25064k reserved, 2564k data, 540k init, 0k highmem)
[ 0.000000] virtual kernel memory layout:
[ 0.000000] fixmap : 0xffd58000 - 0xfffff000 (2716 kB)
[ 0.000000] vmalloc : 0xf8556000 - 0xffd56000 ( 120 MB)
[ 0.000000] lowmem : 0xc0000000 - 0xf7d56000 ( 893 MB)
[ 0.000000] .init : 0xc1834000 - 0xc18bb000 ( 540 kB)
[ 0.000000] .data : 0xc15b3000 - 0xc1834000 (2564 kB)
[ 0.000000] .rodata : 0xc15b4000 - 0xc17e3000 (2236 kB)
[ 0.000000] .text : 0xc1000000 - 0xc15b3000 (5836 kB)
[ 0.000000] pgd_lock address: 0xc17ebdac
[...]
[ 3.496969] try_preserve_large_page - enter
[ 3.500004] try_preserve_large_page - address: 0xc1600000
[ 3.501730] try_preserve_large_page - 2M page
[ 3.503100] try_preserve_large_page - NX:1 RW:1
[ 3.504000] try_preserve_large_page - about to call static_protections
[ 3.504000] static_protections - .rodata PFN:0x1600 VA:0xc1600000
[ 3.504000] try_preserve_large_page - back from static_protections
[ 3.504000] try_preserve_large_page - NX:1 RW:0
So, her is what we have:
1. RO-data is at 0xc15b4000 - 0xc17e3000
2. pgd_lock is at 0xc17ebdac
3. single large page maps tail end of RO-data, and a head of RW-data,
including pgd_lock
4. static_protections says that 0xc1600000 - 0xc17e2000 should be
read-only, and that is true
5. However, try_preserve_large_page assumes that whole large page is
RO since whole requested RO-range fits within the page (0xc1600000 -
0xc1800000) -- FALSE. The problem is that try_preserve_large_page()
never checks static_protections() for the remainder of the page, which
is wrong.
The bug seems to be in the following piece of code (arch/x86/mm/pageattr.c:434):
================================================
/*
* We need to check the full range, whether
* static_protection() requires a different pgprot for one of
* the pages in the range we try to preserve:
*/
addr = address + PAGE_SIZE;
pfn++;
for (i = 1; i < cpa->numpages; i++, addr += PAGE_SIZE, pfn++) {
pgprot_t chk_prot = static_protections(new_prot, addr, pfn);
if (pgprot_val(chk_prot) != pgprot_val(new_prot))
goto out_unlock;
}
================================================
It seems to me that the for loop needs to run for EACH small page
within large page, instead of just from addr through cpa->numpages:
================================================
- addr = address + PAGE_SIZE;
- pfn++;
- for (i = 1; i < cpa->numpages; i++, addr += PAGE_SIZE, pfn++) {
+ addr = address & pmask;
+ pfn = pte_pfn(old_pte);
+ for ( i = 0; i < (psize >> PAGE_SHIFT); i++, addr +=
PAGE_SIZE, pfn++) {
pgprot_t chk_prot = static_protections(new_prot, addr, pfn);
if (pgprot_val(chk_prot) != pgprot_val(new_prot))
goto out_unlock;
}
================================================
Further, I do not think that the conditions for "whole-pageness" are
correct (arch/x86/mm/pageattr.c:457)
================================================
/*
* We need to change the attributes. Check, whether we can
* change the large page in one go. We request a split, when
* the address is not aligned and the number of pages is
* smaller than the number of pages in the large page. Note
* that we limited the number of possible pages already to
* the number of pages in the large page.
*/
- if (address == (nextpage_addr - psize) && cpa->numpages == numpages) {
+ if (address == (address & pmask) && cpa->numpages == (psize
>> PAGE_SHIFT)) {
/*
* The address is aligned and the number of pages
* covers the full page.
*/
new_pte = pfn_pte(pte_pfn(old_pte), canon_pgprot(new_prot));
__set_pmd_pte(kpte, address, new_pte);
cpa->flags |= CPA_FLUSHTLB;
do_split = 0;
}
================================================
Please let me know if this makes any sense, and I will submit a proper patch.
Thank you.