Oops in 2.6.19.1

54 views
Skip to first unread message

Alistair John Strachan

unread,
Dec 20, 2006, 9:48:46 AM12/20/06
to LKML
Hi,

Any ideas?

BUG: unable to handle kernel NULL pointer dereference at virtual address
00000009
printing eip:
c0156f60
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ipt_recent ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat
xt_state iptable_filter ip_tables x_tables prism54 yenta_socket
rsrc_nonstatic pcmcia_core snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm
snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd soundcore ehci_hcd
usblp eth1394 uhci_hcd usbcore ohci1394 ieee1394 via_agp agpgart vt1211
hwmon_vid hwmon ip_nat_ftp ip_nat ip_conntrack_ftp ip_conntrack
CPU: 0
EIP: 0060:[<c0156f60>] Not tainted VLI
EFLAGS: 00010246 (2.6.19.1 #1)
EIP is at pipe_poll+0xa0/0xb0
eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000
esi: f70f3e9c edi: f7017c00 ebp: f70f3c1c esp: f70f3c0c
ds: 007b es: 007b ss: 0068
Process python (pid: 4178, ti=f70f2000 task=f70c4a90 task.ti=f70f2000)
Stack: 00000000 00000000 f70f3e9c f6e111c0 f70f3fa4 c015d7f3 f70f3c54 f70f3fac
084c44a0 00000030 084c44d0 00000000 f70f3e94 f70f3e94 00000006 f70f3ecc
00000000 f70f3e94 c015e580 00000000 00000000 00000006 f6e111c0 00000000
Call Trace:
[<c015d7f3>] do_sys_poll+0x253/0x480
[<c015da53>] sys_poll+0x33/0x50
[<c0102c97>] syscall_call+0x7/0xb
[<b7f6b402>] 0xb7f6b402
=======================
Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8
8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f
45 ca eb b6 8d b6 00 00 00 00 55 b8 01 00 00
EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:f70f3c0c

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Greg KH

unread,
Dec 20, 2006, 11:31:34 AM12/20/06
to Alistair John Strachan
On Wed, Dec 20, 2006 at 02:21:03PM +0000, Alistair John Strachan wrote:
> Hi,
>
> Any ideas?

Does the problem also happen in 2.6.19?

thanks,

greg k-h

Alistair John Strachan

unread,
Dec 20, 2006, 11:45:19 AM12/20/06
to Greg KH
On Wednesday 20 December 2006 16:30, Greg KH wrote:
> On Wed, Dec 20, 2006 at 02:21:03PM +0000, Alistair John Strachan wrote:
> > Hi,
> >
> > Any ideas?
>
> Does the problem also happen in 2.6.19?

No idea. I ran 2.6.19 for a couple of weeks without problems. It took 2 days
to oops 2.6.19.1, so if it happens again within that time period I guess that
might be indicative of a -stable patch.

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Chuck Ebbert

unread,
Dec 20, 2006, 3:52:52 PM12/20/06
to Alistair John Strachan
In-Reply-To: <200612201421....@sms.ed.ac.uk>

On Wed, 20 Dec 2006 14:21:03 +0000, Alistair John Strachan wrote:

> Any ideas?
>
> BUG: unable to handle kernel NULL pointer dereference at virtual address
> 00000009

83 ca 10 or $0x10,%edx
3b .byte 0x3b
87 68 01 xchg %ebp,0x1(%eax) <=====
00 00 add %al,(%eax)

Somehow it is trying to execute code in the middle of an instruction.
That almost never works, even when the resulting fragment is a legal
opcode. :)

The real instruction is:

3b 87 68 01 00 00 00 cmp 0x168(%edi),%eax

I'd guess you have some kind of hardware problem. It could also be
a kernel problem where the saved address was corrupted during an
interrupt, but that's not likely.
--
MBTI: IXTP

Alistair John Strachan

unread,
Dec 20, 2006, 5:39:39 PM12/20/06
to Chuck Ebbert
On Wednesday 20 December 2006 20:48, Chuck Ebbert wrote:
[snip]

> I'd guess you have some kind of hardware problem. It could also be
> a kernel problem where the saved address was corrupted during an
> interrupt, but that's not likely.

Seems pretty unlikely on a 4 year old Via Epia. Never had any problems with it
before now.

Maybe a cosmic ray event? ;-)

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Chuck Ebbert

unread,
Dec 21, 2006, 3:10:02 AM12/21/06
to Alistair John Strachan
In-Reply-To: <200612202215....@sms.ed.ac.uk>

On Wed, 20 Dec 2006 22:15:50 +0000, Alistair John Strachan wrote:

> > I'd guess you have some kind of hardware problem. It could also be
> > a kernel problem where the saved address was corrupted during an
> > interrupt, but that's not likely.
>
> Seems pretty unlikely on a 4 year old Via Epia. Never had any problems with it
> before now.
>
> Maybe a cosmic ray event? ;-)

The low byte of eip should be 5f and it changed to 60, so that's
probably not it. And the oops report is consistent with that being
the instruction that was really executed, so it's not the kernel
misreporting the address after it happened.

You weren't trying kprobes or something, were you? Have you ever
had another unexplained oops with this machine?

--
MBTI: IXTP

Alistair John Strachan

unread,
Dec 21, 2006, 9:23:30 AM12/21/06
to Chuck Ebbert
On Thursday 21 December 2006 08:05, Chuck Ebbert wrote:
> In-Reply-To: <200612202215....@sms.ed.ac.uk>
>
> On Wed, 20 Dec 2006 22:15:50 +0000, Alistair John Strachan wrote:
> > > I'd guess you have some kind of hardware problem. It could also be
> > > a kernel problem where the saved address was corrupted during an
> > > interrupt, but that's not likely.
> >
> > Seems pretty unlikely on a 4 year old Via Epia. Never had any problems
> > with it before now.
> >
> > Maybe a cosmic ray event? ;-)
>
> The low byte of eip should be 5f and it changed to 60, so that's
> probably not it. And the oops report is consistent with that being
> the instruction that was really executed, so it's not the kernel
> misreporting the address after it happened.
>
> You weren't trying kprobes or something, were you? Have you ever
> had another unexplained oops with this machine?

Nope, it's a stock kernel and it's running on a server, kprobes isn't in use.

And no, to my knowledge there's not been another "unexplained" oops. I've had
crashes, but they've always been known issues or BIOS trouble.

The machine was recently tampered with to install additional HDDs, but the
memory was memtest'ed when it was installed and passed several times without
issue. I'm rather puzzled.

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Valdis.K...@vt.edu

unread,
Dec 21, 2006, 10:32:36 AM12/21/06
to Alistair John Strachan
On Wed, 20 Dec 2006 22:15:50 GMT, Alistair John Strachan said:
> Seems pretty unlikely on a 4 year old Via Epia. Never had any problems with it
> before now.
>
> Maybe a cosmic ray event? ;-)

More likely a stray alpha particle from a radioactive decay in the actual chip
casing - I saw some research a while back that said that the average commodity
system should *expect* to see 1 or 2 alpha-induced single-bit errors per year,
and the chance that *you* saw the event was directly related to whether the
memory had ECC, and how much of the other circuitry had ECC on it....

Alistair John Strachan

unread,
Dec 23, 2006, 10:49:50 AM12/23/06
to LKML
On Wednesday 20 December 2006 14:21, Alistair John Strachan wrote:
> Hi,
>
> Any ideas?

Pretty much like clockwork, it happened again. I think it's time to take this
seriously as a software bug, and not some hardware problem. I've ran kernels
since 2.6.0 on this machine without such crashes, and now two of the same in
2.6.19.1? Pretty unlikely!

BUG: unable to handle kernel NULL pointer dereference at virtual address
00000009
printing eip:
c0156f60
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ipt_recent ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat
xt_sta
te iptable_filter ip_tables x_tables prism54 yenta_socket rsrc_nonstatic
pcmcia_core snd_via82xx snd_ac97_codec snd_ac97_bus
snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd soundcore

usblp ehci_hcd eth1394 uhci_hcd usbcore ohci1394 i


eee1394 via_agp agpgart vt1211 hwmon_vid hwmon ip_nat_ftp ip_nat
ip_conntrack_ftp ip_conntrack
CPU: 0
EIP: 0060:[<c0156f60>] Not tainted VLI
EFLAGS: 00010246 (2.6.19.1 #1)
EIP is at pipe_poll+0xa0/0xb0
eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000

esi: ee1b9e9c edi: f4d80a00 ebp: ee1b9c1c esp: ee1b9c0c


ds: 007b es: 007b ss: 0068

Process java (pid: 5374, ti=ee1b8000 task=f7117560 task.ti=ee1b8000)
Stack: 00000000 00000000 ee1b9e9c f6c17160 ee1b9fa4 c015d7f3 ee1b9c54 ee1b9fac
082dff90 00000010 082dffa0 00000000 ee1b9e94 ee1b9e94 00000002 ee1b9eac
00000000 ee1b9e94 c015e580 00000000 00000000 00000002 f6c17160 00000000


Call Trace:
[<c015d7f3>] do_sys_poll+0x253/0x480
[<c015da53>] sys_poll+0x33/0x50
[<c0102c97>] syscall_call+0x7/0xb

[<b7f26402>] 0xb7f26402


=======================
Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8
8b 75
f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f 45 ca
eb b6 8d b6 00 00 00 00 55 b8 01 00 00

EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:ee1b9c0c

Alistair John Strachan

unread,
Dec 24, 2006, 9:41:37 AM12/24/06
to Chuck Ebbert
On Sunday 24 December 2006 04:23, Chuck Ebbert wrote:
> In-Reply-To: <200612231540....@sms.ed.ac.uk>

>
> On Sat, 23 Dec 2006 15:40:46 +0000, Alistair John Strachan wrote:
> > Pretty much like clockwork, it happened again. I think it's time to take
> > this seriously as a software bug, and not some hardware problem. I've ran
> > kernels since 2.6.0 on this machine without such crashes, and now two of
> > the same in 2.6.19.1? Pretty unlikely!
>
> Stranger things have happened, e.g. your system might have started
> to overheat just recently.

True, I've considered it, I'll replace the CPU fan.

> Anyway, post your complete .config. And exactly which one of the
> many Via cpus are you using? Are you using the Padlock unit?

No, much older than that:

[alistair] 14:38 [~] cat /proc/cpuinfo
processor : 0
vendor_id : CentaurHauls
cpu family : 6
model : 9
model name : VIA Nehemiah
stepping : 1
cpu MHz : 999.569
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu de tsc msr cx8 mtrr pge cmov mmx fxsr sse fxsr_opt
bogomips : 2000.02

> What do those java/python programs do that are running? What pipe
> are they polling?
>
> You could try going back to 2.6.18.x for a while in the meantime.

Well, I have had a thought. I recently upgraded the toolchain on the machine
from binutils 2.16.x and GCC 3.4.3 (2.6.19 was built with this) to binutils
2.17 and GCC 4.1.1. It's conceivable that this is some sort of compiler bug.

Alistair John Strachan

unread,
Dec 24, 2006, 9:51:55 AM12/24/06
to Chuck Ebbert
On Sunday 24 December 2006 04:23, Chuck Ebbert wrote:
[snip]

> Anyway, post your complete .config.

Config attached.

config-2.6.19.1

Zhang, Yanmin

unread,
Dec 26, 2006, 9:07:57 PM12/26/06
to Alistair John Strachan
Above codes look weird. Could you disassemble kernel image and post
the part around address 0xc0156f60?

"87 68 01 00 00" is instruction xchg, but if I disassemble from the begining,
I couldn't see instruct xchg.


> EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:ee1b9c0c
>

Alistair John Strachan

unread,
Dec 27, 2006, 7:35:19 AM12/27/06
to Zhang, Yanmin
On Wednesday 27 December 2006 02:07, Zhang, Yanmin wrote:
[snip]

> > 00000000 Call Trace:
> > [<c015d7f3>] do_sys_poll+0x253/0x480
> > [<c015da53>] sys_poll+0x33/0x50
> > [<c0102c97>] syscall_call+0x7/0xb
> > [<b7f26402>] 0xb7f26402
> > =======================
> > Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4
> > 89 c8 8b 75
> > f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f 45
> > ca eb b6 8d b6 00 00 00 00 55 b8 01 00 00
>
> Above codes look weird. Could you disassemble kernel image and post
> the part around address 0xc0156f60?
>
> "87 68 01 00 00" is instruction xchg, but if I disassemble from the
> begining, I couldn't see instruct xchg.
>
> > EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:ee1b9c0c

Unfortunately, after suspecting the toolchain, I did a manual rebuild of
binutils, gcc and glibc from the official sites, and then rebuilt 2.6.19.1.
This might upset the decompile below, versus the original report.

Assuming it's NOT a bug in my distro's toolchain (because I am now running the
GNU stuff), it'll crash again, so this is still useful.

Here's a current decompilation of vmlinux/pipe_poll() from the running kernel,
the addresses have changed slightly. There's no xchg there either:

c0156ec0 <pipe_poll>:
c0156ec0: 55 push %ebp
c0156ec1: 89 e5 mov %esp,%ebp
c0156ec3: 83 ec 10 sub $0x10,%esp
c0156ec6: 89 5d f4 mov %ebx,0xfffffff4(%ebp)
c0156ec9: 85 d2 test %edx,%edx
c0156ecb: 89 d3 mov %edx,%ebx
c0156ecd: 89 75 f8 mov %esi,0xfffffff8(%ebp)
c0156ed0: 89 c6 mov %eax,%esi
c0156ed2: 89 7d fc mov %edi,0xfffffffc(%ebp)
c0156ed5: 8b 40 08 mov 0x8(%eax),%eax
c0156ed8: 8b 40 08 mov 0x8(%eax),%eax
c0156edb: 8b b8 f0 00 00 00 mov 0xf0(%eax),%edi
c0156ee1: 74 0c je c0156eef <pipe_poll+0x2f>
c0156ee3: 85 ff test %edi,%edi
c0156ee5: 74 08 je c0156eef <pipe_poll+0x2f>
c0156ee7: 89 d1 mov %edx,%ecx
c0156ee9: 89 f0 mov %esi,%eax
c0156eeb: 89 fa mov %edi,%edx
c0156eed: ff 13 call *(%ebx)
c0156eef: 0f b7 5e 1c movzwl 0x1c(%esi),%ebx
c0156ef3: 31 c9 xor %ecx,%ecx
c0156ef5: 8b 47 08 mov 0x8(%edi),%eax
c0156ef8: f6 c3 01 test $0x1,%bl
c0156efb: 89 45 f0 mov %eax,0xfffffff0(%ebp)
c0156efe: 74 20 je c0156f20 <pipe_poll+0x60>
c0156f00: 85 c0 test %eax,%eax
c0156f02: b8 41 00 00 00 mov $0x41,%eax
c0156f07: 0f 4f c8 cmovg %eax,%ecx
c0156f0a: 8b 87 5c 01 00 00 mov 0x15c(%edi),%eax
c0156f10: 85 c0 test %eax,%eax
c0156f12: 74 43 je c0156f57 <pipe_poll+0x97>
c0156f14: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
c0156f1a: 8d bf 00 00 00 00 lea 0x0(%edi),%edi
c0156f20: f6 c3 02 test $0x2,%bl
c0156f23: 74 23 je c0156f48 <pipe_poll+0x88>
c0156f25: 83 7d f0 0f cmpl $0xf,0xfffffff0(%ebp)
c0156f29: b8 04 01 00 00 mov $0x104,%eax
c0156f2e: ba 00 00 00 00 mov $0x0,%edx
c0156f33: 8b 9f 58 01 00 00 mov 0x158(%edi),%ebx
c0156f39: 0f 4f c2 cmovg %edx,%eax
c0156f3c: 09 c1 or %eax,%ecx
c0156f3e: 89 c8 mov %ecx,%eax
c0156f40: 83 c8 08 or $0x8,%eax
c0156f43: 85 db test %ebx,%ebx
c0156f45: 0f 44 c8 cmove %eax,%ecx
c0156f48: 8b 5d f4 mov 0xfffffff4(%ebp),%ebx
c0156f4b: 89 c8 mov %ecx,%eax
c0156f4d: 8b 75 f8 mov 0xfffffff8(%ebp),%esi
c0156f50: 8b 7d fc mov 0xfffffffc(%ebp),%edi
c0156f53: 89 ec mov %ebp,%esp
c0156f55: 5d pop %ebp
c0156f56: c3 ret
c0156f57: 89 ca mov %ecx,%edx
c0156f59: 8b 46 6c mov 0x6c(%esi),%eax
c0156f5c: 83 ca 10 or $0x10,%edx
c0156f5f: 3b 87 68 01 00 00 cmp 0x168(%edi),%eax
c0156f65: 0f 45 ca cmovne %edx,%ecx
c0156f68: eb b6 jmp c0156f20 <pipe_poll+0x60>
c0156f6a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Zhang, Yanmin

unread,
Dec 27, 2006, 9:41:59 PM12/27/06
to Alistair John Strachan
Could you reproduce the bug by the new kernel, so we could get the exact address
and instruction of the bug?

Alistair John Strachan

unread,
Dec 27, 2006, 11:02:49 PM12/27/06
to Zhang, Yanmin
On Thursday 28 December 2006 02:41, Zhang, Yanmin wrote:
[snip]

> > Here's a current decompilation of vmlinux/pipe_poll() from the running
> > kernel, the addresses have changed slightly. There's no xchg there
> > either:
>
> Could you reproduce the bug by the new kernel, so we could get the exact
> address and instruction of the bug?

It crashed again, but this time with no output (machine locked solid). To be
honest, the disassembly looks right (it's like Chuck said, it's jumping back
half way through an instruction):

c0156f5f: 3b 87 68 01 00 00 cmp 0x168(%edi),%eax

So c0156f60 is 87 68 01 00 00..

This is with the GCC recompile, so it's not a distro problem. It could still
either be GCC 4.x, or a 2.6.19.1 specific bug, but it's serious. 2.6.19 with
GCC 3.4.3 is 100% stable.

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Alistair John Strachan

unread,
Dec 27, 2006, 11:14:28 PM12/27/06
to Zhang, Yanmin
On Thursday 28 December 2006 04:02, Alistair John Strachan wrote:
> On Thursday 28 December 2006 02:41, Zhang, Yanmin wrote:
> [snip]
>
> > > Here's a current decompilation of vmlinux/pipe_poll() from the running
> > > kernel, the addresses have changed slightly. There's no xchg there
> > > either:
> >
> > Could you reproduce the bug by the new kernel, so we could get the exact
> > address and instruction of the bug?
>
> It crashed again, but this time with no output (machine locked solid). To
> be honest, the disassembly looks right (it's like Chuck said, it's jumping
> back half way through an instruction):
>
> c0156f5f: 3b 87 68 01 00 00 cmp 0x168(%edi),%eax
>
> So c0156f60 is 87 68 01 00 00..
>
> This is with the GCC recompile, so it's not a distro problem. It could
> still either be GCC 4.x, or a 2.6.19.1 specific bug, but it's serious.
> 2.6.19 with GCC 3.4.3 is 100% stable.

Looks like a similar crash here:

http://ubuntuforums.org/showthread.php?p=1803389

Alistair John Strachan

unread,
Dec 30, 2006, 12:00:20 PM12/30/06
to Zhang, Yanmin
On Thursday 28 December 2006 04:14, Alistair John Strachan wrote:
> On Thursday 28 December 2006 04:02, Alistair John Strachan wrote:
> > On Thursday 28 December 2006 02:41, Zhang, Yanmin wrote:
> > [snip]
> >
> > > > Here's a current decompilation of vmlinux/pipe_poll() from the
> > > > running kernel, the addresses have changed slightly. There's no xchg
> > > > there either:
> > >
> > > Could you reproduce the bug by the new kernel, so we could get the
> > > exact address and instruction of the bug?
> >
> > It crashed again, but this time with no output (machine locked solid). To
> > be honest, the disassembly looks right (it's like Chuck said, it's
> > jumping back half way through an instruction):
> >
> > c0156f5f: 3b 87 68 01 00 00 cmp 0x168(%edi),%eax
> >
> > So c0156f60 is 87 68 01 00 00..
> >
> > This is with the GCC recompile, so it's not a distro problem. It could
> > still either be GCC 4.x, or a 2.6.19.1 specific bug, but it's serious.
> > 2.6.19 with GCC 3.4.3 is 100% stable.
>
> Looks like a similar crash here:
>
> http://ubuntuforums.org/showthread.php?p=1803389

I've eliminated 2.6.19.1 as the culprit, and also tried toggling "optimize for
size", various debug options. 2.6.19 compiled with GCC 4.1.1 on an Via
Nehemiah C3-2 seems to crash in pipe_poll reliably, within approximately 12
hours.

The machine passes 6 hours of Prime95 (a CPU stability tester), four memtest86
passes, and there are no heat problems.

I have compiled GCC 3.4.6 and compiled 2.6.19 with an identical config using
this compiler (but the same binutils), and will report back if it crashes. My
bet is that it won't, however.

Chuck Ebbert

unread,
Dec 30, 2006, 12:27:57 PM12/30/06
to Alistair John Strachan
In-Reply-To: <200612301659....@sms.ed.ac.uk>

On Sat, 30 Dec 2006 16:59:35 +0000, Alistair John Strachan wrote:

> I've eliminated 2.6.19.1 as the culprit, and also tried toggling "optimize for
> size", various debug options. 2.6.19 compiled with GCC 4.1.1 on an Via
> Nehemiah C3-2 seems to crash in pipe_poll reliably, within approximately 12
> hours.

Which CPU are you compiling for? You should try different options.

Can you post disassembly of pipe_poll() for both the one that crashes
and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
relocation info and post just the one function from each for now.

--
MBTI: IXTP

James Courtier-Dutton

unread,
Dec 30, 2006, 1:07:12 PM12/30/06
to Chuck Ebbert
Chuck Ebbert wrote:
> In-Reply-To: <200612201421....@sms.ed.ac.uk>
>
> On Wed, 20 Dec 2006 14:21:03 +0000, Alistair John Strachan wrote:
>
>> Any ideas?
>>
>> BUG: unable to handle kernel NULL pointer dereference at virtual address
>> 00000009
>
> 83 ca 10 or $0x10,%edx
> 3b .byte 0x3b
> 87 68 01 xchg %ebp,0x1(%eax) <=====
> 00 00 add %al,(%eax)
>
> Somehow it is trying to execute code in the middle of an instruction.
> That almost never works, even when the resulting fragment is a legal
> opcode. :)
>
> The real instruction is:
>
> 3b 87 68 01 00 00 00 cmp 0x168(%edi),%eax
>
> I'd guess you have some kind of hardware problem. It could also be
> a kernel problem where the saved address was corrupted during an
> interrupt, but that's not likely.

This looks rather strange.
The times I have seen this sort of problem is:
1) when one bit of the kernel is corrupting another part of it.
2) Kernel modules compiled with different gcc than rest of kernel.
3) kernel headers do not match the kernel being used.

One way to start tracking this down would be to run it with the fewest
amount of kernel modules loaded as one can, but still reproduce the problem.

James

Alistair John Strachan

unread,
Dec 30, 2006, 1:29:14 PM12/30/06
to Chuck Ebbert
On Saturday 30 December 2006 17:21, Chuck Ebbert wrote:
> In-Reply-To: <200612301659....@sms.ed.ac.uk>
>
> On Sat, 30 Dec 2006 16:59:35 +0000, Alistair John Strachan wrote:
> > I've eliminated 2.6.19.1 as the culprit, and also tried toggling
> > "optimize for size", various debug options. 2.6.19 compiled with GCC
> > 4.1.1 on an Via Nehemiah C3-2 seems to crash in pipe_poll reliably,
> > within approximately 12 hours.
>
> Which CPU are you compiling for? You should try different options.

I should, I haven't thought of that. Currently it's compiling for
CONFIG_MVIAC3_2, but I could try i686 for example.

> Can you post disassembly of pipe_poll() for both the one that crashes
> and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
> relocation info and post just the one function from each for now.

Sure, no problem:

http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/

Both use identical configs, neither are optimised for size. The config is
available from the same location.

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Alistair John Strachan

unread,
Dec 30, 2006, 1:32:33 PM12/30/06
to James Courtier-Dutton
On Saturday 30 December 2006 18:06, James Courtier-Dutton wrote:
> > I'd guess you have some kind of hardware problem. It could also be
> > a kernel problem where the saved address was corrupted during an
> > interrupt, but that's not likely.
>
> This looks rather strange.
[snip]

> 2) Kernel modules compiled with different gcc than rest of kernel.

Previously there was only one GCC version (4.1.1 totally replaced 3.4.3, and
is the system wide GCC), now I have installed 3.4.6 into /opt/gcc-3.4.6 and
it is only PATH'ed explicitly by me when I wish to compile a kernel using it:

export PATH=/opt/gcc-3.4.6/bin:$PATH
cp /boot/config-2.6.19-test .config
make oldconfig
make

> 3) kernel headers do not match the kernel being used.

The tree is a pristine 2.6.19.

> One way to start tracking this down would be to run it with the fewest
> amount of kernel modules loaded as one can, but still reproduce the
> problem.

Crippling the machine, though. Impractical for something that isn't
immediately reproducible.

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Alistair John Strachan

unread,
Dec 31, 2006, 8:47:06 AM12/31/06
to Zhang, Yanmin
On Saturday 30 December 2006 16:59, Alistair John Strachan wrote:
> I have compiled GCC 3.4.6 and compiled 2.6.19 with an identical config
> using this compiler (but the same binutils), and will report back if it
> crashes. My bet is that it won't, however.

Still fine after >24 hours. Linux 2.6.19, GCC 3.4.6, Binutils 2.17.

Adrian Bunk

unread,
Dec 31, 2006, 11:27:53 AM12/31/06
to Alistair John Strachan

There are occasional reports of problems with kernels compiled with
gcc 4.1 that vanish when using older versions of gcc.

AFAIK, until now noone has ever debugged whether that's a gcc bug,
gcc exposing a kernel bug or gcc exposing a hardware bug.

Comparing your report and [1], it seems that if these are the same
problem, it's not a hardware bug but a gcc or kernel bug.

> Cheers,
> Alistair.

cu
Adrian

[1] http://bugzilla.kernel.org/show_bug.cgi?id=7176

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

Adrian Bunk

unread,
Dec 31, 2006, 11:28:50 AM12/31/06
to Alistair John Strachan
On Sat, Dec 30, 2006 at 06:29:15PM +0000, Alistair John Strachan wrote:
> On Saturday 30 December 2006 17:21, Chuck Ebbert wrote:
> > In-Reply-To: <200612301659....@sms.ed.ac.uk>
> >
> > On Sat, 30 Dec 2006 16:59:35 +0000, Alistair John Strachan wrote:
> > > I've eliminated 2.6.19.1 as the culprit, and also tried toggling
> > > "optimize for size", various debug options. 2.6.19 compiled with GCC
> > > 4.1.1 on an Via Nehemiah C3-2 seems to crash in pipe_poll reliably,
> > > within approximately 12 hours.
> >
> > Which CPU are you compiling for? You should try different options.
>
> I should, I haven't thought of that. Currently it's compiling for
> CONFIG_MVIAC3_2, but I could try i686 for example.
>
> > Can you post disassembly of pipe_poll() for both the one that crashes
> > and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
> > relocation info and post just the one function from each for now.
>
> Sure, no problem:
>
> http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
>
> Both use identical configs, neither are optimised for size. The config is
> available from the same location.

Can you try enabling as many debug options as possible?

> Cheers,
> Alistair.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

-

Alistair John Strachan

unread,
Dec 31, 2006, 11:48:45 AM12/31/06
to Adrian Bunk
On Sunday 31 December 2006 16:28, Adrian Bunk wrote:
> On Sat, Dec 30, 2006 at 06:29:15PM +0000, Alistair John Strachan wrote:
> > On Saturday 30 December 2006 17:21, Chuck Ebbert wrote:
> > > In-Reply-To: <200612301659....@sms.ed.ac.uk>
> > >
> > > On Sat, 30 Dec 2006 16:59:35 +0000, Alistair John Strachan wrote:
> > > > I've eliminated 2.6.19.1 as the culprit, and also tried toggling
> > > > "optimize for size", various debug options. 2.6.19 compiled with GCC
> > > > 4.1.1 on an Via Nehemiah C3-2 seems to crash in pipe_poll reliably,
> > > > within approximately 12 hours.
> > >
> > > Which CPU are you compiling for? You should try different options.
> >
> > I should, I haven't thought of that. Currently it's compiling for
> > CONFIG_MVIAC3_2, but I could try i686 for example.
> >
> > > Can you post disassembly of pipe_poll() for both the one that crashes
> > > and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
> > > relocation info and post just the one function from each for now.
> >
> > Sure, no problem:
> >
> > http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
> >
> > Both use identical configs, neither are optimised for size. The config is
> > available from the same location.
>
> Can you try enabling as many debug options as possible?

Specifically what? I've already had:

CONFIG_DETECT_SOFTLOCKUP
CONFIG_FRAME_POINTER
CONFIG_UNWIND_INFO

Enabled. CONFIG_4KSTACKS is disabled. Are there any debugging features
actually pertinent to this bug?

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Alistair John Strachan

unread,
Dec 31, 2006, 11:55:49 AM12/31/06
to Adrian Bunk

This bug specifically indicates some kind of miscompilation in a driver,
causing boot time hangs. My problem is quite different, and more subtle. The
crash happens in the same place every time, which does suggest determinism
(even with various options toggled on and off, and a 300K smaller kernel
image), but it takes 8-12 hours to manifest and only happens with GCC 4.1.1.

Unless we can start narrowing this down, it would be a mammoth task to seek
out either the kernel or GCC change that first exhibited this bug, due to the
non-immediate reproducibility of the bug, the lack of clues, and this
machine's role as a stable, high-availability server.

(If I had another Epia M10000 or another computer I could reproduce the bug
on, I would be only too happy to boot as many kernels as required to fix it;
however I cannot spare this machine).

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Chuck Ebbert

unread,
Dec 31, 2006, 4:48:55 PM12/31/06
to Alistair John Strachan
In-Reply-To: <200612301829....@sms.ed.ac.uk>

On Sat, 30 Dec 2006 18:29:15 +0000, Alistair John Strachan wrote:

> > Can you post disassembly of pipe_poll() for both the one that crashes
> > and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
> > relocation info and post just the one function from each for now.
>
> Sure, no problem:
>
> http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
>
> Both use identical configs, neither are optimised for size. The config is
> available from the same location.

Those were compiled without frame pointers. Can you post them compiled
with frame pointers so they match your original bug report? And confirm
that pipe_poll() is still at 0xc0156ec0 in vmlinux?

--
MBTI: IXTP

Alistair John Strachan

unread,
Dec 31, 2006, 5:16:55 PM12/31/06
to Chuck Ebbert
On Sunday 31 December 2006 21:43, Chuck Ebbert wrote:
> In-Reply-To: <200612301829....@sms.ed.ac.uk>
>
> On Sat, 30 Dec 2006 18:29:15 +0000, Alistair John Strachan wrote:
> > > Can you post disassembly of pipe_poll() for both the one that crashes
> > > and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
> > > relocation info and post just the one function from each for now.
> >
> > Sure, no problem:
> >
> > http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
> >
> > Both use identical configs, neither are optimised for size. The config is
> > available from the same location.
>
> Those were compiled without frame pointers. Can you post them compiled
> with frame pointers so they match your original bug report? And confirm
> that pipe_poll() is still at 0xc0156ec0 in vmlinux?

c0156ec0 <pipe_poll>:

I used the config I original sent you to rebuild it again. This time I've put
up the whole vmlinux for both kernels, the config is replaced, the
decompilation is re-done, I've confirmed the offset in the GCC 4.1.1 kernel
is identical. Sorry for the confusion.

The reason I changed the configs was to experiment with enabling and disabling
debugging (and other such) options that might have shaken out compiler bugs.

However none of these kernels have ever crashed gracefully again, most of them
hang the machine (no nmi watchdog though) so I've not been able to look at
the oops. It's the same root cause, however, as GCC 3.4.6 kernels do not
crash.

http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/

Happy new year, btw.

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

Adrian Bunk

unread,
Jan 2, 2007, 4:11:09 PM1/2/07
to Alistair John Strachan
>...

Sorry if my point goes a bit away from your problem:

My point is that we have several reported problems only visible
with gcc 4.1.

Other bug reports are e.g. [2] and [3], but they are only present with
using gcc 4.1 _and_ using -Os.

There's simply a bunch of bugs only present with gcc 4.1, and what
worries me most is that the estimated number of unknown cases is most
likely very high since most people won't check different compiler
versions when running into a problem.

> Cheers,
> Alistair.

cu
Adrian

[1] http://bugzilla.kernel.org/show_bug.cgi?id=7176
[2] http://bugzilla.kernel.org/show_bug.cgi?id=7106
[3] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186852

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

-

Adrian Bunk

unread,
Jan 2, 2007, 4:12:46 PM1/2/07
to Alistair John Strachan

No, that's only an "enable as much as possible and hope one helps" shot
in the dark.

> Cheers,
> Alistair.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

-