Any ideas?
BUG: unable to handle kernel NULL pointer dereference at virtual address
00000009
printing eip:
c0156f60
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ipt_recent ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat
xt_state iptable_filter ip_tables x_tables prism54 yenta_socket
rsrc_nonstatic pcmcia_core snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm
snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd soundcore ehci_hcd
usblp eth1394 uhci_hcd usbcore ohci1394 ieee1394 via_agp agpgart vt1211
hwmon_vid hwmon ip_nat_ftp ip_nat ip_conntrack_ftp ip_conntrack
CPU: 0
EIP: 0060:[<c0156f60>] Not tainted VLI
EFLAGS: 00010246 (2.6.19.1 #1)
EIP is at pipe_poll+0xa0/0xb0
eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000
esi: f70f3e9c edi: f7017c00 ebp: f70f3c1c esp: f70f3c0c
ds: 007b es: 007b ss: 0068
Process python (pid: 4178, ti=f70f2000 task=f70c4a90 task.ti=f70f2000)
Stack: 00000000 00000000 f70f3e9c f6e111c0 f70f3fa4 c015d7f3 f70f3c54 f70f3fac
084c44a0 00000030 084c44d0 00000000 f70f3e94 f70f3e94 00000006 f70f3ecc
00000000 f70f3e94 c015e580 00000000 00000000 00000006 f6e111c0 00000000
Call Trace:
[<c015d7f3>] do_sys_poll+0x253/0x480
[<c015da53>] sys_poll+0x33/0x50
[<c0102c97>] syscall_call+0x7/0xb
[<b7f6b402>] 0xb7f6b402
=======================
Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8
8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f
45 ca eb b6 8d b6 00 00 00 00 55 b8 01 00 00
EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:f70f3c0c
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Does the problem also happen in 2.6.19?
thanks,
greg k-h
No idea. I ran 2.6.19 for a couple of weeks without problems. It took 2 days
to oops 2.6.19.1, so if it happens again within that time period I guess that
might be indicative of a -stable patch.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
On Wed, 20 Dec 2006 14:21:03 +0000, Alistair John Strachan wrote:
> Any ideas?
>
> BUG: unable to handle kernel NULL pointer dereference at virtual address
> 00000009
83 ca 10 or $0x10,%edx
3b .byte 0x3b
87 68 01 xchg %ebp,0x1(%eax) <=====
00 00 add %al,(%eax)
Somehow it is trying to execute code in the middle of an instruction.
That almost never works, even when the resulting fragment is a legal
opcode. :)
The real instruction is:
3b 87 68 01 00 00 00 cmp 0x168(%edi),%eax
I'd guess you have some kind of hardware problem. It could also be
a kernel problem where the saved address was corrupted during an
interrupt, but that's not likely.
--
MBTI: IXTP
Seems pretty unlikely on a 4 year old Via Epia. Never had any problems with it
before now.
Maybe a cosmic ray event? ;-)
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
On Wed, 20 Dec 2006 22:15:50 +0000, Alistair John Strachan wrote:
> > I'd guess you have some kind of hardware problem. It could also be
> > a kernel problem where the saved address was corrupted during an
> > interrupt, but that's not likely.
>
> Seems pretty unlikely on a 4 year old Via Epia. Never had any problems with it
> before now.
>
> Maybe a cosmic ray event? ;-)
The low byte of eip should be 5f and it changed to 60, so that's
probably not it. And the oops report is consistent with that being
the instruction that was really executed, so it's not the kernel
misreporting the address after it happened.
You weren't trying kprobes or something, were you? Have you ever
had another unexplained oops with this machine?
--
MBTI: IXTP
Nope, it's a stock kernel and it's running on a server, kprobes isn't in use.
And no, to my knowledge there's not been another "unexplained" oops. I've had
crashes, but they've always been known issues or BIOS trouble.
The machine was recently tampered with to install additional HDDs, but the
memory was memtest'ed when it was installed and passed several times without
issue. I'm rather puzzled.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
More likely a stray alpha particle from a radioactive decay in the actual chip
casing - I saw some research a while back that said that the average commodity
system should *expect* to see 1 or 2 alpha-induced single-bit errors per year,
and the chance that *you* saw the event was directly related to whether the
memory had ECC, and how much of the other circuitry had ECC on it....
Pretty much like clockwork, it happened again. I think it's time to take this
seriously as a software bug, and not some hardware problem. I've ran kernels
since 2.6.0 on this machine without such crashes, and now two of the same in
2.6.19.1? Pretty unlikely!
BUG: unable to handle kernel NULL pointer dereference at virtual address
00000009
printing eip:
c0156f60
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ipt_recent ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat
xt_sta
te iptable_filter ip_tables x_tables prism54 yenta_socket rsrc_nonstatic
pcmcia_core snd_via82xx snd_ac97_codec snd_ac97_bus
snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd soundcore
usblp ehci_hcd eth1394 uhci_hcd usbcore ohci1394 i
eee1394 via_agp agpgart vt1211 hwmon_vid hwmon ip_nat_ftp ip_nat
ip_conntrack_ftp ip_conntrack
CPU: 0
EIP: 0060:[<c0156f60>] Not tainted VLI
EFLAGS: 00010246 (2.6.19.1 #1)
EIP is at pipe_poll+0xa0/0xb0
eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000
esi: ee1b9e9c edi: f4d80a00 ebp: ee1b9c1c esp: ee1b9c0c
ds: 007b es: 007b ss: 0068
Process java (pid: 5374, ti=ee1b8000 task=f7117560 task.ti=ee1b8000)
Stack: 00000000 00000000 ee1b9e9c f6c17160 ee1b9fa4 c015d7f3 ee1b9c54 ee1b9fac
082dff90 00000010 082dffa0 00000000 ee1b9e94 ee1b9e94 00000002 ee1b9eac
00000000 ee1b9e94 c015e580 00000000 00000000 00000002 f6c17160 00000000
Call Trace:
[<c015d7f3>] do_sys_poll+0x253/0x480
[<c015da53>] sys_poll+0x33/0x50
[<c0102c97>] syscall_call+0x7/0xb
[<b7f26402>] 0xb7f26402
=======================
Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8
8b 75
f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f 45 ca
eb b6 8d b6 00 00 00 00 55 b8 01 00 00
EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:ee1b9c0c
True, I've considered it, I'll replace the CPU fan.
> Anyway, post your complete .config. And exactly which one of the
> many Via cpus are you using? Are you using the Padlock unit?
No, much older than that:
[alistair] 14:38 [~] cat /proc/cpuinfo
processor : 0
vendor_id : CentaurHauls
cpu family : 6
model : 9
model name : VIA Nehemiah
stepping : 1
cpu MHz : 999.569
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu de tsc msr cx8 mtrr pge cmov mmx fxsr sse fxsr_opt
bogomips : 2000.02
> What do those java/python programs do that are running? What pipe
> are they polling?
>
> You could try going back to 2.6.18.x for a while in the meantime.
Well, I have had a thought. I recently upgraded the toolchain on the machine
from binutils 2.16.x and GCC 3.4.3 (2.6.19 was built with this) to binutils
2.17 and GCC 4.1.1. It's conceivable that this is some sort of compiler bug.
"87 68 01 00 00" is instruction xchg, but if I disassemble from the begining,
I couldn't see instruct xchg.
> EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:ee1b9c0c
>
Unfortunately, after suspecting the toolchain, I did a manual rebuild of
binutils, gcc and glibc from the official sites, and then rebuilt 2.6.19.1.
This might upset the decompile below, versus the original report.
Assuming it's NOT a bug in my distro's toolchain (because I am now running the
GNU stuff), it'll crash again, so this is still useful.
Here's a current decompilation of vmlinux/pipe_poll() from the running kernel,
the addresses have changed slightly. There's no xchg there either:
c0156ec0 <pipe_poll>:
c0156ec0: 55 push %ebp
c0156ec1: 89 e5 mov %esp,%ebp
c0156ec3: 83 ec 10 sub $0x10,%esp
c0156ec6: 89 5d f4 mov %ebx,0xfffffff4(%ebp)
c0156ec9: 85 d2 test %edx,%edx
c0156ecb: 89 d3 mov %edx,%ebx
c0156ecd: 89 75 f8 mov %esi,0xfffffff8(%ebp)
c0156ed0: 89 c6 mov %eax,%esi
c0156ed2: 89 7d fc mov %edi,0xfffffffc(%ebp)
c0156ed5: 8b 40 08 mov 0x8(%eax),%eax
c0156ed8: 8b 40 08 mov 0x8(%eax),%eax
c0156edb: 8b b8 f0 00 00 00 mov 0xf0(%eax),%edi
c0156ee1: 74 0c je c0156eef <pipe_poll+0x2f>
c0156ee3: 85 ff test %edi,%edi
c0156ee5: 74 08 je c0156eef <pipe_poll+0x2f>
c0156ee7: 89 d1 mov %edx,%ecx
c0156ee9: 89 f0 mov %esi,%eax
c0156eeb: 89 fa mov %edi,%edx
c0156eed: ff 13 call *(%ebx)
c0156eef: 0f b7 5e 1c movzwl 0x1c(%esi),%ebx
c0156ef3: 31 c9 xor %ecx,%ecx
c0156ef5: 8b 47 08 mov 0x8(%edi),%eax
c0156ef8: f6 c3 01 test $0x1,%bl
c0156efb: 89 45 f0 mov %eax,0xfffffff0(%ebp)
c0156efe: 74 20 je c0156f20 <pipe_poll+0x60>
c0156f00: 85 c0 test %eax,%eax
c0156f02: b8 41 00 00 00 mov $0x41,%eax
c0156f07: 0f 4f c8 cmovg %eax,%ecx
c0156f0a: 8b 87 5c 01 00 00 mov 0x15c(%edi),%eax
c0156f10: 85 c0 test %eax,%eax
c0156f12: 74 43 je c0156f57 <pipe_poll+0x97>
c0156f14: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
c0156f1a: 8d bf 00 00 00 00 lea 0x0(%edi),%edi
c0156f20: f6 c3 02 test $0x2,%bl
c0156f23: 74 23 je c0156f48 <pipe_poll+0x88>
c0156f25: 83 7d f0 0f cmpl $0xf,0xfffffff0(%ebp)
c0156f29: b8 04 01 00 00 mov $0x104,%eax
c0156f2e: ba 00 00 00 00 mov $0x0,%edx
c0156f33: 8b 9f 58 01 00 00 mov 0x158(%edi),%ebx
c0156f39: 0f 4f c2 cmovg %edx,%eax
c0156f3c: 09 c1 or %eax,%ecx
c0156f3e: 89 c8 mov %ecx,%eax
c0156f40: 83 c8 08 or $0x8,%eax
c0156f43: 85 db test %ebx,%ebx
c0156f45: 0f 44 c8 cmove %eax,%ecx
c0156f48: 8b 5d f4 mov 0xfffffff4(%ebp),%ebx
c0156f4b: 89 c8 mov %ecx,%eax
c0156f4d: 8b 75 f8 mov 0xfffffff8(%ebp),%esi
c0156f50: 8b 7d fc mov 0xfffffffc(%ebp),%edi
c0156f53: 89 ec mov %ebp,%esp
c0156f55: 5d pop %ebp
c0156f56: c3 ret
c0156f57: 89 ca mov %ecx,%edx
c0156f59: 8b 46 6c mov 0x6c(%esi),%eax
c0156f5c: 83 ca 10 or $0x10,%edx
c0156f5f: 3b 87 68 01 00 00 cmp 0x168(%edi),%eax
c0156f65: 0f 45 ca cmovne %edx,%ecx
c0156f68: eb b6 jmp c0156f20 <pipe_poll+0x60>
c0156f6a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
It crashed again, but this time with no output (machine locked solid). To be
honest, the disassembly looks right (it's like Chuck said, it's jumping back
half way through an instruction):
c0156f5f: 3b 87 68 01 00 00 cmp 0x168(%edi),%eax
So c0156f60 is 87 68 01 00 00..
This is with the GCC recompile, so it's not a distro problem. It could still
either be GCC 4.x, or a 2.6.19.1 specific bug, but it's serious. 2.6.19 with
GCC 3.4.3 is 100% stable.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
Looks like a similar crash here:
http://ubuntuforums.org/showthread.php?p=1803389
I've eliminated 2.6.19.1 as the culprit, and also tried toggling "optimize for
size", various debug options. 2.6.19 compiled with GCC 4.1.1 on an Via
Nehemiah C3-2 seems to crash in pipe_poll reliably, within approximately 12
hours.
The machine passes 6 hours of Prime95 (a CPU stability tester), four memtest86
passes, and there are no heat problems.
I have compiled GCC 3.4.6 and compiled 2.6.19 with an identical config using
this compiler (but the same binutils), and will report back if it crashes. My
bet is that it won't, however.
On Sat, 30 Dec 2006 16:59:35 +0000, Alistair John Strachan wrote:
> I've eliminated 2.6.19.1 as the culprit, and also tried toggling "optimize for
> size", various debug options. 2.6.19 compiled with GCC 4.1.1 on an Via
> Nehemiah C3-2 seems to crash in pipe_poll reliably, within approximately 12
> hours.
Which CPU are you compiling for? You should try different options.
Can you post disassembly of pipe_poll() for both the one that crashes
and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
relocation info and post just the one function from each for now.
--
MBTI: IXTP
This looks rather strange.
The times I have seen this sort of problem is:
1) when one bit of the kernel is corrupting another part of it.
2) Kernel modules compiled with different gcc than rest of kernel.
3) kernel headers do not match the kernel being used.
One way to start tracking this down would be to run it with the fewest
amount of kernel modules loaded as one can, but still reproduce the problem.
James
I should, I haven't thought of that. Currently it's compiling for
CONFIG_MVIAC3_2, but I could try i686 for example.
> Can you post disassembly of pipe_poll() for both the one that crashes
> and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
> relocation info and post just the one function from each for now.
Sure, no problem:
http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
Both use identical configs, neither are optimised for size. The config is
available from the same location.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
> 2) Kernel modules compiled with different gcc than rest of kernel.
Previously there was only one GCC version (4.1.1 totally replaced 3.4.3, and
is the system wide GCC), now I have installed 3.4.6 into /opt/gcc-3.4.6 and
it is only PATH'ed explicitly by me when I wish to compile a kernel using it:
export PATH=/opt/gcc-3.4.6/bin:$PATH
cp /boot/config-2.6.19-test .config
make oldconfig
make
> 3) kernel headers do not match the kernel being used.
The tree is a pristine 2.6.19.
> One way to start tracking this down would be to run it with the fewest
> amount of kernel modules loaded as one can, but still reproduce the
> problem.
Crippling the machine, though. Impractical for something that isn't
immediately reproducible.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
Still fine after >24 hours. Linux 2.6.19, GCC 3.4.6, Binutils 2.17.
There are occasional reports of problems with kernels compiled with
gcc 4.1 that vanish when using older versions of gcc.
AFAIK, until now noone has ever debugged whether that's a gcc bug,
gcc exposing a kernel bug or gcc exposing a hardware bug.
Comparing your report and [1], it seems that if these are the same
problem, it's not a hardware bug but a gcc or kernel bug.
> Cheers,
> Alistair.
cu
Adrian
[1] http://bugzilla.kernel.org/show_bug.cgi?id=7176
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
Can you try enabling as many debug options as possible?
> Cheers,
> Alistair.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
-
Specifically what? I've already had:
CONFIG_DETECT_SOFTLOCKUP
CONFIG_FRAME_POINTER
CONFIG_UNWIND_INFO
Enabled. CONFIG_4KSTACKS is disabled. Are there any debugging features
actually pertinent to this bug?
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
This bug specifically indicates some kind of miscompilation in a driver,
causing boot time hangs. My problem is quite different, and more subtle. The
crash happens in the same place every time, which does suggest determinism
(even with various options toggled on and off, and a 300K smaller kernel
image), but it takes 8-12 hours to manifest and only happens with GCC 4.1.1.
Unless we can start narrowing this down, it would be a mammoth task to seek
out either the kernel or GCC change that first exhibited this bug, due to the
non-immediate reproducibility of the bug, the lack of clues, and this
machine's role as a stable, high-availability server.
(If I had another Epia M10000 or another computer I could reproduce the bug
on, I would be only too happy to boot as many kernels as required to fix it;
however I cannot spare this machine).
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
On Sat, 30 Dec 2006 18:29:15 +0000, Alistair John Strachan wrote:
> > Can you post disassembly of pipe_poll() for both the one that crashes
> > and the one that doesn't? Use 'objdump -D -r fs/pipe.o' so we get the
> > relocation info and post just the one function from each for now.
>
> Sure, no problem:
>
> http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
>
> Both use identical configs, neither are optimised for size. The config is
> available from the same location.
Those were compiled without frame pointers. Can you post them compiled
with frame pointers so they match your original bug report? And confirm
that pipe_poll() is still at 0xc0156ec0 in vmlinux?
--
MBTI: IXTP
c0156ec0 <pipe_poll>:
I used the config I original sent you to rebuild it again. This time I've put
up the whole vmlinux for both kernels, the config is replaced, the
decompilation is re-done, I've confirmed the offset in the GCC 4.1.1 kernel
is identical. Sorry for the confusion.
The reason I changed the configs was to experiment with enabling and disabling
debugging (and other such) options that might have shaken out compiler bugs.
However none of these kernels have ever crashed gracefully again, most of them
hang the machine (no nmi watchdog though) so I've not been able to look at
the oops. It's the same root cause, however, as GCC 3.4.6 kernels do not
crash.
http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
Happy new year, btw.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
Sorry if my point goes a bit away from your problem:
My point is that we have several reported problems only visible
with gcc 4.1.
Other bug reports are e.g. [2] and [3], but they are only present with
using gcc 4.1 _and_ using -Os.
There's simply a bunch of bugs only present with gcc 4.1, and what
worries me most is that the estimated number of unknown cases is most
likely very high since most people won't check different compiler
versions when running into a problem.
> Cheers,
> Alistair.
cu
Adrian
[1] http://bugzilla.kernel.org/show_bug.cgi?id=7176
[2] http://bugzilla.kernel.org/show_bug.cgi?id=7106
[3] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186852
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
-
No, that's only an "enable as much as possible and hope one helps" shot
in the dark.
> Cheers,
> Alistair.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
-
I find [2] most compelling, and I can confirm that I do have the same problem
with or without optimisation for size. I don't use selinux nor has it ever
been enabled.
At any rate, I have absolute confirmation that it is GCC 4.1.1, because with
GCC 3.4.6 the same kernel I reported booting three days ago is still
cheerfully working. I regularly get uptimes of 60+ days on that machine,
rebooting only for kernel upgrades. 2.6.19 seems to be no worse in this
regard.
Perhaps fortunately, the configs I've tried have consistently failed to shake
the crash, so I have a semi-reproducible test case here on C3-2 hardware if
somebody wants to investigate the problem (though it still takes 6-12 hours).
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
On Tue, 2 Jan 2007, Adrian Bunk wrote:
>
> My point is that we have several reported problems only visible
> with gcc 4.1.
>
> Other bug reports are e.g. [2] and [3], but they are only present with
> using gcc 4.1 _and_ using -Os.
Traditionally, afaik, -Os has tended to show compiler problems that
_could_ happen with -O2 too, but never do in practice. It may be that
gcc-4.1 without -Os miscompiles some very unusual code, and then with -Os
we just hit more cases of that.
That said, I th ink gcc-4.1.1 is very common - I know it's the Fedora
compiler. Also, CC_OPTIMIZE_FOR_SIZE defaults to 'y' if you have
EXPERIMENTAL on, and from all the bug-reports about other features that
are marked EXPERIMENTAL, I know that a lot of people do seem to select for
it. So I would expect that gcc-4.1.1 and -Os is actually a fairly common
combination. I just checked, and it's what I use personally, for example.
Of course, my main machine is an x86-64, and it has more registers. At
least some historical -Os bug was about bad things happening under
register pressure, iirc, and so x86-64 would show fewer problems than
regular 32-bit x86 (which has far fewer registers for the compiler to
use).
It is a bit worrisome. These things seem to be about 50:50 real kernel
bugs (just hidden by some common code generation sequence) and real
honest-to-goodness compiler bugs. But they are hard as hell to find.
Linus
The GCC code generator appears to have been rewritten between 3.4.6 and
4.1.1....
I took a look at the dump he posted and there are some minor and some massive
differences between the code. In one case some of the code is swapped, in
another there is code in the 3.4.6 version that isn't in the 4.1.1... Finally
the 4.1.1 version of the function has what appears to be function calls and
these don't appear in the code generated by 3.4.6
In other words - the code generation for 4.1.1 appears to be broken when it
comes to generating system code.
DRH
On Tue, 2 Jan 2007, Alistair John Strachan wrote:
>
> At any rate, I have absolute confirmation that it is GCC 4.1.1, because with
> GCC 3.4.6 the same kernel I reported booting three days ago is still
> cheerfully working. I regularly get uptimes of 60+ days on that machine,
> rebooting only for kernel upgrades. 2.6.19 seems to be no worse in this
> regard.
>
> Perhaps fortunately, the configs I've tried have consistently failed to shake
> the crash, so I have a semi-reproducible test case here on C3-2 hardware if
> somebody wants to investigate the problem (though it still takes 6-12 hours).
Historically, some people have actually used horrible hacks like trying to
figure out which particular C file gets miscompiled by basically having
both compilers installed, and then trying out different subdirectories
with different compilers. And once the subdirectory has been pinpointed,
pinpointing which particular file it is.. etc.
Pretty damn horrible to do, and I'm afraid we don't have any real helpful
scripts to do any of the work for you. So it's all effectively manual
(basically boils down to: "compile everything with known-good compiler.
Then replace the good compiler with the bad one, remove the object files
from one directory, and recompile the kernel". "Rinse and repeat".
I don't think anybody has ever done that with something where triggering
the cause then also takes that long - that just ends up making the whole
thing even more painful.
What are the exact crash details? That might narrow things down enough
that maybe you could try just one or two files that are "suspect".
Linus
> Traditionally, afaik, -Os has tended to show compiler problems that
> _could_ happen with -O2 too, but never do in practice. It may be that
> gcc-4.1 without -Os miscompiles some very unusual code, and then with -Os
> we just hit more cases of that.
>
gcc optimizations were almost completely rewritten between 3.4.6 and 4.1,
and one of the subtle changes that may have been introduced is with regard
to the heuristics used to determine whether to inline an 'inline' function
or not when using -Os. This problem can show up in dynamic linking and
break on certain architectures but should be detectable by using -Winline.
David
On Tuesday 02 January 2007 22:13, Linus Torvalds wrote:
[snip]
> What are the exact crash details? That might narrow things down enough
> that maybe you could try just one or two files that are "suspect".
I'll do a digest of the problem for you and anybody else that's lost track of
the debugging story so far..
There are no hardware problems evidenced by any testing I have performed
(memtest, prime95 CPU torture tests, temp monitors). Furthermore, kernels
compiled with older GCCs have been running without problems for literally
years on this machine.
Here is an example of an oops. The kernel continued to limp along after this.
BUG: unable to handle kernel NULL pointer dereference at virtual address
00000009
printing eip:
c0156f60
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: ipt_recent ipt_REJECT xt_tcpudp ipt_MASQUERADE iptable_nat
xt_state iptable_filter ip_tables x_tables prism54 yenta_socket
rsrc_nonstatic pcmcia_core snd_via82xx snd_ac97_codec snd_ac97_bus snd_pcm
snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd soundcore ehci_hcd
usblp eth1394 uhci_hcd usbcore ohci1394 ieee1394 via_agp agpgart vt1211
hwmon_vid hwmon ip_nat_ftp ip_nat ip_conntrack_ftp ip_conntrack
CPU: 0
EIP: 0060:[<c0156f60>] Not tainted VLI
EFLAGS: 00010246 (2.6.19.1 #1)
EIP is at pipe_poll+0xa0/0xb0
eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000
esi: f70f3e9c edi: f7017c00 ebp: f70f3c1c esp: f70f3c0c
ds: 007b es: 007b ss: 0068
Process python (pid: 4178, ti=f70f2000 task=f70c4a90 task.ti=f70f2000)
Stack: 00000000 00000000 f70f3e9c f6e111c0 f70f3fa4 c015d7f3 f70f3c54 f70f3fac
084c44a0 00000030 084c44d0 00000000 f70f3e94 f70f3e94 00000006 f70f3ecc
00000000 f70f3e94 c015e580 00000000 00000000 00000006 f6e111c0 00000000
Call Trace:
[<c015d7f3>] do_sys_poll+0x253/0x480
[<c015da53>] sys_poll+0x33/0x50
[<c0102c97>] syscall_call+0x7/0xb
[<b7f6b402>] 0xb7f6b402
=======================
Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8
8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f
45 ca eb b6 8d b6 00 00 00 00 55 b8 01 00 00
EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:f70f3c0c
Chuck observed that the kernel tries to reenter pipe_poll half way through an
instruction (c0156f5f->c0156f60); it's not a single-bit error but an
off-by-one.
On Wednesday 20 December 2006 20:48, Chuck Ebbert wrote:
> In-Reply-To: <200612201421....@sms.ed.ac.uk>
>
> On Wed, 20 Dec 2006 14:21:03 +0000, Alistair John Strachan wrote:
> > Any ideas?
> >
> > BUG: unable to handle kernel NULL pointer dereference at virtual address
> > 00000009
>
> 83 ca 10 or $0x10,%edx
> 3b .byte 0x3b
> 87 68 01 xchg %ebp,0x1(%eax) <=====
> 00 00 add %al,(%eax)
>
> Somehow it is trying to execute code in the middle of an instruction.
> That almost never works, even when the resulting fragment is a legal
> opcode. :)
>
> The real instruction is:
>
> 3b 87 68 01 00 00 00 cmp 0x168(%edi),%eax
I've tried a multitude of kernel configs and compiler options, but none have
made any difference. That first oops was pretty lucky, very often the machine
locks up after oopsing (panic_on_oops=1 doesn't work). I've not seen oopses
anywhere but in pipe_poll, but I've not seen many oopses.
The machine runs jabberd 2.x which uses separate python processes as
transports to different networks. The server hosts 50-100 users. One of my
oops reports had Java crashing in the same place, that's Azureus.
I've got binutils 2.17, gcc 4.1.1 hand bootstrapped from GNU sources (not
distro versions). I've got another, secondary compiler (3.4.6), also compiled
from GNU sources, installed elsewhere which I have used to build working
kernels. So the only variable, for sure, is GCC itself.
Both compilers were built with "make bootstrap" and I built binutils with the
resulting GCC, and GCC with the resulting binutils, just to be sure. The only
slightly non-standard thing I do is to compile everything (GCC, binutils, the
kernels) on a dual-opteron box, inside a 32bit chroot, which is rsync'ed over
to the Via C3-2 box with the problem. I can't see how this would cause any
problems (and indeed have done it successfully for years), but I thought I'd
point it out.
The crashes take time to appear, which is why so many people suspected
hardware initially. But the uptime of a GCC 4.1.1 kernel will always be less
than 12 hours, where a 3.4.6 kernel will survive for months. I've had no
other mysterious software crashes, ever.
On Sunday 31 December 2006 22:16, Alistair John Strachan wrote:
> On Sunday 31 December 2006 21:43, Chuck Ebbert wrote:
> > Those were compiled without frame pointers. Can you post them compiled
> > with frame pointers so they match your original bug report? And confirm
> > that pipe_poll() is still at 0xc0156ec0 in vmlinux?
>
> c0156ec0 <pipe_poll>:
>
> I used the config I original sent you to rebuild it again. This time I've
> put up the whole vmlinux for both kernels, the config is replaced, the
> decompilation is re-done, I've confirmed the offset in the GCC 4.1.1 kernel
> is identical. Sorry for the confusion.
[snip]
> http://devzero.co.uk/~alistair/2.6.19-via-c3-pipe_poll/
At the above URL can be found vmlinux images, the config used to build both,
and decompilations of the fs/pipe.o file (with relocation information).
The suggestions I've had so far which I have not yet tried:
- Select a different x86 CPU in the config.
- Unfortunately the C3-2 flags seem to simply tell GCC
to schedule for ppro (like i686) and enabled MMX and SSE
- Probably useless
- Enable as many debug options as possible ("a shot in the dark")
- Try compiling a minimal kernel config, sans modules that are not required
for booting. The problem with this one (whilst it might uncover some bizarre
memory scribbling or stack corruption) is that the machine's primary role is
that of a router, so I require most of the modules loaded for the oops to be
reproduced (chicken, egg?).
If I can provide any more information, please do let me know.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
Differences are expected since we disable unit-at-a-time for gcc < 4
and gcc development didn't stall between 3.4 and 4.1.
> In other words - the code generation for 4.1.1 appears to be broken when it
> comes to generating system code.
Bug number for an either already open or created by you bug in the gcc
Bugzilla for what you claim to be a bug in gcc?
> DRH
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
-
Okay. Thing is that these noted differences, aside from where 4.1.1 doesn't
generate an opcode that 3.4.6 does aren't all that fatal, IMHO. The fact that
there it does generate call's rather than jumps for local pointer moves
(IIRC - been a while since I looked at the dump of pipe_poll that he
provided) might be part of the problem
> > In other words - the code generation for 4.1.1 appears to be broken when
> > it comes to generating system code.
>
> Bug number for an either already open or created by you bug in the gcc
> Bugzilla for what you claim to be a bug in gcc?
None. I didn't file a report on this because I didn't find the big, just noted
a problem that appears to occur. In this case the call's generated seem to
wrap loops - something I've never heard of anyone doing. These *might* be
causing the off-by-one that is causing the function to re-enter in the middle
of an instruction.
Seeing this I'd guess that this follows for all system-level code generated by
4.1.1 and this is exactly what I was reporting. If you'd like I'll go dig up
the dumps he posted and post the two related segments side-by-side to give
you a better example what I'm referring to.
DRH
On Tue, 2 Jan 2007, Alistair John Strachan wrote:
>
> eax: 00000008 ebx: 00000000 ecx: 00000008 edx: 00000000
> esi: f70f3e9c edi: f7017c00 ebp: f70f3c1c esp: f70f3c0c
>
> Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8
> 8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f
> 45 ca eb b6 8d b6 00 00 00 00 55 b8 01 00 00
> EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:f70f3c0c
>
> Chuck observed that the kernel tries to reenter pipe_poll half way through an
> instruction (c0156f5f->c0156f60); it's not a single-bit error but an
> off-by-one.
It's not an off-by-one either (eg say we're taking an exception and
screiwing up %eip by one somehow).
The code sequence in question is
mov %ecx,%edx
mov 0x6c(%esi),%eax
or $0x10,%edx
cmp 0x168(%edi),%eax <--
cmovne %edx,%ecx
jmp ...
and it's in the second byte of the "cmp".
And yes, it definitely entered there, because trying other random
entry-points will have either invalid instructions or instructions that
would fault due to NULL pointers. HOWEVER, it's also not as simple as
"took an interrupt, and returned with %eip incremented by one", becasue
your %edx is zero, so it won't have done that "or $10,%edx" and then some
interrupt happened and screwed up just %eip.
So it's literally a random %eip, but since you say it's consistently in
that function, it's not truly "random". There's something that triggers it
just _there_.
However, that's a damn simple function. There's _nothing_ there. The
particular code that is involved right there is literally
if (!pipe->writers && filp->f_version != pipe->w_counter)
mask |= POLLHUP;
and that's it. There's not even anything half-way interesting around it,
except for the "poll_wait()" call, but even that is about as common as
you can humanly get..
Looking at the register set and the stack, I see:
Stack: 00000000
00000000 <- saved %ebx (dunno, seems dead in caller)
f70f3e9c <- saved %esi (== pollfd in do_pollfd)
f6e111c0 <- saved %edi (== filp)
f70f3fa4 <- outer EBP (looks reasonable)
c015d7f3 <- return address (do_sys_poll+0x253/0x480)
and the strange thing is that when the oops happens, it really looks like
%esi _still_ contains the value it had originally (and that is saved on
the stack). But afaik, from your disassembly, it should have been
overwritten by the initial %eax, which should have had the same value as
%edi on entry...
IOW, none of it really makes any sense. The stack frames look fine, so we
_did_ enter at the beginning of the function (and it wasn't the *poll fn
pointer that was corrupt.
> The suggestions I've had so far which I have not yet tried:
>
> - Select a different x86 CPU in the config.
> - Unfortunately the C3-2 flags seem to simply tell GCC
> to schedule for ppro (like i686) and enabled MMX and SSE
> - Probably useless
Actually, try this one. Try using something that doesn't like "cmov".
Maybe the C3-2 simply has some internal cmov bugginess.
Linus
[...]
> None. I didn't file a report on this because I didn't find the big, just
> noted a problem that appears to occur. In this case the call's generated
> seem to wrap loops - something I've never heard of anyone doing.
Example code showing this weirdness?
> These
> *might* be causing the off-by-one that is causing the function to
> re-enter in the middle of an instruction.
If something like this happened, programs would be crashing left and right.
> Seeing this I'd guess that this follows for all system-level code
> generated by 4.1.1
Define "system-level code". What makes it different from, say,
bog-of-the-mill compiler code (yes, gcc compiles itself as part of its
sanity checking)?
> and this is exactly what I was reporting. If you'd
> like I'll go dig up the dumps he posted and post the two related segments
> side-by-side to give you a better example what I'm referring to.
If the related segments show code that is somehow wrong, by all means
report it /with your detailed analysis/ to the compiler people. Just a
warning, gcc is pretty smart in what it does, its code is often surprising
to the unwashed. Also, the C standard is subtle, the error might be in a
unwarranted assumption in the source code.
Or just C3 (not C3-2), which is what I've done.
I'll report back whether it crashes or not.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
That's a good suggestion. Earlier C3s didn't have cmov so it's
not entirely unlikely that cmov in C3-2 is broken in some cases.
Configuring for P5MMX or 486 should be good safe alternatives.
/Mikael
Agreed! When I developped the cmov emulator, I used an early C3 for the
tests (well, a "Samuel2" to be precise), because it did not report "cmov"
in i