Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Via KT133 pci corruption: stock 2.4.18pre2 oopses as well

0 views
Skip to first unread message

Ville Herva

unread,
Jan 9, 2002, 10:26:04 AM1/9/02
to
On Wed, Jan 09, 2002 at 02:45:49PM +0200, you [Ville Herva] claimed:
>
> We also got the oops with 2.2.20+patches, so this is not a pre2 thing.
> Rather, the difference is that we now ran ping -f on background.
>
> The bad news is that all the bios setting configurations we thought stable
> (that had run the hpt370 read/write test without a hitch for days) now give
> oopses and corruption pretty quickly when we run ping -f on background :(.
>
> Also, ping -f shows "...EEE.EE.EEE.." which I gather means the packets get
> corrupted somewhere.

It also happens with _pristine_ 2.4.18pre2. I ran

cat /dev/hde > /dev/null& cat /dev/hdg > /dev/null& ping -f -s 64000 box2

in single user mode. (hde and hdg are Samsung 80GB disks on HPT370, eth0 is
3c905). After just few seconds I got the following oops:

Unable to handle kernel paging request at virtual address 1d292ee9
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0131ce0>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010203
eax: 00000000 ebx: 1d292ed1 ecx: 000001d0 edx: 00000000
esi: 00000000 edi: cf12e940 ebp: c10acb80 esp: c1433f0c
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 5, stackpage=c1433000)
Stack: c10acb80 cf12ef40 cf12e940 c0131e37 cf12e940 c10acb80 000001d0 00000017
00000200 c013066a c10acb80 000001d0 00000000 c10acb80 c01282f7 c10acb80
000001d0 00000020 000001d0 00000020 00000006 00000006 000022a5 000001d0
Call Trace: [<c0131e37>] [<c013066a>] [<c01282f7>] [<c012852b>] [<c012859c>]
[<c0128653>] [<c01286c6>] [<c01287e7>] [<c0105000>] [<c0105523>]
Code: f6 43 18 06 0f 84 7f 00 00 00 b8 07 00 00 00 0f ab 43 18 19

>>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
Trace; c0131e37 <try_to_free_buffers+b7/e0>
Trace; c013066a <try_to_release_page+3a/40>
Trace; c01282f7 <shrink_cache+1b7/2c0>
Trace; c012852b <shrink_caches+5b/90>
Trace; c012859c <try_to_free_pages+3c/60>
Trace; c0128653 <kswapd_balance_pgdat+53/b0>
Trace; c01286c6 <kswapd_balance+16/30>
Trace; c01287e7 <kswapd+a7/d0>
Trace; c0105000 <_stext+0/0>
Trace; c0105523 <kernel_thread+23/30>
Code; c0131ce0 <sync_page_buffers+10/b0>
00000000 <_EIP>:
Code; c0131ce0 <sync_page_buffers+10/b0> <=====
0: f6 43 18 06 testb $0x6,0x18(%ebx) <=====
Code; c0131ce4 <sync_page_buffers+14/b0>
4: 0f 84 7f 00 00 00 je 89 <_EIP+0x89> c0131d69
<sync_page_buffers+99/b0>
Code; c0131cea <sync_page_buffers+1a/b0>
a: b8 07 00 00 00 mov $0x7,%eax
Code; c0131cef <sync_page_buffers+1f/b0>
f: 0f ab 43 18 bts %eax,0x18(%ebx)
Code; c0131cf3 <sync_page_buffers+23/b0>
13: 19 00 sbb %eax,(%eax)


Which is pretty similar to the 2.2.20 oopses (here's one:)

Unable to handle kernel paging request at virtual address 4d7ebf3e
current->tss.cr3 = 0e912000, %cr3 = 0e912000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0120631>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010006
eax: ccf0efe0 ebx: ccf0efe0 ecx: 4d7ebf3e edx: ccf0ee00
esi: 00000800 edi: cffef740 ebp: 00000282 esp: cf05fc9c
ds: 0018 es: 0018 ss: 0018
Process cat (pid: 708, process nr: 29, stackpage=cf05f000)
Stack: 00000000 00000400 c0126db5 cffef740 00000005 ccf0ee00 00000000 c0126e42
00000000 00000400 00000400 cc908000 00002100 cf05fcdc cf05fcdc cf05e000
cf05e000 00000000 c0127615 cc908000 00000400 00000000 00000000 00000400
Call Trace: [<c0126db5>] [<c0126e42>] [<c0127615>] [<c012679e>] [<c012695a>]
[<c0129ac1>] [<c0111fb6>]
[<c0111fb6>] [<c011255b>] [<c017e988>] [<c018015d>] [<c0196ad9>]
[<c0181bad>] [<c0196a7c>] [<c015dc19>]
[<c0165a18>] [<c015dc19>] [<c0165a18>] [<c01661a9>] [<c016560b>]
[<c01659d6>] [<c015f5a1>] [<c0124fc9>]
[<c0124ece>] [<c0108924>]
Code: 8b 01 89 03 85 c0 74 2b 8b 73 04 85 f6 75 10 89 19 89 c8 2b

>>EIP; c0120631 <kmem_cache_alloc+31/124> <=====
Trace; c0126db5 <get_unused_buffer_head+55/a0>
Trace; c0126e42 <create_buffers+42/198>
Trace; c0127615 <grow_buffers+55/fc>
Trace; c012679e <refill_freelist+a/38>
Trace; c012695a <getblk+11e/144>
Trace; c0129ac1 <block_read+2c1/4f4>
Trace; c0111fb6 <wake_up_process+3a/44>
Trace; c0111fb6 <wake_up_process+3a/44>
Trace; c011255b <__wake_up+4f/6c>
Trace; c017e988 <end_that_request_last+28/2c>
Trace; c018015d <ide_end_request+61/6c>
Trace; c0196ad9 <ide_dma_intr+5d/94>
Trace; c0181bad <ide_intr+111/130>
Trace; c0196a7c <ide_dma_intr+0/94>
Trace; c015dc19 <alloc_skb+71/dc>
Trace; c0165a18 <ip_frag_create+18/60>
Trace; c015dc19 <alloc_skb+71/dc>
Trace; c0165a18 <ip_frag_create+18/60>
Trace; c01661a9 <ip_defrag+2f9/360>
Trace; c016560b <ip_local_deliver+2f/1c4>
Trace; c01659d6 <ip_rcv+236/260>
Trace; c015f5a1 <net_bh+181/1dc>
Trace; c0124fc9 <sys_write+e5/118>
Trace; c0124ece <sys_read+ae/c4>
Trace; c0108924 <system_call+34/38>
Code; c0120631 <kmem_cache_alloc+31/124>
00000000 <_EIP>:
Code; c0120631 <kmem_cache_alloc+31/124> <=====
0: 8b 01 mov (%ecx),%eax <=====
Code; c0120633 <kmem_cache_alloc+33/124>
2: 89 03 mov %eax,(%ebx)
Code; c0120635 <kmem_cache_alloc+35/124>
4: 85 c0 test %eax,%eax
Code; c0120637 <kmem_cache_alloc+37/124>
6: 74 2b je 33 <_EIP+0x33> c0120664
<kmem_cache_all
oc+64/124>
Code; c0120639 <kmem_cache_alloc+39/124>
8: 8b 73 04 mov 0x4(%ebx),%esi
Code; c012063c <kmem_cache_alloc+3c/124>
b: 85 f6 test %esi,%esi
Code; c012063e <kmem_cache_alloc+3e/124>
d: 75 10 jne 1f <_EIP+0x1f> c0120650
<kmem_cache_all
oc+50/124>
Code; c0120640 <kmem_cache_alloc+40/124>
f: 89 19 mov %ebx,(%ecx)
Code; c0120642 <kmem_cache_alloc+42/124>
11: 89 c8 mov %ecx,%eax
Code; c0120644 <kmem_cache_alloc+44/124>
13: 2b 00 sub (%eax),%eax


This is with the bios settings we thought stable.

Any ideas?


-- v --

v...@iki.fi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andrew Morton

unread,
Jan 9, 2002, 4:00:53 PM1/9/02
to
Ville Herva wrote:
>
> >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====

Looks like a corrupted `next' pointer in the page's buffer_head
ring. Your report is identical to Todd Eigenschink's repeatable
oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html

In another thread, yesterday, we were discussing the elusive
"end_request: buffer-list destroyed" crash.

I am able to trigger this in around ten minutes on 2.4.13 and
later kernels. However 2.4.13-pre6 ran the test for nine hours
and did not fail.

I've put the 2.4.13-pre6 -> 2.4.13 diff at http://www.zip.com.au/~akpm/1.gz


MAINTAINERS | 6 ++
Makefile | 2
arch/i386/kernel/smp.c | 58 +++++++++++-----------------
drivers/message/i2o/i2o_block.c | 44 ++++++++-------------
drivers/message/i2o/i2o_config.c | 1
drivers/message/i2o/i2o_core.c | 39 ++++++++++++++++---
drivers/message/i2o/i2o_lan.c | 4 +
drivers/message/i2o/i2o_pci.c | 14 ++++++
drivers/message/i2o/i2o_proc.c | 16 +++----
drivers/message/i2o/i2o_scsi.c | 17 ++++++--
drivers/scsi/dpt_i2o.c | 14 +++---
drivers/sound/ymfpci.c | 52 +++++++++++--------------
fs/buffer.c | 54 ++++++++++++++++----------
fs/ntfs/fs.c | 1
include/linux/fs.h | 3 -
include/linux/locks.h | 2
include/linux/mm.h | 17 ++++----
include/linux/slab.h | 2
include/linux/swap.h | 4 -
kernel/exit.c | 13 +-----
mm/highmem.c | 4 -
mm/page_alloc.c | 39 +++++++++----------
mm/swap.c | 4 -
mm/vmscan.c | 80 ++++++++++++++++++++++-----------------

There were VM changes, and a messy, complex and undocumented
change to sync_page_buffers(), which was the point at which
I ceased to understand that function. The patch was never Cc'ed
to the mailing list, was never explained. This sort of thing
makes it very hard for other developers to hunt down bugs.

Probably, the bug lies elsewhere and perhaps my bug is different
from yours and Todd's. It is timing-related, and the VM and
buffer changes may just have triggered it.

I have a debug patch from Jens to try tonight.

It could just be some random memory scribbler. Dunno yet. It's
awfully repeatable.

-

Ville Herva

unread,
Jan 9, 2002, 4:57:22 PM1/9/02
to
On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:

> Ville Herva wrote:
> >
> > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
>
> Looks like a corrupted `next' pointer in the page's buffer_head
> ring. Your report is identical to Todd Eigenschink's repeatable
> oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
>
> In another thread, yesterday, we were discussing the elusive
> "end_request: buffer-list destroyed" crash.

(...)



> There were VM changes, and a messy, complex and undocumented change to
> sync_page_buffers(), which was the point at which I ceased to understand
> that function.

Nice, yet one more variable to the equation ;). And I thought I could rule
out kernel bugs by reproducing this on supposedly stable kernel (the 2.2.20
I used had all sort of patches in it; ide, e2compr and raid to name the
largest ones.)

This could be a sync_page_buffers() bug, but what puzzles me is that I can
reproduce the oopses on 2.2 as well (although they can of course be
different oopses).

Also, I'm seeing ide and network corruption that would very much point to
pci transfer corruption. Of course, it can be that the oopses are not caused
by that.

> It could just be some random memory scribbler. Dunno yet. It's awfully
> repeatable.

Yep.


-- v --

v...@iki.fi

Daniel J Blueman

unread,
Jan 9, 2002, 6:33:28 PM1/9/02
to
> Nice, yet one more variable to the equation ;). And I thought
> I could rule out kernel bugs by reproducing this on
> supposedly stable kernel (the 2.2.20 I used had all sort of
> patches in it; ide, e2compr and raid to name the largest ones.)
>
> This could be a sync_page_buffers() bug, but what puzzles me
> is that I can reproduce the oopses on 2.2 as well (although
> they can of course be different oopses).
>
> Also, I'm seeing ide and network corruption that would very
> much point to pci transfer corruption. Of course, it can be
> that the oopses are not caused by that.
[snip]

From what I've read, it looks like there can be issues with the VIA
KT133 PCI implementation, possibly applying to other VIA chipsets too.

Master memory reads can talk 45 cycles rather than 16 (the max defined
in the PCI spec) - this sounds like it could be due to either a) bad
motherboard design with signal problems, or b) BIOS chipset
configuration (try setting 'PCI master read caching' to on?). This is
since problems have been reported with different make motherboards using
the same chipset, and those being the only two factors differing.

Of course, this may well not help if it is geniunely a bug in the
kernel, but may solve the PCI corruption (if any).

Also, if it is a chipset issue, updating the BIOS can help at times,
with the vendor incorporating work-arounds for known chipset problems
(eg the well-publicised IDE corruption issue).

Dan
___________________
Daniel J Blueman

Martin Josefsson

unread,
Jan 9, 2002, 6:30:37 PM1/9/02
to
On Wed, 9 Jan 2002, Ville Herva wrote:

> On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:
> > Ville Herva wrote:
> > >
> > > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
> >
> > Looks like a corrupted `next' pointer in the page's buffer_head
> > ring. Your report is identical to Todd Eigenschink's repeatable
> > oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
> >
> > In another thread, yesterday, we were discussing the elusive
> > "end_request: buffer-list destroyed" crash.
>
> (...)
>
> > There were VM changes, and a messy, complex and undocumented change to
> > sync_page_buffers(), which was the point at which I ceased to understand
> > that function.
>

> Nice, yet one more variable to the equation ;). And I thought I could rule
> out kernel bugs by reproducing this on supposedly stable kernel (the 2.2.20
> I used had all sort of patches in it; ide, e2compr and raid to name the
> largest ones.)
>
> This could be a sync_page_buffers() bug, but what puzzles me is that I can
> reproduce the oopses on 2.2 as well (although they can of course be
> different oopses).
>
> Also, I'm seeing ide and network corruption that would very much point to
> pci transfer corruption. Of course, it can be that the oopses are not caused
> by that.

I havn't followed this thread but I have a machine with an Asus A7V
motherboard with KT133 chipset and we had massive corruption before
christmas, both ide and network had corrupted packets. and now after
christmas we ran memtest86 on it and a 256MB module was very very broken.
and we got alot of Oopses and all kinds of strange stuff happened.

We've replaved that memory module now and now it's better but I have to
say that the KT133 or atleast the Asus A7V motherboard seems to be quite
broken. we have a lot of spurious irq's and the ide controllers freak when
but under some load and start getting irq timeouts and resets the ide
channels over and over again with some delay in between when it kind of
works, slow as hell but works.

We are going to replace the motherboard with one with VIA KT266A chipset,
hope that works better.

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

Martin Josefsson

unread,
Jan 9, 2002, 7:41:16 PM1/9/02
to
On Thu, 10 Jan 2002, Daniel J Blueman wrote:

> There are known issues with the VIA 82C686A/B chipset south-bridge and
> IDE in particular. Make sure you have the latest BIOS and latest VIA
> 4in1 drivers to workaround the IDE corruption and other known issues
> (sound problems with certain soundcards).

Yes I'm aware of these problems, I thought that the VIA 4in1 driver where
wintendo drivers. And I also thought that there are workarounds for these
bugs in the kernel.

Daniel J Blueman

unread,
Jan 9, 2002, 7:19:38 PM1/9/02
to
> I havn't followed this thread but I have a machine with an
> Asus A7V motherboard with KT133 chipset and we had massive
> corruption before christmas, both ide and network had
> corrupted packets. and now after christmas we ran memtest86
> on it and a 256MB module was very very broken. and we got
> alot of Oopses and all kinds of strange stuff happened.
>
> We've replaved that memory module now and now it's better but
> I have to say that the KT133 or atleast the Asus A7V
> motherboard seems to be quite broken. we have a lot of
> spurious irq's and the ide controllers freak when but under
> some load and start getting irq timeouts and resets the ide
> channels over and over again with some delay in between when
> it kind of works, slow as hell but works.

There are known issues with the VIA 82C686A/B chipset south-bridge and


IDE in particular. Make sure you have the latest BIOS and latest VIA
4in1 drivers to workaround the IDE corruption and other known issues
(sound problems with certain soundcards).

Dan
___________________
Daniel J Blueman

-

Daniel J Blueman

unread,
Jan 9, 2002, 8:04:17 PM1/9/02
to
> On Thu, 10 Jan 2002, Daniel J Blueman wrote:
>
> > There are known issues with the VIA 82C686A/B chipset
> south-bridge and
> > IDE in particular. Make sure you have the latest BIOS and
> latest VIA
> > 4in1 drivers to workaround the IDE corruption and other
> known issues
> > (sound problems with certain soundcards).
>
> Yes I'm aware of these problems, I thought that the VIA 4in1
> driver where wintendo drivers. And I also thought that there
> are workarounds for these bugs in the kernel.

Yep, the VIA 4in1 drivers are purely for windows.

In linux, if the chipset-fixup code is being trigged on boot (and
appears in your dmesg?), then it looks like the problem maybe
elsewhere...

On the other hand, perhaps that fixup code isn't complete (or relies on
certain chipset features being on/off by default, vendor specific
defaults?)

Ville Herva

unread,
Jan 10, 2002, 12:34:41 AM1/10/02
to
On Thu, Jan 10, 2002 at 12:30:37AM +0100, you [Martin Josefsson] claimed:

>
> I havn't followed this thread but I have a machine with an Asus A7V
> motherboard with KT133 chipset and we had massive corruption before
> christmas, both ide and network had corrupted packets. and now after
> christmas we ran memtest86 on it and a 256MB module was very very broken.
> and we got alot of Oopses and all kinds of strange stuff happened.

We ran memtest86 at one point but it showed nothing. We changed the memory
modules, and it didn't help. (I did seem like the order and placement of the
modules on the mobo made difference, but that turned out to be false
positive. Trying harder made the corruption happen again.)

Also, we only begun to see oopses when we stress tested hpt370 ide AND
network (so far we did only stress hpt370 and run "normal" stuff). The board
never oopsed or behaved strangely other than the hpt370 corruption and the
hpt370+3c905 stress test oopses.



> We are going to replace the motherboard with one with VIA KT266A chipset,
> hope that works better.

If we get around to replace the bugger, the one thing I'll make sure is that
the replacement is not Via. Even if 1000 people told me KT266A was stable.


-- v --

v...@iki.fi

Ville Herva

unread,
Jan 10, 2002, 1:34:13 AM1/10/02
to
On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:
> Ville Herva wrote:
> >
> > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
>
> Looks like a corrupted `next' pointer in the page's buffer_head
> ring. Your report is identical to Todd Eigenschink's repeatable
> oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
<snip>
> I am able to trigger this in around ten minutes on 2.4.13 and
> later kernels. However 2.4.13-pre6 ran the test for nine hours
> and did not fail.

Out of curiosity: what kind of load do you use to trigger it?

> I've put the 2.4.13-pre6 -> 2.4.13 diff at http://www.zip.com.au/~akpm/1.gz

Seems your diff didn't include some bits (Maintainers changes and something
else.)

Anyhow, I compiled 2.4.13pre6 and it collapsed in just a few minutes. My
best guess is that network card pci dma is somehow fubar, and it writes
stuff to where it shouldn't.


Unable to handle kernel paging request at virtual address 86061d0e
c012f354
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c012f354>] Not tainted


Using defaults from ksymoops -t elf32-i386 -a i386

EFLAGS: 00010202
eax: 86061cee ebx: cb4aba40 ecx: 68158c40 edx: 0000aa25
esi: cb4aba40 edi: 00000001 ebp: 00000001 esp: ceb45b48


ds: 0018 es: 0018 ss: 0018

Process ping (pid: 4134, stackpage=ceb45000)
Stack: 00000001 cb4aba40 c012fc93 cb4aba40 ceb45ba8 00000000 c012fcca cb4aba40
c017cdc3 cb4aba40 cb4ab7c0 000003f0 cb4ab7c0 c1324000 00000008 c1405e50
0000dc2f cb4aba40 c0131766 00000001 00000001 ceb45ba8 00000000 cb4aba40
Call Trace: [<c012fc93>] [<c012fcca>] [<c017cdc3>] [<c0131766>] [<c01318a3>]
[<c0127a69>] [<c0127d4f>] [<c0127d9d>] [<c01286e5>] [<c012893f>] [<c0128688>]
[<c01289ba>] [<c0126922>] [<c0126bef>] [<c01e4c2b>] [<c01e4311>] [<c01f59c3>]
[<c01f5d3e>] [<c020c250>] [<c020c5cf>] [<c020c250>] [<c0212f70>] [<c0212fa9>]
[<c01e1f01>] [<c0212f70>] [<c01e3177>] [<c0193800>] [<c0193a3e>] [<c0193a20>]
[<c018f2e0>] [<c018f32a>] [<c018f696>] [<c0193a3e>] [<c018faba>] [<c0193a50>]
[<c01e360c>] [<c0106ebb>]
Code: 89 48 20 c1 e2 02 be 24 0b 2c c0 89 41 24 39 1c 32 75 0a 31

>>EIP; c012f354 <__remove_from_lru_list+14/60> <=====
Trace; c012fc93 <__refile_buffer+33/60>
Trace; c012fcca <refile_buffer+a/10>
Trace; c017cdc3 <ll_rw_block+1a3/1c0>
Trace; c0131766 <sync_page_buffers+46/a0>
Trace; c01318a3 <try_to_free_buffers+e3/110>
Trace; c0127a69 <shrink_cache+129/2b0>
Trace; c0127d4f <shrink_caches+5f/90>
Trace; c0127d9d <try_to_free_pages+1d/50>
Trace; c01286e5 <balance_classzone+55/170>
Trace; c012893f <__alloc_pages+13f/1b0>
Trace; c0128688 <_alloc_pages+18/20>
Trace; c01289ba <__get_free_pages+a/20>
Trace; c0126922 <kmem_cache_grow+a2/200>
Trace; c0126bef <kmalloc+bf/e0>
Trace; c01e4c2b <alloc_skb+cb/180>
Trace; c01e4311 <sock_alloc_send_skb+71/110>
Trace; c01f59c3 <ip_build_xmit_slow+193/4c0>
Trace; c01f5d3e <ip_build_xmit+4e/350>
Trace; c020c250 <raw_getfrag+0/30>
Trace; c020c5cf <raw_sendmsg+28f/300>
Trace; c020c250 <raw_getfrag+0/30>
Trace; c0212f70 <inet_sendmsg+0/40>
Trace; c0212fa9 <inet_sendmsg+39/40>
Trace; c01e1f01 <sock_sendmsg+81/b0>
Trace; c0212f70 <inet_sendmsg+0/40>
Trace; c01e3177 <sys_sendmsg+197/1f0>
Trace; c0193800 <hpt370_rw_proc+0/10>
Trace; c0193a3e <hpt370_dmaproc+1e/30>
Trace; c0193a20 <hpt370_dmaproc+0/30>
Trace; c018f2e0 <start_request+190/240>
Trace; c018f32a <start_request+1da/240>
Trace; c018f696 <ide_do_request+296/2e0>
Trace; c0193a3e <hpt370_dmaproc+1e/30>
Trace; c018faba <ide_intr+7a/140>
Trace; c0193a50 <ide_dma_intr+0/c0>
Trace; c01e360c <sys_socketcall+1cc/1f0>
Trace; c0106ebb <system_call+33/38>
Code; c012f354 <__remove_from_lru_list+14/60>
00000000 <_EIP>:
Code; c012f354 <__remove_from_lru_list+14/60> <=====
0: 89 48 20 mov %ecx,0x20(%eax) <=====
Code; c012f357 <__remove_from_lru_list+17/60>
3: c1 e2 02 shl $0x2,%edx
Code; c012f35a <__remove_from_lru_list+1a/60>
6: be 24 0b 2c c0 mov $0xc02c0b24,%esi
Code; c012f35f <__remove_from_lru_list+1f/60>
b: 89 41 24 mov %eax,0x24(%ecx)
Code; c012f362 <__remove_from_lru_list+22/60>
e: 39 1c 32 cmp %ebx,(%edx,%esi,1)
Code; c012f365 <__remove_from_lru_list+25/60>
11: 75 0a jne 1d <_EIP+0x1d> c012f371 <__remove_from_lru_list+31/60>
Code; c012f367 <__remove_from_lru_list+27/60>
13: 31 00 xor %eax,(%eax)


<1>Unable to handle kernel paging request at virtual address b8f2ed62
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c012f4f1>] Not tainted
EFLAGS: 00010282
eax: b8f2ed5e ebx: cb4ab9c0 ecx: 000002f0 edx: ff2eca38
esi: cb4abd40 edi: cb4ab9c0 ebp: c127e940 esp: ceb43d34


ds: 0018 es: 0018 ss: 0018

Process cat (pid: 4133, stackpage=ceb43000)
Stack: c0131812 cb4ab9c0 00000000 c127e940 000002f0 00002298 c0127a69 c127e940
000002f0 00000020 000002f0 00000006 00002299 00000020 000002f0 c0127d4f
00000000 00000006 00000020 00000000 000002f0 00019f2e c0127d9d 00000000
Call Trace: [<c0131812>] [<c0127a69>] [<c0127d4f>] [<c0127d9d>] [<c01286e5>]
[<c012893f>] [<c0128688>] [<c01289ba>] [<c0126922>] [<c0126b19>] [<c012fdc3>]
[<c012fe4b>] [<c0130097>] [<c01305c8>] [<c01288d8>] [<c0121b76>] [<c0132fef>]
[<c0132f80>] [<c0121c15>] [<c01220c4>] [<c0122327>] [<c01226b5>] [<c01225e0>]
[<c012dffe>] [<c0106ebb>]
Code: 89 50 04 89 02 c3 89 f6 8d bc 27 00 00 00 00 8b 54 24 04 31

>>EIP; c012f4f1 <__remove_inode_queue+11/20> <=====
Trace; c0131812 <try_to_free_buffers+52/110>
Trace; c0127a69 <shrink_cache+129/2b0>
Trace; c0127d4f <shrink_caches+5f/90>
Trace; c0127d9d <try_to_free_pages+1d/50>
Trace; c01286e5 <balance_classzone+55/170>
Trace; c012893f <__alloc_pages+13f/1b0>
Trace; c0128688 <_alloc_pages+18/20>
Trace; c01289ba <__get_free_pages+a/20>
Trace; c0126922 <kmem_cache_grow+a2/200>
Trace; c0126b19 <kmem_cache_alloc+99/b0>
Trace; c012fdc3 <get_unused_buffer_head+33/80>
Trace; c012fe4b <create_buffers+1b/130>
Trace; c0130097 <create_empty_buffers+17/50>
Trace; c01305c8 <block_read_full_page+58/290>
Trace; c01288d8 <__alloc_pages+d8/1b0>
Trace; c0121b76 <add_to_page_cache_unique+66/70>
Trace; c0132fef <blkdev_readpage+f/20>
Trace; c0132f80 <blkdev_get_block+0/40>
Trace; c0121c15 <page_cache_read+95/c0>
Trace; c01220c4 <generic_file_readahead+104/150>
Trace; c0122327 <do_generic_file_read+1e7/4a0>
Trace; c01226b5 <generic_file_read+75/90>
Trace; c01225e0 <file_read_actor+0/60>
Trace; c012dffe <sys_read+8e/d0>
Trace; c0106ebb <system_call+33/38>
Code; c012f4f1 <__remove_inode_queue+11/20>
00000000 <_EIP>:
Code; c012f4f1 <__remove_inode_queue+11/20> <=====
0: 89 50 04 mov %edx,0x4(%eax) <=====
Code; c012f4f4 <__remove_inode_queue+14/20>
3: 89 02 mov %eax,(%edx)
Code; c012f4f6 <__remove_inode_queue+16/20>
5: c3 ret
Code; c012f4f7 <__remove_inode_queue+17/20>
6: 89 f6 mov %esi,%esi
Code; c012f4f9 <__remove_inode_queue+19/20>
8: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi
Code; c012f500 <inode_has_buffers+0/20>
f: 8b 54 24 04 mov 0x4(%esp,1),%edx
Code; c012f504 <inode_has_buffers+4/20>
13: 31 00 xor %eax,(%eax)

Andrew Morton

unread,
Jan 10, 2002, 1:40:28 AM1/10/02
to
Ville Herva wrote:
>
> On Wed, Jan 09, 2002 at 01:00:53PM -0800, you [Andrew Morton] claimed:
> > Ville Herva wrote:
> > >
> > > >>EIP; c0131ce0 <sync_page_buffers+10/b0> <=====
> >
> > Looks like a corrupted `next' pointer in the page's buffer_head
> > ring. Your report is identical to Todd Eigenschink's repeatable
> > oops. http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.3/0689.html
> <snip>
> > I am able to trigger this in around ten minutes on 2.4.13 and
> > later kernels. However 2.4.13-pre6 ran the test for nine hours
> > and did not fail.
>
> Out of curiosity: what kind of load do you use to trigger it?

Massive VM load and ext3. I've found the buffer-list destroyed
bug. It's incorrect buffer locking in ext3. It used to work,
sleazily, but blockdev-in-pagecache pulled its pants down.

> > I've put the 2.4.13-pre6 -> 2.4.13 diff at http://www.zip.com.au/~akpm/1.gz
>
> Seems your diff didn't include some bits (Maintainers changes and something
> else.)
>
> Anyhow, I compiled 2.4.13pre6 and it collapsed in just a few minutes. My
> best guess is that network card pci dma is somehow fubar, and it writes
> stuff to where it shouldn't.

OK. Looks like they're different things - you have hardware problems,
I have brain problems.

-

Henrique de Moraes Holschuh

unread,
Jan 10, 2002, 7:01:02 AM1/10/02
to
On Thu, 10 Jan 2002, Martin Josefsson wrote:
> We've replaved that memory module now and now it's better but I have to
> say that the KT133 or atleast the Asus A7V motherboard seems to be quite
> broken. we have a lot of spurious irq's and the ide controllers freak when
> but under some load and start getting irq timeouts and resets the ide
> channels over and over again with some delay in between when it kind of
> works, slow as hell but works.

Well, my A7V is also acting up, with spurious IRQs (but not too many), and
PCI lockups if the load on the PCI bus increases too much -- this is
probably the last time I ever buy a VIA board (because they take soooo much
time to acknowledge their screw ups and help people fix it) unless they
start issuing non-binary-only fixes (heck, all it takes is a doc telling us
what to do on the PCI registers!).

The IDE corruption and lockups you can fix, just apply the latest IDE
patches, the 2.4.18pre IDE subsystem is not to be used on a KT133, it will
not work at all if you give it a slightly bigger load on the promise
controller, for example.

> We are going to replace the motherboard with one with VIA KT266A chipset,
> hope that works better.

Without the IDE patches, it will (most probably) not help.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

Martin Josefsson

unread,
Jan 10, 2002, 7:56:58 AM1/10/02
to
On Thu, 10 Jan 2002, Henrique de Moraes Holschuh wrote:

> On Thu, 10 Jan 2002, Martin Josefsson wrote:
> > We've replaved that memory module now and now it's better but I have to
> > say that the KT133 or atleast the Asus A7V motherboard seems to be quite
> > broken. we have a lot of spurious irq's and the ide controllers freak when
> > but under some load and start getting irq timeouts and resets the ide
> > channels over and over again with some delay in between when it kind of
> > works, slow as hell but works.
>
> Well, my A7V is also acting up, with spurious IRQs (but not too many), and
> PCI lockups if the load on the PCI bus increases too much -- this is
> probably the last time I ever buy a VIA board (because they take soooo much
> time to acknowledge their screw ups and help people fix it) unless they
> start issuing non-binary-only fixes (heck, all it takes is a doc telling us
> what to do on the PCI registers!).
>
> The IDE corruption and lockups you can fix, just apply the latest IDE
> patches, the 2.4.18pre IDE subsystem is not to be used on a KT133, it will
> not work at all if you give it a slightly bigger load on the promise
> controller, for example.
>
> > We are going to replace the motherboard with one with VIA KT266A chipset,
> > hope that works better.
>
> Without the IDE patches, it will (most probably) not help.

I am using the IDE patch. I've heard that the A7V133 which is based on the
KT133A chipset works much better in linux. I know people using it in a
router for a 1000 client network on a 100Mbit connection and it's working
fine, no problems at all. If we push the networking too hard we get a lot
of spurious interrupts and it appears as we loose some interrupts aswell
as NIC drivers and IDE drivers start complaining sometimes and when it has
started loosing interrupts only a reboot can bring it back to
"normal" operation.

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

-

Ville Herva

unread,
Jan 10, 2002, 8:37:02 AM1/10/02
to
On Thu, Jan 10, 2002 at 10:01:02AM -0200, you [Henrique de Moraes Holschuh] claimed:

>
> The IDE corruption and lockups you can fix, just apply the latest IDE
> patches, the 2.4.18pre IDE subsystem is not to be used on a KT133, it will
> not work at all if you give it a slightly bigger load on the promise
> controller, for example.

We just tried with 2.4.18pre2 + Hedrick ATA patch, but it oopsed just like
2.4.18pre2 vanilla. I reckon the ide corruption will also happen if we leave
the "ping -f" out of the equation.

This is propably a pci issue, not an ide issue.


-- v --

v...@iki.fi

0 new messages