kswapd OOPS under 2.4.19-pre8 (ext3, Reiserfs + (soft)raid0)

Denis Vlasenko

unread,

May 15, 2002, 9:51:41 AM5/15/02

to

On 14 May 2002 13:09, Todd R. Eigenschink wrote:
> I never saw any reponses to my oops post...but now I've narrowed
> things further, to a point that makes it seem more serious.
...
> Dual P3/500, 2 GB RAM, Intel L440-GXC mainboard.

> Oops: 0000
> CPU: 0
> EIP: 0010:[<c0115ba8>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010087
> eax: c2802db4 ebx: c2002db4 ecx: 00000000 edx: 00000003
> esi: c2802db0 edi: c2802db0 ebp: cd2fdf48 esp: cd2fdf2c
> ds: 0018 es: 0018 ss: 0018
> Process ld-linux.so.2 (pid: 18110, stackpage=cd2fd000)
> Stack: c1095000 c2802db0 00000000 c2802db4 00000000 00000282 00000003
> 00000000 c01295fe c1095000 00001000 c012bef0 00000000 ce03f5c0 ffffffea
> 00001000 e0855de8 00001000 00000000 00001000 00001000 00001000 00004000
> 00000000 Call Trace: [<c01295fe>] [<c012bef0>] [<c0136d57>] [<c010889b>]
> Code: 8b 01 85 45 fc 74 69 31 c0 9c 5e fa f0 fe 0d 80 a9 30 c0 0f
>
> >>EIP; c0115ba8 <__wake_up+40/d0> <=====
> Trace; c01295fe <unlock_page+62/68>
> Trace; c012bef0 <generic_file_write+578/778>
> Trace; c0136d57 <sys_write+8f/100>
> Trace; c010889b <system_call+33/38>
>
> Code; c0115ba8 <__wake_up+40/d0>
> 00000000 <_EIP>:
> Code; c0115ba8 <__wake_up+40/d0> <=====
> 0: 8b 01 mov (%ecx),%eax <=====
> 2: 85 45 fc test %eax,0xfffffffc(%ebp)
> 5: 74 69 je 70 <_EIP+0x70> c0115c18 <__wake_up+b0/d0>
> 7: 31 c0 xor %eax,%eax
> 9: 9c pushf
> a: 5e pop %esi
> b: fa cli
> c: f0 fe 0d 80 a9 30 c0 lock decb 0xc030a980
> 13: 0f 00 00 sldt (%eax)

%ecx is a NULL ptr here. It must be in __wake_up_common:

void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr)
{
if (q) {
unsigned long flags;
wq_read_lock_irqsave(&q->lock, flags);
__wake_up_common(q, mode, nr, 0);
wq_read_unlock_irqrestore(&q->lock, flags);
}
}

static inline void __wake_up_common (wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, const int sync)
{
struct list_head *tmp;
struct task_struct *p;

CHECK_MAGIC_WQHEAD(q);
WQ_CHECK_LIST_HEAD(&q->task_list);

list_for_each(tmp,&q->task_list) {
unsigned int state;
wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);

CHECK_MAGIC(curr->__magic);
p = curr->task;
state = p->state;
if (state & mode) {
WQ_NOTE_WAKER(curr);
if (try_to_wake_up(p, sync) && (curr->flags&WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
}
}

Can you compile kernel/sched.c into asm and look where did it do *NULL?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Todd R. Eigenschink

unread,

May 16, 2002, 8:28:44 AM5/16/02

to

Mike Galbraith writes:
>Methinks there's an easier way to get to the line in question. Compile sched.c with -g via make kernel/sched.o EXTRA_CFLAGS=-g.. gbd can then be used to get you the line with list *__wake_up+0xb2.

Ooh, spiffy idea. (Like I said, asm rookie.) I just compiled gdb,
and here's what it says. Interesting, to me, at least.

(gdb) list *__wake_up+0xb2
0x9d6 is in __wake_up
(/src/linux-2.4.19-pre8/include/asm/processor.h:488).
483 #ifdef CONFIG_MPENTIUMIII
484
485 #define ARCH_HAS_PREFETCH
486 extern inline void prefetch(const void *x)
487 {
488 __asm__ __volatile__ ("prefetchnta (%0)" : : "r"(x));
489 }
490
491 #elif CONFIG_X86_USE_3DNOW
492

William Lee Irwin III

unread,

May 16, 2002, 3:38:34 PM5/16/02

to

On Thu, May 16, 2002 at 07:28:44AM -0500, Todd R. Eigenschink wrote:
> Ooh, spiffy idea. (Like I said, asm rookie.) I just compiled gdb,
> and here's what it says. Interesting, to me, at least.
> (gdb) list *__wake_up+0xb2
> 0x9d6 is in __wake_up
> (/src/linux-2.4.19-pre8/include/asm/processor.h:488).
> 483 #ifdef CONFIG_MPENTIUMIII
> 484
> 485 #define ARCH_HAS_PREFETCH
> 486 extern inline void prefetch(const void *x)
> 487 {
> 488 __asm__ __volatile__ ("prefetchnta (%0)" : : "r"(x));
> 489 }
> 490
> 491 #elif CONFIG_X86_USE_3DNOW

list_for_each() uses prefetch() and is used in __wake_up_common(), which
is in turn used by __wake_up(). This is waitqueue list corruption.

Cheers,
Bill

Mike Fedyk

unread,

May 17, 2002, 1:15:31 PM5/17/02

to

On Wed, May 15, 2002 at 07:16:29PM -0500, Todd R. Eigenschink wrote:
> I just reset after another oops. It's similar to, but different from,
> the previous one; the call stack is the same but they die in different
> places.
>
> I put the output of "gcc -E" and "gcc -S" (with the rest of the
> command-line parameters) at the following URLs so you can see what the
> asm turned into on my machine (gcc 2.95.3); I'm not very x86-asm
> literate, so most of it's $FOREIGN_LANG to me.
>

Have you tried testing the memory and power supply?

Todd R. Eigenschink

unread,

May 20, 2002, 8:58:25 AM5/20/02

to

Todd R. Eigenschink writes:
>Mike Galbraith writes:
>>Methinks there's an easier way to get to the line in question. Compile sched.c with -g via make kernel/sched.o EXTRA_CFLAGS=-g.. gbd can then be used to get you the line with list *__wake_up+0xb2.

Since the particular snippet of code at the point of oops in the last
one I posted was P3-specified, I recompiled for 586. The oops remains
the same, although the call stack happens to be a lot longer this
time.

I'm going to run memtest86 on it for a while after it gets done with
its morning processing, although this failure seems a little too
consistent to be memory related.

Trace; c0129b39 <unlock_page+81/88>
Trace; c0139179 <end_buffer_io_async+8d/a8>
Trace; c01b6f45 <end_that_request_first+65/c8>
Trace; c01c1c3c <ide_end_request+68/a8>
Trace; c01c806a <ide_dma_intr+6a/ac>
Trace; c01c38ad <ide_intr+f9/164>
Trace; c01c8000 <ide_dma_intr+0/ac>
Trace; c010a1e1 <handle_IRQ_event+59/84>
Trace; c010a3d9 <do_IRQ+a9/f4>
Trace; c010c568 <call_do_IRQ+5/d>
Trace; c0154b07 <statm_pgd_range+133/1a8>
Trace; c0154c43 <proc_pid_statm+c7/16c>
Trace; c015279e <proc_info_read+5a/118>
Trace; c0137497 <sys_read+8f/104>
Trace; c0108a43 <system_call+33/40>

Code; c0116383 <__wake_up+3b/c0>
00000000 <_EIP>:
Code; c0116383 <__wake_up+3b/c0> <=====

0: 8b 01 mov (%ecx),%eax <=====

Code; c0116385 <__wake_up+3d/c0>

2: 85 45 fc test %eax,0xfffffffc(%ebp)

Code; c0116388 <__wake_up+40/c0>
5: 74 66 je 6d <_EIP+0x6d> c01163f0 <__wake_up+a8/c
0>
Code; c011638a <__wake_up+42/c0>
7: 31 d2 xor %edx,%edx
Code; c011638c <__wake_up+44/c0>
9: 9c pushf
Code; c011638d <__wake_up+45/c0>
a: 5e pop %esi
Code; c011638e <__wake_up+46/c0>
b: fa cli
Code; c011638f <__wake_up+47/c0>
c: f0 fe 0d 80 99 30 c0 lock decb 0xc0309980
Code; c0116396 <__wake_up+4e/c0>
13: 0f 00 00 sldtl (%eax)

(gdb) list *__wake_up+0x3b
0x96f is in __wake_up (kernel/sched.c:732).
727 wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
728
729 CHECK_MAGIC(curr->__magic);
730 p = curr->task;
731 state = p->state;
732 if (state & mode) {
733 WQ_NOTE_WAKER(curr);
734 if (try_to_wake_up(p, sync) && (curr->flags&WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
735 break;
736 }

William Lee Irwin III

unread,

May 20, 2002, 1:00:59 PM5/20/02

to

On Mon, May 20, 2002 at 07:58:25AM -0500, Todd R. Eigenschink wrote:
> Since the particular snippet of code at the point of oops in the last
> one I posted was P3-specified, I recompiled for 586. The oops remains
> the same, although the call stack happens to be a lot longer this
> time.

I suspect the lowest parts of the call chain are being handed bad data.

On Mon, May 20, 2002 at 07:58:25AM -0500, Todd R. Eigenschink wrote:
> I'm going to run memtest86 on it for a while after it gets done with
> its morning processing, although this failure seems a little too
> consistent to be memory related.

I hope I didn't say that.

On Mon, May 20, 2002 at 07:58:25AM -0500, Todd R. Eigenschink wrote:
> Trace; c0129b39 <unlock_page+81/88>
> Trace; c0139179 <end_buffer_io_async+8d/a8>
> Trace; c01b6f45 <end_that_request_first+65/c8>
> Trace; c01c1c3c <ide_end_request+68/a8>
> Trace; c01c806a <ide_dma_intr+6a/ac>
> Trace; c01c38ad <ide_intr+f9/164>
> Trace; c01c8000 <ide_dma_intr+0/ac>
> Trace; c010a1e1 <handle_IRQ_event+59/84>
> Trace; c010a3d9 <do_IRQ+a9/f4>
> Trace; c010c568 <call_do_IRQ+5/d>
> Trace; c0154b07 <statm_pgd_range+133/1a8>
> Trace; c0154c43 <proc_pid_statm+c7/16c>
> Trace; c015279e <proc_info_read+5a/118>
> Trace; c0137497 <sys_read+8f/104>
> Trace; c0108a43 <system_call+33/40>

The __wake_up()/unlock_page() isn't the interesting part of the call
chain, the parts from end_buffer_io_async() to ide_dma_intr() are.

Any chance you can list them in gdb?

Cheers,
Bill

Todd R. Eigenschink

unread,

May 20, 2002, 4:26:56 PM5/20/02

to

William Lee Irwin III writes:
>On Mon, May 20, 2002 at 07:58:25AM -0500, Todd R. Eigenschink wrote:
>> I'm going to run memtest86 on it for a while after it gets done with
>> its morning processing, although this failure seems a little too
>> consistent to be memory related.
>
>I hope I didn't say that.

Someone else had suggested testing the memory and power supply.
memtest86 is easy to run, so I'll try that. It'll have to be tonight,
now.

>The __wake_up()/unlock_page() isn't the interesting part of the call
>chain, the parts from end_buffer_io_async() to ide_dma_intr() are.
>
>Any chance you can list them in gdb?

Well, after my posting from earlier today, I recompiled the kernel
after stripping some more stuff. I just induced an oops in that one,
so I can list the call stack for it.

No IDE stuff this time; this looks a lot like most of the other ones
I've seen. This morning was the first time I've ever seen IDE stuff
in the post-oops call stack.

It seems I can pretty much induce them at will, now. I started up
four simultaneous Webtrends sessions, which grow fairly quickly to
400-600 MB each, give or take. (The machine has 2 GB of RAM, so it
only swaps a little, sometimes.) Within half an hour, it fell over.

Here's the oops itself, then the gdb output.

----------------------------------------------------------------------
Oops: 0000
CPU: 1
EIP: 0010:[<c0116363>] Not tainted

Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010087
eax: c2802db4 ebx: c2002db4 ecx: 00000000 edx: 00000003

esi: c2802db0 edi: c2802db0 ebp: f7bf3ee8 esp: f7bf3ecc

ds: 0018 es: 0018 ss: 0018

Process kswapd (pid: 5, stackpage=f7bf3000)
Stack: c133d790 c2802db0 c02acbf4 c2802db4 00000000 00000282 00000003 d911d9f0
c0129b19 0076eb00 c133d790 f7bf3f4c c0130817 00000000 c12e9ca0 00000020
00008efe 81e65000 81a7d000 1147d047 00000009 81c00000 f6c99818 81c00000
Call Trace: [<c0129b19>] [<c0130817>] [<c0130ca7>] [<c0130ea0>] [<c0130efc>]
[<c0130f97>] [<c0130ff6>] [<c0131107>] [<c010712c>]
Code: 8b 01 85 45 fc 74 66 31 d2 9c 5e fa f0 fe 0d 80 99 30 c0 0f

>>EIP; c0116363 <__wake_up+3b/c0> <=====

>>eax; c2802db4 <END_OF_CODE+249b758/????>
>>ebx; c2002db4 <END_OF_CODE+1c9b758/????>
>>esi; c2802db0 <END_OF_CODE+249b754/????>
>>edi; c2802db0 <END_OF_CODE+249b754/????>
>>ebp; f7bf3ee8 <END_OF_CODE+3788c88c/????>
>>esp; f7bf3ecc <END_OF_CODE+3788c870/????>

Trace; c0129b19 <unlock_page+81/88>
Trace; c0130817 <swap_out+347/4b4>
Trace; c0130ca7 <shrink_cache+323/3cc>
Trace; c0130ea0 <shrink_caches+5c/84>
Trace; c0130efc <try_to_free_pages+34/54>
Trace; c0130f97 <kswapd_balance_pgdat+47/90>
Trace; c0130ff6 <kswapd_balance+16/2c>
Trace; c0131107 <kswapd+9b/b6>
Trace; c010712c <kernel_thread+28/38>

Code; c0116363 <__wake_up+3b/c0>
00000000 <_EIP>:
Code; c0116363 <__wake_up+3b/c0> <=====

0: 8b 01 mov (%ecx),%eax <=====

Code; c0116365 <__wake_up+3d/c0>

2: 85 45 fc test %eax,0xfffffffc(%ebp)

Code; c0116368 <__wake_up+40/c0>
5: 74 66 je 6d <_EIP+0x6d> c01163d0 <__wake_up+a8/c
0>
Code; c011636a <__wake_up+42/c0>

7: 31 d2 xor %edx,%edx

Code; c011636c <__wake_up+44/c0>
9: 9c pushf
Code; c011636d <__wake_up+45/c0>
a: 5e pop %esi
Code; c011636e <__wake_up+46/c0>
b: fa cli
Code; c011636f <__wake_up+47/c0>

c: f0 fe 0d 80 99 30 c0 lock decb 0xc0309980

Code; c0116376 <__wake_up+4e/c0>

13: 0f 00 00 sldtl (%eax)

----------------------------------------------------------------------

(gdb) list *__wake_up+0x3b
0x973 is in __wake_up (sched.c:731).
726 unsigned int state;

727 wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
728
729 CHECK_MAGIC(curr->__magic);
730 p = curr->task;
731 state = p->state;
732 if (state & mode) {
733 WQ_NOTE_WAKER(curr);
734 if (try_to_wake_up(p, sync) && (curr->flags&WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
735 break;

(gdb) list *unlock_page+0x81
0xcf9 is in unlock_page (filemap.c:845).
840 smp_mb__before_clear_bit();
841 if (!test_and_clear_bit(PG_locked, &(page)->flags))
842 BUG();
843 smp_mb__after_clear_bit();
844 if (waitqueue_active(waitqueue))
845 wake_up_all(waitqueue);
846 }
847
848 /*
849 * Get a lock on the page, assuming we need to sleep

(gdb) list *swap_out+0x347
No source file for address 0x347.

(gdb) list *swap_out
0x0 is in kswapd_init (vmscan.c:750).
745 }
746 }
747
748 static int __init kswapd_init(void)
749 {
750 printk("Starting kswapd\n");
751 swap_setup();
752 kernel_thread(kswapd, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL);
753 return 0;
754 }

(I'm fuzzzy on swap_out...can I not see the code because it's a static
function?)

(gdb) list *shrink_cache+0x323
0x7d7 is in shrink_cache (vmscan.c:483).
478 * Alert! We've found too many mapped pages on the
479 * inactive list, so we start swapping out now!
480 */
481 spin_unlock(&pagemap_lru_lock);
482 swap_out(priority, gfp_mask, classzone);
483 return nr_pages;
484 }
485
486 /*
487 * It is critical to check PageDirty _after_ we made sure

(gdb) list *shrink_caches+0x5c
0x9d0 is in shrink_caches (vmscan.c:571).
566 nr_pages = chunk_size;
567 /* try to keep the active list 2/3 of the size of the cache */
568 ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
569 refill_inactive(ratio);
570
571 nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
572 if (nr_pages <= 0)
573 return 0;
574
575 shrink_dcache_memory(priority, gfp_mask);

(gdb) list *try_to_free_pages+0x34
0xa2c is in try_to_free_pages (vmscan.c:591).
586 int priority = DEF_PRIORITY;
587 int nr_pages = SWAP_CLUSTER_MAX;
588
589 gfp_mask = pf_gfp_mask(gfp_mask);
590 do {
591 nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages);
592 if (nr_pages <= 0)
593 return 1;
594 } while (--priority);
595

(gdb) list *kswapd_balance_pgdat+0x47
0xac7 is in kswapd_balance_pgdat (vmscan.c:630).
625 zone = pgdat->node_zones + i;
626 if (unlikely(current->need_resched))
627 schedule();
628 if (!zone->need_balance)
629 continue;
630 if (!try_to_free_pages(zone, GFP_KSWAPD, 0)) {
631 zone->need_balance = 0;
632 __set_current_state(TASK_INTERRUPTIBLE);
633 schedule_timeout(HZ);
634 continue;

(gdb) list *kswapd_balance+0x16
0xb26 is in kswapd_balance (vmscan.c:655).
650 do {
651 need_more_balance = 0;
652 pgdat = pgdat_list;
653 do
654 need_more_balance |= kswapd_balance_pgdat(pgdat);
655 while ((pgdat = pgdat->node_next));
656 } while (need_more_balance);
657 }
658
659 static int kswapd_can_sleep_pgdat(pg_data_t * pgdat)

(gdb) list *kswapd+0x9b
0xc37 is in kswapd (/src/linux-2.4.19-pre8/include/linux/tqueue.h:121).
116
117 extern void __run_task_queue(task_queue *list);
118
119 static inline void run_task_queue(task_queue *list)
120 {
121 if (TQ_ACTIVE(*list))
122 __run_task_queue(list);
123 }
124
125 #endif /* _LINUX_TQUEUE_H */

(gdb) list *kernel_thread+0x28
0x3fc is in kernel_thread (process.c:492).
487 */
488 int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
489 {
490 long retval, d0;
491
492 __asm__ __volatile__(
493 "movl %%esp,%%esi\n\t"
494 "int $0x80\n\t" /* Linux/i386 system call */
495 "cmpl %%esp,%%esi\n\t" /* child or parent? */
496 "je 1f\n\t" /* parent - jump */

----------------------------------------------------------------------

William Lee Irwin III

unread,

May 20, 2002, 6:36:13 PM5/20/02

to

On Mon, May 20, 2002 at 03:26:56PM -0500, Todd R. Eigenschink wrote:
> Someone else had suggested testing the memory and power supply.
> memtest86 is easy to run, so I'll try that. It'll have to be tonight,
> now.

Bitflips are usually things where a pointer turns up invalid (or
non-NULL) and the difference between it and a valid pointer (or NULL)
is one bit. I don't see that here and don't like blaming hardware.

On Mon, May 20, 2002 at 03:26:56PM -0500, Todd R. Eigenschink wrote:
> Well, after my posting from earlier today, I recompiled the kernel
> after stripping some more stuff. I just induced an oops in that one,
> so I can list the call stack for it.

Nice, I presume you've got -g there? Any chance of doing something like
objdump --disassemble --source vmlinux and sending me the annotated
disassembly of __wake_up()? I want to doublecheck something...

On Mon, May 20, 2002 at 03:26:56PM -0500, Todd R. Eigenschink wrote:
> No IDE stuff this time; this looks a lot like most of the other ones
> I've seen. This morning was the first time I've ever seen IDE stuff
> in the post-oops call stack.

This is pretty strange, yes.

On Mon, May 20, 2002 at 03:26:56PM -0500, Todd R. Eigenschink wrote:
> It seems I can pretty much induce them at will, now. I started up
> four simultaneous Webtrends sessions, which grow fairly quickly to
> 400-600 MB each, give or take. (The machine has 2 GB of RAM, so it
> only swaps a little, sometimes.) Within half an hour, it fell over.
> Here's the oops itself, then the gdb output.

Great stuff! Thanks.

On Mon, May 20, 2002 at 03:26:56PM -0500, Todd R. Eigenschink wrote:
> Oops: 0000
> CPU: 1
> EIP: 0010:[<c0116363>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010087
> eax: c2802db4 ebx: c2002db4 ecx: 00000000 edx: 00000003
> esi: c2802db0 edi: c2802db0 ebp: f7bf3ee8 esp: f7bf3ecc
> ds: 0018 es: 0018 ss: 0018

Okay, %ecx is 0 -- no bitflip, just plain old NULL...

On Mon, May 20, 2002 at 03:26:56PM -0500, Todd R. Eigenschink wrote:
> Code; c0116363 <__wake_up+3b/c0>
> 00000000 <_EIP>:
> Code; c0116363 <__wake_up+3b/c0> <=====
> 0: 8b 01 mov (%ecx),%eax <=====
> Code; c0116365 <__wake_up+3d/c0>
> 2: 85 45 fc test %eax,0xfffffffc(%ebp)
> Code; c0116368 <__wake_up+40/c0>
> 5: 74 66 je 6d <_EIP+0x6d> c01163d0 <__wake_up+a8/c

Okay, the offending instruction is mov (%ecx), %eax -- dereferencing the
NULL %ecx...

On Mon, May 20, 2002 at 03:26:56PM -0500, Todd R. Eigenschink wrote:
> (gdb) list *__wake_up+0x3b
> 0x973 is in __wake_up (sched.c:731).
> 726 unsigned int state;
> 727 wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
> 728
> 729 CHECK_MAGIC(curr->__magic);
> 730 p = curr->task;
> 731 state = p->state;
> 732 if (state & mode) {
> 733 WQ_NOTE_WAKER(curr);
> 734 if (try_to_wake_up(p, sync) && (curr->flags&WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> 735 break;

This makes it pretty clear the offending instruction corresponds to the
first dereference of curr->task. Someone's leaving a NULL pointer in
there when they shouldn't. So this entire call chain has nothing to do
with the offender -- it only trips over the bad pointer the offending
code left behind. This looks like a PITA. The objdump --disassemble
--source stuff is just to have the assembly and source next to each
other for a "more convincing" demonstration, not that this isn't already
pretty good as it stands. Of course, finding the offender will be painful.

Cheers,
Bill

William Lee Irwin III

unread,

May 20, 2002, 7:28:07 PM5/20/02

to

On Mon, May 20, 2002 at 06:07:12PM -0500, Todd R. Eigenschink wrote:
> For whatever this may be worth--probably nothing--I have softdog
> compiled in, but it has only successfully rebooted after an oops maybe
> twice out of 20 or more oopsen. On a bunch of them, the message has
> come out to the serial console that it was initiating a reboot (but it
> didn't). Most of the time, it's just the oops and then...darkness.

Actually, getting a notion of your sourcebase and what's actually
running sounds like a great idea. Any chance you could rattle off what
patches you've got and/or name the tree, and maybe send me a .config?
Also, any chance you could tell me a little about the hardware?
I'm not going to tell you what to run or not to run, I just want to
know where to start looking.

On Mon, May 20, 2002 at 06:07:12PM -0500, Todd R. Eigenschink wrote:
> Also, on the off chance that this is a code generation problem, this
> is gcc 2.95.3. I actually was about to say 3.0.4 and wait for the
> slaps-upside-the-head, but I just checked and realized I haven't
> upgraded this box.

I don't know of any particular issues with gcc 2.95.3, but I'll compare
the disassemblies you sent me just in case.

Your help in tracking this down has been immense, I hope you have the
patience to bear with me as I try to fix this for you.

Thanks,

Todd R. Eigenschink

unread,

May 20, 2002, 7:59:33 PM5/20/02

to

William Lee Irwin III writes:

>Actually, getting a notion of your sourcebase and what's actually
>running sounds like a great idea. Any chance you could rattle off what
>patches you've got and/or name the tree, and maybe send me a .config?
>Also, any chance you could tell me a little about the hardware?
>I'm not going to tell you what to run or not to run, I just want to
>know where to start looking.

Kernel: vanilla 2.4.19-pre8 at the moment. I recompiled after adding
Steven Tweedie's latest ext3 patch the other night, but that's it.
I've been following the 2.4.19-pre kernels "religiously", but never
mix in *any* other patches. While I don't have any actual oops output
from previous kernels, I think this has been around in every
2.4.19-pre. (I've been having trouble for longer than that, but my
last round--see link below--at least *appeared* different.)

Stuff That Runs: vanilla. syslog-ng, bind 9.2.1, gated, portmap,
ypserv, xinted automount, cron, rpc.mountd, ypbind, rpc.nfsd, Apache
(hardly ever touched), Backup Exec agent, postgres 7.2.1 (only hit by
Apache).

Webtrends runs early every morning. A bunch of other machines rcp log
files to it between midnight and 04:00. I've had oopsen while
webtrends is running and while it's not running. I've had them just
when there are rsh/rcp sessions from a couple different machines at
the same time. I've even had them when the machine is (as far as I
could predict) completely idle.

If you have suggestions for stuff to run (or not)--whatever--I'll be
glad to try it. I can start going backwards kernel-wise, if you want
me to try to pin a starting point for the problem.

A couple other references:

http://groups.google.com/groups?q=todd+eigenschink&hl=en&lr=&ie=utf-8&oe=utf-8&scoring=d&selm=linux.kernel.15404.36497.77658.797884%40rtfm.ofc.tekinteractive.com&rnum=7

http://groups.google.com/groups?q=todd+eigenschink&hl=en&lr=&ie=utf-8&oe=utf-8&scoring=d&selm=linux.kernel.3C3D375C.E4A7EE77%40zip.com.au&rnum=6

>Your help in tracking this down has been immense, I hope you have the
>patience to bear with me as I try to fix this for you.

I have a lot more patience than kernel hacking skill, so I'll do what
I can, and you do your thing. :-)

A steak dinner and a case of your favorite if you fix it. I'm
*really* tired of getting paged and driving in to the office in the
wee hours of the morning to hit the freaking reset button. I do
preemptive reboots some evenings so I can control it, but it may still
croak a couple hours later. (I'd love an APC MasterSwitch right now,
but I can do a *lot* of driving and switch-flipping for $600.)

Todd

(Hardware info and .config follows.)

----------------------------------------------------------------------

Hardware:

Intel L440GX-C mainboard. Dual P3/500 CPUs, 2 GB of RAM.

1 9GB SCSI disk, 1 36GB SCSI, 4 x 30GB IDE disks, all on the internal
IDE & Adaptec SCSI. (The IDE used to be one 4-disk softraid RAID0
partition; now it's two separate 2-disk RAID0 partitions.)

----------------------------------------------------------------------
"grep =y .config" (nothing configured as modules). It had been
CONFIG_MPENTIUMIII; I recompiled as M586 a few days ago. No change.

CONFIG_X86=y
CONFIG_ISA=y
CONFIG_UID16=y
CONFIG_EXPERIMENTAL=y
CONFIG_MODULES=y
CONFIG_MODVERSIONS=y
CONFIG_KMOD=y
CONFIG_M586=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_USE_STRING_486=y
CONFIG_X86_ALIGNMENT_16=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_HIGHMEM4G=y
CONFIG_HIGHMEM=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_HAVE_DEC_LOCK=y
CONFIG_NET=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_PCI=y
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_NAMES=y
CONFIG_SYSVIPC=y
CONFIG_SYSCTL=y
CONFIG_KCORE_ELF=y
CONFIG_BINFMT_ELF=y
CONFIG_BLK_DEV_FD=y
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_RAID0=y
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_NETLINK_DEV=y
CONFIG_NETFILTER=y
CONFIG_FILTER=y
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_NF_CONNTRACK=y
CONFIG_IP_NF_FTP=y
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_MATCH_MULTIPORT=y
CONFIG_IP_NF_MATCH_STATE=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_LOG=y
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_BLK_DEV_IDECD=y
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_IDEDMA_PCI_AUTO=y
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_BLK_DEV_ADMA=y
CONFIG_BLK_DEV_PIIX=y
CONFIG_PIIX_TUNING=y
CONFIG_IDE_CHIPSETS=y
CONFIG_IDEDMA_AUTO=y
CONFIG_BLK_DEV_IDE_MODES=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_NETDEVICES=y
CONFIG_NET_ETHERNET=y
CONFIG_NET_PCI=y
CONFIG_EEPRO100=y
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_SERIAL_CONSOLE=y
CONFIG_UNIX98_PTYS=y
CONFIG_WATCHDOG=y
CONFIG_SOFT_WATCHDOG=y
CONFIG_RTC=y
CONFIG_AUTOFS_FS=y
CONFIG_AUTOFS4_FS=y
CONFIG_EXT3_FS=y
CONFIG_JBD=y
CONFIG_RAMFS=y
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_PROC_FS=y
CONFIG_DEVPTS_FS=y
CONFIG_EXT2_FS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
CONFIG_SUNRPC=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_MSDOS_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_ISO8859_1=y
CONFIG_VGA_CONSOLE=y

----------------------------------------------------------------------