[gentoo-user] kernel panic logging

shabanip

unread,

Feb 25, 2005, 3:50:09 AM2/25/05

to

how can i log kernel panics (with core dump details) on kernel 2.6?
thanks,
Payam Shabanian
shab...@avapajoohesh.com

--
gento...@gentoo.org mailing list

Frank Schafer

unread,

Feb 25, 2005, 4:10:12 AM2/25/05

to

Hi,

I don't think that anyone will find a means to log something if your
kernel panics.

Regards
Frank

PS: Regardless of the kernel version used ;)

--
gento...@gentoo.org mailing list

Bastian Balthazar Bux

unread,

Feb 25, 2005, 4:40:35 AM2/25/05

to

shabanip ha scritto:

the
# ulimit -c unlimited
doesn't work ?

--
No problem is so formidable that you can't walk away from it.
~ Charles M. Schulz
But sometimes run fast is better
~ Francesco R.
--
gento...@gentoo.org mailing list

Mac

unread,

Feb 25, 2005, 4:50:13 AM2/25/05

to

*FYI :)*
*Original article:
*http://resource.intel.com/telecom/support/tnotes/tnbyos/2000/tn062.htm

*
How to Troubleshoot Linux Kernel Panics*

*Problem Description:
*Kernel panics on Linux are hard to identify and troubleshoot.
Troubleshooting kernel panics often requires reproducing a situation
that occurs rarely and collecting data that is difficult to gather.

*Solution Summary:*
This document outlines several techniques that will help reduce the
amount of time necessary to troubleshoot a kernel panic.

*Technical Discussion:*
*What is a kernel panic?*/
/As the name implies, the Linux kernel gets into a situation where it
doesn’t know what to do next. When this happens, the kernel gives as
much information as it can about what caused the problem, depending on
what caused the panic.

There are two main kinds of kernel panics:

1. Hard Panic – also known as Aieee!
2.

Soft Panic – also known as Oops

*What can cause a kernel panic?
*Only modules that are located within kernel space can _directly_ cause
the kernel to panic. To see what modules are dynamically loaded, do
lsmod – this shows all dynamically loaded modules (Dialogic drivers,
LiS, SCSI driver, filesystem, etc.). In addition to these dynamically
loaded modules, components that are built into the kernel (memory map,
etc.) can cause a panic.

Since hard panics and soft panics are different in nature, we will
discuss how to deal with each separately.

*How to Troubleshoot a Hard Kernel Panic
Hard Panics – Symptoms: *

1. Machine is completely locked up and unusable.
2. Num Lock / Caps Lock / Scroll Lock keys usually blink.
3. If in console mode, dump is displayed on monitor (including the
phrase “Aieee!”).
4.

Similar to Windows Blue Screen.

*Hard panics – causes:
*The most common cause of a hard kernel panic is when a driver crashes
within an interrupt handler, usually because it tried to access a null
pointer within the interrupt handler. When this happens, that driver
cannot handle any new interrupts and eventually the system crashes. This
is not exclusive to Dialogic drivers.

*Hard panics – information to collect:
*Depending on the nature of the panic, the kernel will log all
information it can prior to locking up. Since a kernel panic is a
drastic failure, it is uncertain how much information will be logged.
Below are key pieces of information to collect. It is important to
collect as many of these as possible, but there is no guarantee that all
of them will be available, especially the first time a panic is seen.

1. /var/log/messages -- sometimes the entire kernel panic stack trace
will be logged there
2. Application / Library logs (RTF, cheetah, etc.) – may show what
was happening before the panic
3. Other information about what happened just prior to the panic, or
how to reproduce
4. Screen dump from console. Since the OS is locked, you cannot cut
and paste from the screen. There are two common ways to get this
info:
* Digital Picture of screen (preferred, since it’s quicker and
easier)
*

Copying screen with pen and paper or typing to another computer

If the dump is not available either in /var/log/message or on the
screen, follow these tips to get a dump:

1. If in GUI mode, switch to full console mode – no dump info is
passed to the GUI (not even to GUI shell).
2. Make sure screen stays on during full test run – if a screen saver
kicks in, the screen won’t return after a kernel panic. Use these
settings to ensure the screen stays on.
* setterm -blank 0
* setterm -powerdown 0
* setvesablank off
3.

From console, copy dump from screen (see above).

*Hard panics – Troubleshooting when a full trace is available
*The stack trace is the most important piece of information to use in
troubleshooting a kernel panic. It is often crucial to have a full stack
trace, something that may not be available if only a screen dump is
provided – the top of the stack may scroll off the screen, leaving only
a partial stack trace. *If a full trace is available, it is usually
sufficient to isolate root cause.* To identify whether or not you have a
large enough stack trace, look for a line with EIP, which will show what
function call and module caused the panic. In the example below, this is
shown in the following line:
EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe

If the culprit is a Dialogic driver you will see a module name with:
streams-xxxxDriver (xxxx = dlgn, dvbm, mercd, etc.)

*Hard panic – full trace example:
*Unable to handle kernel NULL pointer dereference at virtual address
0000000c
printing eip:
f89e568a
*pde = 32859001
*pte = 00000000
Oops: 0000
Kernel 2.4.9-31enterprise
CPU: 1
EIP: 0010:[<f89e568a>] Tainted: PF
EFLAGS: 00010096
*EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe *
eax: 00000000 ebx: f65f5410 ecx: f5e16710 edx: f65f5410
esi: 00001ea0 edi: f5e23c30 ebp: f65f5410 esp: f1cf7e78
ds: 0018 es: 0018 ss: 0018
Process pwcallmgr (pid: 10334, stackpage=f1cf7000)
Stack: 00000000 c01067fa 00000086 f1cf7ec0 00001ea0 f5e23c30 f65f5410
f89e53ec
f89fcd60 f5e16710 f65f5410 f65f5410 f8a54420 f1cf7ec0 f8a4d73a 0000139e
f5e16710 f89fcd60 00000086 f5e16710 f5e16754 f65f5410 0000034a f894e648
Call Trace: [setup_sigcontext+218/288] setup_sigcontext [kernel] 0xda
Call Trace: [<c01067fa>] setup_sigcontext [kernel] 0xda
[<f89e53ec>] dlgnwput [streams-dlgnDriver] 0xe8
[<f89fcd60>] Sm_Handle [streams-dlgnDriver] 0x1ea0
[<f8a54420>] intdrv_lock [streams-dlgnDriver] 0x0
[<f8a4d73a>] Gn_Maxpm [streams-dlgnDriver] 0x8ba
[<f89fcd60>] Sm_Handle [streams-dlgnDriver] 0x1ea0
[<f894e648>] lis_safe_putnext [streams] 0x168
[<f8a7b098>] __insmod_streams-dvbmDriver_S.bss_L117376
[streams-dvbmDriver] 0xab8 [<f8a78821>] dvbmwput [streams-dvbmDriver] 0x6f5
[<f8a79f98>] dvwinit [streams-dvbmDriver] 0x2c0
[<f894e648>] lis_safe_putnext [streams] 0x168
[<f893e6d8>] lis_strputpmsg [streams] 0x54c
[<f895482e>] __insmod_streams_S.rodata_L35552 [streams] 0x182e
[<f8951227>] sys_putpmsg [streams] 0x6f
[system_call+51/56] system_call [kernel] 0x33
[<c010719b>] system_call [kernel] 0x33
Nov 28 12:17:58 talus kernel:
Nov 28 12:17:58 talus kernel:
Code: 8b 70 0c 8b 06 83 f8 20 8b 54 24 20 8b 6c 24 24 76 1c 89 5c

*Hard panics – Troubleshooting when a full trace is not available
*If only a partial stack trace is available, it can be tricky to isolate
the root cause, since there is no explicit information about what module
of function call caused the panic. Instead, only commands leading up to
the final command will be seen in a partial stack trace. In this case,
it is very important to collect as much information as possible about
what happened leading up to the kernel panic (application logs, library
traces, steps to reproduce, etc).

*Hard panic – partial trace example (note there is no line with EIP
information)
*[<c01e42e7>] ip_rcv [kernel] 0x357
[<f8a179d5>] sramintr [streams_dlgnDriver] 0x32d
[<f89a3999>] lis_spin_lock_irqsave_fcn [streams] 0x7d
[<f8a82fdc>] inthw_lock [streams_dlgnDriver] 0x1c
[<f8a7bad8>] pwswtbl [streams_dlgnDriver] 0x0
[<f8a15442>] dlgnintr [streams_dlgnDriver] 0x4b
[<f8a7c30a>] Gn_Maxpm [streams_dlgnDriver] 0x7ae
[<c0123bc1>] __run_timers [kernel] 0xd1
[<c0108a6e>] handle_IRQ_event [kernel] 0x5e
[<c0108c74>] do_IRQ [kernel] 0xa4
[<c0105410>] default_idle [kernel] 0x0
[<c0105410>] default_idle [kernel] 0x0
[<c022fab0>] call_do_IRQ [kernel] 0x5
[<c0105410>] default_idle [kernel] 0x0
[<c0105410>] default_idle [kernel] 0x0
[<c010543d>] default_idle [kernel] 0x2d
[<c01054c2>] cpu_idle [kernel] 0x2d
[<c011bb86>] __call_console_drivers [kernel] 0x4b
[<c011bcfb>] call_console_drivers [kernel] 0xeb
Code: 8b 50 0c 85 d2 74 31 f6 42 0a 02 74 04 89 44 24 08 31 f6 0f
<0> Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

*Hard panics – using kernel debugger (KDB)
*If only a partial trace is available and the supporting information is
not sufficient to isolate root cause, it may be useful to use KDB. KDB
is a tool that is compiled into the kernel that causes the kernel to
break into a shell rather than lock up when a panic occurs. This enables
you to collect additional information about the panic, which is often
useful in determining root cause.

Some important things to note about using KDB:

1. If this is a potential Dialogic issue, technical support should be
contacted prior to the to use of KDB
2. Must use base kernel – i.e. 2.4.18 kernel instead of 2.4.18-5 from
RedHat. This is because KDB is only available for the base
kernels, and not the builds created by RedHat. While this does
create a slight deviation from the original configuration, it
usually does not interfere with root cause analysis.
3.

Need different Dialogic drivers compiled to handle the specific
kernel.

*How to Troubleshoot a Soft Kernel Panic
Soft panics – symptoms: *

1. Much less severe than hard panic.
2. Usually results in a segmentation fault.
3. Can see an oops message – search /var/log/messages for string ‘Oops’.
4.

Machine still somewhat usable (but should be rebooted after
information is collected).

*Soft panics – causes:
*Almost anything that causes a module to crash when it is not within an
interrupt handler can cause a soft panic. In this case, the driver
itself will crash, but will not cause catastrophic system failure since
it was not locked in the interrupt handler. The same possible causes
exist for soft panics as do for hard panics (i.e. accessing a null
pointer during runtime).

*Soft panics – information to collect:
*When a soft panic occurs, the kernel will generate a dump that contains
kernel symbols – this information is logged in /var/log/messages. To
begin troubleshooting, use the ksymoops utility to turn kernel symbols
into meaningful data.

To generate a ksymoops file:

1. Create new file from text of stack trace found in
/var/log/messages. Make sure to strip off timestamps, otherwise
ksymoops will fail.
2.

Run ksymoops on new stack trace file:
Generic: ksymoops -o [location of Dialogic drivers] filename
Example: ksymoops -o /lib/modules/2.4.18-5/misc ksymoops.log
All other defaults should work fine

For a man page on ksymoops, see the following webpage:
http://gd.tuwien.ac.at/linuxcommand.org/man_pages/ksymoops8.html

*Soft panics – oops trace example
*Code: 8b 70 0c 50 e8 69 f9 f8 ff 83 c4 10 83 f8 08 74 35 66 c7 47
*EIP; f89ba71e <[streams-dlgnDriver]_dlgn_setidlestate+1e/8c> *
Trace; f8951bd6 <[streams]lis_wakeup_close+86/110>
Trace; f8a2705c <[streams-dlgnDriver]__module_parm_r4_feature+280/1453>
Trace; f8a27040 <[streams-dlgnDriver]__module_parm_r4_feature+264/1453>
Trace; f89b9198 <[streams-dlgnDriver]dlgnwput+e8/204>

*Product List:*
System Release 5.1 for Linux
System Release 5.1 Service Pack 1 for Linux
System Release 5.1 Feature Pack 1 for Linux
System Release 5.1 Service Pack 3 for Linux

*Glossary of Acronyms/Terms: *
LiS --Linux Streams
SCSI -- Small Computer Systems Interface
RTF -- Runtime Tracing Facility
KDB -- Kernel Debugger

*Related Documentation:*
N/A

*First Published:*
03/26/2004

*Last Updated: *
03/26/2004

*Attachments *
N/A

--
gento...@gentoo.org mailing list

Manuel McLure

unread,

Feb 25, 2005, 1:10:11 PM2/25/05

to

shabanip wrote:
> how can i log kernel panics (with core dump details) on kernel 2.6?
> thanks,
> Payam Shabanian
> shab...@avapajoohesh.com

Back in the day when I was trying to help Linus debug an emu10k1 bug I
set up a serial console to another PC - that helped a lot. If you're in
X most of the time, there's not much you can do without setting up a
serial console or one of the kernel dump patches (these patches dump the
panic information into the swap area, where tools can look at them at
the next reboot - the problem is that if you've paniced it may be unsafe
to use any of the standard kernel methods to write to disk, since
pointers may be screwed up and you may end up corrupting a filesystem -
"dump to floppy" patches are somewhat safer.)

--
Manuel A. McLure KE6TAW <man...@mclure.org> <http://www.mclure.org>
...for in Ulthar, according to an ancient and significant law,
no man may kill a cat. -- H.P. Lovecraft
--
gento...@gentoo.org mailing list

Philip Nilsson

unread,

Feb 27, 2005, 6:10:13 AM2/27/05

to

I think there's something one can do with kexec... I have
not investigated that however, but the documentation should
be included in the kernel. (if your kernel has kexec support
of course.)