howto diagnose random reboots

12,532 views
Skip to first unread message

Shawn Brown

unread,
May 15, 2011, 6:13:00 AM5/15/11
to android-platform googlegroups.com
Hi,

I'm sure its not the hardware, so how to diagnose why my phone reboots
seemingly randomly.

Nexus One running stock 2.3.4

My previous Nexus One suffered the same issue running 2.2.x. I sent
it in to find the replacement suffers the same problem.

Note: I am in Japan so wonder if something related to the radio could
be the fault. Still, my GDP2 running stock 1.6 works fine. I run the
same apps on these (GDP2 1.6 and Nexus One 2.3.4) so am confused by
what it might be.

Well ok either 2.x doesn't like Japan or my apps but how to figure out
which one. Stop using all apps? Shouldn't there be a log if an app
took the system down? no? wishful thinking?

What should I do to figure this out?

Shawn

PS yes it rebooted while writing this message. I wanted to say that
it seems perhaps more frequent when using the sdcard but the phone was
just there idle...

Tez

unread,
May 17, 2011, 12:17:09 PM5/17/11
to android-platform
usually the cause for a reboot is an unhandled exception in the system
server. you monitor the logs from logcat to know what happened.
of course, getting it at the right moment is a sort of game of chance,
but its a start.

-Earlence

Chris Stratton

unread,
May 17, 2011, 12:52:18 PM5/17/11
to android-...@googlegroups.com
On Tuesday, May 17, 2011 12:17:09 PM UTC-4, Tez wrote:
usually the cause for a reboot is an unhandled exception in the system
server. you monitor the logs from logcat to know what happened.
of course, getting it at the right moment is a sort of game of chance,
but its a start.

Not exactly.  An exception may cause an android runtime restart, but should not cause a device reboot.  To the user the difference between the two sorts of restarts may not be very obvious.

Depending on android version, interesting log files from past crashes (of apps, runtime, and kernel) can be found in places such as

/proc/last_kmsg
/data/tombstones/
/data/dontpanic/
/data/system/dropbox/

And probably other places as well

Christopher Tate

unread,
May 19, 2011, 6:45:48 PM5/19/11
to android-...@googlegroups.com
Okay, a quick primer on how to diagnose "reboots"....

As Chris Stratton mentioned, there are two different kinds of event
that look similar. One is a hard [kernel- or radio-level] reboot of
the entire device; the other is a restart of only the Android runtime
while the kernel keeps running properly.

If you see the initial static splash image for some period of time,
followed by the Android boot animation, then it's a hard reboot. If
instead the crash takes the device immediately to the animated portion
of the sequence, it's a runtime restart.

If it's a runtime restart, you need to try to pull a bugreport from
the device as soon as possible -- during the boot animation, even;
certainly as soon as possible after you hit the home application
again. Even 30 seconds can make a difference here; there is a *lot*
of log activity during early boot, and what you want to do is make
sure you capture the content of the logs leading up to the runtime
restart.

Just capturing 'logcat' is not sufficient; you need a full bugreport.
There's a reason all that information is in the bugreport. :)

So, let's assume that it was a hard reboot: a kernel- or radio-level
crash and reset. In this case, the single most valuable section of
the bugreport is the "LAST KMSG". That's a copy of the dmesg output
from the previous boot session of the kernel -- the one that ended
with a kernel or similar crash. Look at the end of that section to
see whether there was some sort of kernel-level issue. Oftentimes
you'll find that there's a kernel panic being reported, or some such
thing. At that point you're off and running to diagnose what the
problem was, taking your usual approach to debugging kernel crashes.
There can also be a panic log in the bugreport, which may contain
other significant information.

It's possible that there was a radio-side problem. In that sort of
case, the end of the "LAST KMSG" kernel log will read something like
this:

[ 535.383270] ARM9 has CRASHED
[ 535.383666] smem: DIAG 'MOD SM-TM 00000 '

Obviously you will then worth with your radio code provider to address
the issue.

Now, if the device crash was just a runtime restart, not a
kernel-level crash, then the bugreport has different useful
information. Most importantly, the various logs (main, radio, event)
remain intact and active through a runtime restart, so the logcat
block, the event log, and the radio log all show continuity across
whatever it was that caused the Android runtime to crash. Learn how
to read the event log and the logcat output in particular.

Runtime restarts are almost always for one of two reasons: either a
component running in the system_server process has crashed outright,
or something has caused the system_server's process primary looper
thread to deadlock. The deadlock case is announced by a line in the
logcat buffer reading something very like this:

04-04 14:06:16.888 885 1089 W Watchdog: *** WATCHDOG KILLING SYSTEM
PROCESS: null

I'm going to talk about the NON-deadlock cases first; they're simpler
to diagnose: just look in the logcat trace for the messages about the
system process crashing. There will be a Dalvik stack trace there to
the point at which the fatal exception was thrown, and you then know
where to go looking for the bug. As always, the event log and the
primary text log contain different information; both are useful for
reviewing the state and activity of the device leading up to a crash.

If the crash *was* due to a deadlock, things get a little interesting.
When the watchdog declares a deadlock and forcibly kills the system
process, it first captures the current stack trace of every
system_server process thread into the usual ANR stack trace file
(/data/anr/traces.txt on most devices). This file is automatically
included in a bugreport -- yet another reason to pull a full bugreport
rather than just logcat.

Look at the timestamp of where the watchdog declared a problem. In
the example line I gave above, it was 04-04 14:06:16.888. That tells
you which set of stack traces you're interested in within the "LAST
ANR" section of the bugreport. Find the system_server stack dumps
that were captured at the time of the watchdog's message. For
example, in the bugreport that my example came from, the "LAST ANR"
section was headed:

------ VM TRACES AT LAST ANR (/data/anr/traces.txt: 2011-04-04 14:06:14) ------

and the file holds *two* sets of system_process stacks. One of them
starts like this:

----- pid 885 at 2011-04-04 14:05:42 -----
Cmd line: system_server

and the second starts like this:

----- pid 885 at 2011-04-04 14:06:12 -----
Cmd line: system_server

Note that the first was taken 30 seconds before the second, and the
second was just a few seconds before the watchdog message about
killing the system process to restart the system. The watchdog takes
a thread snapshot after 30 seconds of the looper being unresponsive,
then another snapshot after a full minute of unresponsiveness; and
it's after that full minute that it declares the deadlock and restarts
everything. You probably want to start with the later thread stacks,
the ones that reflect the state of affairs when the watchdog finally
gave up.

In most Android applications the primary looper thread is called
"main", but this is *not* true of the system_process. The primary
system_process looper thread is titled "android.server.ServerThread".
Find that thread's stack trace that was taken at the time the watchdog
declared the deadlock, and you'll be off and running as far as
diagnosing what has caused that looper thread to be unresponsive for >
60 seconds.

At this point you've got the information you need to figure out what
locks are being contended for, what unapproved long-running operations
might be mistakenly being run on the system process's main looper,
etc. Dalvik's stack dump output is very useful, especially since it
tells you things like which other thread currently holds a lock being
requested.

Good luck!

--
christopher tate
android framework engineer

> --
> You received this message because you are subscribed to the Google Groups "android-platform" group.
> To post to this group, send email to android-...@googlegroups.com.
> To unsubscribe from this group, send email to android-platfo...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/android-platform?hl=en.
>
>

Jean-Baptiste Queru

unread,
May 19, 2011, 6:53:34 PM5/19/11
to android-...@googlegroups.com
Do you mind if I grab a copy of that and put it on source.android.com?

JBQ

--
Jean-Baptiste M. "JBQ" Queru
Software Engineer, Android Open-Source Project, Google.

Questions sent directly to me that have no reason for being private
will likely get ignored or forwarded to a public forum with no further
warning.

Shawn Brown

unread,
Aug 15, 2011, 5:45:13 AM8/15/11
to android-...@googlegroups.com
> It's possible that there was a radio-side problem.  In that sort of
> case, the end of the "LAST KMSG" kernel log will read something like
> this:
>
> [ 535.383270] ARM9 has CRASHED
> [ 535.383666] smem: DIAG 'MOD SM-TM   00000

OK below is what I found.

[ 1818.629425] ARM9 has CRASHED
[ 1818.630249] smem: DIAG '[WCDMA] 09:29:27 FATAL: IST0 (mdsptask.c:00476) (1)
[ 1818.630279] '
[ 1818.631134]
[ 1818.631866] Restarting Linux version
2.6.37.4-cyanogenmod-01332-g7f230e8 (shade@toxygene) (gcc version
4.4.3 (GCC) ) #1 PREEMPT Tue Apr 12 12:54:14 EDT 2011
[ 1818.631927]

Does this look like a radio problem?

Shawn

Shawn Brown

unread,
Aug 15, 2011, 6:03:58 AM8/15/11
to android-...@googlegroups.com
> OK below is what I found.
>
> [ 1818.629425] ARM9 has CRASHED
> [ 1818.630249] smem: DIAG '[WCDMA] 09:29:27 FATAL: IST0 (mdsptask.c:00476) (1)
> [ 1818.630279] '
> [ 1818.631134]
> [ 1818.631866] Restarting Linux version
> 2.6.37.4-cyanogenmod-01332-g7f230e8 (shade@toxygene) (gcc version
> 4.4.3 (GCC) ) #1 PREEMPT Tue Apr 12 12:54:14 EDT 2011
> [ 1818.631927]


FYI, I've seen this on two different handset running stock ROMs as
well as this one now on cynogenmod.

Build: GRI40
Bootloader: 0.35.0017
Radio: 5.08.00.04
Network: SoftBank
Kernel: Linux version 2.6.37.4-cyanogenmod-01332-g7f230e8


(shade@toxygene) (gcc version 4.4.3 (GCC) ) #1 PREEMPT Tue Apr 12
12:54:14 EDT 2011

Command line: board_mahimahi.disable_uart3=0
board_mahimahi.usb_h2w_sw=0 board_mahimahi.disable_sdcard=0
diag.enabled=0 board_mahimahi.debug_uart=0 smisize=0
androidboot.baseband=5.08.00.04 androidboot.cid=GOOGL001
androidboot.carrier=GOOGLE androidboot.mid=PB9910000
androidboot.keycaps=qwerty androidboot.mode=normal
androidboot.serialno=MB047P800301 androidboot.bootloader=0.35.0017
no_console_suspend=1 wire.search_count=5

Shawn Brown

unread,
Aug 16, 2011, 6:37:24 PM8/16/11
to android-...@googlegroups.com, android-d...@googlegroups.com
Here is another bugreport of what keeps my nexus one crashing. Where
would I file a bug report on this?

[ 2836.578948] ARM9 has CRASHED
[ 2836.579284] smem: DIAG '[WCDMA] 22:27:18 FATAL: IST0
(mcalwcdma_demod.c:02059) (1)
[ 2836.579284] '
[ 2836.579620]
[ 2836.579864] Restarting Linux version


2.6.37.4-cyanogenmod-01332-g7f230e8 (shade@toxygene) (gcc version
4.4.3 (GCC) ) #1 PREEMPT Tue Apr 12 12:54:14 EDT 2011

[ 2836.579864]

below is another

>> [ 1818.629425] ARM9 has CRASHED
>> [ 1818.630249] smem: DIAG '[WCDMA] 09:29:27 FATAL: IST0 (mdsptask.c:00476) (1)
>> [ 1818.630279] '
>> [ 1818.631134]
>> [ 1818.631866] Restarting Linux version
>> 2.6.37.4-cyanogenmod-01332-g7f230e8 (shade@toxygene) (gcc version
>> 4.4.3 (GCC) ) #1 PREEMPT Tue Apr 12 12:54:14 EDT 2011
>> [ 1818.631927]

> Build: GRI40


> Bootloader: 0.35.0017
> Radio: 5.08.00.04
> Network: SoftBank
> Kernel: Linux version 2.6.37.4-cyanogenmod-01332-g7f230e8
> (shade@toxygene) (gcc version 4.4.3 (GCC) ) #1 PREEMPT Tue Apr 12
> 12:54:14 EDT 2011
> Command line: board_mahimahi.disable_uart3=0
> board_mahimahi.usb_h2w_sw=0 board_mahimahi.disable_sdcard=0
> diag.enabled=0 board_mahimahi.debug_uart=0 smisize=0
> androidboot.baseband=5.08.00.04 androidboot.cid=GOOGL001
> androidboot.carrier=GOOGLE androidboot.mid=PB9910000
> androidboot.keycaps=qwerty androidboot.mode=normal
> androidboot.serialno=MB047P800301 androidboot.bootloader=0.35.0017
> no_console_suspend=1 wire.search_count=5

Note: This is definately not CynogenMod related as sending the phone
with the original ROM back to HTC and trying a new handset did NOT
solve anything.

Christopher Tate

unread,
Aug 16, 2011, 6:45:28 PM8/16/11
to android-...@googlegroups.com
On Tue, Aug 16, 2011 at 3:37 PM, Shawn Brown <big.coff...@gmail.com> wrote:
Here is another bugreport of what keeps my nexus one crashing.  Where
would I file a bug report on this?

[ 2836.578948] ARM9 has CRASHED
[ 2836.579284] smem: DIAG '[WCDMA] 22:27:18 FATAL: IST0
(mcalwcdma_demod.c:02059) (1)
[ 2836.579284] '
[ 2836.579620]
 
Like it says: it's the radio code crashing.  The radio code is owned by the manufacturer, in this case I believe that means HTC (licensed from Qualcomm).  Is this a Nexus One, or an HTC Desire?  If it's a Nexus One I think you're simply out of luck; that device is no longer in production and effectively EOL'd.  Does HTC still sell the Desire?  If so then it's conceivable that they might still consider radio-firmware updates... though I wouldn't get my hopes up.

disillusioned

unread,
Aug 17, 2011, 8:25:11 PM8/17/11
to android-platform
Mine just started doing this, and I haven't looked at bugreport yet,
but I'm guessing it's the same thing based on this discussion.

I think the question is, what would cause this to spontaneously start
occurring? I've been on 2.3.4 for weeks now, with ample free space on
my Nexus One and no new apps or anything like that... what makes the
radio code just start freaking out? Meanwhile, my friend one office
over with the same exactly kernel hasn't had any issues...

Thanks for your help with tracking down the bugreports!

On Aug 16, 3:45 pm, Christopher Tate <ct...@google.com> wrote:

Chris Stratton

unread,
Aug 18, 2011, 1:06:22 AM8/18/11
to android-...@googlegroups.com
On Wednesday, August 17, 2011 8:25:11 PM UTC-4, disillusioned wrote:
I think the question is, what would cause this to spontaneously start
occurring? I've been on 2.3.4 for weeks now, with ample free space on
my Nexus One and no new apps or anything like that... what makes the
radio code just start freaking out? Meanwhile, my friend one office
over with the same exactly kernel hasn't had any issues...

A big questions would be if your friend has exactly the same radio firmware, since that (rather than the kernel) seems to be what is crashing.

Though most software bugs which make it through pre-shipment qualification and early lifecycle undetected to require some combination of factors or less-used pathways to trigger, and hitting those could be a result of external causes.

For that matter, do you have the same network provider?  Has your device crashed in locations where the other one has not?



Naseer

unread,
Aug 18, 2011, 1:49:09 AM8/18/11
to android-platform
Hi Chris,

That is very useful info on debugging crashes.
Do you have similar info for lockups in native code ?
For example - is there a way to get the stack of all the native
threads of system_server ?
I know it can be done with gdb, but it would be valuable to have this
info in a offline dump, like the bugreport.


Thanks,
-Naseer


On Aug 17, 3:45 am, Christopher Tate <ct...@google.com> wrote:

Naseer

unread,
Aug 18, 2011, 2:03:29 AM8/18/11
to android-platform
Which radio image version do you have ?

Can you try flashing this one - http://forum.cyanogenmod.com/topic/2547-new-radio-5080004/
?

disillusioned

unread,
Aug 18, 2011, 4:52:40 PM8/18/11
to android-platform
Identical radio firmware, identical device, purchased a month apart,
in the same office, (literally 5 feet from each other; location
doesn't seem to matter much) same service provider (TMO), same intact
bootloader, same exact build, both not rooted, both stock, etc.

2.3.4 GRJ22
Baseband:
32.41.00.32U_5.08.00.04

Kernel:
2.6.35.7-59463-g52630a3 android-build@apa28 #1

Looked at bugreport and had the exact same output in LAST KMSG:

ARM9 has CRASHED
smem: DIAG '[WCDMA] 22:27:18 FATAL: IST0
(mcalwcdma_demod.c:02059) (1)

(My timestamp was different I think; it's at home and I'm at the
office.)

Crashed WHILE logcatting and got nothing interesting at all.

Discussion here: http://www.google.com/support/forum/p/Google+Mobile/thread?fid=3f6f936e159bccc20004aacd32fa9189&hl=en
prompted me to turn off wifi, which seems to slow down the reboots.
(Yesterday, it was averaging about once an hour.)

I also turned off 3G and wifi and bt before bed last night and it made
it through the night. It failed again a few hours after I woke up
though.

Also changed the battery, did a full reboot, etc. Haven't done a
complete factory reset because of others saying it won't help, but I'm
considering it. Ugh.

Thanks for your follow up. I'm just not sure who we can get this in
front of that might care, but it's looking like a strangely endemic
issue that's cropping up all of a sudden in the past week or two.

Shawn Brown

unread,
Aug 18, 2011, 5:27:23 PM8/18/11
to android-...@googlegroups.com
Hi,

> Which radio image version do you have ?

Radio: 5.08.00.04

I also tried 5.12.00.8 - same problem.

Shawn

shridutt kothari

unread,
Jul 2, 2014, 3:58:24 AM7/2/14
to android-...@googlegroups.com
Thanks very helpful thread.
Reply all
Reply to author
Forward
0 new messages