Issue with 0.9.9.1 64-bit AMIs

2 views
Skip to first unread message

cera

unread,
Dec 17, 2009, 1:00:10 PM12/17/09
to ec2-on-rails-discuss
I'm using the 0.9.9.1 64-bit instances (ami-5594733c) and they crash
(seemingly for no reason) from time to time. Then, after it crashes
once, it goes into a death spiral where it will consistently crash
repeatedly for hours (minutes after each reboot), then be stable for
weeks. I have two running instances that exhibit this behavior. The
32-bit instances (with almost the same build) don't have this problem.

By crash, I mean my SSH connection gets closed and one of two things
happens when I try to SSH back into it:

Connection Timeout: I think b/c SSH isn't running yet
Connection Refused: I'm not sure why this is happening

The SSH error message will sometimes oscillate between timeout and/or
refused. I've verified that it's because the box will sometimes crash/
halt in the middle of the reboot (so it takes 3 reboots before I
actually can get a shell). Then the box will be up for 1-10 minutes.
If it's up for about 15 minutes, then it will stay up forever.

I cannot figure out the source of the crash, and it's not clear if the
first crash which starts the death spiral is significant. I've
scoured everything in /mnt/log and /var/log and I'm completely baffled
as to the problem. There might be some new kernel logging facilities
that I'm unaware of, but I think I've covered all of the obvious log
files. The box will even crash if I shutdown most applications:

monit unmonitor all
/etc/init.d/cron stop
/etc/init.d/mysql stop
/etc/init.d/apache2 stop
/etc/init.d/mongrel stop
/etc/init.d/postfix stop
/etc/init.d/memcached stop
/etc/init.d/dbus stop
/etc/init.d/atd stop

I'm not sure what else is safe to shutdown out of the following list
from ps:

root 1 0.0 0.0 3996 920 ? Ss 19:37 0:00 /sbin/
init
root 2 0.0 0.0 0 0 ? S 19:37 0:00
[migration/0]
root 3 0.0 0.0 0 0 ? SN 19:37 0:00
[ksoftirqd/0]
root 4 0.0 0.0 0 0 ? S 19:37 0:00
[watchdog/0]
root 5 0.0 0.0 0 0 ? S< 19:37 0:00
[events/0]
root 6 0.0 0.0 0 0 ? S< 19:37 0:00
[khelper]
root 7 0.0 0.0 0 0 ? S< 19:37 0:00
[kthread]
root 9 0.0 0.0 0 0 ? S< 19:37 0:00
[xenwatch]
root 10 0.0 0.0 0 0 ? S< 19:37 0:00
[xenbus]
root 18 0.0 0.0 0 0 ? S< 19:37 0:00
[migration/1]
root 19 0.0 0.0 0 0 ? SN 19:37 0:00
[ksoftirqd/1]
root 20 0.0 0.0 0 0 ? S< 19:37 0:00
[watchdog/1]
root 21 0.0 0.0 0 0 ? S< 19:37 0:00
[events/1]
root 22 0.0 0.0 0 0 ? S< 19:37 0:00
[migration/2]
root 23 0.0 0.0 0 0 ? SN 19:37 0:00
[ksoftirqd/2]
root 24 0.0 0.0 0 0 ? S< 19:37 0:00
[watchdog/2]
root 25 0.0 0.0 0 0 ? S< 19:37 0:00
[events/2]
root 26 0.0 0.0 0 0 ? S< 19:37 0:00
[migration/3]
root 27 0.0 0.0 0 0 ? SN 19:37 0:00
[ksoftirqd/3]
root 28 0.0 0.0 0 0 ? S< 19:37 0:00
[watchdog/3]
root 29 0.0 0.0 0 0 ? S< 19:37 0:00
[events/3]
root 61 0.0 0.0 0 0 ? S< 19:37 0:00
[kblockd/0]
root 62 0.0 0.0 0 0 ? S< 19:37 0:00
[kblockd/1]
root 63 0.0 0.0 0 0 ? S< 19:37 0:00
[kblockd/2]
root 64 0.0 0.0 0 0 ? S< 19:37 0:00
[kblockd/3]
root 65 0.0 0.0 0 0 ? S< 19:37 0:00
[cqueue/0]
root 66 0.0 0.0 0 0 ? S< 19:37 0:00
[cqueue/1]
root 67 0.0 0.0 0 0 ? S< 19:37 0:00
[cqueue/2]
root 68 0.0 0.0 0 0 ? S< 19:37 0:00
[cqueue/3]
root 73 0.0 0.0 0 0 ? S< 19:37 0:00
[khubd]
root 75 0.0 0.0 0 0 ? S< 19:37 0:00
[kseriod]
root 113 0.0 0.0 0 0 ? S 19:37 0:00
[pdflush]
root 114 0.0 0.0 0 0 ? S 19:37 0:00
[pdflush]
root 115 0.0 0.0 0 0 ? S< 19:37 0:00
[kswapd0]
root 116 0.0 0.0 0 0 ? S< 19:37 0:00 [aio/
0]
root 117 0.0 0.0 0 0 ? S< 19:37 0:00 [aio/
1]
root 118 0.0 0.0 0 0 ? S< 19:37 0:00 [aio/
2]
root 119 0.0 0.0 0 0 ? S< 19:37 0:00 [aio/
3]
root 232 0.0 0.0 0 0 ? S< 19:37 0:00
[kpsmoused]
root 275 0.0 0.0 0 0 ? S< 19:37 0:00
[kjournald]
root 384 0.0 0.0 16848 956 ? S<s 19:37 0:00 /sbin/
udevd --daemon
dhcp 809 0.0 0.0 15100 756 ? S<s 19:37 0:00
dhclient3 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf /var/lib/
dhcp3/dhclient.eth0.leases eth0
root 880 0.0 0.0 0 0 ? S< 19:37 0:00
[kjournald]
root 1591 0.0 0.0 106540 3480 ? Ssl 19:40 0:00 /usr/
local/bin/monit -I
syslog 1661 0.0 0.0 12288 740 ? Ss 19:40 0:00 /sbin/
syslogd -u syslog
root 1682 0.0 0.0 8128 584 ? S 19:40 0:00 /bin/
dd bs 1 if /proc/kmsg of /var/run/klogd/kmsg
klog 1684 0.0 0.0 5396 2160 ? Ss 19:40 0:00 /sbin/
klogd -P /var/run/klogd/kmsg
root 1727 0.0 0.0 50908 1160 ? Ss 19:40 0:00 /usr/
sbin/sshd
root 1898 71.2 0.0 161744 2592 ? Rs 19:40 2:40 /usr/
sbin/console-kit-daemon
root 2159 0.0 0.0 3856 580 tty1 Ss+ 19:40 0:00 /sbin/
getty 38400 tty1
root 2165 0.0 0.0 65892 2940 ? Ss 19:41 0:00 sshd:
admin [priv]
admin 2167 0.0 0.0 65892 1680 ? S 19:41 0:00 sshd:
admin@pts/0
admin 2168 0.0 0.0 21424 4308 pts/0 Ss 19:41 0:00 -bash
root 2272 0.0 0.0 65888 2936 ? Ss 19:41 0:00 sshd:
admin [priv]
admin 2277 0.0 0.0 65888 1676 ? S 19:41 0:00 sshd:
admin@pts/1
admin 2278 0.0 0.0 21420 4308 pts/1 Ss 19:41 0:00 -bash
root 2509 0.0 0.0 9364 644 pts/1 S+ 19:42 0:00 tail -
f /var/run/klogd/kmsg
admin 2519 0.0 0.0 16040 1096 pts/0 R+ 19:44 0:00 ps
auxww


This is more of a linux debugging issue than an ec2onrails issue. If
someone can answer any of the following questions:

1) Has anyone experienced something like this?
2) Can I shutdown any other user processes? I haven't tried any
kernel modules yet, but that seems like a logical next step.
3) Aware of any additional linux or Xenserver debugging/logging
facilities? I feel like I might see something if I could see the
console log on a monitor.

Any help is greatly appreciated. Thanks! -Chris

Reply all
Reply to author
Forward
0 new messages