On 28.08.2012 20:17, Charlie Poole wrote:
> I've been very happy with the service I now use (Core Networks) but
> they only provide
> the bare metal. Learning to admin the system has been a fun challenge
> and has generally
> kept us up much more reliably than before.
>
> However, we are now coming down about once a month, on average. The
> problem started
> when I was on vacation and I "solved" it each time by doing a cold
> boot from the service
> providers control panel.
>
> Now it's time to actually deal with the cause, which of course must
> first be found discovered.
>
> I'd welcome suggestions from anyone with more experience than I have maintaining
> Linux Debian systems, either here or offline.
>
> BTW, when the system is not responding, my own access through WebMin
> is down as well.
> Whatever I do now has to be based on tracks left in logs, etc.
Linux systems just locking up are in my experience mainly caused by two
factors: faulty hardware and run-away processes that trash the machine.
Another possibility for offlining systems would be some kind of
misconfiguration and/or admin error. I exclude this on the grounds that
you're renting only metal and you don't seem to be playing around with
the system.
The next possibility would be development versions of kernel and
software. I exclude that for the same reason as above.
I hope you have installed all security updates.
I hope you have recent and restorable backups.
If the server is not responsive at any level (web, shell, ping), this
would point towards a hardware problem. Especially a ping is a
kernel-level response and should be always available as long as the
kernel is still running.
If the server's IP is still ping-able while the webserver is not
reacting anymore, some kind of user-space foulness should be suspected.
As a next step I'd look into some kind of monitoring solution to notice
the next outage in a timely fashion and then try to ascertain whether or
not the machine is really totally gone. If it is, you'll need to contact
your provider for replacement hardware. A log of recent failures often
helps to establish that it's not a software failure. Depending on the
provider, it might help/be necessary to call while the server exhibits
its problems.
If the server is still responding at any level, it becomes complicated.
A lock-up + hard-reset often leaves the final file-system transactions
open which contain the relevant log-entry and which will be rolled back
on restart. I'd start with having a look into the logfiles (/var/log)
around the time of the hang-up. Maybe the kernel started reporting
out-of-memory situations, perhaps you can identify unusual access
patterns in the web-logs.
Some kind of performance monitoring would be nice, but that might not
help for very short-lived events. Look into munin or collectd. Some kind
of log-shipping would be nice, but that also might not help for very
short-lived events. Do you have a external syslog-server available? I
could help with that for a while.
If you can get access to the console (serial or otherwise), it makes
sense to configure (r)syslog to write there. This can produce output in
cases where most other venues are blocked.
You can contact me privately if you need further assistance.
Good luck hunting, David