Kernel panic running on Digital Ocean

279 views
Skip to first unread message

Paul Dixon

unread,
Dec 4, 2016, 5:57:08 PM12/4/16
to CoreOS User
I've got a CoreOS host running v1185.3.0 on a Digital Ocean droplet. This host runs a number of containers, and typically within 24 hours the host will completely fail.

There's no clues in the journal, but the console for the server will display a message like this:

yyy:xxx  blocked for more than 120 seconds
Not tainted 4.7.3-coreos-r2
echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message

I've attached a screengrab of the console output showing this and the call stack

Does anyone have suggestions for further diagnosis? I wondered if there was a problem with virtual memory, and there is a spike in dirty pages around the time of the crash, but I don't know if that's a cause or effect. What could a container be doing to cause this kind of failure?

console-crash.png

Brandon Philips

unread,
Dec 4, 2016, 7:45:24 PM12/4/16
to Paul Dixon, CoreOS User
Can you disable swap? Swap + cloud machines generally isn't a great combination.


--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Dixon

unread,
Dec 6, 2016, 4:06:41 AM12/6/16
to CoreOS User
On 5 December 2016 at 00:45, Brandon Philips <brandon...@coreos.com> wrote:
Can you disable swap? Swap + cloud machines generally isn't a great combination.

Thanks Brandon. I can certainly do that, it's more of a safety net / force-of-habit (though I'd rather see a blip in swap usage than have the oom killer strike!)
 

That post referenced https://github.com/coreos/bugs/issues/429 which looks very similar to mine. There, they report that it appeared to be the use of losetup while creating the swap file which was the root cause. 

I too have swap initialised by a unit which used losetup. It looks like the use of losetup was suggested for btrfs (https://github.com/coreos/docs/issues/52#issuecomment-45418877), but I'm using ext4. The fleet unit I'm using for swap was written a couple of years ago and would appear to have blindly incorporated this advice. 

I can see documentation for swap usage was recently added to the documentation which doesn't mention losetup https://coreos.com/os/docs/latest/adding-swap.html

After running a few tests this is almost certainly the root cause of my crash. 

Many thanks for the pointer!

Paul
Reply all
Reply to author
Forward
0 new messages