Kernel panic running on Digital Ocean

Paul Dixon

unread,

Dec 4, 2016, 5:57:08 PM12/4/16

to CoreOS User

I've got a CoreOS host running v1185.3.0 on a Digital Ocean droplet. This host runs a number of containers, and typically within 24 hours the host will completely fail.

There's no clues in the journal, but the console for the server will display a message like this:

yyy:xxx blocked for more than 120 seconds

Not tainted 4.7.3-coreos-r2

echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message

I've attached a screengrab of the console output showing this and the call stack

Does anyone have suggestions for further diagnosis? I wondered if there was a problem with virtual memory, and there is a spike in dirty pages around the time of the crash, but I don't know if that's a cause or effect. What could a container be doing to cause this kind of failure?

console-crash.png

Brandon Philips

unread,

Dec 4, 2016, 7:45:24 PM12/4/16

to Paul Dixon, CoreOS User

Can you disable swap? Swap + cloud machines generally isn't a great combination.

xref https://groups.google.com/d/msg/coreos-user/xw4aeC68k6Q/-heoI8fvBQAJ

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Dixon

unread,

Dec 6, 2016, 4:06:41 AM12/6/16

to CoreOS User

On 5 December 2016 at 00:45, Brandon Philips <brandon...@coreos.com> wrote:

Can you disable swap? Swap + cloud machines generally isn't a great combination.

Thanks Brandon. I can certainly do that, it's more of a safety net / force-of-habit (though I'd rather see a blip in swap usage than have the oom killer strike!)

xref https://groups.google.com/d/msg/coreos-user/xw4aeC68k6Q/-heoI8fvBQAJ

That post referenced https://github.com/coreos/bugs/issues/429 which looks very similar to mine. There, they report that it appeared to be the use of losetup while creating the swap file which was the root cause.

I too have swap initialised by a unit which used losetup. It looks like the use of losetup was suggested for btrfs (https://github.com/coreos/docs/issues/52#issuecomment-45418877), but I'm using ext4. The fleet unit I'm using for swap was written a couple of years ago and would appear to have blindly incorporated this advice.

I can see documentation for swap usage was recently added to the documentation which doesn't mention losetup https://coreos.com/os/docs/latest/adding-swap.html

After running a few tests this is almost certainly the root cause of my crash.

Many thanks for the pointer!

Paul

Reply all

Reply to author

Forward