I've got a CoreOS host running v1185.3.0 on a Digital Ocean droplet. This host runs a number of containers, and typically within 24 hours the host will completely fail.
There's no clues in the journal, but the console for the server will display a message like this:
yyy:xxx blocked for more than 120 seconds
Not tainted 4.7.3-coreos-r2
echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message
I've attached a screengrab of the console output showing this and the call stack
Does anyone have suggestions for further diagnosis? I wondered if there was a problem with virtual memory, and there is a spike in dirty pages around the time of the crash, but I don't know if that's a cause or effect. What could a container be doing to cause this kind of failure?