CoreOS crash on VMware ESX under memory load

111 views
Skip to first unread message

Eric Anderson

unread,
Oct 24, 2016, 3:25:21 PM10/24/16
to CoreOS User
Good day folks,

I am running CoreOS 1122.3.0 in development and 899.13.0 for production and both hosted in VMware ESX. We recently ran into an issue with one of our production nodes where a CoreOS node locked up hard under load and required a reset of the VM. I have worked towards reproducing the issue using tools like "cpuburn" and "stress" running inside Docker. Load testing the CPU did not cause the lock up but running "stress" with the memory tests does indeed lock up a VM. I get a crash dump on the console (I have saved screenshots) and I then have to reset the VM in ESX. I have nothing in /sys/fs/pstore upon reboot. How else can I capture any kind of post crash debugging?

For information, the two tools I am using for load testing are the following:

Brandon Philips

unread,
Oct 24, 2016, 3:41:52 PM10/24/16
to Eric Anderson, CoreOS User
Can you post the screenshots?

Do you run with swap on? CoreOS doesn't use swap by default. WIP guide: https://github.com/coreos/docs/issues/52

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eric Anderson

unread,
Oct 24, 2016, 3:47:25 PM10/24/16
to CoreOS User, ericla...@gmail.com

We do run with swap enabled.

Eric Anderson

unread,
Oct 24, 2016, 3:50:24 PM10/24/16
to CoreOS User, ericla...@gmail.com
I should clarify we run with swap enabled similar to the unit file posted in the link supplied: https://github.com/coreos/docs/issues/52

Eric Anderson

unread,
Oct 25, 2016, 2:28:08 PM10/25/16
to CoreOS User
On Monday, October 24, 2016 at 2:50:24 PM UTC-5, Eric Anderson wrote:
I should clarify we run with swap enabled similar to the unit file posted in the link supplied: https://github.com/coreos/docs/issues/52

Curious you reference whether we have swap enabled or not. After years of running Solaris, FreeBSD and Linux, some habits are hard to break. Thus, we have swap enabled via a fleet wide unit file:

[Unit]

Description=on disk swap file


[Service]

Type=oneshot

Environment="SWAPFILE=/coreos.swap"

Environment="SWAPSIZE=2048"

RemainAfterExit=true

ExecStartPre=/usr/bin/touch ${SWAPFILE}

ExecStartPre=/usr/bin/fallocate -l ${SWAPSIZE}m ${SWAPFILE}

ExecStartPre=/usr/bin/chmod 600 ${SWAPFILE}

ExecStartPre=/usr/sbin/mkswap ${SWAPFILE}

ExecStartPre=/usr/sbin/losetup -f ${SWAPFILE}

ExecStart=/usr/bin/sh -c "/sbin/swapon $(/usr/sbin/losetup -j ${SWAPFILE} | /usr/bin/cut -d : -f 1)"

ExecStop=/usr/bin/sh -c "/sbin/swapoff $(/usr/sbin/losetup -j ${SWAPFILE} | /usr/bin/cut -d : -f 1)"

ExecStopPost=/usr/bin/sh -c "/usr/sbin/losetup -d $(/usr/sbin/losetup -j ${SWAPFILE} | /usr/bin/cut -d : -f 1)"

ExecStopPost=/usr/bin/rm ${SWAPFILE}


[X-Fleet]

Global=true


We noticed that on a stand alone bare metal server running CoreOS 1022.3.0, that the same stress tests did not lock up the server. We also noticed that this bare metal server had no swap enabled. We enabled swap on this bare metal server and we were able to quickly lock up the server. We tested the VM again with swap disabled and no lock up.

Brandon Philips

unread,
Oct 26, 2016, 4:06:16 PM10/26/16
to Eric Anderson, CoreOS User
On Tue, Oct 25, 2016 at 11:28 AM Eric Anderson <ericla...@gmail.com> wrote:
On Monday, October 24, 2016 at 2:50:24 PM UTC-5, Eric Anderson wrote:
I should clarify we run with swap enabled similar to the unit file posted in the link supplied: https://github.com/coreos/docs/issues/52

Curious you reference whether we have swap enabled or not. After years of running Solaris, FreeBSD and Linux, some habits are hard to break. Thus, we have swap enabled via a fleet wide unit file:

We noticed that on a stand alone bare metal server running CoreOS 1022.3.0, that the same stress tests did not lock up the server. We also noticed that this bare metal server had no swap enabled. We enabled swap on this bare metal server and we were able to quickly lock up the server. We tested the VM again with swap disabled and no lock up.

That certainly shouldn't be happening but honestly swap is getting used less and less in systems. See this discussion on Kubernetes: https://github.com/kubernetes/kubernetes/issues/7294

Now, there are valid uses for swap if you have apps that have a huge working set that relies on virtual memory to work correctly. But, those applications are rare and generally need dedicated hardware/disks to perform reliably.

With all of that said it would be great to root cause the issue as you certainly shouldn't have the entire machine lockup. My hunch is that it is an interaction between journald and your disk I/O. Try setting the journald configuration volatile or none: https://www.freedesktop.org/software/systemd/man/journald.conf.html#Storage=

Cheers,

Brandon

Eric Anderson

unread,
Oct 28, 2016, 1:22:37 PM10/28/16
to CoreOS User, ericla...@gmail.com
I tried both "volatile" and "none" and neither fixed the issue. I looked further at an older issue (https://github.com/coreos/bugs/issues/429) and somebody there suggested doing away with "losetup" as part of the swap creation process (which we were using). We did that and the system is now much more stable (now lockups). I am glad we had this discussion though as it makes us re-evaluate our need to have swap available.

Thanks

Eric Anderson

unread,
Oct 28, 2016, 1:24:07 PM10/28/16
to CoreOS User


On Friday, October 28, 2016 at 12:22:37 PM UTC-5, Eric Anderson wrote:
I tried both "volatile" and "none" and neither fixed the issue. I looked further at an older issue (https://github.com/coreos/bugs/issues/429) and somebody there suggested doing away with "losetup" as part of the swap creation process (which we were using). We did that and the system is now much more stable (now lockups). I am glad we had this discussion though as it makes us re-evaluate our need to have swap available.


Sorry, I meant we have had no more lockups since adjusting the swap creation process. 
Reply all
Reply to author
Forward
0 new messages