CoreOS nodes randomly rebooting - what info can I gather to determine root cause?

Derek Olsen

unread,

Apr 14, 2017, 5:22:19 PM4/14/17

to CoreOS User

We are running CoreOS in AWS

NAME="Container Linux by CoreOS"

ID=coreos

VERSION=1339.0.0

VERSION_ID=1339.0.0

BUILD_ID=2017-03-01-2346

PRETTY_NAME="Container Linux by CoreOS 1339.0.0 (Ladybug)"

ANSI_COLOR="38;5;75"

HOME_URL="https://coreos.com/"

BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

We are having some problems the past 2 weeks (not sure how long it's been going on - only noticed it in the past few weeks) where nodes are randomly rebooting. The only evidence I find in the logs is the following message

'-- Reboot --'

The messages prior to the reboot line are just our applications logging stuff. We have disabled auto updates via the following in our cloud-config

update:

reboot-strategy: off

Additionally we have masked off locksmithd

core@ip-10-20-13-19 ~ $ systemctl list-unit-files | grep masked

locksmithd.service masked-runtime

Also I would assume if the reboot was driven by an update that i would get more logs than just '-- Reboot --' and the CoreOS version would change. By reading a similar thread I now know to capture details from /sys/fs/pstore/ if any exist. I'm curious what else I can gather that might help me determine the root cause of these reboots.

Thanks.

Rob Szumski

unread,

Apr 17, 2017, 12:53:52 PM4/17/17

to Derek Olsen, CoreOS User

If Container Linux is triggering this, you can tell for sure by looking at the update engine logs:

journalctl -u update-engine

That should log all update checks and applied updates.

- Rob

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Crawford

unread,

Apr 17, 2017, 12:58:01 PM4/17/17

to Rob Szumski, Derek Olsen, CoreOS User

On 04/17, Rob Szumski wrote:
> If Container Linux is triggering this, you can tell for sure by looking at the update engine logs:
>
> journalctl -u update-engine
>
> That should log all update checks and applied updates.

In this case, there is an abrupt loss of logs which is indicative of a
kernel panic. The panic trace should be caught in the pstore or the
serial logs. Accessing this data is going to be highly dependant on the
platform being used.

-Alex

signature.asc

Derek Olsen

unread,

Apr 17, 2017, 2:23:37 PM4/17/17

to CoreOS User, some...@gmail.com

Rob.

I do see these messages in all of our nodes but we have 'reboot-strategy=off' so i figured it just pulls down the update and applies it to the other partition and then waits for something to reboot the node. As only a small % of the node population is rebooting and only somewhat recently it makes me think it's not related to the update-engine. Happy to be shown i'm wrong though as that would seem to be an easier problem to solve:)

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: </actions>

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: </manifest>

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: </updatecheck>

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: </app>

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: </response>

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152194 896 action_processor.cc:65] ActionProcessor::ActionComplete: finished last action of type OmahaRequestAction

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152201 896 action_processor.cc:73] ActionProcessor::ActionComplete: finished last action of type OmahaRequestAction

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152207 896 update_attempter.cc:290] Processing Done.

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152809 896 update_attempter.cc:316] Update successfully applied, waiting to reboot.

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152835 896 update_check_scheduler.cc:74] Next update check in 50m0s

Derek Olsen

unread,

Apr 17, 2017, 2:26:14 PM4/17/17

to CoreOS User, rob.s...@coreos.com, some...@gmail.com

We are running AWS Ec2 instances. We had the situation over the weekend and i looked in /sys/fs/pstore but not files existed after the reboot. If no data is in /sys/fs/pstore/ should I try to get the console output when this happens again?

Thanks. Derek.

-Alex

Alex Crawford

unread,

Apr 17, 2017, 2:31:08 PM4/17/17

to Derek Olsen, CoreOS User, rob.s...@coreos.com

On 04/17, Derek Olsen wrote:
> We are running AWS Ec2 instances. We had the situation over the weekend
> and i looked in /sys/fs/pstore but not files existed after the reboot. If
> no data is in /sys/fs/pstore/ should I try to get the console output when
> this happens again?

Yes, AWS doesn't have hardware support for pstore, so you'll need to
look at the system logs in the UI. Hopefully that will fully capture the
kernel panic.

-Alex

signature.asc

Derek Olsen

unread,

Apr 17, 2017, 2:34:38 PM4/17/17

to CoreOS User, some...@gmail.com, rob.s...@coreos.com

Yes, AWS doesn't have hardware support for pstore,

Good to know! Thanks for the tips. I'll get prepared for the next event.

Reply all

Reply to author

Forward