CoreOS nodes randomly rebooting - what info can I gather to determine root cause?

431 views
Skip to first unread message

Derek Olsen

unread,
Apr 14, 2017, 5:22:19 PM4/14/17
to CoreOS User
We are running CoreOS in AWS


NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1339.0.0
VERSION_ID=1339.0.0
BUILD_ID=2017-03-01-2346
PRETTY_NAME="Container Linux by CoreOS 1339.0.0 (Ladybug)"
ANSI_COLOR="38;5;75"

We are having some problems the past 2 weeks (not sure how long it's been going on - only noticed it in the past few weeks) where nodes are randomly rebooting.   The only evidence I find in the logs is the following message

'-- Reboot --'

The messages prior to the reboot line are just our applications logging stuff.    We have disabled auto updates via the following in our cloud-config

update:
    reboot-strategy: off

Additionally we have masked off locksmithd
core@ip-10-20-13-19 ~ $ systemctl list-unit-files | grep masked
locksmithd.service                     masked-runtime


Also I would assume if the reboot was driven by an update that i would get more logs than just '-- Reboot --' and the CoreOS version would change.   By reading a similar thread I now know to capture details from /sys/fs/pstore/ if any exist.    I'm curious what else I can gather that might help me determine the root cause of these reboots.    

Thanks.

Rob Szumski

unread,
Apr 17, 2017, 12:53:52 PM4/17/17
to Derek Olsen, CoreOS User
If Container Linux is triggering this, you can tell for sure by looking at the update engine logs:

journalctl -u update-engine

That should log all update checks and applied updates.

 - Rob

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Crawford

unread,
Apr 17, 2017, 12:58:01 PM4/17/17
to Rob Szumski, Derek Olsen, CoreOS User
On 04/17, Rob Szumski wrote:
> If Container Linux is triggering this, you can tell for sure by looking at the update engine logs:
>
> journalctl -u update-engine
>
> That should log all update checks and applied updates.

In this case, there is an abrupt loss of logs which is indicative of a
kernel panic. The panic trace should be caught in the pstore or the
serial logs. Accessing this data is going to be highly dependant on the
platform being used.

-Alex
signature.asc

Derek Olsen

unread,
Apr 17, 2017, 2:23:37 PM4/17/17
to CoreOS User, some...@gmail.com
Rob.

I do see these messages in all of our nodes but we have 'reboot-strategy=off' so i figured it just pulls down the update and applies it to the other partition and then waits for something to reboot the node.      As only a small % of the node population is rebooting and only somewhat recently it makes me think it's not related to the update-engine.  Happy to be shown i'm wrong though as that would seem to be an easier problem to solve:)

Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]:     </actions>
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]:    </manifest>
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]:   </updatecheck>
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]:  </app>
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: </response>
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152194   896 action_processor.cc:65] ActionProcessor::ActionComplete: finished last action of type OmahaRequestAction
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152201   896 action_processor.cc:73] ActionProcessor::ActionComplete: finished last action of type OmahaRequestAction
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152207   896 update_attempter.cc:290] Processing Done.
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152809   896 update_attempter.cc:316] Update successfully applied, waiting to reboot.
Apr 17 17:36:58 ip-10-20-0-104.ec2.internal update_engine[896]: I0417 17:36:58.152835   896 update_check_scheduler.cc:74] Next update check in 50m0s

Derek Olsen

unread,
Apr 17, 2017, 2:26:14 PM4/17/17
to CoreOS User, rob.s...@coreos.com, some...@gmail.com
We are running AWS Ec2 instances.   We had the situation over the weekend and i looked in /sys/fs/pstore but not files existed after the reboot.   If no data is in /sys/fs/pstore/ should I try to get the console output when this happens again?

Thanks.  Derek.

 
-Alex

Alex Crawford

unread,
Apr 17, 2017, 2:31:08 PM4/17/17
to Derek Olsen, CoreOS User, rob.s...@coreos.com
On 04/17, Derek Olsen wrote:
> We are running AWS Ec2 instances. We had the situation over the weekend
> and i looked in /sys/fs/pstore but not files existed after the reboot. If
> no data is in /sys/fs/pstore/ should I try to get the console output when
> this happens again?

Yes, AWS doesn't have hardware support for pstore, so you'll need to
look at the system logs in the UI. Hopefully that will fully capture the
kernel panic.

-Alex
signature.asc

Derek Olsen

unread,
Apr 17, 2017, 2:34:38 PM4/17/17
to CoreOS User, some...@gmail.com, rob.s...@coreos.com


Yes, AWS doesn't have hardware support for pstore,

Good to know!  Thanks for the tips.  I'll get prepared for the next event.
Reply all
Reply to author
Forward
0 new messages