CoreOS v1010.5.0 Frequent servers reboot on DL380p G8 and DL360 G9

88 views
Skip to first unread message

Manuel Carlo Ranieri

unread,
Jun 22, 2016, 7:53:59 AM6/22/16
to CoreOS User
Hi all, I've 9 servers using CoreOS VERSION_ID=1010.5.0

2 DL360 Gen9
7 DL380p Gen8

locksmithd is disabled
update-engine is disabled

I experienced a frequent server reboot, up to 4 per week or more.
No ILO information but 'server reset'
No Logs entry.
Apparently those reboot have no explanations.

Rob Szumski

unread,
Jun 22, 2016, 1:30:42 PM6/22/16
to Manuel Carlo Ranieri, CoreOS User
Can you provide some more details? Are all of these machines rebooting this frequently? Did this happen on previous CoreOS versions? Or did it just pop up?

There aren’t any logs whatsoever around the reboot events? Can you take a peek at the update-engine logs just to make sure that wasn’t related? Even if it was, 4 reboot is much higher than anything you would see from automatic updates.

Would it be possible to switch to beta on one or two of these machines? Beta has a slightly newer kernel.

 - Rob

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Manuel Carlo Ranieri

unread,
Jun 22, 2016, 5:00:48 PM6/22/16
to CoreOS User, emme...@gmail.com
Yes of course I can provide more details, I agree 4 reboot are a lot...
Attached the snipped logs, one for reboot and one for update-engine.

update-engine: disabled
locksmithd: disabled
update strategy:  OFF
  cat /etc/coreos/update.conf
  GROUP=stable
  REBOOT_STRATEGY=off

Sincerely I do not remember if  reboots occurred with the previous version, the update-engine/locksmith was enabled and I set up the reboot windows only few weeks ago.
I think we can switch one server to beta channel it etcd/flannel/docker are compatibles We yet encountered a problem with the rolling update when etcd2 updates to aversion not down-gradable.
I've to check, but you know, beta channel is not acceptable for production environment.

- man
reboot-core-1.log
update-engine-core-1.log

Manuel Carlo Ranieri

unread,
Jun 30, 2016, 5:41:45 AM6/30/16
to CoreOS User, emme...@gmail.com
We performed a OS change, from CoreOS to Ubuntu 16.04

2 major problems.

1) the upgrade via continous stream of updates of the operating system has shipped etcd2 2.3 . Etcd2 2.3 is not rollback-able if the etcd2 cluster grow up from 2.2.3 to 2.3 (from stable 899.13.0 to 1010.5.0)
this means I cannot downgrade my CoreOS cluster in case of problem if the etcd2 cluster update is finished.
2) the continuous and uncontrolled reboot if servers are on load; but I cannot rollback 'cause first problem


Ciao
man

Brandon Philips

unread,
Jul 6, 2016, 6:44:41 PM7/6/16
to Manuel Carlo Ranieri, CoreOS User
On Thu, Jun 30, 2016 at 2:41 AM Manuel Carlo Ranieri <emme...@gmail.com> wrote:
1) the upgrade via continous stream of updates of the operating system has shipped etcd2 2.3 . Etcd2 2.3 is not rollback-able if the etcd2 cluster grow up from 2.2.3 to 2.3 (from stable 899.13.0 to 1010.5.0)
this means I cannot downgrade my CoreOS cluster in case of problem if the etcd2 cluster update is finished.

In the next few versions of CoreOS we will be updating our docs to drive people towards running etcd in containers. This will make it easier for you to pin etcd to a particular version.
 
2) the continuous and uncontrolled reboot if servers are on load; but I cannot rollback 'cause first problem

The updates can be controlled via locksmith. We want to add additional smarts to these mechanisms; one idea is Kubernetes locksmith integration: https://github.com/coreos/bugs/issues/1274

Cheers,

Brandon

Manuel Carlo Ranieri

unread,
Jul 10, 2016, 9:09:04 AM7/10/16
to CoreOS User, emme...@gmail.com


Il giorno giovedì 7 luglio 2016 00:44:41 UTC+2, Brandon Philips ha scritto:
On Thu, Jun 30, 2016 at 2:41 AM Manuel Carlo Ranieri <emme...@gmail.com> wrote:
1) the upgrade via continous stream of updates of the operating system has shipped etcd2 2.3 . Etcd2 2.3 is not rollback-able if the etcd2 cluster grow up from 2.2.3 to 2.3 (from stable 899.13.0 to 1010.5.0)
this means I cannot downgrade my CoreOS cluster in case of problem if the etcd2 cluster update is finished.

In the next few versions of CoreOS we will be updating our docs to drive people towards running etcd in containers. This will make it easier for you to pin etcd to a particular version.
 

This can be a good idea. IMHO, in production environment, is mandatory to have controls of updates that can ship problems.

 
Reply all
Reply to author
Forward
0 new messages