stable 1068.6.0: fleet engine leadership lost

Juan José Amor

unread,

Jul 14, 2016, 7:56:34 AM7/14/16

to CoreOS User

Hello again,

May be related with previous message, but I also have several:

fleetd: ERROR engine.go:217: Engine leadership lost, renewal failed: context deafline exceeded.

This occurs mainly when activating units.

In previous stable releases it never occurred :(

Any ideas?

Many thanks!

Juanjo Amor

unread,

Jul 15, 2016, 4:19:07 PM7/15/16

to CoreOS User

Hello again,

I'm not sure but I guess this error is due to we had two clusters in the same subnet.

Both clusters were configured static (without discovery, as stated in https://coreos.com/etcd/docs/latest/clustering.html ) and, for some reason, when both clusters are running, fleet ramdonly shows the "enginer leadershipt lost".

Although machine-ids and cluster-ids are different, we stopped one of the clusters and the other stopped giving us the error.

We do not know why.

Thanks!

Juan José Amor

http://dramor.net/

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rob Szumski

unread,

Jul 15, 2016, 4:38:55 PM7/15/16

to Juanjo Amor, CoreOS User

Did you have some sort of extended networking outage, latency, or packet loss within that subnet?

I’m thinking that both clusters reacted the same way to external forces, not necessarily were connected.

Juan Jose Amor Iglesias

unread,

Jul 15, 2016, 4:58:24 PM7/15/16

to Rob Szumski, CoreOS User

Hello Rob,

El 15/07/16 a las 22:38, Rob Szumski escribió:

> Did you have some sort of extended networking outage, latency, or packet
> loss within that subnet?

I think not. It is a 10 Gbps LAN, and if I activate both clusters, error
occurs.

>
> I’m thinking that both clusters reacted the same way to external forces,
> not necessarily were connected.

I'm thinking, although cluster and machine IDs are different, should we
use also different ETCD_INITIAL_CLUSTER_TOKEN values?

Thank you!

>
>> On Jul 15, 2016, at 1:18 PM, Juanjo Amor <jja...@gmail.com

>> <mailto:jja...@gmail.com>> wrote:
>>
>> Hello again,
>>
>> I'm not sure but I guess this error is due to we had two clusters in
>> the same subnet.
>>
>> Both clusters were configured static (without discovery, as stated
>> in https://coreos.com/etcd/docs/latest/clustering.html ) and, for some
>> reason, when both clusters are running, fleet ramdonly shows the
>> "enginer leadershipt lost".
>>
>> Although machine-ids and cluster-ids are different, we stopped one of
>> the clusters and the other stopped giving us the error.
>>
>> We do not know why.
>>
>> Thanks!
>>
>>

>> --
>> Juan José Amor
>> https://http://dramor.net/

>>
>>
>>
>>
>>
>>
>> 2016-07-14 13:56 GMT+02:00 Juan José Amor <jja...@gmail.com

>> <mailto:jja...@gmail.com>>:

>>
>> Hello again,
>>
>> May be related with previous message, but I also have several:
>>
>> fleetd: ERROR engine.go:217: Engine leadership lost, renewal
>> failed: context deafline exceeded.
>>
>> This occurs mainly when activating units.
>>
>> In previous stable releases it never occurred :(
>>
>> Any ideas?
>>
>> Many thanks!
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "CoreOS User" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to coreos-user...@googlegroups.com

>> <mailto:coreos-user...@googlegroups.com>.

>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "CoreOS User" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to coreos-user...@googlegroups.com

>> <mailto:coreos-user...@googlegroups.com>.

>> For more options, visit https://groups.google.com/d/optout.
>

--

Juan Jose Amor Iglesias // http://about.me/jjamor
jjamor -at- gmail.com // juanjo -at- dramor.net

-------------------- Visit my Blog! ---------------------
The Boring Stories Written By DrAmor: http://dramor.net/blog/
---------------------------------------------------------------

Rob Szumski

unread,

Jul 15, 2016, 6:40:24 PM7/15/16

to Juan Jose Amor Iglesias, CoreOS User

Gotcha, ok. I haven’t seen this particular behavior that I can remember, but if you can track down a way to reproduce it, that would be fantastic.

Juanjo Amor

unread,

Jul 18, 2016, 1:52:29 AM7/18/16

to Rob Szumski, CoreOS User

Well, I guess you only need to reproduce my infrastructure:

- A three-node cluster "lab" in one subnet

- A three-node cluster "production" in same subnet

Both clusters, created without access to discovery etcd server,

This is the cloud config data for node 1 on cluster 1 (hostnames for nodes: srvlabcoreos01, srvlabcoreos02 and srvlabcoreos03, IPs specified in following data):

...

etcd2:

name: srvlabcoreos01

initial-advertise-peer-urls: http://$private_ipv4:2380

listen-peer-urls: http://$private_ipv4:2380,http://$private_ipv4:7001

initial-cluster-token: etcd-cluster-1

initial-cluster: srvlabcoreos01=http://192.168.9.152:2380,srvlabcoreos02=htt

p://192.168.9.153:2380,srvlabcoreos03=http://192.168.9.154:2380

initial-cluster-state: new

advertise-client-urls: http://$private_ipv4:2379,http://$private_ipv4:4001

listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001

...

and this in node 1 for cluster 2 (hostnames: srvprecoreos01, srvprecoreos02 and srvprecoreos03):

...

etcd2:

name: srvprecoreos01

initial-advertise-peer-urls: http://$private_ipv4:2380

listen-peer-urls: http://$private_ipv4:2380,http://$private_ipv4:7001

initial-cluster-token: etcd-cluster-1

initial-cluster: srvprecoreos01=http://192.168.9.161:2380,srvprecoreos02=http://192.168.9.162:2380,srvprecoreos03=http://192.168.9.163:2380

initial-cluster-state: new

advertise-client-urls: http://$private_ipv4:2379,http://$private_ipv4:4001

listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001

...

Note that all IPs belong to 192.168.9.0/24 network. When we boot both clusters, and start to use them, when interacting with fleet we will ramdonly see "engine leadership lost" messages and loss fleet functions for several seconds.

Many thanks!

Best,

Saludos,

Juan José Amor

http://dramor.net/

Reply all

Reply to author

Forward