[clientv3] Best way to detect re-connection

Milan Lenco

unread,

May 30, 2017, 10:06:02 AM5/30/17

to etcd-dev

Hello etcd-dev,

Could you please recommend me the best approach for an asynchronous detection of clientv3 re-connection? I have searched through the clientv3 implementation as well as through grpc a little bit, but couldn't find any channels/callbacks/etc. through which connection status changes get propagated upwards.

My current approach is rather silly: a separate goroutine is repeatedly attempting to read some random key inside a context with 2 seconds timeout. If the timeout runs out before a response is returned, the connection is considered to be broken (like a ping).

Firstly, this is not quite accurate under all possible scenarious and secondly not as efficient as I would like it to be.

I'm especially interested in the event of connection re-establishment, at which I'm running a procedure that reloads a range of keys to check if something has been added/changed/deleted while the connection of my client was down (basically to handle lost notifications).

Is there a better way to get signaled about the connection status changes?

Thanks,

Milan

anthony...@coreos.com

unread,

May 30, 2017, 12:06:27 PM5/30/17

to etcd-dev

> Could you please recommend me the best approach for an asynchronous detection of clientv3 re-connection?

The go client doesn't expose this information at the moment. There used to be a way to watch on the grpc client connection for state changes, but the API has since been removed from go-grpc.

> My current approach is rather silly: a separate goroutine is repeatedly attempting to read some random key inside a context with 2 seconds timeout.

This is probably one of the more reliable ways to do it.

> I'm especially interested in the event of connection re-establishment, at which I'm running a procedure that reloads a range of keys to check if something has been added/changed/deleted while the connection of my client was down (basically to handle lost notifications).

v3 watches will automatically handle resume after disconnect and won't drop any events. Is there a reason why watches wouldn't work for this case?

Neil Wilson

unread,

Jun 2, 2017, 5:11:19 AM6/2/17

to etcd-dev

On Tuesday, May 30, 2017 at 5:06:27 PM UTC+1, anthony...@coreos.com wrote:

v3 watches will automatically handle resume after disconnect and won't drop any events. Is there a reason why watches wouldn't work for this case?

I've been having trouble with that over the last couple of days and I was wondering what the reconnect design is in APIv3.

I have three replicas which I then reference as endpoints with 'etcdctl' in API=3 mode from two other machines. I have one machine updating a key constantly on a ten second lease every twenty seconds and the other machine has a watch on that key.

I then kill the replica the watch client is connected to (by dirty terminating the server), and the watch hangs. This appears to be because there is no heartbeat across the link and so TCP has no reason to believe it has to close the connection. There just isn't any data coming in at the moment and the TCP connection will sit there indefinitely.

There doesn't appear to be any connection to any of the other endpoints (they appear to be connection tested and then reset by the client), so there is no way one of the other replicas can tell the client there has been a topology change and a new leader elected.

So I don't see how a watch can resume if a replica just vanishes due to a server crash or VM/container disappearing.

anthony...@coreos.com

unread,

Jun 2, 2017, 12:23:18 PM6/2/17

to etcd-dev

Yes, it's a problem. Opened an issue about heartbeats: https://github.com/coreos/etcd/issues/8022

Thanks!

Mark Petrovic

unread,

Mar 23, 2020, 10:17:19 AM3/23/20

to etcd-dev

It's not clear to me if this issue has been resolved, namely the vanishing replica and the hung watch. I see this as the latest issue to speak to it, but am unsure if they are truly one and the same thing: https://github.com/etcd-io/etcd/issues/8673

If the issue is not resolved, could one not establish a watch with WithProgressNotify(), and reset a watchdog timer based on a Timer https://golang.org/pkg/time/#Timer to ascertain if the watch is hung?