I've got a concern with this route api and I'm not really sure the best way to solve it.
Today if a gorouter fails to receive a route.register message for X amount of time (droplet_stale_threshold) then it will remove the route from its tables. Which is fine. However, it expires this router regardless of whether NATS is up or down. So, if for some reason all of our nats servers go down for an extended period of time (which we have had happen before) then the routers eventually forget all their routes and all of the applications in our deployment are no longer accessible.
I think there is an easy fix for this issue in the gorouter today. The gorouter should simply not consider a route stale unless NATS is accessible. Why should it make a route stale if it knows it is impossible for applications to refresh their entry?
Anyway, this is an issue that I've been meaning to submit to the router for a while but haven't gotten around to it.
My main concern is that, if not implemented correctly, the Routing API could exacerbates this issue even further.
If a request is made to the route api service to update/register a route, and this service updates etcd, which then the router reads. We now have a blind layer of indirection where if a route in etcd goes stale the router has no way of knowing if the route went stale because the application that registered the route went down or if the application who hosts the route api endpoint is down. If the route api application is down or if etcd itself is down I would think that the router wouldn't want to expire any of its routes because applications would be unable to refresh their route. But how can the router know this and not consider a route stale?
Thoughts?
Mike