Proposal for Routing API

Dieu Cao

unread,

Jan 14, 2015, 1:14:56 AM1/14/15

to vcap...@cloudfoundry.org

Hi All,

We've put together a proposal for a new Routing API [1].

The goal of this feature is to build a Routing API that services such as the Cloud Controller, UAA and service brokers can use to register/unregister routes, without needing to know the details of the backing service, and for the router to load balance traffic to those services.

We'd like to be able to iterate on this document quickly, so please keep your input and feedback in the comments of the document.

-Dieu

CF Runtime PM

[1] https://docs.google.com/document/d/1iSrqmjRDEM0y_nIgk-iXNR9P-ODV7E-StN4UMj8y0lQ/edit?usp=sharing

Mike Youngstrom

unread,

Feb 27, 2015, 2:00:23 PM2/27/15

to vcap...@cloudfoundry.org

I've got a concern with this route api and I'm not really sure the best way to solve it.

Today if a gorouter fails to receive a route.register message for X amount of time (droplet_stale_threshold) then it will remove the route from its tables. Which is fine. However, it expires this router regardless of whether NATS is up or down. So, if for some reason all of our nats servers go down for an extended period of time (which we have had happen before) then the routers eventually forget all their routes and all of the applications in our deployment are no longer accessible.

I think there is an easy fix for this issue in the gorouter today. The gorouter should simply not consider a route stale unless NATS is accessible. Why should it make a route stale if it knows it is impossible for applications to refresh their entry?

Anyway, this is an issue that I've been meaning to submit to the router for a while but haven't gotten around to it.

My main concern is that, if not implemented correctly, the Routing API could exacerbates this issue even further.

If a request is made to the route api service to update/register a route, and this service updates etcd, which then the router reads. We now have a blind layer of indirection where if a route in etcd goes stale the router has no way of knowing if the route went stale because the application that registered the route went down or if the application who hosts the route api endpoint is down. If the route api application is down or if etcd itself is down I would think that the router wouldn't want to expire any of its routes because applications would be unable to refresh their route. But how can the router know this and not consider a route stale?

Thoughts?

Mike

--
You received this message because you are subscribed to the Google Groups "Cloud Foundry Developers" group.
To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/e0990b63-b7a8-4482-9d3d-ed9d9b3deba5%40cloudfoundry.org.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Dieu Cao

unread,

Feb 27, 2015, 6:09:24 PM2/27/15

to vcap...@cloudfoundry.org

Hi Mike,

We actually tried this a while back, where we added logic to not expire stale routes if NATS is inaccessible.

We then saw in one of our environments a situation where during a deploy, one of four routers lost connectivity to NATS, and then DEAs started to roll while we were trying to figure out what was going on.

Because routes were not expiring and apps were being moved around, it was possible during this time that a user might actually be routed to the wrong apps.

We believe being routed to the wrong app is much worse than not being able to reach your app, so we then ripped that code out.

If we want to not expire routes, we need a way to ensure that users would never get routed to the wrong place. Michael Fraenkel suggested this could be done via an iptables rule that would be checked against a private instance id injected into the request. [1]

Please let us know if you have some other ideas about how we could achieve this.

-Dieu

CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/83120430

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+unsubscribe@cloudfoundry.org.

Mike Youngstrom

unread,

Feb 27, 2015, 6:51:44 PM2/27/15

to vcap...@cloudfoundry.org

Hi Dieu, thanks for the insight.

I would think that something like Mike Fraenkel's suggestion would work well. In addition, it would only take a little more effort for the same solution to quell my related concern with route services [0]. If the app wishes to support direct connections let them deploy with --no-route. Then we disable this validation. Otherwise why not put something in an iptable to always check that the request came from a router and that the destination application is the correct one?

Though I don't know if this would eliminate my main concern with the routing api concept of a router not knowing if it should expire a route if the route api components were failing to update route data.

Mike

[0] https://groups.google.com/a/cloudfoundry.org/d/msg/vcap-dev/bfin6_Bl_zQ/IGhNbe1j4sUJ

To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/17487169-bdc7-490d-8d1e-d9a71d192bec%40cloudfoundry.org.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Mike Youngstrom

unread,

Feb 27, 2015, 7:11:37 PM2/27/15

to vcap...@cloudfoundry.org

Although after doing a little research it appears that it would be very difficult to do something like this with iptables even using the concepts discussed in the article Michael referenced. Anyway, thanks for the insight into why you removed the NATS connectivity checks from gorouter.

If I can think of a solution short of running a proxy on every DEA I'll bring it up.

Mike

Mike Youngstrom

unread,

Apr 7, 2015, 7:58:50 PM4/7/15

to vcap...@cloudfoundry.org

Please let us know if you have some other ideas about how we could achieve this.

Dieu,

It seems that even with the current solution of always dropping stale routes after 120 seconds you could still run into scenarios where DEAs role and the route starts going to the wrong app while this Router is not connected to NATs.

One thing I did think of that could help improve this situation further would be more random warden port assignments. Perhaps instead of starting at the beginning of the port pool on warden/dea start you could start at a random place in the port pool. That would help make routing to the wrong app happen less often. Warden could even persist its last place in the pool on restart and continue from that point. In the case of a new stemcell the random port starting point would have to be good enough but would be better than always taking the first port.

Just a thought.

Mike

Dieu Cao

unread,

Apr 8, 2015, 2:05:14 AM4/8/15

to vcap...@cloudfoundry.org

That's a good suggestion.

I added a story for it [1]

Thanks Mike!

-Dieu

[1] https://www.pivotaltracker.com/story/show/92010618

Mike Youngstrom

unread,

Apr 8, 2015, 12:28:12 PM4/8/15

to vcap...@cloudfoundry.org

Another approach could be deriving application port using incremental input from a consistent hash of the app guid. That would help to ensure that each instance of the same application would more often consume the same ports on each DEA more consistently. Again not foolproof but would probably help prevent misdirected routes even more with the side benefit of routers more often accidentally hitting the correct port even if disconnected from NATs.

Mike

--

You received this message because you are subscribed to the Google Groups "Cloud Foundry Developers" group.

To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/840c6900-f20a-4983-90bc-8c0f45309c84%40cloudfoundry.org.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Reply all

Reply to author

Forward