Hey all,We're considering implementing some patches to the golang grpc implementation. These are things we think would better fit inside of grpc rather than trying to achieve from outside. Before we go through the effort, we'd like to gauge whether these features would be welcome (assuming we'll work with owners to get a quality implementation). Some of these ideas are not fully fleshed out or may not be the best solution to the problem they aim to solve. I also try to state the problem, so if you have ideas on better ways to address these problems, please share :)Add DialOption MaxConnectionLifetimeCurrently, once a connection is established, it lives until there is a transport error or the client proactively closes the connection. These long-lived connections are problematic when using a TCP load balancer, such as the one provided by Google Container Engine and Google Compute Engine. At a a clean start, clients will be somewhat distributed among the servers behind the load balancer, but if the servers go through a rolling restart server will become unbalanced as clients will have a higher likelihood of being connected to the first server that restarts, with the most recently restarted server having close to zero clients.
We propose fixing this by adding a MaxConnectionLifetime, which will force clients to disconnect after some period of time. We'll use the same mechanism as when an address is removed from a balancer (e.g. drain the connection, rather than abruptly throw errors).
Add DialOption NumConnectionsPerSeverThis is related to the problem above. When a client is provided with a single address that points to a TCP load balancer, it's sometimes beneficial to have the client have multiple connections since they underlying performance might vary.
Add ServerOption MaxConcurrentGlobalStreamsCurrently there is only a way to limit the number of streams per client, but it'd be useful to do this globally. This could be achieved via an interceptor that returns StreamRefused, but thought it might be useful in grpc.
Add facility for retriesCurrently, retries must happen in user-level code, but it'd be beneficial for performance and robustness to do have a way to do this with GRPC. Today, if the server refuses a request with StreamRefused, the client doesn't have a way to retry on a different server, it can only just issue the request and hope it gets a different server. It also forces the client to reserialize the request which is unnecessary and given the cost of serialization with proto, it'd be nice to avoid this.
Change behavior of Dial to not block on the balancer's initial listCurrently, when you construct a *grpc.ClientConn with a balancer, the call to Dial blocks until the initial set of servers is returned from the balancer and errors if the balancer returns an empty list. This is inconsistent with the behavior of the client when the balancer produces an empty list later in the life of the client.
We propose changing the behavior such that Dial does not wait for the response of the balancer and thus also can't return an error when the list is empty. This not only makes the behavior consistent, it has the added benefit that callers don't need to their own retries to Dial.
To reiterate, these are just rough ideas and we're also in search of other solutions to these problems if you have ideas.Thanks!
--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/abaa9977-78ee-41d0-b0f5-a4e273dfd13a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks for the info. My comments are inline.On Sun, Dec 4, 2016 at 7:27 PM, Arya Asemanfar <arya.as...@mixpanel.com> wrote:Hey all,We're considering implementing some patches to the golang grpc implementation. These are things we think would better fit inside of grpc rather than trying to achieve from outside. Before we go through the effort, we'd like to gauge whether these features would be welcome (assuming we'll work with owners to get a quality implementation). Some of these ideas are not fully fleshed out or may not be the best solution to the problem they aim to solve. I also try to state the problem, so if you have ideas on better ways to address these problems, please share :)Add DialOption MaxConnectionLifetimeCurrently, once a connection is established, it lives until there is a transport error or the client proactively closes the connection. These long-lived connections are problematic when using a TCP load balancer, such as the one provided by Google Container Engine and Google Compute Engine. At a a clean start, clients will be somewhat distributed among the servers behind the load balancer, but if the servers go through a rolling restart server will become unbalanced as clients will have a higher likelihood of being connected to the first server that restarts, with the most recently restarted server having close to zero clients.I do not think long-lived connections are problematic as long as there are live traffic on them. We do have plan to add idle shutdown to actively close the TCP connections which live long and have no traffic for a while. Which server to chose is really depending on the load balancing policy you choose -- I do not see why your description could happen if you use a round-robin load balance policy.
We propose fixing this by adding a MaxConnectionLifetime, which will force clients to disconnect after some period of time. We'll use the same mechanism as when an address is removed from a balancer (e.g. drain the connection, rather than abruptly throw errors).This should be achieved by GRPCLB load balancer which can sense all the work load of the servers and send refreshed backend list when needed. I am not convinced MaxConnectionLifetime is a must.Add DialOption NumConnectionsPerSeverThis is related to the problem above. When a client is provided with a single address that points to a TCP load balancer, it's sometimes beneficial to have the client have multiple connections since they underlying performance might vary.I am not clear what you plan to do here. Do you want to create multiple connections to a single endpoint (e.g., TCP load balancer)? If yes, you can customize your load balancer impl to do that already (the endpoints with same address but different metadata are treated as different ones in grpc internals).
Add ServerOption MaxConcurrentGlobalStreamsCurrently there is only a way to limit the number of streams per client, but it'd be useful to do this globally. This could be achieved via an interceptor that returns StreamRefused, but thought it might be useful in grpc.This is something similar to what we plan to add for flow control purpose. gRPC servers will have some knobs (e.g., ServerOption) to throttle the resource usage (e.g., memory) of the entire server.
Add facility for retriesCurrently, retries must happen in user-level code, but it'd be beneficial for performance and robustness to do have a way to do this with GRPC. Today, if the server refuses a request with StreamRefused, the client doesn't have a way to retry on a different server, it can only just issue the request and hope it gets a different server. It also forces the client to reserialize the request which is unnecessary and given the cost of serialization with proto, it'd be nice to avoid this.
This is also something on our road map.
Change behavior of Dial to not block on the balancer's initial listCurrently, when you construct a *grpc.ClientConn with a balancer, the call to Dial blocks until the initial set of servers is returned from the balancer and errors if the balancer returns an empty list. This is inconsistent with the behavior of the client when the balancer produces an empty list later in the life of the client.We propose changing the behavior such that Dial does not wait for the response of the balancer and thus also can't return an error when the list is empty. This not only makes the behavior consistent, it has the added benefit that callers don't need to their own retries to Dial.
If my memory works, this discussion happened before. The name "Dial" indicates the dial operation needs to be triggered when it returns. We probably can add another public surface like "NewClientConn" to achieve what you want here.
----To reiterate, these are just rough ideas and we're also in search of other solutions to these problems if you have ideas.Thanks!
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/abaa9977-78ee-41d0-b0f5-a4e273dfd13a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Thanks,-Qi
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAFnDmdoGoqVt%2B_SOmQ5EMmaTpxaF1BFKtCAP0%3DvALCmDofeO4A%40mail.gmail.com.
Thanks for the feedback. Good idea re metadata for getting the Balancer to treat the connections as different. Will take a look at that.Some clarifications/questions inline:On Mon, Dec 5, 2016 at 11:11 AM, 'Qi Zhao' via grpc.io <grp...@googlegroups.com> wrote:Thanks for the info. My comments are inline.On Sun, Dec 4, 2016 at 7:27 PM, Arya Asemanfar <arya.as...@mixpanel.com> wrote:Hey all,We're considering implementing some patches to the golang grpc implementation. These are things we think would better fit inside of grpc rather than trying to achieve from outside. Before we go through the effort, we'd like to gauge whether these features would be welcome (assuming we'll work with owners to get a quality implementation). Some of these ideas are not fully fleshed out or may not be the best solution to the problem they aim to solve. I also try to state the problem, so if you have ideas on better ways to address these problems, please share :)Add DialOption MaxConnectionLifetimeCurrently, once a connection is established, it lives until there is a transport error or the client proactively closes the connection. These long-lived connections are problematic when using a TCP load balancer, such as the one provided by Google Container Engine and Google Compute Engine. At a a clean start, clients will be somewhat distributed among the servers behind the load balancer, but if the servers go through a rolling restart server will become unbalanced as clients will have a higher likelihood of being connected to the first server that restarts, with the most recently restarted server having close to zero clients.I do not think long-lived connections are problematic as long as there are live traffic on them. We do have plan to add idle shutdown to actively close the TCP connections which live long and have no traffic for a while. Which server to chose is really depending on the load balancing policy you choose -- I do not see why your description could happen if you use a round-robin load balance policy.We have a single IP address that we give to GRPC (since the IP address is Google Cloud's TCP load balancer). The client establishes one connection and has no reason to disconnect in normal conditions.Here's an example scenario that results in uneven load:- 100 clients connected evenly to 10 servers- each of the 10 servers has about 10 connect- each of the clients sends about an equal amount of traffic to the server they are connected to- one of the servers restarts- the 10 clients that were connected to that 1 server re-establish connections- the new server, assuming it came up in time, has on average 1 connection, with each of the other 9 having 1 additional connection- now we have 10 servers, one with 1 client and 9 with 11 clients so the load is unevenly distributed
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/abaa9977-78ee-41d0-b0f5-a4e273dfd13a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Thanks,-Qi
--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAFnDmdoGoqVt%2B_SOmQ5EMmaTpxaF1BFKtCAP0%3DvALCmDofeO4A%40mail.gmail.com.
--Thanks,-Qi
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/abaa9977-78ee-41d0-b0f5-a4e273dfd13a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Thanks,-Qi
--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAFnDmdoGoqVt%2B_SOmQ5EMmaTpxaF1BFKtCAP0%3DvALCmDofeO4A%40mail.gmail.com.
--Thanks,-Qi
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/abaa9977-78ee-41d0-b0f5-a4e273dfd13a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Thanks,-Qi
--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAFnDmdoGoqVt%2B_SOmQ5EMmaTpxaF1BFKtCAP0%3DvALCmDofeO4A%40mail.gmail.com.
--Thanks,-Qi--Thanks,-Qi
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/abaa9977-78ee-41d0-b0f5-a4e273dfd13a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Thanks,-Qi
--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscribe@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAFnDmdoGoqVt%2B_SOmQ5EMmaTpxaF1BFKtCAP0%3DvALCmDofeO4A%40mail.gmail.com.
--Thanks,-Qi--Thanks,-Qi
Since a TCP load balancer is only aware of TCP packets and not HTTP2 frames, it cannot multiplex requests from multiple clients onto 1 connection. A TCP load balancer makes a load balancing decision at connection establishment, not per stream, request, or packet.Re calling Close after a timer fires, this terminates in-flight requests so we'd need duplicate the book keeping of outstanding streams, which makes this cumbersome.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/abaa9977-78ee-41d0-b0f5-a4e273dfd13a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Thanks,-Qi
--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAFnDmdoGoqVt%2B_SOmQ5EMmaTpxaF1BFKtCAP0%3DvALCmDofeO4A%40mail.gmail.com.
----Thanks,-Qi--Thanks,-QiThanks,-Qi--Thanks,-Qi