Support for sharding in gRPC

389 views

Skip to first unread message

Antoine Tollenaere

unread,

Mar 27, 2023, 7:55:00 AM3/27/23

to grp...@googlegroups.com

Hey.

My organization is working on implementing a generic sharding mechanism where teams can configure mapping of keys to a set of shards, and shards to a set of containers, for a given service. This is analogous to Facebook's shard manager [1]. For what the routing layer is concerned, the idea is that:

- a control plane component provides information for selecting a shard based on information contained in the incoming request (e.g. a list of keys to a shard identifier, or list of ranges of keys to a shard identifier). Assignments for keys to shards do not have to change very frequently (e.g. monthly), but this explicit mapping can be quite large (e.g. millions of entries).

- a control plane component provides information on how to map shards to running containers (network endpoints).
- the data plane (gRPC or Envoy) uses that information to send requests to the appropriate endpoint.

I'm looking for guidance on how to implement the routing layer in gRPC, potentially using xDS. We also use Envoy for the same purpose, so ideally, the solution is similar for both. In our initial design for Envoy, we use:

1. A custom filter to map requests to shards. In gRPC, this could be done via an interceptor that injects the shard identifier in the request context.

2. Load Balancer Subsets [2] to route requests to individual endpoints within a CDS cluster. This assumes that all shards are part of the same CDS cluster, and identifiers are communicated from the control plane to the clients via EDS config.endpoint.v3.LbEndpoint metadata field [3]. The problem here is that as far as I can see, Load Balancer Subsets are not implemented in gRPC.

For 2., an alternative design would use a different CDS cluster for each shard. This has some advantages, such as more configurable options and detailed metrics for each shard, but is more costly, both in terms of Envoy memory footprint, metrics cost (Envoy generates a lot of metrics for each cluster -- but we could work around that one) and number of CDS and EDS subscriptions. So first, I'd like to know whether support for Load Balancer Subsets is something the gRPC team has considered and rejected, or whether it has not been implemented because it is low priority (I do not see any mention in all xDS related gRFCs).

If the team does not plan on supporting load balancing subsets, would you have recommendations on an alternative way to implement sharding within or on top of gRPC? There are a couple of ways I can think of:

1. implement sharding entirely on top of the stack, using multiple gRPC channels. Exposing an interface that looks like a single gRPC client is some amount of work on the data plane/library side but possible.
2. put each shard in a different CDS cluster rather than using load balancer subsets, simply using mechanisms described in A28 - gRPC xDS traffic splitting and routing [4] to route to the correct endpoint. This would increase the amount of xDS subscriptions by a factor 1000 for sharded services, which may become a problem. I think this would also increase the number of connections to each backend, since each container typically hosts multiple shards.
3. implement a custom load balancing policy that does the mapping for gRPC based on load balancer subsets [5]. I haven't dug too much into what that would look like in practice.

If that matters, we plan to use all Go and Java (we also use Python but typically use an egress Envoy proxy in that case). Has anyone successfully done any of those? (some teams here have done 1 to some extent, but we are aiming to make sharding more transparent and less work for users).

Thanks,

Antoine.

[1] https://research.facebook.com/publications/shard-manager-a-generic-shard-management-framework-for-geo-distributed-applications/

[2] https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/subsets

[3] https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/endpoint/v3/endpoint_components.proto#config-endpoint-v3-lbendpoint

[4] https://github.com/grpc/proposal/blob/master/A28-xds-traffic-splitting-and-routing.md

[5] https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto#config-cluster-v3-loadbalancingpolicy

sanjay...@google.com

unread,

Apr 2, 2023, 3:10:09 AM4/2/23

to grpc.io

Hi Antoine,

This is an interesting use-case and when I started thinking about possible solutions (including the ones you have suggested) I noticed some structural similarity with something we have been working on - and I will come to that shortly.

But first some clarifications:

- you said "map shards to running containers (network endpoints)" : is there a single container (endpoint) for a shard or can you have more than one endpoint and a load balancer picks any one of those endpoints (based on some LB policy)?

- if there is a single endpoint for a shard then the shard identifier basically is immaterial to the gRPC routing component: we combine the 2 mapping steps and you in effect map the keys in the incoming request to an endpoint: unless I am missing something

- the "information contained in the incoming request" i.e. the list of keys are part of the request metadata and not the request payload, is that correct?

There is something called stateful session affinity that is described in A55: xDS-Based Stateful Session Affinity for Proxyless gRPC and there is a follow-on proposal called "A60: new gRFC for weighted clusters support for Stateful Session Affinity" in a PR. And the solution I thought for your problem is structurally similar to what is proposed in A60.

As you mentioned for your Envoy solution, basically you will have a "custom" filter that will map a request to a shard and/or an endpoint (since the shard to cluster mapping is not clear let's say the filter will specify both a cluster and an endpoint within the cluster). Using the technique described in A60 the cluster selection made by the config selector will be overridden by the ClusterSelectionFilter and the endpoint selection will be done by the xds_override_host policy based on the pick attribute set by your custom filter.

So the questions left to be answered are how the custom filter will be injected for the gRPC client and whether you agree that there is a structural similarity between the 2 use-cases. The custom filter in your case uses keys in the incoming request (instead of a cookie) to figure out the cluster and the endpoint - and also possibly interacts with an external component to get the keys->shard and shard->endpoint mappings. Based on answers to these questions we can discuss further.

Sanjay

Antoine Tollenaere

unread,

Apr 7, 2023, 6:00:04 AM4/7/23

to sanjay...@google.com, grpc.io

Hey Sanjay. Thank you for your answer.

Answers inline below.

On Sun, Apr 2, 2023 at 9:10 AM 'sanjay...@google.com' via grpc.io <grp...@googlegroups.com> wrote:

- you said "map shards to running containers (network endpoints)" : is there a single container (endpoint) for a shard or can you have more than one endpoint and a load balancer picks any one of those endpoints (based on some LB policy)?

There are 3 typical setups:

1. primary only: each shard has a single replicas

2. secondary only: each shard has multiple replicas

3. primary/secondary: one replica is responsible for writes, and there are multiple replicas available for reads.

While 1. is a popular option and a good first iteration, the design should definitely account for the possibility of having multiple replicas and using another load balancing policies to route to them.

- if there is a single endpoint for a shard then the shard identifier basically is immaterial to the gRPC routing component: we combine the 2 mapping steps and you in effect map the keys in the incoming request to an endpoint: unless I am missing something

Yes, that would be true iff there was a 1<->1 mapping between shards and endpoints. That being said, we plan to have the service discovery component be aware of the shard <-> endpoint mapping rather than the application, to simplify integration with Envoy, and reuse gRPC built-in load balancing when multiple replicas are present, if possible. Ideally the mapping from shard -> host is not known from the client application, and only from gRPC (an alternative would be to pick the host directly in the filter/interceptor, but that would require it to implement load balancing, which is not ideal).

- the "information contained in the incoming request" i.e. the list of keys are part of the request metadata and not the request payload, is that correct?

Yes, we can make this assumption.

There is something called stateful session affinity that is described in A55: xDS-Based Stateful Session Affinity for Proxyless gRPC and there is a follow-on proposal called "A60: new gRFC for weighted clusters support for Stateful Session Affinity" in a PR. And the solution I thought for your problem is structurally similar to what is proposed in A60.

As you mentioned for your Envoy solution, basically you will have a "custom" filter that will map a request to a shard and/or an endpoint (since the shard to cluster mapping is not clear let's say the filter will specify both a cluster and an endpoint within the cluster). Using the technique described in A60 the cluster selection made by the config selector will be overridden by the ClusterSelectionFilter and the endpoint selection will be done by the xds_override_host policy based on the pick attribute set by your custom filter.

Thanks for pointing out A55 and A60. This is very interesting in that it shows how filters can interact with the load balancer to pick a particular host. I think if we were to implement metadata subsets, we would need to do something very similar.

So the questions left to be answered are how the custom filter will be injected for the gRPC client and whether you agree that there is a structural similarity between the 2 use-cases. The custom filter in your case uses keys in the incoming request (instead of a cookie) to figure out the cluster and the endpoint - and also possibly interacts with an external component to get the keys->shard and shard->endpoint mappings. Based on answers to these questions we can discuss further.

In the first iteration of this work, we'll ask clients to add a custom interceptor that adds the necessary shard identifier to the request context. It would be ideal if that could be done directly via built-in interceptors and configured via an xDS extension, but I'd rather concentrate on the shard -> endpoint mapping for now.