Re: [kubernetes/community] Design for consistent API chunking in Kubernetes (#896)

Kubernetes Submit Queue

unread,

Aug 29, 2017, 4:02:02 PM8/29/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

Automatic merge from submit-queue

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Kubernetes Submit Queue

unread,

Aug 29, 2017, 4:02:06 PM8/29/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

Merged #896.

Daniel Smith

unread,

Aug 30, 2017, 8:10:08 PM8/30/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@lavalamp commented on this pull request.

Sorry, late to the party. A few thoughts. No major complaints.

In contributors/design-proposals/api-chunking.md:

> +  "metadata": {"resourceVersion": "147"},
+  "items": [
+     // no more than 500 items
+   ]
+}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 
+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent chunk reads.
+
+If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.
+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.

I understand why you have it this way, but does it prevent us from moving the filtering step into the db layer in the future? I guess not. But it is an implementation detail leaking through.

In contributors/design-proposals/api-chunking.md:

> +}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 
+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent chunk reads.
+
+If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.
+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.
+
+The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.
+
+The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.

And storage backend restarts.

In contributors/design-proposals/api-chunking.md:

> +     // no more than 500 items
+   ]
+}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 
+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent chunk reads.
+
+If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.
+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.
+
+The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.

Servers must give a clear error on expired continue tokens.

What happens if the collection is large enough that the last page has always expired by the time the client gets there?

In contributors/design-proposals/api-chunking.md:

> +The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.
+
+The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.
+
+The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.
+
+
+### Proposed Implementations
+
+etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case. The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a common prefix, with namespaces being a further prefix of those keys. A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks by defining a start key for the next chunk that skips the last key read.
+
+etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes.  Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.
+
+Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.

+1, we should come up with a stress test for this.

In contributors/design-proposals/api-chunking.md:

> +
+etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case. The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a common prefix, with namespaces being a further prefix of those keys. A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks by defining a start key for the next chunk that skips the last key read.
+
+etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.
+
+Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.
+
+#### etcd3
+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.
+
+#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:

Wow, do you really expect anyone to try this? Does it really need to be in this proposal?

In contributors/design-proposals/api-chunking.md:

> +For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the  storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.
+
+#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:
+
+    SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC
+
+where `namespace` and `name` are part of the continuation token and an index exists over `(namespace, name, resourceVersion, deleted)` that makes the range query performant. The highest returned resource version row for each `(namespace, name)` tuple would be returned.
+
+
+### Security implications of returning last or next key in the continue token
+
+If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). There are a number of approaches to mitigating this impact:

The keys of all the things the user actually got are present in the list; I don't know if I follow why exposing the next one would be problematic.

In contributors/design-proposals/api-chunking.md:

> +#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:
+
+    SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC
+
+where `namespace` and `name` are part of the continuation token and an index exists over `(namespace, name, resourceVersion, deleted)` that makes the range query performant. The highest returned resource version row for each `(namespace, name)` tuple would be returned.
+
+
+### Security implications of returning last or next key in the continue token
+
+If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). There are a number of approaches to mitigating this impact:
+
+1. Disable chunking on specific resources
+2. Disable chunking when the user does not have permission to view all resources within a range
+3. Encrypt the next key or the continue token using a shared secret across all API servers

Generate a random string as the continuation token; a shared map maps from that string to the actual data structure.

In contributors/design-proposals/api-chunking.md:

> +
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:
+
+    SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC
+
+where `namespace` and `name` are part of the continuation token and an index exists over `(namespace, name, resourceVersion, deleted)` that makes the range query performant. The highest returned resource version row for each `(namespace, name)` tuple would be returned.
+
+
+### Security implications of returning last or next key in the continue token
+
+If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). There are a number of approaches to mitigating this impact:
+
+1. Disable chunking on specific resources
+2. Disable chunking when the user does not have permission to view all resources within a range
+3. Encrypt the next key or the continue token using a shared secret across all API servers
+4. When chunking, continue reading until the next visible start key is located after filtering, so that start keys are always keys the user has access to.

If you're worried only about construction and not about making it opaque, you could sign or HMAC the key to prevent users from constructing them.

In contributors/design-proposals/api-chunking.md:

> +}
+```
+
+Some clients may wish to follow a failed paged list with a full list attempt.
+
+The 5 minute default compaction interval for etcd3 bounds how long a list can run.  Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.
+
+
+#### Types of clients and impact
+
+Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.
+
+* Controllers with full caches
+  * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
+* `kubectl get`
+  * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load).  They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"

Doesn't etcd's ordering potentially change with each update? I would expect lots of duplicates / omissions in this case.

In contributors/design-proposals/api-chunking.md:

> +
+The 5 minute default compaction interval for etcd3 bounds how long a list can run.  Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.
+
+
+#### Types of clients and impact
+
+Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.
+
+* Controllers with full caches
+  * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
+* `kubectl get`
+  * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load).  They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"
+* Migration style commands
+  * Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.
+
+For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version.  This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set. It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`.

Are etcd responses sorted by key? For some reason I thought it was by position in the etcd db file. @jpbetz

In contributors/design-proposals/api-chunking.md:

> +
+Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.
+
+
+### Chunk by default?
+
+On a very large data set, chunking trades total memory allocated in etcd, the apiserver, and the client for higher overhead per request (request/response processing, authentication, authorization). Picking a sufficiently high chunk value like 500 or 1000 would not impact smaller clusters, but would reduce the peak memory load of a very large cluster (10k resources and up). In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 store is an MVCC store and must always filter some values to serve a list.
+
+For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set. It also improves user experience for web consoles that may be accessed by administrators with access to large parts of the system.
+
+It is recommended that most clients attempt to page by default at a large page size (500 or 1000) and gracefully degrade to not chunking.
+
+
+### Other solutions
+
+Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes.

Why can't apiserver do stream processing?

Clayton Coleman

unread,

Aug 31, 2017, 7:44:29 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +  "metadata": {"resourceVersion": "147"},
+  "items": [
+     // no more than 500 items
+   ]
+}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 
+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent chunk reads.
+
+If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.
+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.

No it doesn't. In fact, filtering at the DB level should be opaque and possible - the continue token is generated by storage and surfaced all the way up, and filtering at the DB would simply identify the appropriate next key.

Clayton Coleman

unread,

Aug 31, 2017, 7:45:36 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +     // no more than 500 items
+   ]
+}
+```
+
+The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range. 
+
+The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent chunk reads.
+
+If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.
+
+The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.
+
+Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.
+
+The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.

They do, 410 Resource Expired.

For that case the default pager impl falls back to not paged. On our largest pathological data sets (1.2 gb at rest, 20gb in memory) i was able to read 100k secrets (500M of data) in chunks of 500 in under 4 seconds.

Clayton Coleman

unread,

Aug 31, 2017, 7:46:49 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +
+etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case. The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a common prefix, with namespaces being a further prefix of those keys. A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks by defining a start key for the next chunk that skips the last key read.
+
+etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.
+
+Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.
+
+#### etcd3
+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.
+
+#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:

I just want to be really clear on the equivalence between an MVCC store like etcd and someone who really believes they want to have an alternate backend. At our scales it doesn't look like it's necessary anymore, but I want to be able to point back to a clear consideration.

Clayton Coleman

unread,

Aug 31, 2017, 7:48:02 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the  storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.
+
+#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:
+
+    SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC
+
+where `namespace` and `name` are part of the continuation token and an index exists over `(namespace, name, resourceVersion, deleted)` that makes the range query performant. The highest returned resource version row for each `(namespace, name)` tuple would be returned.
+
+
+### Security implications of returning last or next key in the continue token
+
+If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). There are a number of approaches to mitigating this impact:

If we filter by ACL on a subset (user A can see namespaces 1, 3, and 10) and the last page of 3 extends into 4, if we don't continue reading until the next visible key we would return a startKey of 4/somename which would expose both the presence of 4 and sometime to the user who shouldn't know that 4 exists.

Clayton Coleman

unread,

Aug 31, 2017, 7:48:22 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +#### Possible SQL database implementation
+
+A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:
+
+    SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC
+
+where `namespace` and `name` are part of the continuation token and an index exists over `(namespace, name, resourceVersion, deleted)` that makes the range query performant. The highest returned resource version row for each `(namespace, name)` tuple would be returned.
+
+
+### Security implications of returning last or next key in the continue token
+
+If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). There are a number of approaches to mitigating this impact:
+
+1. Disable chunking on specific resources
+2. Disable chunking when the user does not have permission to view all resources within a range
+3. Encrypt the next key or the continue token using a shared secret across all API servers

Wouldn't work across api servers unless persisted to etcd.

Clayton Coleman

unread,

Aug 31, 2017, 7:48:30 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +}
+```
+
+Some clients may wish to follow a failed paged list with a full list attempt.
+
+The 5 minute default compaction interval for etcd3 bounds how long a list can run.  Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.
+
+
+#### Types of clients and impact
+
+Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.
+
+* Controllers with full caches
+  * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
+* `kubectl get`
+  * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load).  They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"

It does not. Etcd guarantees key order.

Clayton Coleman

unread,

Aug 31, 2017, 7:48:47 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +
+The 5 minute default compaction interval for etcd3 bounds how long a list can run.  Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.
+
+
+#### Types of clients and impact
+
+Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.
+
+* Controllers with full caches
+  * Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
+* `kubectl get`
+  * Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load).  They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"
+* Migration style commands
+  * Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.
+
+For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version.  This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set. It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`.

Etcd is a sorted map in lexicographic key order.

Clayton Coleman

unread,

Aug 31, 2017, 7:49:14 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@smarterclayton commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +
+Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.
+
+
+### Chunk by default?
+
+On a very large data set, chunking trades total memory allocated in etcd, the apiserver, and the client for higher overhead per request (request/response processing, authentication, authorization). Picking a sufficiently high chunk value like 500 or 1000 would not impact smaller clusters, but would reduce the peak memory load of a very large cluster (10k resources and up). In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 store is an MVCC store and must always filter some values to serve a list.
+
+For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set. It also improves user experience for web consoles that may be accessed by administrators with access to large parts of the system.
+
+It is recommended that most clients attempt to page by default at a large page size (500 or 1000) and gracefully degrade to not chunking.
+
+
+### Other solutions
+
+Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes.

Depends on how you define a stream. It can. Would require a dramatic change to the API shape of LIST.

Brian Grant

unread,

Aug 31, 2017, 11:40:00 PM8/31/17

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

Speaking of dramatic changes to APIs, we need to address the Endpoints API eventually.

Marek Siarkowicz

unread,

Aug 28, 2023, 9:22:34 AM8/28/23

to kubernetes/community, k8s-mirror-api-machinery-api-reviews, Team mention

@serathius commented on this pull request.

In contributors/design-proposals/api-chunking.md:

> +The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.
+
+
+### Proposed Implementations
+
+etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case. The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a common prefix, with namespaces being a further prefix of those keys. A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks by defining a start key for the next chunk that skips the last key read.
+
+etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.
+
+Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.
+
+#### etcd3
+
+For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
+
+The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.

This cannot happen as etcd doesn't have idea of path. Key is just a blob, it's Kubernetes that puts paths in key.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.

Reply all

Reply to author

Forward