Informer that caches on disk instead of in-memory

649 views
Skip to first unread message

mspre...@gmail.com

unread,
Aug 29, 2022, 2:16:53 PM8/29/22
to K8s API Machinery SIG
I am interested in use cases that strain memory.  One of the big strains on memory is an informer's local cache.  In some of these use cases I think it makes sense to keep the bulk of the cache on disk instead of in memory (perhaps the indices might be OK to keep in memory, not sure about that yet).  The disk-based Indexer could be given a Scheme, so it knows how to serialize/deserialize.  I could almost do this with no changes in-tree, maintaining the disk-based Indexer out-of-tree and using it to construct such an informer.  Sadly, the informer constructor is hard-wired to one particular constructor for its Indexer.  So that brings me to a couple of questions.

1. What do you think of exposing a lower-level informer constructor that takes the Indexer as a parameter, for whatever use cases client developers may have?

2. What do you think of making a disk-based Indexer?  Both as a general idea, and as a proposal for something to maintain in-tree?

Thanks,
Mike

mspre...@gmail.com

unread,
Aug 29, 2022, 2:17:41 PM8/29/22
to K8s API Machinery SIG

mspre...@gmail.com

unread,
Aug 29, 2022, 3:27:45 PM8/29/22
to K8s API Machinery SIG
This is also motivated by wanting to support clients that can restart while disconnected from the apiservers.  That would require a bit more change in-tree, introducing something like the option to resync from the store if no apiserver access shows up in some startup time limit.

Karl Isenberg

unread,
Aug 29, 2022, 5:56:21 PM8/29/22
to mspre...@gmail.com, K8s API Machinery SIG
Caching to disk sounds interesting. We have several addon controllers that have been bitten by the memory bloat from the informer cache. Here are some of the other mitigations I’ve seen:
1. Fork some of the informer code to add filters, to reduce what objects are cached to only the ones cared about. 
2. Add an option to drop the field manager annotation which is huge and rarely needing to be read or modified by most controllers. 
3. Add an option to just store metadata, for controllers that don’t need the spec or status cached. 
4. Add VPA to handle memory increases
5. Ensure objects are being dereferenced aggressively to minimize memory footprint outside of the informers
6. Remove memory limits to avoid oomkills (not a great option, but a tolerable temporary measure)

One of the other related issues is that CRDs are sent over the wire as json, not proto, which uses up a lot of memory and cpu parsing and formatting. If there were some way to make CRDs use proto that would be more efficient. 


--
You received this message because you are subscribed to the Google Groups "K8s API Machinery SIG" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-api-m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/a51e3050-e4aa-4a08-85e7-0c81410b128en%40googlegroups.com.

Kevin Wiesmueller

unread,
Aug 29, 2022, 5:57:17 PM8/29/22
to mspre...@gmail.com, K8s API Machinery SIG
I'd be curious about the memory savings this provides in our standard implementation.
I've noticed on several occasions with some informers in high throughput/turnover clusters, that the memory cost of holding the cache is one thing, but the bigger problem are actually allocations from decoding etc. that put high load on the GC.
So a disk cache implementation could be even more interesting in a non-GC language depending on resource turnover.

Daniel Smith

unread,
Aug 29, 2022, 7:26:43 PM8/29/22
to mspre...@gmail.com, K8s API Machinery SIG
On Mon, Aug 29, 2022 at 12:27 PM mspre...@gmail.com <mspre...@gmail.com> wrote:
This is also motivated by wanting to support clients that can restart while disconnected from the apiservers.

An approach to this is recorded in this issue: https://github.com/kubernetes/kubernetes/issues/90339

It hasn't been high priority though...
I'm more in favor of idea 1 than 2, the problem with 2 isn't that it's bad or anything, rather that a cluster / collection big enough that controllers have to use this technique is too big for any non-fancy controllers.

Additionally huge collections are a problem to get to controllers in the first place.

Thus before investigating this avenue, if we're going to do this much work, I'd want to consider if apiserver can break things into shards that do fit in memory and can get transferred to controllers in a reasonable amount of time. (this approach would also be a lot of work and would require some correspondingly large justification, of course)

Karl Isenberg

unread,
Aug 29, 2022, 9:13:37 PM8/29/22
to Daniel Smith, K8s API Machinery SIG, mspre...@gmail.com
Anything client-side is gonna be much less work and time than a server side change. 

I think it would be work considering modifying the informer to take optional field filters (fields that don’t get cached or passed to the caller) and optional object meta filters (whole objects that don’t get cached or passed to the caller). These kinds of optimizations would allow for rather significant pruning of what is kept in memory, even if it does still get sent over the wire and parsed client-side before being filtered. 

That said, an optional disk cache sounds like a pretty similar amount of work.

mspre...@gmail.com

unread,
Aug 31, 2022, 11:48:22 AM8/31/22
to K8s API Machinery SIG
1. Regarding the client memory mitigation ideas listed by karlis...@google.com:

> Fork some of the informer code to add filters, to reduce what objects are cached to only the ones cared about.

I am not exactly sure what this is saying.  The existing constructor (https://github.com/kubernetes/client-go/blob/v0.25.0/tools/cache/shared_informer.go#L225) is already parameterized by the ListerWatcher, which is an interface --- meaning that the client developer can pass a value that do supported server-side filtering and/or arbitrary client-side filtering.

> Add an option to drop the field manager annotation which is huge and rarely needing to be read or modified by most controllers
> Add an option to just store metadata, for controllers that don’t need the spec or status cached

The client can supply a ListerWatcher that trims (or even transforms, consistently of course) the objects.

> Ensure objects are being dereferenced aggressively to minimize memory footprint outside of the informers

I am not sure what this one means.  "dereference" usually means to follow a pointer.  Perhaps the idea is something about removing unneeded pointers from somewhere?

> CRDs are sent over the wire as json

As noted, this is a big deal for other reasons too.  It is a sufficiently big deal to discuss on its own.

2. Regarding memory costs of decoding.

Decoding requires holding two copies: the JSON or protobuf copy received, and the decoded copy.  At some point after decoding is done, the received bytes become unreachable and thus subject to garbage collection.  There are also helper objects created during decoding that become unreachable even before decoding returns.  If you look carefully, garbage collection is somewhat squishy.  One, slow and crude, kind of squishiness is the fact that the GC can be tuned through runtime configuration.  For another, the behavior of the collector in the Go runtime is dependent on the combination of all the relevant mutator activity in that process.  And even less obviously: when it releases memory, the Go runtime does _not_ actually itself decrease the memory usage of the process; instead it marks the released pages as ones that the operating system can ghost, and whether and when the OS does this is another can of worms.  In short, YMMV.

Caching on disk does not remove the need to hold the relevant copies of an object in memory at once.  Caching on disk would add more encoding and decoding.

Caching on disk would help in scenarios where the cache holds significantly more volume than the client needs to be reachable in memory at any given time.

3. Lavalamp, https://github.com/kubernetes/kubernetes/issues/90339 does not enable a client process to start while disconnected.  That is what I meant by "restart".

Daniel Smith

unread,
Aug 31, 2022, 12:06:21 PM8/31/22
to mspre...@gmail.com, K8s API Machinery SIG
On Wed, Aug 31, 2022 at 8:48 AM mspre...@gmail.com <mspre...@gmail.com> wrote:
1. Regarding the client memory mitigation ideas listed by karlis...@google.com:

> Fork some of the informer code to add filters, to reduce what objects are cached to only the ones cared about.

I am not exactly sure what this is saying.  The existing constructor (https://github.com/kubernetes/client-go/blob/v0.25.0/tools/cache/shared_informer.go#L225) is already parameterized by the ListerWatcher, which is an interface --- meaning that the client developer can pass a value that do supported server-side filtering and/or arbitrary client-side filtering.

> Add an option to drop the field manager annotation which is huge and rarely needing to be read or modified by most controllers
> Add an option to just store metadata, for controllers that don’t need the spec or status cached

The client can supply a ListerWatcher that trims (or even transforms, consistently of course) the objects.

> Ensure objects are being dereferenced aggressively to minimize memory footprint outside of the informers

I am not sure what this one means.  "dereference" usually means to follow a pointer.  Perhaps the idea is something about removing unneeded pointers from somewhere?

> CRDs are sent over the wire as json

As noted, this is a big deal for other reasons too.  It is a sufficiently big deal to discuss on its own.

2. Regarding memory costs of decoding.

Decoding requires holding two copies: the JSON or protobuf copy received, and the decoded copy.  At some point after decoding is done, the received bytes become unreachable and thus subject to garbage collection.  There are also helper objects created during decoding that become unreachable even before decoding returns.  If you look carefully, garbage collection is somewhat squishy.  One, slow and crude, kind of squishiness is the fact that the GC can be tuned through runtime configuration.  For another, the behavior of the collector in the Go runtime is dependent on the combination of all the relevant mutator activity in that process.  And even less obviously: when it releases memory, the Go runtime does _not_ actually itself decrease the memory usage of the process; instead it marks the released pages as ones that the operating system can ghost, and whether and when the OS does this is another can of worms.  In short, YMMV.

Caching on disk does not remove the need to hold the relevant copies of an object in memory at once.  Caching on disk would add more encoding and decoding.

Caching on disk would help in scenarios where the cache holds significantly more volume than the client needs to be reachable in memory at any given time.

3. Lavalamp, https://github.com/kubernetes/kubernetes/issues/90339 does not enable a client process to start while disconnected.  That is what I meant by "restart".

The technique is about making it efficient to catch up a client that already has *some* data . A client that stored its state would need something that accomplishes this. Local state storage is a separate (and likely easier) problem.

You can already restart today if you store everything and restart before you're out of the watch window. (i.e. are down for only a minute or two)

In clusters with enough churn there's going to be diminishing returns to "efficiently" catching up clients with very old states, as nearly everything will have changed anyway. (a reason to focus on sharding as a superior answer to large data sets)
 

Daniel Smith

unread,
Aug 31, 2022, 12:16:12 PM8/31/22
to mspre...@gmail.com, K8s API Machinery SIG
My overarching point is that data sizes that require caching on disk can't currently be delivered to a client before the list revision goes out of the window (i.e. such clients can't start up in the first place), it's not going to be useful to solve the former problem without solving the latter.

If you let a collection size get this large currently and there is some problem restarting a controller for an important resource type, the cluster is bricked without manual intervention of some sort.

In fact you can likely get into this condition today for certain cluster collection sizes + bandwidths + controller locations, and even if you don't brick the cluster it can take a few tries before all the important controllers load the state.

So, I currently consider the initial state transport problem to be a lot more important than supporting larger collections--which would make this problem worse.
Reply all
Reply to author
Forward
0 new messages