Hi memory-dev,
I'm working on a feature that introduces new storage implemented in the network service. We currently keep the feature's state in memory, which (perhaps not surprisingly) leads some metrics to tell us we are using more memory than before (
dashboard link; sorry, Google-internal).
We have some quota knobs we can tune to decrease this memory use, or we could consider architectural changes that keep less of the data in memory.
It seems like the biggest user-facing impact comes from tail memory metrics, which we don't see regress on the linked dashboard. I know memory metrics can be noisy, so I want to understand if there is a first-principles reason we would think that tail memory "really is regressing" even if the metrics don't show it.
If a feature reads (say) 20 KiB from disk into a long-lived in-memory object close to process initialization, would we always expect tail memory metrics to increase by a similar number? In this case, we could think through our optimization options without needing to have the regression confirmed by particular metrics.
Alternatively, are the moving pieces complex enough that understanding tail memory impact always requires observing an empirical regression empirically? (In this case, we'd probably hold off and revisit the numbers after a bigger, or longer, rollout.)
Thanks!