I was recently dealing with some heavily loaded instances that would occasionally get OOM killed.
The maths there mentions "double for GC" so you get a low and high watermark thresholds and you usage should be somewhere between "1x" (the base memory usage of data in Prometheus) and "2x" (if you double that base to account for Go GC). On top of that there will be some memory from queries, recording rules and so on, but what I found was that my real usage was close to 2x (so accounting for GC) so most of that extra memory needed will be hidden in the "double for GC" memory usage.
I did drop a lot of high cardinality metrics to get the memory usage down, mostly from exporters like cadvisor and node_exporter that expose a lot of detailed metrics. metric_relabel_configs in your scrape target configuration can be used to drop individual expensive metrics.
Next thing to know is that metrics churn can cost you a lot - if a target exposes some unique metric for just a few minutes, then it disappear, you will still pay the memory cost of it until the next head compaction. This I found is a bit harder to quantify, but you plot Prometheus memory usage over time you will see memory usage dropping every time after head compaction, how often does that happen can be tweaked with storage.tsdb.max-block-duration flag.
If you do have a lot of churn the extra memory pressure from some operations, like head compaction, can push your instances over the memory limit, so having a low value of storage.tsdb.max-block-duration might help, chances are you already set a custom value for that flag if you're using Thanos (see
https://thanos.io/v0.16/components/sidecar.md/). This was mostly problematic on a big Prometheus instances (256GB of RAM) where default Go garbage collection thresholds would make it act a bit too lazy, and leave too much to be reclaimed, so when there was a spike in memory pressure GC would react too slow compared to how fast new allocations are happening, tweaking GOGC environment variables improved that for me.
After trimming all unnecessary metrics the final thing that help was to:
1. enforce tsdb block duration: --storage.tsdb.max-block-duration=2h --storage.tsdb.min-block-duration=2h is a good start, but we've set it to only 30m on biggest boxes where we see a lot of churn
2. set GOGC=40 - this means you're trading more CPU usage for lower RSS memory usage, a different value might work better for you, I didn't find any benefit of setting it lower (30 would eat more CPU without giving me any memory savings).