--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com.
Soft / partial failure modes can be very hard problems to deal with. You have to be a lot more careful to not end up missing partial failures.
While it seems like having a soft sample_limt is good, and the hard sample_limit is bad. The "Fail fast" will serve you better in the long run. Most of the Prometheus monitoring design assumes fail fast. Partial results are too hard to reason about from a monitoring perspective. With fail fast you will know quickly and decisively that you've hit a problem. If you treat monitoring outages as just as bad as an actual service outage, you'll end up with a healthier system overall.
For the case of label explosions, there are some good meta metrics[0] that can help you. The "scrape_series_added" metric can allow you to soft detect label leaks.In addition, there is a new feature flag[1] that adds additional target metrics for monitoring for targets nearing their limits.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/7d71d3c3-ee76-43ef-bd09-4263055ab1d3n%40googlegroups.com.
I agree that this is not side-effect free. With the worst possible outcome of sending bogus alerts.But.We're talking about metrics, so from my PoV the consequences range not from "annoying to dangerous" but "annoying to more annoying".
There's no real danger in missing metrics. Nothing will break if they are missing.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/8751b359-bc6e-4b2c-8c8e-5502675e8ab6n%40googlegroups.com.
While I believe we should be opened to different solutions to limit the damages of Prometheus going OOM, I just want to point out that metrics use cases are wider than visibility and alerting. For example, missing metrics can get you to miss autoscaling .., and hence be dangerous. I’d be curious if we can be selective on partial failures, and have more control on the priority metrics to privilege for success.
--
Jérôme
From: <prometheus...@googlegroups.com> on behalf of "l.mi...@gmail.com" <l.mi...@gmail.com>
Date: Monday, November 28, 2022 at 7:09 AM
To: Prometheus Developers <prometheus...@googlegroups.com>
Subject: RE: [EXTERNAL][prometheus-developers] Preventing Prometheus from running out of memory
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. |
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/d740484c-9920-481c-aeb6-a847b0fbe20an%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/8751b359-bc6e-4b2c-8c8e-5502675e8ab6n%40googlegroups.com.