Metrics vs log level

41 views
Skip to first unread message

Muthuveerappan Periyakaruppan

unread,
Oct 6, 2022, 11:09:08 PM10/6/22
to Prometheus Users

Hi Team,

we have a situation , where we have 8 to 15 million head series in each Prometheus and we have 7 instance of them (federated). Our prometheus are in a constant flooded situation handling the incoming metrics and back end recording rules.

One thought which came to was - do we have something similar to log level for prometheus metrics ? If its there then... we can benefit from it .... by configuring to run all targets in error level in production and in debug/info level in development... This will help control flooding of metrics.

Say, If we write a wrapper on top of prometheus java client API, its going to be messy - hence wanted to check if this request makes sense or is there any other way out ?

Let me know your thoughts how this can be achieved .... Really need to hear from others on how this sort of situation is handled and whats the way to tackle ...

fyr - We have raised the same issue @ prometheus java client project - https://github.com/prometheus/client_java/issues/815


Many Thanks
Muthuveerappan

Stuart Clark

unread,
Oct 7, 2022, 3:55:27 AM10/7/22
to Muthuveerappan Periyakaruppan, Prometheus Users
On 07/10/2022 04:09, Muthuveerappan Periyakaruppan wrote:
> we have a situation , where we have 8 to 15 million head series in
> each Prometheus and we have 7 instance of them (federated). Our
> prometheus are in a constant flooded situation handling the incoming
> metrics and back end recording rules.

8-15 million time series on a single Prometheus instance is pretty high.
What spec machine/pod are these?

When you say "flooded" what are you meaning?

> One thought which came to was - do we have something similar to log
> level for prometheus metrics ? If its there then... we can benefit
> from it .... by configuring to run all targets in error level in
> production and in debug/info level in development... This will help
> control flooding of metrics.
>
I'm not sure what I understand what you are suggesting. What would be
the difference between setting this hypothetical "error" and "debug"
levels? Are you meaning some metrics would only be exposed on some
environments?

--
Stuart Clark

Brian Candler

unread,
Oct 7, 2022, 4:06:42 AM10/7/22
to Prometheus Users
If you want to filter out some metrics, and you can't control this on the exporter, then you can use metric relabelling to drop any that you don't want to ingest into the database.

Matthias Rampke

unread,
Oct 7, 2022, 5:43:50 AM10/7/22
to Muthuveerappan Periyakaruppan, Fabian Stäber, Prometheus Users
> Say, If we write a wrapper on top of prometheus java client API, its going to be messy

You can make it relatively clean by creating (and incrementing) all the metrics, but only calling .register() on those that you want to expose in the given environment.

Even more elaborately, you could have separate CollectorRegistry instances, and register each metric with the one(s) appropriate for its level. I think as it is, you will have to register "normal" level metrics with both the "normal" and "debug" CollectorRegistry.

I wonder (@fstab?) if it would make sense to have a CollectorRegistryCollector, so that in effect you could do `normalRegistry.register(debugRegistry)` and then decide when setting up the Exporter which registry to serve, because all the "normal" metrics are indirectly automatically registered with the debug registry. Or maybe that exists and I couldn't find it?

/MR


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d3a1bb24-2d87-48c0-8b01-9f91a71dff7bn%40googlegroups.com.

Fabian Stäber

unread,
Oct 7, 2022, 6:05:28 AM10/7/22
to Matthias Rampke, Muthuveerappan Periyakaruppan, Fabian Stäber, Prometheus Users
Hi,

Funny, I commented the same idea of having two separate registries on https://github.com/prometheus/client_java/issues/815 this morning.

Currently you would just register each metric with two registries manually

    errorLevelRegistry.register(myCounter);
    debugLevelRegistry.register(myCounter);

I'm not sure whether it's worthwhile to create API in client_java for making this a one-liner. Writing a custom method for this is just 4 lines of code.

Fabian

Muthuveerappan Periyakaruppan

unread,
Oct 7, 2022, 10:22:53 PM10/7/22
to Prometheus Users
Please find replies inline.

On Friday, 7 October, 2022 at 1:25:27 pm UTC+5:30 Stuart Clark wrote:
On 07/10/2022 04:09, Muthuveerappan Periyakaruppan wrote:
> we have a situation , where we have 8 to 15 million head series in
> each Prometheus and we have 7 instance of them (federated). Our
> prometheus are in a constant flooded situation handling the incoming
> metrics and back end recording rules.

8-15 million time series on a single Prometheus instance is pretty high.
What spec machine/pod are these?

90gb ram, 5000 millicores.
 
When you say "flooded" what are you meaning?
 
Always high usage of ram,  no oom , although missing metrics, average scrape duration like 35 seconds ... (may be due to no of targets/metrics)
cpu demand/usage is not that high


> One thought which came to was - do we have something similar to log
> level for prometheus metrics ? If its there then... we can benefit
> from it .... by configuring to run all targets in error level in
> production and in debug/info level in development... This will help
> control flooding of metrics.
>
I'm not sure what I understand what you are suggesting. What would be
the difference between setting this hypothetical "error" and "debug"
levels? Are you meaning some metrics would only be exposed on some
environments?

Lets say every pod has close to 100 metrics , we may not need all of them in production ...
A developer before logging a metric can access on how useful this metric will be in production ...what indicators does it have - Utilization, Saturation, and Errors (USE) / Rate, Errors, and Duration (RED) ... based on this he can choose the metric level.
Based on the level of metric,  only few can be enabled (ERROR / SEVERE level) in production the rest can be enabled (INFO /DEBUG Level) in development / testing / staging environments.
few metrics should / are enough to troubleshoot and on demand we should have the option to change the metric level ...like log level at runtime to get more metrics

--
Stuart Clark

Muthuveerappan Periyakaruppan

unread,
Oct 7, 2022, 10:26:05 PM10/7/22
to Prometheus Users
We already dropped few that are unused ... even after that we ended up in that number ...

Is there a way to find out / query unused metrics or least queried ones ? topk somehow does not work for us ... it gives different error every time. not able to spend time on that. do you have , any handy promql ?

Muthuveerappan Periyakaruppan

unread,
Oct 7, 2022, 10:26:25 PM10/7/22
to Prometheus Users
Interesting CollectorRegistryCollector

Muthuveerappan Periyakaruppan

unread,
Oct 7, 2022, 10:26:55 PM10/7/22
to Prometheus Users
Thanks a lot  Fabian, will check out and get back to you.

Ben Kochie

unread,
Oct 8, 2022, 3:34:10 AM10/8/22
to Fabian Stäber, Matthias Rampke, Muthuveerappan Periyakaruppan, Fabian Stäber, Prometheus Users
So, different opinion here.

Metrics are meant to tell you _when_ to debug. They're not meant to always be debugging itself.

Metrics are supposed to tell you when it's time to get out other tools.
* Look at the logs
* Look at traces
* Look at profilers

Trying to get every dimension on every metric is just going to make metrics bloated and useless for alerting. Which is what Prometheus is primarily for.


Ben Kochie

unread,
Oct 8, 2022, 3:40:36 AM10/8/22
to Muthuveerappan Periyakaruppan, Prometheus Users
On Sat, Oct 8, 2022 at 4:22 AM Muthuveerappan Periyakaruppan <muthu.v...@gmail.com> wrote:
Please find replies inline.

On Friday, 7 October, 2022 at 1:25:27 pm UTC+5:30 Stuart Clark wrote:
On 07/10/2022 04:09, Muthuveerappan Periyakaruppan wrote:
> we have a situation , where we have 8 to 15 million head series in
> each Prometheus and we have 7 instance of them (federated). Our
> prometheus are in a constant flooded situation handling the incoming
> metrics and back end recording rules.

8-15 million time series on a single Prometheus instance is pretty high.
What spec machine/pod are these?

90gb ram, 5000 millicores.

Wait, are you federating multiple Prometheus instances on multiple clusters into one? Maybe you should look at Thanos instead. It lets you federate, but without actually forcing you to put all the data in oneservice.

We have a Thanos setup with 250+ million metrics. Thousands of Prometheus instances across multiple large Kubernetes clusters.


 
When you say "flooded" what are you meaning?
 
Always high usage of ram,  no oom , although missing metrics, average scrape duration like 35 seconds ... (may be due to no of targets/metrics)
cpu demand/usage is not that high


> One thought which came to was - do we have something similar to log
> level for prometheus metrics ? If its there then... we can benefit
> from it .... by configuring to run all targets in error level in
> production and in debug/info level in development... This will help
> control flooding of metrics.
>
I'm not sure what I understand what you are suggesting. What would be
the difference between setting this hypothetical "error" and "debug"
levels? Are you meaning some metrics would only be exposed on some
environments?

Lets say every pod has close to 100 metrics , we may not need all of them in production ...

100 metrics per pod is not a lot. Is that really what you're using? That means you have between 80k and 150k pods. Is that in a single cluster?

And you said you have a scrape duration of 35s. For 100 metrics per pod, your scrape duration should be closer to 35 milliseconds.

Something in what you're saying doesn't add up.


A developer before logging a metric can access on how useful this metric will be in production ...what indicators does it have - Utilization, Saturation, and Errors (USE) / Rate, Errors, and Duration (RED) ... based on this he can choose the metric level.
Based on the level of metric,  only few can be enabled (ERROR / SEVERE level) in production the rest can be enabled (INFO /DEBUG Level) in development / testing / staging environments.
few metrics should / are enough to troubleshoot and on demand we should have the option to change the metric level ...like log level at runtime to get more metrics

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages