organizing metrics

72 views
Skip to first unread message

leso...@gmail.com

unread,
Jul 13, 2018, 3:15:58 AM7/13/18
to Prometheus Users
Hi,
I have a question about how to efficiently organize metrics )))
Let's assume we have 1 system parameter - CPU usage, which can be supposed as user time, system time, idle time, etc... (10 values).

So here we can use 2 approaches to organize these values into metrics.
1) Use one metric, one label, and 10 unique values for this label. for example:
cpu_usage: mode=user; 
cpu_usage: mode=system;
cpu_usage: mode=softirq;
and so on...

2) Use ten separate metrics, with no labels: cpu_usage_user, cpu_usage_system, cpu_usage_idle, etc...

The question is what approach is more cost-effective from prometheus point of view? Which way does take less space in TSDB and take less resources in expressions' calculations?

It seems to me, the first approach is more convenient when need to write expressions, because it simpler to write 'mode=~"user|system|something_else"', than sum separate metrics. But the second approach potentially doesn't consider storing extra labels (fix me if i wrong here), so theoretically we can avoid extra-JOINs in TSDB when calculating expressions.

Thanks

Matthias Rampke

unread,
Jul 13, 2018, 3:27:18 AM7/13/18
to leso...@gmail.com, Prometheus Users
My understanding is that the most efficient distribution is roughly similar cardinalities for all the labels, that is, you may run into some issues if there are a million values for a single label. 10s of values are totally fine and you don't need to worry about it. The issues causing this have also been mitigated somewhat, so even moderate cardinalities should be ok.

The storage experts can chime in here, but I believe there is an optimization where tine series of the same *name* are stored together and can be read more efficiently than many different names.

In general, I would structure metrics based on semantics first, and not worry about the details of the storage too much. In the end, the bulk of the querying load is reading and handling the actual metrics data, and that needs to happen whatever it is named.

For your example, absolutely use the mode label, since there is a fixed set of cores spending time in one or the other states that partitioning is semantically correct and makes all the queries simpler.

When writing queries, keep in mind that exact label matches (=, !=) are more efficient than regex matches (=~, !~).

/MR



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b84a4891-b81d-42ba-8172-4244eb47b279%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Kochie

unread,
Jul 13, 2018, 3:31:39 AM7/13/18
to leso...@gmail.com, Prometheus Users
On Fri, Jul 13, 2018 at 9:16 AM <leso...@gmail.com> wrote:
Hi,
I have a question about how to efficiently organize metrics )))
Let's assume we have 1 system parameter - CPU usage, which can be supposed as user time, system time, idle time, etc... (10 values).

So here we can use 2 approaches to organize these values into metrics.
1) Use one metric, one label, and 10 unique values for this label. for example:
cpu_usage: mode=user; 
cpu_usage: mode=system;
cpu_usage: mode=softirq;
and so on...

2) Use ten separate metrics, with no labels: cpu_usage_user, cpu_usage_system, cpu_usage_idle, etc...

The question is what approach is more cost-effective from prometheus point of view? Which way does take less space in TSDB and take less resources in expressions' calculations?

There is no space difference between the two, Prometheus tracks either way as individual metrics. Each unique combination of metric name and label combinations is stored separately.
 

It seems to me, the first approach is more convenient when need to write expressions, because it simpler to write 'mode=~"user|system|something_else"', than sum separate metrics. But the second approach potentially doesn't consider storing extra labels (fix me if i wrong here), so theoretically we can avoid extra-JOINs in TSDB when calculating expressions.

When deciding on metric name or label, we usually think about logical operations first. In your example, it to make sense to sum(node_cpu_seconds_total), so this is a good use case for `mode` and `cpu` labels. This is exactly how we do it in the node_exporter.

Related, sometimes you do want separate metrics. For example we tend to have errors as a separate metric from the total counts. For example requests to a memcached from an application.

You could have something like app_memcache_requests_total{status="hit"} and {status="miss"}.

But we usually recommend separate counters

app_memcache_misses_total
app_memcache_requests_total

This allows you to easily do math on them like this:

app_memcache_misses_total / app_memcache_requests_total
 
Note, in my examples above, I left out the likely necessary rate() functions to transform counters into gauges.
Reply all
Reply to author
Forward
0 new messages