Clarification about the context of not overuse the label values

28 views
Skip to first unread message

Pau Freixes

unread,
Aug 7, 2020, 3:57:30 AM8/7/20
to Prometheus Users
Hi,

By reading this [1] and this [2] having the feeling that the reader,
or maybe only me, can have some troubles about when this rule can be
applied and under what circumstances this should be applied and how.

From my understanding, correct me if I'm wrong, Prometheus is
encouraging the use of labels for slicing your metrics [2], like for
example for identifying what service is the owner of a time series.
Considering the following HTTP metrics http_api_requests, would be
fine having different time series for the same metric name identified
with the following label values

http_api_requests service_name=foo, status_code=200
http_api_requests service_name=foo, status_code=500
http_api_requests service_name=bar, status_code=200
http_api_requests service_name=bar, status_code=500

And in the use case of having not 2 services but 1K different
services, this would be still fine since the total number of metrics
would be still manageable.

From what can be read in [1], this could be misunderstood

> As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

Looking at the previous example and the general guideline someone
could understand that adding the service_name as a label name is
breaking that rule.

From my understanding, correct me if I'm wrong, what this general
guideline is should be circumscribed on the side effect of adding a
label with a large cardinality, or by adding one that thought not
having a large cardinality once it's added together with another label
implies an explosion with the number of the metrics.

For example, let's consider the previous example of the
http_api_requests, what would happen if we would add the resource path
as a metric variable? having something like this

http_api_requests service_name=foo, status_code=200, resource_path="/a"
http_api_requests service_name=foo, status_code=500, resource_path="/b"
http_api_requests service_name=bar, status_code=200, resource_path="/c"
http_api_requests service_name=bar, status_code=500, resource_path="/d"

This will become an issue? having the feeling that it would depend,
depend on how the query is done. If the query would be done also
narrowing by service name this should not be a problem since the total
number of time series should be still a manageable number, while the
total number of time series if the query was not filtered by service
name will be most likely unmanageable.

If this is true, and most likely the second query wouldn't make any
sense, why not prefix the metric name by the service name for avoiding
future queries that by mistake could break the system?

Another example, lets consider that we add as a label the pod id,
which can have thousands of different values but they are in somehow
stable during a window time, the metric will look like this

http_api_requests service_name=foo, status_code=200,
resource_path="/a", pod_name="1ef"
http_api_requests service_name=foo, status_code=500,
resource_path="/b", pod_name="2ef"
http_api_requests service_name=bar, status_code=200,
resource_path="/c", pod_name="3ef"
http_api_requests service_name=bar, status_code=500,
resource_path="/d", pod_name="4ef"

The query that we will be running typically won't be using any pod
slicing, but we will still do a narrowing by service name. Let's
consider a scenario where we do have more or less a stable number of
500 pods in a window time, would be the query still manageable by
PrometheusIO?

Looking at the example that you provide about node_exporter seems fine
to me since we will still narrow the query always to one specific
service which will reduce dramatically the number of time series
involved during the query.

am I missing something in my rationale? If not, would it make sense on
rewording a bit the following message:

>> As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

Should be used as a rule of thumb the number of time series involved
during a query, where this number should be < X?

Thanks!


[1] https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
[2] https://www.robustperception.io/target-labels-not-metric-name-prefixes

--
--pau

Brian Brazil

unread,
Aug 7, 2020, 4:09:38 AM8/7/20
to Pau Freixes, Prometheus Users
On Fri, 7 Aug 2020 at 08:57, Pau Freixes <pfre...@gmail.com> wrote:
Hi,

By reading this [1] and this [2] having the feeling that the reader,
or maybe only me, can have some troubles about when this rule can be
applied and under what circumstances this should be applied and how.

From my understanding, correct me if I'm wrong, Prometheus is
encouraging the use of labels for slicing your metrics [2], like for
example for identifying what service is the owner of a time series.
Considering the following HTTP metrics http_api_requests, would be
fine having different time series for the same metric name identified
with the following label values

http_api_requests service_name=foo, status_code=200
http_api_requests service_name=foo, status_code=500
http_api_requests service_name=bar, status_code=200
http_api_requests service_name=bar, status_code=500

And in the use case of having not 2 services but 1K different
services, this would be still fine since the total number of metrics
would be still manageable.

1K services is high cardinality, and then that's also broken out by status_code.
 

From what can be read in [1], this could be misunderstood

> As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

Looking at the previous example and the general guideline someone
could understand that adding the service_name as a label name is
breaking that rule.

From my understanding, correct me if I'm wrong, what this general
guideline is should be circumscribed on the side effect of adding a
label with a large cardinality, or by adding one that thought not
having a large cardinality once it's added together with another label
implies an explosion with the number of the metrics.

That's the basic idea.
 

For example, let's consider the previous example of the
http_api_requests, what would happen if we would add the resource path
as a metric variable? having something like this

http_api_requests service_name=foo, status_code=200, resource_path="/a"
http_api_requests service_name=foo, status_code=500, resource_path="/b"
http_api_requests service_name=bar, status_code=200, resource_path="/c"
http_api_requests service_name=bar, status_code=500, resource_path="/d"

This will become an issue? having the feeling that it would depend,
depend on how the query is done. If the query would be done also
narrowing by service name this should not be a problem since the total
number of time series should be still a manageable number, while the
total number of time series if the query was not filtered by service
name will be most likely unmanageable.

If this is true, and most likely the second query wouldn't make any
sense, why not prefix the metric name by the service name for avoiding
future queries that by mistake could break the system?

That doesn't change the cardinality, it just makes querying harder for users and is an anti-pattern.
 

Another example, lets consider that we add as a label the pod id,
which can have thousands of different values but they are in somehow
stable during a window time, the metric will look like this

http_api_requests service_name=foo, status_code=200,
resource_path="/a", pod_name="1ef"
http_api_requests service_name=foo, status_code=500,
resource_path="/b", pod_name="2ef"
http_api_requests service_name=bar, status_code=200,
resource_path="/c", pod_name="3ef"
http_api_requests service_name=bar, status_code=500,
resource_path="/d", pod_name="4ef"

The query that we will be running typically won't be using any pod
slicing, but we will still do a narrowing by service name. Let's
consider a scenario where we do have more or less a stable number of
500 pods in a window time, would be the query still manageable by
PrometheusIO?

That will end very poorly, as we're now talking a cardinality of 4 million (presuming just 2 status codes and 4 paths) from each individual target.

Looking at the example that you provide about node_exporter seems fine
to me since we will still narrow the query always to one specific
service which will reduce dramatically the number of time series
involved during the query.

It's not just about querying, it's also about how much data Prometheus has to store. Prometheus can practically hold somewhere in the low tens of millions of active time series.
 

am I missing something in my rationale? If not, would it make sense on
rewording a bit the following message:

>> As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

Should be used as a rule of thumb the number of time series involved
during a query, where this number should be < X?

Less than 100K is a good guideline there I think. You can do more, but things start to get problematic by the time you're at 1M.

Brian
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CA%2BULCcF3h%3DvEvsVPZs-2zC2xNrd60tz6vZMMN4aN-6LwEdz75A%40mail.gmail.com.


--

Brian Candler

unread,
Aug 7, 2020, 7:22:29 AM8/7/20
to Prometheus Users
On Friday, 7 August 2020 08:57:30 UTC+1, Pau Freixes wrote:
http_api_requests service_name=foo, status_code=200, resource_path="/a"
http_api_requests service_name=foo, status_code=500, resource_path="/b"
http_api_requests service_name=bar, status_code=200, resource_path="/c"
http_api_requests service_name=bar, status_code=500, resource_path="/d"


Beware any label that comes from user-supplied data.  If I hit your server with path /xyz123 (which is an invalid resource, and you respond with 404) then you don't want to create a new timeseries with label resource_path="/xyz123".  This would mean I could create an unlimited number of timeseries.  However, if you group together all invalid resources into the same bucket, that would be OK.
Reply all
Reply to author
Forward
0 new messages