Hi,
By reading this [1] and this [2] having the feeling that the reader,
or maybe only me, can have some troubles about when this rule can be
applied and under what circumstances this should be applied and how.
From my understanding, correct me if I'm wrong, Prometheus is
encouraging the use of labels for slicing your metrics [2], like for
example for identifying what service is the owner of a time series.
Considering the following HTTP metrics http_api_requests, would be
fine having different time series for the same metric name identified
with the following label values
http_api_requests service_name=foo, status_code=200
http_api_requests service_name=foo, status_code=500
http_api_requests service_name=bar, status_code=200
http_api_requests service_name=bar, status_code=500
And in the use case of having not 2 services but 1K different
services, this would be still fine since the total number of metrics
would be still manageable.
From what can be read in [1], this could be misunderstood
> As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.
Looking at the previous example and the general guideline someone
could understand that adding the service_name as a label name is
breaking that rule.
From my understanding, correct me if I'm wrong, what this general
guideline is should be circumscribed on the side effect of adding a
label with a large cardinality, or by adding one that thought not
having a large cardinality once it's added together with another label
implies an explosion with the number of the metrics.
For example, let's consider the previous example of the
http_api_requests, what would happen if we would add the resource path
as a metric variable? having something like this
http_api_requests service_name=foo, status_code=200, resource_path="/a"
http_api_requests service_name=foo, status_code=500, resource_path="/b"
http_api_requests service_name=bar, status_code=200, resource_path="/c"
http_api_requests service_name=bar, status_code=500, resource_path="/d"
This will become an issue? having the feeling that it would depend,
depend on how the query is done. If the query would be done also
narrowing by service name this should not be a problem since the total
number of time series should be still a manageable number, while the
total number of time series if the query was not filtered by service
name will be most likely unmanageable.
If this is true, and most likely the second query wouldn't make any
sense, why not prefix the metric name by the service name for avoiding
future queries that by mistake could break the system?
Another example, lets consider that we add as a label the pod id,
which can have thousands of different values but they are in somehow
stable during a window time, the metric will look like this
http_api_requests service_name=foo, status_code=200,
resource_path="/a", pod_name="1ef"
http_api_requests service_name=foo, status_code=500,
resource_path="/b", pod_name="2ef"
http_api_requests service_name=bar, status_code=200,
resource_path="/c", pod_name="3ef"
http_api_requests service_name=bar, status_code=500,
resource_path="/d", pod_name="4ef"
The query that we will be running typically won't be using any pod
slicing, but we will still do a narrowing by service name. Let's
consider a scenario where we do have more or less a stable number of
500 pods in a window time, would be the query still manageable by
PrometheusIO?
Looking at the example that you provide about node_exporter seems fine
to me since we will still narrow the query always to one specific
service which will reduce dramatically the number of time series
involved during the query.
am I missing something in my rationale? If not, would it make sense on
rewording a bit the following message:
>> As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.
Should be used as a rule of thumb the number of time series involved
during a query, where this number should be < X?
Thanks!
[1]
https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
[2]
https://www.robustperception.io/target-labels-not-metric-name-prefixes
--
--pau