Metrics aggregation after label is dropped

50 views
Skip to first unread message

Shubham

unread,
Dec 28, 2023, 8:53:25 AM12/28/23
to Prometheus Users
I'm currently focused on optimizing Prometheus for higher performance by reducing cardinality in our setup. I'm seeking assistance in understanding how Prometheus aggregates metrics after a label is dropped. I couldn't find any documentation on this.

For example, consider the following metric:

response_bucket{le="0.1", status="200", path="/api/users"} = 10
response_bucket{le="0.1", status="500", path="/api/users"} = 5
response_bucket{le="0.1", status="200", path="/api/products"} = 8
response_bucket{le="0.1", status="500", path="/api/products"} = 8

Since we're not using the 'status' label in our dashboard queries, I want to drop the 'status' label. How would Prometheus create the final series values? 

I'm using following config to drop label from specific metric

      - source_labels: [__name__, status]
        regex: (response_bucket.*)
        replacement: ""
        target_label: status

Brian Candler

unread,
Dec 28, 2023, 10:37:35 AM12/28/23
to Prometheus Users
If you do that, scraping will fail due to duplicate timeseries:

response_bucket{le="0.1",path="/api/users"} 10
response_bucket{le="0.1",path="/api/users"} 5   ** DUPLICATE TIMESERIES **
response_bucket{le="0.1",path="/api/products"} 8
response_bucket{le="0.1",path="/api/products"} 8  ** DUPLICATE TIMESERIES **

You can use Recording Rules to create new timeseries with the appropriate aggregations.  However, you will then be storing the original timeseries *as well as* the new timeseries.

You can change your exporter to do the aggregation for you (for example, don't create separate buckets per status code).

But to be honest, I think you might be doing the wrong thing here.  Prometheus can easily store several million timeseries; making small reductions in the number of timeseries is unlikely to bring you any benefits.

Where you should be worried is where you have the potential for tens or hundreds of millions of series.  The number of HTTP status codes you generate is surely very small. How many different "path" values are there in your application? Probably only in the low hundreds.  But if you had a label which contained the client IP address, and the API could be accessed from anywhere on the Internet, that would probably result in a cardinality explosion.

Now, if you had tens of thousands of different servers, each of which has its own set of "response_bucket" metrics, and all scraped into the same Prometheus server, then maybe there would be an argument for aggregating them prior to storage. However if you do this, you'd lose valuable information such as the proportion of 200 and 500 status requests (unless you have separate counters for those).  And it might be better to have multiple Prometheus servers each scraping a subset of your estate.
Reply all
Reply to author
Forward
0 new messages