How to aggregate data from different machines?

luv -

unread,

Mar 3, 2020, 4:49:08 AM3/3/20

to Prometheus Users

Hi all！

I use Prometheus to scrape the machine. There are about 500 service machines, and each machine generates about 3500 sample data. When I use the federate interface, I find Prometheus adds instance and job labels to data from different machines. So there are 500 * 3500 = 1750000 sample data. This results in very large memory for queries and writes. I don't care about the data on the single machine dimension. Is there any way to remove the instance label and aggregate data before data writing?

for example:
http_requests_total{methed="GET", code="200", instance="ip1:port", job="job1"} 100
http_requests_total{methed="GET", code="200", instance="ip2:port", job="job1"} 50

merge to:
http_requests_total{methed="GET", code="200", instance="", job="job1"} 150

Has anyone ever encountered the same problem, how did you solve it or any other solution?

Thx!

Brian Candler

unread,

Mar 3, 2020, 6:11:51 AM3/3/20

to Prometheus Users

On Tuesday, 3 March 2020 09:49:08 UTC, luv - wrote:

I use Prometheus to scrape the machine. There are about 500 service machines, and each machine generates about 3500 sample data.

500 machines each generating 3500 different metrics is 1,750,000 timeseries.

When I use the federate interface, I find Prometheus adds instance and job labels to data from different machines. So there are 500 * 3500 = 1750000 sample data. This results in very large memory for queries and writes.

There are indeed 1,750,000 timeseries. The labels themselves don't use up any space, except in the timeseries index. A rule of thumb is that about 2 million timeseries is the point where you start thinking about splitting up scrapes between multiple servers.

I don't care about the data on the single machine dimension. Is there any way to remove the instance label and aggregate data before data writing?

for example: http_requests_total{methed="GET", code="200", instance="ip1:port", job="job1"} 100 http_requests_total{methed="GET", code="200", instance="ip2:port", job="job1"} 50 merge to: http_requests_total{methed="GET", code="200", instance="", job="job1"} 150

You can't aggregate before writing (unless you write your own exporter which does this). Or you could use statsd_exporter, and have all the targets push their counter updates to this.

You can use a recording rule to generate the aggregate - and then when you scrape the /federate endpoint pass a match[] query so that only the aggregate timeseries is returned.

Note that if you simply stripped the labels, you would get conflicting data. For example, at one scrape instant you might have:

http_requests_total{methed="GET", code="200"} 100
http_requests_total{methed="GET", code="200"} 50

Is the value of the counter at this point in time 100 or 50? Answer: it's neither (it should be 150). And if you look at the metric over time, it would bounce up and down as it flips between different counter values.

luv -

unread,

Mar 3, 2020, 10:21:25 PM3/3/20

to Prometheus Users

Thank you for your help. Your reply is very clear. I think I know how to do it. I will try to use recording rule to optimize.