How to aggregate data from different machines?

38 views
Skip to first unread message

luv -

unread,
Mar 3, 2020, 4:49:08 AM3/3/20
to Prometheus Users
Hi all!    
    I use Prometheus to scrape the machine. There are about 500 service machines, and each machine generates about 3500 sample data. When I use the federate interface, I find Prometheus adds instance and job labels to data from different machines. So there are 500 * 3500 = 1750000 sample data. This results in very large memory for queries and writes. I don't care about the data on the single machine dimension.  Is there any way to remove the instance label and aggregate data before data writing?

for example:
http_requests_total
{methed="GET", code="200", instance="ip1:port", job="job1"} 100
http_requests_total
{methed="GET", code="200", instance="ip2:port", job="job1"} 50

merge to
:
http_requests_total
{methed="GET", code="200", instance="", job="job1"} 150



Has anyone ever encountered the same problem, how did you solve it or any other solution?
Thx!

Brian Candler

unread,
Mar 3, 2020, 6:11:51 AM3/3/20
to Prometheus Users
On Tuesday, 3 March 2020 09:49:08 UTC, luv - wrote: 
    I use Prometheus to scrape the machine. There are about 500 service machines, and each machine generates about 3500 sample data.

500 machines each generating 3500 different metrics is 1,750,000 timeseries.
 
When I use the federate interface, I find Prometheus adds instance and job labels to data from different machines. So there are 500 * 3500 = 1750000 sample data. This results in very large memory for queries and writes.

There are indeed 1,750,000 timeseries.  The labels themselves don't use up any space, except in the timeseries index.  A rule of thumb is that about 2 million timeseries is the point where you start thinking about splitting up scrapes between multiple servers.
 
I don't care about the data on the single machine dimension.  Is there any way to remove the instance label and aggregate data before data writing?

for example:
http_requests_total
{methed="GET", code="200", instance="ip1:port", job="job1"} 100
http_requests_total
{methed="GET", code="200", instance="ip2:port", job="job1"} 50

merge to
:
http_requests_total
{methed="GET", code="200", instance="", job="job1"} 150




You can't aggregate before writing (unless you write your own exporter which does this).  Or you could use statsd_exporter, and have all the targets push their counter updates to this.

You can use a recording rule to generate the aggregate - and then when you scrape the /federate endpoint pass a match[] query so that only the aggregate timeseries is returned.

Note that if you simply stripped the labels, you would get conflicting data.  For example, at one scrape instant you might have:

http_requests_total{methed="GET", code="200"} 100
http_requests_total
{methed="GET", code="200"} 50

Is the value of the counter at this point in time 100 or 50?  Answer: it's neither (it should be 150).  And if you look at the metric over time, it would bounce up and down as it flips between different counter values.

luv -

unread,
Mar 3, 2020, 10:21:25 PM3/3/20
to Prometheus Users
Thank you for your help. Your reply is very clear. I think I know how to do it. I will try to use recording rule to optimize.

Thank you again!

在 2020年3月3日星期二 UTC+8下午7:11:51,Brian Candler写道:
Reply all
Reply to author
Forward
0 new messages