
Hi,
I'm fairly new to Prometheus so bear with me. What I'm trying to do is graph the number of EC2 Instances that have been deleted by a tool of mine. I'm not even sure Prometheus is the right tool for the job.
I have a tool that's persistently running. Every hour or so it terminates a number of unused AWS EC2 Instances and exports the numbers at a /metrics endpoint. It currently generates a GaugeMetricFamily called cleaned_instances_total with the labels [cloud, region, type] containing e.g. ['aws', 'us-west-2', 'm5.xlarge'] and as value the number of Instances it just removed.
Prometheus scrapes the target like twice an hour. Now the first Problem I ran into is I'd like to graph the number of instances that were removed in a day, per region and type. However because Prometheus scrapes the target more often than the instances are being cleaned, I get repeating values. Very visible on this output:
The maximum reasonable scrape interval is around 2 minutes due to
staleness, so trying to scrape every 30 minutes will likely cause
issues.
Like with my human eye I can tell that those exact repeating values are likely because the metrics were scraped before the next cleanup run occurred. But I don't know how to express that in a PromQL query.
I suppose I could make the Gauge a Counter, but even then; lets say I have two metrics where 10 instances each have been terminated, how would I know if the 10 instances in my second timestamp are the same 10 from the first one, or if the tool terminated 10 instances, was restarted and the Counter reset to 0 and then terminated another 10 instances?
Prometheus isn't designed to give "exact" billing level answers, but is more for "good enough" information for system monitoring purposes. You are right that a counter reset will result in some potential loss of data between scrapes.
I would suggest using a counter of the number of instances
terminated (BTW your metric naming already suggests it is a
counter from the naming recommendations) and setting the scrape
interval to be something more like 1-2 minutes.
I guess one question right now would be, is there a way to deduplicate those identical values? Like, if all labels and values at a point in time are the same as the ones from previous points in time consider it as a single timestamp... or something along those lines.
Is Prometheus even the right tool for what I'm trying to do? Basically I'm not trying to graph something that happens over time, I'm trying to graph some number of events that happen at a point in time.
It really depends what you are hoping for. If you want a graph
over time with spikes when things roughly happen, then it can do
that - graph the rate() of your counter. If however you are
wanting perfect details of when things happened you are wanting an
event system rather than metric, so something like Elasticsearch,
Splunk or a more generalised database.
-- Stuart Clark