Recording the hostnames of all targets.

59 views
Skip to first unread message

Yagyansh S. Kumar

unread,
Apr 24, 2020, 2:45:35 AM4/24/20
to Prometheus Users
Hi. So, I am using IP:PORT as targets(I know its not ideal, I should have only IPs, will switch to it soon) for all my node exporter jobs. I am getting the hostnames of my servers in the alerts using the group_left and joining them to my original alert query. Now, the problem is when alert of Node Exporter Down or the server itself Down(Using Blackbox ICMP Ping to determine this) fires, I cannot have the hostname as Node Exporter won't be scraping any metrics at that point.

Can a recording rule do the job here? I am not able to proceed on how this recording rule will be created/work.
Also, if there is any better idea or method to approach it, please point me in that direction.

Thanks in advance!

Brian Candler

unread,
Apr 24, 2020, 4:36:31 AM4/24/20
to Prometheus Users
> I cannot have the hostname as Node Exporter won't be scraping any metrics at that point.

Can you give some more specific examples?  What metric are you joining with - perhaps node_uname_info?

Note that the "up" metric will still exist (with a value of 0) when a scrape fails - this means:
(a) you can join on it, and
(b) you can alert on this condition, i.e. scrape failed / node_exporter is down.  This is a different condition than "blackbox_exporter says host/service is down, but node_exporter is still being scraped".  Hence the alerting rule for (up == 0) can be written to avoid the join.  There is actually a benefit here: you'll only get one alert when the host goes down, instead of lots.

Other solutions you can consider:

1. Add labels to your targets at scrape time, either by adding static labels (file_sd_config) or using relabelling

2. Generate an entirely separate metadata timeseries, which is not scraped from the node itself.

This can be done by:

(a) a static recording rule as you suggested, see

There it's being used for alert thresholds, but you can just as well do this for metadata as per

(b) a static web page that you scrape containing all the metadata for all the targets - for an example see

Yagyansh S. Kumar

unread,
Apr 24, 2020, 5:09:44 AM4/24/20
to Prometheus Users
Thanks Brain.


Can you give some more specific examples?  What metric are you joining with - perhaps node_uname_info? >>
      - alert: HighCpuLoadCrit
        expr: (node_load15 > (2 * count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))) * on(instance) group_left(nodename) node_uname_info
 
Note that the "up" metric will still exist (with a value of 0) when a scrape fails - this means:
(a) you can join on it, and >>
 
    UP metrics will exist but if the node exporter itself is down, it won't expose the metric at that time right? So, I won't get the "nodename" label from node_uname_info.

(b) you can alert on this condition, i.e. scrape failed / node_exporter is down.  This is a different condition than "blackbox_exporter says host/service is down, but node_exporter is still being scraped".  Hence the alerting rule for (up == 0) can be written to avoid the join.  There is actually a benefit here: you'll only get one alert when the host goes down, instead of lots. >>
I am using up == 0 only and using it as inhibition rule also, but (up == 0) itself won't give me the hostname. My main aim is to get the hostname for every alert. But, when the server is actually down i.e node exporter will also be down and again I won't get nodename label.
 
   Please correct me if I am wrong anywhere.

Brian Candler

unread,
Apr 24, 2020, 4:30:28 PM4/24/20
to Prometheus Users
On Friday, 24 April 2020 10:09:44 UTC+1, Yagyansh S. Kumar wrote:
    UP metrics will exist but if the node exporter itself is down, it won't expose the metric at that time right? So, I won't get the "nodename" label from node_uname_info.

That's correct.  You could at least alert on that one condition without a "nodename", in the knowledge that all other alerts *would* have a nodename.

If you want to join on something always, then you have to make sure that thing exists - such as a static recording rule, or a metric which is being scraped from somewhere else, like a static http page.
Reply all
Reply to author
Forward
0 new messages