Recording the hostnames of all targets.

Yagyansh S. Kumar

unread,

Apr 24, 2020, 2:45:35 AM4/24/20

to Prometheus Users

Hi. So, I am using IP:PORT as targets(I know its not ideal, I should have only IPs, will switch to it soon) for all my node exporter jobs. I am getting the hostnames of my servers in the alerts using the group_left and joining them to my original alert query. Now, the problem is when alert of Node Exporter Down or the server itself Down(Using Blackbox ICMP Ping to determine this) fires, I cannot have the hostname as Node Exporter won't be scraping any metrics at that point.

Can a recording rule do the job here? I am not able to proceed on how this recording rule will be created/work.

Also, if there is any better idea or method to approach it, please point me in that direction.

Thanks in advance!

Brian Candler

unread,

Apr 24, 2020, 4:36:31 AM4/24/20

to Prometheus Users

> I cannot have the hostname as Node Exporter won't be scraping any metrics at that point.

Can you give some more specific examples? What metric are you joining with - perhaps node_uname_info?

Note that the "up" metric will still exist (with a value of 0) when a scrape fails - this means:

(a) you can join on it, and

(b) you can alert on this condition, i.e. scrape failed / node_exporter is down. This is a different condition than "blackbox_exporter says host/service is down, but node_exporter is still being scraped". Hence the alerting rule for (up == 0) can be written to avoid the join. There is actually a benefit here: you'll only get one alert when the host goes down, instead of lots.

Yagyansh S. Kumar

unread,

Apr 24, 2020, 5:09:44 AM4/24/20

to Prometheus Users

Thanks Brain.

Can you give some more specific examples? What metric are you joining with - perhaps node_uname_info? >>

- alert: HighCpuLoadCrit
expr: (node_load15 > (2 * count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))) * on(instance) group_left(nodename) node_uname_info

Note that the "up" metric will still exist (with a value of 0) when a scrape fails - this means:
(a) you can join on it, and >>

UP metrics will exist but if the node exporter itself is down, it won't expose the metric at that time right? So, I won't get the "nodename" label from node_uname_info.

(b) you can alert on this condition, i.e. scrape failed / node_exporter is down. This is a different condition than "blackbox_exporter says host/service is down, but node_exporter is still being scraped". Hence the alerting rule for (up == 0) can be written to avoid the join. There is actually a benefit here: you'll only get one alert when the host goes down, instead of lots. >>

I am using up == 0 only and using it as inhibition rule also, but (up == 0) itself won't give me the hostname. My main aim is to get the hostname for every alert. But, when the server is actually down i.e node exporter will also be down and again I won't get nodename label.

Please correct me if I am wrong anywhere.

Brian Candler

unread,

Apr 24, 2020, 4:30:28 PM4/24/20

to Prometheus Users

On Friday, 24 April 2020 10:09:44 UTC+1, Yagyansh S. Kumar wrote:

UP metrics will exist but if the node exporter itself is down, it won't expose the metric at that time right? So, I won't get the "nodename" label from node_uname_info.

That's correct. You could at least alert on that one condition without a "nodename", in the knowledge that all other alerts *would* have a nodename.

If you want to join on something always, then you have to make sure that thing exists - such as a static recording rule, or a metric which is being scraped from somewhere else, like a static http page.

Reply all

Reply to author

Forward