Checking if NFS is hanged or not using node_exporter.

1,306 views
Skip to first unread message

Yagyansh S. Kumar

unread,
Mar 3, 2020, 8:10:47 AM3/3/20
to Prometheus Users
I want to check if the NFS is hanged(i.e whether it is accessible from the server or not, and if yes then what is the response time it is getting). I know using the mountstats and nfs collector we have a lot of metrics for NFS, but haven't found any that can tell me every time the NFS hangs correctly.
Thanks in advance.

Murali Krishna Kanagala

unread,
Mar 3, 2020, 10:05:03 AM3/3/20
to Yagyansh S. Kumar, Prometheus Users
Try enabling the nfs options in the node exporter config. It will spit out some metrics about the nfs status. 

Also look at the disk IO metrics from node exporter and if you see no activity which indicates the nfs is not doing anything.

On Tue, Mar 3, 2020, 7:10 AM Yagyansh S. Kumar <yagyans...@gmail.com> wrote:
I want to check if the NFS is hanged(i.e whether it is accessible from the server or not, and if yes then what is the response time it is getting). I know using the mountstats and nfs collector we have a lot of metrics for NFS, but haven't found any that can tell me every time the NFS hangs correctly.
Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/06929518-d3b5-4c2f-9490-b08cc664d26b%40googlegroups.com.

Yagyansh S. Kumar

unread,
Mar 3, 2020, 10:12:28 AM3/3/20
to Prometheus Users
Already enabled the nfs and nfsd collectors. Till now I haven't found anything that can accurately give me the information about NFS hang.
Correct me if I am wrong, but I don't think it is a good indicator of NFS hang as there may be times where no activity is happening on the NFS, but that does not mean that NFS is hanged. (eg. I have 25 NFS mounts on one of my servers, some of them are used rarely, so we won't find any substantial IO on those mounts, but I need to know whether they are accessible or not). Still, thanks for the suggestion, will try it out once.


On Tuesday, March 3, 2020 at 8:35:03 PM UTC+5:30, Murali Krishna Kanagala wrote:
Try enabling the nfs options in the node exporter config. It will spit out some metrics about the nfs status. 

Also look at the disk IO metrics from node exporter and if you see no activity which indicates the nfs is not doing anything.

On Tue, Mar 3, 2020, 7:10 AM Yagyansh S. Kumar <yagyans...@gmail.com> wrote:
I want to check if the NFS is hanged(i.e whether it is accessible from the server or not, and if yes then what is the response time it is getting). I know using the mountstats and nfs collector we have a lot of metrics for NFS, but haven't found any that can tell me every time the NFS hangs correctly.
Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Murali Krishna Kanagala

unread,
Mar 3, 2020, 10:16:27 AM3/3/20
to Yagyansh S. Kumar, Prometheus Users
I would write a small shell script that tries to write to the nfs mount  path and writes the status to a file which can be read by the text file collector. And schedule that shell script cron. I think this is the easiest solution.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/1dda60cc-0b20-47da-87ff-4f1c76ce076f%40googlegroups.com.
Message has been deleted

Yagyansh S. Kumar

unread,
Mar 3, 2020, 10:26:13 AM3/3/20
to Prometheus Users
I also thought about doing the same, but I am keeping that as a last resort because that would require me to push the script to all my 2500+ servers.

Serkan Çoban

unread,
Mar 3, 2020, 12:03:01 PM3/3/20
to Yagyansh S. Kumar, Prometheus Users
if I remember correctly node exporter will hang too when an nfs share
hangs. maybe you can test it...

On Tue, Mar 3, 2020 at 6:26 PM Yagyansh S. Kumar
<yagyans...@gmail.com> wrote:
>
> I also thought about doing the same, but I am keeping that as a last resort because that would require me to push the script to all my 2500+ servers.
>
> On Tuesday, March 3, 2020 at 8:46:27 PM UTC+5:30, Murali Krishna Kanagala wrote:
>>
>> I would write a small shell script that tries to write to the nfs mount path and writes the status to a file which can be read by the text file collector. And schedule that shell script cron. I think this is the easiest solution.
>>
>> On Tue, Mar 3, 2020, 9:12 AM Yagyansh S. Kumar <yagyans...@gmail.com> wrote:
>>>
>>> Already enabled the nfs and nfsd collectors. Till now I haven't found anything that can accurately give me the information about NFS hang.
>>> Correct me if I am wrong, but I don't think it is a good indicator of NFS hang as there may be times where no activity is happening on the NFS, but that does not mean that NFS is hanged. (eg. I have 25 NFS mounts on one of my servers, some of them are used rarely, so we won't find any substantial IO on those mounts, but I need to know whether they are accessible or not). Still, thanks for the suggestion, will try it out once.
>>>
>>>
>>> On Tuesday, March 3, 2020 at 8:35:03 PM UTC+5:30, Murali Krishna Kanagala wrote:
>>>>
>>>> Try enabling the nfs options in the node exporter config. It will spit out some metrics about the nfs status.
>>>>
>>>> Also look at the disk IO metrics from node exporter and if you see no activity which indicates the nfs is not doing anything.
>>>>
>>>> On Tue, Mar 3, 2020, 7:10 AM Yagyansh S. Kumar <yagyans...@gmail.com> wrote:
>>>>>
>>>>> I want to check if the NFS is hanged(i.e whether it is accessible from the server or not, and if yes then what is the response time it is getting). I know using the mountstats and nfs collector we have a lot of metrics for NFS, but haven't found any that can tell me every time the NFS hangs correctly.
>>>>> Thanks in advance.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
>>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/06929518-d3b5-4c2f-9490-b08cc664d26b%40googlegroups.com.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/1dda60cc-0b20-47da-87ff-4f1c76ce076f%40googlegroups.com.
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/832f2823-eab1-4f40-8f91-ddbc00190551%40googlegroups.com.

Ben Kochie

unread,
Mar 3, 2020, 12:12:14 PM3/3/20
to Serkan Çoban, Yagyansh S. Kumar, Prometheus Users
We added some mitigation for filesystem hangs. The node_exporter will notice a stuck filesystem and stop attempting to gather metrics from it until it gets un-stuck. Although, I don't think we have any metrics for when that happens, only log errors.

Yagyansh S. Kumar

unread,
Mar 3, 2020, 12:17:40 PM3/3/20
to Prometheus Users
 If it will stop scraping the metrics all together, then can we safely say that the time we don't have any metrics for a NFS mount, it is because the mount is stuck?

Julien Pivotto

unread,
Mar 3, 2020, 12:18:02 PM3/3/20
to Ben Kochie, Serkan Çoban, Yagyansh S. Kumar, Prometheus Users
Hi,

We have a dedicated job that collects disks metrics:

- job_name: node_disks
params:
collect[]:
- diskstats
- filefd
- filesystem
- mdadm
- mountstats
- nfs
- nfsd
- job_name: node
params:
collect[]:
- arp
- bonding
- conntrack
- cpu
- entropy
- hwmon
- infiniband
- loadavg
- meminfo
- netclass
- netdev
- netstat
- ntp
- processes
- sockstat
- stat
- textfile
- time
- timex
- uname
- vmstat
- xfs

stale nfs will usually be noticed:
up{job="node_disks"}==0 and label_replace(up{job="node"}==1,"job","node_disks","","")
and second rule:
node_filesystem_avail_bytes offset 8h unless node_filesystem_avail_bytes and on(job, instance) up == 1


Those two expression seem to have worked fine for us in the past.
> > https://groups.google.com/d/msgid/prometheus-users/1dda60cc-0b20-47da-87ff-4f1c76ce076f%40googlegroups.com
> > .
> > >
> > > --
> > > You received this message because you are subscribed to the Google
> > Groups "Prometheus Users" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> > an email to prometheus-use...@googlegroups.com.
> > > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/832f2823-eab1-4f40-8f91-ddbc00190551%40googlegroups.com
> > .
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Prometheus Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to prometheus-use...@googlegroups.com.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/CAP9WWed%2BtxJVRSJc0mkCOkg6_neGAJRNEMq_hku87LPbYXAhjA%40mail.gmail.com
> > .
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmqMKQXYNOfdr7BeFA%3Dx%3D5fY%2Bk4EQ8oprL0Wh-8SNqmvoA%40mail.gmail.com.

--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
signature.asc

sayf eddine Hammemi

unread,
Mar 3, 2020, 12:19:57 PM3/3/20
to Ben Kochie, Serkan Çoban, Yagyansh S. Kumar, Prometheus Users
If the node-exporter will log errors if the nfs share hangs then u can use mtail for example to scrape node exporter log files and export nfs errors, that would be better than using a hand made script.

Yagyansh S. Kumar

unread,
Mar 3, 2020, 12:33:13 PM3/3/20
to Prometheus Users
@Julien - Can you please explain a bit on what actually you are checking and how are you concluding that the NFS is infact in hung state.
> > https://groups.google.com/d/msgid/prometheus-users/832f2823-eab1-4f40-8f91-ddbc00190551%40googlegroups.com
> > .
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Prometheus Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to promethe...@googlegroups.com.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/CAP9WWed%2BtxJVRSJc0mkCOkg6_neGAJRNEMq_hku87LPbYXAhjA%40mail.gmail.com
> > .
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
Message has been deleted

Yagyansh S. Kumar

unread,
Mar 3, 2020, 12:38:24 PM3/3/20
to Prometheus Users
@sayf eddine Hammemi -
This seems a very tedious task, given that I have to monitor 2500+ servers.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/832f2823-eab1-4f40-8f91-ddbc00190551%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Murali Krishna Kanagala

unread,
Mar 3, 2020, 7:44:47 PM3/3/20
to Yagyansh S. Kumar, Prometheus Users
The other option would be to run a custom exporter from one box that can ssh to the rest and run the needful commands. 

On Tue, Mar 3, 2020, 9:25 AM Yagyansh S. Kumar <yagyans...@gmail.com> wrote:
I also thought about that, but I am keeping that as a last resort. But that would require me to push a script to all my 2500+ servers.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages