How to sense disk read/write errors

200 views
Skip to first unread message

M Moore

unread,
May 18, 2023, 11:25:05 AM5/18/23
to Prometheus Users
Had two users come at me with "why didn't you...?" because of a machine that had disk
hardware failures, but no alerts before the device died.  They pointed at these messages
in the kernel dmesg:

> [Wed May 17 06:07:05 2023] nvme nvme3: async event result 00010300
> [Wed May 17 06:07:25 2023] nvme nvme3: controller is down; will reset: CSTS=0x2, PCI_STATUS=0x10
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392 > [Thu May 18 08:06:04 2023] Buffer I/O error on dev nvme3n1, logical block 390703424, async page read
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 0
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 256
I didn't find an "errors" counter in iostats[1] so I can guess node_exporter won't have it. I did find node_filesystem_device_error but that was zero the whole time. What would be the prometheus-y way to sense these errors so my users can have their alerts?" I'm hoping to avoid "logtail | grep -c 'error' " in a counter. [1: https://www.kernel.org/doc/html/latest/admin-guide/iostats.html ]

dub...@at.encryp.ch

unread,
May 18, 2023, 11:39:57 AM5/18/23
to promethe...@googlegroups.com
For proper NVMe metrics monitoring you need additional collector script, for example: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/master/nvme_metrics.sh
Reply all
Reply to author
Forward
0 new messages