You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Prometheus Users
Had two users come at me with "why didn't you...?" because of a machine that had disk
hardware failures, but no alerts before the device died. They pointed at these messages
in the kernel dmesg:
> [Wed May 17 06:07:05 2023] nvme nvme3: async event result 00010300 > [Wed May 17 06:07:25 2023] nvme nvme3: controller is down; will reset: CSTS=0x2, PCI_STATUS=0x10 > [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392 > [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392
> [Thu May 18 08:06:04 2023] Buffer I/O error on dev nvme3n1, logical block 390703424, async page read > [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 0 > [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 256
I didn't find an "errors" counter in iostats[1] so I can guess node_exporter won't have it. I did
find node_filesystem_device_error but that was zero the whole time.
What would be the prometheus-y way to sense these errors so my users can have their alerts?"
I'm hoping to avoid "logtail | grep -c 'error' " in a counter.
[1: https://www.kernel.org/doc/html/latest/admin-guide/iostats.html ]
dub...@at.encryp.ch
unread,
May 18, 2023, 11:39:57 AM5/18/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message