Hi,
Sorry for reviving an old discussion, but I just started monitoring my home box with prometheus. As this is a home-grown disk box (>8 2.5'' disk, btrfs raid10) I wanted to monitor disk activity (which node-exporter) as well as smart attributes.
I didn't want to parse smartctl output so I linked against libatasmart. I've dumped the (ugly+initial) code into a gist (including sample output of my laptops SSD) (feel free to do whatever you want with the code)
https://gist.github.com/rtreffer/4ca899ed926955078099b8f623ff3c59gcc main.c -latasmart && sudo ./a.out /dev/sd?
Each metric is exported as
- The smart "value" (usually 100-0, 200-0 or 253-0)
- The raw value
- The pretty value (e.g. degree kelvin, with a suffix if possible, e.g. _ms, _bytes, temp_k, _percent....) (provided by libatasmart)
- The remaining health (value - threshold) (negative values or zero means predicted disk failure - note that the scale is vendor dependent and thus pointless except for sign)
Each data point is labeled with device path (e.g. /dev/sdX) and smart identity
values (model, serial, firmware revision). This should make it easy to
monitor all disks of a certain kind for common failures.
libatasmart was written because smartctl is not a lib and thus doesn't offer good integration with monitoring tools or the desktop.... The only downside I've found is it won't handle smart attributes of disks behind a megaraid controller in RAID mode (I just read the code but it seems like it's missing an equivalent to smartctl -d megaraid,X /dev/sda).
Side note: smartctl and libata smart use a command pass through mode which is only available for root and binaries with special capabilities. So the assertion that a smart util requires root will hold. Simply strace what smartctl is doing, or use smartctl -r ioctl to see what's going on.
Temperature works pretty well:
https://drive.google.com/file/d/0Bxx_x6DuLA2hdnZxdl9sMXNwalk/view (cron currently running every 1h for testing, so not a smooth chart)
Now I'd like to build a disk error rate chart to see if any disk starts spilling out errors. For this I'd like to sum up all error related smart attributes. I'd like to keep it in one chart as most of the time it will hopefully be a zero rate and increases should just be one disk.
This means I'd have to add smart attribute raw values, but each disk has a different set of metrics, basically a sparse instant vector, and the intersect will be empty / all disks will be dropped from the graph (e.g. the SSD has wear counters whereas the HDD reports seek errors - adding both will drop all disks).
Is there a way to add sparse instant vectors? Is this planned for the future?
Regards,
Rene Treffer
PS: I once had a RAID of disks known to loose data during SMART inqueries. I never bought a pack of the same disk for a RAID again. Just in case anyone is wondering.