A Nagios plugin to monitor for errors reported by SMART to alert on failing hard drives.
I'd long used the good old check_ide_smart plugin from Nagios itself, but it relies upon SMART detecting and reporting problems well and doesn't monitor the SMART logs for failed self-tests etc.
I've seen a few drives failing and throwing IO errors, all the while SMART happily proclaims:
SMART overall-health self-assessment test result: PASSEDSimilarly, check_ide_smart reported no problems:
[dave@devvps:~]$ sudo /usr/lib/nagios/plugins/check_ide_smart -n /dev/sdb
OK - Operational (18/18 tests passed)I can no longer trust it alone.
So, this plugin requests the error log (using smartctl -l errors and alarms if errors are found:
[dave@devvps:~]$ ./nagios_smart_log -d /dev/sdb
CHECKSMARTLOG CRITICAL - 5 SMART errors, latest: Error 13 occurred at disk
power-on lifetime: 29769 hours (1240 days + 9 hours)Far more useful.
Update Feb 2024: I was having problems with check_ide_smart on a drive in an external enclosure, where running it directly would work fine, but when run via Nagios it would fail with CRITICAL - SMART_CMD_ENABLE.
I couldn't see what was going on easily, and a Google turned up several people with the same problem, including:
The Debian report suggested the problem was fixed in Wheezy and Jessie packages, but I was still seeing it on my Debian 12 (Buster) box with monitoring-plugins package version 2.3.3-5+deb12u2.
I saw check_scsi_smart recommended as a more modern alternative to check_ide_smart; it works a treat for me, and also can monitor the SMART logs as well, replacing the need for this plugin.
So, I'd recommend you give it a go first, and props to @spjmurray for it.