fyi. i think load is a pretty good proxy for trouble. no guarantees but its a good warning signal IMHO.
we have a 30 node cluster and while we have newrelic, pepperdata, ganglia, logic monitor, nagios and who knows what else... call me a cave man but i just run uptime from the cmd line in a loop. nice 'n simple.
so i get something like this:
dndwr2
09:18:43 up 221 days, 2:16, 0 users, load average: 0.62, 3.46, 5.99
dndwr3
09:18:43 up 231 days, 22:44, 0 users, load average: 1.26, 4.13, 6.27
dndwr4
09:18:43 up 1022 days, 14:41, 0 users, load average: 1.44, 5.10, 6.97
dndwr5
09:18:44 up 223 days, 2:04, 0 users, load average: 1.47, 4.56, 7.00
dndwr6
09:18:44 up 225 days, 52 min, 0 users, load average: 1.51, 3.76, 5.94
dndwr7
09:18:44 up 225 days, 1:40, 0 users, load average: 1.42, 3.97, 6.28
dndwr8
09:18:44 up 225 days, 58 min, 0 users, load average: 0.93, 4.00, 6.34
dndwr9
09:18:44 up 776 days, 18:46, 0 users, load average: 1.74, 4.38, 6.76
dndwr10
09:18:44 up 268 days, 2:59, 0 users, load average: 1.28, 4.50, 6.56
dndwr11
09:18:44 up 558 days, 19:29, 0 users, load average: 1.25, 3.76, 6.30
dndwr12
09:18:44 up 558 days, 18:56, 0 users, load average: 1.55, 4.48, 6.67
dndwr13
09:18:44 up 552 days, 23:51, 0 users, load average: 0.98, 3.74, 6.31
dndwr14
09:18:45 up 223 days, 2:10, 0 users, load average: 1.49, 4.32, 6.32
dndwr15
09:18:45 up 558 days, 18:52, 0 users, load average: 1.06, 3.96, 6.24
dndwr16
09:18:45 up 558 days, 20:50, 0 users, load average: 1.07, 4.27, 6.63
dndwr17
09:18:45 up 558 days, 19:41, 0 users, load average: 1.02, 4.50, 6.82
dndwr18
09:18:45 up 223 days, 2:22, 0 users, load average: 0.93, 3.46, 5.66
dndwr19
09:18:45 up 558 days, 19:35, 0 users, load average: 0.70, 3.73, 6.14
dndwr20
09:18:46 up 84 days, 15:45, 0 users, load average: 1.34, 3.95, 6.40
dndwr21
09:18:46 up 91 days, 8:30, 0 users, load average: 0.61, 3.87, 6.52
dndwr22
09:18:46 up 100 days, 16:45, 0 users, load average: 0.85, 3.43, 6.02
dndwr23
09:18:46 up 100 days, 16:04, 0 users, load average: 1.22, 3.79, 6.18
dndwr24
09:18:46 up 54 days, 2 min, 0 users, load average: 0.66, 3.17, 5.99
dndwr25
09:18:46 up 8 days, 21:15, 0 users, load average: 1.60, 4.00, 6.48
dndwr26
09:18:46 up 79 days, 7:50, 0 users, load average: 0.95, 2.77, 4.43
dndwr27
09:18:46 up 100 days, 16:03, 0 users, load average: 1.78, 4.66, 6.81
pretty easy to spot something wacky on the cluster this way.
Cheers,
Stephen.