Monitoring Unhealthy Nomad Jobs

136 views
Skip to first unread message

hch...@instartlogic.com

unread,
Jul 24, 2018, 9:23:05 PM7/24/18
to Nomad
Hi there!

Are there recommended best practices for monitoring unhealthy jobs in Nomad? We're currently deploying jobs across regions and datacenters. After deployment, is there a good way to check that all the services a
re running correctly and to alert us if not? We currently use Prometheus as our monitoring system.

We've looked at the option of using Consul health checks as many of you do. However, if a task never comes up at all there is no Consul health check. It would also be great is there is a way to monitor for accidental stopping of jobs. For example, if someone accidentally stops a job that's actually needed.

Thanks in advance.

Matt Veitas

unread,
Jul 29, 2018, 6:13:47 PM7/29/18
to Nomad
I don't think there is anything out of the box, but we spent a day writing a number of simple python scripts that we run every 30 seconds to query the Consul and Nomad APIs to get information about jobs and their health and then report these metrics to our monitoring system. So far it's working well.

-Matt

Shantanu Gadgil

unread,
Jul 30, 2018, 1:33:14 PM7/30/18
to Nomad
Hi,
Are these scripts available opensource somewhere???☺️😊

Matt Veitas

unread,
Jul 31, 2018, 2:02:48 PM7/31/18
to Nomad
Not yet, but this is something we (my company) might consider in the future
Reply all
Reply to author
Forward
0 new messages