Batch/Periodic Job Health Checks

724 views
Skip to first unread message

Justin Walz

unread,
Oct 23, 2016, 10:18:52 PM10/23/16
to Nomad
Hi,

Nomad does a great job at registering services with Consul. Is there a solution for monitoring batch jobs (both one-time and periodic)? Ideally, we'd see a failing health check when the job failed by any means (fail to run/start/get allocated, exit code returned non-zero, etc.)

Current working hypothesis involves wrapping that task with a consul health check register/deregister, but it seems prone to errors and edge cases. Meanwhile, the nomad tool itself has good knowledge as to what happened while running (or attempting to run) the task.

Thanks,
Justin

Justin Walz

unread,
Oct 25, 2016, 1:17:43 AM10/25/16
to Nomad
Or, another possibility would be emitting metrics/events that we could then monitor. https://www.nomadproject.io/docs/agent/telemetry.html

Any thoughts?

Mathias Lafeldt

unread,
Oct 25, 2016, 2:29:57 AM10/25/16
to Justin Walz, Nomad
For batch jobs, you might consider pushing metrics to Prometheus (https://github.com/prometheus/pushgateway) or whatever monitoring system you prefer.

It's also possible to wrap your job using a tool like https://github.com/Jimdo/periodicnoise if that is an option.

Last but not least, you can always ask Nomad's API for the status of job allocations and use that as a basis for rolling your own solution.

-Mathias


--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/1804eba2-93ab-417f-9bbd-769b4bb7ecd4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alex Dadgar

unread,
Nov 7, 2016, 1:45:51 PM11/7/16
to Nomad
Hey Justin,

It would be great if you could file an issue with all your various thoughts on this. It is a good idea!

Thanks,
Alex 

On Sunday, October 23, 2016 at 7:18:52 PM UTC-7, Justin Walz wrote:

Justin Walz

unread,
Nov 8, 2016, 5:11:02 PM11/8/16
to Nomad
Hi Alex,

Sure - I just created one here: https://github.com/hashicorp/nomad/issues/1964.

Best, Justin

Zane Williamson

unread,
Mar 26, 2017, 6:42:44 PM3/26/17
to Nomad
I put this together to send 'deadman switch' alerts to Slack for Nomad periodic jobs that fail to run. 


May be useful to others, so I figured I'd share on this thread. 

-Z


On Sunday, October 23, 2016 at 7:18:52 PM UTC-7, Justin Walz wrote:

Alex Dadgar

unread,
Mar 27, 2017, 2:34:31 PM3/27/17
to Nomad, Zane Williamson
Nice work! That looks great!

Thanks,
Alex Dadgar
--

This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/85f4a9b5-8163-43dd-888c-6565408041f9%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages