Handling dead allocations for a job

Olve Hansen

unread,

Sep 30, 2015, 6:53:06 AM9/30/15

to Nomad

Kudos for a great project, and Nomad looks like the first project getting the complexity level low enough for my use.

I have a few questions, knowing that this is a 0.1.0 version.

I am running through the getting started tutorial experimenting with the cluster setup proposed at

https://nomadproject.io/intro/getting-started/cluster.html

I was under the (perhaps mistaken) impression that the scheduler would start again any docker processes that dies. What should I do when I want to "re-assert" the job? I have now killed all my redis docker containers, and nomad says all are dead.

vagrant@nomad:~$ nomad status example

ID = example

Name = example

Type = service

Priority = 50

Datacenters = dc1

Status = <none>

==> Evaluations

ID Priority TriggeredBy Status

8162aee0-5573-37be-98aa-3ab47021815e 50 job-register complete

9e5faf5d-90ed-487d-52c9-2296b8fc8401 50 job-register complete

==> Allocations

ID EvalID NodeID TaskGroup Desired Status

1e0daaf5-115e-76e8-6ad5-13506e0d5157 8162aee0-5573-37be-98aa-3ab47021815e f6d3851a-1d4c-0271-a93f-ad8e72439a8b cache run dead

6aa7ddf9-e726-bb13-afee-7ed9bd5c3ce4 8162aee0-5573-37be-98aa-3ab47021815e afbe984b-a0f4-5a53-1f40-fc2905aa31ed cache run dead

98f033b6-eafc-2c4d-a34c-ba7f25a1c70a 8162aee0-5573-37be-98aa-3ab47021815e f6d3851a-1d4c-0271-a93f-ad8e72439a8b cache run dead

vagrant@nomad:~$

So how do I get things back to normal using nomad? Do I have to delete and run it again? Cause if I just run it it just leaves the allocations in dead status. Same if I restart docker, all processes are reported as dead.

Still - great project! Promising stuff!

Olve

Armon Dadgar

unread,

Oct 3, 2015, 6:19:55 PM10/3/15

to Nomad, Olve Hansen

Hey Olve,

Currently the best way is to “kick” the scheduler into re-evaluating the job.

This can be done by using the `/v1/job/<ID>/evaluate` endpoint, or by just running

“nomad run” on the same job again.

In the future, any jobs with the “service” type will automatically be restarted if they

fail, so this hack shouldn’t be needed for long!

Best Regards,
Armon Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/75044fe3-57f0-44e5-893a-28504d8f9b2a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Olve Hansen

unread,

Oct 5, 2015, 4:36:38 AM10/5/15

to Nomad

It seems like running the job again does not re-evaluate the job. Haven't got around to massaging the api endpoints yet.

vagrant@nomad:~$ sudo docker kill ecstatic_chandrasekhar

ecstatic_chandrasekhar

vagrant@nomad:~$ nomad status example

ID = example

Name = example

Type = service

Priority = 50

Datacenters = dc1

Status = <none>

==> Evaluations

ID Priority TriggeredBy Status

256e28f6-acc1-b8ba-fb0e-cda530ddfa5e 50 job-register complete

==> Allocations

ID EvalID NodeID TaskGroup Desired Status

8fdfd518-9711-6143-f703-93b27c61bff0 256e28f6-acc1-b8ba-fb0e-cda530ddfa5e a76650d3-e4cd-ea2b-8b88-32e1b12f2552 cache run running

920ae146-b237-be4b-5f0e-4e75ccfee5d8 256e28f6-acc1-b8ba-fb0e-cda530ddfa5e c9b97617-fef6-bd2b-a782-a9418c456596 cache run dead

9a9412ce-055f-54de-f07a-b7a0594159c9 256e28f6-acc1-b8ba-fb0e-cda530ddfa5e a76650d3-e4cd-ea2b-8b88-32e1b12f2552 cache run running

vagrant@nomad:~$ nomad run example.nomad

==> Monitoring evaluation "6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c"

Evaluation triggered by job "example"

Allocation "8fdfd518-9711-6143-f703-93b27c61bff0" modified: node "a76650d3-e4cd-ea2b-8b88-32e1b12f2552", group "cache"

Allocation "920ae146-b237-be4b-5f0e-4e75ccfee5d8" modified: node "c9b97617-fef6-bd2b-a782-a9418c456596", group "cache"

Allocation "9a9412ce-055f-54de-f07a-b7a0594159c9" modified: node "a76650d3-e4cd-ea2b-8b88-32e1b12f2552", group "cache"

Evaluation status changed: "pending" -> "complete"

==> Evaluation "6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c" finished with status "complete"

vagrant@nomad:~$ nomad status example

ID = example

Name = example

Type = service

Priority = 50

Datacenters = dc1

Status = <none>

==> Evaluations

ID Priority TriggeredBy Status

6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c 50 job-register complete

256e28f6-acc1-b8ba-fb0e-cda530ddfa5e 50 job-register complete

==> Allocations

ID EvalID NodeID TaskGroup Desired Status

8fdfd518-9711-6143-f703-93b27c61bff0 6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c a76650d3-e4cd-ea2b-8b88-32e1b12f2552 cache run running

920ae146-b237-be4b-5f0e-4e75ccfee5d8 6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c c9b97617-fef6-bd2b-a782-a9418c456596 cache run dead

9a9412ce-055f-54de-f07a-b7a0594159c9 6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c a76650d3-e4cd-ea2b-8b88-32e1b12f2552 cache run running

On Sunday, October 4, 2015 at 12:19:55 AM UTC+2, Armon Dadgar wrote:

Hey Olve,

Currently the best way is to “kick” the scheduler into re-evaluating the job.
This can be done by using the `/v1/job/<ID>/evaluate` endpoint, or by just running
“nomad run” on the same job again.

In the future, any jobs with the “service” type will automatically be restarted if they
fail, so this hack shouldn’t be needed for long!

Best Regards,
Armon Dadgar

Armon Dadgar

unread,

Oct 5, 2015, 9:26:01 AM10/5/15

to Nomad, Olve Hansen

Hey Olve,

Ah interesting. The scheduler already has the job in the desired “run” state, so it is

not considering that the client has moved to the “dead” state. Could you please file

a ticket, including any logs you may have from the node with the dead allocation?

Thanks!

Best Regards,
Armon Dadgar

From: Olve Hansen <ol...@vimond.com>
Reply: Olve Hansen <ol...@vimond.com>>
Date: October 5, 2015 at 4:36:39 AM
To: Nomad <nomad...@googlegroups.com>>

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/2b8116f0-3ef8-4908-a75a-b15c188e7309%40googlegroups.com.

Reply all

Reply to author

Forward