Handling dead allocations for a job

1,034 views
Skip to first unread message

Olve Hansen

unread,
Sep 30, 2015, 6:53:06 AM9/30/15
to Nomad
Kudos for a great project, and Nomad looks like the first project getting the complexity level low enough for my use.

I have a few questions, knowing that this is a 0.1.0 version.

I am running through the getting started tutorial experimenting with the cluster setup proposed at 

I was under the (perhaps mistaken) impression that the scheduler would start again any docker processes that dies. What should I do when I want to "re-assert" the job? I have now killed all my redis docker containers, and nomad says all are dead.

vagrant@nomad:~$ nomad status example
ID          = example
Name        = example
Type        = service
Priority    = 50
Datacenters = dc1
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
8162aee0-5573-37be-98aa-3ab47021815e  50        job-register  complete
9e5faf5d-90ed-487d-52c9-2296b8fc8401  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID                                TaskGroup  Desired  Status
1e0daaf5-115e-76e8-6ad5-13506e0d5157  8162aee0-5573-37be-98aa-3ab47021815e  f6d3851a-1d4c-0271-a93f-ad8e72439a8b  cache      run      dead
6aa7ddf9-e726-bb13-afee-7ed9bd5c3ce4  8162aee0-5573-37be-98aa-3ab47021815e  afbe984b-a0f4-5a53-1f40-fc2905aa31ed  cache      run      dead
98f033b6-eafc-2c4d-a34c-ba7f25a1c70a  8162aee0-5573-37be-98aa-3ab47021815e  f6d3851a-1d4c-0271-a93f-ad8e72439a8b  cache      run      dead
vagrant@nomad:~$

So how do I get things back to normal using nomad? Do I have to delete and run it again? Cause if I just run it it just leaves the allocations in dead status. Same if I restart docker, all processes are reported as dead. 

Still - great project! Promising stuff!

Olve




Armon Dadgar

unread,
Oct 3, 2015, 6:19:55 PM10/3/15
to Nomad, Olve Hansen
Hey Olve,

Currently the best way is to “kick” the scheduler into re-evaluating the job.
This can be done by using the `/v1/job/<ID>/evaluate` endpoint, or by just running
“nomad run” on the same job again.

In the future, any jobs with the “service” type will automatically be restarted if they
fail, so this hack shouldn’t be needed for long!

Best Regards,
Armon Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/75044fe3-57f0-44e5-893a-28504d8f9b2a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Olve Hansen

unread,
Oct 5, 2015, 4:36:38 AM10/5/15
to Nomad
It seems like running the job again does not re-evaluate the job. Haven't got around to massaging the api endpoints yet.

vagrant@nomad:~$ sudo docker kill  ecstatic_chandrasekhar
ecstatic_chandrasekhar
vagrant@nomad:~$ nomad status example
ID          = example
Name        = example
Type        = service
Priority    = 50
Datacenters = dc1
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
256e28f6-acc1-b8ba-fb0e-cda530ddfa5e  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID                                TaskGroup  Desired  Status
8fdfd518-9711-6143-f703-93b27c61bff0  256e28f6-acc1-b8ba-fb0e-cda530ddfa5e  a76650d3-e4cd-ea2b-8b88-32e1b12f2552  cache      run      running
920ae146-b237-be4b-5f0e-4e75ccfee5d8  256e28f6-acc1-b8ba-fb0e-cda530ddfa5e  c9b97617-fef6-bd2b-a782-a9418c456596  cache      run      dead
9a9412ce-055f-54de-f07a-b7a0594159c9  256e28f6-acc1-b8ba-fb0e-cda530ddfa5e  a76650d3-e4cd-ea2b-8b88-32e1b12f2552  cache      run      running

vagrant@nomad:~$ nomad run example.nomad
==> Monitoring evaluation "6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c"
    Evaluation triggered by job "example"
    Allocation "8fdfd518-9711-6143-f703-93b27c61bff0" modified: node "a76650d3-e4cd-ea2b-8b88-32e1b12f2552", group "cache"
    Allocation "920ae146-b237-be4b-5f0e-4e75ccfee5d8" modified: node "c9b97617-fef6-bd2b-a782-a9418c456596", group "cache"
    Allocation "9a9412ce-055f-54de-f07a-b7a0594159c9" modified: node "a76650d3-e4cd-ea2b-8b88-32e1b12f2552", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c" finished with status "complete"

vagrant@nomad:~$ nomad status example
ID          = example
Name        = example
Type        = service
Priority    = 50
Datacenters = dc1
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c  50        job-register  complete
256e28f6-acc1-b8ba-fb0e-cda530ddfa5e  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID                                TaskGroup  Desired  Status
8fdfd518-9711-6143-f703-93b27c61bff0  6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c  a76650d3-e4cd-ea2b-8b88-32e1b12f2552  cache      run      running
920ae146-b237-be4b-5f0e-4e75ccfee5d8  6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c  c9b97617-fef6-bd2b-a782-a9418c456596  cache      run      dead
9a9412ce-055f-54de-f07a-b7a0594159c9  6fe1fcb2-0ea4-0ba9-e3c9-b8115d0a039c  a76650d3-e4cd-ea2b-8b88-32e1b12f2552  cache      run      running

On Sunday, October 4, 2015 at 12:19:55 AM UTC+2, Armon Dadgar wrote:
Hey Olve,

Currently the best way is to “kick” the scheduler into re-evaluating the job.
This can be done by using the `/v1/job/<ID>/evaluate` endpoint, or by just running
“nomad run” on the same job again.

In the future, any jobs with the “service” type will automatically be restarted if they
fail, so this hack shouldn’t be needed for long!

Best Regards,
Armon Dadgar

Armon Dadgar

unread,
Oct 5, 2015, 9:26:01 AM10/5/15
to Nomad, Olve Hansen
Hey Olve,

Ah interesting. The scheduler already has the job in the desired “run” state, so it is
not considering that the client has moved to the “dead” state. Could you please file
a ticket, including any logs you may have from the node with the dead allocation?

Thanks!

Best Regards,
Armon Dadgar


From: Olve Hansen <ol...@vimond.com>
Reply: Olve Hansen <ol...@vimond.com>>
Date: October 5, 2015 at 4:36:39 AM
To: Nomad <nomad...@googlegroups.com>>
Reply all
Reply to author
Forward
0 new messages