messed up my nomad upgrade, how to recover?

387 views
Skip to first unread message

greg

unread,
Sep 11, 2017, 4:31:11 PM9/11/17
to Nomad
i have a development installation, i updated the executables and restarted, obviously the wrong thing to do.  anyway, i've updated all of my nomad servers to 0.6.2, but, it is currently wedged.  It shows a bunch of dead jobs when I do a status:

[root@hypervisor-03.. ~]# nomad status

ID                 Type     Priority  Status          Submit Date

etherpad-lite      service  50        dead (stopped)  <none>

goldfish-ui        service  50        dead (stopped)  <none>

hashi-ui           service  50        dead (stopped)  09/11/17 16:06:58 EDT

manila             service  50        dead (stopped)  <none>

nginx              service  50        dead (stopped)  09/11/17 16:05:42 EDT

nginx-arlington    service  50        dead (stopped)  <none>

palantir           service  50        dead (stopped)  <none>

rack-ui            service  50        dead (stopped)  <none>

rack-ui-arlington  service  50        dead (stopped)  <none>

registry-ui        service  50        dead (stopped)  <none>

s1-twinsburg       service  50        dead (stopped)  <none>

s1-twinsburg-cmdb  service  50        dead (stopped)  <none>

s1-twinsburg-vcbt  service  50        dead (stopped)  <none>

vault-ui           service  50        dead (stopped)  <none>


How can I reset things so nomad starts working again?  Do I clear the /var/nomad filesystems on the clients and the servers?


-g


greg

unread,
Sep 12, 2017, 7:51:35 AM9/12/17
to Nomad
I re installed consul 0.9.3 and then nomad 0.6.3 on 3 server nodes.
consul and nomad seem to be working. I can bring up the consul-ui and look around, I can bring up
hashi-ui and I can see the nomad servers and clients.  I go to one of the nomad
clients and did a nomad init, and edited the example.nomad file to use my datacenter.
I ran this, nomad run example.nomad. The job is stuck in pending state.

By upgrade is from nomad 0.5.6 (I think). and consul 0.9.1.  I did completely reinstall.

Looking at the logs, I am seeing this error:

Sep 12 07:36:12 hypervisor-03 nomad: 2017/09/12 07:36:12.850775 [WARN] client: failed to start task "redis" for alloc "2ed6a584-e6ff-216a-d9f7-604f62fa02d3": Failed to start container 5887933013d804693dc2fefa78d3d22a64caccf64c88e8428434cb7f7b116d5e: API error (500): {"message":"driver failed programming external connectivity on endpoint redis-2ed6a584-e6ff-216a-d9f7-604f62fa02d3 (f80425e6a15a5c531c10751c211b97b67097ad07ee2ab4b3139c235159626cf2): Error starting userland proxy: listen tcp [fe80::1618:77ff:fe30:5e56]:27338: bind: invalid argument"}



I can go to that machine and look at the docker ps state.  There are images that are created, but none of them are running. 


I am running on linux.


-g






greg

unread,
Sep 12, 2017, 8:55:07 AM9/12/17
to Nomad
I reinstalled using 0.5.6 nomad. things are working again (i can run and stop a nomad job).

I installed 0.6.0, I get an error starting a job:

[install@hypervisor-03.. nomad]$ nomad run example.nomad

 

==> Monitoring evaluation "45844ce1"

   
Evaluation triggered by job "example"

   
Evaluation within deployment: "b5022b2f"

   
Evaluation status changed: "pending" -> "complete"

==> Evaluation "45844ce1" finished with status "complete" but failed to place all allocations:

   
Task Group "cache" (failed to place 1 allocation):

     
* Resources exhausted on 3 nodes

     
* Dimension "network: no networks available" exhausted on 3 nodes

   
Evaluation "c06ee003" waiting for additional capacity to place remainder



-g





 

greg

unread,
Sep 12, 2017, 2:33:57 PM9/12/17
to Nomad
This is a bug of some sort. Some help would be appreciated.  Is this the wrong place to request it?

I was able to bring up a ubuntu machine, put docker, consul and nomad on it, and run a nomad job.  This is apparently a bug maybe with centos and/or redhat linux?  The issues started with version 0.6.0 of nomad on redhat linux.  For now I am going to stick with 0.5.6.

I'll be at hashiconf next week if somebody is interested in seeing the issue.

-g

muha...@atsspec.co

unread,
Sep 12, 2017, 2:58:40 PM9/12/17
to Nomad
you are lucky going to hashiconf :)

Armon Dadgar

unread,
Sep 12, 2017, 6:11:16 PM9/12/17
to greg, Nomad
Greg,

I would recommend creating a Nomad ticket in GitHub so we can look into it. Given the posts, it sounds like you’ve upgraded/downgraded between a few different versions so it maybe hard to reproduce at this point with so many different variables. I would capture the relevant log files from the clients and servers and attach those to the ticket as well.

Best Regards,
Armon Dadgar
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/d01603ff-1e11-4d81-bd95-8095cef17c10%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

greg

unread,
Sep 15, 2017, 11:58:27 AM9/15/17
to Nomad
Alex sent an email saying it might be the nomad client interface fingerprinting method selecting the wrong interface.  That was it. The machines I am running on have maybe 50 interfaces declared.  I declare the interface in the agent configuration and all is well.

I will open an issue over the weekend.  Thank you,

-g

Alex Dadgar

unread,
Sep 15, 2017, 1:11:01 PM9/15/17
to Nomad, greg
Hey Greg,

Oops didn’t realize I didn’t reply all! So Nomad will do its best job to select the interface to bind to but when you have that many issues, you generally have to specify it. This is not so much a bug but the fact that beyond selecting a externally routable IP when not given the interface to use Nomad can only do so much. If I am misunderstanding at all, feel free to file an issue and we can dig in more!

Thanks,
Alex Dadgar

greg

unread,
Sep 17, 2017, 9:44:34 AM9/17/17
to Nomad
I raised an issue: https://github.com/hashicorp/nomad/issues/3236

with a description of the work around.

-g
Reply all
Reply to author
Forward
0 new messages