Strange behaviour on nomad client restart

378 views
Skip to first unread message

Adam Twite

unread,
Feb 4, 2016, 12:45:00 AM2/4/16
to Nomad
I've been testing nomads behaviour on client crashes and came across something strange

I have a nomad server and client configured on separate hosts. I ran a system job from the server, with one docker task called store-app-va, which successfully created a docker container on the client. The docker container id for this job was e88e6630f436.

Next I reboot the client host. When it starts up the docker container is in a stopped state, but still visible with: docker ps -a. At this point the nomad client hasn't started yet, since I haven't configured it to start automatically. When I start the nomad client I get the following log entries:

Feb 04 04:20:16 nomad-store1 nomad[3356]: 2016/02/04 04:20:16 [ERR] client: failed to open handle to task 'store-app-va' for alloc '51c01501-1e1d-d85d-9e16-494e5f14433e': Failed to find container e88e6630f436248176e0b034d3ce67e760b1801ff06521d96ad4b9247a03bad8: <nil>
Feb 04 04:20:16 nomad-store1 nomad[3356]: 2016/02/04 04:20:16 [DEBUG] client: starting task context for 'store-app-va' (alloc '51c01501-1e1d-d85d-9e16-494e5f14433e')
Feb 04 04:20:16 nomad-store1 nomad[3356]: 2016/02/04 04:20:16 [DEBUG] client: starting runner for alloc '51c01501-1e1d-d85d-9e16-494e5f14433e'
Feb 04 04:20:16 nomad-store1 nomad[3356]: 2016/02/04 04:20:16 [DEBUG] driver.docker: binding directories []string{"/var/log/nomad/alloc/51c01501-1e1d-d85d-9e16-494e5f14433e/alloc:/alloc:rw,z", "/var/log/nomad/alloc/51c01501-1e1d-d85d-9e16-494e5f14433e/store-app-va:/local:rw,Z"} for 192.168.0.70:5000/store-va
Feb 04 04:20:16 nomad-store1 nomad[3356]: 2016/02/04 04:20:16 [DEBUG] driver.docker: setting container name to: store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e
Feb 04 04:20:16 nomad-store1 docker[3303]: time="2016-02-04T04:20:16.300378727Z" level=info msg="POST /containers/create?name=store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e"
Feb 04 04:20:16 nomad-store1 docker[3303]: time="2016-02-04T04:20:16.305185725Z" level=error msg="Handler for POST /containers/create returned error: Conflict. The name \"store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e\" is already in use by container e88e6630f436. You have to delete (or rename) that container to be able to reuse that name."
Feb 04 04:20:16 nomad-store1 docker[3303]: time="2016-02-04T04:20:16.305264522Z" level=error msg="HTTP Error" err="Conflict. The name \"store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e\" is already in use by container e88e6630f436. You have to delete (or rename) that container to be able to reuse that name." statusCode=409
Feb 04 04:20:16 nomad-store1 docker[3303]: time="2016-02-04T04:20:16.305735310Z" level=info msg="GET /containers/json?all=1&filters=%7B%22name%22%3A%5B%22store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e%22%5D%7D"
Feb 04 04:20:16 nomad-store1 nomad[3356]: 2016/02/04 04:20:16 [INFO] driver.docker: a container with the name store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e already exists; will attempt to purge and re-create
Feb 04 04:20:27 nomad-store1 nomad[3356]: 2016/02/04 04:20:27 [INFO] driver.docker: purged container store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e
Feb 04 04:20:27 nomad-store1 docker[3303]: time="2016-02-04T04:20:27.403240577Z" level=info msg="POST /containers/create?name=store-app-va-51c01501-1e1d-d85d-9e16-494e5f14433e"
Feb 04 04:20:28 nomad-store1 nomad[3356]: 2016/02/04 04:20:28 [ERR] client: failed to start task 'store-app-va' for alloc '51c01501-1e1d-d85d-9e16-494e5f14433e': Failed to start container 0c9911612ce71af682727670bd29d682b97dcb35e3d4b09bd6d1567634f2be2a: API error (500): Cannot start container 0c9911612ce71af682727670bd29d682b97dcb35e3d4b09bd6d1567634f2be2a: Bind for 192.168.0.71:80 failed: port is already allocated
Feb 04 04:20:28 nomad-store1 nomad[3356]: 2016/02/04 04:20:28 [DEBUG] client: terminating runner for alloc '51c01501-1e1d-d85d-9e16-494e5f14433e'

The end result is that the original container has been purged, and two new containers created. The container id in the first logged error looks wrong; it has the real container id at the beginning, followed by a sequence of extra characters. What is the expected behaviour in this case? Should the client agent find the old container and start it again, or will it always create a new container when a client restarts? If the latter is true, then why did it create two new containers?

Alex Dadgar

unread,
Feb 4, 2016, 1:13:43 PM2/4/16
to Adam Twite, Nomad
Hey,

Nomad Client will attempt to re-attach to all processes it has created when it is restarted. However, since you restarted the host machine, all the processes (including the container) are dead. So that is why it fails to re-attach. It then deletes the old container and creates it again. It looks like something else is bound to port 80 which is why you see that last error message. So in the end, the expected behavior is: try to reattach and if you can not, recreate desired state.

- Alex

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/444ee7d4-7399-45d3-8562-2980dba57767%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Adam Twite

unread,
Feb 4, 2016, 5:23:28 PM2/4/16
to Nomad
Hi Alex thanks for the info. The issue I'm having from here is that the nomad agent created not one, but two replacement containers. The containers are statically mapped to port 80 in the nomad job file, so I think the port bind error is probably caused by the second container trying to start up. Do you know why this might be happening?
Reply all
Reply to author
Forward
0 new messages