Why docker temporary fail - is fatal error?

Andrey Kuzmin

unread,

Nov 8, 2016, 4:50:19 PM11/8/16

to Nomad

Hello.

I have some system job (docker container), running on every node.

And sometimes (rarely) container fails and nomad tries to restart it.

But occasionally docker fail to pull image from docker hub.

In this case nomad stops restarting task and got that this is unrecoverable fatal error. Why it is not continuing restart task? Why it is fatal error?

And after that this job needs manual restart - it's so sad.

```

11/08/16 22:17:54 MSK Not Restarting Error was unrecoverable

11/08/16 22:17:54 MSK Driver Failure failed to start task 'nginx' for alloc 'fb54f744-0046-7b46-2b3b-8cdf5a74c3ed': failed to create image: Failed to pull `kaktuss/nginx-frontend:0.13`: Network timed out while trying to connect to https://index.docker.io/v1/repositories/kaktuss/nginx-frontend/images. You may want to check your internet connection or if you are behind a proxy.

11/08/16 22:17:03 MSK Restarting Task restarting in 16.284777174s

11/08/16 22:17:03 MSK Terminated Exit Code: 14, Exit Message: "Docker container exited with non-zero exit code: 14"

```

Alex Dadgar

unread,

Nov 8, 2016, 5:47:43 PM11/8/16

to Andrey Kuzmin, Nomad

Hey Andrey,

This was a regression in 0.4.x. Nomad 0.5 will fix the issue and retry pulling the image.

Thanks,

Alex Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/77978fc6-d107-4c00-9bd7-5081fbb5cab3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrey Kuzmin

unread,

Nov 8, 2016, 5:57:18 PM11/8/16

to Nomad

Thanks!

среда, 9 ноября 2016 г., 1:47:43 UTC+3 пользователь Alex Dadgar написал:

Hey Andrey,

This was a regression in 0.4.x. Nomad 0.5 will fix the issue and retry pulling the image.

Thanks,
Alex Dadgar

On Tue, Nov 8, 2016 at 1:50 PM, 'Andrey Kuzmin' via Nomad <nomad...@googlegroups.com> wrote:

Hello.

I have some system job (docker container), running on every node.
And sometimes (rarely) container fails and nomad tries to restart it.

But occasionally docker fail to pull image from docker hub.
In this case nomad stops restarting task and got that this is unrecoverable fatal error. Why it is not continuing restart task? Why it is fatal error?
And after that this job needs manual restart - it's so sad.

```
11/08/16 22:17:54 MSK Not Restarting Error was unrecoverable
11/08/16 22:17:54 MSK Driver Failure failed to start task 'nginx' for alloc 'fb54f744-0046-7b46-2b3b-8cdf5a74c3ed': failed to create image: Failed to pull `kaktuss/nginx-frontend:0.13`: Network timed out while trying to connect to https://index.docker.io/v1/repositories/kaktuss/nginx-frontend/images. You may want to check your internet connection or if you are behind a proxy.
11/08/16 22:17:03 MSK Restarting Task restarting in 16.284777174s
11/08/16 22:17:03 MSK Terminated Exit Code: 14, Exit Message: "Docker container exited with non-zero exit code: 14"
```

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

Andrey Kuzmin

unread,

Nov 16, 2016, 3:19:03 PM11/16/16

to Nomad

I have 0.5rc2 and this error is still exists. Will be this feature in release?

среда, 9 ноября 2016 г., 1:47:43 UTC+3 пользователь Alex Dadgar написал:

Hey Andrey,

This was a regression in 0.4.x. Nomad 0.5 will fix the issue and retry pulling the image.

Thanks,
Alex Dadgar

On Tue, Nov 8, 2016 at 1:50 PM, 'Andrey Kuzmin' via Nomad <nomad...@googlegroups.com> wrote:

Hello.

I have some system job (docker container), running on every node.
And sometimes (rarely) container fails and nomad tries to restart it.

But occasionally docker fail to pull image from docker hub.
In this case nomad stops restarting task and got that this is unrecoverable fatal error. Why it is not continuing restart task? Why it is fatal error?
And after that this job needs manual restart - it's so sad.

```
11/08/16 22:17:54 MSK Not Restarting Error was unrecoverable
11/08/16 22:17:54 MSK Driver Failure failed to start task 'nginx' for alloc 'fb54f744-0046-7b46-2b3b-8cdf5a74c3ed': failed to create image: Failed to pull `kaktuss/nginx-frontend:0.13`: Network timed out while trying to connect to https://index.docker.io/v1/repositories/kaktuss/nginx-frontend/images. You may want to check your internet connection or if you are behind a proxy.
11/08/16 22:17:03 MSK Restarting Task restarting in 16.284777174s
11/08/16 22:17:03 MSK Terminated Exit Code: 14, Exit Message: "Docker container exited with non-zero exit code: 14"
```

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

Alex Dadgar

unread,

Nov 17, 2016, 4:22:43 PM11/17/16

to Nomad, Andrey Kuzmin

Hey,

Can you try on 0.5.0 and if it doesn’t fix it please report an issue. Looked over the code and it should work!

Thanks,

Alex

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/4099d51e-ff52-4875-a694-838e4cefd981%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

weslley...@ahgora.com.br

unread,

Nov 10, 2017, 2:39:01 PM11/10/17

to Nomad

Hi,

I am using nomad 0.7.0 and I just rebooted my server and it didn't come up with the docker image.

my job specification :

group "aplicacao" {

count = 2

restart {

attempts = 10

interval = "1h"

delay = "25s"

mode = "delay"

}

I got the following error in the UI:

11/10/17 17:04:01 UTC	Restarting	Exceeded allowed attempts, applying a delay - Task restarting in an hour
11/10/17 17:04:01 UTC	Terminated	Exit Code: 1, Exit Message: Docker container exited with non-zero exit code: 1

Is there anything I am missing ?

Thank you.

Alex Dadgar

unread,

Nov 10, 2017, 3:41:27 PM11/10/17

to weslley...@ahgora.com.br, Nomad

Hey Wesley,

Can you show the output of `nomad alloc-status -json <alloc-id>`

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/3db2a225-6766-4ab3-9b79-1161e56be02c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Alex

Weslley Camilo

unread,

Nov 13, 2017, 8:36:31 AM11/13/17

to Nomad

Hello @Alex,

So below is what I found out and in attachment is the log you have requested.

11/13/17 11:06:12 UTC Driver Downloading image ahgora/pontoweb:latest

11/13/17 11:05:43 UTC Restarting Task restarting in 28.432016556s

11/13/17 11:05:43 UTC Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"

---------------------

root@bro7:/tmp# docker logs a1a5a33f2403

Error response from daemon: stat /tmp/plugin573374385: no such file or directory

-----------------

"ContainerIDFile": "",

"LogConfig": {

"Type": "syslog",

"Config": {

"syslog-address": "unix:///tmp/plugin573374385"

}

},

root@runner1:/home/weslley# nomad node-status

ID DC Name Class Drain Status

8d025416 bro7 bro7 <none> false ready

a0768e02 bro7 bro7 <none> false down

eb3bbd56 bro7 bro7 <none> false down

87016ac2 bro5 bro5 <none> false ready

c6b9e297 bro7 bro7 <none> false down

b3a5e863 bro5 bro5 <none> false down

One thing I checked out is that it says that just one of the nodes is available.

"NodesAvailable": {

"bro7": 0,

"bro5": 1

},

Thank you.

alloc-status_nomad.txt

Alex Dadgar

unread,

Nov 13, 2017, 12:44:50 PM11/13/17

to Weslley Camilo, Nomad

Hey Wesley,

Looking at the task states in the alloc-status output you attached, it looks like it has restarted many times so I think the restart policy is working correctly. It looks like your task is exiting with code 1 which indicates your application is starting and then exiting. I would look at the output using `nomad logs <alloc-id>` and `nomad logs -stderr <alloc-id>`, for stdout and stderr output respectively. Hope that helps!

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/287989b2-4aa2-4c9c-b776-a975c3f3c458%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Alex

Weslley Camilo

unread,

Nov 14, 2017, 6:18:01 AM11/14/17

to Nomad

Hi @Alex

So I have been troubleshooting it and I realized that it doesn't spread across datacenters after it fails. Is that correct ?

I got to the conclusion after I see I had port conflict in one of my nodes and resolved with the settings below:

JOB spec:

datacenters = ["bro5", "bro7"]

count = 2

Than I had to include the constraint below to make sure it goes to different hosts once it would make port conflict(i need to use specific port)

constraint {

operator = "distinct_hosts"

value = "true"

}

After all that I got the following error now

Shouldn't it try a new name ?

So I still having problems after restart docker, reboot the OS. Is there anything else which i am missing ? I've been looking to the logs as you suggested.

Thank you.

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/287989b2-4aa2-4c9c-b776-a975c3f3c458%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Thanks,
Alex

Weslley Camilo

unread,

Nov 14, 2017, 7:12:16 AM11/14/17

to Nomad

+++++

I found it on the documentation, when would it start to clean ?

docker.cleanup.image Defaults to true. Changing this to false will prevent Nomad from removing images from stopped tasks.

source [https://www.nomadproject.io/docs/drivers/docker.html]

Weslley Camilo

unread,

Nov 14, 2017, 12:58:02 PM11/14/17

to Nomad

@Alex I got this answer on github https://github.com/hashicorp/nomad/issues/2084

It seems it is a bug and they will fix on nomad 0.7.1 !! I am looking to find which version of docker would work suit. because I am using docker 17.06

Weslley Camilo

unread,

Nov 16, 2017, 8:12:17 AM11/16/17

to Nomad

Hello,

There is a PR to fix it. I've tested it and it is working now.

https://github.com/hashicorp/nomad/issues/2084

Reply all

Reply to author

Forward