Why docker temporary fail - is fatal error?

609 views
Skip to first unread message

Andrey Kuzmin

unread,
Nov 8, 2016, 4:50:19 PM11/8/16
to Nomad
Hello.

I have some system job (docker container), running on every node.
And sometimes (rarely) container fails and nomad tries to restart it.

But occasionally docker fail to pull image from docker hub.
In this case nomad stops restarting task and got that this is unrecoverable fatal error. Why it is not continuing restart task? Why it is fatal error?
And after that this job needs manual restart - it's so sad.

```
11/08/16 22:17:54 MSK  Not Restarting  Error was unrecoverable
11/08/16 22:17:54 MSK  Driver Failure  failed to start task 'nginx' for alloc 'fb54f744-0046-7b46-2b3b-8cdf5a74c3ed': failed to create image: Failed to pull `kaktuss/nginx-frontend:0.13`: Network timed out while trying to connect to https://index.docker.io/v1/repositories/kaktuss/nginx-frontend/images. You may want to check your internet connection or if you are behind a proxy.
11/08/16 22:17:03 MSK  Restarting      Task restarting in 16.284777174s
11/08/16 22:17:03 MSK  Terminated      Exit Code: 14, Exit Message: "Docker container exited with non-zero exit code: 14"
```

Alex Dadgar

unread,
Nov 8, 2016, 5:47:43 PM11/8/16
to Andrey Kuzmin, Nomad
Hey Andrey,

This was a regression in 0.4.x. Nomad 0.5 will fix the issue and retry pulling the image.

Thanks,
Alex Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/77978fc6-d107-4c00-9bd7-5081fbb5cab3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrey Kuzmin

unread,
Nov 8, 2016, 5:57:18 PM11/8/16
to Nomad
Thanks!

среда, 9 ноября 2016 г., 1:47:43 UTC+3 пользователь Alex Dadgar написал:
Hey Andrey,

This was a regression in 0.4.x. Nomad 0.5 will fix the issue and retry pulling the image.

Thanks,
Alex Dadgar
On Tue, Nov 8, 2016 at 1:50 PM, 'Andrey Kuzmin' via Nomad <nomad...@googlegroups.com> wrote:
Hello.

I have some system job (docker container), running on every node.
And sometimes (rarely) container fails and nomad tries to restart it.

But occasionally docker fail to pull image from docker hub.
In this case nomad stops restarting task and got that this is unrecoverable fatal error. Why it is not continuing restart task? Why it is fatal error?
And after that this job needs manual restart - it's so sad.

```
11/08/16 22:17:54 MSK  Not Restarting  Error was unrecoverable
11/08/16 22:17:54 MSK  Driver Failure  failed to start task 'nginx' for alloc 'fb54f744-0046-7b46-2b3b-8cdf5a74c3ed': failed to create image: Failed to pull `kaktuss/nginx-frontend:0.13`: Network timed out while trying to connect to https://index.docker.io/v1/repositories/kaktuss/nginx-frontend/images. You may want to check your internet connection or if you are behind a proxy.
11/08/16 22:17:03 MSK  Restarting      Task restarting in 16.284777174s
11/08/16 22:17:03 MSK  Terminated      Exit Code: 14, Exit Message: "Docker container exited with non-zero exit code: 14"
```

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

Andrey Kuzmin

unread,
Nov 16, 2016, 3:19:03 PM11/16/16
to Nomad
I have 0.5rc2 and this error is still exists. Will be this feature in release?


среда, 9 ноября 2016 г., 1:47:43 UTC+3 пользователь Alex Dadgar написал:
Hey Andrey,

This was a regression in 0.4.x. Nomad 0.5 will fix the issue and retry pulling the image.

Thanks,
Alex Dadgar
On Tue, Nov 8, 2016 at 1:50 PM, 'Andrey Kuzmin' via Nomad <nomad...@googlegroups.com> wrote:
Hello.

I have some system job (docker container), running on every node.
And sometimes (rarely) container fails and nomad tries to restart it.

But occasionally docker fail to pull image from docker hub.
In this case nomad stops restarting task and got that this is unrecoverable fatal error. Why it is not continuing restart task? Why it is fatal error?
And after that this job needs manual restart - it's so sad.

```
11/08/16 22:17:54 MSK  Not Restarting  Error was unrecoverable
11/08/16 22:17:54 MSK  Driver Failure  failed to start task 'nginx' for alloc 'fb54f744-0046-7b46-2b3b-8cdf5a74c3ed': failed to create image: Failed to pull `kaktuss/nginx-frontend:0.13`: Network timed out while trying to connect to https://index.docker.io/v1/repositories/kaktuss/nginx-frontend/images. You may want to check your internet connection or if you are behind a proxy.
11/08/16 22:17:03 MSK  Restarting      Task restarting in 16.284777174s
11/08/16 22:17:03 MSK  Terminated      Exit Code: 14, Exit Message: "Docker container exited with non-zero exit code: 14"
```

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

Alex Dadgar

unread,
Nov 17, 2016, 4:22:43 PM11/17/16
to Nomad, Andrey Kuzmin
Hey,

Can you try on 0.5.0 and if it doesn’t fix it please report an issue. Looked over the code and it should work!

Thanks,
Alex


For more options, visit https://groups.google.com/d/optout.

weslley...@ahgora.com.br

unread,
Nov 10, 2017, 2:39:01 PM11/10/17
to Nomad
Hi, 

I am using nomad 0.7.0 and I just rebooted my server and it didn't come up with the docker image. 

my job specification : 

  group "aplicacao" {
    count = 2
    restart {
      attempts = 10
      interval = "1h"
      delay = "25s"
      mode = "delay"
    }


I got the following error in the UI:
11/10/17 17:04:01 UTCRestartingExceeded allowed attempts, applying a delay - Task restarting in an hour
11/10/17 17:04:01 UTCTerminatedExit Code: 1, Exit Message: Docker container exited with non-zero exit code: 1
Is there anything I am missing ? 

Thank you. 

Alex Dadgar

unread,
Nov 10, 2017, 3:41:27 PM11/10/17
to weslley...@ahgora.com.br, Nomad
Hey Wesley,

Can you show the output of `nomad alloc-status -json <alloc-id>`

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/3db2a225-6766-4ab3-9b79-1161e56be02c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Alex

Weslley Camilo

unread,
Nov 13, 2017, 8:36:31 AM11/13/17
to Nomad
Hello @Alex, 

So below is what I found out and in attachment is the log you have requested. 


11/13/17 11:06:12 UTC  Driver      Downloading image ahgora/pontoweb:latest
11/13/17 11:05:43 UTC  Restarting  Task restarting in 28.432016556s
11/13/17 11:05:43 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"

---------------------
root@bro7:/tmp# docker logs  a1a5a33f2403 
Error response from daemon: stat /tmp/plugin573374385: no such file or directory

-----------------
"ContainerIDFile": "",
            "LogConfig": {
                "Type": "syslog",
                "Config": {
                    "syslog-address": "unix:///tmp/plugin573374385"
                }
            },


root@runner1:/home/weslley# nomad node-status
ID        DC    Name  Class   Drain  Status
8d025416  bro7  bro7  <none>  false  ready
a0768e02  bro7  bro7  <none>  false  down
eb3bbd56  bro7  bro7  <none>  false  down
87016ac2  bro5  bro5  <none>  false  ready
c6b9e297  bro7  bro7  <none>  false  down
b3a5e863  bro5  bro5  <none>  false  down


One thing I checked out is that it says that just one of the nodes is available. 

        "NodesAvailable": {
            "bro7": 0,
            "bro5": 1
        },


Thank you. 

alloc-status_nomad.txt

Alex Dadgar

unread,
Nov 13, 2017, 12:44:50 PM11/13/17
to Weslley Camilo, Nomad
Hey Wesley,

Looking at the task states in the alloc-status output you attached, it looks like it has restarted many times so I think the restart policy is working correctly. It looks like your task is exiting with code 1 which indicates your application is starting and then exiting. I would look at the output using `nomad logs <alloc-id>` and `nomad logs -stderr <alloc-id>`, for stdout and stderr output respectively. Hope that helps!

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Alex

Weslley Camilo

unread,
Nov 14, 2017, 6:18:01 AM11/14/17
to Nomad
Hi @Alex 

So I have been troubleshooting it and I realized that it doesn't spread across datacenters after it fails. Is that correct ? 

I got to the conclusion after I see I had port conflict in one of my nodes and resolved with the settings below: 

JOB spec:

datacenters = ["bro5", "bro7"]
count = 2

Than I had to include the constraint below to make sure it goes to different hosts once it would make port conflict(i need to use specific port)

constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

After all that I got the following error now


Shouldn't it try a new name ? 

So I still having problems after restart docker, reboot the OS. Is there anything else which i am missing ? I've been looking to the logs as you suggested.

Thank you.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.



--
Thanks,
Alex

Weslley Camilo

unread,
Nov 14, 2017, 7:12:16 AM11/14/17
to Nomad
+++++

I found it on the documentation, when would it start to clean ? 


docker.cleanup.image Defaults to true. Changing this to false will prevent Nomad from removing images from stopped tasks.

Weslley Camilo

unread,
Nov 14, 2017, 12:58:02 PM11/14/17
to Nomad
@Alex I got this answer on  github  https://github.com/hashicorp/nomad/issues/2084

It seems it is a bug and they will fix on nomad 0.7.1 !! I am looking to find which version of docker would work suit. because I am using docker 17.06

Weslley Camilo

unread,
Nov 16, 2017, 8:12:17 AM11/16/17
to Nomad

Hello, 

There is a PR to fix it. I've tested it and it is working now. 

Reply all
Reply to author
Forward
0 new messages