Nomad job pending due to Docker i/o timeout on 3/4 nodes?

339 views
Skip to first unread message

Jd Daniel

unread,
Apr 26, 2019, 12:56:36 PM4/26/19
to Nomad
Having the issue of my jobs ending up in an eternal pending sate due to the fact that docker pull of the container I want is hitting i/o timeouts. I've read several times about changing the DNS in order to fix this, but it seems kinda hokey, I don't need a pub google address on a private network...
Here's the `nomad job ping-services.nomad` after a run.


    ○ → nomad job status ping_service
    ID            = ping_service
    Name          = ping_service
    Submit Date   = 2019-04-25T13:29:04-07:00
    Type          = service
    Priority      = 50
    Datacenters   = public-services,private-services,content-connector,backoffice
    Status        = running
    Periodic      = false
    Parameterized = false

    Summary
    Task Group          Queued  Starting  Running  Failed  Complete  Lost
    ping_service_group  0       3         1        0       4         0

    Allocations
    ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
    05468ff2  23b79904  ping_service_group  2        run      pending  18h28m ago  19s ago      <- here
    5ce4c9ba  1601d6b1  ping_service_group  2        run      pending  18h28m ago  20s ago      <- here
    9eced817  2260997a  ping_service_group  2        run      running  18h28m ago  18h28m ago
    aefab4c3  032217e1  ping_service_group  2        run      pending  18h28m ago  42s ago      <- and here


You can see that there are only 3/4 successes, after running `nomad alloc status 05468ff2`


    ○ → nomad alloc status 05468ff2
    ID                  = 05468ff2
    Eval ID             = 10b76231
    Name                = ping_service.ping_service_group[1]
    Node ID             = 23b79904
    Job ID              = ping_service
    Job Version         = 2
    Client Status       = pending
    Client Description  = <none>
    Desired Status      = run
    Desired Description = <none>
    Created             = 18h35m ago
    Modified            = 15s ago

    Task "ping_service_task" is "pending"
    Task Resources
    CPU      Memory  Disk    IOPS  Addresses
    100 MHz  20 MiB  50 MiB  0     http: xx.xxx.xxx.xxx:31215

    Task Events:
    Started At     = N/A
    Finished At    = N/A
    Total Restarts = 982
    Last Restart   = 2019-04-26T15:04:01Z

    Recent Events:
    Time                       Type            Description
    2019-04-26T08:04:28-07:00  Driver          Downloading image thobe/ping_service:0.0.9
    2019-04-26T08:04:01-07:00  Restarting      Task restarting in 27.061915977s
    2019-04-26T08:04:01-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556294011-ftjrcDBBZK4hiQV99v5QZXxvp34%3D: dial tcp 104.18.122.25:443: i/o timeout
    2019-04-26T08:03:19-07:00  Driver          Downloading image thobe/ping_service:0.0.9
    2019-04-26T08:02:51-07:00  Restarting      Task restarting in 27.302069343s
    2019-04-26T08:02:51-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293941-ZUevnKxoKohkLDGDkv5E4A79aZ8%3D: dial tcp 104.18.122.25:443: i/o timeout
    2019-04-26T08:02:12-07:00  Driver          Downloading image thobe/ping_service:0.0.9
    2019-04-26T08:01:46-07:00  Restarting      Task restarting in 25.629825445s
    2019-04-26T08:01:46-07:00  Driver Failure  failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293876-lE4pvy9Jsruduu76LeMoQxL0gxk%3D: dial tcp 104.18.123.25:443: i/o timeout
    2019-04-26T08:01:07-07:00  Driver          Downloading image thobe/ping_service:0.0.9


You can clearly see that the issue is that there is an I/O timeout preventing us to pul our layers, so, jumping on the node, lets try this manually...


    ## Make sure we're really logged into ECR/Docker
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker login
    Authenticating with existing credentials...
    WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
    Configure a credential helper to remove this warning. See
    https://docs.docker.com/engine/reference/commandline/login/#credentials-store

    ## Attempt a manual pull...
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker pull thobe/ping_service:0.0.9
    0.0.9: Pulling from thobe/ping_service
    ff3a5c916c92: Pulling fs layer
    3c5613eb8e39: Pulling fs layer
    error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293601-mrJGlZisGPDvwapT7cAbax7UWig%3D: dial tcp 104.18.125.25:443: i/o timeout

    ## Are you there God?
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ ping -c1 production.cloudflare.docker.com
    PING production.cloudflare.docker.com (104.18.123.25) 56(84) bytes of data.

    --- production.cloudflare.docker.com ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms


    ## NS of Google Pub DNS
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 8.8.8.8
    ;; connection timed out; no servers could be reached

    ## NS of Primary nameserver
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.8.8
    ;; connection timed out; no servers could be reached

    ## NS of Secondary nameserver
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.0.2
    Server:        10.128.0.2
    Address:    10.128.0.2#53

    Non-authoritative answer:
    Name:    production.cloudflare.docker.com
    Address: 104.18.122.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.123.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.124.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.125.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.121.25

    ## Resolver
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
    options timeout:2 attempts:5
    ; generated by /usr/sbin/dhclient-script
    search nomad-eu-west-1 eu-west-1.compute.internal
    nameserver 10.128.8.8
    nameserver 10.128.0.2

    ## What are our current DNS settings?
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
    options timeout:2 attempts:5
    ; generated by /usr/sbin/dhclient-script
    search nomad-eu-west-1 eu-west-1.compute.internal
    nameserver 10.128.8.8
    nameserver 10.128.0.2


There seems to be something going on with the bad nodes (aka, the ones that are not able to pull). Notice there seems to be an issue with the `Docker Driver` not being detected? Just notice this on a bad node, check out the node events....


    ○ → nomad node status 23b79904
    ID            = 23b79904
    Name          = i-xxxxxxx
    Class         = <none>
    DC            = public-services
    Drain         = false
    Eligibility   = eligible
    Status        = ready
    Uptime        = 21h43m20s
    Driver Status = docker,exec

    Node Events
    Time                  Subsystem       Message
    2019-04-25T20:39:48Z  Driver: docker  Driver is available and responsive
    2019-04-25T20:39:03Z  Driver: docker  Driver docker is not detected
    2019-04-25T18:06:53Z  Cluster         Node registered

    Allocated Resources
    CPU           Memory           Disk            IOPS
    500/2399 MHz  128 MiB/983 MiB  300 MiB/48 GiB  0/0

    Allocation Resource Utilization
    CPU         Memory
    5/2399 MHz  14 MiB/983 MiB

    Host Resource Utilization
    CPU          Memory           Disk
    24/2399 MHz  410 MiB/984 MiB  1.8 GiB/50 GiB

    Allocations
    ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
    05468ff2  23b79904  ping_service_group  2        run      pending  19h19m ago  33s ago
    9f9ecba6  23b79904  fabio               0        run      running  21h33m ago  21h32m ago


Good Node below....


    ○ → nomad node status 2260997a
    ID            = 2260997a
    Name          = i-xxxxxxxxx
    Class         = <none>
    DC            = content-connector
    Drain         = false
    Eligibility   = eligible
    Status        = ready
    Uptime        = 21h43m28s
    Driver Status = docker,exec

    Node Events
    Time                  Subsystem  Message
    2019-04-25T18:07:04Z  Cluster    Node registered

    Allocated Resources
    CPU           Memory          Disk           IOPS
    100/2400 MHz  20 MiB/983 MiB  50 MiB/48 GiB  0/0

    Allocation Resource Utilization
    CPU         Memory
    0/2400 MHz  6.1 MiB/983 MiB

    Host Resource Utilization
    CPU          Memory           Disk
    23/2400 MHz  361 MiB/984 MiB  1.8 GiB/50 GiB

    Allocations
    ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
    9eced817  2260997a  ping_service_group  2        run      running  19h19m ago  19h19m ago


Nomad version below


    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nomad -v
    Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)

pre...@hashicorp.com

unread,
Apr 28, 2019, 2:30:58 PM4/28/19
to Nomad
Hi Jd 
As yo pointed out the root cause is the networking issue causing Nomad to not be able to download the image. It will eventually fail and get rescheduled onto another node. If you want to control the number of local restarts it does before it reschedules, look at the restart stanza - https://www.nomadproject.io/docs/job-specification/restart.html#restart-parameters . You can set attempts=0/mode=fail to causing the allocation to go from pending to failed immediately, after which nomad will attempt rescheduling. Rescheduling can be configured according to https://www.nomadproject.io/docs/job-specification/reschedule.html  

As for docker being undetected and then detected, that's expected behavior. The Nomad client runs a fingerprinting mechanism to discover docker daemon, looks like that timed out at first and then after a few seconds it retried and was able to connect to docker daemon to detect it. 
node eligibility command to disable the node from being eligible for scheduling. https://www.nomadproject.io/docs/commands/node/eligibility.html 

Hope this helps,
Preetha
I would suggest debugging the root cause of the networking issue - sounds like even if docker is healthy and running the networking issues would make it impossible to get any containers running. While Nomad can eventually reschedule that workload, it doesn't seem ideal that you have some bad nodes in the cluster like this.  You can use the 
Reply all
Reply to author
Forward
0 new messages