Nomad job pending due to Docker i/o timeout on 3/4 nodes?

339 views

Skip to first unread message

Jd Daniel

unread,

Apr 26, 2019, 12:56:36 PM4/26/19

to Nomad

Having the issue of my jobs ending up in an eternal pending sate due to the fact that docker pull of the container I want is hitting i/o timeouts. I've read several times about changing the DNS in order to fix this, but it seems kinda hokey, I don't need a pub google address on a private network...
Here's the `nomad job ping-services.nomad` after a run.

    ○ → nomad job status ping_service
    ID            = ping_service
    Name          = ping_service
    Submit Date   = 2019-04-25T13:29:04-07:00
    Type          = service
    Priority      = 50
    Datacenters   = public-services,private-services,content-connector,backoffice
    Status        = running
    Periodic      = false
    Parameterized = false

    Summary
    Task Group          Queued Starting Running Failed Complete Lost
    ping_service_group 0       3         1        0       4         0

    Allocations
    ID        Node ID   Task Group          Version Desired Status   Created     Modified
    05468ff2 23b79904 ping_service_group 2        run      pending 18h28m ago 19s ago      <- here
    5ce4c9ba 1601d6b1 ping_service_group 2        run      pending 18h28m ago 20s ago      <- here
    9eced817 2260997a ping_service_group 2        run      running 18h28m ago 18h28m ago
    aefab4c3 032217e1 ping_service_group 2        run      pending 18h28m ago 42s ago      <- and here

You can see that there are only 3/4 successes, after running `nomad alloc status 05468ff2`

    ○ → nomad alloc status 05468ff2
    ID                  = 05468ff2
    Eval ID             = 10b76231
    Name                = ping_service.ping_service_group[1]
    Node ID             = 23b79904
    Job ID              = ping_service
    Job Version         = 2
    Client Status       = pending
    Client Description = <none>
    Desired Status      = run
    Desired Description = <none>
    Created             = 18h35m ago
    Modified            = 15s ago

    Task "ping_service_task" is "pending"
    Task Resources
    CPU      Memory Disk    IOPS Addresses
    100 MHz 20 MiB 50 MiB 0     http: xx.xxx.xxx.xxx:31215

    Task Events:
    Started At     = N/A
    Finished At    = N/A
    Total Restarts = 982
    Last Restart   = 2019-04-26T15:04:01Z

    Recent Events:
    Time                       Type            Description
    2019-04-26T08:04:28-07:00 Driver          Downloading image thobe/ping_service:0.0.9
    2019-04-26T08:04:01-07:00 Restarting      Task restarting in 27.061915977s
    2019-04-26T08:04:01-07:00 Driver Failure failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556294011-ftjrcDBBZK4hiQV99v5QZXxvp34%3D: dial tcp 104.18.122.25:443: i/o timeout
    2019-04-26T08:03:19-07:00 Driver          Downloading image thobe/ping_service:0.0.9
    2019-04-26T08:02:51-07:00 Restarting      Task restarting in 27.302069343s
    2019-04-26T08:02:51-07:00 Driver Failure failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293941-ZUevnKxoKohkLDGDkv5E4A79aZ8%3D: dial tcp 104.18.122.25:443: i/o timeout
    2019-04-26T08:02:12-07:00 Driver          Downloading image thobe/ping_service:0.0.9
    2019-04-26T08:01:46-07:00 Restarting      Task restarting in 25.629825445s
    2019-04-26T08:01:46-07:00 Driver Failure failed to initialize task "ping_service_task" for alloc "05468ff2-f5a0-7a67-3dd7-947d4b30ec45": Failed to pull `thobe/ping_service:0.0.9`: error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293876-lE4pvy9Jsruduu76LeMoQxL0gxk%3D: dial tcp 104.18.123.25:443: i/o timeout
    2019-04-26T08:01:07-07:00 Driver          Downloading image thobe/ping_service:0.0.9

You can clearly see that the issue is that there is an I/O timeout preventing us to pul our layers, so, jumping on the node, lets try this manually...

    ## Make sure we're really logged into ECR/Docker
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker login
    Authenticating with existing credentials...
    WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
    Configure a credential helper to remove this warning. See
    https://docs.docker.com/engine/reference/commandline/login/#credentials-store

    ## Attempt a manual pull...
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ docker pull thobe/ping_service:0.0.9
    0.0.9: Pulling from thobe/ping_service
    ff3a5c916c92: Pulling fs layer
    3c5613eb8e39: Pulling fs layer
    error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/cf/cfaa80d7f11f028474f755c007960a0b219c90e1edc45d94039a987c46d7ca32/data?verify=1556293601-mrJGlZisGPDvwapT7cAbax7UWig%3D: dial tcp 104.18.125.25:443: i/o timeout

    ## Are you there God?
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ ping -c1 production.cloudflare.docker.com
    PING production.cloudflare.docker.com (104.18.123.25) 56(84) bytes of data.

    --- production.cloudflare.docker.com ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms

    ## NS of Google Pub DNS
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 8.8.8.8
    ;; connection timed out; no servers could be reached

    ## NS of Primary nameserver
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.8.8
    ;; connection timed out; no servers could be reached

    ## NS of Secondary nameserver
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nslookup production.cloudflare.docker.com 10.128.0.2
    Server:        10.128.0.2
    Address:    10.128.0.2#53

    Non-authoritative answer:
    Name:    production.cloudflare.docker.com
    Address: 104.18.122.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.123.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.124.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.125.25
    Name:    production.cloudflare.docker.com
    Address: 104.18.121.25

    ## Resolver
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
    options timeout:2 attempts:5
    ; generated by /usr/sbin/dhclient-script
    search nomad-eu-west-1 eu-west-1.compute.internal
    nameserver 10.128.8.8
    nameserver 10.128.0.2

    ## What are our current DNS settings?
    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ cat /etc/resolv.conf
    options timeout:2 attempts:5
    ; generated by /usr/sbin/dhclient-script
    search nomad-eu-west-1 eu-west-1.compute.internal
    nameserver 10.128.8.8
    nameserver 10.128.0.2

There seems to be something going on with the bad nodes (aka, the ones that are not able to pull). Notice there seems to be an issue with the `Docker Driver` not being detected? Just notice this on a bad node, check out the node events....

    ○ → nomad node status 23b79904
    ID            = 23b79904
    Name          = i-xxxxxxx
    Class         = <none>
    DC            = public-services
    Drain         = false
    Eligibility   = eligible
    Status        = ready
    Uptime        = 21h43m20s
    Driver Status = docker,exec

    Node Events
    Time                  Subsystem       Message
    2019-04-25T20:39:48Z Driver: docker Driver is available and responsive
    2019-04-25T20:39:03Z Driver: docker Driver docker is not detected
    2019-04-25T18:06:53Z Cluster         Node registered

    Allocated Resources
    CPU           Memory           Disk            IOPS
    500/2399 MHz 128 MiB/983 MiB 300 MiB/48 GiB 0/0

    Allocation Resource Utilization
    CPU         Memory
    5/2399 MHz 14 MiB/983 MiB

    Host Resource Utilization
    CPU          Memory           Disk
    24/2399 MHz 410 MiB/984 MiB 1.8 GiB/50 GiB

    Allocations
    ID        Node ID   Task Group          Version Desired Status   Created     Modified
    05468ff2 23b79904 ping_service_group 2        run      pending 19h19m ago 33s ago
    9f9ecba6 23b79904 fabio               0        run      running 21h33m ago 21h32m ago

Good Node below....

    ○ → nomad node status 2260997a
    ID            = 2260997a
    Name          = i-xxxxxxxxx
    Class         = <none>
    DC            = content-connector
    Drain         = false
    Eligibility   = eligible
    Status        = ready
    Uptime        = 21h43m28s
    Driver Status = docker,exec

    Node Events
    Time                  Subsystem Message
    2019-04-25T18:07:04Z Cluster    Node registered

    Allocated Resources
    CPU           Memory          Disk           IOPS
    100/2400 MHz 20 MiB/983 MiB 50 MiB/48 GiB 0/0

    Allocation Resource Utilization
    CPU         Memory
    0/2400 MHz 6.1 MiB/983 MiB

    Host Resource Utilization
    CPU          Memory           Disk
    23/2400 MHz 361 MiB/984 MiB 1.8 GiB/50 GiB

    Allocations
    ID        Node ID   Task Group          Version Desired Status   Created     Modified
    9eced817 2260997a ping_service_group 2        run      running 19h19m ago 19h19m ago

Nomad version below

    [ec2-user@ip-xx-xxx-xxx-xxx ~]$ nomad -v
    Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)

pre...@hashicorp.com

unread,

Apr 28, 2019, 2:30:58 PM4/28/19

to Nomad

Hi Jd

As yo pointed out the root cause is the networking issue causing Nomad to not be able to download the image. It will eventually fail and get rescheduled onto another node. If you want to control the number of local restarts it does before it reschedules, look at the restart stanza - https://www.nomadproject.io/docs/job-specification/restart.html#restart-parameters . You can set attempts=0/mode=fail to causing the allocation to go from pending to failed immediately, after which nomad will attempt rescheduling. Rescheduling can be configured according to https://www.nomadproject.io/docs/job-specification/reschedule.html

As for docker being undetected and then detected, that's expected behavior. The Nomad client runs a fingerprinting mechanism to discover docker daemon, looks like that timed out at first and then after a few seconds it retried and was able to connect to docker daemon to detect it.

node eligibility command to disable the node from being eligible for scheduling. https://www.nomadproject.io/docs/commands/node/eligibility.html

Hope this helps,

Preetha

I would suggest debugging the root cause of the networking issue - sounds like even if docker is healthy and running the networking issues would make it impossible to get any containers running. While Nomad can eventually reschedule that workload, it doesn't seem ideal that you have some bad nodes in the cluster like this. You can use the

Reply all

Reply to author

Forward

0 new messages