nomad job pending for too long

lichuan shang

unread,

Dec 14, 2016, 11:36:27 PM12/14/16

to Nomad

I set count=2 in task group with docker driver, after run nomad run command, I noticed one of the allocations changed status to running in several seconds , but the other one changed status for too long (about 1 minute).

Here is my job conf.


job "simple-server" {
  type = "service"
  datacenters = ["dc1"]
  update {
    stagger = "3s"
    max_parallel = 1
  }
  group "simple-server" {
    count = 2
    task "simple-server" {
      service {
        name = "simple-server"
        port = "http"
      }
      driver = "docker"
      config {
        image = "mydockerregistry.com/simple-server:0.0.1"
        auth {
          server_address = "https://mydockerregistry.com"
          username = "admin"
          password = "admin"
        }
        port_map = {
          http = 40000
        }
      }
      resources {
        cpu = 200
        memory = 100
        network {
          mbits = 1
          port "http" {
          }
        }
      }
    }
  }

}

The nomad version is 0.4.0.

When I deploy a new version, the simple-server service is unavailable when scheduling is executing, return a response code 503. The rolling-update strategy will reduce downtime, but it seems the service may not work well during rolling update.

I feel confused about this, could anyone figure out what the problem is or did I just do something wrong by using this excellent tool?

Thanks in advance

Diptanu Choudhury

unread,

Dec 15, 2016, 1:08:10 PM12/15/16

to lichuan shang, Nomad

Hi,

The answer to your first question - The two allocations might take different time if they are running on different nodes as it might take docker more time to download the image on one machine compared to another or even on the same machine docker might just have taken more time to start the container. Run the command `nomad alloc-status alloc-id` to see the timestamps on the task events for the allocation to see if nomad was genuinely slower in starting one allocation compared to the other.

On to the rolling update - Where did the 503 come from? Did you see that in the Nomad cli/api? If so can you share your logs and the cli output?

Nomad currently doesn't have smart rolling updates, it will wait for 3 seconds(based on your config) and start another container so if your container takes more than 3 seconds to get ready for serving traffic you might have problems. We will integrate consul health checks with rolling updates in the future to get around this problem. For now I would suggest increasing the stagger time duration.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/c8f981e3-1f4b-47c4-8065-6b2db308efb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Thanks,
Diptanu Choudhury

Web - www.linkedin.com/in/diptanu

Twitter - @diptanu

lichuan shang

unread,

Dec 16, 2016, 2:40:24 AM12/16/16

to Nomad

Thanks for your reply.

Running the command `nomad alloc-status <alloc-id>` is great helpful to inspect the problem. There was something wrong one of nomad client nodes, probably the docker engine. Laterly I started a new node and run the same nomad job with 2 counts, the allocations change status to running in seconds.

I use consul-template to update haproxy.cfg.

Here is the conf of consul-template.

listen simple-server
    balance roundrobin
    bind *:9982{{range service "simple-server"}}
    server {{.Node}} {{.Address}}:{{.Port}}{{end}}
    option forwardfor
    option httpclose
    option http-server-close

During rolling update, the haproxy.cfg file may not updated by consul-template instantly, which means the `listen simple-server` part may be still the old server address and port. This may be the reason why 503 happens.

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/c8f981e3-1f4b-47c4-8065-6b2db308efb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward