Running example.nomad job results in mysterious failed evaluation

2,471 views
Skip to first unread message

cosmo....@anchor.com.au

unread,
Apr 4, 2016, 3:20:19 AM4/4/16
to Nomad
Hi all,

I've been experimenting with nomad recently, but I've hit a bit of an impasse. I'm not sure if this is a bug or if I've messed something up, so I figured I'd ask here before making a GH issue.

Giving nomad a valid job - even the one generated by `nomad init` - results in a failed evaluation evaluation due to "maximum attempts reached (5)', and I can't see anything indicating why.

Passing in an invalid job (e.g. one that uses a non-existent datacenter) or one that tries to allocate too many resources will fail as expected with a message describing the problem.

I'm running the following inside an EC2 VPC.
  • Nomad v0.3.1
  • CoreOS 899.13.0
  • Consul v0.6.4
  • Docker version 1.9.1
The problem occurs even on a fresh cluster.

$ nomad server-members
Name                                        Address     Port  Status  Protocol  Build  Datacenter  Region
cptest
-dev-master-00e94ad85f1d702fb.global  10.0.4.252  4648  alive   2         0.3.1  dc1         global
cptest
-dev-master-048fc11bd951b6815.global  10.0.1.213  4648  alive   2         0.3.1  dc1         global
cptest
-dev-master-0d8633708cf4778b7.global  10.0.2.222  4648  alive   2         0.3.1  dc1         global

$ nomad node
-status
ID        DC  
Name                                 Class   Drain  Status
eeb88d9d  dc1  cptest
-dev-client-050b7e212203733f6  <none>  false  ready

$ nomad status
No running jobs

Attempting to run the job created by `nomad init` fails within about 1 second real time.

$ nomad init
Example job file written to example.nomad

$ nomad run example
.nomad
==> Monitoring evaluation "96f07a2c"
   
Evaluation triggered by job "example"
   
Evaluation status changed: "pending" -> "failed"
==> Evaluation "96f07a2c" finished with status "failed"

The output doesn't really describe the problem.

$ nomad status example

ID          
= example
Name        = example
Type        = service
Priority    = 50
Datacenters = dc1
Status      = pending
Periodic    = false


==> Evaluations

ID        
Priority  Triggered By  Status
01e73507  50        job-register  blocked
96f07a2c  50        job-register  failed


==> Allocations
ID  
Eval ID  Node ID  Task Group  Desired  Status


$ curl -s localhost:4646/v1/evaluations | python -m json.tool
[
   
{
       
"ID": "01e73507-8a4e-f961-3535-c6e9ca38e9de",
       
"Priority": 50,
       
"Type": "service",
       
"TriggeredBy": "job-register",
       
"JobID": "example",
       
"JobModifyIndex": 6,
       
"NodeID": "",
       
"NodeModifyIndex": 0,
       
"Status": "blocked",
       
"StatusDescription": "",
       
"Wait": 0,
       
"NextEval": "",
       
"PreviousEval": "96f07a2c-11a1-3916-f790-b37ab213794c",
       
"ClassEligibility": {
           
"v1:6305318303864028080": true
       
},
       
"EscapedComputedClass": false,
       
"CreateIndex": 8,
       
"ModifyIndex": 8
   
},
   
{
       
"ID": "96f07a2c-11a1-3916-f790-b37ab213794c",
       
"Priority": 50,
       
"Type": "service",
       
"TriggeredBy": "job-register",
       
"JobID": "example",
       
"JobModifyIndex": 6,
       
"NodeID": "",
       
"NodeModifyIndex": 0,
       
"Status": "failed",
       
"StatusDescription": "maximum attempts reached (5)",
       
"Wait": 0,
       
"NextEval": "",
       
"PreviousEval": "",
       
"ClassEligibility": null,
       
"EscapedComputedClass": false,
       
"CreateIndex": 7,
       
"ModifyIndex": 9
   
}
]


I've uploaded full debug logs for all four servers to https://gist.github.com/cosmopetrich/d6007c71e4cd89e49ba27ab4abd1c861

Does anyone have any ideas what might be causing this, or where I can look for more information?

Parveen Kumar

unread,
Apr 4, 2016, 5:43:35 AM4/4/16
to Nomad
You can use below command

nomad fs cat <allocId fo task> alloc/logs/jobname.stderr.0
it will show the log file

or list down the log file and change above command appropriately.


nomad fs ls <allocId fo task> alloc/logs

cosmo....@anchor.com.au

unread,
Apr 4, 2016, 7:56:42 PM4/4/16
to Nomad
Thanks for your reply Parveen,


On Monday, April 4, 2016 at 7:43:35 PM UTC+10, Parveen Kumar wrote:
nomad fs ls <allocId fo task> alloc/logs

Thanks for the tip, I hadn't thought to look at `nomad fs`. However, it doesn't look like nomad has created any allocations that I can use with that command. 

After re-running `nomad run example.nomad` to make sure they weren't garbage-collected or something, I get this from the API:

$ curl -s localhost:4646/v1/allocations
[]

Parveen Kumar

unread,
Apr 4, 2016, 11:23:06 PM4/4/16
to Nomad
try with either http apis

https://www.nomadproject.io/docs/http/index.html

or

nomad run -verbose example.nomad
it will give you more details related to job

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 12:44:56 AM4/5/16
to Nomad


On Tuesday, April 5, 2016 at 1:23:06 PM UTC+10, Parveen Kumar wrote:
try with either http apis
 
nomad run -verbose example.nomad

it will give you more details related to job

Adding the verbose flag doesn't result in any different output.

$ nomad run -verbose example.nomad
==> Monitoring evaluation "69bb3e59-8d56-f945-6331-d51faa9b1222"

   
Evaluation triggered by job "example"
   
Evaluation status changed: "pending" -> "failed"
==> Evaluation "69bb3e59-8d56-f945-6331-d51faa9b1222" finished with status "failed"

I gave the HTTP API a try with the example at https://www.nomadproject.io/docs/jobspec/json.html. Things looked good at first:

$ curl -XPOST localhost:4646/v1/job/example1 --data-binary '@example1.json'
{"EvalID":"","EvalCreateIndex":0,"JobModifyIndex":1358,"Index":0,"LastContact":0,"KnownLeader":false}


$ nomad status
ID        
Type   Priority  Status
example1  batch  
50        running

However the first time the periodic job tries to fire off I end up with the same symptoms.
$ nomad status example1/
ID          
= example1/periodic-1459831080
Name        = example1/periodic-1459831080
Type        = batch
Priority    = 50
Datacenters = dc1
Status      = pending
Periodic    = false


==> Evaluations
ID        
Priority  Triggered By  Status
71a5b285  50        periodic-job  blocked
1e44d4af  50        periodic-job  failed


==> Allocations
ID  
Eval ID  Node ID  Task Group  Desired  Status

Still nothing of note in the logs.

Parveen Kumar

unread,
Apr 5, 2016, 1:30:37 AM4/5/16
to Nomad
can you attach the config files as well with operating system name.

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 2:25:39 AM4/5/16
to Nomad
Thanks very much for your help so far, Parveen.

I'm currently using CoreOS 899.13.0 (latest stable), though that's mostly as an easy way of getting access to a recent Docker version.

Nomad is being run as root directly on the host (i.e. not in a container) with:

nomad agent -log-level DEBUG -config /var/lib/nomad/server.hcl -bind $COREOS_PRIVATE_IPV4

or

nomad agent -log-level DEBUG -config /var/lib/nomad/client.hcl -bind $COREOS_PRIVATE_IPV4

The server config is as follows:

data_dir = "/var/lib/nomad/data"
disable_update_check
= true

addresses
{
 http
= "0.0.0.0"
}

server
{
 enabled
= true
 bootstrap_expect
= 3

 retry_join
= ["consul.service.consul"]
}

and the client:

data_dir = "/var/lib/nomad/data"
disable_update_check
= true

client
{
 enabled
= true

 servers
= ["server.nomad.service.consul"]

 reserved
{
 cpu
= 500
 memory
= 512
 disk
= 10000
 reserved_ports
= "22,8300-8600"
 
}
}





Parveen Kumar

unread,
Apr 5, 2016, 2:33:40 AM4/5/16
to Nomad
do you have docker installed on each client??? Also as per config files you have 3 servers and clients attached.


On Monday, April 4, 2016 at 12:50:19 PM UTC+5:30, cosmo....@anchor.com.au wrote:

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 2:46:26 AM4/5/16
to Nomad

 Also as per config files you have 3 servers and clients attached.

 There are currently 3 nodes running in server mode and one running in client mode. All using the same CoreOS version (they're built from the same AWS AMI, just with different services started on each). There's only a single small client node right now since I hit this issue before needing to scale that layer out.

do you have docker installed on each client???

Docker is installed and looks to be working fine. If I invoke something like `docker run redis:latest` on the client the new image gets pulled and the redis container comes up as expected. Nomad looks to be reporting the docker capability in the client node's attributes:

$ nomad node-status
ID        DC  
Name                                 Class   Drain  Status
eeb88d9d  dc1  cptest
-dev-client-050b7e212203733f6  <none>  false
 ready


$ nomad node
-status eeb88d9d
ID        
= eeb88d9d
Name       = cptest-dev-client-050b7e212203733f6
Class      = <none>
DC        
= dc1
Drain      = false
Status     = ready
Attributes = arch:amd64, consul.datacenter:us-west-2, consul.revision:26a0ef8c41aa2252ab4cf0844fc6470c8e1d8256, consul.server:false, consul.version:0.6.4, cpu.frequency:2500.092000, cpu.modelname:Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, cpu.numcores:1, cpu.totalcompute:2500.092000, driver.docker:1, driver.docker.version:1.9.1, driver.exec:1, driver.rkt:1, driver.rkt.appc.version:0.7.4, driver.rkt.version:1.0.0, hostname:cptest-dev-client-050b7e212203733f6, kernel.name:linux, kernel.version:4.3.6-coreos, memory.totalbytes:2101010432, os.name:coreos, os.version:899.13.0, platform.aws.instance-type:t2.small, platform.aws.placement.availability-zone:us-west-2c, unique.cgroup.mountpoint:/sys/fs/cgroup, unique.consul.name:cptest-dev-client-050b7e212203733f6, unique.network.ip-address:10.0.4.198, unique.platform.aws.ami-id:ami-5bc4313b, unique.platform.aws.hostname:ip-10-0-4-198.service.consul, unique.platform.aws.instance-id:i-050b7e212203733f6, unique.platform.aws.local-hostname:ip-10-0-4-198.service.consul, unique.platform.aws.local-ipv4:10.0.4.198, unique.platform.aws.public-hostname:ec2-54-191-238-185.us-west-2.compute.amazonaws.com, unique.platform.aws.public-ipv4:54.191.238.185, unique.storage.bytesfree:97222623232, unique.storage.bytestotal:101552205824, unique.storage.volume:/dev/xvda9


==> Allocations
ID  
Eval ID  Job ID  Task Group  Desired Status  Client Status


==> Resource Utilization
CPU  
Memory MB  Disk MB  IOPS
0    0          0        0

Parveen Kumar

unread,
Apr 5, 2016, 2:58:19 AM4/5/16
to Nomad
everything looks fine. try below commands.
sudo docker stop $(docker ps -a -q)
sudo docker rm $(docker ps -a -q)
it will remove the all container first stopping them.

i am using the scripts as some time ago i had the issue where i ran the job which was running before perfectly but started failing. the reason was some how containers were not destroyed. so i ran the above two commands and then ran the job. it started working again.

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 3:06:48 AM4/5/16
to Nomad
On Tuesday, April 5, 2016 at 4:58:19 PM UTC+10, Parveen Kumar wrote:
everything looks fine. try below commands.
sudo docker stop $(docker ps -a -q)
sudo docker rm $(docker ps -a -q)
it will remove the all container first stopping them.

i am using the scripts as some time ago i had the issue where i ran the job which was running before perfectly but started failing. the reason was some how containers were not destroyed. so i ran the above two commands and then ran the job. it started working again.

That's good to know, though in this case there aren't any containers currently in docker, so the `docker stop` failed.

core@cptest-dev-client-050b7e212203733f6 ~ $ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
core@cptest
-dev-client-050b7e212203733f6 ~ $

The client and server nodes are all totally fresh instances, other than the container I started up when testing for my previous message (with `docker run redis:latest`) and then removed (with `docker rm -v)` they haven't been used for anything other than testing things for this topic.

Parveen Kumar

unread,
Apr 5, 2016, 3:50:16 AM4/5/16
to Nomad
can you test running this job

job "test" {
    # Run the job in the global region, which is the default.
    # region = "global"

    # Specify the datacenters within the region this job can run in.
    datacenters = ["dc1"]

    # Priority controls our access to resources and scheduling priority.
    # This can be 1 to 100, inclusively, and defaults to 50.
    priority = 80   
   
    #constraint{
     #   distinct_hosts = true
    #}
   
    # Create a 'postgres' group. Each task in the group will be
    # scheduled onto the same machine.
    group "postgres" {
        # Control the number of instances of this groups.
        # Defaults to 1
        count = 2

        # Define a task to run
        task "postgres" {
            # Use Docker to run the task.
            driver = "docker"

            # Configure Docker driver with the image
            config {
                image = "postgres:9.5"               
                port_map {
                    TCP = 5432
                }
            }

            service {
                name = "postgres"
                tags = ["db"]
                port = "TCP"
            }

            # We must specify the resources required for
            # this task to ensure it runs on a machine with
            # enough capacity.
            resources {
                cpu = 500 # Mhz
                memory = 256 # MB

                network {
                    mbits = 10

                    # Request for a static port
                    port "TCP" {
                        static = 5432

Diptanu Choudhury

unread,
Apr 5, 2016, 4:10:10 AM4/5/16
to cosmo....@anchor.com.au, Nomad
Hi,

The blocked eval was created because Nomad couldn't find a suitable job to run the job. So the failed allocation will have details around why Nomad couldn't find any suitable nodes.

Can you please share the output of "nomad alloc-status 96f07a2c" please? 

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/c8e2997d-a696-4ceb-9f15-7ed4512a9685%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Diptanu Choudhury

Diptanu Choudhury

unread,
Apr 5, 2016, 4:11:34 AM4/5/16
to cosmo....@anchor.com.au, Nomad
Sorry for the typo, I meant - The blocked evaluation was created because Nomad couldn't find a suitable node to run the job.

Parveen Kumar

unread,
Apr 5, 2016, 5:58:35 AM4/5/16
to Nomad
servers = ["server.nomad.service.consul"]

 retry_join = ["consul.service.consul"]

can you try giving ip addresses instead of resolving it using interpreted variables?
also check nomad log file. it will show which all nodes it is joined.
because if
interpreted variables are not returning anything then, your nomad server will be alone in cluster and somehow the job will not bne sent to any clientas server won't run job in own machine.

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 5:55:36 PM4/5/16
to Nomad
On Tuesday, April 5, 2016 at 6:11:34 PM UTC+10, Diptanu Choudhury wrote:
The blocked eval was created because Nomad couldn't find a suitable job to run the job. So the failed allocation will have details around why Nomad couldn't find any suitable nodes.

Can you please share the output of "nomad alloc-status 96f07a2c" please? 
 
Thanks for your reply Diptanu.

I get 'No allocation(s) with prefix or id "<alloc>" found':

$ nomad run example.nomad
==> Monitoring evaluation "b092d115"

   
Evaluation triggered by job "example"
   
Evaluation status changed: "pending" -> "failed"
==> Evaluation "b092d115" finished with status "failed"

$ nomad alloc
-status b092d115
No allocation(s) with prefix or id "b092d115" found

Curling the '/v1/allocations' API endpoint returns an empty array.

Diptanu Choudhury

unread,
Apr 5, 2016, 5:57:37 PM4/5/16
to cosmo....@anchor.com.au, Nomad
Hi, 

You are using the evaluation ID there, you need to use the allocation id.

Running the nomad status example, should provide you with a list of allocations and you can find the allocation id of the failed allocation.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 6:02:33 PM4/5/16
to Nomad
On Wednesday, April 6, 2016 at 7:57:37 AM UTC+10, Diptanu Choudhury wrote:
You are using the evaluation ID there, you need to use the allocation id.

Running the nomad status example, should provide you with a list of allocations and you can find the allocation id of the failed allocation.

My apologies, I misunderstood what you were asking me to do.

However, nomad is not creating any allocations so there isn't anything I can use as an argument to `nomad alloc-status`

`nomad status example` shows an empty allocations set, there's some sample output near the bottom of my initial post[0].

Additionally, hitting the 'allocations' API endpoint via  `curl localhost:4646/v1/allocations` returns an empty array.

Diptanu Choudhury

unread,
Apr 5, 2016, 6:08:02 PM4/5/16
to cosmo....@anchor.com.au, Nomad
Oh I see, didn' realize that no allocations were created.

What does nomad eval-monitor for the failed eval say?

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 6:15:30 PM4/5/16
to Nomad
On Wednesday, April 6, 2016 at 8:08:02 AM UTC+10, Diptanu Choudhury wrote:
Oh I see, didn' realize that no allocations were created.

What does nomad eval-monitor for the failed eval say?

It looks to give similar output to `nomad run`.
 
==> Monitoring evaluation "b092d115"
   
Evaluation triggered by job "example"
   
Evaluation status changed: "pending" -> "failed"
==> Evaluation "b092d115" finished with status "failed"

If I pass in the verbose flag the output includes the full evaluation UUID (b092d115-[...]) but no additional information. 

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 6:34:18 PM4/5/16
to Nomad
On Tuesday, April 5, 2016 at 7:58:35 PM UTC+10, Parveen Kumar wrote:
can you try giving ip addresses instead of resolving it using interpreted variables?
also check nomad log file. it will show which all nodes it is joined.
because if
interpreted variables are not returning anything then, your nomad server will be alone in cluster and somehow the job will not bne sent to any clientas server won't run job in own machine.

The `docker server-members` and `docker node-status` commands show the expected number of client and server nodes. Hitting the '/v1/status/leader' API endpoint returns the private IP of one of the server nodes. I believe many Nomad commands (e.g. `nomad run`) will fail if the cluster doesn't have a leader, won't they?

I've included the logs from server startup to when `nomad run` failed in a github gist[0]. The logs from the server that eventually became the leader show the raft leadership election proceeding as expected, as far as I can tell. The logs also contain the correct private IP addresses for each nodes, so it looks like the Consul DNS ('consul.service.consul' etc) resolved as expected.

Apr 04 06:47:36 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:36 [INFO] serf: EventMemberJoin: cptest-dev-master-0d8633708cf4778b7.global 10.0.2.222
Apr 04 06:47:36 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:36 [INFO] nomad: starting 1 scheduling worker(s) for [service batch system _core]
Apr 04 06:47:36 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:36 [INFO] agent: Joining cluster...
Apr 04 06:47:36 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:36 [INFO] raft: Node at 10.0.2.222:4647 [Follower] entering Follower state
Apr 04 06:47:36 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:36 [INFO] nomad: adding server cptest-dev-master-0d8633708cf4778b7.global (Addr: 10.0.2.222:4647) (DC: dc1)
Apr 04 06:47:36 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:36 [INFO] agent: Join completed. Synced with 1 initial agents
Apr 04 06:47:41 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:41 [INFO] serf: EventMemberJoin: cptest-dev-master-048fc11bd951b6815.global 10.0.1.213
Apr 04 06:47:41 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:41 [INFO] nomad: adding server cptest-dev-master-048fc11bd951b6815.global (Addr: 10.0.1.213:4647) (DC: dc1)
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] serf: EventMemberJoin: cptest-dev-master-00e94ad85f1d702fb.global 10.0.4.252
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] nomad: adding server cptest-dev-master-00e94ad85f1d702fb.global (Addr: 10.0.4.252:4647) (DC: dc1)
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] nomad: Attempting bootstrap with nodes: [10.0.2.222:4647 10.0.1.213:4647 10.0.4.252:4647]
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] raft: Node at 10.0.2.222:4647 [Candidate] entering Candidate state
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] raft: Election won. Tally: 2
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] raft: Node at 10.0.2.222:4647 [Leader] entering Leader state
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] nomad: cluster leadership acquired
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] raft: pipelining replication to peer 10.0.1.213:4647
Apr 04 06:47:43 cptest-dev-master-0d8633708cf4778b7 nomad[792]: 2016/04/04 06:47:43 [INFO] raft: pipelining replication to peer 10.0.4.252:4647

cosmo....@anchor.com.au

unread,
Apr 5, 2016, 6:43:57 PM4/5/16
to Nomad
On Tuesday, April 5, 2016 at 5:50:16 PM UTC+10, Parveen Kumar wrote:
can you test running this job

job "test" {
    [...]

After adding a second client node to account for the job's `count = 2` and waiting for the new node to appear as 'ready' in `nomad node-status` I get similar output to the job created by nomad init:

==> Monitoring evaluation "0ee90e3d"
   
Evaluation triggered by job "test"

   
Evaluation status changed: "pending" -> "failed"
==> Evaluation "0ee90e3d" finished with status "failed"

Output from other commands is similar to that in my first comment (no allocations created, etc).

Parveen Kumar

unread,
Apr 5, 2016, 11:35:18 PM4/5/16
to Nomad
Hi Cosmo,

Can you try to run 1 of your client machine as server+client both. just to test whether if you run the job on same machine and same machine is client also. then it should run the job and deploy the task on docker.

docker.cleanup.image = false flag disables the deletion of docker images when we stop the job on nomad.
You can make a nomad agent to run both server and client at the same time though it is not advisable to use it on production environment.
config file for nomad is below for testing purpose:

bind_addr = "0.0.0.0"

advertise {
  # We need to specify our host's IP because we can't
  # advertise 0.0.0.0 to other nodes in our cluster.
  rpc = "ip of same client machine:4647"
}

# Increase log verbosity
log_level = "DEBUG"

# Setup data dir
data_dir = "/opt/nomad/data"

# Enable the server
server {
    enabled = true
    bootstrap_expect = 1
    start_join = ["ip of same client machine"]
    retry_join = ["ip of same client machine"]
    retry_interval = "15s"
}

# Enable the client
client {
    enabled = true
    servers = ["ip of same client machine"]
    options {
             docker.cleanup.image = false
    }
}

just try to run example.nomad on the single server-client machine and check whether it works or not. if it works, then their is no issue with the node.

cosmo....@anchor.com.au

unread,
Apr 6, 2016, 7:30:41 PM4/6/16
to Nomad
I believe I've found out what's causing it, though not why.

Inspired by Parveen's config in https://groups.google.com/d/msg/nomad-tool/x_CvBBDtoTc/Ea-75npnAQAJ I started messing around with mine some more.

It turns out this line is at fault:

reserved_ports = "22,8300-8600"

Removing it causes the evaluation to succeed. Attempting to reserve *any* ports causes it to fail, even if it's just "22", just "4646",  or something totally unused like 9000.

I'll try to find out some more about what's going on so that I can file a github issue, if only so that anyone else who hits this can find the cause. If anyone has any ideas for any likely causes then I'm all ears.

Thanks again to Parveen and Diptanu for all the time they've spent on this.

Diptanu Choudhury

unread,
Apr 7, 2016, 7:41:32 PM4/7/16
to cosmo....@anchor.com.au, Nomad
This has been fixed on master, and should come out with Nomad 0.3.2 some time next week!

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Cameron Davison

unread,
May 17, 2016, 12:28:36 PM5/17/16
to Nomad
I am running 0.3.2, came across this thread because I was having the same problem. Did the same commented out
reserved_ports = "22,4194,7301,8300-8600"
restarted the client, and it worked.

Did this fix actually get pushed out?

Cameron

cosmo....@anchor.com.au

unread,
May 17, 2016, 6:32:05 PM5/17/16
to Nomad
On Wednesday, May 18, 2016 at 2:28:36 AM UTC+10, Cameron Davison wrote:
I am running 0.3.2, came across this thread because I was having the same problem. Did the same commented out
reserved_ports = "22,4194,7301,8300-8600"
restarted the client, and it worked.

Did this fix actually get pushed out?

For reference, the github issue is here[0]. It looks like the change made it into master[1].

I just tried to replicate this issue with 0.3.2 using the steps in the first post of that github issue. Nomad was able to place the job successfully, even when I used `reserved_ports = "22,4194,7301,8300-8600"` rather than just `reserved_ports = "22"`.

Are you certain that you've updated fully to 0.3.2? Out of interest, does your job attempt to statically assign use any of the ports in your reserved_ports range?

Cameron Davison

unread,
May 18, 2016, 7:47:00 PM5/18/16
to cosmo....@anchor.com.au, Nomad
I am certain that I was running 0.3.2 since I never ran anything before that. I am running a cluster of 3 servers, and 2 client nodes. No machines act as both. Not sure if that matters. I cannot reproduce it using the steps that were given in the issue only when I setup the 5 servers.

Cameron

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to a topic in the Google Groups "Nomad" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nomad-tool/x_CvBBDtoTc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/61a1c175-2181-4414-86f8-83348fa479df%40googlegroups.com.

cosmo....@anchor.com.au

unread,
May 18, 2016, 9:28:32 PM5/18/16
to Nomad
On Thursday, May 19, 2016 at 9:47:00 AM UTC+10, Cameron Davison wrote:
I am certain that I was running 0.3.2 since I never ran anything before that. I am running a cluster of 3 servers, and 2 client nodes. No machines act as both. Not sure if that matters. I cannot reproduce it using the steps that were given in the issue only when I setup the 5 servers.

It doesn't sound directly related to the original issue in this thread. Perhaps you've hit another similar bug, or maybe there's some more general issue with your setup. It's probably worth trying to put some minimal reproduction steps together. That'll either give a good foundation for a bug report or help you identify any accidental misconfiguration.

Bagelswitch

unread,
Aug 14, 2016, 12:12:59 PM8/14/16
to Nomad
FWIW - I can repro this on 0.4.0 and 0.4.1-rc1 - on a single node server+client, using a couple of reserved port ranges, everything works fine. Running separate server and client nodes, if the client config specifies _any_ reserved ports, no job can run - the evaluation fails with no useful output anywhere and no allocations are created. If the reserved ports are removed from the client config, everything is fine. This is reproducible with any job, including trivial examples.

Alex Dadgar

unread,
Aug 15, 2016, 4:14:46 PM8/15/16
to Bagelswitch, Nomad
Hey Bagelswitch,

I just verified this by adding reserved_ports = "20000-59990" to a client and was asking for dynamic ports and it worked. Please let me know how you are reproducing this. It would be best if you opened an issue.

Thanks,
Alex 

On Sun, Aug 14, 2016 at 9:12 AM, Bagelswitch <bagel...@gmail.com> wrote:
FWIW - I can repro this on 0.4.0 and 0.4.1-rc1 - on a single node server+client, using a couple of reserved port ranges, everything works fine. Running separate server and client nodes, if the client config specifies _any_ reserved ports, no job can run - the evaluation fails with no useful output anywhere and no allocations are created. If the reserved ports are removed from the client config, everything is fine. This is reproducible with any job, including trivial examples.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/a8582b78-6bbf-4604-b95c-3bfb31d8b723%40googlegroups.com.

Bagelswitch

unread,
Aug 18, 2016, 1:54:11 PM8/18/16
to Nomad
Thanks Alex - opened https://github.com/hashicorp/nomad/issues/1617 with details.


On Monday, August 15, 2016 at 1:14:46 PM UTC-7, Alex Dadgar wrote:
Hey Bagelswitch,

I just verified this by adding reserved_ports = "20000-59990" to a client and was asking for dynamic ports and it worked. Please let me know how you are reproducing this. It would be best if you opened an issue.

Thanks,
Alex 
On Sun, Aug 14, 2016 at 9:12 AM, Bagelswitch <bagel...@gmail.com> wrote:
FWIW - I can repro this on 0.4.0 and 0.4.1-rc1 - on a single node server+client, using a couple of reserved port ranges, everything works fine. Running separate server and client nodes, if the client config specifies _any_ reserved ports, no job can run - the evaluation fails with no useful output anywhere and no allocations are created. If the reserved ports are removed from the client config, everything is fine. This is reproducible with any job, including trivial examples.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages