Hi,
We tried to use system job with nomad class filtering and we have strange behaviors with "nomad run" status code and output.
Our use case:
We have a system job that runs on an auto scaling group (on AWS).
The instances of this group have a nomad class "foo" so the job definition is like:
job "test" {
datacenters = ["dc1"]
type = "system"
constraint {
attribute = "${node.class}"
value = "foo"
}
[...]
}
So the job will be deployed on all servers in the autoscaling group and if we scale up the group,
Nomad automatically deploys the job on the newly instantiated server.
It's really cool but at the job submission, we have strange output.
Here is our (simplified) cluster nodes:
- A: Instance with class="bar
- B: Instance with class="bar"
- C: Instance with class="baz"
Autoscaling group:
- D1: Instance with class="foo"
When we run the job above we have the following output:
==> Monitoring evaluation "d1e000cd"
Evaluation triggered by job "test"
Allocation "51b3d960" modified: node "a45700d3", group "test"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "d1e000cd" finished with status "complete" but failed to place all allocations:
Task Group "test" (failed to place 3 allocations):
* Class "bar" filtered 1 nodes
* Constraint "${node.class} = foo" filtered 1 nodes
I think it's because a system job has only one evaluation but these numbers are weird:
- Class "bar" really filtered 2 nodes
- Constraint node.class filtered 3 nodes (or indeed 1 if we subtract the previous line)
The output contains a specific line for class "bar" but not class "baz" ? (It's pretty weird)
And, our main problem is that the status code of the "nomad run" command is 2.
I read in the code:
"On successful job submission and scheduling, exit code 0 will be
returned. If there are job placement issues encountered
(unsatisfiable constraints, resource exhaustion, etc), then the
exit code will be 2. Any other errors, including client connection
issues or internal errors, are indicated by exit code 1."
But in this case we can't programmatically differentiate if nomad succeeded to place at least one allocation or if no allocation were placed.
In both case, status code will be 2.
Is it a good way to use constraints with system job ?
If yes (and I hope so :)), is the output and status code normal ?
Otherwise I can open issues on Github !
Thanks !
Cyril.