Hey all --
I have a job with some constraints at the TaskGroup level that aren't working as I'd expect.
The cause is probably my own nievete, but I've had some difficulty understanding how
to troubleshoot scheduling constraints in general, so I'm reaching out for tips.
Background:
I'm using these two constraint examples, which are directly from the docs:
constraint {
attribute = "${attr.platform.aws.instance-type}"
value = "m4.2xlarge"
}
constraint {
distinct_hosts = true
}Also, I am using 'count = 2', to spawn to instances of my task on two separate hosts.
The constraints are defined at the TaskGroup level. I can see them represented in
the output of 'nomad deployment status' when the job is run, like so:
"Constraints": [
{
"LTarget": "",
"Operand": "distinct_hosts",
"RTarget": "true"
},
{
"LTarget": "${attr.platform.aws.instance-type}",
"Operand": "=",
"RTarget": "m4.2xlarge"
}
],
"Count": 2,
so I'm confident they are being added.
The expected behavior is that tasks would be scheduled on the two m4.2xlarge instances, one per host.
What happens is the tasks are scheduled on two distinct hosts, but often on instances that are NOT m4.2xlarge.
So the 'distinct_hosts' constraint seems to be honored, but the instance-type constraint seems
to be silently discarded.
I have some general questions about how to troubleshoot this:
1.) Is there a way to find out how 'attr.platform.aws.instance-type' is being
evalutated as part of the deployment?
2.) It seems like there should be a way to ask nomad how it made it's
scheduling decisions, but the docs left me scratching my head.
explains how to use 'nomad eval-status <eval>'
which I can use to track down the specific eval that was used to erroneously
assign my task to the non-2xlarge host.
However, the output doesn't give any additional clues:
> nomad eval-status XXXXXXXX
ID = XXXXXXXX
Status = complete
Status Description = complete
Type = service
TriggeredBy = job-register
Job ID = testjob
Priority = 50
Placement Failures = false
As I said, I'm not experienced with nomad but I'm interested in digging deeper.
Any advice on how to troubleshoot scheduling constraints like this one would be
greatly appreciated.