Deploying jobs in nomad takes increasingly long time

1,223 views
Skip to first unread message

Paris Morali

unread,
Feb 1, 2019, 5:58:50 AM2/1/19
to nomad...@googlegroups.com
Hi,

On a freshly created nomad cluster, deploying a job seems to be very quick, in the area of 5-10 seconds max from nomad run asdf.nomad to allocation running.

However as time goes by (weeks, months) submitting new jobs takes longer and longer, until we get to 4-5 minute waiting time.

The bottleneck seems to be the "task setup/building directory" phase, just after downloading the artifact to deploy - jars or docker images in my case.

Is there a way to perform an effective housekeeping for my nomad nodes in order to bring back down the deployment time?
I've tried cleaning up directories and files under the allocations/* directory but that didn't seem to work.

---
Nomad version: 0.5.6
--- 

Thanks for your help
Paris

Notice: Current Health Limited is a private limited company incorporated in Scotland with registered number SC481544 (the “Company”). Our registered office is at 5th Floor, 125 Princes Street, Edinburgh, EH2 4AD. 

Lowe Schmidt

unread,
Feb 1, 2019, 10:39:17 AM2/1/19
to Paris Morali, Nomad
Hey Paris,

have you considered upgrading the Nomad version? 0.5.6 is almost two years old. 
--
Lowe Schmidt | +46 723 867 157


--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/CAAFW_cJbsW_QaWnNzR8JWxYvGyMCeanzoFyUR05deadEXQGpXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Chris Stevens

unread,
Feb 1, 2019, 11:08:22 AM2/1/19
to Nomad
Hi Paris,

I saw your post and just wanted to mention that I have observed similar behavior up to and including v0.8.7 and over the last 12-18 months.

I don't really have any insights on the matter, but for us anyway, different environments behave differently in this regard.

We have a very lightly loaded "test" environment that has almost zero scheduling delay.

We have a new within the last 9 months european region that is slightly more loaded and has almost zero scheduling delay.

Finally, our original production region that has been running since the 0.5.5 days or so takes 3-5 or more minutes to even place some allocations during deployments. As a result, we have had to stop setting the auto revert flag since some jobs simply don't finish deploying.

Paris Morali

unread,
Feb 1, 2019, 11:19:07 AM2/1/19
to Lowe Schmidt, Nomad
Hi thanks for replying.

I would consider upgrading if I knew the problem will go away. Are you suggesting there is a known issue with the version I'm running?

Michael Schurter

unread,
Feb 1, 2019, 12:06:25 PM2/1/19
to Paris Morali, Lowe Schmidt, Nomad
Hi Paris,

This is not a known issue and judging by Chris's reply upgrading will not help. Scheduling performance should not degrade over time.

Chris and Paris,

Have you noticed this slowdown only in java/exec drivers? If the slowdown is during task directory setup, that's when the Nomad client creates the chroot which incurs significant IO. If your host filesystem is growing over time so will the take it takes to create the chroot. Image based drivers such as Docker would not see this slowdown.

You can customize what Nomad puts in the java/exec chroot with this agent configuration setting: https://www.nomadproject.io/docs/configuration/client.html#chroot_env-parameters

If the latency is occurring before allocations are picked up by the client nodes, then we need to start investigating your servers. Logs, number of jobs, number of nodes, etc will all be useful data points.

Hope this helps!
Michael Schurter

Chris Stevens

unread,
Feb 1, 2019, 12:29:47 PM2/1/19
to Nomad
We are all docker driver, so the host filesystem optimizations likely won't help.

I don't have any detailed logs right now, only some notes from previous occurrences.

This is what we'll see while waiting for placement:

$ nomad status 3d959224
ID                  = 3d959224
Eval ID             = d81bead3
Name                = api.api-nginx[1]
Node ID             = 12a730a4
Job ID              = api
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1h28m ago
Modified            = 1h28m ago
Deployment ID       = cd726d5a
Deployment Health   = unset

Couldn't retrieve stats: Unexpected response code: 404 (Unknown allocation "3d959224-ec24-4a79-7ad1-d0247a71c64c")

Many times in the most problematic datacenter, the allocation will get placed and come up correctly after several minutes in this condition.

Other times, the progress_deadline will be hit and the deployment of the task stops.

We co-locate nomad, vault, and consul server processes, so I can't rule out some kind of contention. But these are EC2 m5.xlarge or similar with SSD volumes and they always seem to run around 5% cpu and 30% memory in our monitoring.

Michael Schurter

unread,
Feb 1, 2019, 12:44:31 PM2/1/19
to Chris Stevens, Nomad
Chris,

Any chance you have Nomad at debug level logging? The log lines (with timestamps!) I'm interested in are:
  • On the servers any lines with the relevant evaluation
  • On the clients any lines with a slow allocation as well as the preceding "updated allocations" lines that should look something like:
2019/02/01 09:37:12.580329 [DEBUG] client: updated allocations at index 11 (total 1) (pulled 1) (filtered 0)
2019/02/01 09:37:12.580780 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)

Those client log lines represent the time at which a client first receives the allocation.

The evaluation is the first ID output by nomad run:

$ nomad run hello.nomad
==> Monitoring evaluation "805af636"
    Evaluation triggered by job "hello"
    Allocation ...

Hopefully the timings between those events will pinpoint where the slowdown is. 

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

Paris Morali

unread,
Feb 4, 2019, 6:52:32 AM2/4/19
to Michael Schurter, Paris Morali, Lowe Schmidt, Nomad
Thanks a lot Michael, this sounds like good lead. We will give this a go and hopefully manage to resolve the issue and feedback here.

Cheers
Paris

Michael Schurter

unread,
Feb 4, 2019, 10:33:13 AM2/4/19
to Paris Morali, Paris Morali, Lowe Schmidt, Nomad
A coworker pointed out there were significant improvements to how the Nomad client garbage collects dead allocations in 0.9.0. If the slowdown is in the client and the client has a few dead allocations (as clients in long running clusters would), garbage collection may be the slowdown. Garbage collecting old allocations may block the starting of new allocations and during that time the alloc status will appear as Chris posted (available from the server but a 404 from the client).

Unfortunately a nontrivial thing to test in the 0.9.0beta, but hopefully the logging will let us at least confirm we've found the root cause!

Chris Stevens

unread,
Feb 4, 2019, 4:00:43 PM2/4/19
to Nomad
This is a very interesting idea!

I feel like we turn over our clients at least 60 and maybe 30 days, so some could be out there for a long time.

Another interesting aspect is that we run a bunch of parameterized jobs on the clients where we see this issue the most (by node class - so it is a clear delineation). I would imagine that there are a larger number dead allocations on these nodes than on the other node classes.

I'll look for an opportunity to gather some log data, but it might be a bit.

- Chris
 

Michael Schurter

unread,
Feb 4, 2019, 4:07:14 PM2/4/19
to Chris Stevens, Nomad
On Mon, Feb 4, 2019 at 1:00 PM Chris Stevens <chris....@traxo.com> wrote:
Another interesting aspect is that we run a bunch of parameterized jobs on the clients where we see this issue the most (by node class - so it is a clear delineation). I would imagine that there are a larger number dead allocations on these nodes than on the other node classes.

Yes! Batch workflows in general probably cause a much higher rate of dead allocations and GCing. 

Chris Stevens

unread,
Feb 14, 2019, 4:37:29 PM2/14/19
to Nomad
I am trying to build nomad from source on master in my standard CentOS linux VM.

I get stuck on something that seems too basic to be the real issue....









[vagrant@localhost nomad]$ make bootstrap


==> Updating build dependencies...


go get -u github.com/kardianos/govendor


go get -u github.com/hashicorp/go-bindata/go-bindata


go get -u github.com/elazarl/go-bindata-assetfs/go-bindata-assetfs


go get -u github.com/a8m/tree/cmd/tree


go get -u github.com/magiconair/vendorfmt/cmd/vendorfmt


go get -u gotest.tools/gotestsum


Installing codec/codecgen@v1.1.2 ...


Unknown option: -C


usage: git [--version] [--help] [-c name=value]


           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]


           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]


           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]


           <command> [<args>]


Unknown option: -C


usage: git [--version] [--help] [-c name=value]


           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]


           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]


           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]


           <command> [<args>]


make: *** [deps] Error 129



While on Darwin, it builds just fine for me.

Dani Tomlinson

unread,
Feb 14, 2019, 5:49:49 PM2/14/19
to Chris Stevens, Nomad
Hey Chris,

It looks like you have a version of git that is too old to use the `-C` option that IIRC was introduced around git 2.0.0.

Could you try updating your version of git and try again?

Thanks!

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.

Chris Stevens

unread,
Feb 15, 2019, 8:04:53 AM2/15/19
to Nomad
Thanks Danielle! Old git (1.8.3.1) was indeed the issue on that line.

Now, it looks like an issue with gometalinter:

go get -u gotest.tools/gotestsum

Installing codec/codecgen@v1.1.2 ...

Installing protobuf/protoc-gen-go@v1.2.0 ...

==> Updating linter dependencies...

go get -u github.com/alecthomas/gometalinter

gometalinter --install

bash: gometalinter: command not found

make: *** [lint-deps] Error 127


This is a fairly new CentOS 7.6 VM with go version go1.11.4 linux/amd64.

Reply all
Reply to author
Forward
0 new messages