Running out of disk with long running batch jobs

91 views
Skip to first unread message

Wes Higbee

unread,
Apr 12, 2016, 11:23:41 AM4/12/16
to Nomad
I have a test environment setup with a single nomad server that has 200 GB of space on it.

When batch jobs pile up, jobs that often have 300 tasks that take 8 minutes a piece and say 10 jobs pile up, the disk on the nomad server slowly runs out. Right now our cluster only has resources for ~50 of these jobs to run concurrently. 

It seems like something is going on with these jobs that have repeated failed placements that takes up lots of disk space. I could be wrong too, the queued up work could simply be coincidental.

I did some digging, the space is being consumed by raft/snapshots/number/state.bin files, there are three of these clocking in at 51G, 65G and 65G. Is that normal?

Consequently, I have to stop the nomad server, blow away the data directory to get things going again. I assume I could avoid this with multiple nomad servers, but that might be a separate discussion.

Is the disk space problem normal? How should I estimate disk needs?

Alex Dadgar

unread,
Apr 12, 2016, 12:38:57 PM4/12/16
to Wes Higbee, Nomad
Hey Wes,

Going to multiple servers will not actually reduce the storage, that file will be replicated on each server. Would you mind sharing:

* How many jobs you have submitted (https://www.nomadproject.io/docs/http/jobs.html: /v1/jobs)
* Number of allocations (https://www.nomadproject.io/docs/http/allocs.html: /v1/allocations)
* Number of evals (https://www.nomadproject.io/docs/http/evals.html: /v1/evaluations)

I am not sure if the job file you gave me had duplicate tasks just for reproduction of that bug but there is a syntax that can save you time and will reduce the size of the jobs and thus space on the servers.

Instead of repeating:
task "foo" {
   // Config 
}

group "my-group" {
 
  count = 100
  task "foo" {
     // Config
   }
}

Thanks,
Alex

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/c7128510-8cec-4a9e-94fe-51723fcf74d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ki Wong

unread,
May 3, 2016, 12:51:37 PM5/3/16
to Nomad
I just ran into this problem:


# du -sh /var/lib/nomad/
31G /var/lib/nomad/
# du -sh /var/lib/nomad/server/raft/*
4.0K /var/lib/nomad/server/raft/peers.json
527M /var/lib/nomad/server/raft/raft.db
30G /var/lib/nomad/server/raft/snapshots

I have a relatively small /. Here're some values for you:

* I have a 20-node cluster with 3 servers and 17 clients
* I submitted 155 jobs, each containing 42 task-groups with 1 task in each group.
* All tasks are of type 'batch' and using 'raw_exec' driver to launch 'docker-compose'
* In the end, each client node runs about 4 tasks concurrently

I return this morning to find all my servers dead because of exhausted disk space on '/'
What is the best practice of managing these snapshots? I have room to grow '/' but
it is not unlimited.

Thanks,

-kc

Alex Dadgar

unread,
May 3, 2016, 5:38:51 PM5/3/16
to Ki Wong, Nomad
Hey Ki,

Sorry you ran into this. It is a bit hard to estimate as currently batch jobs will keep creating more allocations as long as there is a queue which is what is causing the unbounded growth of the snapshots. After 0.4, estimation should be easier as you can just look at steady state usage and then give yourself a 50% head room for in-between GC periods.

Ki Wong

unread,
May 3, 2016, 10:07:05 PM5/3/16
to Nomad
I now fully understand what you mean. I see all these failed allocation when I run `nomad status <job>`. Initially the cluster is mostly used for batch jobs each performing some kind of simulation. So this situation is very common: large number of jobs for far fewer slots. This means the failed allocation will happen a lot. I think it is worth pondering how to tweak this repeated allocation attempt to work better for that model. I'm loathe to build a layer on top to trickle batch of batch to nomad in order to overcome this allocation problem.

Thanks,

-kc

Alex Dadgar

unread,
May 4, 2016, 12:48:01 PM5/4/16
to Ki Wong, Nomad
Hey Ki,

Please don't build that :) I will be tackling this problem rather soon, in the next 2-3 weeks it should be solved!

Thanks,
Alex

Ki Wong

unread,
May 4, 2016, 1:10:23 PM5/4/16
to Nomad
Believe me, I don't want to. I have enough to wrestle with. From rbd docker-plugin exclusive lock problems, to docker-compose doing weird container naming. I appreciate your putting some thought on this.

-kc
Reply all
Reply to author
Forward
0 new messages