I have a test environment setup with a single nomad server that has 200 GB of space on it.
When batch jobs pile up, jobs that often have 300 tasks that take 8 minutes a piece and say 10 jobs pile up, the disk on the nomad server slowly runs out. Right now our cluster only has resources for ~50 of these jobs to run concurrently.
It seems like something is going on with these jobs that have repeated failed placements that takes up lots of disk space. I could be wrong too, the queued up work could simply be coincidental.
I did some digging, the space is being consumed by raft/snapshots/number/state.bin files, there are three of these clocking in at 51G, 65G and 65G. Is that normal?
Consequently, I have to stop the nomad server, blow away the data directory to get things going again. I assume I could avoid this with multiple nomad servers, but that might be a separate discussion.
Is the disk space problem normal? How should I estimate disk needs?