What is the life cycle of a batch job?

835 views
Skip to first unread message

Chris Hines

unread,
Oct 14, 2015, 9:39:57 PM10/14/15
to Nomad
The current documentation seems focused on service jobs. There are a few mentions of batch jobs, but they are not explained in much detail. Our current workload is almost entirely batch oriented. We manage it with an in-house system that is several years old and struggles with some scalability issues. I've been replacing components to improve stability and performance, but we're open to using another system if it meets our needs. Nomad seems to check many of the boxes we need so I am excited by its potential.

Here are some questions I have:
  • What happens to a batch job when all of its tasks exit?
  • How will the scheduler behave if there is more work than available resources?
  • Where do the stdout and stderr of a fork/exec task go?
  • Are there plans to support a workflow DAG for the tasks within a job?
  • What sort of job/task throughput can Nomad achieve if used nearly exclusively for batch workloads?
  • How many concurrent jobs/tasks can Nomad manage?
Thanks,
Chris

Armon Dadgar

unread,
Oct 14, 2015, 9:48:11 PM10/14/15
to Nomad, Chris Hines
Chris,

You are correct, our initial focus was on service workloads, but we are looking to improve this
with Nomad 0.3 dramatically. The major changes will be periodic jobs and job queueing.

Currently, based on services, if there is no capacity a job is failed. For batch jobs, they should
be queued instead. We will split the batch definition from the batch instance for things like
overlapping batch jobs (periodic every 10 minutes, takes 1 hour to complete).

So to answer each question:
* When tasks exit, the batch job should be considered complete and deleted eventually.
  Currently the job is retained until explicitly deleted.
* When resources are exhausted, batch jobs should queue. Currently they are failed.
* Output goes to local disk, in the allocation directory (something like data-dir/alloc-id/task-name/logs/stdout.log)
* Jobs will be able to depend on other jobs, which must be a DAG (potentially 0.3)
* Nomad is designed for very large workloads. Its a concurrent scheduler, so if you have
  3 servers with 8 cores, it can make 24 parallel scheduling decisions. Thousands of jobs
  should not be an issue.

Definitely in the early days. Nomad 0.2 should make the service experience
much better, while 0.3 will improve the batch experience. Hope that helps! 

Best Regards,
Armon Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/b72ceb92-dfc3-430e-96b1-c86f14318fb6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Hines

unread,
Oct 15, 2015, 3:50:25 PM10/15/15
to Nomad
Armon,

Thanks for the quick and thorough response. I think Nomad has a great future, and so far it looks very promising as a future replacement for our current system. You may not remember, but we spoke at some length after your Consul presentation at the Galvanize event before GopherCon 2014. At the time I was already working on replacing some pieces of our in house system and was intrigued by using a Raft protocol powered (maybe based on Consul) job scheduler to provide fault tolerance and it was nice to bounce some of those ideas off you at the time. So you can imagine my excitement to see that Hashicorp is putting resources into building a tool like Nomad.

Following up on your answers:

You are correct, our initial focus was on service workloads, but we are looking to improve this
with Nomad 0.3 dramatically. The major changes will be periodic jobs and job queueing.

That is nice to hear. We already have higher level systems that manage our recurring batch jobs as separate job submissions. In our case the jobs recur, but also must wait for all of their inputs to be ready. For example, we have many jobs that run on an hourly cadence, but they also must wait for upstream data feeds. We have a separate system that triggers job submission when files appear matching certain patterns (or registered in a global file registry). It is low tech, but it works well and keeps the jobs nicely decoupled from each other. Each job is then a DAG of tasks (command lines) that run on available nodes. We almost always have hundreds of tasks queued either waiting for other tasks to complete or waiting for processing slots that meet their needs. We could easily have a much longer queue, but to avoid scaling problems in our current scheduler we work hard to only submit jobs when they are likely to have some tasks scheduled soon. If we were not worried about overloading our scheduler we could easily dump ~4,000 jobs (~50 tasks each) into the system each morning on top of regular hourly submissions.
 
Currently, based on services, if there is no capacity a job is failed. For batch jobs, they should
be queued instead. We will split the batch definition from the batch instance for things like
overlapping batch jobs (periodic every 10 minutes, takes 1 hour to complete).

By "batch instance" do you mean a specific execution? If so I am glad to hear you will split that from the batch definition. That is a feature I have on my to-do list for our in-house system. At the moment we sometimes have tasks fail and get retried, but because the new execution overwrites the first one we lose the logs related to the original failure.
 
So to answer each question:
* When tasks exit, the batch job should be considered complete and deleted eventually.
  Currently the job is retained until explicitly deleted.

We have a requirement to maintain 18 months worth of task execution history. That currently amounts to ~140,000,000 tasks and growing. Do you think Nomad can handle that? If not, is it possible to archive the execution history to another system for reporting and metrics?
 
* When resources are exhausted, batch jobs should queue. Currently they are failed.

Understood.
 
* Output goes to local disk, in the allocation directory (something like data-dir/alloc-id/task-name/logs/stdout.log)

OK, good.
 
* Jobs will be able to depend on other jobs, which must be a DAG (potentially 0.3)

Although we have some uses for inter-job dependency our current system makes them difficult to manage, so we manage the job DAG externally in other systems. I hope you will also support a DAG controlling the execution order of the tasks within a job. The vast majority of our jobs use task dependencies.
 
* Nomad is designed for very large workloads. Its a concurrent scheduler, so if you have
  3 servers with 8 cores, it can make 24 parallel scheduling decisions. Thousands of jobs
  should not be an issue.

No question that it will be faster than our current system. :)
 
Definitely in the early days. Nomad 0.2 should make the service experience
much better, while 0.3 will improve the batch experience. Hope that helps! 
 
Best Regards,
Armon Dadgar

Yes, thanks again.

New question: Does Nomad support (or do you have plans to support) custom resource constraints? For example, suppose we have a Hadoop cluster and we want to limit the total number of map-reduce jobs running concurrently in the Hadoop cluster. Could we define a global "map-reduce" resource and declare tasks to consume some amount of that resource?

Thanks again,
Chris Hines

Armon Dadgar

unread,
Oct 15, 2015, 9:02:14 PM10/15/15
to Nomad, Chris Hines
Chris,

Some responses are inlined below!

Best Regards,
Armon Dadgar


From: Chris Hines <ggr...@cs-guy.com>
Reply: Chris Hines <ggr...@cs-guy.com>>
Date: October 15, 2015 at 12:50:26 PM
To: Nomad <nomad...@googlegroups.com>>
Subject:  Re: [nomad] What is the life cycle of a batch job?

Armon,

Thanks for the quick and thorough response. I think Nomad has a great future, and so far it looks very promising as a future replacement for our current system. You may not remember, but we spoke at some length after your Consul presentation at the Galvanize event before GopherCon 2014. At the time I was already working on replacing some pieces of our in house system and was intrigued by using a Raft protocol powered (maybe based on Consul) job scheduler to provide fault tolerance and it was nice to bounce some of those ideas off you at the time. So you can imagine my excitement to see that Hashicorp is putting resources into building a tool like Nomad.

Following up on your answers:

You are correct, our initial focus was on service workloads, but we are looking to improve this
with Nomad 0.3 dramatically. The major changes will be periodic jobs and job queueing.

That is nice to hear. We already have higher level systems that manage our recurring batch jobs as separate job submissions. In our case the jobs recur, but also must wait for all of their inputs to be ready. For example, we have many jobs that run on an hourly cadence, but they also must wait for upstream data feeds. We have a separate system that triggers job submission when files appear matching certain patterns (or registered in a global file registry). It is low tech, but it works well and keeps the jobs nicely decoupled from each other. Each job is then a DAG of tasks (command lines) that run on available nodes. We almost always have hundreds of tasks queued either waiting for other tasks to complete or waiting for processing slots that meet their needs. We could easily have a much longer queue, but to avoid scaling problems in our current scheduler we work hard to only submit jobs when they are likely to have some tasks scheduled soon. If we were not worried about overloading our scheduler we could easily dump ~4,000 jobs (~50 tasks each) into the system each morning on top of regular hourly submissions.
 
Currently, based on services, if there is no capacity a job is failed. For batch jobs, they should
be queued instead. We will split the batch definition from the batch instance for things like
overlapping batch jobs (periodic every 10 minutes, takes 1 hour to complete).

By "batch instance" do you mean a specific execution? If so I am glad to hear you will split that from the batch definition. That is a feature I have on my to-do list for our in-house system. At the moment we sometimes have tasks fail and get retried, but because the new execution overwrites the first one we lose the logs related to the original failure.

Exactly. The idea is that the execution instance will be distinct from the definition. This helps with things like updating the definition, or defining periodicity separate from a particular instance.


 
So to answer each question:
* When tasks exit, the batch job should be considered complete and deleted eventually.
  Currently the job is retained until explicitly deleted.

We have a requirement to maintain 18 months worth of task execution history. That currently amounts to ~140,000,000 tasks and growing. Do you think Nomad can handle that? If not, is it possible to archive the execution history to another system for reporting and metrics?

Definitely not designed to handle that level of history. Nomad relies on its state fitting in-memory and that is rather infeasible with 140M tasks which likely fan out to hundreds or thousands of allocations. Nomad automatically GC’s allocations and evaluations that are in a terminal state after some period of time. Instead you probably want to pull that information out of Nomad and store it in a system of record if you need that much data retention.


 
* When resources are exhausted, batch jobs should queue. Currently they are failed.

Understood.
 
* Output goes to local disk, in the allocation directory (something like data-dir/alloc-id/task-name/logs/stdout.log)

OK, good.
 
* Jobs will be able to depend on other jobs, which must be a DAG (potentially 0.3)

Although we have some uses for inter-job dependency our current system makes them difficult to manage, so we manage the job DAG externally in other systems. I hope you will also support a DAG controlling the execution order of the tasks within a job. The vast majority of our jobs use task dependencies.

The DAG will be builtin to the system, so you shouldn’t need an external system with Nomad (eventually).


 
* Nomad is designed for very large workloads. Its a concurrent scheduler, so if you have
  3 servers with 8 cores, it can make 24 parallel scheduling decisions. Thousands of jobs
  should not be an issue.

No question that it will be faster than our current system. :)
 
Definitely in the early days. Nomad 0.2 should make the service experience
much better, while 0.3 will improve the batch experience. Hope that helps! 
 
Best Regards,
Armon Dadgar

Yes, thanks again.

New question: Does Nomad support (or do you have plans to support) custom resource constraints? For example, suppose we have a Hadoop cluster and we want to limit the total number of map-reduce jobs running concurrently in the Hadoop cluster. Could we define a global "map-reduce" resource and declare tasks to consume some amount of that resource?

It is certainly architecturally possible, we just haven’t had seen a clear use case for it. I can imagine things like a global semaphore making sense, but we’d like to see more use cases to make sure whatever abstraction we built fits a broad set of needs.



Thanks again,
Chris Hines
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages