Presently, we have a critical need for a workflow/job monitor component to help us understand better how our workflows and constituent jobs are performing, as well as the overall performance state of our Azkaban system.
We have a fairly short time frame, and are in the processing of beginning this work. So we thought we would share with you a number of technical aspects to this work that are under consideration. In doing so, perhaps you could provide feedback or advice that might help us contribute our effort back to Azkaban sometime in the long run. By the way, we are aware of AZK-37, and to the best of our knowledge we are in line with those needs.
The main architectural and design concepts to our approach are:
1. A monitor component solely within Azkaban that:
1. Accumulates workflow and job statistics and state [discussed further below] based on key events triggered by the Azkaban Scheduler and ExecutableFlows, e.g. start job execution, schedule workflow, etc..
2. Provides a thread-safe interface for external components to retrieve statistics and state. This is a callable interface, meaning that the external component is linked with Azkaban and runs with the Azkaban process.
3. Provides a notification scheme that notifies on state/statistics update to any external component registered with a notification interface to the monitor component. Again this assumes callable and potentially asynchronous notification.
2. Collect two types of accessible state from the monitor component, although we are only considering the first type presently:
1. Global Performance State:
1. Of Azkaban itself in terms of number of workflows run, successful, failed, pending, and similar of jobs
2. For each workflow root job class [effectively representing a workflow class], gather the number of times run, successful, failed, cancelled, as well as average run time. Being in the context of a workflow, statistics aggregate over all successor job executions to the root.
3. Same for each job class but limited to the execution of the job.
2. Detail Performance State:
1. Gather information about each job execution, with execution time, retries, termination condition, etc.
2. Gather information about each workflow execution, being identified by workflow id. Each execution includes order and identity of individual jobs stats as given above, aggregate execution time, etc.
At this time, we are not interested in building a UI servlet to display this information, nor JMX or REST interfaces. Although, we are willing to help work with others in designing and building these components. These are not out of the question, but at this time these are not of critical need to what we require. Our key requirement concerns interaction between external components and access to the the monitor component. Down the road, we would be interested in working with the LinkedIn folks on integration into Azkaban, if they see value in our approach.
If you have any input, please post to the azkaban-dev list. We would be interested in feedback, requirements of others, needs, etc.
Don