PROPOSAL – Moving resource management into a responsibility of the task runner

45 views
Skip to first unread message

Charles Allen

unread,
Nov 10, 2015, 2:12:37 PM11/10/15
to Druid Development

In order to accommodate more task runners that have varying methodologies for resource definition or resource strategy, the current hard tie-in between ResourceManagementStrategy and RemoteTaskRunner is not sustainable. To allow a cleaner definition of how resources are handled for a task runner, the following is proposed.


In the current workflow for scaling resources available for the indexing service, the ResourceManagement is a completely separate concept from the TaskRunner, except that the only resource management strategy depends exclusively on the RemoteTaskRunner. Additionally, there are many other task runners who have stub operations for items intended to be exported to a ResourceManagementStrategy.

The original intent, as I have come to understand it, was that the scaling of resources could be completely separate from the runner of resources. This has turned out to be a great challenge and not yet achieved. I believe this is not the best way to go about handling scaling of resources, mainly because different TaskRunners have completely different concepts of what a resource even is!

The RemoteTaskRunner has ZKWorkers, the ForkingTaskRunner has a fixed set of slots for running tasks, a potential cluster-level runner (like a MesosTaskRunner) would have a set of allocation strategies that could vary greatly from anything the other two mentioned runners would ever need.

Seeing as how the concept of a resource varies from runner to runner, and the way to allocate or free resources depends on the runner, this proposal is to move the ResourceManagement to be the responsibility of the task runner and not an independent concept.

The result of this would be the following:

1.     The ResourceManagementScheduler would start and stop with the TaskRunner instead of separately.

2.     ResourceManagementStrategies would be potentially usable by a task runner, but not guaranteed to be used by all task runners.

3.     Cluster level task runners (like a MesosTaskRunner) can implement strategies appropriate for a container running service.

4.     Tiered runners can each manage their own resource strategies. (Viewing a Tier as a set of rules about resource management goes along with this well)

5.     If the default RTR is not used, the GET method for the workers would not return legacy results, which would cause the druid console view of workers to break. (Which will already break as soon as the concept of “worker as a single node” is violated)

6.     If the default RTR is not used, any health checks that monitor the /workers endpoint will probably break, or at least not function as intended.


Charles Allen

unread,
Nov 11, 2015, 4:12:03 PM11/11/15
to Druid Development
Reply all
Reply to author
Forward
0 new messages