Currently there are a large number of sub-optimal resource allocations in a druid cluster. This stems from the fact that tasks have no capacity to be run in an intelligent manner. This proposal is to address the concern whereby different tasks are desired to be run with different levels of SLAs in a configuration that can improve overall cluster resource utilization. For the purpose of this document, a “task” refers to any druid server including both indexing tasks and core server types.
In order to balance task resources across a cluster, resources must be allocated or freed. Fulfillment of these resource needs is desired to be compatible with modern cluster-level resource management packages such as YARN and MESOS, while still allowing meeting SLAs.
The general workflow compared to the current method is shown below:
The general workflow for starting a task in Stage 1 is as follows:
The coordinator makes an optimistic request to a TierMaster.
The TierMaster knows (roughly) the capacity of the worker nodes and makes an optimistic scheduling attempt on one of the TierMasters whose domain is one single node.
The node TierMaster schedules and tracks the task, reporting status updates to the Tier-A tier master, who then reports status to the coordinator.
In the event of resource contention, the workflow is as follows:
The coordinator requests to the TierMaster - A to schedule a task
The coordinator attempts to schedule the task on the Tier Master-node1
The node TierMaster-node1 fails to schedule the task and reports task failed due to lack of resources
The Tier Master - A receives notification of task failed and requests the Tier Master for node2 to schedule the task
TierMaster-node2 reports task failure due to lack of resources and reports back to the TierMaster-A.
TierMaster-A, having exhausted all scheduling knowledge, reports back to the coordinator that the task has failed due to lack of resources
The coordinator request TierMaster-B to schedule the task
TierMaster-B request TierMaster-node3 to schedule the task
TierMaster-node3 has capacity so it runs the task.
Autoscaling would behave as follows:
The coordinator requests TierMaster-A to schedule a task
TierMaster-A requests TierMaster-node1 to schedule a task
TierMaster-node1 reports back a failure to schedule task due to lack of resources
TierMaster-A reports back to the coordinator the failure due to lack of resources
The coordinator requests TierMaster-A to expand its resource pool
TierMaster-A scales up a new instance in response to the request.
The coordinator makes a new request to TierMaster-A to schedule a task
TierMaster A makes a request to TierMaster-node2 to schedule a task
TierMaster-node2 runs the task.
The general workflow for launching a task on a cluster-based resource management system would look something like the following:
The coordinator requests to the TierMaster to launch a task.
The Tier Master finds capacity and launches the task
Under resource contention, the cluster level masters may behave as follows:
The coordinator requests to TierMaster A to schedule a task
The TierMaster-A fails to allocate cluster resources and reports back to the coordinator that the task has failed due to lack of resources.
The coordinator requests to the TierMaster - B to schedule a task
The TierMaster for B has capacity and so schedules the task.
A simple example of this is that Realtime tasks should be executed as soon as possible, and should not suffer a backlog, whereas hadoop tasks can suffer a limited backlog without severe repercussions.
For Druid, the proposed name for particular group of cluster resources will be called a “Tier” to continue with the nomenclature of the historic nodes. This helps bring Druid “Tier” in line with other cluster resource management systems such as YARN Placement Rules or MESOS Roles.
This will replace (and thereby eliminate) the existing concept of “affinity”. Instead of limiting the hosts of a task keyed by the datasource, the task itself will have “affinity”. If it is desired to retain a functionality of assigning tasks keyed by a datasource to a particular Tier, the concept of affinity could be retained, but have a slightly different functionality.
Additionally, this proposal seeks to separate the launching of tasks and the assignment of task pool resources from the coordination of tasks and expanding of pool resources.. The duties of launching a task and scaling of Tier resource pools are to be handled by a Tier Master. The duties of assigning a task to a particular Tier Master for completion will be the duties of the Coordinator. This will ultimately replace (and thereby eliminate) the existing concept of “overlord”. It will also replace and eliminate the concept of the “Middle Manager”. The prior functionality of the overlord and middle manager will be split between the Tier Master and Coordinator.
This document refers to “coordinator” as a logical unit. A Coordinator may be a distinct JVM instance, or simply a thread or group of threads in a JVM instance which is co-tenant with another logical Coordinator (example: task coordinator and segment coordinator are two separate logical coordinators).
It is important to note that this document refers to items in a logical sense and does not intend to force items to be separated in a physical sense. For example, a Tier Master may service multiple Coordinators, or a Tier Master may know how to request multiple other Tier Masters to run tasks. A single thread may also serve logic for multiple Tier Masters.
A Tier is a set of resources tied to a Tier Master. The Tier Master may or may not have the capacity to scale its resource pool either upon request of the coordinator. Each Tier Master is responsible for maintaining the state of its tasks, launching new tasks, terminating tasks, reporting either TASK SUCCESS or TASK FAIL upon completion, and maintaining High Availability upon Tier Master failure. Further information on the reason for the status change should be made available.
The key design constraints for a Tier are as follows:
Optimistic task assignment.
Decentralized task resource allocation. (the Coordinator can only modify a Tier pool, but not allocate resources specifically for a task).
Separation of duties of task logic vs ensuring a task is scheduled.
The coordinator is responsible for assigning a task to a Tier Master. The coordinator may or may not take the state of the Tier Master into account in its decisions, and may or may not ask the Tier Master to shrink or grow its resource pool, which the Tier Master may or may not accomodate. In the event of resource starvation, the coordinator will judge if a resource pool of a Tier needs to be expanded, or if a Tier should terminate tasks to free up resources. A coordinator may communicate with one or more Tier Masters to accomplish task scheduling.
The method by which a Tier Master completes its duties is not specified here. This allows a Tier Master to operate as a standalone cluster manager, a Tier Master for a single node, a Mesos scheduler, or a YARN app master. It is even permissible to have a network-wide Tier Master maintain Node Tier Masters. Or have a Cross-DC Tier Master maintain AZ Tier Masters which maintain Cell Tier Masters which actually control the launching of tasks. If a coordinator cannot find capacity in a particular AZ, it may, at its discretion, launch the task in an entirely new DC, or in a particular AZ.
Assigning a task to a tier is accomplished by POSTing the task to the Tier Master, which will find resources among the available resources. In the event that resources cannot be allocated, the Tier Master will report TASK FAIL to the coordinator and give more information regarding the reason for the failure as lack of available resources. Once a Tier Master has launched a task, it will not terminate the task unless explicitly requested.
It is the responsibility of the coordinator to request the allocation or freeing of resources if the Task Master fails to launch a task due to lack of resources.
Since this proposal does not specify how many coordinators may be running in a cluster (for example: a STAGE coordinator may be running in the same cluster as a PROD-RT coordinator which may be running in the same cluster as a PROD-BATCH coordinator) it is the duty of the coordinators to come to an agreement on the freeing and allocating of new resources.
An example might be that a STAGE coordinator might have a task fail due to lack of capacity, so it may query the Tier Master and find that the capacity is filled with PROD-BATCH tasks, in which case it will have to either scale up the cluster and get resources from the new pool, or simply wait for resources to become available.
Another example might be if a PROD-RT coordinator needs a new task, but fails to find capacity. Upon querying the Tier Master it notices that PROD-BATCH tasks are filling capacity, so it may terminate the PROD-BATCH task and launch its own task in those resources. The PROD-BATCH coordinator will receive a TASK FAIL with a reason that it was terminated, and will either wait for capacity, or request new capacity from the same Tier Master, or seek out a new Tier Master to complete the task.
The first stage is to move the middle managers to be a Tier Master and have them keep track of the TaskRunners in their domain. This involves moving RemoteTaskRunner functionality out of the Overlord and putting it in the MiddleManager, and having a means for the MiddleManager to report task status success or failure back to the Overlord, and for the Overlord to assume that it does not have the source of truth on what the state of various tasks are. This also involves making two “types” of Tier Masters, one which controls RemoteTaskRunners (which in turn talk to other Tier Masters) and another which controls ForkingTaskRunners / internal peon tasks for launching tasks on a single node. These should be orchestrated such that they can be restarted or upgraded without needing to restart the actual tasks.
A Tier master also should have the ability to launch more nodes under its domain. And such behavior should be migrated from the overlord to the tier master. The overlord makes a request to the Tier Master to increase the resource pool, and the Tier Master does the actual increase.
Auto scaling logic would need to change to accommodate the fact that the overlord does not know everything about the cluster, and instead should request the appropriate Tier Master to increase their resource pool.
The second stage is to merge the overlord segment logic (locking, etc) into the Coordinator.
The third stage is to allow long running tasks (such as historical nodes and brokers) to be launched as a “task” in a particular “tier,” and as data grows, the number of historical “tasks” in a tier can be expanded to accommodate the larger data pool. The concept of Data Tier is a natural fit into this task running paradigm.
--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/af10d0e0-eae8-4d11-9554-19a8e7a2ed12%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/7621cc7d-5947-4fe9-8ed6-a73361b8d35d%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/0ed2c13a-ee46-42f1-b71f-0e7a165de19a%40googlegroups.com.
...
...
...
...
...
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/54d17fa9-3485-401b-86a6-992c340f7755%40googlegroups.com.
...