Charles Allen

unread,

May 28, 2015, 11:48:41 PM5/28/15

to druid-de...@googlegroups.com

Currently there are a large number of sub-optimal resource allocations in a druid cluster. This stems from the fact that tasks have no capacity to be run in an intelligent manner. This proposal is to address the concern whereby different tasks are desired to be run with different levels of SLAs in a configuration that can improve overall cluster resource utilization. For the purpose of this document, a “task” refers to any druid server including both indexing tasks and core server types.

In order to balance task resources across a cluster, resources must be allocated or freed. Fulfillment of these resource needs is desired to be compatible with modern cluster-level resource management packages such as YARN and MESOS, while still allowing meeting SLAs.

The general workflow compared to the current method is shown below:

The general workflow for starting a task in Stage 1 is as follows:

The coordinator makes an optimistic request to a TierMaster.
The TierMaster knows (roughly) the capacity of the worker nodes and makes an optimistic scheduling attempt on one of the TierMasters whose domain is one single node.
The node TierMaster schedules and tracks the task, reporting status updates to the Tier-A tier master, who then reports status to the coordinator.

In the event of resource contention, the workflow is as follows:

The coordinator requests to the TierMaster - A to schedule a task
The coordinator attempts to schedule the task on the Tier Master-node1
The node TierMaster-node1 fails to schedule the task and reports task failed due to lack of resources
The Tier Master - A receives notification of task failed and requests the Tier Master for node2 to schedule the task
TierMaster-node2 reports task failure due to lack of resources and reports back to the TierMaster-A.
TierMaster-A, having exhausted all scheduling knowledge, reports back to the coordinator that the task has failed due to lack of resources
The coordinator request TierMaster-B to schedule the task
TierMaster-B request TierMaster-node3 to schedule the task
TierMaster-node3 has capacity so it runs the task.

Autoscaling would behave as follows:

The coordinator requests TierMaster-A to schedule a task
TierMaster-A requests TierMaster-node1 to schedule a task
TierMaster-node1 reports back a failure to schedule task due to lack of resources
TierMaster-A reports back to the coordinator the failure due to lack of resources
The coordinator requests TierMaster-A to expand its resource pool
TierMaster-A scales up a new instance in response to the request.
The coordinator makes a new request to TierMaster-A to schedule a task
TierMaster A makes a request to TierMaster-node2 to schedule a task
TierMaster-node2 runs the task.

The general workflow for launching a task on a cluster-based resource management system would look something like the following:

The coordinator requests to the TierMaster to launch a task.
The Tier Master finds capacity and launches the task

Under resource contention, the cluster level masters may behave as follows:

The coordinator requests to TierMaster A to schedule a task
The TierMaster-A fails to allocate cluster resources and reports back to the coordinator that the task has failed due to lack of resources.
The coordinator requests to the TierMaster - B to schedule a task
The TierMaster for B has capacity and so schedules the task.

A simple example of this is that Realtime tasks should be executed as soon as possible, and should not suffer a backlog, whereas hadoop tasks can suffer a limited backlog without severe repercussions.

For Druid, the proposed name for particular group of cluster resources will be called a “Tier” to continue with the nomenclature of the historic nodes. This helps bring Druid “Tier” in line with other cluster resource management systems such as YARN Placement Rules or MESOS Roles.

This will replace (and thereby eliminate) the existing concept of “affinity”. Instead of limiting the hosts of a task keyed by the datasource, the task itself will have “affinity”. If it is desired to retain a functionality of assigning tasks keyed by a datasource to a particular Tier, the concept of affinity could be retained, but have a slightly different functionality.

Additionally, this proposal seeks to separate the launching of tasks and the assignment of task pool resources from the coordination of tasks and expanding of pool resources.. The duties of launching a task and scaling of Tier resource pools are to be handled by a Tier Master. The duties of assigning a task to a particular Tier Master for completion will be the duties of the Coordinator. This will ultimately replace (and thereby eliminate) the existing concept of “overlord”. It will also replace and eliminate the concept of the “Middle Manager”. The prior functionality of the overlord and middle manager will be split between the Tier Master and Coordinator.

This document refers to “coordinator” as a logical unit. A Coordinator may be a distinct JVM instance, or simply a thread or group of threads in a JVM instance which is co-tenant with another logical Coordinator (example: task coordinator and segment coordinator are two separate logical coordinators).

It is important to note that this document refers to items in a logical sense and does not intend to force items to be separated in a physical sense. For example, a Tier Master may service multiple Coordinators, or a Tier Master may know how to request multiple other Tier Masters to run tasks. A single thread may also serve logic for multiple Tier Masters.

Tier concepts

A Tier is a set of resources tied to a Tier Master. The Tier Master may or may not have the capacity to scale its resource pool either upon request of the coordinator. Each Tier Master is responsible for maintaining the state of its tasks, launching new tasks, terminating tasks, reporting either TASK SUCCESS or TASK FAIL upon completion, and maintaining High Availability upon Tier Master failure. Further information on the reason for the status change should be made available.

The key design constraints for a Tier are as follows:

Optimistic task assignment.
Decentralized task resource allocation. (the Coordinator can only modify a Tier pool, but not allocate resources specifically for a task).
Separation of duties of task logic vs ensuring a task is scheduled.

The coordinator is responsible for assigning a task to a Tier Master. The coordinator may or may not take the state of the Tier Master into account in its decisions, and may or may not ask the Tier Master to shrink or grow its resource pool, which the Tier Master may or may not accomodate. In the event of resource starvation, the coordinator will judge if a resource pool of a Tier needs to be expanded, or if a Tier should terminate tasks to free up resources. A coordinator may communicate with one or more Tier Masters to accomplish task scheduling.

The method by which a Tier Master completes its duties is not specified here. This allows a Tier Master to operate as a standalone cluster manager, a Tier Master for a single node, a Mesos scheduler, or a YARN app master. It is even permissible to have a network-wide Tier Master maintain Node Tier Masters. Or have a Cross-DC Tier Master maintain AZ Tier Masters which maintain Cell Tier Masters which actually control the launching of tasks. If a coordinator cannot find capacity in a particular AZ, it may, at its discretion, launch the task in an entirely new DC, or in a particular AZ.

Assigning a Task to a Tier

Assigning a task to a tier is accomplished by POSTing the task to the Tier Master, which will find resources among the available resources. In the event that resources cannot be allocated, the Tier Master will report TASK FAIL to the coordinator and give more information regarding the reason for the failure as lack of available resources. Once a Tier Master has launched a task, it will not terminate the task unless explicitly requested.

Resource Allocation / Freeing

It is the responsibility of the coordinator to request the allocation or freeing of resources if the Task Master fails to launch a task due to lack of resources.

Thrashing

Since this proposal does not specify how many coordinators may be running in a cluster (for example: a STAGE coordinator may be running in the same cluster as a PROD-RT coordinator which may be running in the same cluster as a PROD-BATCH coordinator) it is the duty of the coordinators to come to an agreement on the freeing and allocating of new resources.

An example might be that a STAGE coordinator might have a task fail due to lack of capacity, so it may query the Tier Master and find that the capacity is filled with PROD-BATCH tasks, in which case it will have to either scale up the cluster and get resources from the new pool, or simply wait for resources to become available.

Another example might be if a PROD-RT coordinator needs a new task, but fails to find capacity. Upon querying the Tier Master it notices that PROD-BATCH tasks are filling capacity, so it may terminate the PROD-BATCH task and launch its own task in those resources. The PROD-BATCH coordinator will receive a TASK FAIL with a reason that it was terminated, and will either wait for capacity, or request new capacity from the same Tier Master, or seek out a new Tier Master to complete the task.

Implementation Plan

Stage 1) Middle Manager Migration

The first stage is to move the middle managers to be a Tier Master and have them keep track of the TaskRunners in their domain. This involves moving RemoteTaskRunner functionality out of the Overlord and putting it in the MiddleManager, and having a means for the MiddleManager to report task status success or failure back to the Overlord, and for the Overlord to assume that it does not have the source of truth on what the state of various tasks are. This also involves making two “types” of Tier Masters, one which controls RemoteTaskRunners (which in turn talk to other Tier Masters) and another which controls ForkingTaskRunners / internal peon tasks for launching tasks on a single node. These should be orchestrated such that they can be restarted or upgraded without needing to restart the actual tasks.

A Tier master also should have the ability to launch more nodes under its domain. And such behavior should be migrated from the overlord to the tier master. The overlord makes a request to the Tier Master to increase the resource pool, and the Tier Master does the actual increase.

Auto scaling logic would need to change to accommodate the fact that the overlord does not know everything about the cluster, and instead should request the appropriate Tier Master to increase their resource pool.

Stage 2) Overlord Coordinator merger

The second stage is to merge the overlord segment logic (locking, etc) into the Coordinator.

Stage 3) Long Running Task

The third stage is to allow long running tasks (such as historical nodes and brokers) to be launched as a “task” in a particular “tier,” and as data grows, the number of historical “tasks” in a tier can be expanded to accommodate the larger data pool. The concept of Data Tier is a natural fit into this task running paradigm.

Himanshu Gupta

unread,

May 29, 2015, 10:47:11 AM5/29/15

to druid-de...@googlegroups.com

I can't see any of the images. Are they rendering properly for others?

-- Himanshu

Zach Cox

unread,

May 29, 2015, 11:06:25 AM5/29/15

to druid-de...@googlegroups.com

I also cannot see the images.

Charles Allen

unread,

May 29, 2015, 1:46:02 PM5/29/15

to druid-de...@googlegroups.com

Are you looking through your mail client or through google groups?

https://groups.google.com/forum/#!topic/druid-development/1I3CmxlOipM

There's the google groups link for the post

Himanshu Gupta

unread,

May 29, 2015, 2:39:56 PM5/29/15

to druid-de...@googlegroups.com

Can't see the images on google groups either.

Charles Allen

unread,

May 29, 2015, 2:46:53 PM5/29/15

to druid-de...@googlegroups.com

Darn,

I opened the doc up:

https://docs.google.com/document/d/16qOdLwD6oJVBaH0BnpqXfVr4zMjjHtBJJVBnZgW6ePU

Himanshu

unread,

May 29, 2015, 2:50:06 PM5/29/15

to druid-de...@googlegroups.com

OK, I can see now :)

Charles Allen

unread,

Jun 12, 2015, 1:18:22 PM6/12/15

to druid-de...@googlegroups.com

I haven't heard much feedback one way or the other on this topic, so I'm assuming at this point that there are no major objections to moving forward.

Eric Tschetter

unread,

Jun 23, 2015, 2:06:38 PM6/23/15

to druid-de...@googlegroups.com

I think that's a fair assessment Charles. It all looks like a reasonable path to follow.

If I can make a comment on how to chunk the code up. Perhaps step (1) could start with the Overlord actually having multiple tiers internally to it. I.e. its code gets divided up into multiple tiers first. Then after we merge that, we can separate it out so that the communication is no longer internal to the JVM and goes to a remote process instead.

--Eric

Charles Allen

unread,

Jun 23, 2015, 8:09:17 PM6/23/15

to druid-de...@googlegroups.com

Thanks! Sounds reasonable

Himanshu Gupta

unread,

Aug 31, 2015, 3:44:15 PM8/31/15

to Druid Development

Not sure if it is in the scope of this and covered in the doc, but, will it be possible for user to say that make sure 2 tasks run on different/named tiers (something similar to concept of "availabilityGroup" for individual middle managers).. this should allow users to set up "replicated tiers" and will allow faster upgrades (currently, while upgrading, user will take down one middle manager host at a time but later a user can take down a whole tier from set of replicated tiers)

-- Himanshu

Charles Allen

unread,

Aug 31, 2015, 4:07:37 PM8/31/15

to Druid Development

I hadn't planned on implementing that as an explicit use case other than what naturally arrises. In the current scope you can have two tasks for HA purposes. In this proposal, there is not currently an expected change to the method by which HA stuff is run.

That being said, the method by which HA is expected to be achieved in your example would be that there could be two tiers, and the thing that is submitting tasks to be launched would launch two tasks, one in each tier, in the same kind of manner similar to how they are launched now (with two tasks) but with the exception that whomever is submitting the task has knowledge that the tasks are related.

The task runner has no plans currently to have knowledge about the tasks that it is running.

Even the overlord has very limited knowledge, only knowing that things can share locks by groupId.

If there is something that needs to be implemented at the task runner level I'd love to hear about it.

Charles Allen

unread,

Sep 1, 2015, 12:09:02 AM9/1/15

to Druid Development

Also of note is that one of the changes I'm adding in the first iteration is that the middle managers can be restarted and re-discover their prior running tasks.

On Monday, August 31, 2015 at 12:44:15 PM UTC-7, Himanshu Gupta wrote:

Charles Allen

unread,

Oct 6, 2015, 1:41:49 PM10/6/15

to Druid Development

FYI, the branch I'm putting code in is located https://github.com/metamx/druid/tree/taskTier/extensions/tasktier

I'll update this thread with more information.

Charles Allen

unread,

Oct 6, 2015, 2:16:27 PM10/6/15

to Druid Development

Current significant changes from prior way of doing things:

"Peons" have a deadhand resource which is an endpoint which listens for a heartbeat (POST) and calls System.exit if it is not heard in a configurable amount of time.
"Middle Managers" have a pinger which hits the deadhand resource on a configurable interval. In an emergency, ANY service can hit the endpoints to keep the peons alive.
Task reporting is now done through POSTing to the overlord.
The dependence on zookeeper is limited to Announcer and ServiceDiscovery. If a different consensus algorithm is implemented for announcing in the future, this can change.
Middle Managers can restart without loosing the peon task.
Different "tier routes" can be defined. A "Tier" is a set of rules about resources, and a "tier route" is simply a way to submit a task which is to be run with a given set of rules about resources. These routes are injected from Properties at run-time, so should be able to be configured by changing Properties.

Charles Allen

unread,

Oct 29, 2015, 1:29:09 PM10/29/15

to Druid Development

I'm currently thinking about how to handle auto-scaling here. In reality the autoscaling and management strategy is intimately tied with the runner. But a lot of the code for running tasks does not look at it this way.

Currently state is stored in two places: TaskRunner and ResourceManagementStrategy I'm currently looking to see if the TaskRunner is able to give up its information on managing resources, and just have a ResourceManagement object tied to the task runner which holds the state. The idea there is that the ResourceManagement part should be pluggable so that you can use Yarn / Mesos / DruidIndexingService / Other as a resource manager.

Gian Merlino

unread,

Oct 29, 2015, 2:16:55 PM10/29/15

to druid-de...@googlegroups.com

IMO it would be awesome if it's possible to rejigger the TaskRunner interface such that you could separate the assignment rules stuff (like handling of taskResource, availabilityGroups, assignment strategy, that kind of stuff) from the cluster management (the zookeeper and autoscaling stuff), in such a way that you could actually use a single implementation of the former with multiple implementations of the latter.

That would make it easy to use whatever cluster management you want, without losing all the Druid-specific rules about stuff like how to replicate things and the relative sizes of different tasks.

Charles Allen

unread,

Oct 30, 2015, 2:09:13 PM10/30/15

to Druid Development

Looking at the code some more, it really seems like the resources available to the runner and the runner itself are pretty tightly coupled. As such I think it makes sense to have the resource strategy be in the domain of control of the runner. So the resource management strategy starts and stops at the behest of the runner.

Trying out some things and if they work out well I'll get a PR in showcasing the idea

Charles Allen

unread,

Jan 11, 2016, 5:28:34 PM1/11/16

to Druid Development

I opened up a PR for the work here:
https://github.com/druid-io/druid/pull/2246

Reply all

Reply to author

Forward

[PROPOSAL] Task Tier

Charles Allen

Tier concepts

Assigning a Task to a Tier

Resource Allocation / Freeing

Thrashing

Implementation Plan

Stage 1) Middle Manager Migration

Stage 2) Overlord Coordinator merger

Stage 3) Long Running Task

Himanshu Gupta

Zach Cox

Charles Allen

Himanshu Gupta

Charles Allen

Himanshu

Charles Allen

Eric Tschetter

Charles Allen

Himanshu Gupta

Charles Allen

Charles Allen

Charles Allen

Charles Allen

Charles Allen

Gian Merlino

Charles Allen

Charles Allen