Rescind revocable offer

24 views
Skip to first unread message

Klaus Ma

unread,
Mar 13, 2016, 8:00:34 AM3/13/16
to Mesos Resource Allocation Working Group

Hi team,


Start this thread on some cases of allocated resources and revocable resources.

  • If the reserved reosurces are allocated after offered as revocable resources, rescind revocable offer or not?

Take Optimistic Offer Phase 1 as example, there's two cases accoring to frameworks registered time:

Pre-Condition:

  f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)

  f2 in r2


Cases:

Case 1:

  T1: f1 registered, f1 got reserved resource offer cpu(r1):10

  T2: f2 registered, f2 can NOT get allocation_slack resources offer


Case 2:

  T1: f2 registered, f2 got allocation_slack resources offer

  T2: f1 registered, f1 got reserviced resources offer: cpu(r1):10 


For the case 2, should we rescind offer from f2. And there's also a case that: if both f1 and f2 are in allocator now, the allocation result (case 1 or case 2) will depedent on sorter (f1, f2 or f2, f1)


  • Rescind Offer or not? if we keep case 2's behaviour, will master rescind offer or not when launching task? For example,

Pre-Condition:

  f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)

  f2 in r2

  f3 in r3, 10 cpu reserved for r1 (cpu(r3):10)

  f4 in r4


Case:

  T1: f1 registered, f1 got reserved resource offer cpu(r1):10

  T2: f2 registered, f2 can NOT get allocation_slack resources offer

  T3: f4 registered, f4 got allocation_slack resources offer cpu(*){REV}:10

  T4: f3 registered, f3 got reserviced resources offer: cpu(r3):10 

  T5: f1 launch task, rescind f4's offer or not?


If over rescind offer is acceptable, propose the following steps:

To handle LAUNCH request with reserved resources, master class check whether the request is valid firstly; and then keeping rescind allocation_slack offer for reserved resources until 1.) all allocation_slack offer are rescind or 2.) got enough allocation_slack for reserved resources.

The case of #1 is that the allocation_slack offer had not enough resources; in this case, the agent will kill more revocable resources to run task on reserved resources. In Phase 1, the "RunTaskMessage.revocations" has only one item: role is empty (means agent wil decide which role's executors will be evicted; balancing revocation between roles is post-MVP), revocable_resources is `reserved resources` - `sum of allocation_slack offers' resources`.

The case of #2 is that the allocation_slack offer had enough resources; this this case, the agent will not kill any tasks/executors at agent. The "RunTaskMessage.revocations" is empty.

To mini over rescind, seperte the offer into normal, usage, allocation.


BTW, what's finial desicion on offering allocation_slack resources to himself? For example, f1 in r1 with "cpu(r1):10"; what's the expected offer, "cpu(r1):10" or "cpu(*){REV}:10;cpu(r1):10"?


Guangya Liu

unread,
Mar 13, 2016, 8:49:37 PM3/13/16
to Mesos Resource Allocation Working Group
For the rescind issue, I think that this is a common issue for both usage slack and allocation slack, my thinking is leave this to agent to simplify the logic here. If the agent found that the resources was already used, just mark the task as TASK_LOST, there are actually issues https://issues.apache.org/jira/browse/MESOS-2647.

For others, please refer to comments in line.

Thanks,

Guangya

在 2016年3月13日星期日 UTC+8下午8:00:34,Klaus Ma写道:

Hi team,


Start this thread on some cases of allocated resources and revocable resources.

  • If the reserved reosurces are allocated after offered as revocable resources, rescind revocable offer or not?

Take Optimistic Offer Phase 1 as example, there's two cases accoring to frameworks registered time:

Pre-Condition:

  f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)

  f2 in r2


Cases:

Case 1:

  T1: f1 registered, f1 got reserved resource offer cpu(r1):10

  T2: f2 registered, f2 can NOT get allocation_slack resources offer


Case 2:

  T1: f2 registered, f2 got allocation_slack resources offer

  T2: f1 registered, f1 got reserviced resources offer: cpu(r1):10 


For the case 2, should we rescind offer from f2. And there's also a case that: if both f1 and f2 are in allocator now, the allocation result (case 1 or case 2) will depedent on sorter (f1, f2 or f2, f1)

There is already a JIRA wan to add a new sorter https://issues.apache.org/jira/browse/MESOS-4923 to separate regular and revocable resources.

  • Rescind Offer or not? if we keep case 2's behaviour, will master rescind offer or not when launching task? For example,

Pre-Condition:

  f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)

  f2 in r2

  f3 in r3, 10 cpu reserved for r1 (cpu(r3):10)

  f4 in r4


Case:

  T1: f1 registered, f1 got reserved resource offer cpu(r1):10

  T2: f2 registered, f2 can NOT get allocation_slack resources offer

  T3: f4 registered, f4 got allocation_slack resources offer cpu(*){REV}:10

  T4: f3 registered, f3 got reserviced resources offer: cpu(r3):10 

  T5: f1 launch task, rescind f4's offer or not?


If over rescind offer is acceptable, propose the following steps:

To handle LAUNCH request with reserved resources, master class check whether the request is valid firstly; and then keeping rescind allocation_slack offer for reserved resources until 1.) all allocation_slack offer are rescind or 2.) got enough allocation_slack for reserved resources.

The case of #1 is that the allocation_slack offer had not enough resources; in this case, the agent will kill more revocable resources to run task on reserved resources. In Phase 1, the "RunTaskMessage.revocations" has only one item: role is empty (means agent wil decide which role's executors will be evicted; balancing revocation between roles is post-MVP), revocable_resources is `reserved resources` - `sum of allocation_slack offers' resources`.

The case of #2 is that the allocation_slack offer had enough resources; this this case, the agent will not kill any tasks/executors at agent. The "RunTaskMessage.revocations" is empty.

To mini over rescind, seperte the offer into normal, usage, allocation.


BTW, what's finial desicion on offering allocation_slack resources to himself? For example, f1 in r1 with "cpu(r1):10"; what's the expected offer, "cpu(r1):10" or "cpu(*){REV}:10;cpu(r1):10"?

I think that it should be "cpu(*){REV}:10;cpu(r1):10", "r1" can be used by many frameworks and some framwork may prefer revocable resources while some others prefer regular resources.

Klaus Ma

unread,
Mar 14, 2016, 11:51:49 PM3/14/16
to Mesos Resource Allocation Working Group
My comments inline:


On Monday, March 14, 2016 at 8:49:37 AM UTC+8, Guangya Liu wrote:
For the rescind issue, I think that this is a common issue for both usage slack and allocation slack, my thinking is leave this to agent to simplify the logic here. If the agent found that the resources was already used, just mark the task as TASK_LOST, there are actually issues https://issues.apache.org/jira/browse/MESOS-2647.
[Klaus]: It's different. In oversubscription, only agent has the knowledge of latest resources info. But for allocator slack, allocator manages those resources; it's better to reduce to system's workload by avoiding LAUNCH and TASK_LOST.
 
For others, please refer to comments in line.

Thanks,

Guangya

在 2016年3月13日星期日 UTC+8下午8:00:34,Klaus Ma写道:

Hi team,


Start this thread on some cases of allocated resources and revocable resources.

  • If the reserved reosurces are allocated after offered as revocable resources, rescind revocable offer or not?

Take Optimistic Offer Phase 1 as example, there's two cases accoring to frameworks registered time:

Pre-Condition:

  f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)

  f2 in r2


Cases:

Case 1:

  T1: f1 registered, f1 got reserved resource offer cpu(r1):10

  T2: f2 registered, f2 can NOT get allocation_slack resources offer


Case 2:

  T1: f2 registered, f2 got allocation_slack resources offer

  T2: f1 registered, f1 got reserviced resources offer: cpu(r1):10 


For the case 2, should we rescind offer from f2. And there's also a case that: if both f1 and f2 are in allocator now, the allocation result (case 1 or case 2) will depedent on sorter (f1, f2 or f2, f1)

There is already a JIRA wan to add a new sorter https://issues.apache.org/jira/browse/MESOS-4923 to separate regular and revocable resources.

[Klaus] : This JIRA only resolves the issue by sorter; the issue because of registered time is still here. If we rescind offer in allocator, we'd better to wait for "Manage Offer in Allocator (MESOS-4553)". So I'd suggest to keep it as current behaviour in Phase 1 and re-visit after MESOS-4553.

  • Rescind Offer or not? if we keep case 2's behaviour, will master rescind offer or not when launching task? For example,

Pre-Condition:

  f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)

  f2 in r2

  f3 in r3, 10 cpu reserved for r1 (cpu(r3):10)

  f4 in r4


Case:

  T1: f1 registered, f1 got reserved resource offer cpu(r1):10

  T2: f2 registered, f2 can NOT get allocation_slack resources offer

  T3: f4 registered, f4 got allocation_slack resources offer cpu(*){REV}:10

  T4: f3 registered, f3 got reserviced resources offer: cpu(r3):10 

  T5: f1 launch task, rescind f4's offer or not?


If over rescind offer is acceptable, propose the following steps:

To handle LAUNCH request with reserved resources, master class check whether the request is valid firstly; and then keeping rescind allocation_slack offer for reserved resources until 1.) all allocation_slack offer are rescind or 2.) got enough allocation_slack for reserved resources.

The case of #1 is that the allocation_slack offer had not enough resources; in this case, the agent will kill more revocable resources to run task on reserved resources. In Phase 1, the "RunTaskMessage.revocations" has only one item: role is empty (means agent wil decide which role's executors will be evicted; balancing revocation between roles is post-MVP), revocable_resources is `reserved resources` - `sum of allocation_slack offers' resources`.

The case of #2 is that the allocation_slack offer had enough resources; this this case, the agent will not kill any tasks/executors at agent. The "RunTaskMessage.revocations" is empty.

To mini over rescind, seperte the offer into normal, usage, allocation.


BTW, what's finial desicion on offering allocation_slack resources to himself? For example, f1 in r1 with "cpu(r1):10"; what's the expected offer, "cpu(r1):10" or "cpu(*){REV}:10;cpu(r1):10"?

I think that it should be "cpu(*){REV}:10;cpu(r1):10", "r1" can be used by many frameworks and some framwork may prefer revocable resources while some others prefer regular resources.

[Klaus]: We did not get the conclusion of this in last meeting. IMO, it should be “cpu(r1):10”. The reservation is role level, so the revocable resources; if reserved resources are allocated, allocator should not offer them as revocable. If allocator offered resources as "cpu(r1):10;cpu(*){REV}:10", the framwork can not decide whether it should launch tasks on revocable resource together with reserved resources. I also appended my comments to the meeting minutes, let's sync up in tomorrow's meeting.

Reply all
Reply to author
Forward
0 new messages