Hi team,
Start this thread on some cases of allocated resources and revocable resources.
Take Optimistic Offer Phase 1 as example, there's two cases accoring to frameworks registered time:
Pre-Condition:
f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)
f2 in r2
Cases:
Case 1:
T1: f1 registered, f1 got reserved resource offer cpu(r1):10
T2: f2 registered, f2 can NOT get allocation_slack resources offer
Case 2:
T1: f2 registered, f2 got allocation_slack resources offer
T2: f1 registered, f1 got reserviced resources offer: cpu(r1):10
For the case 2, should we rescind offer from f2. And there's also a case that: if both f1 and f2 are in allocator now, the allocation result (case 1 or case 2) will depedent on sorter (f1, f2 or f2, f1)
Pre-Condition:
f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)
f2 in r2
f3 in r3, 10 cpu reserved for r1 (cpu(r3):10)
f4 in r4
Case:
T1: f1 registered, f1 got reserved resource offer cpu(r1):10
T2: f2 registered, f2 can NOT get allocation_slack resources offer
T3: f4 registered, f4 got allocation_slack resources offer cpu(*){REV}:10
T4: f3 registered, f3 got reserviced resources offer: cpu(r3):10
T5: f1 launch task, rescind f4's offer or not?
If over rescind offer is acceptable, propose the following steps:
To handle LAUNCH request with reserved resources, master class check whether the request is valid firstly; and then keeping rescind allocation_slack offer for reserved resources until 1.) all allocation_slack offer are rescind or 2.) got enough allocation_slack for reserved resources.
The case of #1 is that the allocation_slack offer had not enough resources; in this case, the agent will kill more revocable resources to run task on reserved resources. In Phase 1, the "RunTaskMessage.revocations" has only one item: role is empty (means agent wil decide which role's executors will be evicted; balancing revocation between roles is post-MVP), revocable_resources is `reserved resources` - `sum of allocation_slack offers' resources`.
The case of #2 is that the allocation_slack offer had enough resources; this this case, the agent will not kill any tasks/executors at agent. The "RunTaskMessage.revocations" is empty.
To mini over rescind, seperte the offer into normal, usage, allocation.
BTW, what's finial desicion on offering allocation_slack resources to himself? For example, f1 in r1 with "cpu(r1):10"; what's the expected offer, "cpu(r1):10" or "cpu(*){REV}:10;cpu(r1):10"?
Hi team,
Start this thread on some cases of allocated resources and revocable resources.
- If the reserved reosurces are allocated after offered as revocable resources, rescind revocable offer or not?
Take Optimistic Offer Phase 1 as example, there's two cases accoring to frameworks registered time:
Pre-Condition:
f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)
f2 in r2
Cases:
Case 1:
T1: f1 registered, f1 got reserved resource offer cpu(r1):10
T2: f2 registered, f2 can NOT get allocation_slack resources offer
Case 2:
T1: f2 registered, f2 got allocation_slack resources offer
T2: f1 registered, f1 got reserviced resources offer: cpu(r1):10
For the case 2, should we rescind offer from f2. And there's also a case that: if both f1 and f2 are in allocator now, the allocation result (case 1 or case 2) will depedent on sorter (f1, f2 or f2, f1)
There is already a JIRA wan to add a new sorter https://issues.apache.org/jira/browse/MESOS-4923 to separate regular and revocable resources.
- Rescind Offer or not? if we keep case 2's behaviour, will master rescind offer or not when launching task? For example,
Pre-Condition:
f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)
f2 in r2
f3 in r3, 10 cpu reserved for r1 (cpu(r3):10)
f4 in r4
Case:
T1: f1 registered, f1 got reserved resource offer cpu(r1):10
T2: f2 registered, f2 can NOT get allocation_slack resources offer
T3: f4 registered, f4 got allocation_slack resources offer cpu(*){REV}:10
T4: f3 registered, f3 got reserviced resources offer: cpu(r3):10
T5: f1 launch task, rescind f4's offer or not?
If over rescind offer is acceptable, propose the following steps:
To handle LAUNCH request with reserved resources, master class check whether the request is valid firstly; and then keeping rescind allocation_slack offer for reserved resources until 1.) all allocation_slack offer are rescind or 2.) got enough allocation_slack for reserved resources.
The case of #1 is that the allocation_slack offer had not enough resources; in this case, the agent will kill more revocable resources to run task on reserved resources. In Phase 1, the "RunTaskMessage.revocations" has only one item: role is empty (means agent wil decide which role's executors will be evicted; balancing revocation between roles is post-MVP), revocable_resources is `reserved resources` - `sum of allocation_slack offers' resources`.
The case of #2 is that the allocation_slack offer had enough resources; this this case, the agent will not kill any tasks/executors at agent. The "RunTaskMessage.revocations" is empty.
To mini over rescind, seperte the offer into normal, usage, allocation.
BTW, what's finial desicion on offering allocation_slack resources to himself? For example, f1 in r1 with "cpu(r1):10"; what's the expected offer, "cpu(r1):10" or "cpu(*){REV}:10;cpu(r1):10"?
For the rescind issue, I think that this is a common issue for both usage slack and allocation slack, my thinking is leave this to agent to simplify the logic here. If the agent found that the resources was already used, just mark the task as TASK_LOST, there are actually issues https://issues.apache.org/jira/browse/MESOS-2647.
[Klaus]: It's different. In oversubscription, only agent has the knowledge of latest resources info. But for allocator slack, allocator manages those resources; it's better to reduce to system's workload by avoiding LAUNCH and TASK_LOST.
For others, please refer to comments in line.Thanks,Guangya
在 2016年3月13日星期日 UTC+8下午8:00:34,Klaus Ma写道:Hi team,
Start this thread on some cases of allocated resources and revocable resources.
- If the reserved reosurces are allocated after offered as revocable resources, rescind revocable offer or not?
Take Optimistic Offer Phase 1 as example, there's two cases accoring to frameworks registered time:
Pre-Condition:
f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)
f2 in r2
Cases:
Case 1:
T1: f1 registered, f1 got reserved resource offer cpu(r1):10
T2: f2 registered, f2 can NOT get allocation_slack resources offer
Case 2:
T1: f2 registered, f2 got allocation_slack resources offer
T2: f1 registered, f1 got reserviced resources offer: cpu(r1):10
For the case 2, should we rescind offer from f2. And there's also a case that: if both f1 and f2 are in allocator now, the allocation result (case 1 or case 2) will depedent on sorter (f1, f2 or f2, f1)
There is already a JIRA wan to add a new sorter https://issues.apache.org/jira/browse/MESOS-4923 to separate regular and revocable resources.
- Rescind Offer or not? if we keep case 2's behaviour, will master rescind offer or not when launching task? For example,
Pre-Condition:
f1 in r1, 10 cpu reserved for r1 (cpu(r1):10)
f2 in r2
f3 in r3, 10 cpu reserved for r1 (cpu(r3):10)
f4 in r4
Case:
T1: f1 registered, f1 got reserved resource offer cpu(r1):10
T2: f2 registered, f2 can NOT get allocation_slack resources offer
T3: f4 registered, f4 got allocation_slack resources offer cpu(*){REV}:10
T4: f3 registered, f3 got reserviced resources offer: cpu(r3):10
T5: f1 launch task, rescind f4's offer or not?
If over rescind offer is acceptable, propose the following steps:
To handle LAUNCH request with reserved resources, master class check whether the request is valid firstly; and then keeping rescind allocation_slack offer for reserved resources until 1.) all allocation_slack offer are rescind or 2.) got enough allocation_slack for reserved resources.
The case of #1 is that the allocation_slack offer had not enough resources; in this case, the agent will kill more revocable resources to run task on reserved resources. In Phase 1, the "RunTaskMessage.revocations" has only one item: role is empty (means agent wil decide which role's executors will be evicted; balancing revocation between roles is post-MVP), revocable_resources is `reserved resources` - `sum of allocation_slack offers' resources`.
The case of #2 is that the allocation_slack offer had enough resources; this this case, the agent will not kill any tasks/executors at agent. The "RunTaskMessage.revocations" is empty.
To mini over rescind, seperte the offer into normal, usage, allocation.
BTW, what's finial desicion on offering allocation_slack resources to himself? For example, f1 in r1 with "cpu(r1):10"; what's the expected offer, "cpu(r1):10" or "cpu(*){REV}:10;cpu(r1):10"?
I think that it should be "cpu(*){REV}:10;cpu(r1):10", "r1" can be used by many frameworks and some framwork may prefer revocable resources while some others prefer regular resources.
[Klaus]: We did not get the conclusion of this in last meeting. IMO, it should be “cpu(r1):10”. The reservation is role level, so the revocable resources; if reserved resources are allocated, allocator should not offer them as revocable. If allocator offered resources as "cpu(r1):10;cpu(*){REV}:10", the framwork can not decide whether it should launch tasks on revocable resource together with reserved resources. I also appended my comments to the meeting minutes, let's sync up in tomorrow's meeting.