Re: [kubernetes/kubernetes] scheduler and storage provision (PV controller) coordination (#43504)

11 views
Skip to first unread message

Jing Xu

unread,
Mar 22, 2017, 3:26:52 AM3/22/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@kubernetes/sig-storage-misc
@kubernetes/sig-scheduling-misc
@kubernetes/sig-apps-misc


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

David Oppenheimer

unread,
Mar 22, 2017, 3:44:17 AM3/22/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

ref/ kubernetes/community#306 (for 2 and 4)

Jan Šafránek

unread,
Mar 22, 2017, 4:37:03 AM3/22/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

As author of the PV controller I admit it's quite stupid when scheduling PVs to PVCs and it ignores pod requirements at all. It's a very simple process, however it's complicated by our database not allowing us transactions. It would save us lot of pain (and code!) if we could update two objects atomically in a single write and such operation could be easily done in pod scheduler.

PV controller would remain there to coordinate provisioning, deletion and such.

[I know etcd allows transactions, however we intentionally don't use this feature].

Tim Hockin

unread,
Mar 22, 2017, 12:58:57 PM3/22/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

This started in email, so I'll bring in some of my notes from there.

In reality, the only difference between a zone and a node is cardinailty. A node is just a tiny little zone with one choice. If we fold PV binding into scheduling, we get a more holistic sense of resources, which would be good. What I don't want is the (somewhat tricky) PV binding being done in multiple places.

Another consideration: provisioning. If a PVC is pending and provisioning is invoked, we really should decide the zone first and tell the provisioner what zone we want. But so far,
that's optional (and opaque). As long as provisioners provision in whatever zone they feel like, we STILL have split-brain. For net-attached storage we get away with it because cardinality is usually > 1, whereas local storage is not.

I think the right answer might be to join these scheduling decisions and to be more prescriptive with topology wrt provisioning.

Kenneth Owens

unread,
Mar 22, 2017, 2:41:53 PM3/22/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

/ref #41598

Michail Kargakis

unread,
Mar 23, 2017, 2:10:03 PM3/23/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

[I know etcd allows transactions, however we intentionally don't use this feature].

This is the issue about transactions: #27548

Michelle Au

unread,
Apr 26, 2017, 2:11:14 PM4/26/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Here is my rough idea about how to make the scheduler storage-topology aware. It is similar to the topologyKey for pod affinity/anti-affinity, except you specify it in the storageclass instead of the pod. The sequence could look something like:

  1. PVC-PV binding has to be delayed until there is a pod associated with the PVC.
  2. The scheduler has to look into the storageclass of the PVC, pull out the topologyKey.
  3. It filters out existing available PVs based on the topologyKey value that also has to match on the node that its evaluating. Predicate returns true if there's enough available PVs.
  4. If no pre-existing PVs are available, ask the provisioner if it can provision in the topologyKey value of the node. Predicate returns true if provisioner says it can.
  5. Scheduler picks a node for the pod out of the remaining choices based on some ranking.
  6. Kubelet waits until PVCs are bound.
  7. Once the pod is assigned to a node, do the PVC-PV binding/provisioning.
  8. If the binding fails, kubelet has to reject the pod
  9. Scheduler retries a different node. Go back to 6).

Vish Kannan

unread,
Apr 26, 2017, 5:32:50 PM4/26/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I have an alternate idea for dealing with topology.
I think we can use labels for expressing topology. Here is the algorithm that we discussed:

  1. PVs expose a single topology label based on the storage they represent
kind:PersistentVolume
metadata:
  labels:
   topology.kubernetes.io/node:foo 
spec:
 localStorage: ...

or

kind:PersistentVolume
metadata:
  labels:
   topology.kubernetes.io/zone:bar
spec:
 GCEPersistentDisk: ...
  1. PVCs can already select using this topology label. Storage Classes should include a Label Selector that can specify topology constaints.
kind: StorageClass
metadata:
  name: local-fast-storage
spec:
  selector:
    - key: "topology.k8s.io/node"

or

kind: StorageClass
metadata:
  name: durable-slow
spec:
  selector:
    - key: "topology.k8s.io/zone"
      operator: In
      values:
      - bar-zone
  1. Nodes will expose all aspects of topology via consistent label keys.
kind: Node
metadata:
  label: 
   topology.kubernetes.io/node : foo
   topology.k8s.io/zone : bar
  1. The scheduler can then combine NodeSelector on the pod with the Selector on the StorageClass while identifying Nodes that can meet the storage locality requirements.

This method would require using consistent label keys across nodes and PVs. I hope that's not a non starter.

Vish Kannan

unread,
Apr 26, 2017, 5:33:36 PM4/26/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Kubelet is already exposing a failure domain label that indicates the zone and region. Here is an example from GKE:

Labels:			
			failure-domain.beta.kubernetes.io/region=us-central1
			failure-domain.beta.kubernetes.io/zone=us-central1-b
			kubernetes.io/hostname=gke-ssd-default-pool-ef225ddf-xfrk

We can consider re-purposing the existing labels too.

Jan Šafránek

unread,
Apr 27, 2017, 4:26:38 AM4/27/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@vishh, matching existing PVs is IMO not the issue here, problem is the dynamic provisioning. You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume. And @msau42 proposes that this decision should be made during pod scheduling.

@msau42, technically, this could work, however it will break external provisioners. You can't ask them if it's possible to provision a pod for specific node to filter the nodes. you can only ask them to provision a volume and they can either succeed or fail.

Michelle Au

unread,
Apr 27, 2017, 2:06:26 PM4/27/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Yes, the sequence I am suggesting is going to require changes to the current provisioning protocol to add this additional request.

Vish Kannan

unread,
Apr 27, 2017, 2:14:42 PM4/27/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume

I was assuming that provisioning will be triggered by the scheduler in the future at which point the zone/region/rack or specific node a pod will land on will be known prior to provisioning.

Michelle Au

unread,
Apr 27, 2017, 3:55:36 PM4/27/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I also think that the filtering is more of an optimization and can be optional. There are only a handful of zones, so we could try to provision in one zone, and if that fails, then try another zone until it succeeds.

But for the node case, being able to pre-filter available nodes will be important. It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.

Vish Kannan

unread,
Apr 27, 2017, 7:03:01 PM4/27/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I also think that the filtering is more of an optimization and can be optional.

Storage should reside where pods are. If pods have a specific spreading constraint, then storage allocation has to ideally meet that constraint. The scenario you specified is OK for pods that do not have any specific spreading constraints.

It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.

Define success? For local PVs it's only a matter of applying label filters and performing capacity checks right?

Michelle Au

unread,
Apr 27, 2017, 7:28:56 PM4/27/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I'm referring to a dynamic provisioning scenario, where the scheduler decides which node the pod should be on, and then trigger the provisioning on that node. But the scheduler should know beforehand some information about whether that node has enough provisionable capacity, so that it can pre-filter more nodes.

Clayton Coleman

unread,
Apr 27, 2017, 10:35:15 PM4/27/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention
I think we should try to keep the pod as the central scheduling
object, and if dynamic provisioning could fail in a particular part of
the cluster, that info needs to be available to the scheduler prior to
its schedule (whether via dynamic provisioning marking a pvc as being
constrained, or the scheduler knowing about the storage class and a
selector existing on the storage class status). The latter is less
racy. I would hate however for the scheduler to have to have complex
logic to reason about where a pvc can go.

I do think dynamic provisioners should be required to communicate
capacity via status if the scheduler needs that info. I don't think
the initial pvc placement is the responsibility of the dynamic
provisioners.

Vish Kannan

unread,
Apr 29, 2017, 12:28:28 PM4/29/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention
The algorithm I was imagining is as follows:

1. Go through scheduling predicates to identify a list of viable nodes
2. Go through priority functions and get a list of nodes sorted by priority
3. Present this set of nodes to the dynamic provisioner and have the
provisioner choose a node based on priority
4. Scheduler completes the binding process if provisioning succeeded. It
might assign to a specific node if local PV was requested.

For remote PVs, if dynamic provisioning fails in the rack/zone/region, then
the pod cannot be scheduled. The scheduler should not be changing it's pod
spreading policy based on storage availability.


On Thu, Apr 27, 2017 at 7:35 PM, Clayton Coleman <notifi...@github.com>
wrote:


> I think we should try to keep the pod as the central scheduling
> object, and if dynamic provisioning could fail in a particular part of
> the cluster, that info needs to be available to the scheduler prior to
> its schedule (whether via dynamic provisioning marking a pvc as being
> constrained, or the scheduler knowing about the storage class and a
> selector existing on the storage class status). The latter is less
> racy. I would hate however for the scheduler to have to have complex
> logic to reason about where a pvc can go.
>
> I do think dynamic provisioners should be required to communicate
> capacity via status if the scheduler needs that info. I don't think
> the initial pvc placement is the responsibility of the dynamic
> provisioners.
>
> —
> You are receiving this because you were mentioned.

> Reply to this email directly, view it on GitHub
> <https://github.com/kubernetes/kubernetes/issues/43504#issuecomment-297891417>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AGvIKKE6Zxl8FF_7M1rVSX2_a_MCU7rVks5r0VBigaJpZM4Mk0hU>

David Oppenheimer

unread,
Apr 29, 2017, 3:30:56 PM4/29/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

For remote PVs, if dynamic provisioning fails in the rack/zone/region, then
the pod cannot be scheduled. The scheduler should not be changing it's pod
spreading policy based on storage availability.

I'm not sure I understand this. It's important to distinguish between predicates (hard constraints) and priority functions (soft constraints/preferences). The scheduler spreading policy (assuming you're not talking about explicit requiredDuringScheduling anti-affinity) is the latter category. So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.

BTW I like the idea of the StorageClass giving status that indicates the number and shape of PVs that it can allocate, so that the scheduler can use this information, plus its knowledge of the available PVs that have already been created, when making the assignment decision. I agree we probably need an alternative for "legacy provisioners" that don't expose this StorageClass status.

Vish Kannan

unread,
May 1, 2017, 1:01:14 AM5/1/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.

It is possible that storage constraints might violate pod scheduling constraints. What if a statefulSet wants to use a storage class that is accessible only from a single zone, but the pods in that storageClass are expected to be spread across zones? I feel this is an invalid configuration and scheduling should fail. If the scheduler were to (incorrectly) notice that storage is available in only zone and then place all pods in the same zone, that would violate user expectations.

To be clear, local PVs can have a predicate. Local PVs are statically provisioned and from a scheduling standpoint are similar to "cpu" or "memory".
It is dynamic provisioning that will require an additional scheduling step which runs after a sorted list of nodes are available for each pod in the scheduler.
@davidopp thoughts?

David Oppenheimer

unread,
May 1, 2017, 1:16:19 AM5/1/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

There are two kinds of spreading: constraint/hard requirement (called requiredDuringScheduling) and preference/soft requirement (called preferredDuringScheduling). I was just saying that it's OK to violate the second kind due to storage.

I think it's hard to pin down exactly what "user expectations" are for priority functions. We have a weighting scheme but there are so many factors that unless you manually adjust the weights yourself, you can't really have strong expectations.

Vish Kannan

unread,
May 1, 2017, 1:26:12 AM5/1/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I was just saying that it's OK to violate the second kind due to storage.

Got it. If storage availability can be exposed in a portable manner across deployments, then it can definitely be a "soft constraint" as you mentioned. The easiest path forward now is that of performing dynamic provisioning of storage once a list of nodes is available.

I think it's hard to pin down exactly what "user expectations" are for priority functions.

Got it. I was referring to "hard constraints" specifically.

Aaron Crickenberger

unread,
May 31, 2017, 4:21:46 PM5/31/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@kubernetes/sig-storage-misc @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc do you want this in for v1.7? which is the correct sig to own this?

Michelle Au

unread,
May 31, 2017, 4:39:05 PM5/31/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

We're targeting 1.8

Jing Xu

unread,
Aug 3, 2017, 2:52:23 PM8/3/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

The feature issue related to this is #43640

Michelle Au

unread,
Aug 4, 2017, 2:30:06 PM8/4/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

/assign

Kubernetes Submit Queue

unread,
Sep 1, 2017, 2:12:08 PM9/1/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

[MILESTONENOTIFIER] Milestone Labels Incomplete

@jingxu97 @msau42

Action required: This issue requires label changes. If the required changes are not made within 6 days, the issue will be moved out of the v1.8 milestone.

kind: Must specify at most one of ['kind/bug', 'kind/feature', 'kind/cleanup'].
priority: Must specify at most one of ['priority/critical-urgent', 'priority/important-soon', 'priority/important-longterm'].

Additional instructions available here

Michelle Au

unread,
Sep 1, 2017, 2:20:11 PM9/1/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

/kind feature
/priority important-soon

Kubernetes Submit Queue

unread,
Sep 1, 2017, 7:07:49 PM9/1/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

[MILESTONENOTIFIER] Milestone Labels Complete

@jingxu97 @msau42

Issue label settings:

  • sig/scheduling sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/feature: New functionality.
Additional instructions available here

Kubernetes Submit Queue

unread,
Sep 7, 2017, 2:06:01 PM9/7/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

[MILESTONENOTIFIER] Milestone Labels Complete

@jingxu97 @msau42

Issue label settings:

  • sig/scheduling sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/feature: New functionality.
Additional instructions available here The commands available for adding these labels are documented here

Avi Deitcher

unread,
Nov 7, 2017, 6:25:13 AM11/7/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

We're targeting 1.8

Did any of it make it into 1.8? I don't see PRs linked here, but I might have missed them.

Michelle Au

unread,
Nov 7, 2017, 11:44:30 AM11/7/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@deitch No unfortunately not. Scheduler improvements for static PV binding are targeted for 1.9. Dynamic provisioning will come after. The general feature tracker is at kubernetes/features#490.

Avi Deitcher

unread,
Nov 7, 2017, 11:46:53 AM11/7/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@msau42 looks like we are talking on 2 separate issues about the same... "issue"? :-)

So 1.9 for static, 1.10+ or 2.0+ for dynamic?

Michelle Au

unread,
Nov 7, 2017, 11:56:36 AM11/7/17
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Yes

wu105

unread,
Feb 3, 2018, 4:44:22 AM2/3/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Did we cover the use case when two (or more) pods use the same pv? If the underlying infrastructure does not allow the pv to attach to more than one node, the second pod should be scheduled on the same node as the first pod.

Michelle Au

unread,
Feb 3, 2018, 11:08:50 AM2/3/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Pods that use local PVs will be scheduled to the same node, but it's not going to work for zonal PVs. Pods that use zonal PVs will only be scheduled to the same zone.

wu105

unread,
Feb 3, 2018, 9:14:03 PM2/3/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@msau42 that would require the user to specify the node name of the pod, not desirable because kubernetes already has the information to schedule the second pod to the correct node and the user is burdened with selecting nodes.

On a different topic, hope is not off topic for this thread, is about the node and pv zones with cloud provider Openstack. When cloud provide is Openstack, the node and pv zones seems copied from nova and cinder respectively, with node zones are network security zones while pvs got one zone from cinder to serve multiple network zones, which are not suitable for Kubernetes scheduling. The openstack nova and cinder zones just do not seem to support kubernetes scheduling. It would be more helpful if the kubernetes admin can easily configure the node zones and the pv zones on openstack. The pvs come and go, thus it may help to add a pv zone override to pv claim.

Michelle Au

unread,
Feb 4, 2018, 12:34:34 PM2/4/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@wu105 are you referring to local or zonal PVs? The design goal of local PVs is that the pod does not need to specify any node name; it's all contained in the PV information.

The problem of node enforcement with zonal PVs is that you also need to take into account access mode. A multi writer zonal PV does not have a node attachment restriction like with readwriteonce PVs. I think the best way to solve the node problem for zonal PVs is to do access mode enforcement, instead of trying to conflate the PV node affinity.

I'm not sure I completely understand your issue with open stack zone labelling. At least I know for gce and aws volumes, we have admission controllers that already label the PV with he correct zone information. I imagine you can do the same for open stack.

Michelle Au

unread,
Feb 5, 2018, 11:56:00 AM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I just realized as a workaround, you could use pod affinity to get two pods sharing the same zonal single attach PVC to be scheduled on the same node.

wu105

unread,
Feb 5, 2018, 2:37:32 PM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention
Affinity indeed is a workaround.

I selected the preferred rule because it does not seem to require the first pod specifies no affinity which would require the user to track the number of pods using the pv.

I used the max weight 100 in the hope that k8s will never ignore the “preferred” rule and gets a pod on the wrong node.

In summary, the original pod spec --
apiVersion: v1
kind: Pod
metadata:
name: app-pod
spec:
containers:
- name: busybox
image: busybox
stdin: true
tty: true
command:
- /bin/sh
- -i
volumeMounts:
- mountPath: /data
name: node-pv
readOnly: false
volumes:
- name: node-pv
persistentVolumeClaim:
claimName: app-pvc

becomes the following after adding the affinity specs --
apiVersion: v1
kind: Pod
metadata:
name: app-pod
labels:
volumeClaimName: app-pvc
spec:
containers:
- name: busybox
image: busybox
stdin: true
tty: true
command:
- /bin/sh
- -i
volumeMounts:
- mountPath: /data
name: node-pv
readOnly: false
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: volumeClaimName
operator: In
values:
- app-pvc
topologyKey: kubernetes.io/hostname
volumes:
- name: node-pv
persistentVolumeClaim:
claimName: app-pvc



From: Michelle Au [mailto:notifi...@github.com]
Sent: Monday, February 05, 2018 11:56 AM
To: kubernetes/kubernetes
Cc: Wu, Peng (Peng); Mention
Subject: Re: [kubernetes/kubernetes] scheduler and storage provision (PV controller) coordination (#43504)


I just realized as a workaround, you could use pod affinity to get two pods sharing the same zonal single attach PVC to be scheduled on the same node.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_43504-23issuecomment-2D363146984&d=DwMCaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=0BbrcqMiH8YUfbIctRfXJ1UsJenjot_TItO7cru0Biw&s=3kSdCbWWYbIPtFvGR8iTbzwIGcbjeppnndArFP5tM2s&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQW5a3SIXfcTRM9dvMZPe9eSVRz97g1aks5tRzKdgaJpZM4Mk0hU&d=DwMCaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=0BbrcqMiH8YUfbIctRfXJ1UsJenjot_TItO7cru0Biw&s=XuBB82I22YUtyI72M-l62J3DUG9TdxSM-ea_1EQRGYQ&e=>.

Michelle Au

unread,
Feb 5, 2018, 2:46:25 PM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@bsalamat anything we can do about the scenario where you specify podAffinity and it's the first pod (which is going to have no pods matching the selector yet)?

Bobby (Babak) Salamat

unread,
Feb 5, 2018, 4:46:26 PM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@msau42 if two pending pods had affinity to one another, they would never be scheduled. Affinity is a way of specifying dependency and two pods having affinity to one another represents a circular dependency which is an anti-pattern IMO.

wu105

unread,
Feb 5, 2018, 5:04:08 PM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@bsalamat Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_, or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.

Bobby (Babak) Salamat

unread,
Feb 5, 2018, 7:15:23 PM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_,

so, the "OrFirstPod" part causes the pod to be scheduled even if the affinity rule cannot be satisfied? It could work, but I have to think about possible performance implications of this. Affinity/anti-affinity already causes performance issues in medium and large clusters and we are thinking about stream-lining the design. We must be very careful about adding new features which could worsen the situation.

or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.

This is a hack. I wouldn't consider this as an option.

wu105

unread,
Feb 5, 2018, 11:22:33 PM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention
The root issue is assigning a node to a pod and its pv so that they are on the same node.
The new wrinkle is the pv is assigned to a node first via another pod. If we can check the pv and if it is already attached to a node,
we should assign the same node to the pod. Then we would not need this explicit affinity workaround, just like we do not need to spell out the affinity of pvs to its pod.

The performance probably won’t be an issue when the correct node can always be determined by looking only the pod and its pvs.
If the correct node can be not be determined by looking only the pod and its pvs, optionally specifying an order over scheduling might help.

Is it expensive to check whether a pv is already attached to a node or which node a pv is attached to?

An alternative would be that when a pod failed to schedule due to assigned to a wrong node, e.g., its pv cannot be attached because the pv is already attached to another node, automatically reschedule the pod on the pv’s node.

This is less desirable than the above but the performance drag may be limited if such failures are rare. This is better than asking the users to handle it, which is quite difficult as we are still trying to determine how to come up with a spec that would work. The special rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod uses the same idea, i.e., the extra check is only performed when the user specifies this rule.


From: Bobby (Babak) Salamat [mailto:notifi...@github.com]
Sent: Monday, February 5, 2018 7:15 PM
To: kubernetes/kubernetes
Cc: Wu, Peng (Peng); Mention
Subject: Re: [kubernetes/kubernetes] scheduler and storage provision (PV controller) coordination (#43504)


Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_,

so, the "OrFirstPod" part causes the pod to be scheduled even if the affinity rule cannot be satisfied? It could work, but I have to think about possible performance implications of this. Affinity/anti-affinity already causes performance issues in medium and large clusters and we are thinking about stream-lining the design. We must be very careful about adding new features which could worsen the situation.

or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.

This is a hack. I wouldn't consider this as an option.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_43504-23issuecomment-2D363265583&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=RelzpfX2HKLEetIHAooXtUwis5B-ICiqOZMq21LV24I&s=SSAQ8BVIAZ3yOF3dKZAkD1rkySl5uBTvNInGJc8vD1k&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQW5a5rr4dwD3aZAtz9rjuaKgM-5FSuvxSks5tR5mbgaJpZM4Mk0hU&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=RelzpfX2HKLEetIHAooXtUwis5B-ICiqOZMq21LV24I&s=B8wKlf4JBx4bZUsr_7aOfUmHMqoALCle3IfEJ91WW4o&e=>.

Michelle Au

unread,
Feb 5, 2018, 11:53:54 PM2/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@wu105 like I mentioned earlier, I think the proper solution will be access mode enforcement. The PV NodeAffinity feature does not help here as it is unrelated to volume attaching + access modes. We cannot assume that all PVs are only attachable to a single node at a time. There have actually been quite a few other issues discussing this and the challenges: #26567, #30085, #47333

wu105

unread,
Feb 6, 2018, 1:39:04 AM2/6/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention
I quickly went through those referenced issues.

I did not quick capture all the fine points, but I think one issue is to nail the definition of RWO, then, if it does not cover all the needs or variations, we may need to add additional modes.

For instance, single node pv attachment constraint comes from openstack cinder, mixed with file systems on the volume that can support multi-writes, or applications willing to manage the risk of using single writer file system for multiple writers by implementing their own control logic, kubernetes at the infrastructure level may want to relax some of the constraints for special needs, just like we are wary privileged mode or monting hostpath /var/run/docker.sock but still keep them available.

On the pod sharing pv situation, when Kubernetes allows it, has the information to assign nodes correctly but does it wrong, it feels more like a bug than a feature ☺
From: Michelle Au [mailto:notifi...@github.com]
Sent: Monday, February 5, 2018 11:54 PM
To: kubernetes/kubernetes
Cc: Wu, Peng (Peng); Mention
Subject: Re: [kubernetes/kubernetes] scheduler and storage provision (PV controller) coordination (#43504)


@wu105<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wu105&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=JCBQSpQCkF9cq7cwFcW0VaXW5JufxEZMONBsDIe8elk&e=> like I mentioned earlier, I think the proper solution will be access mode enforcement. The PV NodeAffinity feature does not help here as it is unrelated to volume attaching + access modes. We cannot assume that all PVs are only attachable to a single node at a time. There have actually been quite a few other issues discussing this and the challenges: #26567<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_26567&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=f7CT39Mj76uFmfxR74Y0Gh5XdQfL_xI5BxKlRslNbCY&e=>, #30085<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_30085&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=80hDPdfQIaaShmJ_iZtFsYxi-Kv0ffEtw4CXREtRf94&e=>, #47333<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_47333&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=Q8vCgtIV8cJqPt-CzDRYssKtjo_ou896FZSm_Qti3H8&e=>


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_43504-23issuecomment-2D363310569&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=XNIF5JYBmfL9-v8ThJrBZd4-Az0dJL0UjpR49X7UDqM&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQW5a-2DZYgNiJBRvqp2UB5a2GTXTcOefBks5tR9rhgaJpZM4Mk0hU&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=APY9w8ISdsrDln-DRwj8K2hhYtc40IOsHukL9CkFTfY&e=>.

Michelle Au

unread,
Feb 6, 2018, 2:00:13 PM2/6/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Agree, I think some new access mode API is needed to handle this case. Let's use #26567 to continue the discussion since that issue has the most history regarding access modes.

Michelle Au

unread,
Feb 26, 2018, 11:42:58 PM2/26/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Dynamic provisioning topology design proposal is here: kubernetes/community#1857

fejta-bot

unread,
May 28, 2018, 1:00:07 AM5/28/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Avi Deitcher

unread,
May 28, 2018, 1:33:58 AM5/28/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

/remove-lifecycle stale

fejta-bot

unread,
Aug 26, 2018, 2:19:44 AM8/26/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle stale

Alon Lubin

unread,
Sep 5, 2018, 8:05:56 AM9/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

/remove-lifecycle stale

Michelle Au

unread,
Sep 5, 2018, 10:41:52 AM9/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.

/close

k8s-ci-robot

unread,
Sep 5, 2018, 10:42:07 AM9/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@msau42: Closing this issue.

In response to this:

Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot

unread,
Sep 5, 2018, 10:42:40 AM9/5/18
to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Closed #43504.

Reply all
Reply to author
Forward
0 new messages