@kubernetes/sig-storage-misc
@kubernetes/sig-scheduling-misc
@kubernetes/sig-apps-misc
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
ref/ kubernetes/community#306 (for 2 and 4)
As author of the PV controller I admit it's quite stupid when scheduling PVs to PVCs and it ignores pod requirements at all. It's a very simple process, however it's complicated by our database not allowing us transactions. It would save us lot of pain (and code!) if we could update two objects atomically in a single write and such operation could be easily done in pod scheduler.
PV controller would remain there to coordinate provisioning, deletion and such.
[I know etcd allows transactions, however we intentionally don't use this feature].
This started in email, so I'll bring in some of my notes from there.
In reality, the only difference between a zone and a node is cardinailty. A node is just a tiny little zone with one choice. If we fold PV binding into scheduling, we get a more holistic sense of resources, which would be good. What I don't want is the (somewhat tricky) PV binding being done in multiple places.
Another consideration: provisioning. If a PVC is pending and provisioning is invoked, we really should decide the zone first and tell the provisioner what zone we want. But so far,
that's optional (and opaque). As long as provisioners provision in whatever zone they feel like, we STILL have split-brain. For net-attached storage we get away with it because cardinality is usually > 1, whereas local storage is not.
I think the right answer might be to join these scheduling decisions and to be more prescriptive with topology wrt provisioning.
/ref #41598
[I know etcd allows transactions, however we intentionally don't use this feature].
This is the issue about transactions: #27548
Here is my rough idea about how to make the scheduler storage-topology aware. It is similar to the topologyKey for pod affinity/anti-affinity, except you specify it in the storageclass instead of the pod. The sequence could look something like:
I have an alternate idea for dealing with topology.
I think we can use labels for expressing topology. Here is the algorithm that we discussed:
kind:PersistentVolume metadata: labels: topology.kubernetes.io/node:foo spec: localStorage: ...
or
kind:PersistentVolume metadata: labels: topology.kubernetes.io/zone:bar spec: GCEPersistentDisk: ...
kind: StorageClass metadata: name: local-fast-storage spec: selector: - key: "topology.k8s.io/node"
or
kind: StorageClass metadata: name: durable-slow spec: selector: - key: "topology.k8s.io/zone" operator: In values: - bar-zone
kind: Node metadata: label: topology.kubernetes.io/node : foo topology.k8s.io/zone : bar
This method would require using consistent label keys across nodes and PVs. I hope that's not a non starter.
Kubelet is already exposing a failure domain label that indicates the zone and region. Here is an example from GKE:
Labels:
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-b
kubernetes.io/hostname=gke-ssd-default-pool-ef225ddf-xfrk
We can consider re-purposing the existing labels too.
@vishh, matching existing PVs is IMO not the issue here, problem is the dynamic provisioning. You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume. And @msau42 proposes that this decision should be made during pod scheduling.
@msau42, technically, this could work, however it will break external provisioners. You can't ask them if it's possible to provision a pod for specific node to filter the nodes. you can only ask them to provision a volume and they can either succeed or fail.
Yes, the sequence I am suggesting is going to require changes to the current provisioning protocol to add this additional request.
You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume
I was assuming that provisioning will be triggered by the scheduler in the future at which point the zone/region/rack or specific node a pod will land on will be known prior to provisioning.
I also think that the filtering is more of an optimization and can be optional. There are only a handful of zones, so we could try to provision in one zone, and if that fails, then try another zone until it succeeds.
But for the node case, being able to pre-filter available nodes will be important. It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.
I also think that the filtering is more of an optimization and can be optional.
Storage should reside where pods are. If pods have a specific spreading constraint, then storage allocation has to ideally meet that constraint. The scenario you specified is OK for pods that do not have any specific spreading constraints.
It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.
Define success? For local PVs it's only a matter of applying label filters and performing capacity checks right?
I'm referring to a dynamic provisioning scenario, where the scheduler decides which node the pod should be on, and then trigger the provisioning on that node. But the scheduler should know beforehand some information about whether that node has enough provisionable capacity, so that it can pre-filter more nodes.
For remote PVs, if dynamic provisioning fails in the rack/zone/region, then
the pod cannot be scheduled. The scheduler should not be changing it's pod
spreading policy based on storage availability.
I'm not sure I understand this. It's important to distinguish between predicates (hard constraints) and priority functions (soft constraints/preferences). The scheduler spreading policy (assuming you're not talking about explicit requiredDuringScheduling anti-affinity) is the latter category. So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.
BTW I like the idea of the StorageClass giving status that indicates the number and shape of PVs that it can allocate, so that the scheduler can use this information, plus its knowledge of the available PVs that have already been created, when making the assignment decision. I agree we probably need an alternative for "legacy provisioners" that don't expose this StorageClass status.
So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.
It is possible that storage constraints might violate pod scheduling constraints. What if a statefulSet wants to use a storage class that is accessible only from a single zone, but the pods in that storageClass are expected to be spread across zones? I feel this is an invalid configuration and scheduling should fail. If the scheduler were to (incorrectly) notice that storage is available in only zone and then place all pods in the same zone, that would violate user expectations.
To be clear, local PVs can have a predicate. Local PVs are statically provisioned and from a scheduling standpoint are similar to "cpu" or "memory".
It is dynamic provisioning that will require an additional scheduling step which runs after a sorted list of nodes are available for each pod in the scheduler.
@davidopp thoughts?
There are two kinds of spreading: constraint/hard requirement (called requiredDuringScheduling) and preference/soft requirement (called preferredDuringScheduling). I was just saying that it's OK to violate the second kind due to storage.
I think it's hard to pin down exactly what "user expectations" are for priority functions. We have a weighting scheme but there are so many factors that unless you manually adjust the weights yourself, you can't really have strong expectations.
I was just saying that it's OK to violate the second kind due to storage.
Got it. If storage availability can be exposed in a portable manner across deployments, then it can definitely be a "soft constraint" as you mentioned. The easiest path forward now is that of performing dynamic provisioning of storage once a list of nodes is available.
I think it's hard to pin down exactly what "user expectations" are for priority functions.
Got it. I was referring to "hard constraints" specifically.
@kubernetes/sig-storage-misc @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc do you want this in for v1.7? which is the correct sig to own this?
We're targeting 1.8
The feature issue related to this is #43640
/assign
[MILESTONENOTIFIER] Milestone Labels Incomplete
Action required: This issue requires label changes. If the required changes are not made within 6 days, the issue will be moved out of the v1.8 milestone.
kind: Must specify at most one of ['kind/bug', 'kind/feature', 'kind/cleanup'].
priority: Must specify at most one of ['priority/critical-urgent', 'priority/important-soon', 'priority/important-longterm'].
/kind feature
/priority important-soon
[MILESTONENOTIFIER] Milestone Labels Complete
Issue label settings:
[MILESTONENOTIFIER] Milestone Labels Complete
Issue label settings:
We're targeting 1.8
Did any of it make it into 1.8? I don't see PRs linked here, but I might have missed them.
@deitch No unfortunately not. Scheduler improvements for static PV binding are targeted for 1.9. Dynamic provisioning will come after. The general feature tracker is at kubernetes/features#490.
@msau42 looks like we are talking on 2 separate issues about the same... "issue"? :-)
So 1.9 for static, 1.10+ or 2.0+ for dynamic?
Yes
Did we cover the use case when two (or more) pods use the same pv? If the underlying infrastructure does not allow the pv to attach to more than one node, the second pod should be scheduled on the same node as the first pod.
Pods that use local PVs will be scheduled to the same node, but it's not going to work for zonal PVs. Pods that use zonal PVs will only be scheduled to the same zone.
@msau42 that would require the user to specify the node name of the pod, not desirable because kubernetes already has the information to schedule the second pod to the correct node and the user is burdened with selecting nodes.
On a different topic, hope is not off topic for this thread, is about the node and pv zones with cloud provider Openstack. When cloud provide is Openstack, the node and pv zones seems copied from nova and cinder respectively, with node zones are network security zones while pvs got one zone from cinder to serve multiple network zones, which are not suitable for Kubernetes scheduling. The openstack nova and cinder zones just do not seem to support kubernetes scheduling. It would be more helpful if the kubernetes admin can easily configure the node zones and the pv zones on openstack. The pvs come and go, thus it may help to add a pv zone override to pv claim.
@wu105 are you referring to local or zonal PVs? The design goal of local PVs is that the pod does not need to specify any node name; it's all contained in the PV information.
The problem of node enforcement with zonal PVs is that you also need to take into account access mode. A multi writer zonal PV does not have a node attachment restriction like with readwriteonce PVs. I think the best way to solve the node problem for zonal PVs is to do access mode enforcement, instead of trying to conflate the PV node affinity.
I'm not sure I completely understand your issue with open stack zone labelling. At least I know for gce and aws volumes, we have admission controllers that already label the PV with he correct zone information. I imagine you can do the same for open stack.
I just realized as a workaround, you could use pod affinity to get two pods sharing the same zonal single attach PVC to be scheduled on the same node.
@bsalamat anything we can do about the scenario where you specify podAffinity and it's the first pod (which is going to have no pods matching the selector yet)?
@msau42 if two pending pods had affinity to one another, they would never be scheduled. Affinity is a way of specifying dependency and two pods having affinity to one another represents a circular dependency which is an anti-pattern IMO.
@bsalamat Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_, or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.
Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_,
so, the "OrFirstPod" part causes the pod to be scheduled even if the affinity rule cannot be satisfied? It could work, but I have to think about possible performance implications of this. Affinity/anti-affinity already causes performance issues in medium and large clusters and we are thinking about stream-lining the design. We must be very careful about adding new features which could worsen the situation.
or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.
This is a hack. I wouldn't consider this as an option.
@wu105 like I mentioned earlier, I think the proper solution will be access mode enforcement. The PV NodeAffinity feature does not help here as it is unrelated to volume attaching + access modes. We cannot assume that all PVs are only attachable to a single node at a time. There have actually been quite a few other issues discussing this and the challenges: #26567, #30085, #47333
Agree, I think some new access mode API is needed to handle this case. Let's use #26567 to continue the discussion since that issue has the most history regarding access modes.
Dynamic provisioning topology design proposal is here: kubernetes/community#1857
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.
/close
@msau42: Closing this issue.
In response to this:
Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Closed #43504.