API patterns for provisioned pets

Tim Hockin

unread,

May 14, 2021, 12:26:50 PM5/14/21

to kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

Hi all,

As I was doing KEPs, this non-pattern made itself more clear, and I thought it was an interesting topic to discuss. Not urgent.

I think we all appreciate the value of git-ops (aka config-as-data) as a guiding principle. The idea that I can check my YAML into a repo and re-instantiate it at will is pretty nice. And it works pretty well with kubernetes. Mostly.

We have a few places where it doesn't hold, and we have never established a pattern or philosophy around it. Specifically I am thinking of cases where Kubernetes causes or oversees the allocation or provisioning of something durable - whose lifecycle we want to manage at a different scale than a single instantiation of an app or even a cluster - IOW "pets".

Let's consider a few examples.

1) Load-balancer IPs

A typical flow for turning up an app on the internet goes like this:

* deploy app to a cluster

* create a Service type=ClusterIP

* verify the app works, repeat until satisfied

* commit YAML to git

Now I can sleep easy knowing I can apply this YAML to a new cluster and it will work. Time to publish my app to the net.

* change Service YAML to type=LoadBalancer

* commit to git

* apply service YAML

* observe LB is provisioned, IP is assigned

* assign DNS to this IP

Now, if I tore down my cluster and applied this YAML, something bad happens - I get a different IP! It's not the end of the world - you can always reprogram DNS (the app was down anyway). Still, I usually see people do:

* copy the IP address assigned

* change Service YAML to set loadBalancerIP

* use provider APIs to make the IP "non-ephemeral" (maybe)

* commit to git

The truth is that IP management is the purvey of the provider implementation and k8s has no API to manage IPs. Implementations do their best to get it right.

I see this also emerging for the newer Gateway APIs.

2) PersistentVolumes

To get a "pet-like" volume, the usual flow is:

* create a PVC YAML

* commit to git (this step could be later)

* apply PVC YAML

* observe that a PV is allocated and bound to this PVC

* use the PVC in a Pod

This feels superficially similar to IPs, but I don't see people taking the same step of editing the PVC to include the binding. This is probably because the perceived "pet" (the PV) is just an indirection to the true "pet" (the provider-disk behind the PV).

The result here is that I can't simply re-instantiate my YAML in a new cluster - the data is lost (or at least disconnected) and requires human intervention to make a PV that links to that existing provider-disk.

I see this being repeated in the object-storage API proposal.

So what do I want? I want to think about whether there's a pattern we want to extract and document here, or maybe I am just seeing similarities where none exist?

Tim

Brendan Burns

unread,

May 14, 2021, 12:37:57 PM5/14/21

to kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott, Tim Hockin

I feel like in both cases you're mixing different life-span contexts in a single YAML/repo and it's expected that you'll run into problems.

In both cases there is an entity (IP, Disk) which is associated with a context that is broader than a cluster (e.g. a "subscription" in Azure, a "project" in GCP)

Because we implicitly create these entities in the cluster context when they don't exist, we're providing "ease of use" that causes the customer problem.

The right model is:

Repo-1 <- "above cluster entity definitions" (e.g. the YAML/JSON to create the IP or the disk)

Repo-2 <- "the cluster configs that use the IP or disk"

The bad/annoying part if this is that it forces the user to be explicit up front, but at the benefit of not confusing the contexts on the other side.

Now, I think the addition bad/annoying part is that we currently have no ability to reference one entity created in a different context (e.g. the IP address) in a YAML.

Azure templates have this ability, fwiw (Template functions - resources - Azure Resource Manager | Microsoft Docs)

But I'm not positive that mixing "code-like" functionality and "data-only" YAML is necessarily a good idea, and at the very least, we've never done it before in Kubernetes. This has usually been the domain of tools like Helm.

--brendan

From: 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>
Sent: Friday, May 14, 2021 9:26 AM
To: kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>; kubernetes-api-reviewers <kubernetes-a...@googlegroups.com>; Bowei Du <bo...@google.com>; Sidhartha Mani <s...@minio.io>; Saad Ali <saa...@google.com>; Michelle Au <ms...@google.com>; Rob Scott <robert...@google.com>
Subject: [EXTERNAL] API patterns for provisioned pets

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAO_RewY9%3DTGJcuuSyNrjvNEDJF1FL36e6uN-tMBwnZWDzub%3DYw%40mail.gmail.com.

Tim Hockin

unread,

May 14, 2021, 12:46:14 PM5/14/21

to Brendan Burns, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 9:37 AM Brendan Burns <bbu...@microsoft.com> wrote:

I feel like in both cases you're mixing different life-span contexts in a single YAML/repo and it's expected that you'll run into problems.

In both cases there is an entity (IP, Disk) which is associated with a context that is broader than a cluster (e.g. a "subscription" in Azure, a "project" in GCP)

Because we implicitly create these entities in the cluster context when they don't exist, we're providing "ease of use" that causes the customer problem.

The right model is:

Repo-1 <- "above cluster entity definitions" (e.g. the YAML/JSON to create the IP or the disk)

Repo-2 <- "the cluster configs that use the IP or disk"

Even this breaks down in the face of something like PD. That has to be an admin-owned resource, or else end-users can PVC their way into any disk they want. With IPs, we have this problem. We don't have any way to denote that "namespace foo can use static IP x, but namespace bar cannot". Should "foo" stop using that IP, "bar" could jump on it.

This makes it easier to move between clusters (I just refer to the backing IP by name) vs. PV/PVC (which is safer but requires the admin make a new PV ref'ing the backing disk by name).

Tim Hockin

unread,

May 14, 2021, 12:46:49 PM5/14/21

to Brendan Burns, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

s/PD/PV/ - brain-short.

Daniel Smith

unread,

May 14, 2021, 12:49:53 PM5/14/21

to Tim Hockin, Brendan Burns, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 9:46 AM 'Tim Hockin' via kubernetes-api-reviewers <kubernetes-a...@googlegroups.com> wrote:

On Fri, May 14, 2021 at 9:37 AM Brendan Burns <bbu...@microsoft.com> wrote:

I feel like in both cases you're mixing different life-span contexts in a single YAML/repo and it's expected that you'll run into problems.

In both cases there is an entity (IP, Disk) which is associated with a context that is broader than a cluster (e.g. a "subscription" in Azure, a "project" in GCP)

Because we implicitly create these entities in the cluster context when they don't exist, we're providing "ease of use" that causes the customer problem.

The right model is:

Repo-1 <- "above cluster entity definitions" (e.g. the YAML/JSON to create the IP or the disk)

Repo-2 <- "the cluster configs that use the IP or disk"

Even this breaks down in the face of something like PD. That has to be an admin-owned resource, or else end-users can PVC their way into any disk they want. With IPs, we have this problem. We don't have any way to denote that "namespace foo can use static IP x, but namespace bar cannot". Should "foo" stop using that IP, "bar" could jump on it.

This seems like a large improvement, in that now there's a resource whose permissions can be appropriately managed. It's not like the world where there's only a single resource and people copy-paste IP addresses is better on this score?

You received this message because you are subscribed to the Google Groups "kubernetes-api-reviewers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-api-rev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-api-reviewers/CAO_Rewbk_ChfP6c5ZreQKaKQ0uixBUTOGvki2QfmOb2YkjqijQ%40mail.gmail.com.

Brendan Burns

unread,

May 14, 2021, 12:52:55 PM5/14/21

to Daniel Smith, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

fwiw, what you really need is a "linked access check" where when you try to reference some entity (e.g. the IP address) from some other entity (the Service), there is an RBAC check in the control plane that validates that the Service has the RBAC to use the IP.

This kind of reference-based RBAC is currently missing from the Kubernetes API server.

--brendan

From: Daniel Smith <dbs...@google.com>
Sent: Friday, May 14, 2021 9:49 AM
To: Tim Hockin <tho...@google.com>
Cc: Brendan Burns <bbu...@microsoft.com>; kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>; kubernetes-api-reviewers <kubernetes-a...@googlegroups.com>; Bowei Du <bo...@google.com>; Sidhartha Mani <s...@minio.io>; Saad Ali <saa...@google.com>; Michelle Au <ms...@google.com>; Rob Scott <robert...@google.com>
Subject: Re: [EXTERNAL] API patterns for provisioned pets

Brendan Burns

unread,

May 14, 2021, 12:54:04 PM5/14/21

to Daniel Smith, Tim Hockin, Brendan Burns, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

Or alternately, we need to use the identity of the service account associated with the Pod to mount the disk instead of the cluster's identity as is used today.

--brendan

From: 'Brendan Burns' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>
Sent: Friday, May 14, 2021 9:52 AM
To: Daniel Smith <dbs...@google.com>; Tim Hockin <tho...@google.com>
Cc: kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>; kubernetes-api-reviewers <kubernetes-a...@googlegroups.com>; Bowei Du <bo...@google.com>; Sidhartha Mani <s...@minio.io>; Saad Ali <saa...@google.com>; Michelle Au <ms...@google.com>; Rob Scott <robert...@google.com>

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/SJ0PR21MB2013552AD82824C97880D6D7DB509%40SJ0PR21MB2013.namprd21.prod.outlook.com.

John Belamaric

unread,

May 14, 2021, 1:25:22 PM5/14/21

to Brendan Burns, Daniel Smith, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

I think there are two problems: 1) how do we reference an externally allocated and managed resource from within the cluster; 2) how do we manage access control to that resource. I don't think we really have a solution to the first one yet. Taking the LB IP example, today we allocate one the first time the service is deployed, and when we delete and redeploy, the external system gives us a new IP. If on the other hand, the external system allows us to provide parameters for the allocation, we can reference the external resource indirectly via that metadata (after all, all computer science problems are solved by another level of indirection!).

For example, rather than just: ipam.allocateIP(myregion), we pass in ipam.allocateIP(myregion, myservice, myenvironment, ...), then the IPAM system can take into account those inputs, and if an IP for that set of inputs is already allocated, it can return that value, rather than allocating a new value. If you want the value to persist when you tear down and rebuild the app, you have to avoid deallocation when you tear down the service. By doing this we have repeatable deployments from the aspect of external resource allocation, without having to explicitly specify those external resources in our manifests.

Once we have a way to reference these things, we have the problem of access control. But I see it as separate from what Tim was referring to.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/SJ0PR21MB2013552AD82824C97880D6D7DB509%40SJ0PR21MB2013.namprd21.prod.outlook.com.

Daniel Smith

unread,

May 14, 2021, 1:43:30 PM5/14/21

to John Belamaric, Brendan Burns, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 10:25 AM John Belamaric <jbela...@google.com> wrote:

I think there are two problems: 1) how do we reference an externally allocated and managed resource from within the cluster; 2) how do we manage access control to that resource. I don't think we really have a solution to the first one yet.

What is wrong with the PV / PVC model?

Brendan Burns

unread,

May 14, 2021, 1:55:48 PM5/14/21

to Daniel Smith, John Belamaric, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

The lifespan of the PV is tied to one specific cluster. As Tim pointed out in the very first email, this causes problems when you want the entity's lifespan to be outside of the lifespan of the cluster.

--brendan

From: Daniel Smith <dbs...@google.com>
Sent: Friday, May 14, 2021 10:43 AM
To: John Belamaric <jbela...@google.com>
Cc: Brendan Burns <bbu...@microsoft.com>; Tim Hockin <tho...@google.com>; kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>; kubernetes-api-reviewers <kubernetes-a...@googlegroups.com>; Bowei Du <bo...@google.com>; Sidhartha Mani <s...@minio.io>; Saad Ali <saa...@google.com>; Michelle Au <ms...@google.com>; Rob Scott <robert...@google.com>

Brendan Burns

unread,

May 14, 2021, 1:56:22 PM5/14/21

to Daniel Smith, John Belamaric, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

(and generally, referencing something that lives/persists outside the cluster isn't well defined)

From: Brendan Burns <bbu...@microsoft.com>
Sent: Friday, May 14, 2021 10:55 AM
To: Daniel Smith <dbs...@google.com>; John Belamaric <jbela...@google.com>
Cc: Tim Hockin <tho...@google.com>; kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>; kubernetes-api-reviewers <kubernetes-a...@googlegroups.com>; Bowei Du <bo...@google.com>; Sidhartha Mani <s...@minio.io>; Saad Ali <saa...@google.com>; Michelle Au <ms...@google.com>; Rob Scott <robert...@google.com>

Clayton Coleman

unread,

May 14, 2021, 1:56:33 PM5/14/21

to Tim Hockin, Brendan Burns, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 12:46 PM 'Tim Hockin' via kubernetes-api-reviewers <kubernetes-a...@googlegroups.com> wrote:

On Fri, May 14, 2021 at 9:37 AM Brendan Burns <bbu...@microsoft.com> wrote:

I feel like in both cases you're mixing different life-span contexts in a single YAML/repo and it's expected that you'll run into problems.

In both cases there is an entity (IP, Disk) which is associated with a context that is broader than a cluster (e.g. a "subscription" in Azure, a "project" in GCP)

Because we implicitly create these entities in the cluster context when they don't exist, we're providing "ease of use" that causes the customer problem.

The right model is:

Repo-1 <- "above cluster entity definitions" (e.g. the YAML/JSON to create the IP or the disk)

Repo-2 <- "the cluster configs that use the IP or disk"

Even this breaks down in the face of something like PD. That has to be an admin-owned resource, or else end-users can PVC their way into any disk they want. With IPs, we have this problem. We don't have any way to denote that "namespace foo can use static IP x, but namespace bar cannot". Should "foo" stop using that IP, "bar" could jump on it.

This makes it easier to move between clusters (I just refer to the backing IP by name) vs. PV/PVC (which is safer but requires the admin make a new PV ref'ing the backing disk by name).

An external resource (IP, dns name in ingress, hostname in pod) referenced by "name" requires (the Knative example I think matches it roughly) that someone perform an ACL check as Brendan notes that says "you can use this name". If you perform that check there is an implied backing model (which could be represented multiple ways) that decides the "use" of the name is valid at a point in time. If you don't model that inside the system (an RBAC rule, a knative DomainMapping, a Node object that exists) or have a queryable model outside the system (the list of service IPs that exist in this account that were created by this controller) you can't perform that check later on.

If you don't model the name in the system, references to it are problematic. I think the bug in your example is that the DNS name was bound to a non-modelled (and non-durable) entity (the ip) directly instead of by a different name / automation should have done it for you. IP isn't a name unless you treat it like a name.

I definitely would try to steer us away from new apis in the future that directly referenced a global or uncontrolled space - roughly that looked like spec.clusterIP (where the reference is to a global resource AND we already provide stable naming - there should be no need to hardcode the IP because of DNS).

You received this message because you are subscribed to the Google Groups "kubernetes-api-reviewers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-api-rev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-api-reviewers/CAO_Rewbk_ChfP6c5ZreQKaKQ0uixBUTOGvki2QfmOb2YkjqijQ%40mail.gmail.com.

John Belamaric

unread,

May 14, 2021, 1:56:58 PM5/14/21

to Daniel Smith, Brendan Burns, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

There are two levels of binding that need to happen PVC -> PV and PV -> external disk. The piece in question here is the second, the PV -> external disk binding. As I recall, you can specify characteristics associated with PV you want as generic metadata, and it will be selected or allocated assuming the plugin supports it. That is effectively initial allocation, since you are not in need of getting a specific external disk. If you want a specific external disk, I believe you have to create the PV with the ID or name or whatever for the particular plugin. What I would want instead would be a way to create a "natural" rather than "allocated" binding - I want the disk for zone xyz, of this type, for this application, in this environment. If it already exists I get that disk, if not I get a new one. That's the analogy to the IP. The key here is that in my manifests I do not store any allocated value - be that an IP address or an ID of some sort. I could do it by pre-defined name, which is just another way to encode the meta data as a single string.

I might be wrong - maybe PV allows this - but it likely would be provider-specific.

Clayton Coleman

unread,

May 14, 2021, 1:58:31 PM5/14/21

to Brendan Burns, Daniel Smith, John Belamaric, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

The PV model supports adoption and orphaning, so if a controller doesn't support that that's the controller's problem. Agreed you want to model lifecycle of your names with an entity, and orphan/adopt are *required* to handle lifecycles of a relationship that are broader than the object representation (pv living longer than a cluster is fine if your semantics are correctly handled to let a higher privilege entity hand it off).

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/SJ0PR21MB20130C5E202B60B84276960FDB509%40SJ0PR21MB2013.namprd21.prod.outlook.com.

Brendan Burns

unread,

May 14, 2021, 2:03:55 PM5/14/21

to ccoleman, Daniel Smith, John Belamaric, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

I generally agree w/ the adopt/orphan, but the trouble is that's not the default pattern on day 0.

I think what Tim was getting at is that if you use the "create-it-for-me" pattern, it's difficult to then migrate those yamls to the "adopt/orphan" pattern.

Which was what led to my original response, that when you mix "create-it-for-me" YAMLs for day 0 with "adopt/orphan" for day N you run into trouble. The right answer is to explicitly model the create outside of the PV/PVC and use the adopt/orphan pattern for everything.

--brendan

From: Clayton Coleman <ccol...@redhat.com>
Sent: Friday, May 14, 2021 10:58 AM
To: Brendan Burns <bbu...@microsoft.com>
Cc: Daniel Smith <dbs...@google.com>; John Belamaric <jbela...@google.com>; Tim Hockin <tho...@google.com>; kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>; kubernetes-api-reviewers <kubernetes-a...@googlegroups.com>; Bowei Du <bo...@google.com>; Sidhartha Mani <s...@minio.io>; Saad Ali <saa...@google.com>; Michelle Au <ms...@google.com>; Rob Scott <robert...@google.com>

Daniel Smith

unread,

May 14, 2021, 2:09:59 PM5/14/21

to Brendan Burns, ccoleman, John Belamaric, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

If the object needs to outlive the cluster, then maybe the cluster is the wrong cluster to create the claiming object.

I don't think it is good to leave dangling objects when the cluster is going down.

I think that either means running a longer lived cluster and leaving the reference there, or letting a cluster overlap in time with the cluster replacing it, and doing an atomic (ish) handoff.

So, I disagree that there is e.g. a problem of binding a PV to a disk; that is the problem PV exists to solve, we should not add another turtle there.

John Belamaric

unread,

May 14, 2021, 2:14:57 PM5/14/21

to Daniel Smith, Brendan Burns, ccoleman, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

No new turtles. If "adoption" happens based upon a user-define string (e.g., name) then my requirement is satisfied. It's only if adoption can only be done use an allocated ID that there is a problem. For PV probably we're OK. For IP we don't necessarily keep the reference so perhaps that's the issue.

Agree that if you have things external to the cluster that you are allocating, and want to persist beyond the cluster life, then you have a lifecycle mismatch.

Daniel Smith

unread,

May 14, 2021, 2:19:11 PM5/14/21

to John Belamaric, Brendan Burns, ccoleman, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 11:14 AM John Belamaric <jbela...@google.com> wrote:

No new turtles. If "adoption" happens based upon a user-define string (e.g., name) then my requirement is satisfied. It's only if adoption can only be done use an allocated ID that there is a problem. For PV probably we're OK. For IP we don't necessarily keep the reference so perhaps that's the issue.

I don't think it's good to put this requirement on the driver/cloud provider, I doubt every possible target integration is going to have a place to record the sort of metadata necessary to make that work, and it sounds like with that model we'd be making the cloud provider controller a critical part of the security model. I think it's much better for implementers to only have to figure out how to allocate/deallocate, and have k8s handle adoption or access concerns within the k8s API paradigm.

John Belamaric

unread,

May 14, 2021, 2:32:37 PM5/14/21

to Daniel Smith, Brendan Burns, ccoleman, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 11:19 AM Daniel Smith <dbs...@google.com> wrote:

On Fri, May 14, 2021 at 11:14 AM John Belamaric <jbela...@google.com> wrote:
No new turtles. If "adoption" happens based upon a user-define string (e.g., name) then my requirement is satisfied. It's only if adoption can only be done use an allocated ID that there is a problem. For PV probably we're OK. For IP we don't necessarily keep the reference so perhaps that's the issue.

I don't think it's good to put this requirement on the driver/cloud provider, I doubt every possible target integration is going to have a place to record the sort of metadata necessary to make that work, and it sounds like with that model we'd be making the cloud provider controller a critical part of the security model. I think it's much better for implementers to only have to figure out how to allocate/deallocate, and have k8s handle adoption or access concerns within the k8s API paradigm.

When they allocated or deallocate, they return an identifier. In order to solve the problem that Tim brought up, you need that identifier to be pre-allocated and known by the external system and your manifest. Otherwise, you have to store the identifier in your manifest post-allocation. In a PV this is general going to be fine, for example in GCP you use the pdName to bind it. For the IP case, we don't provide a similar way to do the binding. So, in your early question of "what's wrong with the PV / PVC model", I probably should have just answered "nothing" :)

But for IPs we don't have the equivalent - a way to bind to an externally allocated IP by name.

Daniel Smith

unread,

May 14, 2021, 2:46:35 PM5/14/21

to John Belamaric, Brendan Burns, ccoleman, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 11:32 AM John Belamaric <jbela...@google.com> wrote:

On Fri, May 14, 2021 at 11:19 AM Daniel Smith <dbs...@google.com> wrote:

On Fri, May 14, 2021 at 11:14 AM John Belamaric <jbela...@google.com> wrote:
No new turtles. If "adoption" happens based upon a user-define string (e.g., name) then my requirement is satisfied. It's only if adoption can only be done use an allocated ID that there is a problem. For PV probably we're OK. For IP we don't necessarily keep the reference so perhaps that's the issue.

I don't think it's good to put this requirement on the driver/cloud provider, I doubt every possible target integration is going to have a place to record the sort of metadata necessary to make that work, and it sounds like with that model we'd be making the cloud provider controller a critical part of the security model. I think it's much better for implementers to only have to figure out how to allocate/deallocate, and have k8s handle adoption or access concerns within the k8s API paradigm.

When they allocated or deallocate, they return an identifier. In order to solve the problem that Tim brought up, you need that identifier to be pre-allocated and known by the external system and your manifest. Otherwise, you have to store the identifier in your manifest post-allocation. In a PV this is general going to be fine, for example in GCP you use the pdName to bind it. For the IP case, we don't provide a similar way to do the binding. So, in your early question of "what's wrong with the PV / PVC model", I probably should have just answered "nothing" :)

haha, OK. I agree we are missing the "PV of LBs/IPs"

Brendan Burns

unread,

May 14, 2021, 2:48:42 PM5/14/21

to Daniel Smith, John Belamaric, ccoleman, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

fwiw, both Azure and AWS (and probably others) have an explicit IP resource that you can reference by name. So i think the general 'put a name in the yaml' pattern applies.

But again this assumes you follow the adopt/orphan pattern, not create-it-for-me.

--brendan

From: John Belamaric <jbela...@google.com>
Sent: Friday, May 14, 2021 11:32:24 AM
To: Daniel Smith <dbs...@google.com>
Cc: Brendan Burns <bbu...@microsoft.com>; ccoleman <ccol...@redhat.com>; Tim Hockin <tho...@google.com>; kubernetes-sig-architecture <kubernetes-si...@googlegroups.com>; kubernetes-api-reviewers <kubernetes-a...@googlegroups.com>; Bowei Du <bo...@google.com>; Sidhartha Mani <s...@minio.io>; Saad Ali <saa...@google.com>; Michelle Au <ms...@google.com>; Rob Scott <robert...@google.com>

John Belamaric

unread,

May 14, 2021, 2:50:01 PM5/14/21

to Daniel Smith, Brendan Burns, ccoleman, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

There is one other point though - while a name is sufficient, meta-data is better. Tim's example was "we allocate the IP then we store it in the manifest". That works for a single instance of the service in a single cluster, and the name makes it slightly better because you get the repeatability in the allocation. But if you want to deploy the same set of manifests across multiple clusters in different regions, you're going to actually need different names for those IPs (assuming name is not regionally scoped). If instead the allocation is done by the metadata, the region is just another piece of metadata. This is why we would prefer not to have to pre-provision the IPs with specific names - it means you have a separate out-of-band step prior to configuring your manifests. Instead we want a "create or adopt" model. In non-cloud provider cases, the enterprise IPAM needs to do these allocate from specific CIDRs based upon the data center the cluster lives in - it needs information to key into its model of the network to determine those CIDRs.

Clayton Coleman

unread,

May 14, 2021, 2:58:56 PM5/14/21

to John Belamaric, Daniel Smith, Brendan Burns, Tim Hockin, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 2:50 PM John Belamaric <jbela...@google.com> wrote:

There is one other point though - while a name is sufficient, meta-data is better. Tim's example was "we allocate the IP then we store it in the manifest". That works for a single instance of the service in a single cluster, and the name makes it slightly better because you get the repeatability in the allocation. But if you want to deploy the same set of manifests across multiple clusters in different regions, you're going to actually need different names for those IPs (assuming name is not regionally scoped). If instead the allocation is done by the metadata, the region is just another piece of metadata. This is why we would prefer not to have to pre-provision the IPs with specific names - it means you have a separate out-of-band step prior to configuring your manifests. Instead we want a "create or adopt" model. In non-cloud provider cases, the enterprise IPAM needs to do these allocate from specific CIDRs based upon the data center the cluster lives in - it needs information to key into its model of the network to determine those CIDRs.

IP is kind of a special case too because a fully realized global system has been developed to name and refer to IPs, and we have to fit into it poorly in some cases.

Tim Hockin

unread,

May 14, 2021, 3:09:20 PM5/14/21

to Clayton Coleman, John Belamaric, Daniel Smith, Brendan Burns, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

so many comments! Replying all at once.

> This seems like a large improvement, in that now there's a resource whose permissions can be appropriately managed. It's not like the world where there's only a single resource and people copy-paste IP addresses is better on this score?

Insitinctively, I agree :)

> Have you seen the pattern we use in Knative (DomainMapping) with a namespaced resource creating a cluster scoped “claim” resource? We use the flat cluster scoped names to ensure uniqueness of exact-match names. The namespaced resource only implements its semantics if it’s namespace holds the claim.

This sounds an awful lot like PVC and PV - is it different in some subtle way I
am missing?

> The lifespan of the PV is tied to one specific cluster. As Tim pointed out in the very first email, this causes problems when you want the entity's lifespan to be outside of the lifespan of the cluster.

I mean, it's supported to "abandon" the backing disk when deleting the PV. The
clunky part is the next step - bringing it back up in the next cluster requires
a super-user to make the new PV by hand.

> I think it's much better for implementers to only have to figure out how to allocate/deallocate, and have k8s handle adoption or access concerns within the k8s API paradigm.

For lack of the sorts of deep access checks, I mostly agree. Maybe this is
just a problem for sig-multicluster to tackle?

> But for IPs we don't have the equivalent - a way to bind to an externally allocated IP by name.

Yeah, I am sensing this as a missing "intermediate" resource.

> That works for a single instance of the service in a single cluster, and the name makes it slightly better because you get the repeatability in the allocation. But if you want to deploy the same set of manifests across multiple clusters in different regions, you're going to actually need different names for those IPs (assuming name is not regionally scoped). If instead the allocation is done by the metadata, the region is just another piece of metadata.

This comes down to the strength of that metadata - is it unfakeable? Are all
the inputs immutable? Is the context "closed"? E.g. if I tie it to a specific
cluster ID (inasmuch as we have one), then I can't hand it off to another
cluster. But if it is not tied to a cluster ID, then any cluster can take it.

I sense a new sig-multicluster project coming on.

John Belamaric

unread,

May 14, 2021, 3:17:10 PM5/14/21

to Tim Hockin, Clayton Coleman, Daniel Smith, Brendan Burns, kubernetes-sig-architecture, kubernetes-api-reviewers, Bowei Du, Sidhartha Mani, Saad Ali, Michelle Au, Rob Scott

On Fri, May 14, 2021 at 12:09 PM Tim Hockin <tho...@google.com> wrote:

This comes down to the strength of that metadata - is it unfakeable? Are all
the inputs immutable? Is the context "closed"? E.g. if I tie it to a specific
cluster ID (inasmuch as we have one), then I can't hand it off to another
cluster. But if it is not tied to a cluster ID, then any cluster can take it.

yeah, interesting questions. But for IPs "fakeable" metadata doesn't really matter, because we're just using it to pick the "right" IP. We could explicitly specify any IP we want anyway, and it's up to the provider to kick it back if we aren't allowed to use it. For something like PV, faking the metadata and getting access to the PV would be bad. But IP is sort of already all out there as an explicit value, so not really the same sort of permissions issue. Or rather, the permissions come into play at usage time not adoption time.

Tim Bannister

unread,

May 17, 2021, 6:10:58 AM5/17/21

to kubernetes-sig-architecture

Some APIs I've seen use two identities for this: one for the workload omponent, one for the thing responsible for launching the workload component. I think this has already been suggested for Kubernetes.

To replicate that pattern, a Pod could gain an optional “launch service account” (name not ideal but I hope it gets the idea across).

Reply all

Reply to author

Forward