Future API evolution

91 views
Skip to first unread message

Zane Bitter

unread,
May 12, 2020, 9:07:48 AM5/12/20
to metal...@googlegroups.com
I've mentioned this a couple of times, but it increasingly seems to me
that at the root of some of our tricky design questions is the fact that
we are combining everything you can manage about a bare metal host into
a single CRD. As we've learned more about the kinds of things that users
want to be able to manage, that is seeming less sustainable. I've been
thinking about this a bit more deeply, and identified 3 separate tasks
that we're currently trying to accomplish with the current BareMetalHost
CRD:

1 Connection (remember that this host exists and how to connect to the BMC)
2 Inventory Management (make this host available in a certain pool of
resources)
3 Deployment (provisioning an image, rebooting it when required)

In Ironic terms, these roughly correspond to stuff that administrators
do during discovery/day 0 setup; stuff that administrators or the Nova
service do through the Ironic API; and stuff end users do through the
Nova API. We have a habit of writing user stories that begin with "As a
user..." but there are actually several different personas that we
should be thinking about.

I think some of the current design issues we are wrestling with (e.g.
whether or not we should allow selecting BIOS vs. UEFI explicitly) can
be solved by us just talking right now about which parts of the existing
BareMetalHost implementation are conceptually parts of which group (and
since there is a minor API revision bump planned, we could perhaps take
that opportunity to group them together a bit better).

But as much as I am not looking forward to cutting-and-pasting all that
code and reworking it, in the medium term I think we will need to split
the current CRD into multiple CRDs. We ought to be able to delegate RBAC
control of any one of those phases to a user/service without delegating
all of it. For example, the CAPM3 should be able to deploy an image on a
host but not make the system forget how to contact the BMC. Making a
major API change is only going to get more painful as time goes on, so
now is the time to start thinking about it.

To that end, here is a strawman proposal for splitting the current
BareMetalHost CRD into four parts. I'll be interested to hear what y'all
think of both the particulars and the general idea.

I imagine all the CRs being linked together by sharing the same name and
namespace (but different types), rather than containing links to each
other. It's likely that we would still need only a single controller to
manage all of these custom resource types.

BareMetalHost:
Spec:
- Description
- BMC Credentials
- Management URL
- Management driver
- Disable TLS certificate verification
- PXE MAC address
- Boot mode (BIOS vs. UEFI)
Status:
- State
- Operational status
- Powered on
- Current image
- Error message
- Error type
- Good credentials
- Tried credentials
- Operation metrics
Annotations:
- baremetalhost.metal3.io/paused

This resource would be used by administrators. You should basically
never delete one unless you threw the hardware away (or handed
management of it over to some other system).

HardwareDetails:
Spec:
- Hardware details

This CRD would just contain the hardware details that currently appear
in the host status. It would usually be writable only by the
baremetal-operator. Having this as a separate CRD allows the hardware
details for master nodes to be pivoted into the cluster, or otherwise
generated from an earlier inspection.

BareMetalAllocation(?):
Spec:
- Externally provisioned
- Offline
- Root device hints
- Network data(?)
- Metadata(?)
Status:
- Powered on
- Available (i.e. not powered off or provisioned, externally or
otherwise)

If this resource doesn't exist, the host would be left powered down. If
it exists, the host would be powered up ready for fast-track booting
unless the 'offline' flag is set. The BareMetalMachine would search for
these resources (rather than BareMetalHosts) - and the hardware
classification controller would apply its labels here. Probably the bmo
would also provide an 'available' label for use as a selector. This
finally provides an easy answer for how to take a host out of service
for maintenance, without forgetting that it exists or where its BMC
credentials are stored: simply delete this resource.

BareMetalDeployment
Spec:
- Image URL
- Image checksum
- Checksum algorithm
- User data
- RAID config
- Preboot options (nested virt, SMT, &c.)
Status:
- Complete
- Powered On
- Error message
Annotations:
- reboot.metal3.io

Rather than having a consumer reference, the deployment resource could
simply be owned by whatever resource is doing the deployment.
Predictable naming prevents conflicts between multiple consumers.

cheers,
Zane.

Doug Hellmann

unread,
May 12, 2020, 4:35:02 PM5/12/20
to Zane Bitter, Metal3 Development List
How does splitting the host into separate CRDs help decide whether we want the user to have to specify the boot mode?
 

But as much as I am not looking forward to cutting-and-pasting all that
code and reworking it, in the medium term I think we will need to split
the current CRD into multiple CRDs. We ought to be able to delegate RBAC
control of any one of those phases to a user/service without delegating
all of it. For example, the CAPM3 should be able to deploy an image on a
host but not  make the system forget how to contact the BMC. Making a
major API change is only going to get more painful as time goes on, so
now is the time to start thinking about it.

Splitting the BMC out into its own CRD does make some sense for RBAC.
We wouldn't expect the user to ever write to this API, though, right? Which makes putting the data in the spec seem wrong. For the requirement for copying between clusters, we need to be able to rebuild the data *without* having to copy it. Are there other examples of CRDs with what amounts to internal state stored in the spec fields?
 

BareMetalAllocation(?):
  Spec:
   - Externally provisioned
   - Offline
   - Root device hints
   - Network data(?)
   - Metadata(?)
  Status:
   - Powered on
   - Available (i.e. not powered off or provisioned, externally or
otherwise)

If this resource doesn't exist, the host would be left powered down. If
it exists, the host would be powered up ready for fast-track booting
unless the 'offline' flag is set. The BareMetalMachine would search for
these resources (rather than BareMetalHosts) - and the hardware
classification controller would apply its labels here. Probably the bmo

It seems odd to have the classifier look at a HardwareDetails and then apply the labels on an Allocation. I get the reason, if that's what the machine API is going to search for, but it still feels off.
 
would also provide an 'available' label for use as a selector. This
finally provides an easy answer for how to take a host out of service
for maintenance, without forgetting that it exists or where its BMC
credentials are stored: simply delete this resource.

BareMetalDeployment
  Spec:
   - Image URL
   - Image checksum
   - Checksum algorithm
   - User data
   - RAID config
   - Preboot options (nested virt, SMT, &c.)
  Status:
   - Complete
   - Powered On
   - Error message
  Annotations:
   - reboot.metal3.io


Rather than having a consumer reference, the deployment resource could
simply be owned by whatever resource is doing the deployment.
Predictable naming prevents conflicts between multiple consumers.

I'm not really clear on the split between an Allocation and a Deployment. The Deployment has the RAIDConfig but the Allocation has the root device hints. Why aren't those in the same place? Same for network data vs. user data, why are those separated?

If we eliminate the online flag entirely (or make it part of a separate CRD for expressing the maintenance state of a host), could an Allocation and a Deployment be merged into 1 CRD? Aside from the flag they both seem to have partial instructions for provisioning a host.

If we have separate resources for deploying an image to a host and for managing its power, and both are changed, which controller wins? What is more important, finishing the provisioning work or powering the host off? We have that same issue today, but all of the logic lives in 1 controller so at least we can express the priority in 1 code base.

What if we have a BMC type; a Host type for the power management provisioning instructions, boot instructions, and hardware details; and another type to manage things like powering off or rebooting (name TBD). That gives some separation of control via RBAC. Does that buy us enough to go through the trouble of making it possible to migrate the types?

 

cheers,
Zane.

--
You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metal3-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metal3-dev/5af117fc-2006-6068-5be6-6226c8fa0faf%40redhat.com.

Doug Hellmann

unread,
May 12, 2020, 4:36:32 PM5/12/20
to Zane Bitter, Metal3 Development List
Oops, typo. I meant to remove the "power management" from Host and say:

What if we have a BMC type; a Host type for the provisioning instructions, boot instructions, and hardware details; and another type to manage things like powering off or rebooting (name TBD). That gives some separation of control via RBAC. Does that buy us enough to go through the trouble of making it possible to migrate the types?

Mael Kimmerlin

unread,
May 13, 2020, 1:21:04 AM5/13/20
to Zane Bitter, dhel...@redhat.com, Metal3 Development List
Hello,

Thank you for the proposal, it definitely addresses some of our pain points. We could discuss it during the community meeting for some live exchange on the topic. Maybe we could already put this into a google doc as a WIP proposal ?

About the proposal itself, it is good. I just want to emphasize on one of Doug's points. The metadata, network data and root device hints are tied to a deployment rather than an allocation from CAPM3 perspective. And in addition, separating allocation and deployment into different objects would be great when it comes to upgrades. It would allow us to look into the direction of re-using the same node across an upgrade, in a simpler manner than what is considered now. That would be particularly useful for hosted storage (ceph running on the node for example, to avoid heavy traffic due to upgrades).

Best regards,
Maël

From: metal...@googlegroups.com <metal...@googlegroups.com> on behalf of Doug Hellmann <dhel...@redhat.com>
Sent: Tuesday, May 12, 2020 11:36 PM
To: Zane Bitter <zbi...@redhat.com>
Cc: Metal3 Development List <metal...@googlegroups.com>
Subject: Re: [metal3-dev] Future API evolution
 

Zane Bitter

unread,
May 14, 2020, 9:48:07 AM5/14/20
to Metal3 Development List
What might help is if we stop thinking about everyone and everything
that can interact with the CRD as a single entity, 'the user'.

Here's my theory: in OpenStack there are several places you can
configure the BIOS vs. UEFI mode - in the hardware itself, in the Ironic
config, in the Ironic API, or in the Nova API. I think we are unanimous
that we don't want it to be configurable by the actor doing the
deployment (i.e. the equivalent of the Nova API). But the feedback we've
had is that it is necessary to be able to configure it at a lower level,
because some hardware just requires that. So maybe we do want it to be
settable by the person enrolling the hardware, but not by whatever is
doing the deployment.

Perhaps that's not the answer, but if it is then we can actually address
the problem in the short term because we don't actually have to split
apart the CRD yet, just change how we think about it.
> <http://baremetalhost.metal3.io/paused>
>
> This resource would be used by administrators. You should basically
> never delete one unless you threw the hardware away (or handed
> management of it over to some other system).
>
> HardwareDetails:
>   Spec:
>    - Hardware details
>
> This CRD would just contain the hardware details that currently appear
> in the host status. It would usually be writable only by the
> baremetal-operator. Having this as a separate CRD allows the hardware
> details for master nodes to be pivoted into the cluster, or otherwise
> generated from an earlier inspection.
>
>
> We wouldn't expect the user to ever write to this API, though, right?

Correct. It'd be written by the operator and the thing that pivots
resources into the cluster.

> Which makes putting the data in the spec seem wrong. For the requirement
> for copying between clusters, we need to be able to rebuild the data
> *without* having to copy it. Are there other examples of CRDs with what
> amounts to internal state stored in the spec fields?

I was going by the discussion in
https://github.com/metal3-io/baremetal-operator/issues/431

I don't love having to separate it out, but it's not really just
internal state because if the pod gets rescheduled (thus wiping the
ironic-inspector DB) then we have no way to reconstruct it.
>    - reboot.metal3.io <http://reboot.metal3.io>
>
>
>
> Rather than having a consumer reference, the deployment resource could
> simply be owned by whatever resource is doing the deployment.
> Predictable naming prevents conflicts between multiple consumers.
>
>
> I'm not really clear on the split between an Allocation and a
> Deployment. The Deployment has the RAIDConfig but the Allocation has the
> root device hints. Why aren't those in the same place? Same for network
> data vs. user data, why are those separated?

I tried to separate them the same way they would be in a cloud. If you
spin up a cloud instance then you get to control the user_data, but the
cloud provides the meta_data (and, if it's OpenStack, the network_data).
I don't think we necessarily want to force the BMO to be in charge of
producing the metadata, but we should want to enable people to build
systems with a component whose role it is to provide trusted information
about the environment in which the host is running and a separate 'user'
component that can only provide the image and the untrusted user-data.

It looks like I got the RAID Config in the wrong place here; according
to the Ironic docs the RAID configuration is done by the operator and
all the user can do is request a flavour that filters on a particular
RAID level. Come to think of it, that could also save us from having to
do a manual cleaning step before every RAID deployment.

> If we eliminate the online flag entirely (or make it part of a separate
> CRD for expressing the maintenance state of a host), could an Allocation
> and a Deployment be merged into 1 CRD? Aside from the flag they both
> seem to have partial instructions for provisioning a host.
>
> If we have separate resources for deploying an image to a host and for
> managing its power, and both are changed, which controller wins? What is
> more important, finishing the provisioning work or powering the host
> off? We have that same issue today, but all of the logic lives in 1
> controller so at least we can express the priority in 1 code base.

I'd expect that even with multiple CRDs we'd still have only one controller.

> What if we have a BMC type; a Host type for the provisioning instructions, boot instructions, and hardware details;

I think if the hardware details are not going to be separate then they
should be part of the BMC object. You probably don't want to delete them
every time you reprovision.

> and another type to manage things like powering off or rebooting (name TBD).

I do like the idea of a separate reboot request CRD (which will
necessarily need to allow holding the power off long-term as well)
instead of an annotation, if we can agree on what it should look like :)

I wonder if part of the reason the current interface is awkward is that
we actually want different controls for the 'ready' and 'provisioned'
states, to be used by different classes of actors for different reasons.

> That gives some separation of control via RBAC. Does that buy us enough to go through the trouble of making it possible to migrate the types?

TBH I don't think that's enough separation; it's reasonably foreseeable
that we might have to change it again.

cheers,
Zane.

Doug Hellmann

unread,
May 14, 2020, 11:59:55 AM5/14/20
to Zane Bitter, Metal3 Development List
That's a good point. My assumption has always been that *most* of the time human users will configure Machine or MachineSet resources, and will not be touching the host directly. Instead the cluster API provider will update the host resource. So, we already have a separate resource for those settings that can have different RBAC.
Yes, relying on an (effectively) out-of-band process for inspection makes it hard to recover that data.
If we assume the Machine is the user-editable set of provisioning instructions, we can leave things we don't want users to set, like metadata, out of that layer of the API. But, leaving all of the fields in the host means the BMO only has to look at one resource in order to decide when it has all of the information it needs to manage the operations on the host.
 

It looks like I got the RAID Config in the wrong place here; according
to the Ironic docs the RAID configuration is done by the operator and
all the user can do is request a flavour that filters on a particular
RAID level. Come to think of it, that could also save us from having to
do a manual cleaning step before every RAID deployment.

I'm not sure what this means. Are you saying Ironic doesn't change the RAID settings to match what is required? I thought it did that as part of cleaning.
 

> If we eliminate the online flag entirely (or make it part of a separate
> CRD for expressing the maintenance state of a host), could an Allocation
> and a Deployment be merged into 1 CRD? Aside from the flag they both
> seem to have partial instructions for provisioning a host.
>
> If we have separate resources for deploying an image to a host and for
> managing its power, and both are changed, which controller wins? What is
> more important, finishing the provisioning work or powering the host
> off? We have that same issue today, but all of the logic lives in 1
> controller so at least we can express the priority in 1 code base.

I'd expect that even with multiple CRDs we'd still have only one controller.

> What if we have a BMC type; a Host type for the provisioning instructions, boot instructions, and hardware details;

I think if the hardware details are not going to be separate then they
should be part of the BMC object. You probably don't want to delete them
every time you reprovision.

That's reasonable.
 

> and another type to manage things like powering off or rebooting (name TBD).

I do like the idea of a separate reboot request CRD (which will
necessarily need to allow holding the power off long-term as well)
instead of an annotation, if we can agree on what it should look like :)

Yeah, the annotation approach was only ever meant to be a short-term solution.

I wonder if part of the reason the current interface is awkward is that
we actually want different controls for the 'ready' and 'provisioned'
states, to be used by different classes of actors for different reasons.

That's a useful way to think about it. Does using the Machine layer as the API for ready hosts help?

 

> That gives some separation of control via RBAC. Does that buy us enough to go through the trouble of making it possible to migrate the types?

TBH I don't think that's enough separation; it's reasonably foreseeable
that we might have to change it again.

cheers,
Zane.

--
You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metal3-dev+...@googlegroups.com.

Dmitry Tantsur

unread,
May 15, 2020, 4:41:22 AM5/15/20
to Zane Bitter, Metal3 Development List
I'm curious why. In the OpenStack world we've put quite some effort into enabling controlled access to these features by users. Granted, we don't enable them to apply any setting to a any node though.
In ironic itself these things are provided the same way. At least network_data is something users may want to provide to configure their networking. A counter-argument may be that user_data covers all features that meta_data and network_data provide if used with something like ignition or cloud-init.
 
I don't think we necessarily want to force the BMO to be in charge of
producing the metadata, but we should want to enable people to build
systems with a component whose role it is to provide trusted information
about the environment in which the host is running and a separate 'user'
component that can only provide the image and the untrusted user-data.

It looks like I got the RAID Config in the wrong place here; according
to the Ironic docs the RAID configuration is done by the operator and
all the user can do is request a flavour that filters on a particular
RAID level. Come to think of it, that could also save us from having to
do a manual cleaning step before every RAID deployment.

Just a word of caution: as I mentioned above, we're working on somewhat relaxing this restriction. The idea is that a node will provide possible RAID levels via so called "deployment templates", and a user will be able to pick a template that is compatible with a node. This way an operator may enable deploy-time RAID.

Dmitry
 
--
You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metal3-dev+...@googlegroups.com.

Yu Zou

unread,
May 17, 2020, 10:50:49 PM5/17/20
to Metal3 Development List


在 2020年5月12日星期二 UTC+8下午9:07:48,Zane Bitter写道:
What happens if both `Root device hints` and `Root volume` (in `RAID config`) are set, but not the same device?
Where should `BIOS config` be placed? Is it the same as `RAID config`?
Whether `Boot mode` should be set in `BIOS config`?

Zane Bitter

unread,
May 22, 2020, 5:46:39 PM5/22/20
to Metal3 Development List
On 14/05/20 11:59 am, Doug Hellmann wrote:
>
>
> On Thu, May 14, 2020 at 9:48 AM Zane Bitter <zbi...@redhat.com
> <mailto:zbi...@redhat.com>> wrote:
>
> On 12/05/20 4:34 pm, Doug Hellmann wrote:
> >
> >
> > On Tue, May 12, 2020 at 9:07 AM Zane Bitter <zbi...@redhat.com
> <mailto:zbi...@redhat.com>
> will not be touching the host directly.Instead the cluster API provider
> will update the host resource. So, we already have a separate resource
> for those settings that can have different RBAC.

That's true, but the principle of least privilege also suggests that the
Machine itself should only have the minimal access required to do its job.

Also:
https://github.com/metal3-io/metal3-docs/blob/master/design/api-design-principles.md#dont-assume-machine-api
Is looking at multiple resources a big concern?

>
> It looks like I got the RAID Config in the wrong place here; according
> to the Ironic docs the RAID configuration is done by the operator and
> all the user can do is request a flavour that filters on a particular
> RAID level. Come to think of it, that could also save us from having to
> do a manual cleaning step before every RAID deployment.
>
>
> I'm not sure what this means. Are you saying Ironic doesn't change the
> RAID settings to match what is required? I thought it did that as part
> of cleaning.

The current PR for adding RAID support adds an additional state
(Preparing, between Ready and Provisioning) that runs a 'manual
cleaning' in Ironic to set up RAID every time we provision. Presumably
we would be able to run this manual cleaning only once and then
provision multiple images with the same RAID configuration without doing
a manual cleaning every time. Although, on reflection the real
difference here would simply be that we remembered what the last RAID
config we set up on the host was... which would be a natural thing to do
with this API proposal, but in fact is something we could do
independently any time if we thought it was valuable.
Thinking about this a bit more, the reasons I can think of to want to
turn off a server are:

1) It's not provisioned and isn't assigned to any pool of resources
(i.e. we don't expect to have to provision it in a hurry). We want it
powered off to save electricity.
2) It's not provisioned and somebody needs to physically do something to
it and they would like not to fry any components or be electrocuted in
the process.
3) The same but it is provisioned.
4) An actor responsible for the stuff on the provisioned host wants it
shut down.

On reflection, there's no real distinction between 2 & 3. The real
distinction is between 3 & 4 - the reboot annotation is the de facto API
for the latter, which we should formalise when we make a real API for
it. There's also a distinction between 1 & 2 - the former is likely to
be managed in many cases by some sort of automated inventory management
system (in the simplest case, by one that just makes all hosts
available), which 2 & 3 are very much manual overrides. 1-3 are
currently implemented by the 'Online' flag. The awkwardness is partly
from mixing the concerns of automated inventory management and manual
maintenance, and partly that despite the fact that we don't have any
systems implementing 1 yet, it's the default (i.e. if you do nothing the
host will be off).

So my thought with the proposal was to change effectively change the
default based on the stage we are at:

* Only the BMC is specified - host will be powered off, implementing (1)
* Host is allocated to some resource pool - it will be powered up and
ready to provision by default. An 'offline' flag allows (2) & (3)
* Host is provisioned - the reboot annotation (or a new API) implements (4)

>
> > That gives some separation of control via RBAC. Does that buy us
> enough to go through the trouble of making it possible to migrate
> the types?
>
> TBH I don't think that's enough separation; it's reasonably foreseeable
> that we might have to change it again.
>
> cheers,
> Zane.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Metal3 Development List" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to metal3-dev+...@googlegroups.com
> <mailto:metal3-dev%2Bunsu...@googlegroups.com>.

Doug Hellmann

unread,
May 26, 2020, 1:36:28 PM5/26/20
to Zane Bitter, Metal3 Development List
I guess that's fair.

As we expand the API to support setting new details like RAID and BIOS settings, we have to keep answering questions about when those settings can be changed and what it means to change them at other times. Having them all on one resource means that updates can at least be atomic, so that once provisioning starts changes might be ignored. If some of the settings are on different resources, then we have to deal with cases where one resource is updated but another isn't and some operation starts but then the instructions change. There are ways to decompose the API so that isn't a particular problem. We could put all of the "provisioning" instructions in one resource, for example, but how much are really going to split things up then?

The RBAC considerations you raise are really good points, though, and decomposing some makes a lot of sense. It's not obvious to me, yet, what the results need to look like, though.

 

>
>     It looks like I got the RAID Config in the wrong place here; according
>     to the Ironic docs the RAID configuration is done by the operator and
>     all the user can do is request a flavour that filters on a particular
>     RAID level. Come to think of it, that could also save us from having to
>     do a manual cleaning step before every RAID deployment.
>
>
> I'm not sure what this means. Are you saying Ironic doesn't change the
> RAID settings to match what is required? I thought it did that as part
> of cleaning.

The current PR for adding RAID support adds an additional state
(Preparing, between Ready and Provisioning) that runs a 'manual
cleaning' in Ironic to set up RAID every time we provision. Presumably
we would be able to run this manual cleaning only once and then
provision multiple images with the same RAID configuration without doing
a manual cleaning every time. Although, on reflection the real
difference here would simply be that we remembered what the last RAID
config we set up on the host was... which would be a natural thing to do
with this API proposal, but in fact is something we could do
independently any time if we thought it was valuable.

Is it safe to assume that nothing else has changed the RAID configuration out from under us since the last time we applied it? Is it important to only wipe the RAID configuration only once? Wouldn't we want to do it when deprovisioning, too?
What does "allocated to some resource pool" mean? That a BareMetalHostAllocation resource exists (with the spec either empty or with details about the image)?

How would I represent an externally provisioned host for which I want power control to support fencing?

How would I represent an externally provisioned host for which I do not yet have BMC credentials but where I may have hardware inventory details?

 

>
>      > That gives some separation of control via RBAC. Does that buy us
>     enough to go through the trouble of making it possible to migrate
>     the types?
>
>     TBH I don't think that's enough separation; it's reasonably foreseeable
>     that we might have to change it again.
>
>     cheers,
>     Zane.
>
>     --
>     You received this message because you are subscribed to the Google
>     Groups "Metal3 Development List" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to metal3-dev+...@googlegroups.com
>     <mailto:metal3-dev%2Bunsu...@googlegroups.com>.
>     To view this discussion on the web visit
>     https://groups.google.com/d/msgid/metal3-dev/0b99774e-a61e-f8e1-25a6-d12516b9077f%40redhat.com.
>

--
You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metal3-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metal3-dev/002a22e4-9ad6-26b4-20ca-e1b25becd859%40redhat.com.

Yu Zou

unread,
May 26, 2020, 10:34:01 PM5/26/20
to Metal3 Development List


在 2020年5月27日星期三 UTC+8上午1:36:28,Doug Hellmann写道:
I think we needn't to wipe the  RAID configuration when deprovisioning.
First, some users want to keep old raid configuration.
Second, not every deleted host will be used again in a short time, it is meaningless to immediately wipe the RAID configuration for them.
>     send an email to metal...@googlegroups.com

>     <mailto:metal3-dev%2Bunsu...@googlegroups.com>.
>     To view this discussion on the web visit
>     https://groups.google.com/d/msgid/metal3-dev/0b99774e-a61e-f8e1-25a6-d12516b9077f%40redhat.com.
>

--
You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metal...@googlegroups.com.

Zane Bitter

unread,
May 27, 2020, 8:53:19 AM5/27/20
to Metal3 Development List
On 26/05/20 1:36 pm, Doug Hellmann wrote:
> Is looking at multiple resources a big concern?
>
>
> As we expand the API to support setting new details like RAID and BIOS
> settings, we have to keep answering questions about when those settings
> can be changed and what it means to change them at other times.

Yeah, I think we've generally settled on an approach where we say they
are fixed at the time we start provisioning. (We should probably try to
do better at making sure we save them in the status so users aren't
misled as to what is current.)

> Having
> them all on one resource means that updates can at least be atomic, so
> that once provisioning starts changes might be ignored. If some of the
> settings are on different resources, then we have to deal with cases
> where one resource is updated but another isn't and some operation
> starts but then the instructions change.

I think I see what you're saying. If a user changes settings in one
resource (either the Host or Allocation) and immediately creates a
BareMetalDeployment, then is there a risk that the controller will read
an outdated cache of the settings that it then starts provisioning with?
Could a sequence of causally-ordered changes to different resources
nevertheless result in them being applied out of order? Given that the
premise of Kubernetes APIs is that resources are always converging to
the current spec, which theoretically makes ordering irrelevant, there's
a good chance that the answer is yes.

The way to solve is this is to write the data as status to a single
resource. So assuming that when a BareMetalDeployment is created we
first copy the desired image details to the BareMetalHost's status, and
then deploy based on that, we can ensure that we have the latest
settings from the BareMetalHost itself (a write to the status would fail
if the settings had been updated).

Any settings that end up part of the BareMetalAllocation are trickier.
We'll still want to copy them to the BareMetalHost, but this does not
guarantee ordering. A client could ensure ordering by waiting for the
new settings to be copied before creating the BareMetalDeployment,
though this does require some smarts on the part of the client. However,
if we "roll up" the image details from
BareMetalDeployment->BareMetalAllocation->BareMetalHost then we can
guarantee that the ordering is correct.

> There are ways to decompose the
> API so that isn't a particular problem. We could put all of the
> "provisioning" instructions in one resource, for example, but how much
> are really going to split things up then?
>
> The RBAC considerations you raise are really good points, though, and
> decomposing some makes a lot of sense. It's not obvious to me, yet, what
> the results need to look like, though.

Right there with ya ;)
AIUI we already clean automatically on deprovisioning, but if RAID is
specified we have to clean a second time when we reprovision because we
don't know if the RAID config has changed. At least on the surface this
seems like overkill. (It's possible that the specifics of Ironic's
implementation might actually make it unavoidable - my understanding of
it is pretty shallow.)

> So my thought with the proposal was to change effectively change the
> default based on the stage we are at:
>
> * Only the BMC is specified - host will be powered off, implementing (1)
> * Host is allocated to some resource pool - it will be powered up and
> ready to provision by default. An 'offline' flag allows (2) & (3)
> * Host is provisioned - the reboot annotation (or a new API)
> implements (4)
>
>
> What does "allocated to some resource pool" mean? That a
> BareMetalHostAllocation resource exists

Yes.

> How would I represent an externally provisioned host for which I want
> power control to support fencing?

Good question. I proposed that the ExternallyProvisioned flag would
exist in the BareMetalHostAllocation, but that the reboot annotation
would be on the BareMetalDeployment. So if there is no deployment there
is nowhere to put the annotation. That would be a good reason to
implement the reboot request CRD that we talked about upthread, so you
could still fence an externally provisioned host.

In typing this I realised something important about security. It's a
mistake to link all of these resources in the same namespace, because
then ability to reboot one host is ability to reboot all of the hosts in
that namespace (and transferring a host between namespaces requires
deleting the CR with the BMC credentials, which was the thing we are
trying to avoid). Instead the Allocation should define the namespace in
which to look for BareMetalDeployment and reboot request CRs for that
host (it could default to the same namespace). So the meaning of the
Allocation would be allocating control over what is running on the
server to a particular namespace.

> How would I represent an externally provisioned host for which I do not
> yet have BMC credentials but where I may have hardware inventory details?

You could go ahead and create the BareMetalHostAllocation (to set the
ExternallyProvisioned flag) and HardwareDetails CRs. Since they're tied
together by name, in theory you wouldn't need to create the
BareMetalHost yet at all, unless we programmed the controller to delete
all Allocations/HardwareDetails resources that weren't associated with a
Host. Or you could just create the Host with blank credentials (I'm not
sure we handle this well today, but it's the ~same problem in either case).

- ZB

Andrew Beekhof

unread,
May 27, 2020, 11:06:14 PM5/27/20
to Zane Bitter, Metal3 Development List
On Wed, 27 May 2020 at 22:53, Zane Bitter <zbi...@redhat.com> wrote:
On 26/05/20 1:36 pm, Doug Hellmann wrote:

> What does "allocated to some resource pool" mean? That a
> BareMetalHostAllocation resource exists

Yes.

Would it be possible to support a flow that (optionally) involved some sort of manual approval before that allocation is allowed?

Zane Bitter

unread,
May 28, 2020, 10:26:52 AM5/28/20
to Metal3 Development List, Andrew Beekhof
I think it's up to whatever component does the allocation, so yes.

In a standalone cluster it would probably just be a thing that allocates
all of the baremetal hosts automatically. But if you wanted to manually
approve discovered hosts then you could manually create the allocations
(or manually approve them somehow in the component that does create
them). By separating this into a separate CRD, you can give separate
RBAC roles to the actors that do the discovery and allocation, respectively.

And in some sort of multicluster environment, the component that manages
inventory would create the assignments to the various clusters.

Doug Hellmann

unread,
May 28, 2020, 3:12:45 PM5/28/20
to Zane Bitter, Metal3 Development List
On Wed, May 27, 2020 at 8:53 AM Zane Bitter <zbi...@redhat.com> wrote:
On 26/05/20 1:36 pm, Doug Hellmann wrote:
>     Is looking at multiple resources a big concern?
>
>
> As we expand the API to support setting new details like RAID and BIOS
> settings, we have to keep answering questions about when those settings
> can be changed and what it means to change them at other times.

Yeah, I think we've generally settled on an approach where we say they
are fixed at the time we start provisioning. (We should probably try to
do better at making sure we save them in the status so users aren't
misled as to what is current.)

Yes, good point.
 

> Having
> them all on one resource means that updates can at least be atomic, so
> that once provisioning starts changes might be ignored. If some of the
> settings are on different resources, then we have to deal with cases
> where one resource is updated but another isn't and some operation
> starts but then the instructions change.

I think I see what you're saying. If a user changes settings in one
resource (either the Host or Allocation) and immediately creates a
BareMetalDeployment, then is there a risk that the controller will read
an outdated cache of the settings that it then starts provisioning with?
Could a sequence of causally-ordered changes to different resources
nevertheless result in them being applied out of order? Given that the
premise of Kubernetes APIs is that resources are always converging to
the current spec, which theoretically makes ordering irrelevant, there's
a good chance that the answer is yes.

Yes, it could. For most resource types that doesn't matter, because reconciliation would keep nudging the state toward what is desired, even as the desired state changes. If we have a bunch of instructions that we're only going to read once when we start provisioning a host, though, we don't fit into that pattern and we have to add more protection.
 

The way to solve is this is to write the data as status to a single
resource. So assuming that when a BareMetalDeployment is created we
first copy the desired image details to the BareMetalHost's status, and
then deploy based on that, we can ensure that we have the latest
settings from the BareMetalHost itself (a write to the status would fail
if the settings had been updated). 

Any settings that end up part of the BareMetalAllocation are trickier.
We'll still want to copy them to the BareMetalHost, but this does not
guarantee ordering. A client could ensure ordering by waiting for the
new settings to be copied before creating the BareMetalDeployment,
though this does require some smarts on the part of the client. However,
if we "roll up" the image details from
BareMetalDeployment->BareMetalAllocation->BareMetalHost then we can
guarantee that the ordering is correct.

At one point the general cluster API pattern was going to have a "core" resource with references to provider-specific resources hanging off of it. I've not kept up with how they actually implemented that, but if we do something like that we would have a BareMetalDeployment created independently of the host, and provisioning would only be triggered by adding the reference to the deployment into the allocation's spec field. If we make the BMD immutable, then we don't ever have to deal with it changing and our ordering issue goes away.

Suppose we have a BareMetalHost with BMC and MAC details, a BareMetalHardwareDetails with the discovered inventory information, a BareMetalAllocation which manages provisioning permissions and reports status, and a BareMetalDeployment which has the instructions for doing the provisioning.

Those CRDs are linked BMH -> BMHD, BMH -> BMA, and BMA->BMD (where the CRD on the left of the arrow has a field in which the name & namespace of the CR on the right is placed).

A hardware admin would need to be able to create a BMH. When the host is created, if credentials are available inspection would start.

Someone allowed to approve a host for use in a cluster (maybe the same hardware admin, maybe another role) would need to be able to create a BMA and modify (but not necessarily create) the BMH to have a reference to it.

Someone allowed to provision would need to be able to create a BMD and modify (but not create) a BMA to add the reference to the BMD. When the reference is updated, provisioning starts and progress is reported on the BMA.

A software controller would create or update a HardwareDetails resource based on introspection. That controller would need to be able to update the BMH to add a reference to the HardwareDetails resource. 

We're using the spec of the HardwareDetails to report discovered info, so that's a bit odd. I think we would want it's controller to rewrite it if a user tries to modify any settings from what we have in the ironic-inspector database.

When we add discovery, the controller that handles that would also need permission to create a BMH so it could register the host and link the discovered HardwareDetails to it.

Now that I've typed all of that out, calling a BareMetalDeployment that is going to be a little confusing with k8s' Deployment type. Maybe we want to think harder for another name of that type.

Worrying about whether we should clean on deprovisioning seems like premature optimization, but maybe that's a flag in the BareMetalDeployment.
 

>     So my thought with the proposal was to change effectively change the
>     default based on the stage we are at:
>
>     * Only the BMC is specified - host will be powered off, implementing (1)
>     * Host is allocated to some resource pool - it will be powered up and
>     ready to provision by default. An 'offline' flag allows (2) & (3)
>     * Host is provisioned - the reboot annotation (or a new API)
>     implements (4)
>
>
> What does "allocated to some resource pool" mean? That a
> BareMetalHostAllocation resource exists

Yes.

> How would I represent an externally provisioned host for which I want
> power control to support fencing?

Good question. I proposed that the ExternallyProvisioned flag would
exist in the BareMetalHostAllocation, but that the reboot annotation
would be on the BareMetalDeployment. So if there is no deployment there
is nowhere to put the annotation. That would be a good reason to
implement the reboot request CRD that we talked about upthread, so you
could still fence an externally provisioned host.

Yes, another CRD for power state management seems reasonable.

 

In typing this I realised something important about security. It's a
mistake to link all of these resources in the same namespace, because
then ability to reboot one host is ability to reboot all of the hosts in
that namespace (and transferring a host between namespaces requires
deleting the CR with the BMC credentials, which was the thing we are
trying to avoid). Instead the Allocation should define the namespace in
which to look for BareMetalDeployment and reboot request CRs for that
host (it could default to the same namespace). So the meaning of the
Allocation would be allocating control over what is running on the
server to a particular namespace.

I think here you're assuming that linked resources would have the same name? What about using explicit object references instead?
 

> How would I represent an externally provisioned host for which I do not
> yet have BMC credentials but where I may have hardware inventory details?

You could go ahead and create the BareMetalHostAllocation (to set the
ExternallyProvisioned flag) and HardwareDetails CRs. Since they're tied
together by name, in theory you wouldn't need to create the
BareMetalHost yet at all, unless we programmed the controller to delete
all Allocations/HardwareDetails resources that weren't associated with a
Host. Or you could just create the Host with blank credentials (I'm not
sure we handle this well today, but it's the ~same problem in either case).

- ZB

--
You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metal3-dev+...@googlegroups.com.

Dmitry Tantsur

unread,
May 29, 2020, 5:36:54 AM5/29/20
to Zane Bitter, Metal3 Development List
One clarification inline.

You don't need to rebuild RAID between deployments (even software RAID) unless some 3rd party is messing with your hardware or the software RAID has been broken from within the system.

Dmitry
 
--
You received this message because you are subscribed to the Google Groups "Metal3 Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metal3-dev+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages