Bootable containers and Kubevirt

139 views
Skip to first unread message

Stefan Hajnoczi

unread,
Mar 7, 2023, 5:10:46 PM3/7/23
to kubevirt-dev
Hi,
I came across bootc (https://github.com/containers/bootc) and the
concept of bootable container images. The container image building
workflow is very popular and could become a welcome alternative to disk
image files for virtual machines, too. I wanted to discuss the idea of
adding support for bootable container images to KubeVirt.

Some use cases to start with:

- Building container images for containers and disk images for VM
  workloads running on Kubernetes involves two different workflows. It
  would be convenient to use container images with VMs and just have to
  learn one workflow.

- Starting with a container but then realizing specific kernel features
  are required (e.g. a kernel module) requires switching to VMs. That
  switch is difficult because the image needs to be built with different
  tools in order to work with VMs.

  The same applies the other way around: starting with a VM and deciding
  it would be better to deploy it as a container also requires redoing
  the image building.

  There are already some tools that can output both container and VM
  images, so conversion is possible, but that still leaves you with
  multiple image files instead of just one.

- The ability to share layers to reduce image build times and reduce
  storage space when there are many similar images. (In theory qcow2
  backing file images in a container registry could do the same thing,
  but I think that's not supported by KubeVirt today?)

How about adding a filesystem type to KubeVirt that attaches a container
image?

  apiVersion: kubevirt.io/v1
  kind: VirtualMachineInstance
  metadata:
    name: testvmi-fs
  spec:
    domain:
      devices:
        filesystems:
          - name: my-disk
            bootc:
              image: quay.io/fedora/fedora:latest

The VM would be launched with a virtiofs filesystem containing the
container image files.

Thanks to bootc the kernel inside the image can be booted.

These VMs don't need PVs unless they want persistent storage.

Dan: Maybe libvirt would like to support this functionality directly so
it's available beyond KubeVirt. Otherwise KubeVirt can implement it
itself.

Colin: Is there a specification document for the bootc metadata? I know
the kernel version is included as metadata and the actual vmlinuz +
initramfs are in /usr/lib/modules/$VERSION. I'm not sure if kernel
parameters can also be included as metadata with the image?

Thanks,
Stefan

Colin Walters

unread,
Mar 8, 2023, 7:23:15 AM3/8/23
to kubevirt-dev
On Tuesday, March 7, 2023 at 5:10:46 PM UTC-5 Stefan Hajnoczi wrote:
Hi,
I came across bootc (https://github.com/containers/bootc) and the
concept of bootable container images. The container image building
workflow is very popular and could become a welcome alternative to disk
image files for virtual machines, too. I wanted to discuss the idea of
adding support for bootable container images to KubeVirt.

Awesome, thanks so much for starting this thread!
 
Some use cases to start with:

- Building container images for containers and disk images for VM
  workloads running on Kubernetes involves two different workflows. It
  would be convenient to use container images with VMs and just have to
  learn one workflow.

Yes; this rationale is a huge driver for bootc; every tool and technique one knows from building containerized applications also applies here.

That said IMO it's not just about images; see e.g. https://github.com/containers/bootc/issues/22 which proposes extending the tooling in a crucial way.

Also note it is a toplevel goal of bootc to create workflows that *also* work on bare metal, not just VMs.  (In fact currently right now e.g. the quay.io/fedora/fedora-coreos:stable base image works exactly the same across bare metal and most clouds; we have long eschewed having hypervisor-specific agents etc. in favor of a single image.  So suddenly if you derive from this image, you get something you can deploy *bit for bit* identically across metal and cloud.  For example, test the image in a virtualized environment, then deploy on metal.

This property is quite useful for Kubernetes/OpenShift nodes as it's also how we deploy RHEL CoreOS today.


- Starting with a container but then realizing specific kernel features
  are required (e.g. a kernel module) requires switching to VMs. That
  switch is difficult because the image needs to be built with different
  tools in order to work with VMs.

  The same applies the other way around: starting with a VM and deciding
  it would be better to deploy it as a container also requires redoing
  the image building.

  There are already some tools that can output both container and VM
  images, so conversion is possible, but that still leaves you with
  multiple image files instead of just one.

Well...this is a nuanced topic.  We already have e.g. kata containers as the answer for "I want to build like a container, but have the stronger isolation properties of VMs".  As of recently there's even now

So I think many cases are moving away from a need to build explicit VM images.

Also it's super important to emphasize here that bootc is proposing building as a container, but deploying directly on the host.  At runtime, systemd etc. are still pid1, etc.

A thing I see bootc as aiding though is "cross-compatible VM and container image builds"...e.g. using things like Ansible and similar tools as part of a container build, but deploy to host:


 

- The ability to share layers to reduce image build times and reduce
  storage space when there are many similar images. (In theory qcow2
  backing file images in a container registry could do the same thing,
  but I think that's not supported by KubeVirt today?)

Right, though actually sharing layers requires generating layers in a *reproducible* way...which is something that https://coreos.github.io/rpm-ostree/container/ does today, but is not something you can really get from Dockerfile.
See also https://github.com/ostreedev/ostree-rs-ext/issues/69 and lots of links from there.
 

How about adding a filesystem type to KubeVirt that attaches a container
image?

  apiVersion: kubevirt.io/v1
  kind: VirtualMachineInstance
  metadata:
    name: testvmi-fs
  spec:
    domain:
      devices:
        filesystems:
          - name: my-disk
            bootc:
              image: quay.io/fedora/fedora:latest

The VM would be launched with a virtiofs filesystem containing the
container image files.

Thanks to bootc the kernel inside the image can be booted.

Hmm yes...but part of the idea of bootc is that the VM system can update itself in place.
This type of model is a bit more like PXE booting right?
 
Colin: Is there a specification document for the bootc metadata? I know
the kernel version is included as metadata and the actual vmlinuz +
initramfs are in /usr/lib/modules/$VERSION.

I think some people might want the virtiofs setup, but it's not clear how many versus more persistent style systems.
 
I'm not sure if kernel
parameters can also be included as metadata with the image?

I'd like to add standard support for this indeed.   I'd also like it to work to do
`RUN grubby --add=nosmt`
in a Dockerfile and have that actually honored by bootc on the client sytsem.

Alexander Wels

unread,
Mar 8, 2023, 9:00:11 AM3/8/23
to Stefan Hajnoczi, kubevirt-dev
I am assuming you saw this https://kubevirt.io/user-guide/virtual_machines/disks_and_volumes/#containerdisk which is essentially a qcow2 file in a container image. CDI even supports importing images from the registry like that into PVCs. I think I am missing some nuance about what you are discussing here.
 
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/3a09d51e-aba2-4867-8393-f4e5467b5b7bn%40googlegroups.com.

Fabian Deutsch

unread,
Mar 8, 2023, 9:18:51 AM3/8/23
to Colin Walters, kubevirt-dev
Yes.

This is a container, intended for bare metal (as well as virt or reg container), how is this then intended to be booted on BM?
I assume that you still need a bootloader, kernel, and initramfs which is then ultimately pivoting into the container as a rootfs?

IOW like Fedora IoT or Silverblue are booted and updated in-place?
 
 
Colin: Is there a specification document for the bootc metadata? I know
the kernel version is included as metadata and the actual vmlinuz +
initramfs are in /usr/lib/modules/$VERSION.

I think some people might want the virtiofs setup, but it's not clear how many versus more persistent style systems.
 
I'm not sure if kernel
parameters can also be included as metadata with the image?

I'd like to add standard support for this indeed.   I'd also like it to work to do
`RUN grubby --add=nosmt`
in a Dockerfile and have that actually honored by bootc on the client sytsem.

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.

Stefan Hajnoczi

unread,
Mar 8, 2023, 10:09:08 AM3/8/23
to kubevirt-dev
On Wednesday, March 8, 2023 at 7:23:15 AM UTC-5 Colin Walters wrote:
On Tuesday, March 7, 2023 at 5:10:46 PM UTC-5 Stefan Hajnoczi wrote:

- The ability to share layers to reduce image build times and reduce
  storage space when there are many similar images. (In theory qcow2
  backing file images in a container registry could do the same thing,
  but I think that's not supported by KubeVirt today?)

Right, though actually sharing layers requires generating layers in a *reproducible* way...which is something that https://coreos.github.io/rpm-ostree/container/ does today, but is not something you can really get from Dockerfile.

 
See also https://github.com/ostreedev/ostree-rs-ext/issues/69 and lots of links from there.
 

How about adding a filesystem type to KubeVirt that attaches a container
image?

  apiVersion: kubevirt.io/v1
  kind: VirtualMachineInstance
  metadata:
    name: testvmi-fs
  spec:
    domain:
      devices:
        filesystems:
          - name: my-disk
            bootc:
              image: quay.io/fedora/fedora:latest

The VM would be launched with a virtiofs filesystem containing the
container image files.

Thanks to bootc the kernel inside the image can be booted.

Hmm yes...but part of the idea of bootc is that the VM system can update itself in place.
This type of model is a bit more like PXE booting right?

Yes. The part I'm most interested in here is a standard for bootable container images, not the bootc tool itself.

bootc can already be used by creating a traditional KubeVirt VM and running bootc inside (like on bare metal). That's useful for long-lived VMs that need to update software.

The scenario I'm thinking about is more like launching a container. The container image doesn't change.
 
 
Colin: Is there a specification document for the bootc metadata? I know
the kernel version is included as metadata and the actual vmlinuz +
initramfs are in /usr/lib/modules/$VERSION.

I think some people might want the virtiofs setup, but it's not clear how many versus more persistent style systems.

There are 3 levels related to persistence:
1. A read-only root file system. (virtiofs)
2. A read-write root file system that does not persist after the VM is shut down. (virtiofs)
3. A k8s Persistent Volume that persists after the VM is shut down. (virtio-blk/scsi)

KubeVirt does #3 today (and a bunch of variations).

I don't think it does #1 or #2 today because the existing virtiofs support is tied to Persistent Volumes rather than container images.
 
 
I'm not sure if kernel
parameters can also be included as metadata with the image?

I'd like to add standard support for this indeed.   I'd also like it to work to do
`RUN grubby --add=nosmt`
in a Dockerfile and have that actually honored by bootc on the client sytsem.

I see. Would the kernel parameters be stored in a file inside the container image (e.g. /boot or /etc)?

Stefan Hajnoczi

unread,
Mar 8, 2023, 10:13:09 AM3/8/23
to kubevirt-dev
It's convenient to build a container image and boot it without the overhead of PVs. Containers don't need a volume either. It's about offering this workflow.

Stefan

Stefan Hajnoczi

unread,
Mar 8, 2023, 10:20:04 AM3/8/23
to kubevirt-dev
On Wednesday, March 8, 2023 at 9:18:51 AM UTC-5 Fabian Deutsch wrote:
On Wed, Mar 8, 2023 at 1:23 PM Colin Walters <wal...@redhat.com> wrote:


On Tuesday, March 7, 2023 at 5:10:46 PM UTC-5 Stefan Hajnoczi wrote:
How about adding a filesystem type to KubeVirt that attaches a container
image?

  apiVersion: kubevirt.io/v1
  kind: VirtualMachineInstance
  metadata:
    name: testvmi-fs
  spec:
    domain:
      devices:
        filesystems:
          - name: my-disk
            bootc:
              image: quay.io/fedora/fedora:latest

The VM would be launched with a virtiofs filesystem containing the
container image files.

Thanks to bootc the kernel inside the image can be booted.

Hmm yes...but part of the idea of bootc is that the VM system can update itself in place.
This type of model is a bit more like PXE booting right?

Yes.

This is a container, intended for bare metal (as well as virt or reg container), how is this then intended to be booted on BM?
I assume that you still need a bootloader, kernel, and initramfs which is then ultimately pivoting into the container as a rootfs?

A bootable container image contains a kernel and initramfs but no bootloader.

A VM can be booted by extracting the kernel and initramfs from the image (like the https://kubevirt.io/user-guide/virtual_machines/boot_from_external_source/ feature) and providing the container image as a virtiofs root file system.

Colin Walters

unread,
Mar 8, 2023, 11:26:23 AM3/8/23
to kubevi...@googlegroups.com


On Wed, Mar 8, 2023, at 9:18 AM, Fabian Deutsch wrote:
>
> This is a container, intended for bare metal (as well as virt or reg
> container), how is this then intended to be booted on BM?
>
> I assume that you still need a bootloader, kernel, and initramfs which
> is then ultimately pivoting into the container as a rootfs?

This is what "bootc install" does:
https://github.com/containers/bootc/#using-bootc-install

It uses the tools inside the (privileged) container to create the desired root filesystem at the time of install to the target block device, does a grub install etc. Note that when run as a privileged container, it uses the running kernel to make the filesystems, but e.g. `mkfs.xfs` etc. come from the container userspace.

Once an install is done and the machine booted, `bootc upgrade` itself pulls the updated container images. I also plan to roll into the bootupd logic, so `bootc bootloader upgrade` would also update the bootloader (grub2/shim/etc.)

Another way to say this is: bootc is designed for use anywhere one might use apt/yum/etc. today. Once an install is done, it owns the kernel and userspace updates.

That said, there's also e.g. `bootc install-to-filesystem` which is intended to support external install software (e.g. Anaconda and the like) for more complex block storage setups (LVM/Stratis/etc.)

> IOW like Fedora IoT or Silverblue are booted and updated in-place?

Yes; bootc also aims to be a successor to rpm-ostree, which those things use. It's in fact today a seamless operation to switch those to pull bootable containers instead; see `bootc switch`.

https://pagure.io/releng/issue/11047 tracks having Fedora ship compatible containers for Fedora Silverblue; we're very close to doing so.

Fabian Deutsch

unread,
Mar 8, 2023, 2:32:43 PM3/8/23
to Colin Walters, kubevi...@googlegroups.com
On Wed, Mar 8, 2023 at 5:26 PM Colin Walters <wal...@verbum.org> wrote:


On Wed, Mar 8, 2023, at 9:18 AM, Fabian Deutsch wrote:
>
> This is a container, intended for bare metal (as well as virt or reg
> container), how is this then intended to be booted on BM?
>
> I assume that you still need a bootloader, kernel, and initramfs which
> is then ultimately pivoting into the container as a rootfs?

This is what "bootc install" does:
https://github.com/containers/bootc/#using-bootc-install

It uses the tools inside the (privileged) container to create the desired root filesystem at the time of install to the target block device, does a grub install etc.  Note that when run as a privileged container, it uses the running kernel to make the filesystems, but e.g. `mkfs.xfs` etc. come from the container userspace.

Once an install is done and the machine booted, `bootc upgrade` itself pulls the updated container images.  I also plan to roll into the bootupd logic, so `bootc bootloader upgrade` would also update the bootloader (grub2/shim/etc.)

Another way to say this is: bootc is designed for use anywhere one might use apt/yum/etc. today.  Once an install is done, it owns the kernel and userspace updates.

That said, there's also e.g. `bootc install-to-filesystem` which is intended to support external install software (e.g. Anaconda and the like) for more complex block storage setups (LVM/Stratis/etc.)

IOW `bootc install` is doing just enough in order to make the content/tree bootable on a device.
Only due to this support - bootc install magic - the container can be booted.
I'm not picky or complaining, I'm just trying to see it in the right light :)

I think it also matters, because it's "host" sided to "install", but then guest sided to upgrade. Speak the ownership is getting transfered.
It would be charming if there was a way to keep both flows on the host side.
But then I wonder if this is still valuable anough to classic VM users, or too much container-ish.
 

> IOW like Fedora IoT or Silverblue are booted and updated in-place?

Yes; bootc also aims to be a successor to rpm-ostree, which those things use.  It's in fact today a seamless operation to switch those to pull bootable containers instead; see `bootc switch`.

https://pagure.io/releng/issue/11047 tracks having Fedora ship compatible containers for Fedora Silverblue; we're very close to doing so.

10x I'm looking forward to it :)
 

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.

Colin Walters

unread,
Mar 8, 2023, 4:28:10 PM3/8/23
to kubevi...@googlegroups.com


On Wed, Mar 8, 2023, at 2:32 PM, Fabian Deutsch wrote:
>
> I think it also matters, because it's "host" sided to "install",

I wouldn't say that, no. "Installation" of bootc containers is still intended to be done by default in a guest context. I'm actually working on a "bootc install --takeover" mode where we move the running privileged container into RAM, and then rewrite the root volume.

So in IaaS platforms like AWS and KubeVirt, one could provision an existing stock cloud image (whether that's a "traditional" Fedora-derivative cloud image that uses yum and cloud-init e.g. or Ubuntu or whatever) and the cloud-init metadata injects code that runs the `podman run --privileged quay.io/myuser/example:latest --takeover /` and then everything that existed in the default image backing store (EBS/etc.) is wiped and replaced with the content of the container.

One could also of course have a model that spins up a guest instance and does an install to a disk, then snapshot that disk and use it for later instances.

I could *imagine* that a project like KubeVirt gains some sort of native recognition of bootc-style container images and supports not just "host-managed upgrades" as the thread originally started, but also the GUI understands how to do the equivalent of "virt-install" style guided installations.

> but
> then guest sided to upgrade. Speak the ownership is getting transfered.

Right, I'd say there is no ownership transfer. bootc owns both halves by default; all code is executed in a the guest context without hypervisor awareness. Which again we need because we also need to care about contexts where we *are* the hypervisor on bare metal.

Colin Walters

unread,
Mar 9, 2023, 10:14:10 AM3/9/23
to kubevi...@googlegroups.com


On Wed, Mar 8, 2023, at 4:27 PM, Colin Walters wrote:
> On Wed, Mar 8, 2023, at 2:32 PM, Fabian Deutsch wrote:
>>
>> I think it also matters, because it's "host" sided to "install",
>
> I wouldn't say that, no. "Installation" of bootc containers is still
> intended to be done by default in a guest context.

Here's another way to look at it: bootc is explicitly intended to "mirror" podman in system architecture. Today, by default KubeVirt does not need to do anything to enable podman to work inside a guest. Of course, one *can* create cross-intersections here (for example, having guests do pull-through caching of container images from a registry hosted "natively" in the kubevirt cluster). And note that that specific use case will also literally *just work* with bootc because we use https://github.com/containers/image behind the scenes so we also honor /etc/containers/registries.conf.

What you guys seem to be thinking does make some sense - but I don't think it's at all required. It's again not the primary focus for bootc, and I am skeptical we should even think of such hypervisor-managed boots as something to aim for by default even. Another way to say this is: there are systems out there that *always* PXE boot on bare metal. But I wouldn't call that the default.

Stefan Hajnoczi

unread,
Mar 9, 2023, 2:34:29 PM3/9/23
to kubevirt-dev
I'm not sure anyone else is on board with the idea I proposed :).

I agree with you that it's not required in order to use bootable container images in KubeVirt. Using bootc inside a VM the same way it's used on bare metal is possible.

What I find exciting about bootable container images is the opportunity to simplify the VM workflow. If you build VMs as container images, then the extra steps of getting a Persistent Volume and running bootc inside the VM or using containerDisk to copy the contents of the image onto a Persistent Volume seem like friction. The workflow is simpler if the VM can launch directly from a bootable container image.

This direct launch doesn't work for some use cases like running a traditional Linux "pet" system, but it's great for "cattle" VMs. I think users might find it more convenient than the existing workflow.

Stefan

Alice Frosi

unread,
Mar 10, 2023, 3:26:28 AM3/10/23
to Stefan Hajnoczi, kubevirt-dev
Hi Stefan,

On Thu, Mar 9, 2023 at 8:34 PM Stefan Hajnoczi <shaj...@redhat.com> wrote:


On Thursday, March 9, 2023 at 10:14:10 AM UTC-5 Colin Walters wrote:


On Wed, Mar 8, 2023, at 4:27 PM, Colin Walters wrote:
> On Wed, Mar 8, 2023, at 2:32 PM, Fabian Deutsch wrote:
>>
>> I think it also matters, because it's "host" sided to "install",
>
> I wouldn't say that, no. "Installation" of bootc containers is still
> intended to be done by default in a guest context.

Here's another way to look at it: bootc is explicitly intended to "mirror" podman in system architecture. Today, by default KubeVirt does not need to do anything to enable podman to work inside a guest. Of course, one *can* create cross-intersections here (for example, having guests do pull-through caching of container images from a registry hosted "natively" in the kubevirt cluster). And note that that specific use case will also literally *just work* with bootc because we use https://github.com/containers/image behind the scenes so we also honor /etc/containers/registries.conf.

What you guys seem to be thinking does make some sense - but I don't think it's at all required. It's again not the primary focus for bootc, and I am skeptical we should even think of such hypervisor-managed boots as something to aim for by default even. Another way to say this is: there are systems out there that *always* PXE boot on bare metal. But I wouldn't call that the default.

I'm not sure anyone else is on board with the idea I proposed :).

I agree with you that it's not required in order to use bootable container images in KubeVirt. Using bootc inside a VM the same way it's used on bare metal is possible.


What I don't quite understand is why you want to make the 2 workflows to be exactly the same. Or in other words, containers and VMs will always differ, a VM will need a kernel, initrd, and a bootloader. The common part could be the root filesystem and this could be shared with virtiofs. However, this approach looks now very close to what kata containers and libkrun do today with the difference that KubeVirt doesn't provide its own kernel and initrd.
 
What I find exciting about bootable container images is the opportunity to simplify the VM workflow. If you build VMs as container images, then the extra steps of getting a Persistent Volume and running bootc inside the VM or using containerDisk to copy the contents of the image onto a Persistent Volume seem like friction. The workflow is simpler if the VM can launch directly from a bootable container image.

Just a note on Persistent Volumes, they aren't necessary. By design, container images are ephemeral, and so are the container disks, which we could classify as the corresponding VM disk version.

This, totally, diverges from the topic, but it is kind of related. What I find more interesting would be an option to say, this image will be used by a VM-based runtime/workload, and in this way, the container runtime will store the image on a device instead of a filesystem. With this option, the container image building will remain the same, but you could decide at runtime how this image will be used (and stored).  By the way, this was the idea behind this PR in cri-o (https://github.com/cri-o/cri-o/pull/5624
 

This direct launch doesn't work for some use cases like running a traditional Linux "pet" system, but it's great for "cattle" VMs. I think users might find it more convenient than the existing workflow.

Stefan

Alice 

Stefan Hajnoczi

unread,
Mar 10, 2023, 8:02:54 AM3/10/23
to Alice Frosi, Stefan Hajnoczi, kubevirt-dev
Kata Containers and podman+libkrun are different: from the user
perspective you're working with containers. The user does not have
access to and cannot control the VM configuration.

The workflow I'm suggesting is a VM workflow (you can control the VM
configuration through KubeVirt and there is no container, i.e.
cgroups/namespaces) but image building and deployment uses bootable
container images.

The bootable container image workflow would be nice for "cattle" VMs
with ephemeral root file systems.

> > What I find exciting about bootable container images is the opportunity to
> > simplify the VM workflow. If you build VMs as container images, then the
> > extra steps of getting a Persistent Volume and running bootc inside the VM
> > or using containerDisk to copy the contents of the image onto a Persistent
> > Volume seem like friction. The workflow is simpler if the VM can launch
> > directly from a bootable container image.
> >
>
> Just a note on Persistent Volumes, they aren't necessary. By design,
> container images are ephemeral, and so are the container disks, which we
> could classify as the corresponding VM disk version.

Thanks for pointing this out. I misunderstood how containerDisk (and
KubeVirt's Volumes in general) works. containerDisk is close to what I'm
thinking about, except it uses a block device instead of a file system.

I think what I'm imagining is a virtiofs equivalent of containerDisk.
Today KubeVirt only supports virtiofs on PVCs and has no equivalent of
containerDisk?

That, plus bootable containers support so KubeVirt knows how to parse
the container image and extract the kernel/initramfs without manual
configuration.

> This, totally, diverges from the topic, but it is kind of related. What I
> find more interesting would be an option to say, this image will be used by
> a VM-based runtime/workload, and in this way, the container runtime will
> store the image on a device instead of a filesystem. With this option, the
> container image building will remain the same, but you could decide at
> runtime how this image will be used (and stored). By the way, this was the
> idea behind this PR in cri-o (https://github.com/cri-o/cri-o/pull/5624)

I think there is interesting stuff to be done in that area too.

Kata Containers had a trick to pass through the underlying block device
to the sandbox VM. It seemed unsafe to me (because the container runtime
would have it mounted at the same time!) and I don't think it's used
much, but I think it was an optimization for device-mapper Docker
storage driver.

Hopefully in practice the same container image isn't used by VM-based
and regular containers at the same time because the same image would be
extracted twice when two storage drivers are configured. It would be
cool if the same container image blob could be either mounted on the
host or passed through to a sandbox VM. The tar file format that OCI
images use isn't a good fit for that though, so OCI images would still
need to be extracted first.

Stefan
signature.asc

Colin Walters

unread,
Mar 10, 2023, 2:51:40 PM3/10/23
to kubevi...@googlegroups.com


On Thu, Mar 9, 2023, at 2:34 PM, Stefan Hajnoczi wrote:

> I'm not sure anyone else is on board with the idea I proposed :).

Heh sorry; I didn't mean to come across as negative actually.
>
> What I find exciting about bootable container images is the opportunity
> to simplify the VM workflow. If you build VMs as container images, then
> the extra steps of getting a Persistent Volume and running bootc inside
> the VM or using containerDisk to copy the contents of the image onto a
> Persistent Volume seem like friction. The workflow is simpler if the VM
> can launch directly from a bootable container image.

Yeah actually to be clear I agree this is an interesting topic and direction - if polished well I do agree it could be quite compelling for many use cases. There's intermediate degrees here too...I could imagine nodes exposing a "blob cache" service over virtiofs or so, which would apply to both podman containers but becomes even more interesting if it's used for the host too.

One detail here is with UKIs we can cut out the bootloader, which definitely simplifies things for this model.

> This direct launch doesn't work for some use cases like running a
> traditional Linux "pet" system, but it's great for "cattle" VMs. I
> think users might find it more convenient than the existing workflow.

Sort of an aside but I try to use the term "reprovisionable" instead of "cattle". A bit more related to this in https://blog.verbum.org/2020/08/22/immutable-%E2%86%92-reprovisionable-anti-hysteresis/


Alice Frosi

unread,
Mar 14, 2023, 11:57:11 AM3/14/23
to Stefan Hajnoczi, Stefan Hajnoczi, kubevirt-dev
Yes, you are right
 

> > What I find exciting about bootable container images is the opportunity to
> > simplify the VM workflow. If you build VMs as container images, then the
> > extra steps of getting a Persistent Volume and running bootc inside the VM
> > or using containerDisk to copy the contents of the image onto a Persistent
> > Volume seem like friction. The workflow is simpler if the VM can launch
> > directly from a bootable container image.
> >
>
> Just a note on Persistent Volumes, they aren't necessary. By design,
> container images are ephemeral, and so are the container disks, which we
> could classify as the corresponding VM disk version.

Thanks for pointing this out. I misunderstood how containerDisk (and
KubeVirt's Volumes in general) works. containerDisk is close to what I'm
thinking about, except it uses a block device instead of a file system.

I think what I'm imagining is a virtiofs equivalent of containerDisk.
Today KubeVirt only supports virtiofs on PVCs and has no equivalent of
containerDisk?

You can also use virtiofs on different volume types that PVCs. German did recently support for secrets and configmaps [1]. Virtiofs is configured on the domain side, so theoretically you could use it with all types of volumes. The missing part here is to extend container disks to support something different than raw/qcow images. Probably, in this case, we will need kubevirt to directly mount in virt-launcher the container filesystem. 

 

That, plus bootable containers support so KubeVirt knows how to parse
the container image and extract the kernel/initramfs without manual
configuration.

> This, totally, diverges from the topic, but it is kind of related. What I
> find more interesting would be an option to say, this image will be used by
> a VM-based runtime/workload, and in this way, the container runtime will
> store the image on a device instead of a filesystem. With this option, the
> container image building will remain the same, but you could decide at
> runtime how this image will be used (and stored).  By the way, this was the
> idea behind this PR in cri-o (https://github.com/cri-o/cri-o/pull/5624)

I think there is interesting stuff to be done in that area too.

Kata Containers had a trick to pass through the underlying block device
to the sandbox VM. It seemed unsafe to me (because the container runtime
would have it mounted at the same time!) and I don't think it's used
much, but I think it was an optimization for device-mapper Docker
storage driver.

Yes, of course, that trick isn't nice and it could be improved. However, I still believe that if we could teach the container engines how to handle storage for VM-based workloads could be beneficial. For example, if the container engine understands that is a VM-based workload then it shouldn't mount the storage at all.
 

Hopefully in practice the same container image isn't used by VM-based
and regular containers at the same time because the same image would be
extracted twice when two storage drivers are configured. It would be
cool if the same container image blob could be either mounted on the
host or passed through to a sandbox VM. The tar file format that OCI
images use isn't a good fit for that though, so OCI images would still
need to be extracted first.

Again, we are talking about a feature that still doesn't exist. However, you could pull the image and cache it on the node before copying it to 2 different storage drivers. Of course, if we use 2 different storage drivers then we cannot avoid having a double local copy of the image.
We need to distinguish the pulling (and storing) of the image from the moment when we create the container filesystem at the time you run the workload. 
As mentioned above, I think the mount of the image blob could be avoided with the extension of the container engine. The image is mounted only when the container is created, so when the workload is started. At runtime, we would now how we want to use the image.
 
Alice


Stefan

dvo...@redhat.com

unread,
Mar 15, 2023, 5:52:37 PM3/15/23
to kubevirt-dev
On Wednesday, March 8, 2023 at 4:28:10 PM UTC-5 Colin Walters wrote:


On Wed, Mar 8, 2023, at 2:32 PM, Fabian Deutsch wrote:
>
> I think it also matters, because it's "host" sided to "install",

I wouldn't say that, no. "Installation" of bootc containers is still intended to be done by default in a guest context. I'm actually working on a "bootc install --takeover" mode where we move the running privileged container into RAM, and then rewrite the root volume.

So in IaaS platforms like AWS and KubeVirt, one could provision an existing stock cloud image (whether that's a "traditional" Fedora-derivative cloud image that uses yum and cloud-init e.g. or Ubuntu or whatever) and the cloud-init metadata injects code that runs the `podman run --privileged quay.io/myuser/example:latest --takeover /` and then everything that existed in the default image backing store (EBS/etc.) is wiped and replaced with the content of the container.

Oh, interesting. So once `bootc install --takeover` is implemented, there's really nothing stopping anyone from demoing this using bootc in kubevirt. 

It would be cool to see someone demo some a multi-stage container build for an application that both built the image and shipped it in a bootable container. Then use that container to launch the application in a KubeVirt VM (using cloud-init metadata + bootc takeover for now). 

I think a demo like that would open up more possibilities in the discussion by giving us a clear target for the workflow we're trying to optimize.

Stefan Hajnoczi

unread,
Mar 16, 2023, 10:10:09 AM3/16/23
to Alice Frosi, Stefan Hajnoczi, kubevirt-dev
Yes, it's a good trick.

>
>
> >
> > Hopefully in practice the same container image isn't used by VM-based
> > and regular containers at the same time because the same image would be
> > extracted twice when two storage drivers are configured. It would be
> > cool if the same container image blob could be either mounted on the
> > host or passed through to a sandbox VM. The tar file format that OCI
> > images use isn't a good fit for that though, so OCI images would still
> > need to be extracted first.
> >
>
> Again, we are talking about a feature that still doesn't exist. However,
> you could pull the image and cache it on the node before copying it to 2
> different storage drivers. Of course, if we use 2 different storage drivers
> then we cannot avoid having a double local copy of the image.
> We need to distinguish the pulling (and storing) of the image from the
> moment when we create the container filesystem at the time you run the
> workload.
> As mentioned above, I think the mount of the image blob could be avoided
> with the extension of the container engine. The image is mounted only when
> the container is created, so when the workload is started. At runtime, we
> would now how we want to use the image.

I see what you mean. That makes sense to me.

In the long term, I think it would be nice to have kernel support to
mount the container image along the lines of composefs
(https://github.com/containers/composefs) but with the property that the
contents of the images are stored in a block device-friendly format.
(I'm describing the idea to avoid extracting images twice in more
detail.)

For regular container the runtime mounts composefs and the block
device-friendly blob that contains the contents of the image.

For sandboxed containers, the runtime inside the guest mounts composefs
and the block device-friendly blob is located on a block device (e.g.
virtio-blk) presented by the host.

This way just one copy needs to be stored. Also, it's probably
unnecessary to keep the original OCI image around once it has been
extracted into the composefs block device-friendly store.

There are challenges with this approach like how to limit the data
exposed to the guest to just what's needed by the container, how to hot
plug/unplug more, etc. I think it would be necessary to have multiple,
smaller, block device-friendly "packs" instead of a single composefs
store.

Stefan
signature.asc

Colin Walters

unread,
Mar 17, 2023, 9:08:03 AM3/17/23
to kubevi...@googlegroups.com


On Wed, Mar 15, 2023, at 5:52 PM, dvo...@redhat.com wrote:

> Oh, interesting. So once `bootc install --takeover` is implemented,
> there's really nothing stopping anyone from demoing this using bootc in
> kubevirt.

Yeah, `--takeover` will make it *way* easier to use in cloud scenarios since you
can just start from an existing (e.g. yum based/traditional) cloud image. I've
been tinkering with this in the background and posted my WIP so far
https://github.com/containers/bootc/pull/78

I think it's getting pretty close...

> It would be cool to see someone demo some a multi-stage container build
> for an application that both built the image and shipped it in a
> bootable container.

Can you elaborate on this a bit? What do you mean by "application" here?
We do have https://github.com/coreos/layering-examples
and specifically https://github.com/coreos/layering-examples/tree/main/inject-go-binary
demonstrates a multi-stage build; but there's nothing really novel in that on the build side - which is really the point =) What's new is on the client end.

By "application" are you thinking something like "single purpose machines (VMs)"? Something more like in the heyday of virtualization when [software appliances](https://en.wikipedia.org/wiki/Software_appliance) were the rage, and there was work on standardizing the OVA format for shipping them around etc? Replacing that type of flow with bootable containers kind of makes sense...but on the other hand, ISTM that actually regular non-bootable containers is better for most of that. I know people who want to take that to the next step are using tooling to package their containers up with a host system to re-create the appliance type flow.

A key thing bootc is trying to enable here is: given a CoreOS-like system, make it trivial to inject say the Cloudflare agent or nagios installed as a traditional package in a container build, then boot that image. The end user combines the OS base image with their agents and customization, not the app vendor.

> Then use that container to launch the application
> in a KubeVirt VM (using cloud-init metadata + bootc takeover for now).

Right. But the other flow that will make total sense is a convenient process to go from "bootable container" -> "boot image" (in the Kubevirt case a containerdisk). And optimizing *that* is its own interesting topic. I could imagine it's a KubeVirt native feature, but OTOH again I think because of other popular non-container-native IaaS virt setups I think the default case is going to be a "service" running as a distinct guest.

> I think a demo like that would open up more possibilities in the
> discussion by giving us a clear target for the workflow we're trying to
> optimize.

Agree.

David Vossel

unread,
Mar 17, 2023, 12:29:33 PM3/17/23
to Colin Walters, kubevi...@googlegroups.com
On Fri, Mar 17, 2023 at 9:08 AM Colin Walters <wal...@verbum.org> wrote:


On Wed, Mar 15, 2023, at 5:52 PM, dvo...@redhat.com wrote:

> Oh, interesting. So once `bootc install --takeover` is implemented,
> there's really nothing stopping anyone from demoing this using bootc in
> kubevirt.

Yeah, `--takeover` will make it *way* easier to use in cloud scenarios since you
can just start from an existing (e.g. yum based/traditional) cloud image.  I've
been tinkering with this in the background and posted my WIP so far
https://github.com/containers/bootc/pull/78

I think it's getting pretty close...

> It would be cool to see someone demo some a multi-stage container build
> for an application that both built the image and shipped it in a
> bootable container.

Can you elaborate on this a bit?  What do you mean by "application" here?
We do have https://github.com/coreos/layering-examples
and specifically https://github.com/coreos/layering-examples/tree/main/inject-go-binary
demonstrates a multi-stage build; but there's nothing really novel in that on the build side - which is really the point =)  What's new is on the client end.

By "application" are you thinking something like "single purpose machines (VMs)"?  Something more like in the heyday of virtualization when [software appliances](https://en.wikipedia.org/wiki/Software_appliance) were the rage, and there was work on standardizing the OVA format for shipping them around etc?  Replacing that type of flow with bootable containers kind of makes sense...but on the other hand, ISTM that actually regular non-bootable containers is better for most of that.  I know people who want to take that to the next step are using tooling to package their containers up with a host system to re-create the appliance type flow.


My suggestion at a demo was partially meant to help bring into focus what we think is valuable about this functionality in the context of KubeVirt.

My thoughts are, if someone is currently invested in packaging their applications into VMs, then they are likely entrenched into a legacy workflow. Modernization of that workflow would likely result in full containerization rather than just adopting a containerization build flow and a traditional virtualization runtime... It's possible this could gain a small amount of adoption though. It's also possible there are angles to this I haven't thought of.

People who are managing their own custom VM boot images and need to update various agents within that boot image could find this useful. However, similar to the previous "custom application", a user who needs this has already invested in legacy technology to ensure their agents are updated and might be resistant to novel technology at this point in the game... but again, some people might adopt this and I might not be seeing the full picture.

I don't think we have any instances of users rolling out their own CoreOS like vm image management on KubeVirt. We see applications like OpenShift running on KubeVirt VMs now, but even then users are not typically touching the VM images in those setups. So the strongest use case for this technology doesn't seem to have much overlap with KubeVirt at the moment.

My request for a KubeVirt demo is really to expose how this tooling maps to KubeVirt use cases so we can judge clearly what the value is here for our community. I'm not seeing that value right now, and I think it would take someone providing a strong proof of concept to bring that value into focus.

Once the value is proven, then I think it makes sense to start discussing the technical details of how to further integrate with KubeVirt.


A key thing bootc is trying to enable here is: given a CoreOS-like system, make it trivial to inject say the Cloudflare agent or nagios installed as a traditional package in a container build, then boot that image.  The end user combines the OS base image with their agents and customization, not the app vendor.

> Then use that container to launch the application
> in a KubeVirt VM (using cloud-init metadata + bootc takeover for now).

Right.  But the other flow that will make total sense is a convenient process to go from "bootable container" -> "boot image" (in the Kubevirt case a containerdisk).  And optimizing *that* is its own interesting topic.  I could imagine it's a KubeVirt native feature, but OTOH again I think because of other popular non-container-native IaaS virt setups I think the default case is going to be a "service" running as a distinct guest.

> I think a demo like that would open up more possibilities in the
> discussion by giving us a clear target for the workflow we're trying to
> optimize.

Agree.

--
You received this message because you are subscribed to a topic in the Google Groups "kubevirt-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubevirt-dev/K-jNJL_Y9bA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/f516796b-3d50-4aef-a525-b29a43041fb3%40betaapp.fastmail.com.

Stefan Hajnoczi

unread,
Mar 21, 2023, 7:35:31 AM3/21/23
to Alice Frosi, Stefan Hajnoczi, kubevirt-dev
By the way, I looked into the idea I mentioned more and it turns out
there is already something like that in Linux:
https://www.kernel.org/doc/html/latest/filesystems/erofs.html

Nydus can use erofs:
https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md

Stefan
signature.asc
Reply all
Reply to author
Forward
0 new messages