Hi,
I came across bootc (https://github.com/containers/bootc) and the
concept of bootable container images. The container image building
workflow is very popular and could become a welcome alternative to disk
image files for virtual machines, too. I wanted to discuss the idea of
adding support for bootable container images to KubeVirt.
Some use cases to start with:
- Building container images for containers and disk images for VM
workloads running on Kubernetes involves two different workflows. It
would be convenient to use container images with VMs and just have to
learn one workflow.
- Starting with a container but then realizing specific kernel features
are required (e.g. a kernel module) requires switching to VMs. That
switch is difficult because the image needs to be built with different
tools in order to work with VMs.
The same applies the other way around: starting with a VM and deciding
it would be better to deploy it as a container also requires redoing
the image building.
There are already some tools that can output both container and VM
images, so conversion is possible, but that still leaves you with
multiple image files instead of just one.
- The ability to share layers to reduce image build times and reduce
storage space when there are many similar images. (In theory qcow2
backing file images in a container registry could do the same thing,
but I think that's not supported by KubeVirt today?)
How about adding a filesystem type to KubeVirt that attaches a container
image?
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: testvmi-fs
spec:
domain:
devices:
filesystems:
- name: my-disk
bootc:
image: quay.io/fedora/fedora:latest
The VM would be launched with a virtiofs filesystem containing the
container image files.
Thanks to bootc the kernel inside the image can be booted.
Colin: Is there a specification document for the bootc metadata? I know
the kernel version is included as metadata and the actual vmlinuz +
initramfs are in /usr/lib/modules/$VERSION.
I'm not sure if kernelparameters can also be included as metadata with the image?
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/3a09d51e-aba2-4867-8393-f4e5467b5b7bn%40googlegroups.com.
Colin: Is there a specification document for the bootc metadata? I know
the kernel version is included as metadata and the actual vmlinuz +
initramfs are in /usr/lib/modules/$VERSION.I think some people might want the virtiofs setup, but it's not clear how many versus more persistent style systems.I'm not sure if kernelparameters can also be included as metadata with the image?I'd like to add standard support for this indeed. I'd also like it to work to do`RUN grubby --add=nosmt`in a Dockerfile and have that actually honored by bootc on the client sytsem.
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/ebbf0430-c556-40ec-ac98-d9e9a390247an%40googlegroups.com.
On Tuesday, March 7, 2023 at 5:10:46 PM UTC-5 Stefan Hajnoczi wrote:
- The ability to share layers to reduce image build times and reduce
storage space when there are many similar images. (In theory qcow2
backing file images in a container registry could do the same thing,
but I think that's not supported by KubeVirt today?)Right, though actually sharing layers requires generating layers in a *reproducible* way...which is something that https://coreos.github.io/rpm-ostree/container/ does today, but is not something you can really get from Dockerfile.
See also https://github.com/ostreedev/ostree-rs-ext/issues/69 and lots of links from there.
How about adding a filesystem type to KubeVirt that attaches a container
image?
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: testvmi-fs
spec:
domain:
devices:
filesystems:
- name: my-disk
bootc:
image: quay.io/fedora/fedora:latest
The VM would be launched with a virtiofs filesystem containing the
container image files.
Thanks to bootc the kernel inside the image can be booted.Hmm yes...but part of the idea of bootc is that the VM system can update itself in place.This type of model is a bit more like PXE booting right?
Colin: Is there a specification document for the bootc metadata? I know
the kernel version is included as metadata and the actual vmlinuz +
initramfs are in /usr/lib/modules/$VERSION.I think some people might want the virtiofs setup, but it's not clear how many versus more persistent style systems.
I'm not sure if kernelparameters can also be included as metadata with the image?I'd like to add standard support for this indeed. I'd also like it to work to do`RUN grubby --add=nosmt`in a Dockerfile and have that actually honored by bootc on the client sytsem.
On Wed, Mar 8, 2023 at 1:23 PM Colin Walters <wal...@redhat.com> wrote:On Tuesday, March 7, 2023 at 5:10:46 PM UTC-5 Stefan Hajnoczi wrote:How about adding a filesystem type to KubeVirt that attaches a container
image?
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: testvmi-fs
spec:
domain:
devices:
filesystems:
- name: my-disk
bootc:
image: quay.io/fedora/fedora:latest
The VM would be launched with a virtiofs filesystem containing the
container image files.
Thanks to bootc the kernel inside the image can be booted.Hmm yes...but part of the idea of bootc is that the VM system can update itself in place.This type of model is a bit more like PXE booting right?Yes.This is a container, intended for bare metal (as well as virt or reg container), how is this then intended to be booted on BM?I assume that you still need a bootloader, kernel, and initramfs which is then ultimately pivoting into the container as a rootfs?
On Wed, Mar 8, 2023, at 9:18 AM, Fabian Deutsch wrote:
>
> This is a container, intended for bare metal (as well as virt or reg
> container), how is this then intended to be booted on BM?
>
> I assume that you still need a bootloader, kernel, and initramfs which
> is then ultimately pivoting into the container as a rootfs?
This is what "bootc install" does:
https://github.com/containers/bootc/#using-bootc-install
It uses the tools inside the (privileged) container to create the desired root filesystem at the time of install to the target block device, does a grub install etc. Note that when run as a privileged container, it uses the running kernel to make the filesystems, but e.g. `mkfs.xfs` etc. come from the container userspace.
Once an install is done and the machine booted, `bootc upgrade` itself pulls the updated container images. I also plan to roll into the bootupd logic, so `bootc bootloader upgrade` would also update the bootloader (grub2/shim/etc.)
Another way to say this is: bootc is designed for use anywhere one might use apt/yum/etc. today. Once an install is done, it owns the kernel and userspace updates.
That said, there's also e.g. `bootc install-to-filesystem` which is intended to support external install software (e.g. Anaconda and the like) for more complex block storage setups (LVM/Stratis/etc.)
> IOW like Fedora IoT or Silverblue are booted and updated in-place?
Yes; bootc also aims to be a successor to rpm-ostree, which those things use. It's in fact today a seamless operation to switch those to pull bootable containers instead; see `bootc switch`.
https://pagure.io/releng/issue/11047 tracks having Fedora ship compatible containers for Fedora Silverblue; we're very close to doing so.
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/2b828375-8797-4e41-9c42-dcc4384c41ac%40app.fastmail.com.
On Thursday, March 9, 2023 at 10:14:10 AM UTC-5 Colin Walters wrote:
On Wed, Mar 8, 2023, at 4:27 PM, Colin Walters wrote:
> On Wed, Mar 8, 2023, at 2:32 PM, Fabian Deutsch wrote:
>>
>> I think it also matters, because it's "host" sided to "install",
>
> I wouldn't say that, no. "Installation" of bootc containers is still
> intended to be done by default in a guest context.
Here's another way to look at it: bootc is explicitly intended to "mirror" podman in system architecture. Today, by default KubeVirt does not need to do anything to enable podman to work inside a guest. Of course, one *can* create cross-intersections here (for example, having guests do pull-through caching of container images from a registry hosted "natively" in the kubevirt cluster). And note that that specific use case will also literally *just work* with bootc because we use https://github.com/containers/image behind the scenes so we also honor /etc/containers/registries.conf.
What you guys seem to be thinking does make some sense - but I don't think it's at all required. It's again not the primary focus for bootc, and I am skeptical we should even think of such hypervisor-managed boots as something to aim for by default even. Another way to say this is: there are systems out there that *always* PXE boot on bare metal. But I wouldn't call that the default.I'm not sure anyone else is on board with the idea I proposed :).I agree with you that it's not required in order to use bootable container images in KubeVirt. Using bootc inside a VM the same way it's used on bare metal is possible.
What I find exciting about bootable container images is the opportunity to simplify the VM workflow. If you build VMs as container images, then the extra steps of getting a Persistent Volume and running bootc inside the VM or using containerDisk to copy the contents of the image onto a Persistent Volume seem like friction. The workflow is simpler if the VM can launch directly from a bootable container image.
This direct launch doesn't work for some use cases like running a traditional Linux "pet" system, but it's great for "cattle" VMs. I think users might find it more convenient than the existing workflow.Stefan
> > What I find exciting about bootable container images is the opportunity to
> > simplify the VM workflow. If you build VMs as container images, then the
> > extra steps of getting a Persistent Volume and running bootc inside the VM
> > or using containerDisk to copy the contents of the image onto a Persistent
> > Volume seem like friction. The workflow is simpler if the VM can launch
> > directly from a bootable container image.
> >
>
> Just a note on Persistent Volumes, they aren't necessary. By design,
> container images are ephemeral, and so are the container disks, which we
> could classify as the corresponding VM disk version.
Thanks for pointing this out. I misunderstood how containerDisk (and
KubeVirt's Volumes in general) works. containerDisk is close to what I'm
thinking about, except it uses a block device instead of a file system.
I think what I'm imagining is a virtiofs equivalent of containerDisk.
Today KubeVirt only supports virtiofs on PVCs and has no equivalent of
containerDisk?
That, plus bootable containers support so KubeVirt knows how to parse
the container image and extract the kernel/initramfs without manual
configuration.
> This, totally, diverges from the topic, but it is kind of related. What I
> find more interesting would be an option to say, this image will be used by
> a VM-based runtime/workload, and in this way, the container runtime will
> store the image on a device instead of a filesystem. With this option, the
> container image building will remain the same, but you could decide at
> runtime how this image will be used (and stored). By the way, this was the
> idea behind this PR in cri-o (https://github.com/cri-o/cri-o/pull/5624)
I think there is interesting stuff to be done in that area too.
Kata Containers had a trick to pass through the underlying block device
to the sandbox VM. It seemed unsafe to me (because the container runtime
would have it mounted at the same time!) and I don't think it's used
much, but I think it was an optimization for device-mapper Docker
storage driver.
Hopefully in practice the same container image isn't used by VM-based
and regular containers at the same time because the same image would be
extracted twice when two storage drivers are configured. It would be
cool if the same container image blob could be either mounted on the
host or passed through to a sandbox VM. The tar file format that OCI
images use isn't a good fit for that though, so OCI images would still
need to be extracted first.
Stefan
On Wed, Mar 8, 2023, at 2:32 PM, Fabian Deutsch wrote:
>
> I think it also matters, because it's "host" sided to "install",
I wouldn't say that, no. "Installation" of bootc containers is still intended to be done by default in a guest context. I'm actually working on a "bootc install --takeover" mode where we move the running privileged container into RAM, and then rewrite the root volume.
So in IaaS platforms like AWS and KubeVirt, one could provision an existing stock cloud image (whether that's a "traditional" Fedora-derivative cloud image that uses yum and cloud-init e.g. or Ubuntu or whatever) and the cloud-init metadata injects code that runs the `podman run --privileged quay.io/myuser/example:latest --takeover /` and then everything that existed in the default image backing store (EBS/etc.) is wiped and replaced with the content of the container.
On Wed, Mar 15, 2023, at 5:52 PM, dvo...@redhat.com wrote:
> Oh, interesting. So once `bootc install --takeover` is implemented,
> there's really nothing stopping anyone from demoing this using bootc in
> kubevirt.
Yeah, `--takeover` will make it *way* easier to use in cloud scenarios since you
can just start from an existing (e.g. yum based/traditional) cloud image. I've
been tinkering with this in the background and posted my WIP so far
https://github.com/containers/bootc/pull/78
I think it's getting pretty close...
> It would be cool to see someone demo some a multi-stage container build
> for an application that both built the image and shipped it in a
> bootable container.
Can you elaborate on this a bit? What do you mean by "application" here?
We do have https://github.com/coreos/layering-examples
and specifically https://github.com/coreos/layering-examples/tree/main/inject-go-binary
demonstrates a multi-stage build; but there's nothing really novel in that on the build side - which is really the point =) What's new is on the client end.
By "application" are you thinking something like "single purpose machines (VMs)"? Something more like in the heyday of virtualization when [software appliances](https://en.wikipedia.org/wiki/Software_appliance) were the rage, and there was work on standardizing the OVA format for shipping them around etc? Replacing that type of flow with bootable containers kind of makes sense...but on the other hand, ISTM that actually regular non-bootable containers is better for most of that. I know people who want to take that to the next step are using tooling to package their containers up with a host system to re-create the appliance type flow.
A key thing bootc is trying to enable here is: given a CoreOS-like system, make it trivial to inject say the Cloudflare agent or nagios installed as a traditional package in a container build, then boot that image. The end user combines the OS base image with their agents and customization, not the app vendor.
> Then use that container to launch the application
> in a KubeVirt VM (using cloud-init metadata + bootc takeover for now).
Right. But the other flow that will make total sense is a convenient process to go from "bootable container" -> "boot image" (in the Kubevirt case a containerdisk). And optimizing *that* is its own interesting topic. I could imagine it's a KubeVirt native feature, but OTOH again I think because of other popular non-container-native IaaS virt setups I think the default case is going to be a "service" running as a distinct guest.
> I think a demo like that would open up more possibilities in the
> discussion by giving us a clear target for the workflow we're trying to
> optimize.
Agree.
--
You received this message because you are subscribed to a topic in the Google Groups "kubevirt-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubevirt-dev/K-jNJL_Y9bA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/f516796b-3d50-4aef-a525-b29a43041fb3%40betaapp.fastmail.com.