Proposal Draft for OCI Image Spec V2

215 views
Skip to first unread message

Till Wegmüller

unread,
May 23, 2020, 12:59:37 PM5/23/20
to dev
Hello Everyone.

I took the liberty to advance the topic of the OCIv2 discussion a bit
and write down my own thoughts about the image spec into a Markdown
Document to help build a picture of possibilities. I look forward to
discuss these thoughts and add the experience of the rest of the
community to either this or the final proposal.

Without further ado happy reading.

https://gist.github.com/Toasterson/1dff780fe3a6339041f9b7604be3f068

-Till

Akihiro Suda

unread,
May 23, 2020, 1:26:12 PM5/23/20
to Till Wegmüller, dev
Is this related to Aleksa's restic-based OCI v2 proposal?

2020年5月24日(日) 1:59 Till Wegmüller <toast...@gmail.com>:
--
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@opencontainers.org.

Till Wegmüller

unread,
May 23, 2020, 1:27:15 PM5/23/20
to Akihiro Suda, dev
Yes this related to those talks. And adding upon them with new ideas.

On 23.05.20 19:25, Akihiro Suda wrote:
> Is this related to Aleksa's restic-based OCI v2 proposal?
>
> 2020年5月24日(日) 1:59 Till Wegmüller <toast...@gmail.com
> <mailto:toast...@gmail.com>>:
>
> Hello Everyone.
>
> I took the liberty to advance the topic of the OCIv2 discussion a bit
> and write down my own thoughts about the image spec into a Markdown
> Document to help build a picture of possibilities. I look forward to
> discuss these thoughts and add the experience of the rest of the
> community to either this or the final proposal.
>
> Without further ado happy reading.
>
> https://gist.github.com/Toasterson/1dff780fe3a6339041f9b7604be3f068
>
> -Till
>
> --
> To unsubscribe from this group and stop receiving emails from it,
> send an email to dev+uns...@opencontainers.org
> <mailto:dev%2Bunsu...@opencontainers.org>.
>

Aleksa Sarai

unread,
May 23, 2020, 10:56:06 PM5/23/20
to Till Wegmüller, dev
As discussed on the last call I was on, we should first agree on
requirements before we start discussing concrete proposals. The reason
is quite simple -- we need to make sure what things are a priority and
what usecases folks have. And sorry for not getting around to this last
week, I will set up a HackMD and post it on the list on Monday.

To take your proposal as an example:

* It doesn't fulfil the canonical representation criterion, meaning
that different implementations will generate different images. Now,
this isn't as bad as with tar layers (the file data blobs will be the
same) but it does have an impact on image reproducibility.

* It would be at least slightly frustrating to implement an in-kernel
filesystem driver based on the manifest. If the intention is for this
to be one of the things included in the signed image, then ideally it
should be usable from inside the kernel without any modification.

In the actual proposal, it might be necessary for some processing
steps to be able to run an OCIv2 image inside the kernel-driver, but
we should avoid any processing steps as much as we can -- since each
processing step reduces the validity of the image signature.

To show I'm not just picking sides, my own proposal also suffers from
problems:

* Compression is not included at the moment, and chunking is done at
the blob level which is just begging for a breakage (given how
complicated some chunking algorithms can be). It also explodes the
number of blobs being transferred considerably.

* It's not really usable from a kernel driver either. While we could
parse JSON in-kernel, really I think that the rootfs format shouldn't
be JSON at all -- it should be a binary format which can be easily
navigated in-kernel. The nice thing about the files being
content-addressed is we can treat the checksum as a kind of (virtual)
block address.

Both of our formats also suffer from the issue that while they do allow
for reduced data transfer, they increase round-trips by having many more
blobs. I think it would be useful to consider having the format be such
that you could optimise such transfer problems through HTTP multipart
range requests.

And the stargz proposal also suffers from a few similar problems as
well. Not to mention that all proposals have effectively copied a
mistake from the original image-spec -- character and block devices
shouldn't be included in images because major/minor numbers can change
between machines (and even reboots of the same machine). That is
something we need to either fix (in the way systemd has done by
leveraging /proc/filesystems) or by not allowing them in images at all.

I note that the proposal you have looks like the mtree format. Don't get
me wrong, the mtree format is quite useful (in fact, umoci uses it
internally as part of its diff generation code). And I do appreciate the
wish to simplify the format, though I would argue that JSON is the least
over-complicated or legacy-laden part of the image-spec. And my
experiences with mtree have shown that its much better equipped as a
"supplementary" manifest format.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
signature.asc

Till Wegmüller

unread,
May 24, 2020, 8:37:53 AM5/24/20
to Aleksa Sarai, dev
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 24.05.20 04:55, Aleksa Sarai wrote:
> On 2020-05-23, Till Wegmüller <toast...@gmail.com> wrote:
>> Hello Everyone.
>>
>> I took the liberty to advance the topic of the OCIv2 discussion a
>> bit and write down my own thoughts about the image spec into a
>> Markdown Document to help build a picture of possibilities. I
>> look forward to discuss these thoughts and add the experience of
>> the rest of the community to either this or the final proposal.
>>
>> Without further ado happy reading.
>>
>> https://gist.github.com/Toasterson/1dff780fe3a6339041f9b7604be3f068
>
>>
> As discussed on the last call I was on, we should first agree on
> requirements before we start discussing concrete proposals. The
> reason is quite simple -- we need to make sure what things are a
> priority and what usecases folks have. And sorry for not getting
> around to this last week, I will set up a HackMD and post it on the
> list on Monday.
>
Yes that will help much with a colaborative drafting process and don't
worry about rushing it in. I had this on the top of my head and wanted
to write it down and share it.

> To take your proposal as an example:
>
> * It doesn't fulfil the canonical representation criterion,
> meaning that different implementations will generate different
> images. Now, this isn't as bad as with tar layers (the file data
> blobs will be the same) but it does have an impact on image
> reproducibility.

This is not in what I have written yet yes. I see it as impossible to
find a complete image spec everybody can use, but we can align on a
common smallest denominator of things every implementation fills out
the same. That will be mostly the metadata we have currently I think.
Unless there is history which people wanted removed.

>
> * It would be at least slightly frustrating to implement an
> in-kernel filesystem driver based on the manifest. If the intention
> is for this to be one of the things included in the signed image,
> then ideally it should be usable from inside the kernel without any
> modification.
>
> In the actual proposal, it might be necessary for some processing
> steps to be able to run an OCIv2 image inside the kernel-driver,
> but we should avoid any processing steps as much as we can -- since
> each processing step reduces the validity of the image signature.
>
> To show I'm not just picking sides, my own proposal also suffers
> from problems:
>
> * Compression is not included at the moment, and chunking is done
> at the blob level which is just begging for a breakage (given how
> complicated some chunking algorithms can be). It also explodes the
> number of blobs being transferred considerably.
>
> * It's not really usable from a kernel driver either. While we
> could parse JSON in-kernel, really I think that the rootfs format
> shouldn't be JSON at all -- it should be a binary format which can
> be easily navigated in-kernel. The nice thing about the files
> being content-addressed is we can treat the checksum as a kind of
> (virtual) block address.

Any binary format would need unpacking for any other filesystem other
than the format it is sent as. And that's what that binary format is.
It's a filesystem. The only way, I am aware of, to make that seamless
is the prepare the backing filesystem locally in an empty state and
then populating it with the files directly from the download.

Also thinking about the kernel driver, this sounds like what SmartOS
did with their images, which are essentially ZFS Filesystem blobs
saved as file and bundled with JSON metadata. The on-disk format of
ZFS is as far as I have gathered not that complicated so it could give
you an inspiration on how to make a filesystem format.

>
> Both of our formats also suffer from the issue that while they do
> allow for reduced data transfer, they increase round-trips by
> having many more blobs. I think it would be useful to consider
> having the format be such that you could optimise such transfer
> problems through HTTP multipart range requests.

My hope was on HTTP2 or a transport encapsulation. As that can be a
whole spec in itself I don't want to proposae anything into that
direction as of yet.

>
> And the stargz proposal also suffers from a few similar problems
> as well. Not to mention that all proposals have effectively copied
> a mistake from the original image-spec -- character and block
> devices shouldn't be included in images because major/minor numbers
> can change between machines (and even reboots of the same machine).
> That is something we need to either fix (in the way systemd has
> done by leveraging /proc/filesystems) or by not allowing them in
> images at all.
>

Ah on illumos our /devices is a filesystem not present during the
imaging process and only during runtime of a Zone. And /dev a set of
forcefully overriden symlinks. I did not consider that we will need an
exception or handling for that in other Os'es.

> I note that the proposal you have looks like the mtree format.
> Don't get me wrong, the mtree format is quite useful (in fact,
> umoci uses it internally as part of its diff generation code). And
> I do appreciate the wish to simplify the format, though I would
> argue that JSON is the least over-complicated or legacy-laden part
> of the image-spec. And my experiences with mtree have shown that
> its much better equipped as a "supplementary" manifest format.

No it's not mtree but they probably share the same ancestory. It's the
format used by the Image Packaging System. I personally have not had
such deep experience into JSON. From what I hear it should not be
complicated, but I don't want to just blindly step into that. I am
happy with any formating we end up with.
-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQR0Tlx6kDnJJt6txLGYBG8gdxOmQgUCXspqnQAKCRCYBG8gdxOm
QjtiAP9gIlFM5jseKOPWAC5f6ecyWIKUtk/frd9xjTWIxS6xRAD+NBwf4cDDKjvY
GEc8W1aRTvNSnZ05CW1lSWD8dptqpwk=
=J9pe
-----END PGP SIGNATURE-----

Peng Tao

unread,
May 24, 2020, 12:42:25 PM5/24/20
to Aleksa Sarai, Till Wegmüller, dev
On Sun, May 24, 2020 at 10:56 AM Aleksa Sarai <cyp...@cyphar.com> wrote:
>
> On 2020-05-23, Till Wegmüller <toast...@gmail.com> wrote:
> > Hello Everyone.
> >
> > I took the liberty to advance the topic of the OCIv2 discussion a bit
> > and write down my own thoughts about the image spec into a Markdown
> > Document to help build a picture of possibilities. I look forward to
> > discuss these thoughts and add the experience of the rest of the
> > community to either this or the final proposal.
> >
> > Without further ado happy reading.
> >
> > https://gist.github.com/Toasterson/1dff780fe3a6339041f9b7604be3f068
>
> As discussed on the last call I was on, we should first agree on
> requirements before we start discussing concrete proposals. The reason
> is quite simple -- we need to make sure what things are a priority and
> what usecases folks have. And sorry for not getting around to this last
> week, I will set up a HackMD and post it on the list on Monday.
>
Yes indeed. From what I can see, the current image format has
following drawbacks,
-. While only a small fraction of an image is used by applications,
runtime has to wait for an entire image to be pulled before creating
new containers.
-. Deduplication at image layer level is less efficient
-. Metadata only modification would cause the file data to be saved
again in the new layer
-. Files modified in multiple layers are downloaded multiple times
while only the last modified file data is actually useable for
containers
-. Deleted files/directories are still downloaded when pulling an image
-. Image data are not verifiable after being decompressed
-. The tar format has its own drawbacks

We may have different binary format proposals but I think we should
agree on some first rules before designing the binary formats. And
what I have in mind as first rules for v2 image format requirements:
-. Minimize the time spent on pulling images in the container lifecycle
-. Be very efficient about image data storage and data transfer during pulling
-. Support end-to-end data integrity
Yes, indeed. And to further complicate things (a little bit;), we have
implemented a format that allows HTTP range requests and chunk level
deduplication. Some key characteristics are:
-. Container images are downloaded on demand
-. Chunk level data duplication with configurable chunk size
-. Flatten image metadata and data to remove all intermediate layers
-. Only usable image data is saved when building a container image
-. Only usable image data is downloaded when running a container
-. End-to-end image metadata and data integrity

> And the stargz proposal also suffers from a few similar problems as
> well. Not to mention that all proposals have effectively copied a
> mistake from the original image-spec -- character and block devices
> shouldn't be included in images because major/minor numbers can change
> between machines (and even reboots of the same machine). That is
> something we need to either fix (in the way systemd has done by
> leveraging /proc/filesystems) or by not allowing them in images at all.
>
I think most of the existing formats are compatible architecturally.
They also suffer from similar or different problems here and there.
What we need is a consensus on what the v2 format should look like and
consolidates it with a format that fits in the consensus.

Cheers,
Tao
--
Into Sth. Rich & Strange

Peng Tao

unread,
May 24, 2020, 12:52:57 PM5/24/20
to Till Wegmüller, Aleksa Sarai, dev
Or we do not use a local file system format at all, like all the FUSE
based proposals have done. The local file system is the image itself.
And to address Aleska's concern, we can bake a format at userspace and
mov to a kernel driver when necessary. Similar attempts have been
tried before, like increnmentalfs.
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@opencontainers.org.

Tycho Andersen

unread,
May 24, 2020, 1:03:52 PM5/24/20
to Till Wegmüller, Aleksa Sarai, dev
Squashfs is an example of a self-contained binary format that is
seamless, i.e. has no unpacking step or other fiddling. It is not a
given that there be an "unpack" step.

I think it is very important that we define this format in such a way
that it is easy for the kernel to mount it.

Tycho

Aleksa Sarai

unread,
May 24, 2020, 6:48:46 PM5/24/20
to Till Wegmüller, dev
On 2020-05-24, Till Wegmüller <toast...@gmail.com> wrote:
> On 24.05.20 04:55, Aleksa Sarai wrote:
> > On 2020-05-23, Till Wegmüller <toast...@gmail.com> wrote:
> >> Hello Everyone.
> >>
> >> I took the liberty to advance the topic of the OCIv2 discussion a
> >> bit and write down my own thoughts about the image spec into a
> >> Markdown Document to help build a picture of possibilities. I
> >> look forward to discuss these thoughts and add the experience of
> >> the rest of the community to either this or the final proposal.
> >>
> >> Without further ado happy reading.
> >>
> >> https://gist.github.com/Toasterson/1dff780fe3a6339041f9b7604be3f068
> >
> >>
> > As discussed on the last call I was on, we should first agree on
> > requirements before we start discussing concrete proposals. The
> > reason is quite simple -- we need to make sure what things are a
> > priority and what usecases folks have. And sorry for not getting
> > around to this last week, I will set up a HackMD and post it on the
> > list on Monday.
>
> Yes that will help much with a colaborative drafting process and don't
> worry about rushing it in. I had this on the top of my head and wanted
> to write it down and share it.

Fair enough, I didn't mean to sound grouchy.

That was mostly just a knee-jerk reaction based on something that
happened during the distribution-spec discussions, where a few prototype
proposals (which were a massive departure from the distribution-spec we
have now) flew around and then we spent more time arguing about the
merits of the (almost identical) proposals rather than actually having a
solid and practicable draft specification to go and implement.

I'd like to avoid doing that if at all possible, hence why I made sure
to point out that my own proposal (and all the other proposals I've
seen) have gaps which we should address collectively. Among many other
things we need to discuss is what filesystem metadata should actually be
stored inside the container image.

> > To take your proposal as an example:
> >
> > * It doesn't fulfil the canonical representation criterion,
> > meaning that different implementations will generate different
> > images. Now, this isn't as bad as with tar layers (the file data
> > blobs will be the same) but it does have an impact on image
> > reproducibility.
>
> This is not in what I have written yet yes. I see it as impossible to
> find a complete image spec everybody can use, but we can align on a
> common smallest denominator of things every implementation fills out
> the same. That will be mostly the metadata we have currently I think.
> Unless there is history which people wanted removed.

I was more focusing on the idea that you can (by design) have multiple
representations of the same root filesystem by (for instance)
rearranging the order of entries in the manifest.
I wasn't suggesting that we ship XFS (or whatever) images -- that would
be a bad idea for a variety of reasons. The idea is that we would
develop a kernel driver *for whatever format we end up using*. This
wouldn't be required in order to use OCIv2, but it would be an option
for users that want additional assurances about the code they're
running. And yes, you would only ever want to use the kernel driver for
images which are signed by a trusted vendor.

> Also thinking about the kernel driver, this sounds like what SmartOS
> did with their images, which are essentially ZFS Filesystem blobs
> saved as file and bundled with JSON metadata. The on-disk format of
> ZFS is as far as I have gathered not that complicated so it could give
> you an inspiration on how to make a filesystem format.

I use ZFS myself, and it's an awesome filesystem.

However it would be incredibly unwise to embed ZFS send payloads into
OCI (not to mention it wouldn't actually solve any of the issues we had
with tar archives -- unless you rely on servers running with ZFS
deduplication [which is often not enabled, for good reason] and even
then I would argue it still doesn't solve the primary issue of transfer
duplication). And on systems which don't support ZFS natively you'd have
to run a FUSE driver or other implementation of the ZFS format.

There's also the license issue, but I don't want to rehash that entire
debate. Suffice to say, we can't assume that all OCI users are running
ZFS.

> > Both of our formats also suffer from the issue that while they do
> > allow for reduced data transfer, they increase round-trips by
> > having many more blobs. I think it would be useful to consider
> > having the format be such that you could optimise such transfer
> > problems through HTTP multipart range requests.
>
> My hope was on HTTP2 or a transport encapsulation. As that can be a
> whole spec in itself I don't want to proposae anything into that
> direction as of yet.

There is already a spec in OCI for that -- the distribution-spec. This
is actually something we will need to collaborate with them on, so that
we don't cause issues with whatever OCIv2 proposal we have.

> > And the stargz proposal also suffers from a few similar problems
> > as well. Not to mention that all proposals have effectively copied
> > a mistake from the original image-spec -- character and block
> > devices shouldn't be included in images because major/minor numbers
> > can change between machines (and even reboots of the same machine).
> > That is something we need to either fix (in the way systemd has
> > done by leveraging /proc/filesystems) or by not allowing them in
> > images at all.
>
> Ah on illumos our /devices is a filesystem not present during the
> imaging process and only during runtime of a Zone. And /dev a set of
> forcefully overriden symlinks. I did not consider that we will need an
> exception or handling for that in other Os'es.

Practically speaking, this is also true for Linux containers (/dev is
configured by the container runtime and has a separate configuration to
the image). However you can always mknod a device anywhere on the
filesystem and if you create a tar archive, you'll get a device inode
which will be unpacked on the destination system.

> > I note that the proposal you have looks like the mtree format.
> > Don't get me wrong, the mtree format is quite useful (in fact,
> > umoci uses it internally as part of its diff generation code). And
> > I do appreciate the wish to simplify the format, though I would
> > argue that JSON is the least over-complicated or legacy-laden part
> > of the image-spec. And my experiences with mtree have shown that
> > its much better equipped as a "supplementary" manifest format.
>
> No it's not mtree but they probably share the same ancestory. It's the
> format used by the Image Packaging System. I personally have not had
> such deep experience into JSON. From what I hear it should not be
> complicated, but I don't want to just blindly step into that. I am
> happy with any formating we end up with.

I might have buried the lede with my comment -- I was responding to your
comment in the proposal that we should avoid complicating the format
(and you then mention JSON in the same paragraph). My point was just
that the encoding is probably the least over-complicated thing I can
imagine in the spec -- but it seems we agree that the encoding really
shouldn't be a blocking issue.
signature.asc

Till Wegmüller

unread,
May 24, 2020, 7:16:59 PM5/24/20
to Aleksa Sarai, dev
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 25.05.20 00:48, Aleksa Sarai wrote:

>
> Fair enough, I didn't mean to sound grouchy.
>
> That was mostly just a knee-jerk reaction based on something that
> happened during the distribution-spec discussions, where a few
> prototype proposals (which were a massive departure from the
> distribution-spec we have now) flew around and then we spent more
> time arguing about the merits of the (almost identical) proposals
> rather than actually having a solid and practicable draft
> specification to go and implement.
>
> I'd like to avoid doing that if at all possible, hence why I made
> sure to point out that my own proposal (and all the other proposals
> I've seen) have gaps which we should address collectively. Among
> many other things we need to discuss is what filesystem metadata
> should actually be stored inside the container image.
>

+1

>>> To take your proposal as an example:
>>>
>>> * It doesn't fulfil the canonical representation criterion,
>>> meaning that different implementations will generate different
>>> images. Now, this isn't as bad as with tar layers (the file
>>> data blobs will be the same) but it does have an impact on
>>> image reproducibility.
>>
>> This is not in what I have written yet yes. I see it as
>> impossible to find a complete image spec everybody can use, but
>> we can align on a common smallest denominator of things every
>> implementation fills out the same. That will be mostly the
>> metadata we have currently I think. Unless there is history which
>> people wanted removed.
>
> I was more focusing on the idea that you can (by design) have
> multiple representations of the same root filesystem by (for
> instance) rearranging the order of entries in the manifest.

Ooops I think i may have forgotten to detail out the merging part. Yep
that is actually the case. Yes you are right, as it is written one can
modify that in unintended ways. I won't continue writing on it, as I
want to move further with a collaborative approach. But there is
suposed to be a mechanic in there that allows only one representaion
to be in the manifest and order not to effect the image. Or are you
reffering to the "variants" and "facets"?
What additional assurances are you thinking about? Are you thinking
about Syscall filtering or speed or immutability?

>
>>> Both of our formats also suffer from the issue that while they
>>> do allow for reduced data transfer, they increase round-trips
>>> by having many more blobs. I think it would be useful to
>>> consider having the format be such that you could optimise such
>>> transfer problems through HTTP multipart range requests.
>>
>> My hope was on HTTP2 or a transport encapsulation. As that can be
>> a whole spec in itself I don't want to proposae anything into
>> that direction as of yet.
>
> There is already a spec in OCI for that -- the distribution-spec.
> This is actually something we will need to collaborate with them
> on, so that we don't cause issues with whatever OCIv2 proposal we
> have.
>

+1

> I might have buried the lede with my comment -- I was responding to
> your comment in the proposal that we should avoid complicating the
> format (and you then mention JSON in the same paragraph). My point
> was just that the encoding is probably the least over-complicated
> thing I can imagine in the spec -- but it seems we agree that the
> encoding really shouldn't be a blocking issue.
>

Yes encoding is not a blocking issue.
-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQR0Tlx6kDnJJt6txLGYBG8gdxOmQgUCXssAZwAKCRCYBG8gdxOm
Ql8AAP9ZGdEb39ASQAA0YiiTJlrE/DjA3MO5FSYDLnV3sKwOcAEAxnOheLKd/Kdc
ms5b4C8J90VnYZf/9A+yQFU80EFGCQU=
=vmQP
-----END PGP SIGNATURE-----

Aleksa Sarai

unread,
May 24, 2020, 10:06:04 PM5/24/20
to Till Wegmüller, dev
It's about eliminating the unpacking step when users go to run an image.
The reason why this matters (for some people) is that if you mandate
that you can only run container images which are signed by a vendor's
key, having an unpacking stage where you take the signed binary and
convert it to a different format (expand it on the filesystem) where it
is no longer signed is hardly ideal.

This is somewhat related to immutability in concept, but it's more about
eliminating an unpacking step which effectively renders the signatures
on images only useful for transfers. Note that this doesn't mean we have
to have a format which is immediately mountable, just that any necessary
manipulations need to be minimal, obvious, and safe.

I don't expect that the in-kernel driver would be any faster at runtime
than if you unpacked it because once unpacked the image is just a
regular directory on a filesystem. I would even argue the in-kernel
driver might be slower because we're not going to end up with a format
which is as optimised as on-disk filesystem formats (unless we just copy
an existing on-disk filesystem format).

Tycho is really the best person to ask though, since he's the guy who
told me about this concern. The best way to think about it is the
squashfs example he gave in a sister email.
signature.asc

Peng Tao

unread,
May 24, 2020, 11:43:58 PM5/24/20
to Aleksa Sarai, Till Wegmüller, dev
There is another way. Given that an image is immutable by nature, the
writable layer is just a local representation and we can create it
using union mounts like overlayfs on top of a readonly image file
system and a writable directory in the local file system. Then we can
make use of the immutability of an image file system to design a
special purpose on-disk format and optimize for readonly case rather
than, rather than a general purpose fs format would look like.

Cheers,
Tao

Tycho Andersen

unread,
May 24, 2020, 11:59:39 PM5/24/20
to Peng Tao, Aleksa Sarai, Till Wegmüller, dev
Yes, although we'd also like to include all solutions to the
complaints Aleksa has in his blob post about duplication, etc.

I think it's possible to solve all of these problems at once if we
think about it hard enough :)

Tycho

Aleksa Sarai

unread,
May 25, 2020, 3:14:48 AM5/25/20
to Peng Tao, Till Wegmüller, dev
I honestly always assumed it would be a read-only mount with overlayfs
on top -- modifying the image would require updating the digests and
would invalidate the signatures. But even with that constraint, I think
it's fair to say the filesystems are a hard problem.
signature.asc

Peng Tao

unread,
May 25, 2020, 12:27:58 PM5/25/20
to Aleksa Sarai, Till Wegmüller, dev
Totally agree! Yet we can make it a target to optimize for readonly
use case and it would be an important difference compared to general
purpose file systems. And it will have impact when it comes to making
decisions about trade-offs on file system design and implementation.

Till Wegmüller

unread,
May 26, 2020, 5:03:08 AM5/26/20
to Peng Tao, Aleksa Sarai, dev
On 25.05.20 18:27, Peng Tao wrote:
> It's about eliminating the unpacking step when users go to run an image.
> The reason why this matters (for some people) is that if you mandate
> that you can only run container images which are signed by a vendor's
> key, having an unpacking stage where you take the signed binary and
> convert it to a different format (expand it on the filesystem) where it
> is no longer signed is hardly ideal.
>
> This is somewhat related to immutability in concept, but it's more about
> eliminating an unpacking step which effectively renders the signatures
> on images only useful for transfers. Note that this doesn't mean we have
> to have a format which is immediately mountable, just that any necessary
> manipulations need to be minimal, obvious, and safe.
>
> I don't expect that the in-kernel driver would be any faster at runtime
> than if you unpacked it because once unpacked the image is just a
> regular directory on a filesystem. I would even argue the in-kernel
> driver might be slower because we're not going to end up with a format
> which is as optimised as on-disk filesystem formats (unless we just copy
> an existing on-disk filesystem format).

Ah yes. That is a simple requirement. Are you planning to run a compute
heavy verification code? Or to run it often say at every open syscall? I
don't think a binary format is going to make this any easier on you.
What I would do, is to do it like IPS does the verification, by
validating the manifest text file against it's attached PGP signature
and then checking every file against the digest hash or the multiple
hashes of that file on record. As long as the manifest is not
manipulated (which you check by the signature) you can verify and
re-download every file in the image. A directory as the boundary of the
logical image is sufficient for that code to work. If you have a overlay
filesystem on top of that, you can absolutely certain that no
manipulation or corruption to the image has occurred. And therefore
allow you to keep that mandated promise to the vendor. As for the kernel
driver to monitor that, I would configure it from userland with the
easily parse able binary format and keep the text parsing in userland.
That also makes the driver more general purpose than it being tied into
this specific use case. I personally would not make this code in kernel
space but rather write a static rust binary (or go) which does the
verification on a trigger either by inotify or cronjob.

Tycho Andersen

unread,
May 26, 2020, 9:32:46 AM5/26/20
to Till Wegmüller, Peng Tao, Aleksa Sarai, dev
On Tue, May 26, 2020 at 11:03:03AM +0200, Till Wegmüller wrote:
> On 25.05.20 18:27, Peng Tao wrote:
> > It's about eliminating the unpacking step when users go to run an image.
> > The reason why this matters (for some people) is that if you mandate
> > that you can only run container images which are signed by a vendor's
> > key, having an unpacking stage where you take the signed binary and
> > convert it to a different format (expand it on the filesystem) where it
> > is no longer signed is hardly ideal.
> >
> > This is somewhat related to immutability in concept, but it's more about
> > eliminating an unpacking step which effectively renders the signatures
> > on images only useful for transfers. Note that this doesn't mean we have
> > to have a format which is immediately mountable, just that any necessary
> > manipulations need to be minimal, obvious, and safe.
> >
> > I don't expect that the in-kernel driver would be any faster at runtime
> > than if you unpacked it because once unpacked the image is just a
> > regular directory on a filesystem. I would even argue the in-kernel
> > driver might be slower because we're not going to end up with a format
> > which is as optimised as on-disk filesystem formats (unless we just copy
> > an existing on-disk filesystem format).
>
> Ah yes. That is a simple requirement. Are you planning to run a compute
> heavy verification code?

Verifying the image signature itself should be enough to verify
providence of a running image. Whether this operation is "compute
heavy" or not depends entirely on how the signature is defined. But
yes, we will verify image signatures.

> Or to run it often say at every open syscall?

One strategy would be to attach IMA metadata to every blob, so that
you only have to run it when the bits of the image are opened, not
bits of the file. So, not necessarily.

> I don't think a binary format is going to make this any easier on
> you.

No, but it does make the image smaller. See Aleksa's OCIv2 talk for
what happens when you expand file metadata into json.

> What I would do, is to do it like IPS does the verification, by
> validating the manifest text file against it's attached PGP signature
> and then checking every file against the digest hash or the multiple
> hashes of that file on record. As long as the manifest is not
> manipulated (which you check by the signature) you can verify and
> re-download every file in the image.

Sure, this is one reasonable implementation. The direct-mount
requirement isn't really about how you verify a signature, but how you
construct the image. For example, tar is a very poor format for this,
since it's not seekable. That's why our implementation currently uses
squashfs layers instead of tar layers.

> A directory as the boundary of the logical image is sufficient for
> that code to work. If you have a overlay filesystem on top of that,
> you can absolutely certain that no manipulation or corruption to the
> image has occurred.

Not exactly. What if someone escapes the container and changes the
image bits? This is why you need some additional protection like IMA.

> And therefore
> allow you to keep that mandated promise to the vendor. As for the kernel
> driver to monitor that, I would configure it from userland with the
> easily parse able binary format and keep the text parsing in userland.
> That also makes the driver more general purpose than it being tied into
> this specific use case. I personally would not make this code in kernel
> space but rather write a static rust binary (or go) which does the
> verification on a trigger either by inotify or cronjob.

Again, it's not about how to do verification. How do you take the
image and display a filesystem? That code necessarily lives in the
kernel (or in fuse).

Tycho

BoLiu

unread,
May 27, 2020, 3:14:19 AM5/27/20
to dev, toast...@gmail.com
Hi Aleksa,


On Saturday, May 23, 2020 at 7:56:06 PM UTC-7, Aleksa Sarai wrote:
On 2020-05-23, Till Wegmüller <toast...@gmail.com> wrote:
> Hello Everyone.
>
> I took the liberty to advance the topic of the OCIv2 discussion a bit
> and write down my own thoughts about the image spec into a Markdown
> Document to help build a picture of possibilities. I look forward to
> discuss these thoughts and add the experience of the rest of the
> community to either this or the final proposal.
>
> Without further ado happy reading.
>
> https://gist.github.com/Toasterson/1dff780fe3a6339041f9b7604be3f068

As discussed on the last call I was on, we should first agree on
requirements before we start discussing concrete proposals. The reason
is quite simple -- we need to make sure what things are a priority and
what usecases folks have. And sorry for not getting around to this last
week, I will set up a HackMD and post it on the list on Monday.


Seems that I couldn't find it on HackMD now, can you please share the link?

thanks,
liubo

Aleksa Sarai

unread,
May 27, 2020, 10:10:56 AM5/27/20
to BoLiu, dev, toast...@gmail.com
On 2020-05-27, BoLiu <obuil...@gmail.com> wrote:
> On Saturday, May 23, 2020 at 7:56:06 PM UTC-7, Aleksa Sarai wrote:
> > On 2020-05-23, Till Wegmüller <toast...@gmail.com <javascript:>> wrote:
> > > I took the liberty to advance the topic of the OCIv2 discussion a bit
> > > and write down my own thoughts about the image spec into a Markdown
> > > Document to help build a picture of possibilities. I look forward to
> > > discuss these thoughts and add the experience of the rest of the
> > > community to either this or the final proposal.
> > >
> > > Without further ado happy reading.
> > >
> > > https://gist.github.com/Toasterson/1dff780fe3a6339041f9b7604be3f068
> >
> > As discussed on the last call I was on, we should first agree on
> > requirements before we start discussing concrete proposals. The reason
> > is quite simple -- we need to make sure what things are a priority and
> > what usecases folks have. And sorry for not getting around to this last
> > week, I will set up a HackMD and post it on the list on Monday.
>
> Seems that I couldn't find it on HackMD now, can you please share the link?

I only just managed to finish the initial draft, you can find it at [1].
It's publicly writeable, and I'll post an email about it tomorrow.

[1]: https://hackmd.io/@cyphar/ociv2-brainstorm
signature.asc
Reply all
Reply to author
Forward
0 new messages