OCI image deltas

225 views
Skip to first unread message

Alexander Larsson

unread,
May 4, 2020, 12:35:36 PM5/4/20
to dev
Recently I've been working on a delta system for OCI images. Deltas allow saving a lot of network bandwidth when fetching a new version of an image if you already have a previous version available. This is common when e.g updating to the newest version of an app or rebasing it on a new base image version. The focus for this is using OCI images in settings with limited bandwidth, such as for desktop applications, or IOT, but this could be interesting for any widely distributed image.

To facilitate this I wrote a tool called tar-diff which generates a diff between two tar files (optionally compressed). The tar-diff file can be applied by tar-patch, given the the extracted files from the first tar file, thereby reconstructing the (uncompressed) second tar file. The uncompressed tar is exactly what we need, since it is what defines the DiffID of an OCI image layer, allowing us to verify the result before using it.

Tar-diff is based on bsdiff with zstd compression, and is very efficient at creating small diffs for typical binaries. Here are some statistics we collected for existing images. Note for example how the median monthly update delta size of the RHEL8 base image is 2.22% of the size of the full version.

In typical use you would pre-generate (it is somewhat expensive) deltas to the latest versions of an image going from some selection of older versions (typically the preceding version, but going back two or more versions could also be useful depending on how often images are updated). Storing deltas for older versions is not useful, as deltas only apply when the client goes forward, which is uncommon for older version. So it is typical to remove deltas for older versions when creating new ones.

To find the tar-diffs that are relevant for upgrading to an image version, the client downloads a manifest that is stored in the same repository as the images. This delta manifest defines an OCI Artifact where each layer in the artifact refers to a tar-diff blob that can generate one of the layers in the real image, and we use the per-layer annotations to record the DiffIDs the layer goes TO and FROM. For each layer in the image being pulled we try to find a tar-diff layer with the corresponding TO annotation and a FROM annotation that is locally available (as an extracted directory of files). If no matches for the layer is found we fall back to downloading the layer as usual.

Just as normal, when pulling an image we first resolve the tag to the digest, but then we need a way to locate the delta manifest for this particular digest. This is currently done by resolving the single tag `_deltaindex` which is an OCI index where each image is a delta manifest. A key in the annotations for each delta manifest declares what image digest it applies to.

I currently have an implementation of using deltas in containers/image, and code in skopeo to generate deltas:

 https://github.com/containers/image/pull/902

I also put some example deltas at:

  https://hub.docker.com/repository/docker/alexl/fedoratest

A more detailed specification of the current delta implementation, with some examples, is available here:

  https://hackmd.io/@owtaylor/Sk3O_dTXL

As well as a specification for the tar-diff file format:

 https://hackmd.io/@owtaylor/ByqZMKltL

Note that the specification above uses custom media types for the delta manifests, but the current implementation uses the normal OCI manifest media types. This is in order to be able to put the deltas on registries that do not support OCI artifacts (such as docker hub).

I wonder if there is any interest in the OCI community in this. It would be nice to get some feedback about the approach and see some parts could be done better. Additionally, there might be interest in actually standardizing this under the OCI umbrella.

Tycho Andersen

unread,
May 7, 2020, 3:31:20 PM5/7/20
to d...@opencontainers.org

Tycho Andersen

unread,
May 7, 2020, 3:35:33 PM5/7/20
to d...@opencontainers.org
Hi all,

Oof, pressed the wrong button in mutt on the last one,
Aleksa has been thinking about OCIv2:

https://github.com/openSUSE/umoci/issues/256
https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

I've been also interested in working on that, ideally to make it so
that we could move away from the tar format and enable direct-mounting
of OCI images.

I see in your PR that you're pretty close to integrating the tar stuff
with containers/image now, but have you thought at all about what a
clean room format for this would look like?

Tycho

Owen Taylor

unread,
May 7, 2020, 6:07:13 PM5/7/20
to Tycho Andersen, d...@opencontainers.org
On Thu, May 7, 2020 at 3:35 PM Tycho Andersen <ty...@tycho.ws> wrote:

> Aleksa has been thinking about OCIv2:
>
> https://github.com/openSUSE/umoci/issues/256
> https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar
>
> I've been also interested in working on that, ideally to make it so
> that we could move away from the tar format and enable direct-mounting
> of OCI images.
>
> I see in your PR that you're pretty close to integrating the tar stuff
> with containers/image now, but have you thought at all about what a
> clean room format for this would look like?

Hi Tycho -

When we were in the planning phases of this project, we were basically
looking at two options:

* File format that allows just downloading the pieces that changed
(stargz, "OciV2")
* Pre-compile deltas between versions and make them available for download

We settled on the second mostly because it was more efficient - for
one test case, if you compare the compressed size of changed files
with the original compressed layer size, the changed files are 37% of
the original size, but the tar-diff size is 6%. (*) These are pretty
typical stats. We felt that this achieving that level of efficiency
enables additional use cases and makes the work considerably more
compelling.

The disadvantage of the delta approach is mostly needing extra work to
maintain a repository - deltas need to be generated, old deltas
potentially need to be removed and garbage collected and so forth.

I spent some time thinking about the intersection of the two
approaches - pre-generating deltas with a new file format, and while
it's possible and gives you the same level of efficiency, there aren't
really intersectional benefits - it doesn't work that much *better*
than with tar files.

By using deltas against tar files, there's also a high degree of
compatibility with the ecosystem - if the deltas aren't available, or
the client doesn't know how to apply them, then you simply fall back
to a full layer download.

Of course, precisely because the delta approach is extremely
compatible with current practice - it doesn't fix any of the other
issues Aleksa identified with the current state - needing to
distribute the files erased by whiteouts, all of the different
variants of tar, and so forth. And you don't get the direct-mount
ability.

- Owen

(*) You maybe can do a bit better than 37% by adding in
content-defined chunking in the style of restic or casync, but it
doesn't really help that much for executables, which tend to be the
majority of changed files for image updates.

Alexander Larsson

unread,
May 8, 2020, 2:48:59 AM5/8/20
to dev

On Thursday, 7 May 2020 21:35:33 UTC+2, Tycho Andersen wrote:

Aleksa has been thinking about OCIv2:

https://github.com/openSUSE/umoci/issues/256
https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

I've been also interested in working on that, ideally to make it so
that we could move away from the tar format and enable direct-mounting
of OCI images.

I see in your PR that you're pretty close to integrating the tar stuff
with containers/image now, but have you thought at all about what a
clean room format for this would look like?

We came at this in a slightly weird way. I'm the main developer of flatpak, which normally uses OSTree to distribute images. However, for infrastructure reason we distribute the fedora and redhat flatpaks as OCI images, and we were getting complaints that updates of them were too large.

OSTree is to a large degree a OCIv2 style system. I.e. it is single-layer, yet it does maximal sharing of identical files both on disk and over the network (due to using content addressed storage). There are some details of OSTree that makes it not work for OCI as is, but from a very high level it has similar properties as what you'd want.

However, even with this in-built efficiency OSTree added a delta system (which forms the basis of tar-diff), and in practical use (most users updating from previous version to current, all the time) it performs drastically better.

So, as owen said, we did look into ways of changing the basic form of the layers to get some kind of incrementality, but in the end it just didn't perform nearly well enough to be worth all the effort compared to a delta-style update.

That doesn't mean OCIv2 is not useful in addition to deltas. We could use OCIv2 to improve initial pulls, as well as pulls of independent images that happen to "accidentally" share files (or chunks). However, even in an OCIv2 world I think deltas are going to be important.




 


Till Wegmüller

unread,
May 8, 2020, 4:01:39 AM5/8/20
to d...@opencontainers.org
On 08.05.20 08:48, Alexander Larsson wrote:
> So, as owen said, we did look into ways of changing the basic form of
> the layers to get some kind of incrementality, but in the end it just
> didn't perform nearly well enough to be worth all the effort compared to
> a delta-style update.
>
> That doesn't mean OCIv2 is not useful in addition to deltas. We could
> use OCIv2 to improve initial pulls, as well as pulls of independent
> images that happen to "accidentally" share files (or chunks). However,
> even in an OCIv2 world I think deltas are going to be important.

Coming from the illumos community where we use this principle for our
Package management I can name a few more benefits gained from an
approach as discussed in this thread.

The main difference is, that you are separating data and metadata of the
layers into different parts. While you can achieve a lot of benefits by
processing the data itself (Content based storage addressing, Individual
Compression). The most benefits you will gain however from the
processing of the metadata. We use a set of orthogonal properties to
encode details about a file, which we then can use to install only the
subset of parts the user is interested in.

As example take the following manifest (a package in our system):
set key=source.url value=https://github.com/foo/bar
set key=source.version value=v1.0
set key=source.commit.hash value=$SOURCECOMMITHASH
file $CONTENTADDRESSHASH path=/foo/bar pkg.variant=i86pc elfbits=32
pkg.variant.os=illumos
file $CONTENTADDRESSHASH path=/foo/bar pkg.variant=i86pc elfbits=64
pkg.variant.os=illumos
file $CONTENTADDRESSHASH path=/foo/bar pkg.variant=sparc elfbits=32
pkg.variant.os=illumos
file $CONTENTADDRESSHASH path=/foo/bar pkg.variant=sparc elfbits=64
pkg.variant.os=illumos
file $CONTENTADDRESSHASH path=/foo/bar pkg.variant=i86pc elfbits=64
pkg.variant.os=linux

This is a real life example of a package (or in the case of OCI that
would be a layer) that contains one binary in all sorts of flavours.
path is the Property we check for duplicates after applying all filters.
In this case we filter for processor architecture (sparc or x86),
bitness and ABI compatibility. These filters are given by the runtime
which wants to download the image. We also have additional filters
called facets which are defined by user input and allow to trim fat an a
per use case basis.

Another benefit is the capability to easily add key values pairs to the
metadata which allow extended audit and mapping use cases such as CVE
vulnerability checks or out of date dependency checks. These can easily
be integrated into any system, as those now don't need to download any
image at all but only need to parse the metadata for the keys they are
interested in.

Hope this gives an idea what forms and use cases this could cover.
Greetings
Till

Tycho Andersen

unread,
May 8, 2020, 11:14:29 AM5/8/20
to Alexander Larsson, dev
Right, and in particular OCIv2 would offer direct-mounting, which is
orthogonal to how the deltas are computed and the "killer feature"
that I'm interested in. I just wanted to make sure you're aware of
(and involved in, assuming we have more) that discussion, since I have
only thought about how to implement direct-mounting and not at all
about how to arrange deltas in a nice way. It would be good to have
someone who has thought about that involved in the design :)

Tycho

Till Wegmüller

unread,
May 8, 2020, 12:20:50 PM5/8/20
to d...@opencontainers.org
On 08.05.20 17:14, Tycho Andersen wrote:
> Right, and in particular OCIv2 would offer direct-mounting, which is
> orthogonal to how the deltas are computed and the "killer feature"
> that I'm interested in. I just wanted to make sure you're aware of
> (and involved in, assuming we have more) that discussion, since I have
> only thought about how to implement direct-mounting and not at all
> about how to arrange deltas in a nice way. It would be good to have
> someone who has thought about that involved in the design :)
>
> Tycho


Unfortunately I was not present when this discussion started. Can you
quickly explain what is meant by the term direct-mounting?

Thanks and Greetings
Till

Tycho Andersen

unread,
May 8, 2020, 12:30:33 PM5/8/20
to Till Wegmüller, d...@opencontainers.org
Sure, the idea is that you don't have to do any post-processing on the
image (e.g. extracting tar files onto disk), you can just mount it
directly as-is. You can enable a form of this today if you use
squashfs as the layer format instead of tar and using overlay to wire
all the squashfs mounts together, but that doesn't really solve the
cross-layer sharing problem.

This means that 1. image signatures are enough to guarantee
providence and you don't need additional things like IMA signatures,
2. you can verify a currently running image, instead of just at
extraction time, and 3. you don't need to do this extract step.

So ideally OCIv2 would: 1. share data between layers and images in a
more intelligent way (this is the delta work you guys are doing, IIUC)
2. distribute data in a more intelligent way (maybe this is free with
#1 and a good dist-spec design) and 3. be direct-mountable, so you get
all the nice properties above.

Tycho

Alexander Larsson

unread,
May 9, 2020, 5:15:36 AM5/9/20
to dev, toast...@gmail.com

On Friday, 8 May 2020 18:30:33 UTC+2, Tycho Andersen wrote:

Sure, the idea is that you don't have to do any post-processing on the
image (e.g. extracting tar files onto disk), you can just mount it
directly as-is. You can enable a form of this today if you use
squashfs as the layer format instead of tar and using overlay to wire
all the squashfs mounts together, but that doesn't really solve the
cross-layer sharing problem.

This means that 1. image signatures are enough to guarantee
providence and you don't need additional things like IMA signatures,
2. you can verify a currently running image, instead of just at
extraction time, and 3. you don't need to do this extract step.

So ideally OCIv2 would: 1. share data between layers and images in a
more intelligent way (this is the delta work you guys are doing, IIUC)
2. distribute data in a more intelligent way (maybe this is free with
#1 and a good dist-spec design) and 3. be direct-mountable, so you get
all the nice properties above.

Actually, one important aspects of the deltas is that we are able to, given what we store on the client, reproduce a bitwise identical version of the artifact that is verified by a digest. In the case of the current tar layers, i was lucky in that the part that is verified is the uncompressed tar (via the DiffID), because it is generally hard to reproduce a bitwise identical compressed artifact.

Additionally delta algorithms work poorly on the compressed data, so its a bad idea to work on the compressed data "blindly". We really need to be able to get at the file content.

So, in the OCIv2 work, keep in mind that we don't want to rely on the checksum of something that is compressed, nor do we want to use a format where the compression is "built into" the system in a way where we can't get at the real data.

Till Wegmüller

unread,
May 9, 2020, 5:20:43 AM5/9/20
to Alexander Larsson, dev
On 09.05.20 11:15, Alexander Larsson wrote:
> Actually, one important aspects of the deltas is that we are able to,
> given what we store on the client, reproduce a bitwise identical version
> of the artifact that is verified by a digest. In the case of the current
> tar layers, i was lucky in that the part that is verified is the
> uncompressed tar (via the DiffID), because it is generally hard to
> reproduce a bitwise identical compressed artifact.
>
> Additionally delta algorithms work poorly on the compressed data, so its
> a bad idea to work on the compressed data "blindly". We really need to
> be able to get at the file content.
>
> So, in the OCIv2 work, keep in mind that we don't want to rely on the
> checksum of something that is compressed, nor do we want to use a format
> where the compression is "built into" the system in a way where we can't
> get at the real data.


Good point. In IPS we save all checksums of uncompressed and compressed
files together with it's compressed and uncompressed size for verification.

Aleksa Sarai

unread,
May 9, 2020, 5:45:49 AM5/9/20
to Alexander Larsson, dev, toast...@gmail.com
For the record, I completely agree. My proposal for OCIv2 doesn't have
compression at all at the moment because I wanted to figure out if we
could just have compression be done when transferring data (because when
you have the rootfs on disk, you don't want the compressed version
anyway).

I think it makes little sense to bake it into the actual objects you're
hashing -- aside from reproducing the compressed stream there's also all
sorts of additional zip-bomb attacks that could be easily mitigated by
not compressing in the on-disk format.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
signature.asc

Aleksa Sarai

unread,
May 9, 2020, 6:04:58 AM5/9/20
to Alexander Larsson, dev, toast...@gmail.com
I think the best solution to get the ball rolling on an OCIv2 is to have
a proper working group set up, so that we can iterate on spec ideas and
hopefully have it done far more quickly than we could individually.

I have been working on some ideas for a while, but I think that any
attempt to redesign the image-spec should be done with a group that has
the broadest possible set of experiences. Several of the things Tycho
brought up in our in-person discussions were things I wouldn't have
considered, as I'm sure would be the case if any of us had such such
discussions.

The OCI doesn't really have a process for setting up a working group,
but I suggest we just set one up ourselves.
signature.asc

Aleksa Sarai

unread,
May 9, 2020, 6:18:56 AM5/9/20
to Alexander Larsson, dev
I agree that deltas are definitely a worthwhile addition to update
systems, and is something that should be kept in mind when designing
OCIv2.

My only contention is that deltas should be a strictly opt-in system,
and hosts which don't wish to spend computation power generating deltas
can simply provide the complete blobs (with as much de-duplication as we
can get). I don't want us to require (by design) "smart" image
registries. You could even do static deltas a-la LXD (where they
generate deltas for the 3 previous images, allowing for static hosting
while still providing deltas).
signature.asc

Till Wegmüller

unread,
May 9, 2020, 6:54:15 AM5/9/20
to d...@opencontainers.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 09.05.20 12:18, Aleksa Sarai wrote:
> I don't want us to require (by design) "smart" image registries.
> You could even do static deltas a-la LXD (where they generate
> deltas for the 3 previous images, allowing for static hosting while
> still providing deltas).


You can do delta calculation on the client side and request only those
files which you need.
-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQR0Tlx6kDnJJt6txLGYBG8gdxOmQgUCXraL0wAKCRCYBG8gdxOm
QhyaAP96PGvPmudB5EN8SdZLUMUbRz5FSQlnkgxemoKLjvHtEwD/f8EsrUPya0T9
3YsSB/a0Lu68xjpsIWlclQk9/7H1+Q0=
=gUuD
-----END PGP SIGNATURE-----

Aleksa Sarai

unread,
May 9, 2020, 8:38:39 AM5/9/20
to Till Wegmüller, d...@opencontainers.org
On 2020-05-09, Till Wegmüller <toast...@gmail.com> wrote:
> On 09.05.20 12:18, Aleksa Sarai wrote:
> > I don't want us to require (by design) "smart" image registries.
> > You could even do static deltas a-la LXD (where they generate
> > deltas for the 3 previous images, allowing for static hosting while
> > still providing deltas).
>
>
> You can do delta calculation on the client side and request only those
> files which you need.

That's not really a delta -- especially if we're talking about a format
where everything in the rootfs is content-addressed separately. Then
it's just a sane transfer algorithm (and is how my current OCIv2
proposal would work).

The deltas we're talking about are more like minimal diff(1)s for
individual binary blobs. You can't calculate those client-side, because
you need both versions to figure out what the minimal difference is.
It's more akin to delta-RPMs than content-addressed deduplication.
signature.asc

Till Wegmüller

unread,
May 9, 2020, 12:16:39 PM5/9/20
to Aleksa Sarai, d...@opencontainers.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Ah Yeah I see your point let's better use the distinguished terms.
Do you have your proposal online anywhere? I would like to cross check
it with what we have in illumos. Maybe I can make afew suggestions. Or
do you want a more sanctified structure first to ideate on the proposal?

Greetings
Till
-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQR0Tlx6kDnJJt6txLGYBG8gdxOmQgUCXrbXYwAKCRCYBG8gdxOm
QkXSAQCASCOHqiUD4kcH7S0cockH7uJU0KonpF4+Y9lYD/FplAD/U7tKJVvxZqzk
fqtc4+2OOF8/CyFY0Z2CCEcAJM/fQwk=
=L0ZN
-----END PGP SIGNATURE-----

Aleksa Sarai

unread,
May 12, 2020, 7:27:13 PM5/12/20
to Till Wegmüller, Aleksa Sarai, d...@opencontainers.org
I have a rough implementation of my proposal[1] and was working on a
proper document, but given how many additional discussions have happened
since then it probably makes more sense to brainstorm a document in a
working group rather than starting a written version of my proposal.

[1]: https://github.com/openSUSE/umoci/tree/experimental/ociv2
signature.asc
Reply all
Reply to author
Forward
0 new messages