OCI and casync

393 views
Skip to first unread message

Atlas Kerr

unread,
Feb 20, 2019, 1:10:24 PM2/20/19
to dev
Hi all,

The best discussion on the internet about the relationship between OCI and casync is on a twitter thread:


It seems that if we work casync into the image spec, we could achieve efficient distribution of images without needing to define a spec. Is this accurate?

Are there any drawbacks to casync in the context of OCI?

Best,
Atlas

Aleksa Sarai

unread,
Feb 20, 2019, 1:47:54 PM2/20/19
to Atlas Kerr, dev
On 2019-02-20, Atlas Kerr <atla...@gmail.com> wrote:
> The best discussion on the internet about the relationship between OCI and
> casync is on a twitter thread:
>
> https://twitter.com/vbatts/status/1087271584535072769

This is mainly because Lennart and I haven't had a chance to actually
have a face-to-face discussion so that something can be written down.
Vincent and I have talked about this in London last year, but there
weren't a lot of clear answers to questions we had.

I have been thinking about having a working group for some of the
changes being discussed for OCI images (and I know quite a few folks who
would like to be part of such a group), but I'm not sure how exactly
this would work. I do get that it is quite frustrating that the only
information about design discussions is happening in drips and drabs.

The blog post I published last month outlines the general issues[1], and
I am currently working on a follow-up with a description of the
alternative I would like to see for OCI -- CVE-2019-5736 took away a lot
of my time (and continues to do so).

> It seems that if we work casync into the image spec, we could achieve
> efficient distribution of images without needing to define a spec. Is this
> accurate?
>
> Are there any drawbacks to casync in the context of OCI?

It wouldn't be reasonable to *just* use casync, because we'd then have
to re-use their .cidx format which would make it annoying to handle
dealing with casync blobs (you'd need to fetch the .cidx blob and parse
that rather than using the current JSON walk we have). So modifying
casync or otherwise augmenting it is required off-the-bat.

Using casync would mean less work for us (with the downside that we have
no freedom in the format, because otherwise we'd be forking casync). And
I'm definitely in favour of less work for us -- but I have reservations
about casync. There are a few things that make me cautious about using
casync:

* The format is a canonicalised version of tar. This means that it
inherits many of the problems discussed in [1] off the bat. The only
significant problem is solves the canonical representation problem
(which is great). Parallelism of extraction is no better (as far as I
understand .caidx is a lookup table for where chunks come from not
where tar-level data is stored).

Reproducibility is better than stock tar but still has issues
(hardlinks can result in some complications, and as far as I know
casync doesn't explicitly reject non-canonical catar formats).

* The chunking is done after serialisation, which means that metadata
is combined with data in a way that requires parsing tar in order to
understand what is in an image. I have some fairly radical ideas
about being able to testify what packages an image contains (in a
verifiable way) that would be rendered difficult with this setup.

It also means that metadata changes will cause duplication of
transfer and storage for small files (though I am unsure how big of
an issue this is -- and there are reasons why you don't want lots of
tiny objects).

* There have been some discussions with folks from Cisco about whether
a new format could make their OCI usage (they use squashfs with
overlayfs so that the executing filesystem is actually signed rather
than just the tar representation) work more generically. I don't
think that the tar-based design of casync would allow for this.

But these are things that I think can be discussed and decided on when
Lennart and I will finally have a chance to meet in September (I was
meant to be at FOSDEM but didn't get the budget unfortunately).

One thing that I've come to realise is that a plain Merkle-tree
structure might not make a lot of sense and a flat structure might make
more sense (though we'd have to see how bad duplication might be as a
result). This change of heart is based on some older discussions I had
with Steven as well as thinking about how casync does things.

[1]: https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
signature.asc

Atlas Kerr

unread,
Feb 20, 2019, 2:46:45 PM2/20/19
to Aleksa Sarai, dev
I'd love to be a fly-on-wall for the discussion between you and Lennart about this topic.

Maybe we could invite Lennart to one of our weekly meetings to have an open discussion about it?

Stephen Day

unread,
Feb 20, 2019, 3:29:20 PM2/20/19
to Atlas Kerr, Aleksa Sarai, dev
While I think casync would be nice, the licensing makes it a non-starter for a lot of use cases.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@opencontainers.org.

Vincent Batts

unread,
Feb 20, 2019, 3:41:44 PM2/20/19
to Aleksa Sarai, Atlas Kerr, dev
On 21/02/19 05:47 +1100, Aleksa Sarai wrote:
>On 2019-02-20, Atlas Kerr <atla...@gmail.com> wrote:
>> The best discussion on the internet about the relationship between OCI and
>> casync is on a twitter thread:
>>
>> https://twitter.com/vbatts/status/1087271584535072769
>
>This is mainly because Lennart and I haven't had a chance to actually
>have a face-to-face discussion so that something can be written down.
>Vincent and I have talked about this in London last year, but there
>weren't a lot of clear answers to questions we had.
>
>I have been thinking about having a working group for some of the
>changes being discussed for OCI images (and I know quite a few folks who
>would like to be part of such a group), but I'm not sure how exactly
>this would work. I do get that it is quite frustrating that the only
>information about design discussions is happening in drips and drabs.

I've since chatted a bit with the casync devs. A working group would be
good.

Largely this "v2" conversation is looking more like reviewing additional
formats for resolving the rootfs. We've focused on alternatives to TAR.
Though we're finding that folks' have different use-cases and what
they're optimizing for. And while the current image has very definite
drawbacks and inefficiencies, I've come to the grouping of three
buckets. 1) container runtime node; 2) bandwidth for distribution; 3)
storage of registries.
What we've got "works", and unless a v2 format doesn't give an order of
magnitude improvement for the general case then, the conversation may
just become an example of how some with a narrow-case could craft their
own format and find their own improvements.
I'm concerned regardless of chunks (i.e. restic or casync) or file level (i.e.
ostree), having chunk reuse across local node, local cluster and remote
CAS registry, and to have any garbage-collection guidance.

With the way casync puts the metadata of the files inline, and
concatenates file boundaries even in a chunk, may likely cause chunk
reuse go down a bit (e.g. you drop a new binary in /usr/bin/ may cause
an offset calculation for that directory). There are some features of
casync specifically (maybe not in the clone
https://github.com/folbricht/desync), but it's not all built into the
format/protocol of casync per se.

Best case scenario will be an improvement, but not an order of
magnitude. Worst case is possibly worst than file-level CAS (due to
fetching exist files again due to the chunks just being an offset).

The other concern I have is the fetching these small files. This will
require like git's smart pack or similar.

vb
signature.asc

Stephen Day

unread,
Feb 20, 2019, 3:49:42 PM2/20/19
to Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
I’ve also evaluated restic for this use case and it works great. As Vincent said, a smart pack is necessary for small blob io. You solve that with a concept I’ve called chunk maps.

The biggest thing we can do to enable better reuse is to disable compression at the layer level. Doing so will let us provide better deduplication at the tar layer and innovate without changing the underlying format too much. We can progress to all these other methods once we do that.

I also think we should add squashfs support at the base layer level.

Atlas Kerr

unread,
Feb 20, 2019, 4:07:44 PM2/20/19
to Stephen Day, Vincent Batts, Aleksa Sarai, dev
> I've since chatted a bit with the casync devs. A working group would be
> good.

That would be awesome!

What would have to be submitted to the `tob` repo to get the process moving?

Stephen Day

unread,
Feb 20, 2019, 4:08:38 PM2/20/19
to Atlas Kerr, Vincent Batts, Aleksa Sarai, dev
Let’s not overlook the licensing issue: lgpl is not good for general distribution.

Aleksa Sarai

unread,
Feb 20, 2019, 4:16:00 PM2/20/19
to Vincent Batts, Atlas Kerr, dev
On 2019-02-20, Vincent Batts <vba...@redhat.com> wrote:
> On 21/02/19 05:47 +1100, Aleksa Sarai wrote:
> > On 2019-02-20, Atlas Kerr <atla...@gmail.com> wrote:
> > > The best discussion on the internet about the relationship between OCI and
> > > casync is on a twitter thread:
> > >
> > > https://twitter.com/vbatts/status/1087271584535072769
> >
> > This is mainly because Lennart and I haven't had a chance to actually
> > have a face-to-face discussion so that something can be written down.
> > Vincent and I have talked about this in London last year, but there
> > weren't a lot of clear answers to questions we had.
> >
> > I have been thinking about having a working group for some of the
> > changes being discussed for OCI images (and I know quite a few folks who
> > would like to be part of such a group), but I'm not sure how exactly
> > this would work. I do get that it is quite frustrating that the only
> > information about design discussions is happening in drips and drabs.
>
> I've since chatted a bit with the casync devs. A working group would be
> good.

Agreed.

> Largely this "v2" conversation is looking more like reviewing additional
> formats for resolving the rootfs. We've focused on alternatives to TAR.
> Though we're finding that folks' have different use-cases and what
> they're optimizing for. And while the current image has very definite
> drawbacks and inefficiencies, I've come to the grouping of three
> buckets. 1) container runtime node; 2) bandwidth for distribution; 3)
> storage of registries.

I'm not sure I fully understand what you mean? The "v2" conversation was
spurred by the tar issues, and with the hope that we could get some
other things out of a more transparent format (like being able to verify
that an image actually contains some packages). What are your buckets
referring to?

> What we've got "works", and unless a v2 format doesn't give an order of
> magnitude improvement for the general case then, the conversation may
> just become an example of how some with a narrow-case could craft their
> own format and find their own improvements.

Felix did a quick comparison of different approaches[1]. Just with their
base images, the de-duplication benefit was 4x for chunk-based
de-duplication (and a bit less for file-based de-duplication). While
implementing an extraction tool that uses concurrency will take some
effort, I am willing to bet it will be quite a bit faster than the
current system especially when you consider that extraction could occur
*during* download.

If we consider the most common use case of someone downloading an Ubuntu
image for the 5000th time, I'm pretty sure we have a clear argument.

> I'm concerned regardless of chunks (i.e. restic or casync) or file level (i.e.
> ostree), having chunk reuse across local node, local cluster and remote
> CAS registry, and to have any garbage-collection guidance.

I agree garbage-collection is going to be a "fun" issue to solve.

> The other concern I have is the fetching these small files. This will
> require like git's smart pack or similar.

I completely agree, and would like to see what Steven has cooked up to
solve this with "chunk maps". Packing might be the only reasonable
solution at the end of the day (I did have dreams that HTTP/2 could
solve this with server-push but that requires far too much semantic
information in the server about what is being downloaded). :D

[1]: https://github.com/openSUSE/umoci/issues/256#issuecomment-430413607
signature.asc

Aleksa Sarai

unread,
Feb 20, 2019, 4:17:44 PM2/20/19
to Stephen Day, Atlas Kerr, Vincent Batts, dev
On 2019-02-20, Stephen Day <stev...@gmail.com> wrote:
> Let’s not overlook the licensing issue: lgpl is not good for general
> distribution.

I think Atlas was referring to how we should set up a working group, not
about including casync within OCI.

> On Wed, Feb 20, 2019 at 13:07 Atlas Kerr <atla...@gmail.com> wrote:
>
> > > I've since chatted a bit with the casync devs. A working group would be
> > > good.
> >
> > That would be awesome!
> >
> > What would have to be submitted to the `tob` repo to get the process
> > moving?

signature.asc

Atlas Kerr

unread,
Feb 20, 2019, 8:36:44 PM2/20/19
to Aleksa Sarai, Stephen Day, Vincent Batts, dev
Yep, I was talking about setting up a working group.

I'm not saying licensing isn't an issue but I'm sure we can work something out once we all get together.

Akihiro Suda

unread,
Feb 20, 2019, 11:38:48 PM2/20/19
to Atlas Kerr, Aleksa Sarai, Stephen Day, Vincent Batts, dev
> The biggest thing we can do to enable better reuse is to disable compression at the layer level. Doing so will let us provide better deduplication at the tar layer and innovate without changing the underlying format too much. We can progress to all these other methods once we do that.


* By pushing uncompressed `vnd.oci.image.layer.v1.tar` blobs to a registry (via `Transfer-Encoding: gzip`), the registry can deduplicate tar balls in arbitrary chunk-level/file-level algorithm that can reproduce sha256 of original tar balls.
* Lazy-pull can be implemented without introducing new format by seeking tar headers using HTTP Range Requests, but apparently this is non-optimal because it causes a bunch of small read requests. So, my suggestion it to provide a way for returning list of {tarhdr, offset} in a single request. 
* By sorting the files in the tar balls in descending order of likeliness to be used for starting up the container, a client can fetch files for starting up the container in a single request.


2019年2月21日(木) 10:36 Atlas Kerr <atla...@gmail.com>:
--

Greg KH

unread,
Feb 21, 2019, 3:40:13 AM2/21/19
to Stephen Day, Atlas Kerr, Aleksa Sarai, dev
On Wed, Feb 20, 2019 at 12:29:07PM -0800, Stephen Day wrote:
> While I think casync would be nice, the licensing makes it a non-starter
> for a lot of use cases.

That is merely an implementation, if we end up using the format that
casync defines, you are free to create what ever other tool you want to
use to work with that format.

Also, you already use LGPL code just fine in your company's systems, so
I doubt it's a "real" issue :)

thanks,

greg k-h

Aleksa Sarai

unread,
Feb 21, 2019, 3:48:33 AM2/21/19
to Greg KH, Stephen Day, Atlas Kerr, dev
On 2019-02-21, Greg KH <gre...@linuxfoundation.org> wrote:
> On Wed, Feb 20, 2019 at 12:29:07PM -0800, Stephen Day wrote:
> > While I think casync would be nice, the licensing makes it a non-starter
> > for a lot of use cases.
>
> That is merely an implementation, if we end up using the format that
> casync defines, you are free to create what ever other tool you want to
> use to work with that format.

There's even a pre-existing BSD-3-Clause implementation in Go[1].
Personally, my concerns about casync are unrelated to licensing.

[1]: https://github.com/folbricht/desync
signature.asc

Cirujano Cuesta, Silvano

unread,
Feb 21, 2019, 4:59:31 AM2/21/19
to atla...@gmail.com, stev...@gmail.com, d...@opencontainers.org, vba...@redhat.com, cyp...@cyphar.com
On Wed, 2019-02-20 at 15:07 -0600, Atlas Kerr wrote:
> > I've since chatted a bit with the casync devs. A working group would be
> > good. 
>
> That would be awesome!
>
> What would have to be submitted to the `tob` repo to get the process moving?

I'd vote for any measure to "get the process moving" :-)

Getting some improvements in the direction would be amazing for companies with edge use cases (like mine).
I'm sure that there are some of them, although not very visible here.

I've been looking at OStree as the initial approach, but always with an eye on casync approach.
I haven't compared the casync and restic approaches, but specifying a format based on one of them sounds good to me.

Silvano

Greg KH

unread,
Feb 21, 2019, 5:10:08 AM2/21/19
to Stephen Day, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
On Wed, Feb 20, 2019 at 12:49:29PM -0800, Stephen Day wrote:
> I’ve also evaluated restic for this use case and it works great. As Vincent
> said, a smart pack is necessary for small blob io. You solve that with a
> concept I’ve called chunk maps.
>
> The biggest thing we can do to enable better reuse is to disable
> compression at the layer level. Doing so will let us provide better
> deduplication at the tar layer and innovate without changing the underlying
> format too much. We can progress to all these other methods once we do that.
>
> I also think we should add squashfs support at the base layer level.

No, please don't use squashfs for anything new. It really does not work
well for many things. There are other ways to have compressed
filesystems if you _really_ think it is needed (hint, it almost never
is...)

thanks,

greg k-h

Stephen Day

unread,
Feb 21, 2019, 12:21:37 PM2/21/19
to Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
> Also, you already use LGPL code just fine in your company's systems, so
I doubt it's a "real" issue :)

Will you do the work of convincing everyone’s legal teams to adopt new lgpl code? It’s extra work and risk that a lot of organizations are unwilling to take. Trivializing that is pretty condescending.

Stephen Day

unread,
Feb 21, 2019, 12:27:39 PM2/21/19
to Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
> No, please don't use squashfs for anything new. It really does not work
well for many things. There are other ways to have compressed
filesystems if you _really_ think it is needed (hint, it almost never
is...)

Could you expand on this? What are the problems? This is one of the largest asks I’ve been hearing about for support in container images.

vanessa sochat

unread,
Feb 21, 2019, 12:38:31 PM2/21/19
to Stephen Day, Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
I'd be interested to hear your thoughts on this too - I've really liked squashfs (and it's derivatives) after seeing the huge reduction in size from ext3/ext4 and (still) having read only reproducible containers.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@opencontainers.org.


--
Vanessa Villamia Sochat
Stanford University

Shane Canon

unread,
Feb 21, 2019, 12:39:50 PM2/21/19
to Stephen Day, Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev

I am curious to hear Greg’s feedback on this too.

We use squashfs under the hood with Shifter to distribute flattened images out to ~10k compute nodes on our Cray systems.
The image are copied onto a global parallel file system.  It is supper efficient and key to us getting scalable launch times.
I think Singularity is doing the same.  

If there were another fast, safe alternative, it would be interesting to hear.


—Shane



Greg KH

unread,
Feb 21, 2019, 12:55:04 PM2/21/19
to Stephen Day, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
On Thu, Feb 21, 2019 at 09:21:25AM -0800, Stephen Day wrote:
> > Also, you already use LGPL code just fine in your company's systems, so
> I doubt it's a "real" issue :)
>
> Will you do the work of convincing everyone’s legal teams to adopt new lgpl
> code? It’s extra work and risk that a lot of organizations are unwilling to
> take. Trivializing that is pretty condescending.

I'm not trying to "trivialize" it, other to say that people's "fear" of
the license is usually not real when it comes to actually using
something they need to use (i.e. you are already using code licensed
this way).

As was already pointed out, this is about the spec, not the
implementation, there is another implementation already created with a
license you might like better. License issues have no real play here,
except the one governing the spec itself.

thanks,

greg k-h

Greg KH

unread,
Feb 21, 2019, 12:59:41 PM2/21/19
to Stephen Day, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
On Thu, Feb 21, 2019 at 09:27:27AM -0800, Stephen Day wrote:
> > No, please don't use squashfs for anything new. It really does not work
> well for many things. There are other ways to have compressed
> filesystems if you _really_ think it is needed (hint, it almost never
> is...)
>
> Could you expand on this? What are the problems? This is one of the largest
> asks I’ve been hearing about for support in container images.

Oh where to begin :)

There have been numberous attempts to fix a bunch of the performance and
implementation details of squashfs over the years. All of them ended up
failing for the reasons that it turns out not to really be all that good
of an image format to use as a real filesystem.

Lots of systems are ripping it out now, Android being one known one that
is easy to point to. There are other attempts at creating read-only and
compressed filesystems to replace it, erofs is one promising attempt.

If all you care about is reducing the amount of data over the wire, then
try compressing the filesystem image itself and then expanding it
in-place. That seems to be the goal here, not the fact that you need a
compressed image on disk that blows up huge when loaded into memory,
right?

Anyway, squashfs is very dated, please do not mandate its use in any new
protocol or specification.

thanks,

greg k-h

Greg KH

unread,
Feb 21, 2019, 1:01:45 PM2/21/19
to vanessa sochat, Stephen Day, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
On Thu, Feb 21, 2019 at 09:38:16AM -0800, vanessa sochat wrote:
> I'd be interested to hear your thoughts on this too - I've really liked
> squashfs (and it's derivatives) after seeing the huge reduction in size
> from ext3/ext4 and (still) having read only reproducible containers.

Size where, as an image in storage, or in memory? In memory, it gets
big. It also suffers real performance issues that are very obvious when
using low-powered devices.

If you want a good read-only filesystem image to base things on, look at
something like erofs for where people have designed it to work better
after learning the problems of squashfs.

thanks,

greg k-h

Stephen Day

unread,
Feb 21, 2019, 1:41:30 PM2/21/19
to Greg KH, vanessa sochat, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
I'm not saying that your interpretation isn't correct. It's just that when the LGPL is brought up, organizations tend to go into risk mitigation mode. Even if there is an alternative with a separate license (assuming you mean https://github.com/folbricht/desync), there is still the issue of whether or not it is truly unencumbered. It's not necessarily logical or rational, but it is something that should be considered.

If we can avoid encoding the use of an LGPL encumbered technology into the OCI spec, it would be for the better.

This isn't even my opinion necessarily. It's just the reality of how these licenses are interpreted. For an example, see Google's open source policy: https://opensource.google.com/docs/thirdparty/licenses/#types. They do have a dynamic linking exception but that isn't really easy to do with Go.

Either way, it sounds like we can consider a wide range of technologies. casync compatibility shouldn't even be a requirement.

As far as squashfs is concerned, I wouldn't propose making it the base of everything, but rather, just an option for building and mounting images. I do have concerns with it from a security perspective: loading a remote blob into kernel memory seems like there is room for a problem.

Thanks for elaborating on the issues with squashfs. It's good to have these laid out.

Cheers,
Stephen.

Aleksa Sarai

unread,
Feb 21, 2019, 1:48:39 PM2/21/19
to Stephen Day, Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
On 2019-02-21, Stephen Day <stev...@gmail.com> wrote:
> > No, please don't use squashfs for anything new. It really does not work
> well for many things. There are other ways to have compressed
> filesystems if you _really_ think it is needed (hint, it almost never
> is...)
>
> Could you expand on this? What are the problems? This is one of the largest
> asks I’ve been hearing about for support in container images.

One other issue which I heard from Tycho is that squashfs has
effectively no library support and so generating or otherwise operating
on squashfs archives is basically impossible without shelling out to
mksquashfs a bunch of times.

While that's not impossible to work around, it is an issue.
signature.asc

vanessa sochat

unread,
Feb 21, 2019, 1:54:25 PM2/21/19
to Aleksa Sarai, Stephen Day, Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
Do you mean libraries as in APIs to interact with other languages? I also have been looking for a more system level API, but I never found it aside from the binaries that you mentioned. It would be hugely useful, for reasons outside of container space. But for actually interacting with the images, doesn't it come down to writing clients to parse the binary? Would that be really hard? I wrote a python client to parse SIF headers, but I didn't get into the file system itself. The tip of the ice berg is likely much simpler than the beast below it.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@opencontainers.org.

Stephen Day

unread,
Feb 21, 2019, 2:02:52 PM2/21/19
to vanessa sochat, Aleksa Sarai, Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
I think the biggest problem with making a squashfs client library is getting the unpacking sizes correct and correctly implementing the file system structure. I think the format might actually be defined by the implementation, which can make it hard to follow in a user space library. For example, I remember having a tough time getting the btrfs go bindings to get the right field width for unpacking ioctl structs. It still requires a pass through a C compiler to do that correctly. I think squashfs is less system-specific, but it still may represent a challenge.

I don't think squashfs would ever become a default. This would just be a new, optional layer mediatype that could be understood by implementations. The implementation would invoke the existing tools to build and mount squashfs.

I think we should be focusing on getting several different filesystems representations working, rather than trying to find a single technology that meets everyone's needs.

Greg KH

unread,
Feb 21, 2019, 2:09:16 PM2/21/19
to Stephen Day, vanessa sochat, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
On Thu, Feb 21, 2019 at 10:41:17AM -0800, Stephen Day wrote:
> If we can avoid encoding the use of an LGPL encumbered technology into the
> OCI spec, it would be for the better.

Again, this isn't a "LGPL encumbered technology". Anything that ends up
in the spec is "encumbered" by the license of the spec. You are free to
write code in any license to match that spec.

> As far as squashfs is concerned, I wouldn't propose making it the base of
> everything, but rather, just an option for building and mounting images. I
> do have concerns with it from a security perspective: loading a remote blob
> into kernel memory seems like there is room for a problem.

That's a known issue, never do that, bad things are guaranteed to
happen. We fixed a number of the "known" problems in that area, but we
also "know" there are many left to be found. Everyone moved away from
squashfs instead of worrying about fixing those.

thanks,

greg k-h

vanessa sochat

unread,
Feb 21, 2019, 3:52:37 PM2/21/19
to Stephen Day, Aleksa Sarai, Greg KH, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev

I think we should be focusing on getting several different filesystems representations working, rather than trying to find a single technology that meets everyone's needs.

+1. We never would :) 

Aleksa Sarai

unread,
Feb 21, 2019, 9:32:00 PM2/21/19
to Stephen Day, Greg KH, vanessa sochat, Vincent Batts, Aleksa Sarai, Atlas Kerr, dev
On 2019-02-21, Stephen Day <stev...@gmail.com> wrote:
> If we can avoid encoding the use of an LGPL encumbered technology into the
> OCI spec, it would be for the better.
>
> This isn't even my opinion necessarily. It's just the reality of how these
> licenses are interpreted. For an example, see Google's open source policy:
> https://opensource.google.com/docs/thirdparty/licenses/#types. They do have
> a dynamic linking exception but that isn't really easy to do with Go.

But again, desync isn't LGPL licensed -- the copyright license of a
project only applies to a particular implementation of something (yes,
you can copyright a spec but as far as I understand you're actually
doing that as a way of providing a patent grant rather than actually
caring if someone copies/modifies/distributes the spec document).

> Either way, it sounds like we can consider a wide range of technologies.
> casync compatibility shouldn't even be a requirement.

I agree.

> As far as squashfs is concerned, I wouldn't propose making it the base of
> everything, but rather, just an option for building and mounting images. I
> do have concerns with it from a security perspective: loading a remote blob
> into kernel memory seems like there is room for a problem.

I wouldn't mind either way if we add a new MIME type for squashfs (as
long as it's optional for implementations to actually implement) but it
should be noted that my original point was that we should see if we can
get squashfs-like benefits (not having an extraction-step) without the
downsides of squashfs or layer-based designs.

The idea Tycho and I discussed was creating an overlay-like filesystem
implementation for OCI images which would allow you to remove the
extraction step. You could do this by either creating an alternative
on-disk representation of an OCI layout, or by reusing the common "dir"
layout (though I think having a single-file representation would be
better for squashfs-like designs).

It's unlikely we'd be able to upstream such a filesystem (unless it was
significantly more generic), but having a project like that would be
useful for quite a lot of people who want to be able to mount signed
images without an extraction step.

In addition, there is quite a lot of interesting work happening in FUSE
(extending it with eBPF to improve performance) and I think we might be
able to take advantage of that quite significantly with a FUSE
implementation of the above idea.
signature.asc

Vincent Batts

unread,
Feb 22, 2019, 9:38:58 AM2/22/19
to Akihiro Suda, Atlas Kerr, Aleksa Sarai, Stephen Day, dev
On 21/02/19 13:38 +0900, Akihiro Suda wrote:
>> The biggest thing we can do to enable better reuse is to disable
>compression at the layer level. Doing so will let us provide better
>deduplication at the tar layer and innovate without changing the underlying
>format too much. We can progress to all these other methods once we do that.
>
>I have a proposal for this:
>https://github.com/AkihiroSuda/filegrain/issues/21
>
>* By pushing uncompressed `vnd.oci.image.layer.v1.tar` blobs to a registry
>(via `Transfer-Encoding: gzip`), the registry can deduplicate tar balls in
>arbitrary chunk-level/file-level algorithm that can reproduce sha256 of
>original tar balls.

Oh fun. Sounds like an iteration https://github.com/vbatts/tar-split
signature.asc

Vincent Batts

unread,
Feb 22, 2019, 9:42:41 AM2/22/19
to Cirujano Cuesta, Silvano, atla...@gmail.com, stev...@gmail.com, d...@opencontainers.org, cyp...@cyphar.com
On 21/02/19 09:59 +0000, Cirujano Cuesta, Silvano wrote:
>On Wed, 2019-02-20 at 15:07 -0600, Atlas Kerr wrote:
>> > I've since chatted a bit with the casync devs. A working group would be
>> > good. 
>>
>> That would be awesome!
>>
>> What would have to be submitted to the `tob` repo to get the process moving?
>
>I'd vote for any measure to "get the process moving" :-)
>
>Getting some improvements in the direction would be amazing for companies with edge use cases (like mine).
>I'm sure that there are some of them, although not very visible here.
>
>I've been looking at OStree as the initial approach, but always with an eye on casync approach.
>I haven't compared the casync and restic approaches, but specifying a format based on one of them sounds good to me.

From my talks with cgwalters about ostree, I get the clear reasons that
decisions were made to not include mtime for files. And the farm of
hardlinks is neat, but it sounds like had ostree been designed again
_today_ then it would include mtime and likely use vfs features like
reflinks (found in xfs and btrfs) to not do hardlinks and have better
chunk deduplication on disk.

I think the next iteration of CoW+file/chunk+whatever will be in this
vein.
signature.asc

Vincent Batts

unread,
Feb 22, 2019, 9:47:01 AM2/22/19
to Greg KH, Stephen Day, Aleksa Sarai, Atlas Kerr, dev
I get that. This and loopback btrfs or whatever is _clever_ but I would
never ever recommend this for a general use case. Only times I hear of
this going to production is either YOLO or they own up to the code from
soup to nuts. You should not mount a vfs image that you didn't create
yourself (and still consider your life choices even then).

vb
signature.asc

Vincent Batts

unread,
Feb 22, 2019, 10:37:12 AM2/22/19
to Aleksa Sarai, Atlas Kerr, dev
I'm buckets are referring to areas of optimization. Largely in storage
consumption, not so much on memory consumption.
Currently TAR+CoW is most optimized for the end nodes, and sucks for
bandwidth and remote registries (apart from Akihiro's filegrain
project).
Some of these file/chunk level ideas could improve the registry and
bandwidth, but without decent reflink'ing in the vfs, then it could make
the end node storage worse.

Service providers charging for bandwidth and storage surely don't mind
the inefficiencies.

My point is that having options is fine, but the use-case of most folks
is optimizing for the end-node, and currently it is "decent" (however
much I don't care for TAR). And optimizing for the other areas at the
expense of the end would only be interesting to some, not for the
general use-case.

>> What we've got "works", and unless a v2 format doesn't give an order of
>> magnitude improvement for the general case then, the conversation may
>> just become an example of how some with a narrow-case could craft their
>> own format and find their own improvements.
>
>Felix did a quick comparison of different approaches[1]. Just with their
>base images, the de-duplication benefit was 4x for chunk-based
>de-duplication (and a bit less for file-based de-duplication). While
>implementing an extraction tool that uses concurrency will take some
>effort, I am willing to bet it will be quite a bit faster than the
>current system especially when you consider that extraction could occur
>*during* download.
>
>If we consider the most common use case of someone downloading an Ubuntu
>image for the 5000th time, I'm pretty sure we have a clear argument.

But this is the subtlety that I'm looking at. With the chunk/file level
discussion, the concept of these layers may blend away a bit. With a
name and digest reference to the parent that your new image was derived
from, there is no need to provide all these layers. Only to provide the
merkle manifest of the chunks to be fetched for your final rootfs view.
If I take that 80mb ubuntu base, install some package the litters tiny
files across /etc /usr /bin /whatever, then this would have to be
inlined into a streamable merkle of chunks. Which may likely cause just
an offset to the chunks, causing cache misses, despite having derived
from the same ubuntu base image.

>> I'm concerned regardless of chunks (i.e. restic or casync) or file level (i.e.
>> ostree), having chunk reuse across local node, local cluster and remote
>> CAS registry, and to have any garbage-collection guidance.
>
>I agree garbage-collection is going to be a "fun" issue to solve.
>
>> The other concern I have is the fetching these small files. This will
>> require like git's smart pack or similar.
>
>I completely agree, and would like to see what Steven has cooked up to
>solve this with "chunk maps". Packing might be the only reasonable
>solution at the end of the day (I did have dreams that HTTP/2 could
>solve this with server-push but that requires far too much semantic
>information in the server about what is being downloaded). :D
>
>[1]: https://github.com/openSUSE/umoci/issues/256#issuecomment-430413607
>
>--
>Aleksa Sarai
>Senior Software Engineer (Containers)
>SUSE Linux GmbH
><https://www.cyphar.com/>
>
signature.asc

Aleksa Sarai

unread,
Feb 22, 2019, 11:22:54 AM2/22/19
to Vincent Batts, Atlas Kerr, dev
It's "most optimised" for the storage layout of an overlay filesystem --
not necessarily for the node because you have a lot of duplication on
the filesystem. The same layer duplication issues we have from tar exist
once on the node because you generally will extract each layer to
separate directories (even if it's similar to another one).

> Some of these file/chunk level ideas could improve the registry and
> bandwidth, but without decent reflink'ing in the vfs, then it could make
> the end node storage worse.

Having a filestore with reflinks would solve the issue, and even without
reflinks you're looking at a worst-case of file-based deduplication
which is far better than what we currently have (you would hard-link the
file store for each image you want to use as a lowerdir).

> My point is that having options is fine, but the use-case of most folks
> is optimizing for the end-node, and currently it is "decent" (however
> much I don't care for TAR). And optimizing for the other areas at the
> expense of the end would only be interesting to some, not for the
> general use-case.

I don't agree that we are optimising at the expense of the end node,
quite the opposite.
signature.asc

Stephen Day

unread,
Feb 22, 2019, 4:30:24 PM2/22/19
to Aleksa Sarai, Vincent Batts, Atlas Kerr, dev
Am I hearing that we should make a new filesystem? I think we should make a filesystem that can do cas cdc chunking and ref linking.

You can achieve this in fuse, but it’s slow. Not sure if we could build it with fuse and then force the data into the Vfs buffer cache or use some bpf magic.

Aleksa Sarai

unread,
Feb 22, 2019, 8:34:44 PM2/22/19
to Stephen Day, Vincent Batts, Atlas Kerr, dev
On 2019-02-22, Stephen Day <stev...@gmail.com> wrote:
> Am I hearing that we should make a new filesystem? I think we should make a
> filesystem that can do cas cdc chunking and ref linking.

I am suggesting that a new filesystem would be a good *optional* way of
optimising usage of OCI images, and it would allow you to have
actually-signed root filesystems (without an extraction step). Tycho has
said that he'd be willing to help us work on this in parallel with
whatever other things we'd like to get done.

Unfortunately you can't use reflinks with CDC chunking directly (with
each chunk being reflinked) because of a fundamental limitation in
filesystems -- you have to reflink chunks that are multiples of the
filesystem block size. But we could create an overlay-like filesystem
that would remove the need for reflinks entirely (and the "extents"
would be the CDC blobs) -- with reflinks or hardlinks being used for the
more traditional extraction-based system that would work without our
filesystem.

> You can achieve this in fuse, but it’s slow. Not sure if we could build it
> with fuse and then force the data into the Vfs buffer cache or use some bpf
> magic.

This is what FUSE-eBPF is trying to accomplish -- removing calls to
user-space with pre-emptive caching using eBPF maps. I spoke to the PhD
student working on this (their first focus was metadata caching but the
eBPF map system could be used to tell FUSE that a given inode should be
operated on with data operations -- thus bypassing userspace for the
read/write and some lookup paths).
signature.asc

Greg KH

unread,
Feb 23, 2019, 2:46:20 AM2/23/19
to Stephen Day, Aleksa Sarai, Vincent Batts, Atlas Kerr, dev
On Fri, Feb 22, 2019 at 01:30:11PM -0800, Stephen Day wrote:
> Am I hearing that we should make a new filesystem? I think we should make a
> filesystem that can do cas cdc chunking and ref linking.
>
> You can achieve this in fuse, but it’s slow. Not sure if we could build it
> with fuse and then force the data into the Vfs buffer cache or use some bpf
> magic.

No, never create a new filesystem unless you have 5-10 years to focus
exclusivly on it before you can rely on it.

greg k-h

Vincent Batts

unread,
Feb 23, 2019, 8:05:19 AM2/23/19
to Greg KH, Stephen Day, Aleksa Sarai, Atlas Kerr, dev


I was not proposing a new filesystem. I was proposing that something _like_ ostree but making use of reflinks and providing mtimes would be the path.

vanessa sochat

unread,
Feb 24, 2019, 7:40:00 AM2/24/19
to Vincent Batts, Greg KH, Stephen Day, Aleksa Sarai, Atlas Kerr, dev
A bit late to post, but better than never! I stumbled on this nice read just now about Merkle trees (mentioned a few times in this discussion):

In case others find it useful / interesting !

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@opencontainers.org.

ktokuna...@gmail.com

unread,
Feb 26, 2019, 4:59:49 AM2/26/19
to dev
Hi all,

This is a quite interesting discussion for me. Our private
clouds are managed by container technology, so this kind of
image issues are applicable to our use case.

Recently I am looking around image de-dup and lazy-pull.
Especially, I like the idea of CDC block-level (or
file-level) chunking and lazy-pull of them (like Slacker,
CernVM-FS, FILEgrain, and so on).

What I concerned about is something like that
"Without any modification on runtime or registries, can we
achieve block-level de-dup and lazy-pull?".
I finally think we can do, and the proposal is following:

* By introducing a small init program (we call it 'boot')
  into a container, which has responsibility to construct
  rootfs for the actual application inside the container, it
  can achieve de-dup and lazy-pull without any modifications
  on container runtimes and registries, and without any
  dedicated NFS infrastructure.
* By applying this 'boot' program approach, it can deal with
  similar kind of drastic changes in the future, in a very
  flexible and compatible way.

I have the rough PoC of the image converter internally,
which is using casync and desync. If you are interested in,
I can share the source code.

There're some points to discuss to apply the approach:

* It is using FUSE functionality inside container. It means
  we need to use priviledged containers and introduce
  /dev/fuse (seems to be common issue: [1]).

* Because of using desync, which doesn't talk registry API,
  we need extra blob store. But we can solve it by extending
  the desync, which is closed inside container.

Finally I'm willing to supporting this activity.



2019年2月21日木曜日 3時10分24秒 UTC+9 Atlas Kerr:
Hi all,

The best discussion on the internet about the relationship between OCI and casync is on a twitter thread:


It seems that if we work casync into the image spec, we could achieve efficient distribution of images without needing to define a spec. Is this accurate?

Are there any drawbacks to casync in the context of OCI?

Best,
Atlas

poett...@gmail.com

unread,
Feb 27, 2019, 10:56:04 AM2/27/19
to dev, atla...@gmail.com, Lennart Poettering
Sorry for the late reply. I was pointed to this recently and just checked the thread now:


On Wednesday, February 20, 2019 at 7:47:54 PM UTC+1, Aleksa Sarai wrote:

It wouldn't be reasonable to *just* use casync, because we'd then have
to re-use their .cidx format which would make it annoying to handle
dealing with casync blobs (you'd need to fetch the .cidx blob and parse
that rather than using the current JSON walk we have). So modifying
casync or otherwise augmenting it is required off-the-bat.

Parsing caidx is trivial btw. it's a tiny header followed by a simple array of offset+hash pairs. It's binary, and that's a good thing I am sure. Doing this in JSON instead comes at a major price: with the current scheme we can locate any byte in archive in O(log(n)) time, which means we can do random access without having to download and parse a huge JSON tree first. In fact, by doing this binary we can calculate the index into the chunk list nicely, and even mmap stuff if we want.

This is a biggie, btw, as it means we can fuse mount caidx/catar files across the network, with good performance (everything O(log(n))) and with only retrieving the block actually accessed. Now, I am doubt the fuse thing is too relevant for OCI directly, but then again agreeing on a common format that allows this is a good thing.
 
 * The format is a canonicalised version of tar. This means that it
   inherits many of the problems discussed in [1] off the bat. The only
   significant problem is solves the canonical representation problem
   (which is great). Parallelism of extraction is no better (as far as I
   understand .caidx is a lookup table for where chunks come from not
   where tar-level data is stored).

Not true.

It shares no history with tar really, and it is fully indexed. Finding a directory entry can be done with random access in O(log(n)) time. This is very different from tar where to decode a file at the end of the stream you always have to read everything before it.

At the end of the serialization of each directory there's a short index table which allows quick access to the entries stored before. This makes the format nicely hybrid: you can stream it when decoding/encoding it, but you can also do random access after the fact. Moreover, you can fully parallelize generation, and decoding of it, though the casync implementation currently doesn't. I am currently working on making that part happen though (if you look at the github repo you'll find a PR that is a first step for parallelizing generation btw).

   Reproducibility is better than stock tar but still has issues
   (hardlinks can result in some complications, and as far as I know
   casync doesn't explicitly reject non-canonic
 al catar formats)


hardlinks are currently not stored as hardlinks, but as individual files.  And there's only one valid serialization of any tree, and casync refuses any serialization that doesn't matc it. So yes, we explicitly refuse any serialization that is even one bit off. Reproducibility is key for us, always has been and I keep pressing that in all my talks about the topic.

  * The chunking is done after serialisation, which means that metadata 
   is combined with data in a way that requires parsing tar in order to
   understand what is in an image. I have some fairly radical ideas
   about being able to testify what packages an image contains (in a
   verifiable way) that would be rendered difficult with this setup.

Please don't say "tar", casync makes zero use of the tar concept. Our format has nothing to do with it. We don't read tar, nor generate tar.

To enumerate what is in an a caidx/catar image you have to parse it. That's true for most formats really. You can efficiently skip over the file contents though, but it is true that metadata is distributed over the serialization, which is a side-effect from out emphasis on "composability", i.e. the fact that the serialization of a directory is always the strict concatenation of the serialization of the files and directories inside it, with no pointers from the inner to the outer or the outer to the inner. This is a particularly nice property if you have a lot of similar trees that share common subtrees (such as your typical Linux OS tree).

but anyway: the "casync list" operation (which dumps a list of filenames in an archive) is a lot faster than on tar, simply because we can efficiently skip over payload and do so, and don't have to retrieve payload chunks from the network if they contain no metadata.
 
   It also means that metadata changes will cause duplication of
   transfer and storage for small files (though I am unsure how big of
   an issue this is -- and there are reasons why you don't want lots of
   tiny objects).

casync's emphasis is on evening out chunk sizes. This means small files are joined together until a chunk is full, before a chunk is generated. Moreover large files are split into smaller pieces to make sure they result in the same average chunk size too. This concept is a *strength* of casync, since for tiny chunks you'd have to pay a high price for delivery (think: if you have tiny chunks you need a ton more HTTP request control data, even if you pipeline). Or in other words: in casync if any byte of the image changes then it will mean the chunk around it will have to be downloaded again, but it's always the same price you pay: you never have to redownloda huge files because the file you changed is huge nor do you have to redownloads lots of tiny files again.

Note that ostree works differently in this regard: for them each file is handled separately, hence large files mean large HTTP requests and small files mean lots of small HTTP requests. This was a problem for them, which is why they did binary delta stuff in the end, to deal with that. casync kinda is the way out here: it makes sure that byte changes don't explode into too many metadata and neither too many payload changes, but always in the same amount of changes.

On Linux we tend to have a lot of tiny files actually, (think: /etc), hence this logic actually matters a lot.

 * There have been some discussions with folks from Cisco about whether
   a new format could make their OCI usage (they use squashfs with
   overlayfs so that the executing filesystem is actually signed rather
   than just the tar representation) work more generically. I don't
   think that the tar-based design of casync would allow for this.

casync can operate on two levels: on the block level and on the file system level. When it operates on the block level it packs up raw file systems, such as ext4 or squashfs. If it operates on the file system level it instead serializes files and directories like tar would, but in something we call "catar", which has similar props, but is strictly reproducible, a lot more careful with file metadata and guarantees that there's only one valid serialization for each tree.

When operating on the block layer, casync generates a .caibx index file refererring to the raw block data. When operating on the file system layer it instead serializes the tree into a .catar which it then generates a .caidx index file from. The two index file types .caibx and .caidx are actually completely the same, the slightly different suffix is just supposed to give the user a hint whether the contained data is a raw block device or a tar-like serialization.

Now, historically, when you look at casync's block device support it has been pretty good with normal file systems such as ext4 and things. it performs much worse though on squashfs, since squashfs is compressed and thus most of the redundancies and similarities between the trees are already gone, and the fact that casync chunks differently than the compressed blocks in the squashfs file system reduces its effectiveness quite a bit. I am currently working on fixing that however, by adding a concept that allows casync to recognize where squashfs blocks start/end and thus synchronize its own chunking with.

Vincent Batts

unread,
Feb 27, 2019, 11:12:54 AM2/27/19
to poett...@gmail.com, dev, atla...@gmail.com, Lennart Poettering
But finding those directory indexes still requires reading through and
fetching chunks. This is not quite seekable. Right?
signature.asc

Vincent Batts

unread,
Feb 27, 2019, 11:23:17 AM2/27/19
to poett...@gmail.com, dev, atla...@gmail.com, Lennart Poettering
On 27/02/19 07:56 -0800, poett...@gmail.com wrote:
Though by concatenating these file boundaries together, it means that
dropping a new file into a directory will very likely shift these
offsets. Causing new chunk generation. I realize that this could be
worked around by making deriving where reflinks could happen when
expanding a caidx to disk. But for the case of /usr/bin on a building
"layered" containers, it will cause a fair amount of chunk churn.
signature.asc

Lennart Poettering

unread,
Feb 27, 2019, 11:56:07 AM2/27/19
to Vincent Batts, dev, atla...@gmail.com
On Mi, 27.02.19 11:12, Vincent Batts (vba...@redhat.com) wrote:

> > At the end of the serialization of each directory there's a short index
> > table which allows quick access to the entries stored before. This makes
> > the format nicely hybrid: you can stream it when decoding/encoding it, but
> > you can also do random access after the fact. Moreover, you can fully
> > parallelize generation, and decoding of it, though the casync
> > implementation currently doesn't. I am currently working on making that
> > part happen though (if you look at the github repo you'll find a PR that is
> > a first step for parallelizing generation btw).
>
> But finding those directory indexes still requires reading through and
> fetching chunks. This is not quite seekable. Right?

Some chunks. But only those chunks that actually contain the file
metadata.

Lennart

--
Lennart Poettering, Red Hat

Lennart Poettering

unread,
Feb 27, 2019, 12:18:39 PM2/27/19
to Vincent Batts, poett...@gmail.com, dev, atla...@gmail.com
Well, but these changes propagate pretty minimally, due to the focus
on "composability": any subtrees that don't change at all will always
result in the exact same bit image. But yes, if one file changes size,
then this will affect the immediate chunk around that change, plus the
end-of-directory record for each directory this file is contained in,
all the way up the tree. i.e. if you insert one byte in
/foo/bar/baz.txt then this will affect 4 locations in the stream: the
serialization of the the file /foo/bar/baz.txt's payload itself, plus
the end-of-directory record of /foo/bar, of /foo and of /. However
that's it. The number of chunks changing is dependent on the depth of
the directory tree if you so will. But given that directories are a
concept of grouping usually when multiple things change they tend to
be close and thus the end-of-directory records are going to be the
same ones.

> worked around by making deriving where reflinks could happen when
> expanding a caidx to disk. But for the case of /usr/bin on a building
> "layered" containers, it will cause a fair amount of chunk churn.

The end-of-directory records never hit the disk when extracting
archives. In fact if you extract an archive serially (as you normally
do), then the end-of-directory records are pretty much ignored (not
entirely, they are always validated, as we strictly validate every
byte passing through to ensure reproducibility at every step).

Currently, if you use casync to extract a caidx/catar on an existing
directory tree (and the tree was never seen before), then for each
file in the stream it will unpack it into a temporary file first,
placed in the directory it shall end up in. While extracting it will
try to reflink as much as it can from the existing tree (to be
precise: in any file in the tree, the paths don't have to match,
i.e. this is very efficeint for file renames and moving files within
the tree to different subtrees). When it is done extracting the file
it will check if the file already exists and if so if it it identical
in contents and metadata. If so, it will remove the temporary file
again and leave the old file in place. If they are different otoh
we'll atomically replace the old file with the temporary file. It does
this to optimize disk space: if the old file is good enough we'll just
keep that one in place, and thus will continue sharing any data it is
sharing with other files. Only if it doesn't match the existing one
we'll replace it and make a change to the disk image. But even then
we'll use reflinks as much as we can, so that we share as much as
possible. I wrote it that way with btrfs subvols + reflinks and xfs
reflinks in mind: so that you can take a btrfs subvol snapshot or a
btrfs/xfs reflink copy, and then "mutate" it with casync replacing
only the files that actually changed with absolute minimal disk ops in
the end. overlayfs should benefit from this too, as this means the
temporary files would be written to the top layer always, and copy-up
never has to take place, as we'll always atomically write each full
file, and then either cancel it as a whole or keep it as a whole.

This behaviour as lots of benefits, including somewhat "atomic"
behaviour: apps either see the old or the new files, but never half
written files. However it also has some negative effects: for every
file we process the contianing dir's mtime will be changed while we
unpack it and then reset after we are done with the dir. I stil think
this is the best behaviour we can have, given the available Linux
APIs...

ktokuna...@gmail.com

unread,
Mar 1, 2019, 3:25:43 AM3/1/19
to dev
Based on my experience with casync and desync, I think they
have great features:
- We can chunk rootfs into block-level blobs using casync.
- We can use desync for lazy-pulling blobs, caching it in
  arbitrary local directories and then provisioning a
  catar file which is a canonicalised archive
  format (great!).
- The catar file can be FUSE mounted same as original rootfs
  using casync.

But optimization's point of view, I think the following
points should be discussed in quantitative ways:
- Chunking granularity
    - Block-level vs File-level vs Hybrid.
- File metadata
    - Composability (which means no pointer between inside
      and outside the stream) vs Treating file metadata and
      data separately.
- Blob boundaries
  - Separating blobs based on file boundaries (which means
    each blob doesn't include file boundaries) vs Evening
    chunk size.

Anyway, the source code of the rough PoC related to my
recent post on this thread is now opened.
> What I concerned about is something like that
> "Without any modification on runtime or registries, can we
> achieve block-level de-dup and lazy-pull?".
> I finally think we can do, and the proposal is following:


Brief summary: 
* This is an image converter which is aiming to chunk image
  into block-level CDC blobs using casync and to run it in a
  lazy-pull manner using casync and desync, without any
  modifications on container runtimes and registries, and
  without any dedicated NFS infrastructure. Of course, there
  are a lot of TODOs on implementation... but the concept
  should be clear.

Thanks a lot for casync & desync, which are wandarful tools.
Reply all
Reply to author
Forward
0 new messages