removal of /run/opencontainer/containers

Brandon Philips

unread,

Nov 25, 2015, 9:29:47 AM11/25/15

to dev

Hello Everyone-

Awhile back we agreed to use a single common directory, now /run/opencontainer/containers, to store state.json files. After talking to a few projects that use filesystem based APIs I think we should change or remove this.

1) Correctness of these APIs is difficult. If we do feel we need this directory for state tracking we need to document atomic file creation and deletion via O_TMPFILE/linkat or some "hidden file" mechanism. Otherwise we are going to go crazy trying to watch for changes.

2) Watching for changes is difficult. The primary use case for this directory was for things like cAdvisor to watch for state changes but inotify is impossible to use without 100% correctness from the writers which are going to be out of the control of the readers and likely to have bugs. See above.

3) Garbage collection is unsafe and nearly impossible to get right. This becomes even more true as we add lifecycle hooks that aren't tied to the application process. Someone reading this directory won't whether a container is "dead" in this case. And if we have a writer that just lets stuff pile up our reader has no way of figuring it out.

4) Boot time setup is tricky. Who will create this directory in the early boot? What will the permissions, owners, and ACLs be for the directory and where will the configuration be for that?

At this point I think we should just scrap the whole thing and go back to the drawing board. We could do something like `/run/opencontainers/containers/<runtime>` but most of these issues still remain but just get compartmentalized.

Thoughts?

Brandon

[1] https://github.com/opencontainers/specs/commit/180df9dd8f45417a212b4469e35181bcac11051d

Solomon Hykes

unread,

Nov 25, 2015, 10:01:38 AM11/25/15

to Brandon Philips, dev

I agree, in my opinion this should not be part of the spec, lots of work for not a whole lot of benefit.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@opencontainers.org.

Alexander Morozov

unread,

Nov 25, 2015, 11:01:39 AM11/25/15

to Solomon Hykes, Brandon Philips, dev

So, is there will be "action" to get state? Because I'm not sure that
removing state from spec totally is possible. However I agree that
probably hardcoded fs schema brings more harm, than good.

Mrunal Patel

unread,

Nov 25, 2015, 11:06:12 AM11/25/15

to Alexander Morozov, Solomon Hykes, Brandon Philips, dev

I am +1 but agree with Alex that there needs to be an action or API so one get the state from the runtime. For e.g. runc --id <container_id> state

Thanks,

Mrunal

Brandon Philips

unread,

Nov 25, 2015, 11:19:31 AM11/25/15

to Alexander Morozov, Solomon Hykes, dev

On Wed, Nov 25, 2015 at 8:01 AM Alexander Morozov <lk4d...@gmail.com> wrote:

So, is there will be "action" to get state? Because I'm not sure that
removing state from spec totally is possible. However I agree that
probably hardcoded fs schema brings more harm, than good.

What is an "action"?

Mrunal Patel

unread,

Nov 25, 2015, 11:30:15 AM11/25/15

to Brandon Philips, Alexander Morozov, Solomon Hykes, dev

So, is there will be "action" to get state? Because I'm not sure that
removing state from spec totally is possible. However I agree that
probably hardcoded fs schema brings more harm, than good.

What is an "action"?

https://github.com/opencontainers/specs/pull/225

W. Trevor King

unread,

Nov 30, 2015, 4:57:43 PM11/30/15

to Mrunal Patel, Alexander Morozov, Solomon Hykes, Brandon Philips, dev

On Wed, Nov 25, 2015 at 08:06:10AM -0800, Mrunal Patel wrote:
> I am +1…

Me too.

> … but agree with Alex that there needs to be an action or API so one

> get the state from the runtime. For e.g. runc --id <container_id>
> state

Things like “notify cAdvisor of a new container” can be handled easily
by pre-start and post-stop hooks that register and de-register a
container. You could also use hooks like that to emulate the
/run/opencontainer/containers functionality, with a pre-start script
running:

STATE=$(cat)
ID=$(echo "${STATE}" | jq --raw-output .id)
DIR="/run/opencontainer/containers/${ID}"
mkdir "${DIR}"
echo "${STATE}" >"${DIR}/state.json"

and a post-stop script running:

ID=$(jq --raw-output .id)
rm -rf "/run/opencontainer/containers/${ID}"

So I don't see a need to have a separate “give me the state for
$CONTAINER” action.

I don't have a major problem with a “give me the state for $CONTAINER”
action. It seems convenient for users who want to poll containers
instead of getting information pushed from hooks. On the other hand,
it means implementations will need a way for non-start processes (like
‘runc --id <container-id> …’) to get information generated by the
‘runc start’ process. That sounds like it's punting the “how do we
keep (global?) state?” process from “manage /run/opencontainer” to
“manage something inside your implementation”. The extra flexibility
is nice, but I'd rather avoid the problem entirely and not require a
“give me the state…” action. If it turns out that lots of people need
an action like that, it would be easy to write it as a stand-alone
tool and plug that tool into the runtime via hooks. But if the
implementation folks see the “manage something inside your
implementation” approach as a lot easier to implement, then I'm fine
requiring them to do it ;).

Cheers,
Trevor

--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

signature.asc

DL duglin

unread,

Dec 1, 2015, 10:12:16 AM12/1/15

to Brandon Philips, dev

Let’s add this to the agenda for tomorrow’s call - this is pretty critical especially as CNCF starts it work.

As of now, I kind of see a boolean choice here: either we mandate an external data store that 3rd party tooling can access (e.g. the fs) or we mandate the command line for all OCI implementations and let each impl store this data however they want.

While using the fs might be hard, we could help via some code examples to help people get it right.

Defining the cmd line offers some nice benefits for pluggability/interop, but might feel constraining.

And, of course, we could do both too.

- - - -

This assumes we want to have interop at this level at all, which I do think would be a good thing. No ability to have 3rd party tooling play a role here would limit the benefits of OCI since we’d be punting this to some higher level (probably CNCF) and if all interoperable interactions with the runtime are at the CNCF level (and not OCI) then I think the benefit for the OCI customers is limit.

-Doug

Solomon Hykes

unread,

Dec 1, 2015, 10:37:22 AM12/1/15

to DL duglin, Brandon Philips, dev

Doug, speaking for myself I don't plan on implementing Google's CNCF and don't consider it a legitimate place of interop (I can think of at least a dozen competing pseudo-standards at that layer). Even if you don't share my view, you should definitely not count on a successful outcome of CNCF to determine your stance in an OCI decision.

W. Trevor King

unread,

Dec 8, 2015, 6:51:59 PM12/8/15

to dev, Alexander Morozov, Solomon Hykes, Brandon Philips, Mrunal Patel, Julian Friedman

On Mon, Nov 30, 2015 at 01:55:40PM -0800, W. Trevor King wrote:
> I don't have a major problem with a “give me the state for $CONTAINER”
> action. It seems convenient for users who want to poll containers
> instead of getting information pushed from hooks. On the other hand,
> it means implementations will need a way for non-start processes (like
> ‘runc --id <container-id> …’) to get information generated by the
> ‘runc start’ process. That sounds like it's punting the “how do we
> keep (global?) state?” process from “manage /run/opencontainer” to
> “manage something inside your implementation”.

There was some more discussion about this in the 2015-12-02 meeting
[1], but I don't think we reached a conclusion. On Linux (but not on
Solaris [2]) actions like ‘pause’ and ‘signal’ will need a way to
figure out which namespaces/cgroups (externalFds? [3]). As Brandon
pointed out, maintaining a global directory of state JSON is tricky,
but Julz floated an option for letting the runtime-caller specify the
state JSON path [4]:

$ funC start --bundle foo/bar --state /my/container/state.json
$ funC pause --state /my/container/state.json

That allows us to punt the tricky parts to higher levels (there's no
built-in mechanism for listing containers, global registry, garbage
collection, …) while still allowing us to perform actions like ‘pause’
that use one runtime process to interact with a container created by
another runtime process.

In the absence of externalFds, Linux could probably get away with:

$ funC pause <PID>

but that's not a portable approach.

With support for unseekable files, callers like Docker could maintain
their state registry internally and use /dev/fd/3 and pipes instead of
paths like /my/container/state.json that point to a non-pseudo
filesystem.

--state will be overkill for systems like Solaris that have an
in-kernel state registry, but I'd require all runtimes to support it
for consistent testing [6]. Solaris systems could support additional
options like:

$ funC pause --id <CONTAINER-ID>

for a more native experience, but automated systems like the
conformance tester could rely on all runtimes supporting --state.

So how do folks feel about the --state approach?

Cheers,
Trevor

[1]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2015/opencontainers.2015-12-02-18.01.log.html#l-79
[2]: https://github.com/wking/oci-command-line-api/pull/3#issuecomment-162079383
[3]: https://groups.google.com/a/opencontainers.org/d/msg/dev/z25xQsF3pHA/ixyeTrxyFwAJ
Subject: Re: Drop /run/opencontainer entirely?
Message-ID: <20150905043...@odin.tremily.us>
[4]: https://github.com/wking/oci-command-line-api/pull/3#issuecomment-162009033
Julz actually proposed --state-file, but I don't think the -file
suffix adds much clarity. I'm happy to add it back if folks
disagree.
[5]: https://github.com/wking/oci-command-line-api/pull/3#issuecomment-162041009
[6]: https://github.com/wking/oci-command-line-api/pull/3#issuecomment-162082556

signature.asc

W. Trevor King

unread,

Dec 17, 2015, 11:43:55 AM12/17/15

to dev, Alexander Morozov, Solomon Hykes, Brandon Philips, Mrunal Patel, Julian Friedman

On Tue, Dec 08, 2015 at 03:49:57PM -0800, W. Trevor King wrote:
> … but Julz floated an option for letting the runtime-caller specify

> the state JSON path [4]:
>
> $ funC start --bundle foo/bar --state /my/container/state.json
> $ funC pause --state /my/container/state.json

I've written the ‘start’ portion of this up as a command-line API PR
[1], if interested parties want to chip in on review (including
pointing out shortcomings with the approach that I have overlooked).

Cheers,
Trevor

[1]: https://github.com/wking/oci-command-line-api/pull/14

signature.asc

Brandon Philips

unread,

Dec 23, 2015, 1:48:07 AM12/23/15

to W. Trevor King, dev, Alexander Morozov, Solomon Hykes, Mrunal Patel, Julian Friedman

A state flag seems OK. I can't think of any particular downside at the moment.

W. Trevor King

unread,

Jan 6, 2016, 4:04:57 PM1/6/16

to dev, Alexander Morozov, Solomon Hykes, Brandon Philips, Mrunal Patel, Julian Friedman, Rob Dolin, Doug Davis

On Tue, Dec 08, 2015 at 03:49:57PM -0800, W. Trevor King wrote:
> As Brandon pointed out, maintaining a global directory of state JSON
> is tricky, but Julz floated an option for letting the runtime-caller
> specify the state JSON path [4]:
>
> $ funC start --bundle foo/bar --state /my/container/state.json
> $ funC pause --state /my/container/state.json
>
> That allows us to punt the tricky parts to higher levels (there's no
> built-in mechanism for listing containers, global registry, garbage
> collection, …) while still allowing us to perform actions like ‘pause’
> that use one runtime process to interact with a container created by
> another runtime process.

In today's meeting, Rob asked for a concrete example of where this
would be useful [1], so here's a bit more detail on one of the “tricky
parts” I mentioned above.

An unprivileged user wants to launch a bundle, but lacks permission to
write to the host-wide registry. The host-wide registry could be
/run/opencontainer/containers, but the same permissions issue applies
to *any* host-wide registry where unprivileged posts are restricted.
With the current /run/opencontainer/containers requirement [2]. With
--state, the runtime is punting on registry-management. The
unprivileged author is free to use:

$ funC start --state ~/.oci/my-container/state.json

if they want something like the current filesystem registry where they
have access to it.

Doug still prefers ‘funC --id CONTAINER_ID state’ [3], but that still
means the runtime has to maintain a registry (somewhere). I tried to
address in [4], where I said:

> Possible alternatives for transmitting state information, and why I
> feel this approach is superior:
>
> [snip: comparing our current global directory and writing state from
> a pre-start hook]
>
> * Requiring runtimes to maintain an internal registry of containers
> they launch. This gives runtimes more flexibility than having a
> single, global directory. But ownership/access issues are still
> difficult (if one unprivileged user registers a container, can
> another unprivileged user see that entry? What elevated
> permissions would you need to see that entry? To remove that
> entry?). And the easiest way to get atomic changes and
> garbage-collection is by registering with a daemon, while not
> requiring a daemon is currently the # 1 feature listed on the runC
> homepage.
>
> In the event that any of those arguments seem leaky, callers that
> prefer a different approach can easily use hooks (without setting
> --state) or write wrappers that use a named pipe approach like
> (--state /dev/fd/3) to collect the JSON and then write it to their
> preferred registry. So the --state approach seems easy for the
> runtime to implement reliably, and also compatible with any of the
> suggested alternatives. The converse is not true; requiring a write
> to a global or per-runtime registry is not compatible with use-cases
> that prefer the anonymity of not writing the state at all (which is
> possible just by leaving off the --state option).

I still don't see anything wrong with that argument.

Coming back to the runtime-registry question and unprivileged users,
you *could* define a system where the runtime maintains a registry
that allows unpriviledged writes and appropriate removals (this could
just be a sticky directory with global read/write [5], although the
visibility of such a directory would depend on your mount namespaces
and subtree sharing [6]). But do we gain anything by *requiring* such
a registry? As I explain in [4], hooks and/or the --state option make
it easy for you to attach an arbitrary registry on top of a
container-launching runtime. And when there is an opportunity for
clear separation of concerns between complicated functionality
(e.g. launching containers vs. managing a host-wide registry of
container state), I suggest we take advantage of it. I'm not opposed
to a separate opencontainers/ repository that defines a
container-state registry and supplies the hooks (or whatever) for
managing it if folks want something with an OCI blessing on it.

Cheers,
Trevor

[1]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-01-06-18.03.log.html#l-34
[2]: https://github.com/opencontainers/specs/blob/6a6ba6775567e726c405456e74b0e52350104eda/runtime.md#state

This was “should” in v0.1.0 [7], but stiffened to MUST [8]. The
global directory initially landed in [9]. There's some tiptoeing
around whether the initial idea was for SHOULD (as specified in
RFC 2119 [10]) [11,12], but I don't see anything concrete.

[3]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-01-06-18.03.log.html#l-17
[4]: https://github.com/wking/oci-command-line-api/pull/14
[5]: https://en.wikipedia.org/wiki/Sticky_bit
[6]: https://kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
[7]: https://github.com/opencontainers/specs/blob/v0.1.1/runtime.md#state
[8]: https://github.com/opencontainers/specs/pull/211
[9]: https://github.com/opencontainers/specs/pull/87
[10]: http://tools.ietf.org/html/rfc2119#section-3
[11]: https://github.com/opencontainers/runc/pull/159
[12]: https://github.com/opencontainers/specs/pull/87#issuecomment-126114680

signature.asc

Doug Davis

unread,

Jan 6, 2016, 8:28:23 PM1/6/16

to W. Trevor King, Brandon Philips, dev, Julian Friedman, Alexander Morozov, Mrunal Patel, Rob Dolin, Solomon Hykes

I've been thinking of this slightly differently. I've been considering this more from the data perspective. Meaning, who should have access to the state file?

1 - the state file is an internal processing thing and therefore where it is stored is up to the impl to decide. If an unprivileged use of runc happens then runc needs to make sure the state is stored in a place it has write access to. Not our problem from a spec perspective. However, I then view this as implying we must standardize on a cmd line so that we can interoperably ask something like "runc --id myapp state" and expect back the json file.

2 - the state file is shared. This is what we have/do to day and to ensure interop we need to specify where it goes. While the spec say it MUST be in a certain dir, I guess we could change that to we "STRONGLY RECOMMEND" that dir, and give fair warning that not doing so hurt interop.

I don't have a good sense for how 3rd party tooling would prefer to access this information but I have to admit that I like option 1 because it allows for impls to store their state in a DB, and keeps that decision hidden from the user/tooling. It also avoids some of the sync/deadlock/mux issues people have mentioned concerning file access. Of course, it also forces a different model on us. Today each instance of runc is (for the most part) independent of each other - with #1 this isn't true because the 2nd instance of runc (e.g. runc state) needs to be able share info with other instances and I wonder if there are any pitfalls we'll run into w.r.t. defining the scope of this data sharing?

thanks
-Doug
_______________________________________________________
STSM | IBM Open Source, Cloud Architecture & Technology
(919) 254-6905 | IBM 444-6905 | d...@us.ibm.com
The more I'm around some people, the more I like my dog

"W. Trevor King" ---01/06/2016 04:05:10 PM---On Tue, Dec 08, 2015 at 03:49:57PM -0800, W. Trevor King wrote: > As Brandon pointed out, maintainin

[attachment "signature.asc" deleted by Doug Davis/Raleigh/IBM]

W. Trevor King

unread,

Jan 6, 2016, 11:51:11 PM1/6/16

to Doug Davis, Brandon Philips, dev, Julian Friedman, Alexander Morozov, Mrunal Patel, Rob Dolin, Solomon Hykes

On Wed, Jan 06, 2016 at 08:28:06PM -0500, Doug Davis wrote:
> 1 - the state file is an internal processing thing and therefore
> where it is stored is up to the impl to decide. If an unprivileged
> use of runc happens then runc needs to make sure the state is stored
> in a place it has write access to. Not our problem from a spec
> perspective. However, I then view this as implying we must
> standardize on a cmd line so that we can interoperably ask something
> like "runc --id myapp state" and expect back the json file.

This all sounds good to me, but I don't see a benefit to making the
*container launcher* the same program that is handling the state
registry. Making the registry a separate tool / spec gives us
composable, minimal tools [1]. There is a small integration cost to
pay for this flexibility, but (taking a few liberties with the current
hook structure [2]):

"hooks": {
"prestart": [{"args": ["state-registry", "add"]}],
"poststop": [{"args": ["state-registry", "remove"]}]
},

doesn't sound like a high cost to me. If my writing a state-registry
hook that duplicates the existing /run/opencontainer approach would
help this conversation along, I'm happy to do so.

> I don't have a good sense for how 3rd party tooling would prefer to

> access this information…

I don't think there's much space between your two options. If there's
a consistent API for accessing it (process invocation or filesystem
walk), then it's a wash for consumers. Folks who choose to have
multiple registries on their host will have multiple places to ask
either way.

> Today each instance of runc is (for the most part) independent of
> each other - with #1 this isn't true because the 2nd instance of
> runc (e.g. runc state) needs to be able share info with other
> instances and I wonder if there are any pitfalls we'll run into
> w.r.t. defining the scope of this data sharing?

With my --state option, the shared registry (wherever you choose to
keep it) is optional, so on systems where persistent state
preservation is worth the trouble can install a state registry (from a
set of existing state-registry implementations). While systems where
such persistent state preservation is not worth the trouble can use
--state ${SOME_AD_HOC_PATH} or skip --state entirely.

Cheers,
Trevor

[1]: https://www.opencontainers.org/governance §7.a and §7.f
[2]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/Cn5L6g1prgA
Subject: Unify container-process and hook-process structures
Date: Mon, 4 Jan 2016 21:56:37 -0800
Message-ID: <2016010505...@odin.tremily.us>

signature.asc

Reply all

Reply to author

Forward