On Tue, Sep 15, 2015 at 08:28:39AM -0700, Mrunal Patel wrote:
> I could take a stab at how it works today and we could refine it
> from there.
It's been a few days now, so I thought I'd take a stab at this. If we
have enough lead-time we might be able to get this finalized at next
weeks meeting (on Thursday [1]). This write-up is just about Linux
containers, since I don't understand the other systems well enough to
do them justice.
# Typical lifecycle
A typical lifecyle progresses like this:
1. There is no container or running application
2. A user tells the runtime to start a container+application
3. The runtime creates the container
4. The runtime executes any pre-start hooks
5. The runtime executes the application
6. The application is running
7. A user tells the runtime to send a termination signal to the application
8. The runtime sends a termination signal to the application
9. The application exits
10. The runtime executes any post-stop hooks
11. The runtime removes the container
With steps 7 and 8, the user is explicitly stopping the application
(via the runtime), but it's also possible that the application could
exit for other reasons. In that case we skip directly from 6 to 9.
Failure in a pre-start hook or other setup task can cause a jump
straight to 10.
## Create
Create the container: file system, namespaces, cgroups,
capabilities. The invoked process forks, with one branch that stays
in the host namespace and another that enters the container. The
host process caries out all container setup actions, and continues
running for the life of the container so it can perform teardown
after the container process exits. The container process performs
tasks such as username-lookups [2], and then drops privileges in
preparation for the application start. At this point, the host
process writes the state.json file with the host-side version of the
container-process's PID (the container process may be in a PID
namespace) [3].
[This is where standard streams get a bit tricky, because both the
host and container processes are sharing the same streams.
Untangling this will probably involve logging (instead of stderr
writing) for the host process, although we'll have to figure out
when to cutover from stderr to logs. Probably some time before we
fork, and definitely before we exec the application.]
## Pre-start hooks
The pre-start hooks are executed after container creation by the
host process.
## Start (process)
After the pre-start hooks complete, the host process signals the
container process to execute the runtime. The runtime execs the
process defined in config.json's ‘process’ attribute [4].
On Linux hosts, some information for this execution may come from
outside the `config.json` and `runtime.json` specifications. See
the Linux-specific notes for details [this is the "additional file
descriptor" stuff that landed in specs#113, see also [5]].
## Stop (process)
Send a termination signal to the application process (can optionally
send other signals to the application process, e.g. a kill signal).
When the process exits, the host process collects it's exit status
to return as its own exit status. If there are any remaining
processes in the container's cgroup (and we only support unified
cgroups [6]), the host process kills and reaps them.
[On IRC on 2015-09-15, Michael said: “if the main process dies in
the container, all other process are killed” and “we actually freeze
first, send the KILL, then unfreeze so we don't have races”. “The
main process” is probably “the container process associated with the
host process that created the cgroup”, to distinguish it from
container processes that have subsequently joined the cgroup. And
KILL seems like a harsh starting point, so it might be “TERM, wait
on a clean exit for $TIMEOUT, if processes are still running,
freeze, KILL, unfreeze, and reap”.]
## Post-stop hooks
The post-stop hooks are executed after container creation by the
host process.
[I'm not clear on what state.json looks like for these processes.
Does it still have a PID? The container process is dead by this
point and some of the container (e.g. PID namespaces) won't even
exist anymore. How are post-stop hooks supposed to get information
about the container?]
## Cleanup
The host process removes the container: unmounting file systems,
removing namespaces, etc. This is the inverse of create. The host
process then exits with the application's exit status [Julz has
pointed out that this makes it hard to report on the host-process's
own teardown errors].
# Joining existing containers
Joining an existing container looks just like the usual workflow,
except that the container process joins the target container [7] at
the beginning of step three. It can then, depending on its
configuration, continue to create an additional child cgroup
underneath the one it joined.
When exiting, the reaping logic in the ‘stop’ phase is the same. If
the container process created a child cgroup, all other processes in
that child cgroup are reaped. But no other processes in the joined
cgroup (which the container process did not create) are reaped.
Does that sound close to what we have now in runC? Can anyone suggest
edits or complete rewrites where I got things wrong? Add clarity to
my [bracketed confusion]?
Cheers,
Trevor
[1]:
https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/pnM9vDNJgrg
[2]:
https://github.com/opencontainers/specs/pull/191
[3]:
https://github.com/opencontainers/specs/blob/v0.1.1/runtime.md#state
[4]:
https://github.com/opencontainers/specs/blob/v0.1.1/config.md#process-configuration
[5]:
https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/D-3t4XHOqnU
Message-ID: <CAK4o1WzT7rVv16rAG=
EGHfHKMY+kc9HR-k...@mail.gmail.com>
[6]:
https://github.com/opencontainers/specs/blob/v0.1.1/runtime-config-linux.md#control-groups
“The Spec does not support split hierarchy.”
[7]:
https://github.com/opencontainers/specs/blob/v0.1.1/runtime-config-linux.md#control-groups
cgroupsPath