[PATCH] syscalls: Document OCI seccomp filter interactions & workaround

47 views
Skip to first unread message

Florian Weimer

unread,
Nov 24, 2020, 7:08:37 AM11/24/20
to linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
This documents a way to safely use new security-related system calls
while preserving compatibility with container runtimes that require
insecure emulation (because they filter the system call by default).
Admittedly, it is somewhat hackish, but it can be implemented by
userspace today, for existing system calls such as faccessat2,
without kernel or container runtime changes.

Signed-off-by: Florian Weimer <fwe...@redhat.com>

---
Documentation/process/adding-syscalls.rst | 37 +++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/Documentation/process/adding-syscalls.rst b/Documentation/process/adding-syscalls.rst
index a3ecb236576c..7d1e578a1df1 100644
--- a/Documentation/process/adding-syscalls.rst
+++ b/Documentation/process/adding-syscalls.rst
@@ -436,6 +436,40 @@ simulates registers etc). Fixing this is as simple as adding a #define to

#define stub_xyzzy sys_xyzzy

+Container Compatibility and seccomp
+-----------------------------------
+
+The Linux Foundation Open Container Initiative Runtime Specification
+requires that by default, implementations install seccomp system call
+filters which cause system calls to fail with ``EPERM``. As a result,
+all new system calls in such containers fail with ``EPERM`` instead of
+``ENOSYS``. This design is problematic because ``EPERM`` is a
+legitimate system call result which should not trigger fallback to a
+userspace emulation, particularly for security-related system calls.
+(With ``ENOSYS``, it is clear that a fallback implementation has to be
+used to maintain compatibility with older kernels or container
+runtimes.)
+
+New system calls should therefore provide a way to reliably trigger an
+error distinct from ``EPERM``, without any side effects. Some ways to
+achieve that are:
+
+ - ``EBADFD`` for the invalid file descriptor -1
+ - ``EFAULT`` for a null pointer
+ - ``EINVAL`` for a contradictory set of flags that will remain invalid
+ in the future
+
+If a system call has such error behavior, upon encountering an
+``EPERM`` error, userspace applications can perform further
+invocations of the same system call to check if the ``EPERM`` error
+persists for those known error conditions. If those also fail with
+``EPERM``, that likely means that the original ``EPERM`` error was the
+result of a seccomp filter, and should be treated like ``ENOSYS``
+(e.g., trigger an alternative fallback implementation). If those
+probing system calls do not fail with ``EPERM``, the error likely came
+from a real implementation, and should be reported to the caller
+directly, without resorting to ``ENOSYS``-style fallback.
+

Other Details
-------------
@@ -575,3 +609,6 @@ References and Sources
- Recommendation from Linus Torvalds that x32 system calls should prefer
compatibility with 64-bit versions rather than 32-bit versions:
https://lkml.org/lkml/2011/8/31/244
+ - Linux Configuration section of the Open Container Initiative
+ Runtime Specification:
+ https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md

--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Christian Brauner

unread,
Nov 24, 2020, 7:26:45 AM11/24/20
to Florian Weimer, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
I'm sorry but I have some doubts about this new "rule". The idea of
being able to reliably trigger an error for a system call other then
EPERM might have merrit in some scenarios but justifying it via a bug in
a userspace standard is not enough in my opinion.

The solution is to fix the standard to mandate ENOSYS. This is the
correct error for this exact scenario and standards can be changed.
I don't think it is the kernel's job to work around a deliberate
userspace decision to use EPERM and not ENOSYS. The kernel's system call
design should not be informed by this especially since this is clearly
not a kernel bug.

Apart from that I have doubts that this is in any shape or form
enforceable. Not just because in principle there might be system calls
that only return EPERM on error but also because this requirement feels
arbitrary and I doubt developers will feel bound by it or people will
check for it.

> +
> +If a system call has such error behavior, upon encountering an
> +``EPERM`` error, userspace applications can perform further
> +invocations of the same system call to check if the ``EPERM`` error
> +persists for those known error conditions. If those also fail with
> +``EPERM``, that likely means that the original ``EPERM`` error was the
> +result of a seccomp filter, and should be treated like ``ENOSYS``

I think that this "approach" alone should illustrate that this is the
wrong way to approach this. It's hacky and requires excercising a system
call multiple times just to find out whether or not it is supported.
The only application that would possibly do this is probably glibc.
This seems to be the complete wrong way of solving this problem.

Florian Weimer

unread,
Nov 24, 2020, 7:54:41 AM11/24/20
to Christian Brauner, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
* Christian Brauner:
Thank you for your feedback. I appreciate it.

I agree that the standard should mandate ENOSYS, and I've just proposed
a specification change here:

<https://groups.google.com/a/opencontainers.org/g/dev/c/8Phfq3VBxtw>

However, such a change may take some time to implement.

Meanwhile, we have the problem today with glibc that it wants to use the
faccessat2 system call but it can't. I've been told that it would make
glibc incompatible with the public cloud and Docker. The best solution
I could come up with it is this awkward probing sequence. (Just
checking for the zero flags argument is not sufficient because systemd
calls fchmodat with AT_SYMLINK_NOFOLLOW.)

I do not wish to put the probing sequence into glibc (upstream or
downstream) unless it is blessed to some degree by kernel developers. I
consider it quite ugly and would prefer if more of us share the blame.

We will face the same issue again with fchmodat2 (or fchmodat4 if that's
what it's name is going to be). And we have been lucky in recent times
that didn't need a new system call to fix a security vulnerability in an
existing system call in wide use by userspace (although faccessat2 comes
rather close because it replaces a userspace permission check
approximation with a proper kernel check). The seccomp situation means
that we can't, reliably, and the probing hack seems to be way out.
That's another reason for not just putting in the probing sequence
quietly and be done with it: I'd like to discuss this aspect in the
open, before we need it as part of a fix for some embargoed security
vulnerability.

Thanks,
Florian

Aleksa Sarai

unread,
Nov 24, 2020, 7:58:22 AM11/24/20
to Florian Weimer, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
As I mentioned in the runc thread[1], this is really down to Docker's
default policy configuration. The EPERM-everything behaviour in OCI was
inherited from Docker, and it boils down to not having an additional
seccomp rule which does ENOSYS for unknown syscall numbers (Docker can
just add the rule without modifying the OCI runtime-spec -- so it's
something Docker can fix entirely on their own). I'll prepare a patch
for Docker this week.

IMHO it's also slightly overkill to change the kernel API design
guidelines in response to this issue.

[1]: https://github.com/opencontainers/runc/issues/2151

> Other Details
> -------------
> @@ -575,3 +609,6 @@ References and Sources
> - Recommendation from Linus Torvalds that x32 system calls should prefer
> compatibility with 64-bit versions rather than 32-bit versions:
> https://lkml.org/lkml/2011/8/31/244
> + - Linux Configuration section of the Open Container Initiative
> + Runtime Specification:
> + https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
signature.asc

Florian Weimer

unread,
Nov 24, 2020, 8:06:07 AM11/24/20
to Aleksa Sarai, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
* Aleksa Sarai:

> As I mentioned in the runc thread[1], this is really down to Docker's
> default policy configuration. The EPERM-everything behaviour in OCI was
> inherited from Docker, and it boils down to not having an additional
> seccomp rule which does ENOSYS for unknown syscall numbers (Docker can
> just add the rule without modifying the OCI runtime-spec -- so it's
> something Docker can fix entirely on their own). I'll prepare a patch
> for Docker this week.

Appreciated, thanks.

> IMHO it's also slightly overkill to change the kernel API design
> guidelines in response to this issue.
>
> [1]: https://github.com/opencontainers/runc/issues/2151

Won't this cause docker to lose OCI compliance? Or is the compliance
testing not that good?

Thanks,
Florian

Christoph Hellwig

unread,
Nov 24, 2020, 8:37:24 AM11/24/20
to Florian Weimer, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
On Tue, Nov 24, 2020 at 01:08:20PM +0100, Florian Weimer wrote:
> This documents a way to safely use new security-related system calls
> while preserving compatibility with container runtimes that require
> insecure emulation (because they filter the system call by default).
> Admittedly, it is somewhat hackish, but it can be implemented by
> userspace today, for existing system calls such as faccessat2,
> without kernel or container runtime changes.

I think this is completely insane. Tell the OCI folks to fix their
completely broken specification instead.

Mark Wielaard

unread,
Nov 24, 2020, 9:08:09 AM11/24/20
to Florian Weimer, Christian Brauner, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
Hi,

Just a reply to note that this isn't just an issue for glibc, but for
any program that might use linux syscalls directly (with fallbacks).

On Tue, 2020-11-24 at 13:54 +0100, Florian Weimer wrote:
>
> I agree that the standard should mandate ENOSYS, and I've just proposed
> a specification change here:
>
> <https://groups.google.com/a/opencontainers.org/g/dev/c/8Phfq3VBxtw>
>
> However, such a change may take some time to implement.

Thanks, that is really appreciated. We face the same issue in valgrind.

> Meanwhile, we have the problem today with glibc that it wants to use the
> faccessat2 system call but it can't. I've been told that it would make
> glibc incompatible with the public cloud and Docker. The best solution
> I could come up with it is this awkward probing sequence. (Just
> checking for the zero flags argument is not sufficient because systemd
> calls fchmodat with AT_SYMLINK_NOFOLLOW.)
>
> I do not wish to put the probing sequence into glibc (upstream or
> downstream) unless it is blessed to some degree by kernel developers. I
> consider it quite ugly and would prefer if more of us share the blame.
>
> We will face the same issue again with fchmodat2 (or fchmodat4 if that's
> what it's name is going to be).

For valgrind the issue is statx which we try to use before falling back
to stat64, fstatat or stat (depending on architecture, not all define
all of these). The problem with these fallbacks is that under some
containers (libseccomp versions) they might return EPERM instead of
ENOSYS. This causes really obscure errors that are really hard to
diagnose.

Don't you have the same issue with glibc for those architectures that
don't have fstatat or 32bit arches that need 64-bit time_t? And if so,
how are you working around containers possibly returning EPERM instead
of ENOSYS?

Thanks,

Mark

Florian Weimer

unread,
Nov 24, 2020, 9:08:26 AM11/24/20
to Christoph Hellwig, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
* Christoph Hellwig:
Do you categorically reject the general advice, or specific instances as
well? Like this workaround for faccessat that follows the pattern I
outlined:

<https://sourceware.org/pipermail/libc-alpha/2020-November/119955.html>

I value your feedback and want to make sure I capture it accurately.

Thanks,
Florian

Christoph Hellwig

unread,
Nov 24, 2020, 11:45:59 AM11/24/20
to Mark Wielaard, Florian Weimer, Christian Brauner, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote:
> For valgrind the issue is statx which we try to use before falling back
> to stat64, fstatat or stat (depending on architecture, not all define
> all of these). The problem with these fallbacks is that under some
> containers (libseccomp versions) they might return EPERM instead of
> ENOSYS. This causes really obscure errors that are really hard to
> diagnose.

So find a way to detect these completely broken container run times
and refuse to run under them at all. After all they've decided to
deliberately break the syscall ABI. (and yes, we gave the the rope
to do that with seccomp :().

Christoph Hellwig

unread,
Nov 24, 2020, 11:47:04 AM11/24/20
to Florian Weimer, Christoph Hellwig, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
On Tue, Nov 24, 2020 at 03:08:09PM +0100, Florian Weimer wrote:
> Do you categorically reject the general advice, or specific instances as
> well?

All of the above. Really, if people decided to use seccompt to return
nonsensical error codes we should not work around that in new kernel
ABIs.

Florian Weimer

unread,
Nov 24, 2020, 11:53:02 AM11/24/20
to Christoph Hellwig, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
* Christoph Hellwig:
Fair enough, I can work with that. Thanks.

Jann Horn

unread,
Nov 24, 2020, 12:07:09 PM11/24/20
to Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Florian Weimer, Christian Brauner, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
+seccomp maintainers/reviewers
[thread context is at
https://lore.kernel.org/linux-api/87lfer2...@oldenburg2.str.redhat.com/
]
FWIW, if the consensus is that seccomp filters that return -EPERM by
default are categorically wrong, I think it should be fairly easy to
add a check to the seccomp core that detects whether the installed
filter returns EPERM for some fixed unused syscall number and, if so,
prints a warning to dmesg or something along those lines...

Greg KH

unread,
Nov 24, 2020, 12:15:42 PM11/24/20
to Jann Horn, Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Florian Weimer, Christian Brauner, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
Why? seccomp is saying "this syscall is not permitted", so -EPERM seems
like the correct error to provide here. It's not -ENOSYS as the syscall
is present.

As everyone knows, there are other ways to have -EPERM be returned from
a syscall if you don't have the correct permissions to do something.
Why is seccomp being singled out here? It's doing the correct thing.

thanks,

greg k-h

Christian Brauner

unread,
Nov 24, 2020, 12:21:57 PM11/24/20
to Greg KH, Jann Horn, Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Florian Weimer, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
The correct solution to this problem is simple: the standard and the
problematic container runtimes need to be fixed to return ENOSYS as I
said in my first mail. Imho, the kernel neither should need to log
anything or be opinionated about what error is correct or not. Imho,
this is a broken standard and that's where the story ends.

We've had that argument about ENOSYS being the correct errno in such
scenarios in userspace already and that's been ignored for _years_. Now,
as could be expected it's suddenly the kernel who's supposed to fix
this. That's totally wrong imho.

Christian

Jann Horn

unread,
Nov 24, 2020, 12:30:58 PM11/24/20
to Greg KH, Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Florian Weimer, Christian Brauner, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
AFAIU from what the others have said, it's being singled out because
it means that for two semantically equivalent operations (e.g.
openat() vs open()), one can fail while the other works because the
filter doesn't know about one of the syscalls. Normally semantically
equivalent syscalls are supposed to be subject to the same checks, and
if one of them fails, trying the other one won't help.

But if you can't tell whether the more modern syscall failed because
of a seccomp filter, you may be forced to retry with an older syscall
even on systems where the new syscall works fine, and such a fallback
may reduce security or reliability if you're trying to use some flags
that only the new syscall provides for security, or something like
that. (As a contrived example, imagine being forced to retry any
tgkill() that fails with EPERM as a tkill() just in case you're
running under a seccomp filter.)

Greg KH

unread,
Nov 24, 2020, 12:44:20 PM11/24/20
to Jann Horn, Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Florian Weimer, Christian Brauner, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
They aren't being subject to the same checks, if the seccomp permissions
are different for both of them, they will get different answers.

Trying to use this to determine if the syscall is present or not is not
ok, and as Christian just said, needs to be fixed in userspace. We
can't change the kernel ABI now, odds are someone else relies on the api
we have had in place and it can not be changed :)

thanks,

greg k-h

Jann Horn

unread,
Nov 24, 2020, 12:47:53 PM11/24/20
to Greg KH, Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Florian Weimer, Christian Brauner, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
I don't think anyone was proposing changes to existing kernel API.

Florian Weimer

unread,
Nov 24, 2020, 1:03:11 PM11/24/20
to Jann Horn, Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Christian Brauner, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
* Jann Horn:
But that's playing Core Wars, right? Someone will write a seccomp
filter trying to game that kernel check. I don't really think it solves
anything until there is consensus what a system call filter should do
with system calls not on the permitted list.

Florian Weimer

unread,
Nov 24, 2020, 1:09:58 PM11/24/20
to Mark Wielaard, Christian Brauner, linu...@vger.kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, d...@opencontainers.org, cor...@lwn.net, Carlos O'Donell
* Mark Wielaard:

> For valgrind the issue is statx which we try to use before falling back
> to stat64, fstatat or stat (depending on architecture, not all define
> all of these). The problem with these fallbacks is that under some
> containers (libseccomp versions) they might return EPERM instead of
> ENOSYS. This causes really obscure errors that are really hard to
> diagnose.

The probing sequence I proposed should also work for statx. 8-p

> Don't you have the same issue with glibc for those architectures that
> don't have fstatat or 32bit arches that need 64-bit time_t? And if so,
> how are you working around containers possibly returning EPERM instead
> of ENOSYS?

That's a good point. I don't think many people run 32-bit containers in
the cloud. The Y2038 changes in glibc impact 64-bit ports a little, but
mostly on the fringes (e.g., clock_nanosleep vs nanosleep).

Florian Weimer

unread,
Nov 24, 2020, 1:17:17 PM11/24/20
to Jann Horn, Greg KH, Christoph Hellwig, Kees Cook, Andy Lutomirski, Will Drewry, Mark Wielaard, Christian Brauner, Linux API, open list:DOCUMENTATION, kernel list, d...@opencontainers.org, Jonathan Corbet, Carlos O'Donell
* Jann Horn:

> But if you can't tell whether the more modern syscall failed because
> of a seccomp filter, you may be forced to retry with an older syscall
> even on systems where the new syscall works fine, and such a fallback
> may reduce security or reliability if you're trying to use some flags
> that only the new syscall provides for security, or something like
> that. (As a contrived example, imagine being forced to retry any
> tgkill() that fails with EPERM as a tkill() just in case you're
> running under a seccomp filter.)

We have exactly this situation with faccessat2 and faccessat today.
EPERM could mean a reject from a LSM, and we really don't want to do our
broken fallback in this case because it will mask the EPERM error from
the LSM (and the sole purpose of faccessat2 is to get that error).

This is why I was so eager to start using faccessat2 in glibc, and we
are now encountering breakage with container runtimes. Applications
call faccessat (with a non-zero flags argument) today, and they now get
routed to the faccessat2 entry point, without needing recompilation or
anything like that.

We have the same problem for any new system call, but it's different
this time because it affects 64-bit hosts *and* existing applications.

And as I explained earlier, I want to take this opportunity to get
consensus how to solve this properly, so that we are ready for a new
system call where incorrect fallback would definitely reintroduce a
security issue. Whether it's that ugly probing sequence, a change to
the OCI specification that gets deployed in a reasonable time frame, or
something else that I haven't thought of—I do not have a very strong
preference, although I lean towards the spec change myself. But I do
feel that we shouldn't throw in a distro-specific patch to paper over
the current faccessat2 issue and forget about it.
Reply all
Reply to author
Forward
0 new messages