Change seccomp default error from EPERM to ENOSYS

157 views
Skip to first unread message

Florian Weimer

unread,
Nov 24, 2020, 7:22:03 AM11/24/20
to d...@opencontainers.org, Carlos O'Donell
This is related to the kernel patch I just sent:

<https://groups.google.com/a/opencontainers.org/g/dev/c/gj9ErIn5LQI>

I think it would be nice if we could phase out that ugly userspace
probing sequence eventually, but that requires switching from EPERM to
ENOSYS, with a patch like this:

diff --git a/config-linux.md b/config-linux.md
index 9ea44a0..19278e1 100644
--- a/config-linux.md
+++ b/config-linux.md
@@ -646,7 +646,7 @@ The following parameters can be specified to set up seccomp:

* **`errnoRet`** *(uint, OPTIONAL)* - the errno return code to use.
Some actions like `SCMP_ACT_ERRNO` and `SCMP_ACT_TRACE` allow to specify the errno
- code to return. If not specified its default value is `EPERM`.
+ code to return. If not specified its default value is `ENOSYS`.

* **`args`** *(array of objects, OPTIONAL)* - the specific syscall in seccomp.
Each entry has the following structure:

Thoughts?

Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Tycho Andersen

unread,
Nov 24, 2020, 9:49:30 AM11/24/20
to Florian Weimer, d...@opencontainers.org, Carlos O'Donell
On Tue, Nov 24, 2020 at 01:21:50PM +0100, Florian Weimer wrote:
> This is related to the kernel patch I just sent:
>
> <https://groups.google.com/a/opencontainers.org/g/dev/c/gj9ErIn5LQI>
>
> I think it would be nice if we could phase out that ugly userspace
> probing sequence eventually, but that requires switching from EPERM to
> ENOSYS, with a patch like this:
>
> diff --git a/config-linux.md b/config-linux.md
> index 9ea44a0..19278e1 100644
> --- a/config-linux.md
> +++ b/config-linux.md
> @@ -646,7 +646,7 @@ The following parameters can be specified to set up seccomp:
>
> * **`errnoRet`** *(uint, OPTIONAL)* - the errno return code to use.
> Some actions like `SCMP_ACT_ERRNO` and `SCMP_ACT_TRACE` allow to specify the errno
> - code to return. If not specified its default value is `EPERM`.
> + code to return. If not specified its default value is `ENOSYS`.
>
> * **`args`** *(array of objects, OPTIONAL)* - the specific syscall in seccomp.
> Each entry has the following structure:
>
> Thoughts?

Sounds reasonable to me -- does any userspace actually depend on this
behavior (i.e. -EPERM instead of -ENOSYS)? Seems like we could
probably switch it without breaking too much...

Tycho

Giuseppe Scrivano

unread,
Nov 24, 2020, 11:26:00 AM11/24/20
to Florian Weimer, d...@opencontainers.org, Carlos O'Donell
Hi Florian,

Florian Weimer <fwe...@redhat.com> writes:

> This is related to the kernel patch I just sent:
>
> <https://groups.google.com/a/opencontainers.org/g/dev/c/gj9ErIn5LQI>
>
> I think it would be nice if we could phase out that ugly userspace
> probing sequence eventually, but that requires switching from EPERM to
> ENOSYS, with a patch like this:
>
> diff --git a/config-linux.md b/config-linux.md
> index 9ea44a0..19278e1 100644
> --- a/config-linux.md
> +++ b/config-linux.md
> @@ -646,7 +646,7 @@ The following parameters can be specified to set up seccomp:
>
> * **`errnoRet`** *(uint, OPTIONAL)* - the errno return code to use.
> Some actions like `SCMP_ACT_ERRNO` and `SCMP_ACT_TRACE` allow to specify the errno
> - code to return. If not specified its default value is `EPERM`.
> + code to return. If not specified its default value is `ENOSYS`.
>
> * **`args`** *(array of objects, OPTIONAL)* - the specific syscall in seccomp.
> Each entry has the following structure:
>
> Thoughts?

I am afraid it will be a breaking change.

I think ENOSYS makes sense only for new added syscalls, as likely there
is a fallback, but IMO it should be handled at a higher level.

Regards,
Giuseppe

Tycho Andersen

unread,
Nov 24, 2020, 1:39:12 PM11/24/20
to Giuseppe Scrivano, Florian Weimer, d...@opencontainers.org, Carlos O'Donell
On Tue, Nov 24, 2020 at 05:25:41PM +0100, Giuseppe Scrivano wrote:
> Hi Florian,
>
> Florian Weimer <fwe...@redhat.com> writes:
>
> > This is related to the kernel patch I just sent:
> >
> > <https://groups.google.com/a/opencontainers.org/g/dev/c/gj9ErIn5LQI>
> >
> > I think it would be nice if we could phase out that ugly userspace
> > probing sequence eventually, but that requires switching from EPERM to
> > ENOSYS, with a patch like this:
> >
> > diff --git a/config-linux.md b/config-linux.md
> > index 9ea44a0..19278e1 100644
> > --- a/config-linux.md
> > +++ b/config-linux.md
> > @@ -646,7 +646,7 @@ The following parameters can be specified to set up seccomp:
> >
> > * **`errnoRet`** *(uint, OPTIONAL)* - the errno return code to use.
> > Some actions like `SCMP_ACT_ERRNO` and `SCMP_ACT_TRACE` allow to specify the errno
> > - code to return. If not specified its default value is `EPERM`.
> > + code to return. If not specified its default value is `ENOSYS`.
> >
> > * **`args`** *(array of objects, OPTIONAL)* - the specific syscall in seccomp.
> > Each entry has the following structure:
> >
> > Thoughts?
>
> I am afraid it will be a breaking change.

But how, exactly? Existing userspace already needs to be prepared to
handle -ENOSYS and accomplish what they want some other way, because
their kernel might not be new enough.

If they had another test for this weird -EPERM semantic, that code
will be dead with the above change, but it won't break anything.

Tycho

Florian Weimer

unread,
Nov 24, 2020, 1:46:29 PM11/24/20
to Giuseppe Scrivano, d...@opencontainers.org, Carlos O'Donell
* Giuseppe Scrivano:

> Hi Florian,
>
> Florian Weimer <fwe...@redhat.com> writes:
>
>> This is related to the kernel patch I just sent:
>>
>> <https://groups.google.com/a/opencontainers.org/g/dev/c/gj9ErIn5LQI>
>>
>> I think it would be nice if we could phase out that ugly userspace
>> probing sequence eventually, but that requires switching from EPERM to
>> ENOSYS, with a patch like this:
>>
>> diff --git a/config-linux.md b/config-linux.md
>> index 9ea44a0..19278e1 100644
>> --- a/config-linux.md
>> +++ b/config-linux.md
>> @@ -646,7 +646,7 @@ The following parameters can be specified to set up seccomp:
>>
>> * **`errnoRet`** *(uint, OPTIONAL)* - the errno return code to use.
>> Some actions like `SCMP_ACT_ERRNO` and `SCMP_ACT_TRACE` allow to specify the errno
>> - code to return. If not specified its default value is `EPERM`.
>> + code to return. If not specified its default value is `ENOSYS`.
>>
>> * **`args`** *(array of objects, OPTIONAL)* - the specific syscall in seccomp.
>> Each entry has the following structure:
>>
>> Thoughts?
>
> I am afraid it will be a breaking change.

Some bug fixes necessarily are, unfortuantely. Not making this change
also breaks things.

For glibc's use, it would be sufficient to attach the EPERM vs ENOSYS
default to the base image and have it inherit by any derived images. (I
have no idea whether such a mechanism exist.) glibc updates typically
happen at distribution release boundaries, and that is more or less a
well-defined event, so the wider impact of the EPERM → ENOSYS transition
could be mentioned in the container image/distribution release notes.

Since the actually permitted set of system calls does not change, it
would be safe to ask the image itself for the preferred error default.

> I think ENOSYS makes sense only for new added syscalls, as likely there
> is a fallback, but IMO it should be handled at a higher level.

For security reasons, I think these kinds of seccomp filters need to
come in the form of a list of permitted (and thus known) system calls.
The system calls not in this list are unknown. There is no third,
fourth &c state here. Where would it come from?

libseccomp could hard-code a particular system call universe at build
time, based on the kernel headers it finds (or built-in tables). This
means that on every libseccomp upgrade, the universe could potentially
change, and with it some system call errors would turn from ENOSYS
(working fallback, previously not within the universe) to EPERM (broken
fallback, now in the universe). Any component that hard-codes such a
universe would have the same issue because the behavior outside the
permitted syscall list is essentially arbitrary. That includes the host
kernel: Assume a new system call is backported. Should containers
change their error code from ENOSYS to EPERM? I don't think so.

In a sense, the new/old system call distinction would turn what is a one
time potential breakage (due to the EPERM → ENOSYS transition) into an
ongoing source of issues related to potential ENOSYS → EPERM changes at
any cluster infrastructure update.

(Note that this message was written based on a view from the outside,
looking at various failure modes and discussions of related topics. I
do not anything about the inner workings of libseccomp.)

Giuseppe Scrivano

unread,
Nov 24, 2020, 3:50:20 PM11/24/20
to Tycho Andersen, Florian Weimer, d...@opencontainers.org, Carlos O'Donell
The difference I see is that ENOSYS is usually handled as "do not try
this syscall anymore", while EPERM is a temporary failure.

That could be a problem with conditional seccomp rules that allow
syscalls only with certain inputs and currently fail with EPERM in the
remaining cases. e.g. personality(2) in the default Podman profile is
allowed if the first arg matches some specific values.

If userspace sees ENOSYS, it won't likely attempt the syscall again,
even if it could succeed with a different set of inputs.

Regards,
Giuseppe

Florian Weimer

unread,
Nov 24, 2020, 4:01:07 PM11/24/20
to Giuseppe Scrivano, Tycho Andersen, d...@opencontainers.org, Carlos O'Donell
* Giuseppe Scrivano:
ENOSYS also means “try something else if you can”, while EPERM is pretty
much the opposite—the current operation is expected to fail (although
you are right that it would be inappropriate to cache that result,
particularly in an argument-independent fashion).

> That could be a problem with conditional seccomp rules that allow
> syscalls only with certain inputs and currently fail with EPERM in the
> remaining cases. e.g. personality(2) in the default Podman profile is
> allowed if the first arg matches some specific values.

But if you have a rule that inspects arguments, then you know about the
system call, and it's fine to return specific error codes.

I guess this means that the overall default for system calls that do not
have specific rules should be ENOSYS (basically, the unknown case), but
if there is a rule that checks a specific system call and specifies an
erro rreturn, the default error code in the absence of an explicit
specification should be EPERM.

I hope the seccomp infrastructure is flexible enough to implement that.

> If userspace sees ENOSYS, it won't likely attempt the syscall again,
> even if it could succeed with a different set of inputs.

Only in some cases on performance-critical system call wrappers. I
believe there are kernels out there that have a copy_file_range
implementation which returns argument-dependent ENOSYS because that was
the least worst option. (On those kernels, copy_file_range is only
supported for network file systems, not for local file systems, due
kernel limitations.) But definitely not a precedent to follow.

In glibc, we have mostly stopped caching system call availability. If
that causes performance problems, people should use kernels that have
the preferred system calls.

Tycho Andersen

unread,
Nov 25, 2020, 2:07:19 PM11/25/20
to Giuseppe Scrivano, Florian Weimer, d...@opencontainers.org, Carlos O'Donell
But in the case of EPERM from a seccomp filter it's a permanent
failure, because seccomp filters are pure over their arguments
(ignoring USER_NOTIF for a moment) and thus the return code for a
struct seccomp_data can't change for a given filter.

It's still not clear to me why this matters, since applications should
be used to getting ENOSYS and have code to handle things accordingly.

Tycho

Florian Weimer

unread,
Nov 25, 2020, 2:10:41 PM11/25/20
to Tycho Andersen, Giuseppe Scrivano, d...@opencontainers.org, Carlos O'Donell
* Tycho Andersen:

> But in the case of EPERM from a seccomp filter it's a permanent
> failure, because seccomp filters are pure over their arguments
> (ignoring USER_NOTIF for a moment) and thus the return code for a
> struct seccomp_data can't change for a given filter.
>
> It's still not clear to me why this matters, since applications should
> be used to getting ENOSYS and have code to handle things accordingly.

They don't get ENOSYS, they get EPERM today for system calls not
specified in the policy. This appears to be implied by the OCI run-time
spec.

Tycho Andersen

unread,
Nov 25, 2020, 2:26:16 PM11/25/20
to Florian Weimer, Giuseppe Scrivano, d...@opencontainers.org, Carlos O'Donell
On Wed, Nov 25, 2020 at 08:10:29PM +0100, Florian Weimer wrote:
> * Tycho Andersen:
>
> > But in the case of EPERM from a seccomp filter it's a permanent
> > failure, because seccomp filters are pure over their arguments
> > (ignoring USER_NOTIF for a moment) and thus the return code for a
> > struct seccomp_data can't change for a given filter.
> >
> > It's still not clear to me why this matters, since applications should
> > be used to getting ENOSYS and have code to handle things accordingly.
>
> They don't get ENOSYS, they get EPERM today for system calls not
> specified in the policy. This appears to be implied by the OCI run-time
> spec.

yes, I think we're in violent agreement on this :)

Tycho

Giuseppe Scrivano

unread,
Nov 25, 2020, 4:45:24 PM11/25/20
to Florian Weimer, Tycho Andersen, d...@opencontainers.org, Carlos O'Donell
Florian Weimer <fwe...@redhat.com> writes:

> * Tycho Andersen:
>
>> But in the case of EPERM from a seccomp filter it's a permanent
>> failure, because seccomp filters are pure over their arguments
>> (ignoring USER_NOTIF for a moment) and thus the return code for a
>> struct seccomp_data can't change for a given filter.
>>
>> It's still not clear to me why this matters, since applications should
>> be used to getting ENOSYS and have code to handle things accordingly.
>
> They don't get ENOSYS, they get EPERM today for system calls not
> specified in the policy. This appears to be implied by the OCI run-time
> spec.

if a syscall is not specified at all in the configuration then I agree
ENOSYS makes sense as anyway that syscall will always fail.

What I am concerned about are seccomp configurations such as:
...
{
"names": [
"some_syscall"
],
"action": "SCMP_ACT_ALLOW",
"args": [
{
"index": 0,
"value": 0,
"op": "SCMP_CMP_EQ"
}
]
},
...

with such configuration, currently `some_syscall(0)` succeeds but
`some_syscall(1)` fails with EPERM.

Given that caching the ENOSYS result is a common pattern (e.g. gnulib
accept4[1] wrapper), how would ENOSYS work in this case? I'd expect it
to be consistent and always fail no matter what input is used.

Giuseppe

[1] https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/accept4.c;h=a21f12f0fd5ba54f89a7bdc51c1f4705fc94c716;hb=HEAD#l50

Peng Tao

unread,
Nov 25, 2020, 9:28:56 PM11/25/20
to Giuseppe Scrivano, Florian Weimer, Tycho Andersen, dev, Carlos O'Donell
EPERM is no permission. ENOSYS is missing such syscall. Why would we
want to unify them? If a syscall is filtered by seccomp, it is totally
legal to return EPERM. For a syscall that is not implemented, ENOSYS
is the right one. Can we make seccomp aware of such differences and
return the proper error code?

Cheers,
Tao
--
Into Sth. Rich & Strange
Reply all
Reply to author
Forward
0 new messages