recvmmsg() timeout behavior strangeness [RESEND]

Michael Kerrisk (man-pages)

unread,

Apr 30, 2014, 9:59:49 AM4/30/14

to Arnaldo Carvalho de Melo, lkml, mtk.ma...@gmail.com, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Arnaldo,

I raised this issue somewhat more than a year ago, here:
http://thread.gmane.org/gmane.linux.man/3477
but got no reply from you. (Chris Friesen in that thread agreed
that there is a problem though.)

Here, a slightly revised version of that mail, since I've just bumper
into a related problem in a different context...

As part of his attempt to better document the recvmmsg() syscall that
you added in commit a2e2725541fad72416326798c2d7fa4dafb7d337, Elie de
Brauwer alerted to me to some strangeness in the timeout behavior of
the syscall. I suspect there's a bug that needs fixing, as detailed
below.

AFAICT, the timeout argument was added to this syscall as a result of
the discussion here:
http://markmail.org/message/m5l2ap4hiiimut6k#query:+page:1+mid:m5l2ap4hiiimut6k+state:results
(20-21 May 2009, "[RFC 1/2] net: Introduce recvmmsg...")

If I understand correctly, the *intended* purpose of the timeout
argument is to set a limit on how long to wait for additional
datagrams after the arrival of an initial datagram. However, the
syscall behaves in quite a different way. Instead, it potentially
blocks forever, regardless of the timeout. The way the timeout seems
to work is as follows:

1. The timeout, T, is armed on receipt of first diagram, starting at time X.
2. After each further datagram is received, a check is made if we have
reached time X+T. If we have reached that time, then the syscall
returns.

Since the timeout is only checked after the arrival of each datagram,
we can have scenarios like the following:

0. Assume a timeout of 10 seconds, and that vlen is 5.
1. First datagram arrives at time X.
2. Second datagram arrives at time X+2 secs
3. No more datagrams arrive.

In this case, the call blocks forever. Is that intended behavior?
(Basically, if up to vlen-1 datagrams arrive before X+T, but then no
more datagrams arrive, the call will remain blocked forever.) If it's
intended behavior, could you elaborate the use case, since it would be
good to add that to the man page. If not, a fix seems to be needed,
since otherwise, it's hard to see how the recvmmsg() timeout argument
can sanely be used.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Michael Kerrisk (man-pages)

unread,

May 3, 2014, 6:29:12 AM5/3/14

to Arnaldo Carvalho de Melo, lkml, Michael Kerrisk, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Arnaldo,

On Wed, Apr 30, 2014 at 3:59 PM, Michael Kerrisk (man-pages)
<mtk.ma...@gmail.com> wrote:
> Arnaldo,
>
> I raised this issue somewhat more than a year ago, here:
> http://thread.gmane.org/gmane.linux.man/3477
> but got no reply from you. (Chris Friesen in that thread agreed
> that there is a problem though.)
>
> Here, a slightly revised version of that mail, since I've just bumper
> into a related problem in a different context...
>
> As part of his attempt to better document the recvmmsg() syscall that
> you added in commit a2e2725541fad72416326798c2d7fa4dafb7d337, Elie de
> Brauwer alerted to me to some strangeness in the timeout behavior of
> the syscall. I suspect there's a bug that needs fixing, as detailed
> below.
>
> AFAICT, the timeout argument was added to this syscall as a result of
> the discussion here:
> http://markmail.org/message/m5l2ap4hiiimut6k#query:+page:1+mid:m5l2ap4hiiimut6k+state:results
> (20-21 May 2009, "[RFC 1/2] net: Introduce recvmmsg...")
>
> If I understand correctly, the *intended* purpose of the timeout
> argument is to set a limit on how long to wait for additional
> datagrams after the arrival of an initial datagram. However, the
> syscall behaves in quite a different way. Instead, it potentially
> blocks forever, regardless of the timeout. The way the timeout seems
> to work is as follows:

So that the report does not get lost, I've created
https://bugzilla.kernel.org/show_bug.cgi?id=75371
to track it

Reinvestigating the problem, I see that I got my description of the
behavior slightly wrong, although the fundamental problem remains.
Here's my improved formulation:

int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,

unsigned int flags, struct timespec *timeout);

As currently implemented, the recvmmsg() timeout feature appears to be
fit for no sane use. The timeout argument is implemented as follows:

Timer is armed at the time of the call

while (datagrams-received < vlen) {
Wait for the next datagram

Check if timeout has been exceeded; if yes, break out of loop

}

Since the timeout is only checked after the arrival of each datagram,
we can have scenarios like the following:

0. Assume a timeout of 10 (T) seconds, that vlen is 5, and the call
is made at time X

1. First datagram arrives at time X+2.

2. Second datagram arrives at time X+4 secs

3. Third datagram arrives at time X+6 secs

4. No more datagrams arrive.

In this case, the call blocks forever. It hardly seems that this could
be intended behavior. The problem, of course is that the timeout is
checked only after receipt of a datagram.

Basically, if up to vlen-1 datagrams arrive before X+T, but then no

more datagrams arrive, the call will remain blocked forever. If it's
intended behavior (seems unlikely), the rationale needs to be
elaborated so that it can be documented in the man page. If not, a fix

seems to be needed, since otherwise, it's hard to see how the
recvmmsg() timeout argument can sanely be used.

Cheers,

Florian Westphal

unread,

May 3, 2014, 7:39:55 AM5/3/14

to Michael Kerrisk (man-pages), Arnaldo Carvalho de Melo, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Michael Kerrisk (man-pages) <mtk.ma...@gmail.com> wrote:
> Reinvestigating the problem, I see that I got my description of the
> behavior slightly wrong, although the fundamental problem remains.
> Here's my improved formulation:

[..]

> Since the timeout is only checked after the arrival of each datagram,
> we can have scenarios like the following:
>
> 0. Assume a timeout of 10 (T) seconds, that vlen is 5, and the call
> is made at time X
>
> 1. First datagram arrives at time X+2.
>
> 2. Second datagram arrives at time X+4 secs
>
> 3. Third datagram arrives at time X+6 secs
>
> 4. No more datagrams arrive.
>
> In this case, the call blocks forever. It hardly seems that this could
> be intended behavior. The problem, of course is that the timeout is
> checked only after receipt of a datagram.

Isn't that what MSG_WAITFORONE is supposed to solve?

Michael Kerrisk (man-pages)

unread,

May 3, 2014, 7:40:10 AM5/3/14

to Florian Westphal, Arnaldo Carvalho de Melo, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

On Sat, May 3, 2014 at 1:29 PM, Florian Westphal <f...@strlen.de> wrote:
> Michael Kerrisk (man-pages) <mtk.ma...@gmail.com> wrote:
>> Reinvestigating the problem, I see that I got my description of the
>> behavior slightly wrong, although the fundamental problem remains.
>> Here's my improved formulation:
> [..]
>
>> Since the timeout is only checked after the arrival of each datagram,
>> we can have scenarios like the following:
>>
>> 0. Assume a timeout of 10 (T) seconds, that vlen is 5, and the call
>> is made at time X
>>
>> 1. First datagram arrives at time X+2.
>>
>> 2. Second datagram arrives at time X+4 secs
>>
>> 3. Third datagram arrives at time X+6 secs
>>
>> 4. No more datagrams arrive.
>>
>> In this case, the call blocks forever. It hardly seems that this could
>> be intended behavior. The problem, of course is that the timeout is
>> checked only after receipt of a datagram.
>
> Isn't that what MSG_WAITFORONE is supposed to solve?

I don't think so. I understand the idea of the timeout to be: get as
many datagrams as you can within a certain interval. MSG_WAITFORONE is
orthogonal to that goal (you can specify MSG_WAITFORONE without an
infiniite timeout, for example).

Also, consider the algorithm above: if no datagrams arrive, the
timeout is in effect ignored.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Michael Kerrisk (man-pages)

unread,

May 12, 2014, 6:15:53 AM5/12/14

to Arnaldo Carvalho de Melo, lkml, Michael Kerrisk, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Hi Arnaldo,

Ping!

Cheers,

Michael

Arnaldo Carvalho de Melo

unread,

May 12, 2014, 10:35:48 AM5/12/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Mon, May 12, 2014 at 12:15:25PM +0200, Michael Kerrisk (man-pages) escreveu:
> Hi Arnaldo,
>
> Ping!

I acknowledge the problem, the timeout has to be passed to the
underlying ->recvmsg() implementations that should return the time spent
waiting for each packet, so that we can accrue that at recvmmsg level.

We can do either passing an extra timeout parameter to the recvmsg
implementations or using some struct sock member to specify that
timeout.

The first approach is intrusive, touches tons of files, so I'll try
making it all mostly transparent by hooking into sock_rcvtimeo()
somehow.

- Arnaldo

Arnaldo Carvalho de Melo

unread,

May 21, 2014, 5:06:18 PM5/21/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Mon, May 12, 2014 at 11:34:51AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, May 12, 2014 at 12:15:25PM +0200, Michael Kerrisk (man-pages) escreveu:
> > Hi Arnaldo,

> > Ping!

> I acknowledge the problem, the timeout has to be passed to the
> underlying ->recvmsg() implementations that should return the time spent
> waiting for each packet, so that we can accrue that at recvmmsg level.

> We can do either passing an extra timeout parameter to the recvmsg
> implementations or using some struct sock member to specify that
> timeout.

> The first approach is intrusive, touches tons of files, so I'll try
> making it all mostly transparent by hooking into sock_rcvtimeo()
> somehow.

But after thinking a bit more, looks like we need to do that, please
take a look at the attached patch to see if it addresses the problem.

Mostly it adds a new timeop to the per protocol recvmsg()
implementations, that, if not NULL, should be used instead of
SO_RCVTIMEO.

since the underlying recvmsg implementations already check that timeout,
return what is remaining, that will then be used in subsequent recvmsg
calls, at the end we just convert it back to timespec format.

In most cases it is just passed to skb_recv_datagram, that will check
the pointer, use it and update if not NULL.

Should have no problems, but I only did a boot with a system with this
patch applied, no problems noticed on a normal desktop session, ssh,
etc.

Arnaldo

recvmmsg-timeout.patch

Michael Kerrisk (man-pages)

unread,

May 22, 2014, 10:27:59 AM5/22/14

to Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Hi Arnaldo,

Thanks! I applied this patch against 3.15-rc6.

recvmmsg() now (mostly) does what I expect:
* it waits until either the timeout expires or vlen messages
have been received
* If no message is received before timeout, it returns -1/EAGAIN.
* If vlen messages are received before the timeout expires, then
the remaining time is returned in timeout.

One question: in the event that the call is interrupted by a signal
handler, it fails (as expected) with EINTR, but the 'timeout' value is
not updated with the remaining time on the timer. Would it be desirable
to emulate the behavior of select() (and other syscalls) in this
respect, and instead return the remaining time if interrupted by
a signal?

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--

David Miller

unread,

May 23, 2014, 3:01:06 PM5/23/14

to ac...@kernel.org, mtk.ma...@gmail.com, linux-...@vger.kernel.org, linu...@vger.kernel.org, net...@vger.kernel.org, nel...@seznam.cz, caitlin...@gmail.com, nho...@tuxdriver.com, eliede...@gmail.com, st...@chygwyn.com, remi.deni...@nokia.com, pa...@paul-moore.com, chris....@windriver.com

From: Arnaldo Carvalho de Melo <ac...@kernel.org>
Date: Wed, 21 May 2014 18:05:35 -0300

> But after thinking a bit more, looks like we need to do that, please
> take a look at the attached patch to see if it addresses the problem.
>
> Mostly it adds a new timeop to the per protocol recvmsg()
> implementations, that, if not NULL, should be used instead of
> SO_RCVTIMEO.
>
> since the underlying recvmsg implementations already check that timeout,
> return what is remaining, that will then be used in subsequent recvmsg
> calls, at the end we just convert it back to timespec format.
>
> In most cases it is just passed to skb_recv_datagram, that will check
> the pointer, use it and update if not NULL.
>
> Should have no problems, but I only did a boot with a system with this
> patch applied, no problems noticed on a normal desktop session, ssh,
> etc.

This looks fine to me, but I have a small request:

+ return noblock ? 0 : timeop ? *timeop : sk->sk_rcvtimeo;

I keep forgetting which way these expressions associate, so if you could
parenthesize the innermost ?: I'd appreciate it. :)

Thanks!

Arnaldo Carvalho de Melo

unread,

May 23, 2014, 3:55:42 PM5/23/14

to David Miller, mtk.ma...@gmail.com, linux-...@vger.kernel.org, linu...@vger.kernel.org, net...@vger.kernel.org, nel...@seznam.cz, caitlin...@gmail.com, nho...@tuxdriver.com, eliede...@gmail.com, st...@chygwyn.com, remi.deni...@nokia.com, pa...@paul-moore.com, chris....@windriver.com

Em Fri, May 23, 2014 at 03:00:55PM -0400, David Miller escreveu:
> From: Arnaldo Carvalho de Melo <ac...@kernel.org>
> Date: Wed, 21 May 2014 18:05:35 -0300

> > But after thinking a bit more, looks like we need to do that, please
> > take a look at the attached patch to see if it addresses the problem.

> > Mostly it adds a new timeop to the per protocol recvmsg()
> > implementations, that, if not NULL, should be used instead of
> > SO_RCVTIMEO.

> > since the underlying recvmsg implementations already check that timeout,
> > return what is remaining, that will then be used in subsequent recvmsg
> > calls, at the end we just convert it back to timespec format.

> > In most cases it is just passed to skb_recv_datagram, that will check
> > the pointer, use it and update if not NULL.

> > Should have no problems, but I only did a boot with a system with this
> > patch applied, no problems noticed on a normal desktop session, ssh,
> > etc.

> This looks fine to me, but I have a small request:

> + return noblock ? 0 : timeop ? *timeop : sk->sk_rcvtimeo;

> I keep forgetting which way these expressions associate, so if you could
> parenthesize the innermost ?: I'd appreciate it. :)

Ok, I actually wrote a sample program to verify that these ternaries did
what I meant 8)

I'll finish the cset log and do this clarification change.

Would be great to get Acked-by tags from the original reporter, Michael
and whoever had a look at this change, if possible. Michael, Elie?

> Thanks!

Thanks a lot for reviewing it!

- Arnaldo

Michael Kerrisk (man-pages)

unread,

May 24, 2014, 8:24:52 AM5/24/14

to Arnaldo Carvalho de Melo, David Miller, mtk.ma...@gmail.com, linux-...@vger.kernel.org, linu...@vger.kernel.org, net...@vger.kernel.org, nel...@seznam.cz, caitlin...@gmail.com, nho...@tuxdriver.com, eliede...@gmail.com, st...@chygwyn.com, remi.deni...@nokia.com, pa...@paul-moore.com, chris....@windriver.com

Arnaldo, I already sent you a reply (will reping on that one),
but got no response. My light testing got the expected results,
but I still had one question about the semantics.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Michael Kerrisk (man-pages)

unread,

May 24, 2014, 8:24:57 AM5/24/14

to Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Ping!

Arnaldo Carvalho de Melo

unread,

May 26, 2014, 9:47:00 AM5/26/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

I think so, will check how to achieve that!

> Cheers,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --

> To unsubscribe from this list: send the line "unsubscribe netdev" in

Arnaldo Carvalho de Melo

unread,

May 26, 2014, 5:17:19 PM5/26/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Mon, May 26, 2014 at 10:46:47AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, May 22, 2014 at 04:27:45PM +0200, Michael Kerrisk (man-pages) escreveu:
> > Thanks! I applied this patch against 3.15-rc6.

> > recvmmsg() now (mostly) does what I expect:
> > * it waits until either the timeout expires or vlen messages
> > have been received
> > * If no message is received before timeout, it returns -1/EAGAIN.
> > * If vlen messages are received before the timeout expires, then
> > the remaining time is returned in timeout.

> > One question: in the event that the call is interrupted by a signal
> > handler, it fails (as expected) with EINTR, but the 'timeout' value is
> > not updated with the remaining time on the timer. Would it be desirable
> > to emulate the behavior of select() (and other syscalls) in this
> > respect, and instead return the remaining time if interrupted by
> > a signal?

> I think so, will check how to achieve that!

Can you try the attached patch on top of the first one?

It starts adding explicit parentheses on a ternary, as David requested,
and then should return the remaining timeouts in cases like signals,
etc.

Please let me know if this is enough.

- Arnaldo

P.S. compile testing while sending this message :-)

recvmsg-return-timeout-harder.patch

Michael Kerrisk (man-pages)

unread,

May 27, 2014, 12:35:38 PM5/27/14

to Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Hi Arnaldo,

On 05/26/2014 11:17 PM, Arnaldo Carvalho de Melo wrote:
> Em Mon, May 26, 2014 at 10:46:47AM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Thu, May 22, 2014 at 04:27:45PM +0200, Michael Kerrisk (man-pages) escreveu:
>>> Thanks! I applied this patch against 3.15-rc6.
>
>>> recvmmsg() now (mostly) does what I expect:
>>> * it waits until either the timeout expires or vlen messages
>>> have been received
>>> * If no message is received before timeout, it returns -1/EAGAIN.
>>> * If vlen messages are received before the timeout expires, then
>>> the remaining time is returned in timeout.
>
>>> One question: in the event that the call is interrupted by a signal
>>> handler, it fails (as expected) with EINTR, but the 'timeout' value is
>>> not updated with the remaining time on the timer. Would it be desirable
>>> to emulate the behavior of select() (and other syscalls) in this
>>> respect, and instead return the remaining time if interrupted by
>>> a signal?
>
>> I think so, will check how to achieve that!
>
> Can you try the attached patch on top of the first one?

Patches on patches is a way to make your testers work unnecessarily
harder. Also, it means that anyone else who was interested in this
thread likely got lost at this point, because they probably didn't
save the first patch. All of this to say: it makes life much easier
if you provide a complete new self-contained patch on each iteration.

> It starts adding explicit parentheses on a ternary, as David requested,
> and then should return the remaining timeouts in cases like signals,
> etc.
>
> Please let me know if this is enough.

Nope, it doesn't fix the problem. (I applied both patches against 3.15-rc7)

> P.S. compile testing while sending this message :-)

Okay -- how about some real testing for the next version ;-). I've appended
my test program below. You can use it as follows:

/t_recvmmsg <port> <timeout-in-secs> <bufsize>...

(The timeout can also be '-' meaning use NULL as the timeout argument.)

Cheers,

Michael

/* t_recvmmsg.c

A simple test program for the Linux-specific recvmmsg() system call.
*/
#define _GNU_SOURCE
#include <sys/time.h>
#include <signal.h>
#include <sys/socket.h>
#include <netdb.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

static int /* Public interfaces: inetBind() and inetListen() */
createBoundSocket(const char *service, int type, socklen_t *addrlen)
{
struct addrinfo hints;
struct addrinfo *result, *rp;
int sfd, optval, s;

memset(&hints, 0, sizeof(struct addrinfo));
hints.ai_canonname = NULL;
hints.ai_addr = NULL;
hints.ai_next = NULL;
hints.ai_socktype = type;
hints.ai_family = AF_UNSPEC; /* Allows IPv4 or IPv6 */
hints.ai_flags = AI_PASSIVE; /* Use wildcard IP address */

s = getaddrinfo(NULL, service, &hints, &result);
if (s != 0)
return -1;

/* Walk through returned list until we find an address structure
that can be used to successfully create and bind a socket */

optval = 1;
for (rp = result; rp != NULL; rp = rp->ai_next) {
sfd = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (sfd == -1)
continue; /* On error, try next address */

if (bind(sfd, rp->ai_addr, rp->ai_addrlen) == 0)
break; /* Success */

/* bind() failed: close this socket and try next address */

close(sfd);
}

if (rp != NULL && addrlen != NULL)
*addrlen = rp->ai_addrlen; /* Return address structure size */

freeaddrinfo(result);

return (rp == NULL) ? -1 : sfd;
}

static void
handler()
{
/* Just interrupt a syscall */
}

int
main(int argc, char *argv[])
{
int sfd, vlen, j, s;
struct mmsghdr *msgvecp;
struct timespec ts;
struct timespec *tsp;
struct sigaction sa;

if (argc < 4) {
fprintf(stderr, "Usage: %s port tmo-secs buf-len...\n", argv[0]);
exit(EXIT_FAILURE);
}

sfd = createBoundSocket(argv[1], SOCK_DGRAM, NULL);
if (sfd == -1) {
fprintf(stderr, "Could not create server socket (%s)",
strerror(errno));
exit(EXIT_FAILURE);
}

/* Handle a signal, so we can test behaviour when recvmmsg()
is interrupted by a signal */

sa.sa_handler = handler;
sa.sa_flags = 0;
sigemptyset(&sa.sa_mask);
if (sigaction (SIGQUIT, &sa, NULL) == -1)
errExit("sigaction");

/* argv[2] specifies recvmmsg() timeout in seconds, or is '-', meaning
using NULL argument to get infinite timeout */

if (argv[2][0] == '-') {
tsp = NULL;

} else {
ts.tv_sec = atoi(argv[2]);
ts.tv_nsec = 0;
tsp = &ts;
}

/* Remaining command-line arguments specify the size of recvmmsg()
buffers */

/* The second argument to recvmmsg() is a pointer to an array of
mmsghdr structures. Each element of that array has a field,
'struct msghdr msg_hdr', that is used to store information from a
single received datagram. Among the fields in the msghdr structure
is a 'struct iovec *msg_iov'--that is, a pointer to a scatter/gather
I/O vector. To keep things simple for this example, our scatter/gather
vectors always consists of a single element.
*/

/* Allocate the mmssghdr vector, whose size corresponds to the
number of remaining command-line arguments */

vlen = argc - 3;

msgvecp = calloc(vlen, sizeof(struct mmsghdr));
if (msgvecp == NULL)
errExit("calloc");

for (j = 0; j < vlen; j++) {
msgvecp[j].msg_hdr.msg_name = NULL;
msgvecp[j].msg_hdr.msg_namelen = 0;
msgvecp[j].msg_hdr.msg_control = NULL;
msgvecp[j].msg_hdr.msg_controllen = 0;

/* Allocate an iovec for this mmsghdr element. The vector
contains just a single item. */

msgvecp[j].msg_hdr.msg_iovlen = 1;

msgvecp[j].msg_hdr.msg_iov = malloc(sizeof(struct iovec));
if (msgvecp[j].msg_hdr.msg_iov == NULL)
errExit("malloc");

/* The single iovec element contains a pointer to a buffer
that is sized according to the number given in the
corresponding command-line argument */

s = atoi(argv[j + 3]);
msgvecp[j].msg_hdr.msg_iov[0].iov_len = s;
msgvecp[j].msg_hdr.msg_iov[0].iov_base = malloc(s);
}

if (tsp != NULL)
printf("Timespec before call = %ld.%09ld\n",
(long) tsp->tv_sec, (long) tsp->tv_nsec);

/* Now we're ready to make the recvmmsg() call */

s = recvmmsg(sfd, msgvecp, vlen, 0, tsp);
if (s == -1) {
if (errno == EINTR)
printf("EINTR! (interrupted system call)\n");
else
errExit("recvmmsg");
}

printf("recvmmsg() returned %d\n", s);

if (tsp != NULL)
printf("Timespec after call = %ld.%09ld\n",
(long) tsp->tv_sec, (long) tsp->tv_nsec);

/* Display datagrams retrieved by recvmmsg() */

for (j = 0; j < s; j++) {
printf("%d: %u - %.*s\n", j,
msgvecp[j].msg_len, msgvecp[j].msg_len,
(char *) msgvecp[j].msg_hdr.msg_iov[0].iov_base);
}

exit(EXIT_SUCCESS);

}

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--

Arnaldo Carvalho de Melo

unread,

May 27, 2014, 3:21:24 PM5/27/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Tue, May 27, 2014 at 06:35:17PM +0200, Michael Kerrisk (man-pages) escreveu:
> On 05/26/2014 11:17 PM, Arnaldo Carvalho de Melo wrote:
> > Can you try the attached patch on top of the first one?

> Patches on patches is a way to make your testers work unnecessarily
> harder. Also, it means that anyone else who was interested in this

It was meant to highlight the changes with regard to the previous patch,
i.e. to make things easier for reviewing.

> thread likely got lost at this point, because they probably didn't
> save the first patch. All of this to say: it makes life much easier
> if you provide a complete new self-contained patch on each iteration.

If you prefer it that way, find one attached, that I was about to send
(but you can wait till I use your program to test it ;-) )

> > It starts adding explicit parentheses on a ternary, as David requested,
> > and then should return the remaining timeouts in cases like signals,
> > etc.
> >
> > Please let me know if this is enough.
>
> Nope, it doesn't fix the problem. (I applied both patches against 3.15-rc7)

What was the problem experienced?

> > P.S. compile testing while sending this message :-)
>
> Okay -- how about some real testing for the next version ;-). I've appended

Hey, you were provinding that real testing! thanks for that! :-)

> my test program below. You can use it as follows:
>

> ./t_recvmmsg <port> <timeout-in-secs> <bufsize>...

>
> (The timeout can also be '-' meaning use NULL as the timeout argument.)

Thanks for the test proggie, will use it.

> Cheers,
>
> Michael

Arnaldo Carvalho de Melo

unread,

May 27, 2014, 3:22:38 PM5/27/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Tue, May 27, 2014 at 04:21:15PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, May 27, 2014 at 06:35:17PM +0200, Michael Kerrisk (man-pages) escreveu:
> > On 05/26/2014 11:17 PM, Arnaldo Carvalho de Melo wrote:
> > > Can you try the attached patch on top of the first one?
>
> > Patches on patches is a way to make your testers work unnecessarily
> > harder. Also, it means that anyone else who was interested in this
>
> It was meant to highlight the changes with regard to the previous patch,
> i.e. to make things easier for reviewing.
>
> > thread likely got lost at this point, because they probably didn't
> > save the first patch. All of this to say: it makes life much easier
> > if you provide a complete new self-contained patch on each iteration.
>
> If you prefer it that way, find one attached, that I was about to send
> (but you can wait till I use your program to test it ;-) )

Really attached this time ;-\

- Arnaldo

recvmmsg-timeout-v2.patch

Michael Kerrisk (man-pages)

unread,

May 27, 2014, 3:29:07 PM5/27/14

to Arnaldo Carvalho de Melo, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen, Arnaldo Carvalho de Melo

On Tue, May 27, 2014 at 9:21 PM, Arnaldo Carvalho de Melo
<ac...@ghostprotocols.net> wrote:
> Em Tue, May 27, 2014 at 06:35:17PM +0200, Michael Kerrisk (man-pages) escreveu:
>> On 05/26/2014 11:17 PM, Arnaldo Carvalho de Melo wrote:
>> > Can you try the attached patch on top of the first one?
>
>> Patches on patches is a way to make your testers work unnecessarily
>> harder. Also, it means that anyone else who was interested in this
>
> It was meant to highlight the changes with regard to the previous patch,
> i.e. to make things easier for reviewing.

(I don't think that works...)

>> thread likely got lost at this point, because they probably didn't
>> save the first patch. All of this to say: it makes life much easier
>> if you provide a complete new self-contained patch on each iteration.
>
> If you prefer it that way, find one attached, that I was about to send
> (but you can wait till I use your program to test it ;-) )
>
>> > It starts adding explicit parentheses on a ternary, as David requested,
>> > and then should return the remaining timeouts in cases like signals,
>> > etc.
>> >
>> > Please let me know if this is enough.
>>
>> Nope, it doesn't fix the problem. (I applied both patches against 3.15-rc7)
>
> What was the problem experienced?

The problem is that after EINTR, the timeout is not updated with the
remaining time until expiry. (This was true with just patch 1 applied,
and is also true with both patch 1 and patch 2 applied.)

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Arnaldo Carvalho de Melo

unread,

May 27, 2014, 4:30:34 PM5/27/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Tue, May 27, 2014 at 09:28:37PM +0200, Michael Kerrisk (man-pages) escreveu:
> On Tue, May 27, 2014 at 9:21 PM, Arnaldo Carvalho de Melo
> <ac...@ghostprotocols.net> wrote:
> > Em Tue, May 27, 2014 at 06:35:17PM +0200, Michael Kerrisk (man-pages) escreveu:
> >> On 05/26/2014 11:17 PM, Arnaldo Carvalho de Melo wrote:
> >> > Can you try the attached patch on top of the first one?
> >
> >> Patches on patches is a way to make your testers work unnecessarily
> >> harder. Also, it means that anyone else who was interested in this
> >
> > It was meant to highlight the changes with regard to the previous patch,
> > i.e. to make things easier for reviewing.
>
> (I don't think that works...)

Lets try both then, attached goes the updated patch, and this is the
diff to the last combined one:

diff --git a/net/socket.c b/net/socket.c
index 310a50971769..379be43879db 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2478,8 +2478,7 @@ SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,

datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);

- if (datagrams > 0 &&
- copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
+ if (copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
datagrams = -EFAULT;

return datagrams;

------------------------------------------

This is a quick thing just to show where the problem lies, need to think
how to report an -EFAULT at this point properly, i.e. look at
__sys_recvmmsg for something related (returning the number of
successfully copied datagrams to userspace while storing the error for
subsequent reporting):

if (err == 0)
return datagrams;

if (datagrams != 0) {
/*
* We may return less entries than requested (vlen) if
* the
* sock is non block and there aren't enough
* datagrams...
*/
if (err != -EAGAIN) {
/*
* ... or if recvmsg returns an error after we
* received some datagrams, where we record the
* error to return on the next call or if the
* app asks about it using getsockopt(SO_ERROR).
*/
sock->sk->sk_err = -err;
}

return datagrams;
}

I.e. userspace would have to use getsockopt(SO_ERROR)... need to think
more about it, sidetracked now, will be back to this.

Anyway, attached goes the current combined patch.

- Arnaldo

> >> thread likely got lost at this point, because they probably didn't
> >> save the first patch. All of this to say: it makes life much easier
> >> if you provide a complete new self-contained patch on each iteration.
> >
> > If you prefer it that way, find one attached, that I was about to send
> > (but you can wait till I use your program to test it ;-) )
> >
> >> > It starts adding explicit parentheses on a ternary, as David requested,
> >> > and then should return the remaining timeouts in cases like signals,
> >> > etc.
> >> >
> >> > Please let me know if this is enough.
> >>
> >> Nope, it doesn't fix the problem. (I applied both patches against 3.15-rc7)
> >
> > What was the problem experienced?
>
> The problem is that after EINTR, the timeout is not updated with the
> remaining time until expiry. (This was true with just patch 1 applied,
> and is also true with both patch 1 and patch 2 applied.)
>
> Cheers,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --

> To unsubscribe from this list: send the line "unsubscribe netdev" in

recvmmsg-timeout-v3.patch

Michael Kerrisk (man-pages)

unread,

May 28, 2014, 1:00:18 AM5/28/14

to Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

On 05/27/2014 10:30 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, May 27, 2014 at 09:28:37PM +0200, Michael Kerrisk (man-pages) escreveu:
>> On Tue, May 27, 2014 at 9:21 PM, Arnaldo Carvalho de Melo
>> <ac...@ghostprotocols.net> wrote:
>>> Em Tue, May 27, 2014 at 06:35:17PM +0200, Michael Kerrisk (man-pages) escreveu:
>>>> On 05/26/2014 11:17 PM, Arnaldo Carvalho de Melo wrote:
>>>>> Can you try the attached patch on top of the first one?
>>>
>>>> Patches on patches is a way to make your testers work unnecessarily
>>>> harder. Also, it means that anyone else who was interested in this
>>>
>>> It was meant to highlight the changes with regard to the previous patch,
>>> i.e. to make things easier for reviewing.
>>
>> (I don't think that works...)
>
> Lets try both then, attached goes the updated patch, and this is the
> diff to the last combined one:

What tree does this apply to? I tried applying to 3.15-rc7, but a piece
was rejected, and the fix was not obvious.

Cheers,

Michael

drivers/net/tun.c.rej

--- drivers/net/tun.c
+++ drivers/net/tun.c
@@ -1343,7 +1343,7 @@

/* Read frames from queue */
skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
- &peeked, &off, &err);
+ &peeked, &off, &err, timeop);
if (skb) {
ret = tun_put_user(tun, tfile, skb, iv, len);
kfree_skb(skb);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Michael Kerrisk (man-pages)

unread,

May 28, 2014, 8:20:32 AM5/28/14

to Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

On 05/27/2014 10:30 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, May 27, 2014 at 09:28:37PM +0200, Michael Kerrisk (man-pages) escreveu:
>> On Tue, May 27, 2014 at 9:21 PM, Arnaldo Carvalho de Melo
>> <ac...@ghostprotocols.net> wrote:
>>> Em Tue, May 27, 2014 at 06:35:17PM +0200, Michael Kerrisk (man-pages) escreveu:
>>>> On 05/26/2014 11:17 PM, Arnaldo Carvalho de Melo wrote:
>>>>> Can you try the attached patch on top of the first one?
>>>
>>>> Patches on patches is a way to make your testers work unnecessarily
>>>> harder. Also, it means that anyone else who was interested in this
>>>
>>> It was meant to highlight the changes with regard to the previous patch,
>>> i.e. to make things easier for reviewing.
>>
>> (I don't think that works...)
>
> Lets try both then,

That's better!

So, I applied against net-next as you suggested offlist.
Builds and generally tests fine. Some observations:

* In the case that the call is interrupted by a signal handler and no
datagrams have been received, the call fails with EINTR, as expected.

* The call always updates 'timeout', both in the success case and in the
EINTR case. (That seems fine.)

But, another question...

In the case that the call is interrupted by a signal handler and some
datagrams have already been received, then the call succeeds, and
returns the number of datagrams received, and 'timeout' is updated with
the remaining time. Maybe that's the right behavior, but I just want to
check. There is at least one other possibility:

* Fetch no datagrams (i.e., the datagrams are left to receive in a
future call), and the call fails with EINTR, and 'timeout' is updated.

Maybe that possibility is hard to implement (not sure). But my main point
is to make the current behavior clear, note the alternative, and ask:
is the current behavior the best choice. (I'm not saying it's not, but I
do want the choice to be a conscious one.)

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Arnaldo Carvalho de Melo

unread,

May 28, 2014, 11:07:52 AM5/28/14

to Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Wed, May 28, 2014 at 02:20:10PM +0200, Michael Kerrisk (man-pages) escreveu:
> On 05/27/2014 10:30 PM, Arnaldo Carvalho de Melo wrote:
> > attached goes the updated patch, and this is the
> > diff to the last combined one:
> >
> > diff --git a/net/socket.c b/net/socket.c
> > index 310a50971769..379be43879db 100644
> > --- a/net/socket.c
> > +++ b/net/socket.c
> > @@ -2478,8 +2478,7 @@ SYSCALL_DEFINE5(recvmmsg, int, fd, struct mmsghdr __user *, mmsg,
> >
> > datagrams = __sys_recvmmsg(fd, mmsg, vlen, flags, &timeout_sys);
> >
> > - if (datagrams > 0 &&
> > - copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
> > + if (copy_to_user(timeout, &timeout_sys, sizeof(timeout_sys)))
> > datagrams = -EFAULT;
> >
> > return datagrams;
> >
> > ------------------------------------------
> >
> > This is a quick thing just to show where the problem lies, need to think
> > how to report an -EFAULT at this point properly, i.e. look at

Ok, so I can live with the way things were before this fix, i.e. if the
user specifies a timeout, then if it fails when copying to remaining
time to userspace (copy_to_user call above), then we return -EFAULT.

I.e. there would be no change in behaviour, but then perhaps we should
go with the interface that is in place when we received some datagrams
and then some error happens, see comment in the existing code, below:

> > __sys_recvmmsg for something related (returning the number of
> > successfully copied datagrams to userspace while storing the error for
> > subsequent reporting):
> >
> > if (err == 0)
> > return datagrams;
> >
> > if (datagrams != 0) {
> > /*
> > * We may return less entries than requested (vlen) if

> > * the sock is non block and there aren't enough

> > * datagrams...
> > */
> > if (err != -EAGAIN) {
> > /*
> > * ... or if recvmsg returns an error after we
> > * received some datagrams, where we record the
> > * error to return on the next call or if the
> > * app asks about it using getsockopt(SO_ERROR).
> > */
> > sock->sk->sk_err = -err;
> > }
> >
> > return datagrams;
> > }
> >
> > I.e. userspace would have to use getsockopt(SO_ERROR)... need to think
> > more about it, sidetracked now, will be back to this.
> >
> > Anyway, attached goes the current combined patch.

> So, I applied against net-next as you suggested offlist.
> Builds and generally tests fine. Some observations:

> * In the case that the call is interrupted by a signal handler and no
> datagrams have been received, the call fails with EINTR, as expected.

Ok

> * The call always updates 'timeout', both in the success case and in the
> EINTR case. (That seems fine.)

Agreed that it is how it should behave.

> But, another question...
>
> In the case that the call is interrupted by a signal handler and some
> datagrams have already been received, then the call succeeds, and
> returns the number of datagrams received, and 'timeout' is updated with
> the remaining time. Maybe that's the right behavior, but I just want to

Note that what the comment in the existing code says should apply here,
namely that the next recv (m or mmsg) syscall on this socket will return
what is in sock->sk->sk_err, that is the signal:

sys_recvmmsg()
sock->ops->recvmsg() (e.g. inet_recvmsg)
sk->prot->recvmsg() (e.g., udp_recvmsg)
skb_recv_datagram()
wait_for_more_packets()
sock_intr_errno()
*err = -EINTR
sock->sk->sk_err = err

Next recv will end up calling skb_recv_datagram and that does:

struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
int *peeked, int *off, int *err, long *timeop)
{
struct sk_buff *skb, *last;
long timeo;
/*
* Caller is allowed not to check sk->sk_err before
* skb_recv_datagram()
*/
int error = sock_error(sk);

if (error)
goto no_packet;
<SNIP>
out:
if (timeop)
*timeop = timeo;
return NULL;

no_packet:
*err = error;
goto out;
}

So, yes, the user _can_ process the packets already copied to userspace,
i.e. no packet loss, and then, on the next call, will receive the signal
notification.

So, the user can just try the next call and see the signal, and it is
also possible to notice that the timeout didn't expire and less than
vlen packets were received, so something went wrong and calling
getsockopt(SO_ERROR) will clarify things.

This is not some new error reporting facility, it predates recvmmsg,
that merely uses it for consistency.

How to properly report the -EFAULT when copying the remaining timeout to
userspace is the special case here, with my patches it will:

copy n (less than vlen) packets to userspace successfully
return -EFAULT, not n, just as before the patches being cooked now.

> check. There is at least one other possibility:
>
> * Fetch no datagrams (i.e., the datagrams are left to receive in a
> future call), and the call fails with EINTR, and 'timeout' is updated.

Humm, then we would have to go back to the protocol layer and re-add the
packets to queues, etc, etc, not feasible, I'd say, too much state was
lost already, we would have to have some sort of commit interface
(perhaps using peek, but that gets crazy quickly with multithreaded
apps), guess its a super intrusive path to follow, not worth it, I
think.

> Maybe that possibility is hard to implement (not sure). But my main point
> is to make the current behavior clear, note the alternative, and ask:
> is the current behavior the best choice. (I'm not saying it's not, but I
> do want the choice to be a conscious one.)

Well, my main pet peeve here is how to report that we managed to copy
the datagrams but failed to copy the remaining timeout and then don't
report how many datagrams were successfully copied.

I'm inclined to say that failing to copy the timeout is something so
unlikely, even more since we managed to copy it from userspace to kernel
space at that point in time, that we should keep the current behaviour
and report that something terribly wrong happened, i.e. -EFAULT when
copying the timeout.

Thanks a lot for testing all this, was a pity you were not around when
we first designed and implemented this syscall.

And also it would be really nice if the people in the CC list commented
on this last round of discussion about fixing the timeout behavior.

- Arnaldo

David Laight

unread,

May 28, 2014, 11:18:48 AM5/28/14

to Arnaldo Carvalho de Melo, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

From: Arnaldo Carvalho de Melo

..

> > But, another question...
> >
> > In the case that the call is interrupted by a signal handler and some
> > datagrams have already been received, then the call succeeds, and
> > returns the number of datagrams received, and 'timeout' is updated with
> > the remaining time. Maybe that's the right behavior, but I just want to
>
> Note that what the comment in the existing code says should apply here,
> namely that the next recv (m or mmsg) syscall on this socket will return
> what is in sock->sk->sk_err, that is the signal:
>

..

>
> So, yes, the user _can_ process the packets already copied to userspace,
> i.e. no packet loss, and then, on the next call, will receive the signal
> notification.

The application shouldn't need to see an EINTR response, any signal handler
should be run when the system call returns to user (regardless of the
system call result code).
If that doesn't happen Linux is badly broken!
From an application point of view this is exactly the same as the signal
occurring just before/after the kernel entry/exit for the system call.

The call should just return early with success status.
No need to preserve the EINTR response for later.

The same might be appropriate for other errors - maybe including EFAULT
copying non-initial messages to userspace.
Put the message being processed back on the socket queue and return
success with the (non-zero) partial message count.

David

Arnaldo Carvalho de Melo

unread,

May 28, 2014, 3:50:22 PM5/28/14

to David Laight, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Wed, May 28, 2014 at 03:17:40PM +0000, David Laight escreveu:
> From: Arnaldo Carvalho de Melo

> ...

> > > But, another question...
> > >
> > > In the case that the call is interrupted by a signal handler and some
> > > datagrams have already been received, then the call succeeds, and
> > > returns the number of datagrams received, and 'timeout' is updated with
> > > the remaining time. Maybe that's the right behavior, but I just want to

> > Note that what the comment in the existing code says should apply here,
> > namely that the next recv (m or mmsg) syscall on this socket will return
> > what is in sock->sk->sk_err, that is the signal:

> ...

> > So, yes, the user _can_ process the packets already copied to userspace,
> > i.e. no packet loss, and then, on the next call, will receive the signal
> > notification.

> The application shouldn't need to see an EINTR response, any signal handler
> should be run when the system call returns to user (regardless of the
> system call result code).
> If that doesn't happen Linux is badly broken!
> >From an application point of view this is exactly the same as the signal
> occurring just before/after the kernel entry/exit for the system call.
>
> The call should just return early with success status.
> No need to preserve the EINTR response for later.
>
> The same might be appropriate for other errors - maybe including EFAULT
> copying non-initial messages to userspace.
> Put the message being processed back on the socket queue and return
> success with the (non-zero) partial message count.

We don't need to put anything back, if we get an EFAULT for a datagram,
then we stop processing that packet, _dropping_ it (and that is just
like recvmsg works, look at __skb_recv_datagram, the skb_unlink there,
and udp_recvmsg, what happens if skb_copy_and_csum_datagram_iovec fails)
and stop the batch, and if no datagrams were received, return the error
straight away.

But if some datagrams were successfully received, and at that point
_already_ removed from queues and sent successfully to userspace,
recvmmsg will return the number of successfully copied datagrams and
store the error so that it can return on the next syscall,

Please refer to the original discussion on how to report how many
successfully copied datagrams and also report that it stopped before the
timeout and the number of requested datagrams in a batch:

http://lkml.kernel.org/r/200905221022.48790....@nokia.com

What is being discussed here is how to return the EFAULT that may happen
_after_ datagram processing, be it interrupted by an EFAULT, signal, or
plain returning all that was requested, with no errors.

This EFAULT _after_ datagram processing may happen when updating the
remaining timeout, because then how can userspace both receive the
number of successfully copied datagrams (in any of the cases mentioned
in the previous paragraph) and know that that timeout can't be used
because there was a problem while trying to copy it to userspace
(EFAULT)?

- Arnaldo

Chris Friesen

unread,

May 28, 2014, 5:35:02 PM5/28/14

to Arnaldo Carvalho de Melo, David Laight, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore

On 05/28/2014 01:50 PM, 'Arnaldo Carvalho de Melo' wrote:

> What is being discussed here is how to return the EFAULT that may happen
> _after_ datagram processing, be it interrupted by an EFAULT, signal, or
> plain returning all that was requested, with no errors.
>
> This EFAULT _after_ datagram processing may happen when updating the
> remaining timeout, because then how can userspace both receive the
> number of successfully copied datagrams (in any of the cases mentioned
> in the previous paragraph) and know that that timeout can't be used
> because there was a problem while trying to copy it to userspace
> (EFAULT)?

How does select() handle this problem? It updates the timeout and also
modifies other data.

Could we just check whether the timeout pointer is valid before doing
anything else? Of course we could still fault the page out while
waiting for messages and then fail to fault it back in later, but that
seems like a not-very-likely scenario.

Chris

Arnaldo Carvalho de Melo

unread,

May 28, 2014, 5:49:43 PM5/28/14

to Chris Friesen, David Laight, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore

Em Wed, May 28, 2014 at 03:33:51PM -0600, Chris Friesen escreveu:
> On 05/28/2014 01:50 PM, 'Arnaldo Carvalho de Melo' wrote:

> >What is being discussed here is how to return the EFAULT that may happen
> >_after_ datagram processing, be it interrupted by an EFAULT, signal, or
> >plain returning all that was requested, with no errors.

> >This EFAULT _after_ datagram processing may happen when updating the
> >remaining timeout, because then how can userspace both receive the
> >number of successfully copied datagrams (in any of the cases mentioned
> >in the previous paragraph) and know that that timeout can't be used
> >because there was a problem while trying to copy it to userspace
> >(EFAULT)?

> How does select() handle this problem? It updates the timeout and also
> modifies other data.

> Could we just check whether the timeout pointer is valid before doing
> anything else? Of course we could still fault the page out while waiting
> for messages and then fail to fault it back in later, but that seems like a
> not-very-likely scenario.

I'll check how select behaves, and yes, I think it is not-very-likely
and what we're doing now is reasonable for datagram protocols, i.e. to
return -EFAULT when updating the timeout fails, not reporting if packets
were successfully received, i.e. they end up being "dropped", as
userspace can't easily figure out if some was received short of painting
it with some pattern and then checking the ones that aren't with that
pattern.

- Arnaldo

David Laight

unread,

May 29, 2014, 6:54:33 AM5/29/14

to Arnaldo Carvalho de Melo, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

From: 'Arnaldo Carvalho de

That just doesn't make any sense.
Saving an errno code would only make any sense if the error were a
property of the socket - but EFAULT is a property of the system call,
and EINTR a property of the process (it exists so that the process
can return to userspace to execute a signal handler - relying on
SIGALRM to timeout blocking system calls is a recipe for disaster).

The next system call could be from an entirely different process,
neither EFAULT nor EINTR would mean anything to it at all.

ISTR that returning EFAULT generates a signal that will typically
terminate the process.
You definitely don't want to send one to a different process.

> Please refer to the original discussion on how to report how many
> successfully copied datagrams and also report that it stopped before the
> timeout and the number of requested datagrams in a batch:
>
> http://lkml.kernel.org/r/200905221022.48790....@nokia.com

I do remember the original problem.
I don't recall error reporting being referenced.

> What is being discussed here is how to return the EFAULT that may happen
> _after_ datagram processing, be it interrupted by an EFAULT, signal, or
> plain returning all that was requested, with no errors.

I remember some discussions from an XNET standards meeting (I've forgotten
exactly which errors on which calls were being discussed).
My recollection is that you return success with a partial transfer
count for ANY error that happens after some data has been transferred.
The actual error will be returned when it happens again on the next
system call - Note the AGAIN, not a saved error.

Things like blocking send/write being interrupted spring to mind.
Possibly even copyin/out failures part way through a read/write call.

> This EFAULT _after_ datagram processing may happen when updating the
> remaining timeout, because then how can userspace both receive the
> number of successfully copied datagrams (in any of the cases mentioned
> in the previous paragraph) and know that that timeout can't be used
> because there was a problem while trying to copy it to userspace
> (EFAULT)?

Failure to write the control structure back to userspace probably
deserves an EFAULT return - the application is buggy.
IIRC normal recvmsg() copies out the control structure at the end
of processing - that can fail.
I wouldn't worry about datagram discards on any of those late
EFAULT conditions.

David

Arnaldo Carvalho de Melo

unread,

May 29, 2014, 9:56:01 AM5/29/14

to David Laight, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Thu, May 29, 2014 at 10:53:22AM +0000, David Laight escreveu:
> From: 'Arnaldo Carvalho de

> ...

Yeah for things like EFAULT, storing it in a per socket area for later
reporting is a bug, a separate bug.

> Saving an errno code would only make any sense if the error were a
> property of the socket - but EFAULT is a property of the system call,

Agreed, so for the errors that are socket related, the mechanism should
work, not for things that are thread specific, then we should either
straight away signal it despite of any successfully received packets in
the batch so far in the current recvmmsg syscall or mimic what would
happen if the user issued multiple recvmsg syscalls instead, i.e. in the
next call _for this thread_, the EFAULT will take place.

> and EINTR a property of the process (it exists so that the process
> can return to userspace to execute a signal handler - relying on
> SIGALRM to timeout blocking system calls is a recipe for disaster).
>
> The next system call could be from an entirely different process,
> neither EFAULT nor EINTR would mean anything to it at all.

Right, storing thread specific errors on the socket is a bug and has to
be fixed. I.e. _if_ we keep the saving error for next syscall strategy,
then that error has, for the per thread cases, be stored in a per thread
area error field for socket operations.

> ISTR that returning EFAULT generates a signal that will typically
> terminate the process.
> You definitely don't want to send one to a different process.

Right.

> > Please refer to the original discussion on how to report how many
> > successfully copied datagrams and also report that it stopped before the
> > timeout and the number of requested datagrams in a batch:

> > http://lkml.kernel.org/r/200905221022.48790....@nokia.com

> I do remember the original problem.
> I don't recall error reporting being referenced.

> > What is being discussed here is how to return the EFAULT that may happen
> > _after_ datagram processing, be it interrupted by an EFAULT, signal, or
> > plain returning all that was requested, with no errors.

> I remember some discussions from an XNET standards meeting (I've forgotten
> exactly which errors on which calls were being discussed).
> My recollection is that you return success with a partial transfer
> count for ANY error that happens after some data has been transferred.
> The actual error will be returned when it happens again on the next
> system call - Note the AGAIN, not a saved error.

A saved error, for the right entity, in the recvmmsg case, that
basically is batching multiple recvmsg syscalls, doesn't sound like a
problem, i.e. the idea is to, as much as possible, mimic what multiple
recvmsg calls would do, but reduce its in/out kernel (and inside kernel
subsystems) overhead.

Perhaps we can have something in between, i.e. for things like EFAULT,
we should report straight away, effectively dropping whatever datagrams
successfully received in the current batch, do you agree?

For transient errors the existing mechanism, fixed so that only per
socket errors are saved for later, as today, could be kept?

> Things like blocking send/write being interrupted spring to mind.
> Possibly even copyin/out failures part way through a read/write call.
>
> > This EFAULT _after_ datagram processing may happen when updating the
> > remaining timeout, because then how can userspace both receive the
> > number of successfully copied datagrams (in any of the cases mentioned
> > in the previous paragraph) and know that that timeout can't be used
> > because there was a problem while trying to copy it to userspace
> > (EFAULT)?
>
> Failure to write the control structure back to userspace probably
> deserves an EFAULT return - the application is buggy.
> IIRC normal recvmsg() copies out the control structure at the end
> of processing - that can fail.
> I wouldn't worry about datagram discards on any of those late
> EFAULT conditions.

This part we all seem to be in agreement, so I'll just leave it as is,
i.e. it doesn't matter that the actual packet receiving part was
(partially) successful, if the copy_to_user(remaining timeout) fails,
EFAULT should be returned.

- Arnaldo

David Laight

unread,

May 29, 2014, 10:07:22 AM5/29/14

to Arnaldo Carvalho de Melo, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

From: 'Arnaldo Carvalho de Melo'
..

> > I remember some discussions from an XNET standards meeting (I've forgotten
> > exactly which errors on which calls were being discussed).
> > My recollection is that you return success with a partial transfer
> > count for ANY error that happens after some data has been transferred.
> > The actual error will be returned when it happens again on the next
> > system call - Note the AGAIN, not a saved error.
>
> A saved error, for the right entity, in the recvmmsg case, that
> basically is batching multiple recvmsg syscalls, doesn't sound like a
> problem, i.e. the idea is to, as much as possible, mimic what multiple
> recvmsg calls would do, but reduce its in/out kernel (and inside kernel
> subsystems) overhead.
>
> Perhaps we can have something in between, i.e. for things like EFAULT,
> we should report straight away, effectively dropping whatever datagrams
> successfully received in the current batch, do you agree?

Not unreasonable - EFAULT shouldn't happen unless the application
is buggy.

> For transient errors the existing mechanism, fixed so that only per
> socket errors are saved for later, as today, could be kept?

I don't think it is ever necessary to save an errno value for the
next system call at all.
Just process the next system call and see what happens.

If the call returns with less than the maximum number of datagrams
and with a non-zero timeout left - then the application can infer
that it was terminated by an abnormal event of some kind.
This might be a signal.
I'm not sure if an icmp error on a connected datagram socket could
generate a 'disconnect'. It might happen if the interface is being
used for something like SCTP.
In either case the next call will detect the error.

David

Michael Kerrisk (man-pages)

unread,

May 29, 2014, 10:07:31 AM5/29/14

to David Laight, Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

On 05/29/2014 12:53 PM, David Laight wrote:
> From: 'Arnaldo Carvalho de
> ...

Agreed.

> Saving an errno code would only make any sense if the error were a
> property of the socket -

Back in http://marc.info/?l=linux-netdev&m=124298156121906&w=2
(the follow-on from the discussion that Arnaldo mentions below),
it was noted:

: Normally you'd expect the call to return what it has read without an
: error, and then the socket error would be picked up on the next call.

and the key point in that sentence was "*socket* error."

> but EFAULT is a property of the system call,
> and EINTR a property of the process (it exists so that the process
> can return to userspace to execute a signal handler - relying on
> SIGALRM to timeout blocking system calls is a recipe for disaster).

Exactly. Interruption by a signal should just result in an early
success return, unless no datagrams have been received so far, in
which case it should produce an EINTR failure. No error should be
saved for a future call.

> The next system call could be from an entirely different process,
> neither EFAULT nor EINTR would mean anything to it at all.
>
> ISTR that returning EFAULT generates a signal that will typically
> terminate the process.

Not generally, I think. (I think you're thinking of SIGSEGV when
a process touches a nonexistent address in user mode.)

> You definitely don't want to send one to a different process.

But it's true that the EFAULT or EINTR shouldn't be returned
to another process.

>> Please refer to the original discussion on how to report how many
>> successfully copied datagrams and also report that it stopped before the
>> timeout and the number of requested datagrams in a batch:
>>
>> http://lkml.kernel.org/r/200905221022.48790....@nokia.com
>
> I do remember the original problem.
> I don't recall error reporting being referenced.

(See above.)

>> What is being discussed here is how to return the EFAULT that may happen
>> _after_ datagram processing, be it interrupted by an EFAULT, signal, or
>> plain returning all that was requested, with no errors.
>
> I remember some discussions from an XNET standards meeting (I've forgotten
> exactly which errors on which calls were being discussed).
> My recollection is that you return success with a partial transfer
> count for ANY error that happens after some data has been transferred.
> The actual error will be returned when it happens again on the next
> system call - Note the AGAIN, not a saved error.
>
> Things like blocking send/write being interrupted spring to mind.
> Possibly even copyin/out failures part way through a read/write call.
>
>> This EFAULT _after_ datagram processing may happen when updating the
>> remaining timeout, because then how can userspace both receive the
>> number of successfully copied datagrams (in any of the cases mentioned
>> in the previous paragraph) and know that that timeout can't be used
>> because there was a problem while trying to copy it to userspace
>> (EFAULT)?
>
> Failure to write the control structure back to userspace probably
> deserves an EFAULT return - the application is buggy.
> IIRC normal recvmsg() copies out the control structure at the end
> of processing - that can fail.
> I wouldn't worry about datagram discards on any of those late
> EFAULT conditions.

Agree on all of the above, and that last point certainly seems
like the right approach to me.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Arnaldo Carvalho de Melo

unread,

May 29, 2014, 10:17:19 AM5/29/14

to David Laight, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Thu, May 29, 2014 at 02:06:04PM +0000, David Laight escreveu:
> From: 'Arnaldo Carvalho de Melo'

> ...

> > > I remember some discussions from an XNET standards meeting (I've forgotten
> > > exactly which errors on which calls were being discussed).
> > > My recollection is that you return success with a partial transfer
> > > count for ANY error that happens after some data has been transferred.
> > > The actual error will be returned when it happens again on the next
> > > system call - Note the AGAIN, not a saved error.

> > A saved error, for the right entity, in the recvmmsg case, that
> > basically is batching multiple recvmsg syscalls, doesn't sound like a
> > problem, i.e. the idea is to, as much as possible, mimic what multiple
> > recvmsg calls would do, but reduce its in/out kernel (and inside kernel
> > subsystems) overhead.

> > Perhaps we can have something in between, i.e. for things like EFAULT,
> > we should report straight away, effectively dropping whatever datagrams
> > successfully received in the current batch, do you agree?

> Not unreasonable - EFAULT shouldn't happen unless the application
> is buggy.

Ok.

> > For transient errors the existing mechanism, fixed so that only per
> > socket errors are saved for later, as today, could be kept?

> I don't think it is ever necessary to save an errno value for the
> next system call at all.
> Just process the next system call and see what happens.

> If the call returns with less than the maximum number of datagrams
> and with a non-zero timeout left - then the application can infer
> that it was terminated by an abnormal event of some kind.
> This might be a signal.

Then it could use getsockopt(SO_ERROR) perhaps? I.e. we don't return the
error on the next call, but we provide a way for the app to retrieve the
reason for the smaller than expected batch?

> I'm not sure if an icmp error on a connected datagram socket could
> generate a 'disconnect'. It might happen if the interface is being
> used for something like SCTP.
> In either case the next call will detect the error.

- Arnaldo

David Laight

unread,

May 29, 2014, 10:41:22 AM5/29/14

to Arnaldo Carvalho de Melo, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

If you really think it is necessary, then you want a field in the
control structure.
But IMHO returning the 'time left' is more than enough.

IIRC the original problem was that the user-specified timeout
was used as an inter-datagram timer instead of an overall timeout.

I suspect that most application won't actually care about the
'time left', nor the actual number of returned datagrams.
They will just process what they are given and then wait for
the next batch.

David

Arnaldo Carvalho de Melo

unread,

May 29, 2014, 11:33:54 AM5/29/14

to David Laight, Michael Kerrisk (man-pages), lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Thu, May 29, 2014 at 11:17:05AM -0300, 'Arnaldo Carvalho de Melo' escreveu:
> Em Thu, May 29, 2014 at 02:06:04PM +0000, David Laight escreveu:
> > From: 'Arnaldo Carvalho de Melo'
> > ...
> > > > I remember some discussions from an XNET standards meeting (I've forgotten
> > > > exactly which errors on which calls were being discussed).
> > > > My recollection is that you return success with a partial transfer
> > > > count for ANY error that happens after some data has been transferred.
> > > > The actual error will be returned when it happens again on the next
> > > > system call - Note the AGAIN, not a saved error.

> > > A saved error, for the right entity, in the recvmmsg case, that
> > > basically is batching multiple recvmsg syscalls, doesn't sound like a
> > > problem, i.e. the idea is to, as much as possible, mimic what multiple
> > > recvmsg calls would do, but reduce its in/out kernel (and inside kernel
> > > subsystems) overhead.

> > > Perhaps we can have something in between, i.e. for things like EFAULT,
> > > we should report straight away, effectively dropping whatever datagrams
> > > successfully received in the current batch, do you agree?

> > Not unreasonable - EFAULT shouldn't happen unless the application
> > is buggy.

> Ok.

So the patch below should handle it, and record that the packets were
dropped, not at the transport level, like UDP_MIB_INERRORS, for
instance, would indicate, but at the batching, recvmmsg level, so
perhaps we'll need a MIB variable for that.

Also a counterpart to the trace_kfree_skb(skb, udp_recvmsg) tracepoint
for dropwatch and similar tools to use, Neil?

I'm keeping this separate from the timeout update patch.

- Arnaldo

diff --git a/net/socket.c b/net/socket.c
index abf56b2a14f9..63491f015912 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2415,13 +2415,17 @@ out_put:

return datagrams;

if (datagrams != 0) {

+ if (err == -EFAULT) {
+ atomic_add(datagrams, &sock->sk->sk_drops);
+ return -EFAULT;
+ }
/*
* We may return less entries than requested (vlen) if the
* sock is non block and there aren't enough datagrams...

*/
if (err != -EAGAIN) {
/*

- * ... or if recvmsg returns an error after we
+ * ... or if recvmsg returns a socket error after we

* received some datagrams, where we record the
* error to return on the next call or if the
* app asks about it using getsockopt(SO_ERROR).

Michael Kerrisk (man-pages)

unread,

Jun 16, 2014, 5:59:20 AM6/16/14

to Arnaldo Carvalho de Melo, David Laight, lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Hi Arnaldo,

Things have gone quiet ;-). What's the current state of this patch?

Thanks,

Michael

Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Arnaldo Carvalho de Melo

unread,

Jun 24, 2014, 4:26:55 PM6/24/14

to Michael Kerrisk (man-pages), David Laight, lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

Em Mon, Jun 16, 2014 at 11:58:51AM +0200, Michael Kerrisk (man-pages) escreveu:
> Hi Arnaldo,
>
> Things have gone quiet ;-). What's the current state of this patch?

Yeah, I kept meaning to prod the other people on this thread about what
they thought about my last messages, patches, etc. :-)

Can I have acked-by or even tested-by on those? Is it ok?

- Arnaldo

Michael Kerrisk (man-pages)

unread,

Jun 27, 2014, 7:29:42 AM6/27/14

to Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, David Laight, lkml, linu...@vger.kernel.org, netdev, Ondrej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, Rémi Denis-Courmont, Paul Moore, Chris Friesen

On 06/24/2014 10:25 PM, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jun 16, 2014 at 11:58:51AM +0200, Michael Kerrisk (man-pages) escreveu:
>> Hi Arnaldo,
>>
>> Things have gone quiet ;-). What's the current state of this patch?
>
> Yeah, I kept meaning to prod the other people on this thread about what
> they thought about my last messages, patches, etc. :-)
>
> Can I have acked-by or even tested-by on those? Is it ok?

I just need to go back and test one point that sounds like it might still be
broken.

Cheers,

Michael

Michael Kerrisk (man-pages)

unread,

Jun 27, 2014, 7:37:43 AM6/27/14

to Arnaldo Carvalho de Melo, mtk.ma...@gmail.com, lkml, linu...@vger.kernel.org, netdev, Ondřej Bílka, Caitlin Bestler, Neil Horman, Elie De Brauwer, David Miller, Steven Whitehouse, David Laight, Paul Moore, Chris Friesen

Hi Arnaldo,

So, returning to your recvmmsg-timeout-v3.patch. I think the behavior as
implemented, and described above is okay.

> But, another question...
>
> In the case that the call is interrupted by a signal handler and some
> datagrams have already been received, then the call succeeds, and
> returns the number of datagrams received, and 'timeout' is updated with
> the remaining time. Maybe that's the right behavior, but I just want to
> check. There is at least one other possibility:
>
> * Fetch no datagrams (i.e., the datagrams are left to receive in a
> future call), and the call fails with EINTR, and 'timeout' is updated.
>
> Maybe that possibility is hard to implement (not sure). But my main point
> is to make the current behavior clear, note the alternative, and ask:
> is the current behavior the best choice. (I'm not saying it's not, but I
> do want the choice to be a conscious one.)

So, I think (can't find the mail right now) that you explained elsewhere
that the above would be hard to implement. And in any case, I'm not sure
it's desirable; I only wanted to check that the choice was a deliberate one.

However, there is still a weirdness, which relates to the discussion you
and David Laight had.

Suppose the following scenario.

1. We do a recvmmsg() with 10 second timeout, asking for 5 messages.
2. 3 messages arrive
3. 6 seconds after the call, a signal handler interrupts the call.
4. recvmmsg() returns success, telling us it got 3 messages.

So far, so good. But

5. We make a further recvmmsg() call.
6. That call returns immediately, with an EINTR error.

That really should not be happening. As noted elsewhere in this
thread, EINTR is a property of a specific system call, not of the
thread or the socket. By the time of step 5, the kernel should
already have forgotten about the signal that occurred at step 3.
I don't think I saw any other patch that fixes that behavior.

I recall now that this was why I was waiting for you to follow up
in this thread with a new patch.