Possible Poll Corruption

153 views
Skip to first unread message

pa...@rcom-software.com

unread,
Apr 28, 2021, 11:13:07 PM4/28/21
to libuv
Hi Folks:

I am experiencing an intermittent problem with uv_poll_init() after the successful establishment and release
of multiple concurrent TCP connections. I am not sure if this problem is due to a bug, which may
be corrected in another Libuv release, or if I am doing something wrong when releasing the poll handle.
Do you have insight about the cause ? The details I consider most important follow.

The uv_poll_t callback function reads incoming TCP data as follows.

    iff(events & UV_READABLE)
      {
        CONN_DESC *cdesc = (CONN_DESC *)  poll_handle->data;
        n = recv(cdesc->fd, cdesc->rx_buf, RX_BUF_SIZE, MSG_DONTWAIT);

NOTE: I am using Libuv version 1.41.0 running on Ubuntu Linux version 15.04.

The problem is that uv_poll_init() fails, normally with the EEXIST or EBADF error code, and
my investigation indicates the uv_loop_t Poll_Loop.watchers[N] field is not NULL when it should be,
where N is TCP socket descriptor. It occurs immediately after the uv_poll_t poll_handle is allocated.
(There is exactly one TCP socket descriptor per poll handle.) The call to uv_poll_init() is as follows
and the socket descriptor is obtained with uv_fileno().

    if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) != 0)
       {
        fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY -  Polling Initialization Error %d, %s",
             r, uv_err_name(r));
       }

It occurs in the IO_Task() thread when there multiple TCP socket descriptors are in use. The IO_task
releases the poll_handle with the following code sequence when it is notified that polling should stop
via a Libuv async. message from the Protocol_Task() thread.

    if( (r = uv_poll_stop(poll_handle)) )
      {
        fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Poll Stop Error %d, %s",
             r, uv_err_name(r) );
      }

    poll_handle->data = (void *) cdesc;
    uv_close((uv_handle_t *) poll_handle, async_close_callback);

The actual release occurs in async_close_callback() as follows and the Protocol_Task() releases the
TCP socket decriptor with a normal Linux close() after it receives the T_DISMANTLE response message.
    
    VFREE((UCHAR *) handle);
    //
    // Send a notification message to the Protocol_Task.
    //
    msg = MSG_ALLOC(0, FALSE);
    msg->class = C_NOTIFY;
    msg->type = T_DISMANTLE_RSP;
    msg->info = 0;

    SendProtoMsg(cdesc, msg);

I think the underlying cause is that if there is new TCP connection that uses a the same TCP socket descriptor
as one released with uv_poll_stop() and uv_close(), the call to uv_poll_init() occurs before the socket closure
has propogated into the uv_loop_t Poll_Handle.

Best Regards,

Paul Romero

Jameson Nash

unread,
Apr 28, 2021, 11:52:43 PM4/28/21
to libuv
I'm assuming from your description that you've ensured all libuv calls happen from exactly one thread (except uv_async_send). It sounds like you're describing the issue that  https://github.com/libuv/libuv/commit/c9406ba0e3d67907c1973a71968b89a6bd83c63c was intended to fix, which was included in v1.41.0

Note that the Poll_Loop.watchers[N] is not always required to be NULL, so perhaps you've found a case where it is still incorrectly expecting that? You might want to file an issue with code to reproduce the problem, or at least an `rr` trace dump of the code that hits problem (note that `rr` could potentially upload all of your hard drive, so be careful what you are submitting).

--
You received this message because you are subscribed to the Google Groups "libuv" group.
To unsubscribe from this group and stop receiving emails from it, send an email to libuv+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/libuv/c3290a76-ab6b-42ad-8540-33021c6188b9n%40googlegroups.com.

Paul Romero

unread,
Apr 29, 2021, 12:09:10 PM4/29/21
to li...@googlegroups.com
Hi:

The most common symptom of the problem is that uv_poll_init() fails which means the problem must
be related to the  uv_poll_t Poll_Loop. The relevant code segment is as follows.

   poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t));


   if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) != 0)
     {
          fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY -  Polling Initialization Error %d, %s",
                             r, uv_err_name(r));
      }

The following files illustrate the problem and are attached to this message. Please advise me
how to attach files to the list directly.

*  trace_badf.txt: A stack trace when the error code is EBADF.
*  trace_eexist.txt: A stack trace when the error code is EEXIST.
* network_io.c: The IO_Task() thread code. The problem occurs in the poll_proxy() routine.
* main.c:  The main() process/thread code where most initialization occurs.
* transmit.c: The Transmit_Task() thread.
* rx_fsm.c: The heart of the protocol task code.


You are right that only the IO_Task() performs polling related operations.
However, please note that the Transmit_Task() thread uses the underlying TCP socket descriptor
to send data to the network with Linux send(). (See the TX_DATA() routine.) Also,
USE_SSL and USE_SCRAM are not defined.

I will send more information if am able to  produce the problem at another point.
Also, I have seen the issue you mentioned before and suspect it may be relevant.
Please let me know if you need anything else.


Best Regards,

Paul R.






Best Regards,

Paul R.

You received this message because you are subscribed to a topic in the Google Groups "libuv" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to libuv+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/libuv/CADnnjUXSsF7QwRnVqFojAm1W_o35CHKyMCPM%2BVw0FM0FjqN1XQ%40mail.gmail.com.

-- 


Paul Romero
-----------
RCOM Communications Software
EMAIL: pa...@rcom-software.com
PHONE: (510)482-2769




main.c
network_io.c
rx_fsm.c
trace_badf.txt
trace_eexst.txt
transmit.c

Paul Romero

unread,
Apr 29, 2021, 6:14:07 PM4/29/21
to li...@googlegroups.com
Hi:

I reattached the following files with some confusing debugging stuff removed and
a dump using them follows.

* main.c
* network_io.c
* rx_fsm.c
* async.c

Again this is the relevant code segment.

    case T_POLL_CONFIG_REQ:
        //
        // Main process request to start Libuv/epoll() operation.
        //

        poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t));

        if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) != 0)
          {

            fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY -  Polling Initialization Error %d, %s",
                 r, uv_err_name(r));
          }

This is the crash diagnostic.

IO_TASK: POLL_PROXY -  Polling Initialization Error -17, EEXIST
Aborted (core dumped)

And the is the gdb backtrace and is typical of the problem.

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./pexd'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007ff1da90b267 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
55    ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ff1da90b267 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007ff1da90ceca in __GI_abort () at abort.c:89
#2  0x000000000040d379 in fatal_error (log_prio=2,
    fmt=0x424ce0 "IO_TASK: POLL_PROXY -  Polling Initialization Error %d, %s") at misc.c:135
#3  0x0000000000405b21 in poll_proxy (handle=0x641f80 <IO_Task_Async_Handle>) at network_io.c:287
#4  0x000000000041142b in uv__async_io (loop=0xc26800 <Poll_Loop>, w=<optimized out>, events=<optimized out>)
    at src/unix/async.c:163
#5  0x000000000041d1b5 in uv__io_poll (loop=loop@entry=0xc26800 <Poll_Loop>, timeout=<optimized out>)
    at src/unix/linux-core.c:462
#6  0x0000000000411ba8 in uv_run (loop=0xc26800 <Poll_Loop>, mode=UV_RUN_DEFAULT) at src/unix/core.c:385
#7  0x000000000040580f in IO_Task (arg=0x0) at network_io.c:100
#8  0x00007ff1daca76aa in start_thread (arg=0x7ff1d5a7c700) at pthread_create.c:333
#9  0x00007ff1da9dceed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

At this point cdesc->fd is 22 and Poll_Loop.watchers[22] = 0x17ef458, a valid address ?
async.c
main.c
network_io.c
rx_fsm.c

Paul Romero

unread,
Apr 30, 2021, 2:48:12 PM4/30/21
to li...@googlegroups.com
Hi:

My guess is that there is a subtle, probably timing related, problem in uv_poll_stop()
simply because I can't think of any other relevant Libuv routines that are explicitly
called after uv_poll_start() succeeds.  What do you think ? Of course, there could
be a problem in the way connections are rejected. All that code is in the make_incoming_connection()
callback routine in main.c

Best Regards,

Paul R.

Paul Romero

unread,
Apr 30, 2021, 4:32:51 PM4/30/21
to li...@googlegroups.com
Hi:

Where can one find documentation on the error codes returned by Libuv routines such as uv_poll_init() ?

Best Regards,

Paul R.

Paul Romero

unread,
Apr 30, 2021, 7:59:30 PM4/30/21
to li...@googlegroups.com

Hi:

I think I've isolated the problem and it appears to be in uv_poll_stop().

In particlular if uv_accept() is called immediately after uv_poll_stop(), and it reallocates the
same socket descriptor just freed by uv_poll_stop(), then uv_poll_init() fails with the error
EEXIST or EBADF. This must have something to do with uv_loop_t handle passed to uv_poll_init(),
called Poll_Loop in my code.

I had to instrument my code to keep track of Libuv API calls to see this.

The following explanatory items follows.

A - The sequence of calls that produces the problem.
B - Commented syslog() diagnostics.
C - The relevant code segments.
D - The gdb stack backtrace.

Best Regards,

Paul R.


A) The sequence of relevant calls is as follows and the problem only occur with this exact
sequence of events.
****************************************************************************************

1) uv_poll_stop(): The uv_pool_t handle is bound to socket descriptor N.
   This should free socket descriptor N.
2) uv_accept(): Bound to socket desciptor N as set by uv_fileno().
   Socket descriptor N should be newly allocated.
3) uv_poll_init(): Returns EBADF with socket descriptor N and Poll_Loop.watchers[N] is NULL.
   Something is wrong !

B) The commented syslog() diagnostic sequence from the instrumented code.
**************************************************************************
* Stop Polling - Connection 1 Fd 21. This frees the FD.

Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 21] - IO_TASK - uv_poll_stop() - T_POLL_DISMANTLE_REQ

* Stop Polling - Connection 0 Fd 20. This frees the FD

Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 0 fd 20] - IO_TASK - uv_poll_stop() - T_POLL_DISMANTLE_REQ

* New Connection - Connection 1 Fd 20. This is fine since uv_poll_stop() was called for Fd 20 which
  freed it and it is newly allcated with uv_accept() and obtained with uv_fileno().

Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 20] - IO_TASK - uv_accept() + uv_fileno()

* However we crash because uv_poll_init() returned EBADF !

Apr 30 15:46:08 mobius pexd[30963]: POLL_CALL[i 1 fd 20] - IO_TASK - uv_poll_init() - T_POLL_CONFIG_REQ
Apr 30 15:46:08 mobius pexd[30963]: IO_TASK: POLL_PROXY -  Polling Initialization Error -9, EBADF

C) The edited associated code segments are as follows.
********************************************************
1) Prelude to uv_poll_stop() in poll_proxy() in the IO_Task() thread. (network_io.c)
------------------------------------------------------------------------------------
            case T_POLL_DISMANTLE_REQ:
                ppdata = (uv_handle_t **) &msg->buffer[0];
                poll_handle = (uv_poll_t *) *ppdata;

syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK - uv_poll_stop() - T_POLL_DISMANTLE_REQ\n", cdesc->index, cdesc->fd);


                if( (r = uv_poll_stop(poll_handle)) )
                  {
                    fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Poll Stop Error %d, %s",
                         r, uv_err_name(r) );
                  }
                //
                // The callback routine notifies the Protocol_Task() when everything is done.
                //

                poll_handle->data = (void *) cdesc;
                uv_close((uv_handle_t *) poll_handle, async_close_callback);


2) Connection acceptance in make_incoming_connection() in the main() thread. (main.c)
---------------------------------------------------------------------------------------
        conn_handle = (uv_tcp_t *) VALLOC(sizeof(uv_tcp_t));
    uv_tcp_init(&Connect_Loop, conn_handle);

    if((r = uv_accept(listen_handle, (uv_stream_t *) conn_handle)) == 0)
       {
        //
        // A new connection occured.
        //
            uv_os_fd_t fd;
ENTER_MUTEX(&Protocol_D_Mutex);
        nsock = N_Sockets++;
EXIT_MUTEX(&Protocol_D_Mutex);
        //
        // Fetch the socket descriptor from the connection handle.
        //
        uv_fileno((const uv_handle_t*) conn_handle, &fd);
        //
        // Allocate the connection descriptor.
        //
        cdesc = ALLOC_CONN_DESC(fd);
        if( !cdesc )
          {
            syslog(LOG_DEBUG, "MAIN_TASK: CONNECTION REJECTION - No Connection Descriptors, N =  %d", nsock);

ENTER_MUTEX(&Protocol_D_Mutex);
            N_Sockets--;
            close(fd);
EXIT_MUTEX(&Protocol_D_Mutex);
            uv_close((uv_handle_t *) conn_handle, forced_close_callback);
            return;
          }

syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK - uv_accept() + uv_fileno()\n", cdesc->index, cdesc->fd);
        //
        // Save the connection handle and start polling.
        //
        cdesc->conn_handle = (uv_tcp_t *) conn_handle;

        syslog(LOG_INFO, "MAIN_TASK: NEW CONNECTION ESTABLISHED - CONN %d FD %d", cdesc->index, cdesc->fd);
        //
        // Set up epoll() plumbing by sending a message to IO_Task();
        //
        bzero((void *) &poll_data, sizeof(ASYNC_DATA));
        poll_data.type = T_POLL_CONFIG_REQ;
        poll_data.async_handle = &IO_Task_Async_Handle;
        poll_data.async_q = &IO_Task_Input_Q;
        poll_data.object_handle = NULL;
        //
        // The T_POLL_CONFIG_RSP message will be sent to the Protocol_Task() which
        // is in S_IDLE state.
        //
        send_async_msg(cdesc, &poll_data);
      }

3) Prelude to uv_poll_init() in poll_proxy() in the IO_Task() thread. (network_io.c)
------------------------------------------------------------------------------------

            case T_POLL_CONFIG_REQ:
                //
                // Main process request to start Libuv/epoll() operation.
                //
                poll_handle = (uv_poll_t *) VALLOC(sizeof(uv_poll_t));

syslog(LOG_INFO, "POLL_CALL[i %d fd %d] - IO_TASK - uv_poll_init() - T_POLL_CONFIG_REQ\n", cdesc->index, cdesc->fd);

                if( (r = uv_poll_init(&Poll_Loop, poll_handle, cdesc->fd)) != 0)
                  {

                    fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY -  Polling Initialization Error %d, %s",
                         r, uv_err_name(r));
                  }



D) Stack backtrace.

(gdb) bt
#0  0x00007f62c47b0267 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f62c47b1eca in __GI_abort () at abort.c:89
#2  0x000000000040d411 in fatal_error (log_prio=2,
    fmt=0x424e28 "IO_TASK: POLL_PROXY -  Polling Initialization Error %d, %s") at misc.c:135
#3  0x0000000000405b6d in poll_proxy (handle=0x641f80 <IO_Task_Async_Handle>) at network_io.c:291
#4  0x00000000004114cb in uv__async_io (loop=0xc26800 <Poll_Loop>, w=<optimized out>, events=<optimized out>)
    at src/unix/async.c:163
#5  0x000000000041d255 in uv__io_poll (loop=loop@entry=0xc26800 <Poll_Loop>, timeout=<optimized out>)
    at src/unix/linux-core.c:462
#6  0x0000000000411c48 in uv_run (loop=0xc26800 <Poll_Loop>, mode=UV_RUN_DEFAULT) at src/unix/core.c:385
#7  0x0000000000405835 in IO_Task (arg=0x0) at network_io.c:100
#8  0x00007f62c4b4c6aa in start_thread (arg=0x7f62bbfff700) at pthread_create.c:333
#9  0x00007f62c4881eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

At this point cdesc->fd is 20 and Poll_Loop.watchers[20] is NULL.

Paul Romero

unread,
May 1, 2021, 3:01:49 AM5/1/21
to li...@googlegroups.com
Hi:

I may have overlooked the following factor.

We know that Libuv is not reentrant and not really multi-task safe. I suspect, but
don't know, that the internal Libuv data structures attached to the uv_loop_t structure,
Poll_Loop in my case, are not protected by internal Libuv Mutexes.

1) uv_accept() and uv_fileno() exectute in the main() thread.

2) uv_poll_init(), uv_poll_start() and uv_poll_stop(), execute in the IO_Task() thread.

That means uv_accept() and that the Libuv API routines in 2, above, can execute concurrently.
Is it possible if such concurrency could damage the internal Libuv data structures attached to the
uv_loop_t structure ?  (Again Poll_Loop in my case.)

Best Regards,

Paul R.

NOTE: I did a few experments and noticed that calling uv_close() in a Mutex often
prevents its callback routine from executing.

Paul Romero

unread,
May 2, 2021, 2:24:59 AM5/2/21
to li...@googlegroups.com
Hi:

A trace giving a more refined picture of the problem and more details is attached
to this note. It dumps what I consider to be the significant fields of the uv_loop_t structure,
the Poll_Loop in my code, and the the dynamically allocated uv_poll_t structure surrounding
the essential Libuv API routine calls. I think the format of the dump is fairly intuitive
but please inform me if you disagree.

My comments about what is happening are identified by the string "COMMENT:" and the
API calls are identified by the string "MPOLL". The trace is redacted to eliminate
irrelevant events.

Please let me know if you see something amiss about the data structure fields at
any point.

My server code is based on simple event oriented architecture and uses the following 5 concurrent
tasks.

1) main(): Handles incoming connections and Linux signals. (Libuv based.)
2) IO_Task(): Handles incoming data from the network. (Libuv based.)
3) Protocol_Task(): Does most of the work. (Message based.)
4) DB_Task(): Does database insertions. (Message based.)
5) Timer_Task(): Handles timer management. (Mostly  Libuv based.)

I can send you my server code and a test tool if it would be useful to you.
However, it may take a few days to put it in an easily useable form.

As previously mentioned uv_accept() executes in the main() task and the
rest of the polling related Libuv API routines execute in the IO_Task().

Finally, remember the problem is intermittent. Generally, it takes about
45 to 3 hours of constant concurrent connections for the crash to occur.

Best Regards,

Paul R.
scenario.txt

Jameson Nash

unread,
May 2, 2021, 9:51:50 AM5/2/21
to libuv
You might want to try compiling your application with Clang TSan. Although, since libuv is a single-threaded library (per uv_loop_t), any access from multiple threads is generally just forbidden.

Paul Romero

unread,
May 2, 2021, 12:02:18 PM5/2/21
to li...@googlegroups.com
Hi:

I forgot one task, the Transmit_Task(). It is message based.

PR

Paul Romero

unread,
May 4, 2021, 9:21:05 PM5/4/21
to li...@googlegroups.com
Hi:

I have two questions about Libuv that I haven't seen addressed by the documentation.

First, when uv_poll_init() returns EEXIST, could it be indicative of a transient
problem rather than an operational problem or bug ?

Second, this question is related to concurrency but does not involve concurrent execution.
In fact the all the Libuv calls execute serially.

1) The following sequence of calls executes because it has been decided polling needs
   be terminated for a particular TCP connection.

    uv_poll_stop(poll_handle_1);
    uv_close((uv_handle_t *) poll_handle_1, close_callback);

2) Now it is decided polling needs to be terminated for another TCP connection and
   the following sequence of calls executes.

    uv_poll_stop(poll_handle_2);
    uv_close((uv_handle_t *) poll_handle_2, close_callback);

3) The close_callback() function executes but we don't know if it executes for the
   uv_close() call made with poll_handle_1 or poll_handle_2 first.

Can this happen and result in corruption ?

Best Regards,

Paul R.


On 05/02/2021 06:51 AM, Jameson Nash wrote:

Paul Romero

unread,
May 8, 2021, 12:11:00 AM5/8/21
to li...@googlegroups.com
Hi:

I was able to track down and fix the problem causing uv_poll_init() to return EEXIST.
However, uv_poll_init() still returns EBADF intermittently and I traced the problem
to the following code segment but don't fully understand why.

The EBADF is generated by the following code segment in uv__poll_io() in src/unix/poll.c around line 90.

  /* If ioctl(FIONBIO) reports ENOTTY, try fcntl(F_GETFL) + fcntl(F_SETFL).
   * Workaround for e.g. kqueue fds not supporting ioctls.
   */
  err = uv__nonblock(fd, 1);
  if (err == UV_ENOTTY)
    if (uv__nonblock == uv__nonblock_ioctl)
      err = uv__nonblock_fcntl(fd, 1);

Do you have any insight about this ?

I was able to verify the the following sequence of events can occur and strongly
suspect it can lead to corruption. This is what causes EEXIST to be returned in
uv_poll_init() calls.


1) The following sequence of calls executes because it has been decided polling needs
   be terminated for a particular TCP connection.

    uv_poll_stop(poll_handle_1);
    uv_close((uv_handle_t *) poll_handle_1, close_callback);

2) Now it is decided polling needs to be terminated for another TCP connection and
   the following sequence of calls executes.

    uv_poll_stop(poll_handle_2);
    uv_close((uv_handle_t *) poll_handle_2, close_callback);

3) The close_callback() function executes but we don't know if it executes for the
   uv_close() call made with poll_handle_1 or poll_handle_2 first.


Handling the problem, requires knowing that close_callback() won't execute until
uv_run() execute. (I don't think this is documented.)

The solution is that when a uv_poll_stop()/uv_close() sequence is started do
not start another such sequence until both close_callback() and uv_run() have
executed.

In my case, this requires a variable, Poll_Releasing which is set to TRUE when such a
sequence has been initiated and to FALSE when the close_callback() executes.
It also requires putting T_POLL_DISMANTLE_REQ messages that arrive while Poll_Releasing
is TRUE on a temporary queue. Then when the IO_Task_Input_Q is empty, the messages from
the temporary queue are moved to IO_Task_Input_Q and the poll_proxy() routine returns which
allows uv_run() to execute.

The relevant code segments are as follows.

1) In the IO_Task() the poll_proxy() routine in network_io.c.


ROUTINE PRIVATE void poll_proxy(uv_async_t *handle)
{
    CONN_DESC *cdesc;
    MSG *msg, *rsp;
    MSG_FIFO Poll_Release_Q;
    uv_poll_t *poll_handle;
    uv_handle_t **ppdata;
    int r;
    BOOL done;

    //
    // Do nothing if execution is already in progress.
    //
    if(Poll_Proxy_Busy)
        return;

    Poll_Proxy_Busy = TRUE;
    Poll_Release_Q.head = Poll_Release_Q.tail = NULL;
    //
    // Handle messages from other tasks.
    //
    done = FALSE;
    while(done != TRUE)
      {

ENTER_MUTEX(&Async_Mutex);
        msg = remove_msg(&IO_Task_Input_Q);
EXIT_MUTEX(&Async_Mutex);

        if(msg)
          {
            cdesc = (CONN_DESC *) msg->conn_desc;
            syslog(LOG_DEBUG, "IO_TASK(Proxy): RX EVENT - CONN[%d] Type %d\n", cdesc->index, msg->type);

            switch(msg->type)
            {

            case T_POLL_CONFIG_REQ:
                //
                // Main process request to start Libuv/epoll() operation.
                //
                ... Code to handle message ...
                break;
            case T_POLL_DISMANTLE_REQ:
                //
                // If a poll handle is being dismantled, you must
                // wait for the uv_close() callback routine to execute
                // before handling this message. In order for the callback
                // routine to execute uv_run() must be called.
                //
                if(Poll_Releasing)
                  {
                    //
                    // Save the message on a temporary queue
                    // and move the contents of that queue to
                    // the task input queue when this routine returns.
                    //
                    insert_msg(&Poll_Release_Q, msg);
                    msg = NULL;
                    break;
                  }
                else
                    Poll_Releasing = TRUE;
                //
                // Protocol_Task() request to cease Libuv/epoll() operation and
                // release the poll handle and its resources.
                //

                ppdata = (uv_handle_t **) &msg->buffer[0];
                poll_handle = (uv_poll_t *) *ppdata;

                if( (r = uv_poll_stop(poll_handle)) )
                  {
                    fatal_error(LOG_CRIT, "IO_TASK: POLL_PROXY - Poll Stop Error %d, %s",
                         r, uv_err_name(r) );
                  }
                //
                // The callback routine notifies the Protocol_Task() when everything is done.
                //
                poll_handle->data = (void *) cdesc;
                uv_close((uv_handle_t *) poll_handle, async_close_callback);
                break;
            default:
                fatal_error(LOG_EMERG, "IO_TASK: POLL_PROXY - Invalid Message = %d", msg->type);

            }

            if(msg)
                MSG_FREE(msg);
          }
        else
          {
            done = TRUE;
            //
            // The input queue is now empty. Move the messages saved
            // on the temporary queue to it.
            //
            while( (msg = remove_msg(&Poll_Release_Q)) )
               {
ENTER_MUTEX(&Async_Mutex);
                insert_msg(&IO_Task_Input_Q, msg);
EXIT_MUTEX(&Async_Mutex);
               }
          }
      }

    Poll_Proxy_Busy = FALSE;

    return;
}


2) In the IO_Task() the XXX routine in async.c.

ROUTINE void async_close_callback(uv_handle_t *handle)
{
    MSG *msg;

    CONN_DESC *cdesc = (CONN_DESC *) handle->data;
    int type = handle->type;


    VFREE((UCHAR *) handle);
    //
    // Send a notification message to the Protocol_Task.
    //
    msg = MSG_ALLOC(0, FALSE);
    msg->class = C_NOTIFY;
    msg->type = T_DISMANTLE_RSP;
    msg->info = 0;

    SendProtoMsg(cdesc, msg);
    //
    // Signal that the poll handle is released.
    //
    if(type == UV_POLL)
      {
        Poll_Releasing = FALSE;
        syslog(LOG_INFO, "IO_Task: CONN[%d] Poll Release Complete", cdesc->index);
      }

    return;
}



pa...@rcom-software.com

unread,
May 8, 2021, 11:27:29 AM5/8/21
to libuv
Hi:

Addition to my last message.  When uv__nonblock() fails it is indicative of a Linux FIONBIO ioctl() failure. What would cause
setting non-blocking mode to fail ?

Best Regards,

Paul R.

Paul Romero

unread,
May 8, 2021, 11:49:57 AM5/8/21
to libuv
Hi:

I made a debugging error and the problem has nothing to do with FIONBIO.
The real problem is that epoll_ctl() fails with errno set to EBADF when called as follows.

    epoll_ctl(loop->backend_fd, EPOLL_CTL_ADD, fd, &e)

This happens in uv__check_fd() and means etiher the backend_fd or fd is bad.
I think it is far more likely the backend_fd is is bad but don't know why this would happen.
Any ideas ?

PR



Jameson Nash

unread,
May 8, 2021, 11:52:02 AM5/8/21
to libuv
Are you still accessing libuv (sans explicitly thread-safe functions such as uv_async_send) from multiple threads, as you mentioned earlier? If so, I'd suggest fixing that first. In conjunction, I recommend running TSan and making sure it runs cleanly before checking for library or logic problems. Then, if it is still a rare failure, I recommend debugging under `rr` as you'll be able to run forward to the problem, then walk backwards through the code to see what happened to your state and file descriptors.

Paul Romero

unread,
May 9, 2021, 1:24:58 PM5/9/21
to li...@googlegroups.com
Hi:

Using TSAN did turn up one very signifcant problem. The root cause of the TCP socket descriptor corrpution
is that accept4() executes concurrently in the main() task, with close() in the Protocol_Task().

As an interim measure I avoid the problem by simply not calling close(). So far this has worked
seemlesly and Libuv appears to automatically free socket descriptors. The Libuv documentation
about this is somewhat ambiguous. It indicates that after calling uv_poll_stop() or uv_close(),
for a particular uv_poll_t poll handle, the socket descriptor is returned to the user per the Libuv
contract. I am not sure what that means, and in particular, if it means the socket descriptor is freed.
Can you clarify this ?

A tagential issue is whether Linux accept4() and close() are thread safe. I believe they
are and the crucial data is protected in the kernel. Is it possible Libuv is not handling the accept4()
return codes correctly ? The Linux accept4() man page details how errors should be handled and is it
somewhat fussy. The Linux close() man page also details error handling but it is straight forward.

Also, I haven't been able to make the program compiled with TSAN dump core. Do you have
any suggestions ? Incidentally, I had to use clang rather than the usual gcc to get
TSAN to work on my system.

Best Regards,

Paul R.
--
You received this message because you are subscribed to the Google Groups "libuv" group.
To unsubscribe from this group and stop receiving emails from it, send an email to libuv+un...@googlegroups.com.

Jameson Nash

unread,
May 9, 2021, 2:53:30 PM5/9/21
to libuv
uv_poll_stop and uv_accept must happen on the same thread, so how are they racing? Once uv_poll_stop is called, it should be safe to notify another thread to call close, even if that then happens concurrently, as the kernel should be thread-safe (though weird things may happen in userspace if you close a fd that is concurrently in use on another thread or event queue).

Jameson Nash

unread,
May 9, 2021, 2:58:31 PM5/9/21
to libuv
Are you extracting a fd from a libuv object then calling close or uv_poll on it? Either would cause problems, and shouldn't be done.

Paul Romero

unread,
May 9, 2021, 3:02:13 PM5/9/21
to li...@googlegroups.com
Hi:

u_accept() is in the main() task and uv_poll_stop() is int IO_Task(). However, I never observed
them executing concurrently with the same FD. What you wrote does make sense and explains
a lot of things though. I guess that means uv_accept() somehow accesses the same data
as uv_stop_poll() even though the poll handle is never passed to uv_accept() !

Best Regards,

Paul R.

Paul Romero

unread,
May 9, 2021, 3:14:28 PM5/9/21
to li...@googlegroups.com
Hi:

I get the FD immediately after accepting an incomig connection with uv_fileno().
Then subsequently it is used as follows:

* IO_Task(): When the poll callback executes correctly, it is used with Linux recv() to
  read incoming network data in non-blocking mode. (All incoming data and polling
  is done in this thread/task.)

* Transmit_Task(): It is used with Linux send() to send data across the network in non-blocking
   mode. This thread/ task is driven by message reception outside Libuv mechanisms.

That's it other than the final close() we discussed. I never observed any problems unrelated
to Libuv connection acceptance and release, and starting and stoping Libuv polling.

Best Regards,

Paul R.

pa...@rcom-software.com

unread,
May 11, 2021, 9:45:59 AM5/11/21
to libuv
Hi:

Perhaps it would be useful to know that all Libuv epoll() operations occur in the IO_Task() which
is dedicated to reception of networrk data. This provides nice modularity and minimizes the
possibiliy of deadlocks due to interactions with other kinds of functionality.

Best Regards,

Paul R.

pa...@rcom-software.com

unread,
May 14, 2021, 12:37:26 PM5/14/21
to libuv
Hi:

I think it is safe to perform connection acceptance and epoll() oriented TCP socket
transactions in separate thread/tasks if you adhere to the following rules.

My server follows these rules and works seamlessly with the following exception. If it uses
Linux close() TCP socket descriptors, after uv_poll_stop() and uv_close() have succeeded for a
particular uv_poll_t connection handle, internal Libuv structures are corrupted
intermittently after a long period of successfully making and releasing many TCP connections.

This implies that Libuv automatically closes TCP socket descriptors because my Server
never runs out of them no matter how long it runs.

RULES
-----
* There is separate uv_loop_t Loop structure for accepting incoming connections and epoll() I/O events.

  In my case they are declared as follows:

    uv_loop_t Connect_Loop;        // Libuv loop for incoming connections.
    static uv_loop_t Poll_Loop;    // Libuv loop for reading incoming network data.

* The Loop structures are owned by separate tasks and only accessed by the task that owns the Loop structure.

  In my case the main() task/thread owns Connect_Loop and the IO_Task() thread/task owns Poll_Loop.

* Only the task/thread which owns the incoming connection acceptance Loop calls connection management
  related Libuv API routines, such uv_accept() and uv_close(), and access uv_tcp_t connection
  handles allocated during connection acceptance.

  In my case the main() task/thread follows this rule and owns Connect_Loop.

* Only the task/thread which owns the epoll() I/O event Loop calls polling related Libuv API routines,
  such as uv_poll_start() and uv_poll_stop(), and access uv_poll_t connection handles allocated
  during polling initiation.

  In my case the the IO_Task() thread/task follows this rule and owns the Poll_Loop.


Best Regards,

Paul R.

Paul Romero

unread,
May 31, 2021, 10:51:24 AM5/31/21
to li...@googlegroups.com
Hi Folks:

I worked out the kinks in my multi-task threaded Libuv oriented application and an
overview of its salient features follows. It is now highly reliable and I think this
overview closes the relevant Libuv issues and provides insight about using Libuv
in a threaded environment.

My application includes two Libuv oriented tasks. Their primary functions are performed
with standard Libuv mechanisms. Each task has its own uv_loop_t poll loop structure which
is accessed only by that task and is used to perform related Libuv operations. The tasks
can also perform Libuv operations on behalf of other tasks upon receipt of Libuv async.
messages from other tasks. An overview of the Libuv task functionality is as follows.

Libuv Task/Threads
------------------
* main(): Handles incoming TCP network connection establishment and Linux/Unix signals.
  (SSL connectivity is also handled by extension.)
* IO_Task(): Handles incoming network data.

The relevant files, which are attached for your convenience, are as follows.

Relevant Files
----------------
* main.c. The main() task. The make_incoming_connection() routine handles
  incoming network connections and the connect_proxy() routine handles async.
  messages from other tasks.
* network_io.c: The IO_Task() task. The poll_callback() routine handles
  incoming network data and the poll_proxy() routine handles async. messages
  from other tasks.
* async.c: The send_async_msg() routine is used to send messages to the Libuv oriented tasks.
* ssl.c: Performs SSL related operations. This file is included to resolve ambiguity in
  the details of the make_incoming_connection() routine.

An overview of the other message oriented application tasks is as follows.

Message Oriented Task/Threads
---------------------------------
* Protocol_Task(): Implements the data transfer state machine used for data transfer
  between the Server and a Client on a per TCP connection basis.
* DB_Task(): Performs database insertion operations.
* Scram_Task(): Performs slow SCRAM authentication proof computation operations.
* Transmit_Task(): Transmits data from the Server to the Client on a per TCP connection basis.
* Timer_Task(): Manages protocol and application timers.

Best Regards,

Paul R.
You received this message because you are subscribed to a topic in the Google Groups "libuv" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/libuv/_4ClQoaVPCg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to libuv+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/libuv/843aca48-bc3a-4c4d-8260-241e6c56e0d5n%40googlegroups.com.
async.c
main.c
network_io.c
ssl.c
Reply all
Reply to author
Forward
0 new messages