Meaning of errors in error log and their reporting to handlers

62 views
Skip to first unread message

Hans

unread,
Feb 13, 2017, 12:51:19 PM2/13/17
to nxweb
Hi,

context: 
my use case is mainly based on uploads via PUT or POST.
I use .on_request and .on_post_data inworker handlers, and a finalizer callback.

Can you explain a bit the error messages in the error log?

Examples:
error: i=-2 errno=11 state=3 rc=0 br=348
error: i=-1 errno=104 state=2 rc=0 br=348 

I understand the errno, but what do the i and state mean? (br always seems to be 348 in my case)
And what would be
I found the following:  
when state=3, my worker does not get informed of any error.
when state=2, my worker gets informed with nxe_data.ptr->fwb->error == 0, but I see that the uploaded size is bad.

Is that normal or am I doing something wrong here?

Also, errno 11 would mean: EAGAIN. So you could just retry on the socket. Why do you cut off the connection? 
errno 11 is on a significant percentage of the connections, so it does bother me.

thx,
Hans

Yaroslav

unread,
Feb 13, 2017, 2:04:57 PM2/13/17
to nxweb
Hi Hans,

These errors typically mean drop of connection. Do they appear often?

Numbers are internal state. br - is number of bytes received, for example. Need to look into source to find out meaning of i and state. errno is the last errno observed. It does not necessarily mean that connection dropped due to that errno. EAGAIN is most certainly not the cause for the error.

Yaroslav

--
You received this message because you are subscribed to the Google Groups "nxweb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nxweb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hans

unread,
Feb 14, 2017, 1:35:21 AM2/14/17
to nxweb
Hi,
Connection errors on nxweb happen in my setup in about 2.5% of all connections. errno 11 represents 80% of them. (sample size: 2M connections/uploads)
That is significant.
It is on a clean, non overloaded connection with few hops (but over the internet).
The same connection, but to nginx, has far fewer errors. So to me something is wrong here.

Hans

To unsubscribe from this group and stop receiving emails from it, send an email to nxweb+un...@googlegroups.com.

Hans

unread,
Feb 14, 2017, 7:11:31 AM2/14/17
to nxweb

Hi,


here is my investigation:


Total request to error ratio: 2.5% in a sample size of 2M.


The error cases are situations that are considered fault situations by the server, and are subsequently logged, followed by an active connection cut-off via nxweb_http_server_connection_finalize.

I am worried that this situation is triggered in cases where it is not mandated.


The cases that concern me are when:


rc = 0 (98% of all errors)


state = HSP_RECEIVING_BODY (there are a few logs with HSP_HANDLING and HSP_SENDING_HEADERS, but they seem OK to me)


For i, the distribution is NXE_ERROR: 53%,  NXE_RDHUP: 36%, NXD_HSP_READ_TIMEOUT: 10%, NXE_RDCLOSED: 0.3%. 

Let’s consider NXD_HSP_READ_TIMEOUT and NXE_RDCLOSED to be valid reasons for cutting off the connection. 

I am dubious about NXE_RDHUP. This would suggest a remote hangup, but I have my suspicions that this might be flawed as that number is rather high. 

so: I look at i = NXE_ERROR and NXE_RDHUP


For errno, the distribution is EAGAIN: 84%, ECONNRESET: 16%. There is no correlation between i and errno, as errno is spread out evenly.

I have the impression that the errno value that is logged is of no use, as it is the errno at the moment of the log, not at the moment of the error detection. So that value just indicates the last error that has occurred somewhere else in the thread, potentially not related to the situation that finally will provoke the connection cutoff.

so: I ignore errno


br is irrelevant to this case, as it indicates the size of the client-provided headers.


NXE_ERROR:

there is 1 case in in sock_data_recv_read(): read() < 0 and errno != E_AGAIN (but note that I see many cases of i=NXE_ERROR, errno=EAGAIN logged, but then again, see the remark about errno)

I see 3 cases in nx_event.c, all related to EPOLLERR. I do not (yet) understand how these cases are triggered.


NXE_RDHUP:

I see 3 cases in nx_event.c, all related to EPOLLRDHUP. I do not (yet) understand how these cases are triggered.


I am trying to create a reproducible case with netem, but have not found a good case yet. Will continue to try, but here you have the first feedback.


Hans

Yaroslav

unread,
Feb 14, 2017, 12:06:10 PM2/14/17
to nxweb
Hi Hans,

This is very thorough investigation.

2.5% seems to be quite high error ratio. Have you tried your tests over local network, not via internet? It would be interesting to compare. It could also be interesting to compare with nginx.

Looks like most errors originate from EPOLLERR and EPOLLRDHUP signals. These are epoll events documented here: http://man7.org/linux/man-pages/man2/epoll_ctl.2.html

errno is not relevant in these cases because epoll events do not set errno. It holds some previous value observed on the same thread, which might even relate to different connection.

To investigate it further I would suggest adding extra logging in places where these events originate. Eg. this: http://stackoverflow.com/questions/13150377/how-to-get-errno-when-epoll-wait-returns-epollerr

You said you are using INWORKER handler. Is it because you are doing some blocking operations? Try commenting out that blocking commands and switch to INPROCESS. This should eliminate any possible cross-thread delays. BTW are you sure you are not running out of worker threads?

Perhaps you can show me your code. Both client and server.

BTW it could be a problem on the client side.

Yaroslav


To unsubscribe from this group and stop receiving emails from it, send an email to nxweb+unsubscribe@googlegroups.com.

Hans

unread,
Feb 14, 2017, 1:44:41 PM2/14/17
to nxweb
Hi,

Tests over local network, but using bad WiFi show 1% error rate.

I will set up some slimmed down test apps and do some stress tests on a simulated "realistic" network. I will also instrumentalise nxweb a bit to get down to the error cause.

By the way, about the inworker vs inprocess (which I need, as I do blocking I/O with the uploaded data), how would I see that I run out of worker threads? (although I am pretty sure I don't, as I have tested the app in the lab both on 1000 times the load of that machine that is complaining now and on a small CPU saturated machine, and I have seen no issues there.)

Hans

Yaroslav

unread,
Feb 14, 2017, 4:05:14 PM2/14/17
to nxweb
Try testing on localhost without any physical network. Is it going to further reduce errors.

About workers. When there are no workers left (defined by NXWEB_MAX_WORKERS) there will be 503 HTTP errors. Also note that new workers are slow to create so there could be delays (which theoretically can result in other problems) during extremely steep warm-up.

It worth trying with non-blocking inprocess handler so we can exclude possible causes from that side.

Try building minimum reproducible environment without blocking I/O, complicated processing, etc. Something that I can run on my PC, for example.

To unsubscribe from this group and stop receiving emails from it, send an email to nxweb+unsubscribe@googlegroups.com.

Hans

unread,
Feb 15, 2017, 2:52:45 AM2/15/17
to nxweb
hi,

It seems the errors I am hunting now, all come from 1 particular source network. So I am looking there first now, and I am running nginx and nxweb destinations along side, from that network, to be able to compare. 
I will instrumentalise nxweb a bit more so I can get more detailed error info, and am already running full packet dumps alongside to be able to pinpoint the problem.
I have not seen 503 errors for this situation. In the lab, I have seen them, but indeed at steep warm-ups, and on CPU deprived situations, as to be expected. The nxweb target this is all about is well below saturation.

Hans
Reply all
Reply to author
Forward
0 new messages