[AOLSERVER] Null responses

Cyan ogilvie

unread,

Aug 26, 2014, 2:51:44 PM8/26/14

to aolserv...@lists.sourceforge.net

Hi

I'm hoping there are still some subscribers to this list ;)

I'm trying to debug a strange condition we're seeing on a small percentage of our connections: connections are being closed by the server without any response being sent back on the connection (verified by looking at network packet traces and inserting a logging transparent proxy between the client and server). The network packet pattern we see is:

Request data (in a single frame, or multiple), ACK

Then the connection is closed by the server after 10 - 70 ms, without any data being sent, with a FIN/ACK (still getting confirmation on this - these logs are from the other side of the man-in-the-middle proxy I'm using to get debugging info).

For some of the failed requests, the server processing never gets as far as the start of our Tcl code (a preauth filter that starts with an ns_log that doesn't show up in the server log).

For others the request is processed normally and an access.log message written indicating that a response was generated with HTTP code 200, but no packet shows up on the network.

There is no pattern to the failed requests (sometimes requests for favicon.ico fail), and retrying the exact request shortly afterwards often succeeds.

Has anyone seen anything like this before, or have any advice on how to narrow down the cause further?

We're running a slightly patched version of the last 4.5.2 rc, on Ubuntu 12.04.5 64bit on Amazon EC2 instances, with Tcl 8.6.1

Thanks

Cyan

Peter Sadlon

unread,

Aug 26, 2014, 3:33:01 PM8/26/14

to Cyan ogilvie, aolserv...@lists.sourceforge.net

Not sure if this will help, but....

If you have a preauth filter (possibly other filters too) which does an ns_adp_abort (and/or possibly ns_adp_break) within the filter or within a proc called from the filter then the next time that thread is used, the state will be "abort" and will end the connection.

You can put this line as the first lines of your pre_auth filter:

ns_adp_exception state

ns_log notice $state

If you find states of "abort" in your log, do this as a simple hack to reset the state:

ns_adp_exception state

if {$state=="abort" || $state=="break"} {

catch {ns_adp_return}

set foobar [ns_adp_parse "Hello"]

}

It may be a place to start looking, but if you have a ns_log in your preauth which is not logging then I would guess the issue is somewhat different. But anyone who may use an ns_adp_abort or ns_adp_break from within a filter should be aware of this. I believe using ns_adp_abort in a filter before version 4 worked fine, since 4.0 or 4.5 it causes this abort state issue, I don't remember exactly, it was a long time ago that I had upgraded and had to figure that out.

_Peter

Date: Tue, 26 Aug 2014 18:05:31 +0200
From: cy...@rubylane.com
To: aolserv...@lists.sourceforge.net
Subject: [AOLSERVER] Null responses

------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/
_______________________________________________ aolserver-talk mailing list aolserv...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/aolserver-talk

Jeff Rogers

unread,

Aug 26, 2014, 6:51:43 PM8/26/14

to Cyan ogilvie, aolserv...@lists.sourceforge.net

Hi Cyan,

Yes, there's still a few subscribers.

Do you know if this connection dropping happens mostly when there is a
lot of activity or more frequently when there is very low activity?

I recall a few edge cases in the thread pooling where a thread would in
some circumstances wait until another connection came in before running,
and there might have been a related case where a connection could get
dropped. IIRC, these both happened generally when there was low traffic
(or more specifically low concurrent traffic). Playing with maxconns
might diminish the problem in this case.

You also mention favicon.ico; is it mostly or always that? It's notable
for being a small static file, which could point to other causes, like a
corrupt interpreter state as Peter suggested. Or there might be some
weirdness with mmap if you have that enabled.

One other thought, can you switch to naviserver? The connection
handling there has evolved somewhat differently not to mention more
recently) than aolserver, but programming-wise there are not a lot of
differences.

-J

Tony Bennett (Brown Paper Tickets)

unread,

Aug 27, 2014, 9:47:05 AM8/27/14

to aolserv...@lists.sourceforge.net

Does this only happen under heavy load? You might be able to adjust your init.tcl to handle more connections. I'm not sure what you would need to change so I pulled out any config that might help.

-Tony

ns_section "ns/parameters"
ns_param listenbacklog        32        ;# Max length of pending conn queue

# keepalivetimeout
# In prior versions of AOLserver this may have been under
# ns/servers/$server. Set to 0 to disable Keep-alive support, any other
# value will enable Keep-alive but won't actually configure the timeout,
# which is now configured under the driver.
# (http://aolserver.am.net/docs/tuning.adpx)
ns_param keepalivetimeout    30        ;# Max time conn is kept alive (keepalive)

# maxkeepalive
# In prior versions of AOLserver this may have been under
# ns/servers/$server. In AOLserver 4.0 and newer, this setting exists
# but appears to be non-operational
# (http://aolserver.am.net/docs/tuning.adpx)
ns_param maxkeepalive        100        ;# Max no. of conns in keepalive state

ns_section "ns/server/${server_name}"
ns_limits set default -maxrun 500        ;# the maximum number of connections which can be running simultaneously

### Tuning Options ###
ns_param connsperthread        0        ;# Normally there's one conn per thread
#ns_param flushcontent        false        ;# Flush all data before returning
ns_param maxconnections        100        ;# Max connections per connection thread before it is shut down
#ns_param maxdropped        0        ;# Shut down if dropping too many conns
ns_param maxthreads        20        ;# Tune this to scale your server
ns_param minthreads        0        ;# Tune this to scale your server
ns_param threadtimeout        120        ;# Idle timeout for connection threads
ns_param spread            20        ;# Variance factor for threadtimeout and maxconnections to prevent mass mortality of theads (e.g. +-20%)

ns_section "ns/server/${server_name}/module/nssock"
ns_param Address        $address
ns_param Hostname        $host
ns_param Port            $port
ns_param maxsock        500
ns_param keepwait        30        ;# number of seconds to hang-up on clients while waiting for connection
ns_param socktimeout    30        ;# number of seconds to wait for a client request. increase for file upload
ns_param maxinput        1MB    ;# maximum size of data sent from browser
ns_param maxline        16k        ;# maximum number of bytes for http request or header line.
ns_param maxheader        64k        ;# maximum number of bytes for all HTTP header lines in a request

########################################################################
# Thread library (nsthread) parameters
#
# If the server is crashing with no explanation you may have a corrupted
# data due to a stack overflow. Calculate locally declared data types
# and function parameters to get the stack size needed.
########################################################################
ns_section "ns/threads"
#ns_param mutexmeter        true        ;# measure lock contention
#ns_param stacksize        [expr 128*1024] ;# stack size per thread (in bytes)

Cyan ogilvie

unread,

Aug 27, 2014, 3:02:36 PM8/27/14

to Jeff Rogers, aolserver-talk

On Wed, Aug 27, 2014 at 12:29 AM, Jeff Rogers <dv...@diphi.com> wrote:
>
> Do you know if this connection dropping happens mostly when there is a lot of activity or more frequently when there is very low activity?
>
> I recall a few edge cases in the thread pooling where a thread would in some circumstances wait until another connection came in before running, and there might have been a related case where a connection could get dropped. IIRC, these both happened generally when there was low traffic (or more specifically low concurrent traffic). Playing with maxconns might diminish the problem in this case.

There doesn't seem to be a correlation between the site load and the
connection drops. We never really see very low activity - the design
of the site means there are hundreds of thousands of pages, so we're
constantly being crawled by every bot ever spawned (we serve about 8GB
daily to googlebot alone). But we use fairly small Amazon EC2
instances that are primarily memory limited, so the config is tuned to
reach a memory high water mark of around 700 - 800 MB, which based on
the load testing I did found maxthreads of 6 and maxconnections of 50
(modules/tcl/pools.tcl seems to take the maxconnections ns_param as
the ns_pools -maxconns parameter). The project was most recently
moved from version 3.4 / Tcl 7.6 to 4.5 / Tcl 8.6, so there are a lot
of legacy bits and shims in place. I recall something about the
meaning of the maxconnections ns_param changing meaning between these
versions, but it seems to be working as I would expect).

Server load tends to vary between a loadavg of 0.4 to 2.0, typically
around 0.6 (2 cores). Concurrency is pretty low, it seldom reaches 6,
and usually sits at around 1 - 3, based on the monitoring we've
currently got in place. I'm currently trying to reconstruct exact the
thread / concurrency / request context for the connection drop events
by parsing the log files, I'm hoping that might reveal some pattern in
the failures. But so far they don't seem to correspond with
connection lifecycle events.

> You also mention favicon.ico; is it mostly or always that? It's notable for being a small static file, which could point to other causes, like a corrupt interpreter state as Peter suggested. Or there might be some weirdness with mmap if you have that enabled.

There doesn't seem to be a pattern to the failing requests, sometimes
it's small static files like favicon.ico, but mostly not (although in
our case we're not using fastpath for that - different favicons are
served based on the request context). At the moment I'm leaning
towards some sort of corrupted connection thread state - the failures
tend to cluster by time, server, user - so that, although the failures
are exceedingly rare overall (220 yesterday), it's often the case that
a given user will have to reload a page several times before they get
a successful response. The servers are fronted by haproxy which will
tend to send a given session back to the same server.

> One other thought, can you switch to naviserver? The connection handling there has evolved somewhat differently not to mention more recently) than aolserver, but programming-wise there are not a lot of differences.

It's probably not out of the question if there is a strong argument to
be made that it would fix the problem, we're taking quite a reputation
hit at the moment. I initially attempted to make the site work on
naviserver since that seemed to be more active, but I ran into
problems with the nsdb / nsdbi change and segfaults when I tried to
get nsdb working on it. It's also a 15 year old code base that seemed
to be quite sensitive to the small config and api changes in
naviserver, and the port from 3.4 / Tcl 7.6 was tricky enough as it
was (encoding issues, list parsing differences, regexp syntax, etc)
that the call was made to go with AOLserver 4.5 instead to minimize
the changes required.

But the site has been running well on the new version since March /
April, so porting to naviserver should be feasible, but I'd need to
make a very strong case.

Cyan ogilvie

unread,

Aug 27, 2014, 3:31:05 PM8/27/14

to Tony Bennett (Brown Paper Tickets), aolserver-talk

On Tue, Aug 26, 2014 at 11:05 PM, Tony Bennett (Brown Paper Tickets)
<to...@brownpapertickets.com> wrote:
> Does this only happen under heavy load? You might be able to adjust your
> init.tcl to handle more connections. I'm not sure what you would need to
> change so I pulled out any config that might help.

Thanks, the current config has been been minimally modified from what
was being used for 3.4, so there are a few settings in your example
that we're not explicitly setting and that I'm not familiar with, so
they're good leads to follow.

Torben Brosten

unread,

Aug 27, 2014, 3:54:23 PM8/27/14

to aolserv...@lists.sourceforge.net

On 08/27/2014 12:02 PM, Cyan ogilvie wrote:> ..

> There doesn't seem to be a pattern to the failing requests,
> sometimes it's small static files like favicon.ico, but mostly not
> (although in> our case we're not using fastpath for that - different
favicons are
> served based on the request context). At the moment I'm leaning
> towards some sort of corrupted connection thread state - the failures
> tend to cluster by time, server, user - so that, although the failures
> are exceedingly rare overall (220 yesterday), it's often the case that
> a given user will have to reload a page several times before they get
> a successful response. The servers are fronted by haproxy which will
> tend to send a given session back to the same server.

> ..

Have you ruled out a router issue, such as from ipv4 exhaustion or
localized network flooding?

Cyan ogilvie

unread,

Aug 28, 2014, 7:51:27 AM8/28/14

to Torben Brosten, aolserver-talk

On Wed, Aug 27, 2014 at 9:34 PM, Torben Brosten <tor...@dekka.com> wrote:
>
> On 08/27/2014 12:02 PM, Cyan ogilvie wrote:> ..
> > There doesn't seem to be a pattern to the failing requests,
> > sometimes it's small static files like favicon.ico, but mostly not
> > (although in> our case we're not using fastpath for that - different
> favicons are
> > served based on the request context). At the moment I'm leaning
> > towards some sort of corrupted connection thread state - the failures
> > tend to cluster by time, server, user - so that, although the failures
> > are exceedingly rare overall (220 yesterday), it's often the case that
> > a given user will have to reload a page several times before they get
> > a successful response. The servers are fronted by haproxy which will
> > tend to send a given session back to the same server.
> > ..
>
> Have you ruled out a router issue, such as from ipv4 exhaustion or
> localized network flooding?

I'm pretty sure it's not network related at this stage. To test this I built a man-in-the-middle relay listening on port 8008 running on the same server as nsd, which forwards all connection traffic to 127.0.0.1:80 and records the events and data it sees flowing in both directions, and a packet trace using tcpdump of both the requests arriving on eth0 and the relayed traffic on lo. An iptables nat prerouting rule DNATs connections coming in on eth0 port 80 to the relay's port 8008. Another process watches the haproxy logs for indications of a failed request and retrieves a dump of the ring buffers from the relay and saves them for later analysis.

Here is the packet trace of the relay -> nsd traffic on the lo interface for a typical event (as captured by tcpdump, which indicated that no packets had been dropped):

Time Prot Len Info

34.750440 TCP 76 47576 > http [SYN] Seq=0 Win=32792 Len=0 MSS=16396 SACK_PERM=1 WS=32

34.750465 TCP 76 http > 47576 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=16396 SACK_PERM=1 WS=32

34.750479 TCP 68 47576 > http [ACK] Seq=1 Ack=1 Win=32800 Len=0

34.750720 TCP 4412 [TCP segment of a reassembled PDU]

34.750756 TCP 68 http > 47576 [ACK] Seq=1 Ack=4345 Win=32768 Len=0

34.751274 HTTP 439 GET /item/531138-RR-1012/Elgin-Art-Deco-Dial-Pocket-Watch HTTP/1.1

34.751295 TCP 68 http > 47576 [ACK] Seq=1 Ack=4716 Win=32768 Len=0

34.751377 TCP 68 http > 47576 [FIN, ACK] Seq=1 Ack=4716 Win=32768 Len=0

34.751492 TCP 68 47576 > http [FIN, ACK] Seq=4716 Ack=2 Win=32800 Len=0

34.751515 TCP 68 http > 47576 [ACK] Seq=2 Ack=4717 Win=32768 Len=0

The connection reaches the ESTABLISHED state and the HTTP request data is sent to nsd which is acked. Then 0.8 ms later the connection is closed by nsd. The first Tcl code that should execute for this request is a preauth filter, which starts by writing a log containing [ns_conn request]. That log message isn't reached in cases like this. In this example the request is quite large (around 4.7 KB) because of some large cookies, but the same pattern happens for requests of around 400 bytes.

There appears to be a less common failure mode where the request processing happens normally and the Tcl code generates a normal response (HTTP code 200), but the response data never hits the network. The network trace looks the same as the example I gave above except that the time between the ACK of the GET request and nsd's FIN, ACK is longer - around 20 - 70 ms which is in-line with the normal times for a successful request. I haven't yet caught the packet trace between the relay and nsd for this case (only the trace on eth0 between haproxy and the relay), so I'm not 100% certain of my interpretation of this failure mode yet.

There is usually a cluster of up to about 6 requests that fail in this way for a given server in the space of around 2 seconds, with long intervals between the clusters where no failures happen.

Cyan

Gustaf Neumann

unread,

Aug 28, 2014, 10:30:36 AM8/28/14

to aolserv...@lists.sourceforge.net

Dear Cyan,

you seem to have invested in detail into the problem. We are using
aolserver and naviserver since many years with heavy traffic
(since two last years just naviserver), but we have not seen this
problem.

Some question pop up:
- you say, you have a legacy application: did the problem show up
just recently or did you become aware of this just recently?

- can you reproduce the problem on are bare metal machine?

- do you have the option to replace haproxy by nginx in front of aolserver?
nginx can use for the intratalk persistent connections (we could reduce
the number of connections to naviserver by some order of magnitude),
nginx has various retry options. maybe haproxy has the same, i have not
worked with this.

-gustaf neumann

Am 28.08.14 13:51, schrieb Cyan ogilvie:

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
aolserver-talk mailing list
aolserv...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk

-- 
Univ.Prof. Dr. Gustaf Neumann
WU Vienna
Institute of Information Systems and New Media
Welthandelsplatz 1, A-1020 Vienna, Austria

Jeff Rogers

unread,

Aug 28, 2014, 5:41:49 PM8/28/14

to Cyan ogilvie, aolserver-talk

Do you happen to have a custom error page for 503 errors (service
unavailable)? I'm looking over the code for places where a connection
might be dropped without sending anything, and although this path
doesn't seem to do that it does look possible for it to do other bad
things (i.e., crash). It could correlate with a quick burst of
requests from a single user (loading all the images for a page, for
example).

I'd expect an outright crash if this was the case, which would probably
be more noticeable. However, crashes can be funny things and there's a
chance that it could do something entirely unexpected.

-J

Cyan ogilvie wrote:
> On Wed, Aug 27, 2014 at 9:34 PM, Torben Brosten <tor...@dekka.com

> There is usually a cluster of up to about 6 requests that fail in this
> way for a given server in the space of around 2 seconds, with long
> intervals between the clusters where no failures happen.

Brad Chick

unread,

Sep 5, 2014, 2:49:13 PM9/5/14

to aolserv...@lists.sourceforge.net

On the old AOLserver, under modules, there is nssession. But I can't
seem to get it. cvs seems not to be working:

[brad@repos ~]$ cvs -d :pserver:anon...@cvs.panoptic.com:/home/cvs co
-d nssession aolserver/nssession
cvs [checkout aborted]: connect to [cvs.panoptic.com]:2401 failed:
Protocol not available

Does anyone have that module lying around that you could send along?

Thanks

--
==============================
BRAD CHICK
==============================

Br...@ChickCentral.com
734.662.1701 (h)
734.646.9372 (m)

"Make Some Time for Wasting!"
_
| |
___| |__ ___ ___ _ __ ___
/ __| '_ \ / _ \/ _ \ '__/ __|
(__| | | | __/ __/ | \__ \
\___|_| |_|\___|\___|_| |___/
================================

Jeff Rogers

unread,

Sep 5, 2014, 3:37:09 PM9/5/14

to Brad Chick, aolserv...@lists.sourceforge.net

Brad Chick wrote:
> On the old AOLserver, under modules, there is nssession. But I can't
> seem to get it. cvs seems not to be working:
>
> [brad@repos ~]$ cvs -d :pserver:anon...@cvs.panoptic.com:/home/cvs co
> -d nssession aolserver/nssession
> cvs [checkout aborted]: connect to [cvs.panoptic.com]:2401 failed:
> Protocol not available
>
> Does anyone have that module lying around that you could send along?

Try aolserver.cvs.sourceforge.net rather than cvs.panoptic.com - Dossy
might have finally gotten rid of his copies of old cvs stuff.

-J

Brad Chick

unread,

Sep 5, 2014, 5:05:07 PM9/5/14

to Mat Kovach, aolserv...@lists.sourceforge.net

Brilliant. Thanks much

On 9/5/14, 3:41 PM, Mat Kovach wrote:

> Brad Chick writes:
>> On the old AOLserver, under modules, there is nssession. But I can't
>> seem to get it. cvs seems not to be working:
>>
>> [brad@repos ~]$ cvs -d :pserver:anon...@cvs.panoptic.com:/home/cvs co
>> -d nssession aolserver/nssession
>> cvs [checkout aborted]: connect to [cvs.panoptic.com]:2401 failed:
>> Protocol not available
>>
>> Does anyone have that module lying around that you could send along?

> $ cvs -z3 -d:pserver:anon...@aolserver.cvs.sourceforge.net:/cvsroot/aolserver co -P nssession
> cvs checkout: Updating nssession
> U nssession/Makefile
> U nssession/Makefile.nssession
> U nssession/Makefile.nssessionmanager_file
> U nssession/README
> U nssession/config.tcl
> U nssession/nshash.c
> U nssession/nshash.h
> U nssession/nssession.c
> U nssession/nssession.h
> U nssession/nssessionmanager_file.c
> cvs checkout: Updating nssession/examples
> U nssession/examples/counter.adp
> U nssession/examples/guess.adp
>
> http://sourceforge.net/projects/aolserver/
>
> At least, that is where I still tend to yank my code from.
>
> /mek

Reply all

Reply to author

Forward