Starman backlog

127 views
Skip to first unread message

maxwel...@gmail.com

unread,
Jan 4, 2017, 9:49:53 PM1/4/17
to psgi-plack
Hi,

I'm running Starman 0.4014 behind nginx 1.10.0 on Ubuntu 16.04.1
with Linux 4.4.0 (with nginx and Starman on separate servers).

I get immediate 502 errors from Nginx if all five Starman worker
processes are busy, even if I have recently (re-)started the
Starman master process.

As the --backlog option of Starman defaults to 1024,
and /proc/sys/net/core/somaxconn is 128, shouldn't these
requests be queued by Starman on the backend,
rather than being rejected?

Configuration is through Plack::Runner:


my %args = (
  '--port' => 6548,    # 3000 would be the default
  '--server' => "Starman",
);
my $runner = Plack::Runner->new();
$runner->parse_options(%args);
my $app = $runner->locate_app("$FindBin::Bin/app.pl");
$app = $runner->apply_middleware($app, "SizeLimit", max_unshared_size_in_kb =>
                                              300000);
my $loader = $runner->loader;
$loader->preload_app($app);
my $server = $runner->load_server($loader);
$loader->run($server);

- Maxwell

Eric Wong

unread,
Jan 4, 2017, 10:00:31 PM1/4/17
to psgi-...@googlegroups.com
maxwel...@gmail.com wrote:
> Hi,
>
> I'm running Starman 0.4014 behind nginx 1.10.0 on Ubuntu 16.04.1
> with Linux 4.4.0 (with nginx and Starman on separate servers).
>
> I get immediate 502 errors from Nginx if all five Starman worker
> processes are busy, even if I have recently (re-)started the
> Starman master process.
>
> As the --backlog option of Starman defaults to 1024,
> and /proc/sys/net/core/somaxconn is 128, shouldn't these
> requests be queued by Starman on the backend,
> rather than being rejected?

Maybe all 128 slots in the queue are immediately used up by
clients? It happens on busy sites...

The ss(8) tool in Linux can show how much of the backlog you
use in the Recv-Q (2nd) column.

ss -tl

(-t for TCP, -l for LISTEN)

You can leave out the -l to see the client sockets hitting
:6548, too.

maxwel...@gmail.com

unread,
Jan 5, 2017, 1:00:25 PM1/5/17
to psgi-plack, e...@80x24.org
That's a command I didn't know about - thanks!

My traffic is bursty, rather than steadily high.
Whenever I manage to execute "ss -tl", on the load balancer or
the backend, I see all zeros in the Recv-Q column, even if I've
occupied the Starman workers.

To be sure, I'm not certain that I've occupied exactly five workers
at the instants I get the 502 errors.  Serving the application requests involves
some rather heavy processing, and some other system limit, such as memory,
could be coming into play.  However, I've never had problems hitting the
system process limit or having the workers that are running fail due to
memory problems.  So it seems likely to me that Starman closes the
connection "voluntarily," based on its own decision that it can't serve the
incoming request.

In my nginx error.log on the load balancer, I see the message:

"*6528 upstream prematurely closed connection while reading response header from upstream"

Are there other conditions that could cause a connection to be closed prematurely?
Also, where in the source code should I look for this?
Starman itself appears to be pure Perl, with no reference to the "backlog"
parameter that it parses.  Which underlying module is doing the heavy
lifting, and where does it make a decision about returning a 502
or closing a connection?

- Maxwell

Eric Wong

unread,
Jan 5, 2017, 10:35:03 PM1/5/17
to psgi-...@googlegroups.com, maxwel...@gmail.com
maxwel...@gmail.com wrote:
> That's a command I didn't know about - thanks!
>
> My traffic is bursty, rather than steadily high.
> Whenever I manage to execute "ss -tl", on the load balancer or
> the backend, I see all zeros in the Recv-Q column, even if I've
> occupied the Starman workers.

Yeah, if the worker has actually made an accept/accept4 syscall,
that'll move the connection out of the backlog so it won't show
up in Recv-Q.

> To be sure, I'm not certain that I've occupied exactly five workers
> at the instants I get the 502 errors. Serving the application requests
> involves
> some rather heavy processing, and some other system limit, such as memory,
> could be coming into play. However, I've never had problems hitting the
> system process limit or having the workers that are running fail due to
> memory problems. So it seems likely to me that Starman closes the
> connection "voluntarily," based on its own decision that it can't serve the
> incoming request.

I doubt it, starman is one-connection-per-process, so it
won't make any accept/accept4 calls while it's doing application
processing. You can `strace -p $PID_OF_WORKER -e accept,accept4`
to be sure.

You should also also try using curl to hit the starman instances
directly and bypass nginx.

> In my ngin error.log on the load balancer, I see the message:
>
> "*6528 upstream prematurely closed connection while reading response header
> from upstream"

If all else fails, perhaps try a simple PSGI app that just calls
"sleep($A_FEW_SECONDS)" before returning to see if you can
simulate it. It may also be a network or firewall problem
between nginx and starman since they're on different boxes.

> Are there other conditions that could cause a connection to be closed
> prematurely?

Maybe workers are crashing/exiting and getting restarted?

> Also, where in the source code should I look for this?
> Starman itself appears to be pure Perl, with no reference to the "backlog"
> parameter that it parses. Which underlying module is doing the heavy
> lifting, and where does it make a decision about returning a 502
> or closing a connection?

Probably somewhere in the Net::Server::* suite which starman
inherits from. I'm not familiar with the code myself, but
the "backlog" should be the parameter passed to the listen(2)
system call which you can verify via strace.

maxwel...@gmail.com

unread,
Jan 9, 2017, 10:16:34 PM1/9/17
to psgi-plack, maxwel...@gmail.com, e...@80x24.org
Indeed, workers were crashing and getting restarted.

What helped me track down which requests were crashing, and how,
amidst the other junk echoed to my console was:

1. export STARMAN_DEBUG=1
2. Stop echoing all bytes sent to the console when STARMAN_DEBUG==1
3. Add a debugging echo upon each read in _prepare_env(), not just _read_headers()
4. Increase read_timeout in Starman/Server.pm to eliminate one source of failures
5. Change debugging output of Starman/Server.pm to echo the read
information to different files, one named after each process.

Then I discovered that one of the worker processes was reading a request in its
entirety, but never getting to the point where "Closing connection" would be
echoed as would typically happen.  Debugging my application, I confirmed a
condition that actually sent the worker process a kill signal.

My small and ugly debugging patch to Starman/Server.pm is attached.
Thanks for all the suggestions!

- Maxwell
starmanserver.diff
Reply all
Reply to author
Forward
0 new messages