Measuring wait time inside Thin

202 views
Skip to first unread message

Daniel Doubrovkine

unread,
Feb 15, 2013, 2:38:07 PM2/15/13
to thin...@googlegroups.com
I am sure many have read about the RapGenius Heroku debacle. Related to that I've set myself to measure how much time we're spending inside the single-threaded Thin waiting for the request to be executed while another request is being processed. 

I came up with this:

module Thin
  class Connection < EventMachine::Connection
    attr_accessor :init_ts, :data_ts

    def initialize
      @init_ts = Time.now
      super
    end

    alias_method :_receive_data, :receive_data
    def receive_data(data)
      unless @data_ts
        @data_ts = Time.now
        puts "time spent waiting: #{(@data_ts - @init_ts).to_i}"
      end
      _receive_data(data)
    end
  end
end

I think my assumptions are incorrect. A keep-alive connection is created for any new web client and the first data received happens when Thin is ready to process (1 at a time). My question is whether it's possible that another wait will happen for an already open connection for a given client?

If so, what's a good place to stick the start/end times? 

Thx
dB.

--

dB. | Moscow - Geneva - Seattle - New York
dblock.org - @dblockdotorg

Marc-André Cournoyer

unread,
Feb 15, 2013, 3:26:20 PM2/15/13
to thin...@googlegroups.com
Hi Daniel,

Are you trying to mesure the time from connection open up to the first byte received? Not sure this will tell you how long requests are in kept in the queue.

Perhaps just log the time of first byte (data_ts) in a class var and compute difference w/ prev value.

--
You received this message because you are subscribed to the Google Groups "thin-ruby" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thin-ruby+...@googlegroups.com.
To post to this group, send email to thin...@googlegroups.com.
Visit this group at http://groups.google.com/group/thin-ruby?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
M-A

Daniel Doubrovkine

unread,
Feb 15, 2013, 3:38:41 PM2/15/13
to thin...@googlegroups.com
No, I am not.

Consider the scenario of 100 concurrent connections to a single-threaded Rails application. Think will make, say 10 accept() calls at which point 9 will wait till the first request is processed. If processing takes 1s per request than the first will not wait, the second will wait for 1s, the third for 2, etc. That's the wait time I am trying to measure.

Marc-André Cournoyer

unread,
Feb 15, 2013, 3:47:55 PM2/15/13
to thin...@googlegroups.com
OK then mesure time in Connection#connection_completed and Connection#unbind.

Connection 0: completed0, unbind0
Connection 1: completed1, unbind1

queued time for Connection 1 (time between request w/o processing time) = completed1 - unbind0

Daniel Doubrovkine

unread,
Feb 15, 2013, 3:53:48 PM2/15/13
to thin...@googlegroups.com
Doesn't that assume full load? If I have a request every 5 minutes, don't I have 5 minutes between unbind0 and completed1?

Marc-André Cournoyer

unread,
Feb 15, 2013, 4:04:52 PM2/15/13
to thin...@googlegroups.com
Oh I thought you were just benchmarking and controlling the speed of requests.

Check this out: https://gist.github.com/a-warner/f5db30857ed3423cea79 you can get env['HTTP_X_REQUEST_START'] on Heroku.

Daniel Doubrovkine

unread,
Feb 15, 2013, 4:32:35 PM2/15/13
to thin...@googlegroups.com
Ok, this is good :)

Shouldn't it add the value of HTTP_X_HEROKU_QUEUE_WAIT_TIME to the queue time to get total queuing? Isn't that the time spent in the routing mesh?

Daniel Doubrovkine

unread,
Feb 15, 2013, 4:45:43 PM2/15/13
to thin...@googlegroups.com
I mean we're replacing the Heroku queue time with the app's queue time (awesome). But there's still potentially queue time in Heroku itself that is getting dropped here. Any idea how to split them for New Relic?

Daniel Doubrovkine

unread,
Feb 15, 2013, 5:59:58 PM2/15/13
to thin...@googlegroups.com
Sorry to be spammy.

I tried all of this out. There's an additional problem - a clock skew between the Heroku router and the dyno. I am seeing something like 5-7ms. Curious what you think.

James Tucker

unread,
Feb 15, 2013, 7:49:35 PM2/15/13
to thin...@googlegroups.com
The application accept(2) calls are separate from TCP level accepts.

Depending on exactly how the Heroku load balancers are implemented (particularly what layer), you probably have no real defense against this.

Your best option is to tune the accept backlog (assuming a low layer load balancer), and that should inform the OS, and respectively the LB when your process is "full". We never exposed an API in EventMachine for this tunable, and as such, you may have trouble.

I'd be happy to accept a patch to EM to allow for this, it's presently hardcoded here:


The other problem here, is that EM being concurrency focused, preferences accepting more than just a single socket per "request" (the per-request aspect is related to how a webserver is implemented on top of EM (being blocking for a whole request at a time)):


Again, this could be made tunable. For a single-threaded Thin style use case, this loop essentially would want to be eradicated, as concurrency is not desired in this way.

What this all really means, is this is the direct cause as to why low layer load balancers cannot balance Thin in a very "fair" way today. Thin will gobble up connections faster than it can service them, so under more than your serviceable load, you will get very unfair scheduling at the load balancer. It is as always, worth noting that this happens when you have more than your serviceable load, so you should really be provisioning new backends already.

Daniel Doubrovkine

unread,
Feb 16, 2013, 8:04:24 PM2/16/13
to thin...@googlegroups.com
James, this is super helpful.

Daniel Doubrovkine

unread,
Feb 16, 2013, 8:10:55 PM2/16/13
to thin...@googlegroups.com
I meant to ask, looking at the second link: is this running on a separate thread behind a select()? 

For example, if I have a request that takes 20 seconds. First request comes in, the loop runs once, and EM gets it, we're now processing. Then a second request comes in - does this loop run again, or do we wait 19 seconds or so before another accept() is called?

James Tucker

unread,
Feb 17, 2013, 2:27:44 PM2/17/13
to thin...@googlegroups.com
On Feb 16, 2013, at 5:10 PM, Daniel Doubrovkine <dbl...@dblock.org> wrote:

I meant to ask, looking at the second link: is this running on a separate thread behind a select()? 

It is not threaded, so you have some serialization of events. The reactor treats accept descriptors like normal open connections in the core loop, where the "readable" event and "read" action are actually executing the accept loop. This processing will happen once per reactor loop. A reactor loop will occur when Thin gives control back to EM (i.e. after each request handler completes). If you're running EM in threaded mode, as far as EM is concerned, the request handler completes very quickly, and EM's loop runs often (which can make the topical problems worse).

For example, if I have a request that takes 20 seconds. First request comes in, the loop runs once, and EM gets it, we're now processing.

So IIRC there is actually a one tick delay between the accept and when socket data will actually get sent to handlers. This is for architectural/concurrency reasons, it could be changed, but it would have an effect on throughput for other applications.

Then a second request comes in - does this loop run again, or do we wait 19 seconds or so before another accept() is called?

Disregarding the additional loop step described above for a moment (as it actually makes this problem worse, but is harder to understand). In this simplified model, and assuming Thin is in single-threaded mode, then your query is roughly correct. A request will block the reactor, and only once it has unblocked the reactor, will the accept loop run again. The accept loop will pull up to 10 connections off the queue.

In reality, there are a few other things happening:

 * The accept loop runs every "tick" (every gap between request processing (note: this can be /concurrent request processing/) (note note: concurrent does not mean parallel, in reality they'll be serialized))
 * Accepted connections go into a queue and are only read from in the next tick
 * During a tick, all /live/ connections get a large buffer read, and then passed to their respective handlers (often enough for a full request)
 * During a tick, as all of the live connections have whole requests in buffers, EM, and subsequently Thin, will dispatch them all between ticks.

So you get a kind of batching effect here, but it can get quite catastrophic under load:

 * Reactor loop
 * Pull 10 connections off the listen backlog
 * Put them in an internal queue
 * Reactor loop
 * Pull another 10 connections off the listen backlog
 * Put them in an internal queue
 * Read a full buffer off of each connection (the connections from the first loop)
 * Thin will dispatch some or all, depending on buffer contents (if all, and each request takes 20s, this could take 200s - more likely (as this has taken little time), only one or two connections have dispatched)
 * Reactor loop
 * Pull another 10 connections off the listen backlog
 * Put them in an internal queue
 * Read a full buffer off of each connection (the connections from the first and second loops)
 * Thin will dispatch some or all, depending on buffer contents (probably now all remaining from the first loop, and maybe some from the second)
 * Reactor loop

By the end of this little event, depending on client send performance, size of request data, and some other factors, you could have over 25 connections enqueued inside EM/Thin, and another 100 in the OS queue (under very high load, this scenario will happen). Due to the tick differential as many open connections get backed up, coupled with the 10 connection accept loop, this /can/ result in EM/Thin simply pulling an ever increasing number of connections off the queue. This becomes particularly harmful when those connections are getting so stale that they've already been aborted on the load balancer / client side.

Is this a Thin bug, or an EM bug, or a fault of the system? Well that's hard to really quantify. It's a mismatch of design goals. EM is designed to enable high performance at medium levels of concurrency, and it really is tuned for concurrency. EM is really not designed to have long handler operations, you want to "answer" a receive_data quickly, and unblock the reactor. This is not what a Rails app or even a Sinatra application will do. It's possible you could make the latter relatively low latency, but you're still talking "ms" range generally. If you do the math, say you're responding in 5ms, you are going to peak out at < 200r/s, before you start creating potentially unserviceable backlog, or in the very least adversely affecting your response rate statistics. If you put this all together, and think carefully about the queue delay caused by long requests, all this should start to make a lot more sense.

Put simply, you want to use systems like EM for queues that contain short processing (or the processing is split into short (read:concurrent) parts). If you care about reliability and stable response rate statistics, you also want to ensure that your standard deviation processing time be very low (at least an order of magnitude lower than your mean processing time). If you don't maintain those tight processing rate statistics, you're lining yourself up for catastrophic queue backlogs in the worst case - all it takes is an unusually high distribution of long processes, and the receiving instance is toast.

If this kind of fast processing isn't the work you're doing, and you don't have the ability to make processing time distribution guarantees at your load balancer (by url or such), then your probably don't want a server that has these semantics. If you can make certain assurances about distribution and have plenty of capacity, then you can potentially disregard all this, but I doubt that applies to many Rails web apps.

HTH

Daniel Doubrovkine

unread,
Feb 17, 2013, 3:16:28 PM2/17/13
to thin...@googlegroups.com
Wow, and so, since we're on the Thin mailing list, why is Thin using EM? Sounds like EM only does 1 good thing in this scenario, which is to accept something faster and sit on it, well ... sometimes.

For a web server, for a single-threaded case wouldn't we be better off with a basic loop: blocking accept, process, blocking accept process? And for a MT case a basic threadpool: select / accept -> pass to a thread from the pool?

Marc-André Cournoyer

unread,
Feb 17, 2013, 5:34:45 PM2/17/13
to thin...@googlegroups.com
Wow, and so, since we're on the Thin mailing list, why is Thin using EM? Sounds like EM only does 1 good thing in this scenario, which is to accept something faster and sit on it, well ... sometimes.

That's the only _bad_ thing it's doing in the context of Thin. The problem seems to be that EM gives priority to accept (w/ the for loop) over read/write to other active connections. That's indeed a problem. I'll see if I can submit a patch.

For a web server, for a single-threaded case wouldn't we be better off with a basic loop: blocking accept, process, blocking accept process?

Doing a blocking accept or any other blocking operation will block the event loop from processing (read/write) other active connections.
 
And for a MT case a basic threadpool: select / accept -> pass to a thread from the pool?

That's exactly what happen if you run in threaded mode via --threaded option.

--
M-A

James Tucker

unread,
Feb 18, 2013, 1:31:39 PM2/18/13
to thin...@googlegroups.com
On Feb 17, 2013, at 12:16 PM, Daniel Doubrovkine <dbl...@dblock.org> wrote:

Wow, and so, since we're on the Thin mailing list, why is Thin using EM? Sounds like EM only does 1 good thing in this scenario, which is to accept something faster and sit on it, well ... sometimes.

EM doesn't "sit on it". EM is pretty clear about "don't block the reactor", Thin breaches this. Actually, that's not fair on Thin either, as Thin doesn't breach this, applications do.

For a web server, for a single-threaded case wouldn't we be better off with a basic loop: blocking accept, process, blocking accept process?

See Unicorn, etc.

And for a MT case a basic threadpool: select / accept -> pass to a thread from the pool?

This is subject to some load balancing, latency and throughput concerns as well. It is also possible (and often simpler) to use a shared accept model. Please be aware, as I am trying to make it clear, any time your pre-accept, you're creating the potential for a bad case of the scenario the OP was talking about.

Either of the latter solutions are still not immune to these problems, again, depending on the layer at which the load balancer operates. A low layer round-robin load balancer, even in some particularly evil cases a weighted high layer load balancer, will still make bad decisions unless you make your response rates stable. Without additional configuration, you're using constant algorithms against what is essentially non-constant data, the likelihood of error cases is inherently high.

In the Unicorn case of these scenarios, your OS is actually saving you from some of this, because your OS is smart. The default listen backlog is incorrect for this scenario, and can very much result in the same situation. It's also worth noting that if your OS doing clever tricks, you'd find many layer 7 load balancers would also be freaking out (e.g. haproxy).

Daniel Doubrovkine

unread,
Feb 18, 2013, 2:23:57 PM2/18/13
to thin...@googlegroups.com
I want to thank everyone for detailed explanations. This has been very helpful.

I was asking this in the context of the Heroku load-balancing. We went down the wrong path when first trying to measure the actual wait time inside the web server, blocked on an already processing request - trying to do it monkey-patching Thin. We ended up with what's described in http://artsy.github.com/blog/2013/02/17/impact-of-heroku-routing-mesh-and-random-routing instead, which has its drawbacks. 

We'll get cooperation from the Heroku router at some point, but have also been also exploring options like closing the listening socket after a request has come in (see https://github.com/dblock/1aat-ruby), causing the router to try another dyno. I know at least two people who went down the path of trying to get EM to do the same and now I completely understand why they have not been successful at it.

cheers
dB.

James Tucker

unread,
Feb 18, 2013, 9:49:18 PM2/18/13
to thin...@googlegroups.com
On Feb 18, 2013, at 11:23 AM, Daniel Doubrovkine <dbl...@dblock.org> wrote:

I want to thank everyone for detailed explanations. This has been very helpful.

I was asking this in the context of the Heroku load-balancing. We went down the wrong path when first trying to measure the actual wait time inside the web server, blocked on an already processing request - trying to do it monkey-patching Thin. We ended up with what's described in http://artsy.github.com/blog/2013/02/17/impact-of-heroku-routing-mesh-and-random-routing instead, which has its drawbacks. 

We'll get cooperation from the Heroku router at some point, but have also been also exploring options like closing the listening socket after a request has come in (see https://github.com/dblock/1aat-ruby), causing the router to try another dyno. I know at least two people who went down the path of trying to get EM to do the same and now I completely understand why they have not been successful at it.

You don't need to close the socket, you just need to tune the accept backlog. If the Heroku balancer is written at all sensibly (which I suspect it is), then it'll have a short TCP_CONNECTIONTIMEOUT set, and will move on to another dyno when the backlog is full. The patch isn't that hard at all, you'll mainly have to deal with how to pass the options without breaking the existing APIs.

Daniel Doubrovkine

unread,
Feb 19, 2013, 7:58:58 AM2/19/13
to thin...@googlegroups.com
Unfortunately most operating systems don't respect this number and will adjust it dynamically, usually larger than the number set. So it's nothing more than a hint.

James Tucker

unread,
Feb 19, 2013, 11:09:52 AM2/19/13
to thin...@googlegroups.com


On Tuesday, February 19, 2013, Daniel Doubrovkine wrote:
Unfortunately most operating systems don't respect this number and will adjust it dynamically, usually larger than the number set. So it's nothing more than a hint.


It's again, not that simple. I would recommend you preference protocol specifications over stack overflow, they're often still going to be incomplete, but at least they shouldn't lead you astray.

In TCP, there isn't actually a "this queue is full" response, or even "this connection is refused". Refused is typically indicated with an RST in response to a SYN. This is distinct from a "connection reset", which is when an RST is sent after the SYN-ACK dance completes. In the case of the port being open, but the accept backlog is full, the appropriate response is actually not to respond to the SYN. The theory is, that the client will assume that the SYN was lost, and later retransmit it, which may succeed. By contrast, sending an RST in this scenario may very well be terminal to the client, and leaves no distinction between misconfiguration and busy. It is of course possible to configure things differently, to put firewalls in the way that provide additional semantics and so on, but this hanging SYN is the typical mechanism used. A lower layer load balancer will set their tcp connect timeout quite low in such scenarios, as to timeout and move on to another backend more promptly. You can tune such things based on expected latency and allowable packet loss.

 


On Mon, Feb 18, 2013 at 9:49 PM, James Tucker <jftu...@gmail.com> wrote:


On Feb 18, 2013, at 11:23 AM, Daniel Doubrovkine <dbl...@dblock.org> wrote:

I want to thank everyone for detailed explanations. This has been very helpful.

I was asking this in the context of the Heroku load-balancing. We went down the wrong path when first trying to measure the actual wait time inside the web server, blocked on an already processing request - trying to do it monkey-patching Thin. We ended up with what's described in http://artsy.github.com/blog/2013/02/17/impact-of-heroku-routing-mesh-and-random-routing instead, which has its drawbacks. 

We'll get cooperation from the Heroku router at some point, but have also been also exploring options like closing the listening socket after a request has come in (see https://github.com/dblock/1aat-ruby), causing the router to try another dyno. I know at least two people who went down the path of trying to get EM to do the same and now I completely understand why they have not been successful at it.

You don't need to close the socket, you just need to tune the accept backlog. If the Heroku balancer is written at all sensibly (which I suspect it is), then it'll have a short TCP_CONNECTIONTIMEOUT set, and will move on to another dyno when the backlog is full. The patch isn't that hard at all, you'll mainly have to deal with how to pass the options without breaking the existing APIs.


cheers
dB.

On Mon, Feb 18, 2013 at 1:31 PM, James Tucker <jftu...@gmail.com> wrote:

On Feb 17, 2013, at 12:16 PM, Daniel Doubrovkine <dbl...@dblock.org> wrote:

Wow, and so, since we're on the Thin mailing list, why is Thin using EM? Sounds like EM only does 1 good thing in this scenario, which is to accept something faster and sit on it, well ... sometimes.

EM doesn't "sit on it". EM is pretty clear about "don't block the reactor", Thin breaches this. Actually, that's not fair on Thin either, as Thin doesn't breach this, applications do.

For a web server, for a single-threaded case wouldn't we be better off with a basic loop: blocking accept, process, blocking accept process?

See Unicorn, etc.

And for a MT case a basic threadpool: select / accept -> pass to a thread from the pool?

This is subject to some load balancing, latency and throughput concerns as well. It is also possible (and often simpler) to use a shared accept model. Please be aware, as I am trying to make it clear, any time your pre-accept, you're creating the potential for a bad case of the scenario the OP was talking about.

Either of the latter solutions are still not immune to these problems, again, depending on the layer at which the load balancer operates. A low layer round-robin load balancer, even in some particularly evil cases a weighted high layer load balancer, will still make bad decisions unless you make your response rates stable. Without additional configuration, you're using constant algorithms against what is essentially non-constant data, the likelihood of error cases is inherently high.

In the Unicorn case of these scenarios, your OS is actually saving you from some of t


 

--
Reply all
Reply to author
Forward
0 new messages