Sanity check: Configured registration timeout of 1 second expired, stopping

3,368 views
Skip to first unread message

Alex Cruise

unread,
Aug 26, 2013, 2:11:58 PM8/26/13
to spray...@googlegroups.com
Hey folks, we're seeing our servers occasionally get into this state where they stop accepting new HTTP connections.  At this point we're not sure if we just haven't allocated sufficient resources to handle the load, or if we're doing something wrong, so I thought I'd ask first.
We have our HTTP handler actors behind a round-robin router set to min=5, max=1234 (not the real number :).  Here's how we set that up in our main actor:
val handler = context.actorOf(Props(new RoutingRestService(...)).withRouter(FromConfig), "restHandler"))  
io.IO(Http)(context.system) ! Http.Bind(handler, myBindAddr, myBindPort)

Here's the relevant snippet of application.conf:
akka.actor.deployment {
/main/restHandler { router = round-robin
resizer { lower-bound = 5
upper-bound = 1234 }
} }

And, in the handler actor's receive method, we have this clause:
case akka.io.Tcp.Connected(remote, local) =>
sender ! spray.can.Http.Register(self)

So, each instance of the handler agrees to handle any connection that comes along. Now, after anywhere from days to weeks of uptime, we see this in the log (the first line happens once, the rest repeat every few seconds:

ERR [c456a1] [20130824-10:05:48.001] akka: Dead letter from Actor[akka://myThing/user/IO-HTTP/listener-0/2172] to Actor[akka://myThing/system/IO-TCP/selectors/$a/2173]! Close
WAR [c456a1] [20130824-10:06:08.202] akka://myThing/user/IO-HTTP/listener-0/2177: Configured registration timeout of 1 second expired, stopping
WAR [c456a1] [20130824-10:06:10.998] akka://myThing/user/IO-HTTP/listener-0/2178: Configured registration timeout of 1 second expired, stopping
WAR [c456a1] [20130824-10:06:26.897] akka://myThing/user/IO-HTTP/listener-0/2182: Configured registration timeout of 1 second expired, stopping
WAR [c456a1] [20130824-10:07:19.998] akka://myThing/user/IO-HTTP/listener-0/2183: Configured registration timeout of 1 second expired, stopping
WAR [c456a1] [20130824-10:07:27.097] akka://myThing/user/IO-HTTP/listener-0/2185: Configured registration timeout of 1 second expired, stopping
WAR [c456a1] [20130824-10:07:35.098] akka://myThing/user/IO-HTTP/listener-0/2186: Configured registration timeout of 1 second expired, stopping

So, the first question is, are we doing it wrong? :)  If it looks like our setup is sane, is this error a simple indication of our servers being asked to handle more than 1234 concurrent requests?

Thanks!

-0xe1a

Age Mooij

unread,
Aug 27, 2013, 4:34:17 AM8/27/13
to spray...@googlegroups.com
Are you using spray-routing or the lower level spray-can stuff?

When using spray-routing you don't have to handle any low-level IO messages at all, so if you are mixing those two up you might be running into weird behavior.

Age

--
You received this message because you are subscribed to the Google Groups "spray-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spray-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mathias Doenitz

unread,
Aug 27, 2013, 9:00:38 AM8/27/13
to spray...@googlegroups.com
Alex,

> So, each instance of the handler agrees to handle any connection that comes
> along. Now, after anywhere from days to weeks of uptime, we see this in the
> log (the first line happens once, the rest repeat every few seconds:
>
> ERR [c456a1] [20130824-10:05:48.001] akka: Dead letter from Actor[akka://
>> myThing/user/IO-HTTP/listener-0/2172] to Actor[akka://myThing/system/IO-TCP/selectors/$a/2173]!
>> Close
>> WAR [c456a1] [20130824-10:06:08.202]
>> akka://myThing/user/IO-HTTP/listener-0/2177: Configured registration
>> timeout of 1 second expired, stopping
>> WAR [c456a1] [20130824-10:06:10.998] akka://myThing/user/IO-HTTP/listener-0/2178:
>> Configured registration timeout of 1 second expired, stopping
>> WAR [c456a1] [20130824-10:06:26.897] akka://myThing/user/IO-HTTP/listener-0/2182:
>> Configured registration timeout of 1 second expired, stopping
>> WAR [c456a1] [20130824-10:07:19.998] akka://myThing/user/IO-HTTP/listener-0/2183:
>> Configured registration timeout of 1 second expired, stopping
>> WAR [c456a1] [20130824-10:07:27.097] akka://myThing/user/IO-HTTP/listener-0/2185:
>> Configured registration timeout of 1 second expired, stopping
>> WAR [c456a1] [20130824-10:07:35.098] akka://myThing/user/IO-HTTP/listener-0/2186:
>> Configured registration timeout of 1 second expired, stopping


The dead letter you are seeing might be nothing that you have to worry about.
All it shows is that the HTTP-level ServerConnection actor is telling the IO-level connection actor to close the connection.
If the UI-level connection actor has already terminated itself because the client closed the connection this dead letter can be expected.

The "Configured registration timeout of 1 second expired, stopping" warning from the HTTP-level ServerConnection actors tell you that, for some reason, they are not receiving the `Http.Register` message.

> So, the first question is, are we doing it wrong? :)

With using router as handler you are basically using a "sharded" singleton handler, i.e. each routee will receive the messages from several connections (rather than having one handler actor per HTTP connection). The difference to using a simple singleton actor as handler is simply that the work can be split across several threads. However, what's the purpose of using a "large" number (like > 1000) routees here?
On first glance it doesn't seem valuable to run more routees than cores, no?

> … is this error a simple indication of our servers being asked
> to handle more than 1234 concurrent requests?

No. Your setup should easily be able to handle 10K request (and much more).

Can you rule out that your handler router (or a significant subset of all routees) simply become unresponsive to `Connected` events?

Cheers,
Mathias

---
mat...@spray.io
http://spray.io

Luis Ángel Vicente Sánchez

unread,
Aug 27, 2013, 9:28:41 AM8/27/13
to spray...@googlegroups.com
I had the same problem last week with one of the latest 1.2 nightlies using spray-routing but I was doing some load tests with gatling (> 10k rps). Increasing the registration timeout to 10s did the trick for me so I thought that the problem was my laptop was not able to connect properly within 1s when I was sending that huge ammount of request per second.


2013/8/27 Mathias Doenitz <mat...@spray.io>

Alex Cruise

unread,
Aug 27, 2013, 1:45:42 PM8/27/13
to spray...@googlegroups.com
Hi Mathias, thanks for answering. :)

On Tue, Aug 27, 2013 at 6:00 AM, Mathias Doenitz <mat...@spray.io> wrote:
The dead letter you are seeing might be nothing that you have to worry about.
All it shows is that the HTTP-level ServerConnection actor is telling the IO-level connection actor to close the connection.
If the UI-level connection actor has already terminated itself because the client closed the connection this dead letter can be expected.

Yeah, I usually ignore them unless they happen in loops. :)
 
The "Configured registration timeout of 1 second expired, stopping" warning from the HTTP-level ServerConnection actors tell you that, for some reason, they are not receiving the `Http.Register` message.

Yep.
 
> So, the first question is, are we doing it wrong? :)

With using router as handler you are basically using a "sharded" singleton handler, i.e. each routee will receive the messages from several connections (rather than having one handler actor per HTTP connection). The difference to using a simple singleton actor as handler is simply that the work can be split across several threads. However, what's the purpose of using a "large" number (like > 1000) routees here?
On first glance it doesn't seem valuable to run more routees than cores, no?

The actual number isn't 1234, it's smaller. But it's larger than we'd like because there's a fair amount of blocking in our REST handlers still.
 
Can you rule out that your handler router (or a significant subset of all routees) simply become unresponsive to `Connected` events?

They won't ever be in a state when they can't handle the Connected event, but they might simply be too busy to handle it.

I guess what I'll try next is:

- switch the router from round-robin to shortest-mailbox
- increase the number of handlers
- increase the connection timeout

Thanks!

-0xe1a

Mathias Doenitz

unread,
Aug 27, 2013, 5:05:49 PM8/27/13
to spray...@googlegroups.com
Alex,

> I guess what I'll try next is:
>
> - switch the router from round-robin to shortest-mailbox
> - increase the number of handlers
> - increase the connection timeout

Ok, let us know how you fare…

Cheers,
Mathias

---
mat...@spray.io
http://spray.io

Alex Cruise

unread,
Sep 9, 2013, 2:36:34 PM9/9/13
to spray...@googlegroups.com
On Tue, Aug 27, 2013 at 2:05 PM, Mathias Doenitz <mat...@spray.io> wrote:
> - switch the router from round-robin to shortest-mailbox
> - increase the number of handlers
> - increase the connection timeout

Ok, let us know how you fare…

OK, I did all these and this morning our server stopped accepting new connections again.  I checked on the active connections, and there aren't anywhere near the number of handlers we have in the router pool. So it seems like the handlers are either hung processing a very long message, crashed silently or somehow need to be deregistered from spray.

Do we need to be notified and respond when a connection is finished? 

Thanks!

-0xe1a

Johannes Rudolph

unread,
Sep 9, 2013, 3:32:27 PM9/9/13
to spray...@googlegroups.com
On Mon, Sep 9, 2013 at 8:36 PM, Alex Cruise <al...@cluonflux.com> wrote:
On Tue, Aug 27, 2013 at 2:05 PM, Mathias Doenitz <mat...@spray.io> wrote:
> - switch the router from round-robin to shortest-mailbox
> - increase the number of handlers
> - increase the connection timeout

Ok, let us know how you fare…

OK, I did all these and this morning our server stopped accepting new connections again.  I checked on the active connections, and there aren't anywhere near the number of handlers we have in the router pool. So it seems like the handlers are either hung processing a very long message, crashed silently or somehow need to be deregistered from spray.

Have you looked at a stack-trace what all the threads are doing (`jstack <pid>`)? That's the only sure way to rule out a thread-starvation issue.

If you are actually having a thread-starvation issue none of the above will be a definite cure for the problem. You will need to isolate the blocking calls, hide them behind actors and configure dedicated dispatchers for those actors so that the problem doesn't take hold of the complete actor system.
  
Do we need to be notified and respond when a connection is finished? 

What do you mean exactly?

--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Alex Cruise

unread,
Sep 9, 2013, 5:44:46 PM9/9/13
to spray...@googlegroups.com
On Mon, Sep 9, 2013 at 12:32 PM, Johannes Rudolph <johannes...@googlemail.com> wrote:
Have you looked at a stack-trace what all the threads are doing (`jstack <pid>`)? That's the only sure way to rule out a thread-starvation issue.

I'll do that, thanks.  I don't think we're running out of threads; I have parallelism-factor=100.

The thread dump doesn't contain anything that's doing actual work, and in any case I'm looking at a server that currently has very little activity except for Pingdom making a request every minute.
 
Do we need to be notified and respond when a connection is finished? 

What do you mean exactly?

Well, when we accept a new connection, we need to respond to a Connected request with a Register; I was wondering if there's something analogous when a connection is no longer in use.

-0xe1a

Mathias Doenitz

unread,
Sep 10, 2013, 4:00:22 AM9/10/13
to spray...@googlegroups.com
Alex,

> Well, when we accept a new connection, we need to respond to a Connected
> request with a Register; I was wondering if there's something analogous
> when a connection is no longer in use.

No, when the connection is closed (for whatever reason) you'll receive an `Http.ConnectionClosed` event and that's it.
If you watch the connection actor from your own code you'll also see a `Terminated` event some time afterwards.

I think in your case I'd try to find out whether the HTTP/IO layers are somehow stopping to send `Http.Connected` events or whether your application code simply stops to respond to them. If you have a way of reproducing the strange state your application is in you could set `akka.io.tcp.trace-logging = on` (which will produce a lot more low-level log messages) and then see what happens when you server gets into the "funny state".

Cheers,
Mathias

---
mat...@spray.io
http://spray.io

Reply all
Reply to author
Forward
0 new messages