Timeout in grid/node/browser

1,165 views
Skip to first unread message

Kristian Rosenvold

unread,
Feb 18, 2012, 5:17:24 AM2/18/12
to selenium-...@googlegroups.com
We've been discussing this on and off on irc for some time now; as
most of you are aware of,
there is no timeout on "get" and "evaluateScript", meaning that the
browser can theoretically
hang indefinitely on "get" and running javascript code.

The grid actually enforces a timeout with regard to hanging browsers,
and I think my (current) conclusion is that this timeout is under all
circumstances incorrect
and should be removed. The grid should only terminate sessions when
the client decides to die.

Architecturally, it seems to me like the timeout should be in the browser,
implying that under normal operating conditions, no-one else needs to have
timeout handling. As a secondary solution, the node (selenium-server) can
enforce a timeout.

But not the grid. That's just asking for trouble.

WDYT ?

Kristian

David Burns

unread,
Feb 18, 2012, 5:33:06 AM2/18/12
to selenium-...@googlegroups.com
I agree since Remote just passes things to the browser anyway so having the browser control it sounds like a good plan to me!


David Burns
URL: http://www.theautomatedtester.co.uk/



Kristian

--
You received this message because you are subscribed to the Google Groups "Selenium Developers" group.
To post to this group, send email to selenium-...@googlegroups.com.
To unsubscribe from this group, send email to selenium-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/selenium-developers?hl=en.


François Reynaud

unread,
Feb 18, 2012, 8:07:57 AM2/18/12
to selenium-...@googlegroups.com
but that' assuming nothing ever goes wrong with the browser right ?

Kristian Rosenvold

unread,
Feb 18, 2012, 9:30:17 AM2/18/12
to selenium-...@googlegroups.com
Well I think there may be room for a well-defined
timeout on the selenium-server node; we're all aware that reality bites ;)

I've been trying to come to grips with what/how this
should actually be done, and this is as far as I've gotten:

The well-behaved webdriver client is expected to
have a try/finally relationship to the quit method:

WebDriver driver;
try {
... run test ...
} finally {
driver.quit();
}

This is normally enforced by the test framework,
where (at least) JUnit ensures this behaviour wrt
@After and @AfterClass.

The way I understand this; if the selenium-server decides to
time-out a get operation, this should be associated with the
selenium-server-session ("isTimedOut"). When the subsequent
call to "quit" arrives, the server must kill the process (at the os-level); no
mr nice guy calling "quit" since the state of the browser is undefined.

As of today, if the browser hangs (on get, javascript or FUD), it eventually
hits the 3 hour max socket timeout I added some versions ago. We frequently
see that happening on our CI.

I am not sure users without grid/server actually *need* any
timeout in their stack, though I'm sure that's just my short-sightedness.

I think this stuff is pretty hard to get right...

Kristian

2012/2/18 François Reynaud <francois...@gmail.com>:

François Reynaud

unread,
Feb 18, 2012, 3:20:44 PM2/18/12
to selenium-...@googlegroups.com
we cannot expect driver.quit to be called all the time.It is not in my team at least.
All the test have a finally quit(), but you always have someone stopping the test in the middle, or closing its laptop to go to a meeting in  the middle of a test suite.

In the current model, grid generates the timeout, allowing to release all the resources when that happens. If the server now does the timeout, how will you release the session on the grid ? Will the browser contact the grid before closing or something like that ?

Kristian Rosenvold

unread,
Feb 18, 2012, 3:48:30 PM2/18/12
to selenium-...@googlegroups.com
That's all right; I was more concerned about the fact that quit /will/
be called, more than the possibility that it might *not* be called.

Not calling quit or the client simply dying can and should be handled
on the grid with the current clean up thread mechanism. That's a
controlled, non-reentrant thing; the grid only deals with client death
and can do all the current stuff when the client has decided to die.

The thing is, grid should not generate or act on timeouts for any
request that is being forwarded to the node (which it currently does).
In this case, acting on timeout is the responsibility of the node (or
even the browser, it's a SEP as seen from the grid). Furthermore, the
node will generate a fault that'll just pass straight through the grid
and on to the client, and it's the client's responsibility to call
quit. (This is where the try/finally bit comes in which most java test
frameworks just do anyway). If the client decides to puke on the
exception and not call quit, it's just the same scenario as client
dying above.

Daniel Wagner-Hall

unread,
Feb 19, 2012, 4:08:12 AM2/19/12
to selenium-...@googlegroups.com, selenium-...@googlegroups.com
On 18 Feb 2012, at 15:30, Kristian Rosenvold <kristian....@gmail.com> wrote:

> As of today, if the browser hangs (on get, javascript or FUD), it eventually
> hits the 3 hour max socket timeout I added some versions ago. We frequently
> see that happening on our CI.

Any idea why? I've very rarely if ever seen this...

Dave Hunt

unread,
Feb 19, 2012, 9:12:56 AM2/19/12
to selenium-...@googlegroups.com
I am also coming to the conclusion that the current timeout situation is causing the majority of our Selenium Grid issues (very long running jobs, browser processes 'orphaned' on nodes, etc). I would support anything that fixes this, as it's happening frequently for us.

The proposal sounds good to me. I suspect you'd consider this already, but our framework calls a few additional things before quit (get screenshot, HTML, etc). If we need to modify this to jump straight to the quit on certain exceptions (such as timeouts) then that's fine by me.

Thanks for looking into this Kristian & François!

Dave

Kristian Rosenvold

unread,
Feb 19, 2012, 2:41:01 PM2/19/12
to selenium-...@googlegroups.com
A) Firefox refusing to start due to some funky alert-box that no-one
knows exists ;)
B) Twilight zone on the network
The ops-guys have decided they can reboot anything they like
routinely around 4am.
If the tests were running when they (or the gnomes they keep in
the basement)
rebooted the master switch or one of its 20 peers, we seem to hit
the 3 hour trap.
Or maybe it's the dns they boot. Or maybe the ops guys boot the
switches, the
monkeys reboot the routers and the gnomes restart all the wildly
non-transparent
"transparent" monitoring stuff in the network.

I've just stopped running the tests from 3am to 6am to avoid the
whole problem;
unsure if that is a can of worms we'd ever want to open ;)

Kristian

2012/2/19 Daniel Wagner-Hall <dawa...@gmail.com>:

Kristian Rosenvold

unread,
Feb 19, 2012, 2:44:28 PM2/19/12
to selenium-...@googlegroups.com
2012/2/19 Dave Hunt <dave...@gmail.com>:

>
> The proposal sounds good to me. I suspect you'd consider this already, but
> our framework calls a few additional things before quit (get screenshot,
> HTML, etc). If we need to modify this to jump straight to the quit on
> certain exceptions (such as timeouts) then that's fine by me.

This is the point where my eyes go all glazy and I seem to stare
right through you. I hope Simon will correct me on this, but
after a timeout you cannot really safely do any of those things.

If you just had an ordinary error, fine, but after timeout not really.
Even calling "quit" is technically illegal after a timeout, which is
why I suggested that the selenium server kill the process on quit.

Someone please tell me I'm wrong ;)

Kristian

Kristian Rosenvold

unread,
Feb 19, 2012, 2:57:05 PM2/19/12
to selenium-...@googlegroups.com
Technically, I'm talking about timeouts that originate outside
the browser here; when then node/hub decides that you are
timed out. I've already made my case that hub can't do this.
The node can as long as it just kills the process.

K

2012/2/19 Kristian Rosenvold <kristian....@gmail.com>:

Dave Hunt

unread,
Feb 19, 2012, 7:04:27 PM2/19/12
to selenium-...@googlegroups.com
I think you misunderstood. I'm not wanting to do any of these things after a timeout, but wanted to let you know that my framework (and therefore potentially others) may still try to. I can modify it if needed, but ideally these calls would just fail such as they currently do if a session is no longer available.

Santiago Suarez Ordoñez

unread,
Mar 3, 2012, 4:51:18 PM3/3/12
to selenium-...@googlegroups.com
I see your point, Kristian. From a theoretic stand point, we wouldn't want the grid to interfere with tests and take on responsibilities that the nodes should take on themselves. In practice, though, there's a lot of edge cases that an ideal long lasting grid will have to eventually deal with.

As most of you might now, I work at Sauce Labs and we have a product (Sauce OnDemand) that basically acts as a shared grid for our customers' tests. As that, and particularly because we charge by the minute, we can't afford to overcharge people when edge scenarios force their tests to run longer than expected. Even if it's our customers' fault.

In case it helps as guidance for the project or maybe your own setups, let me go through all the different timeouts we implemented at Sauce. I'd love to hear you opinions on whether we're missing something or if we could rely on different mechanisms for some of these. Here's the list:
  • Browser start timeout: We put a timeout around the getNewBrowserSession for both Selenium 1 and 2. If we don't detect a browser start in 90 seconds, we basically re-assign the test to a different node, recreate the session request and destroy the current node. This is transparent to our customers and can happen up to 3 times before we finally decide we can't start a browser with the configuration sent from the client. This was a huge win around some IE crashes that IIRC happened up to 0.1% of the times.
  • Idle timeout: As a safety net around our customers' tests crashing, laptops closing mid test, or connectivity to sauce just being interrupted, we have a timeout between commands that basically stops the test after we replied to a command and a new one wasn't received. We chose this to be 90 seconds but made it configurable.
  • Selenium command timeout: This one is a safety net around commands while they are running. We are aware that in most cases, this timeout is unnecessary and has potential to affect purposely long-running commands like execute_scripts, long page loads or implicit waits. But we run over a million tests a month and at that size, we start seeing things like java crashes, browser crashes, blocking OS or browser popups more than we wanted. And again, we can't afford to let the timer run incorrectly to later charge a paying customer for the situation. We set this to 300 seconds by default but made it configurable for those who need it.
  • Total test duration: In case we miss something or an unknown edge case occurs (and believe me, they eventually do), this timeout is our ultimate safety net. This won't allow any test to run longer than 30 minutes. For users with the need and at their own risk, we've made this configurable as well.
By setting reasonable default values to all of these, we found they affect a surprisingly small amount of customers compared to the ones they help on regular basis.

I can't think of any reason for which all of these couldn't make it to the grid as well.

Santi


On Sun, Feb 19, 2012 at 4:04 PM, Dave Hunt <dave...@gmail.com> wrote:
I think you misunderstood. I'm not wanting to do any of these things after a timeout, but wanted to let you know that my framework (and therefore potentially others) may still try to. I can modify it if needed, but ideally these calls would just fail such as they currently do if a session is no longer available.

--
You received this message because you are subscribed to the Google Groups "Selenium Developers" group.
Reply all
Reply to author
Forward
0 new messages