RemoteWebDriver + Queue + Time outs

68 views
Skip to first unread message

Matt Smith

unread,
Jun 18, 2024, 12:22:01 AM (9 days ago) Jun 18
to Selenium Users
I have a .NET C# project for smoke testing our applications across all combinations of environments, branches, sites, and servers. Because we are hitting lower environments, a lot of the endpoints take extra time to spin up. Because there are so many combinations, a complete test run can take a while but everything generally works fine when testing locally with the ChromeDriver. The tests run at whatever parallelization I have configured and all tests eventually run to completion.

However, I am starting to now work with Selenium Grid and experiencing some unexpected behavior. I am running Selenium Grid 4.21.0 (revision 79ed462ef4) on a VM with 16 logical processors (same machine I can test locally on). The test run starts out fine but at about 2 to 2.5 mins in, tests start showing up in the Queue. The queueing feature would be great except for the large number of tests that end up failing due to "The HTTP request to the remote WebDriver server for URL timed out after 60 seconds."

For a test run with 150 test cases against our TEST environment, about 50 tests fail with that timed out status. The queue will eventually reach about 50 test cases as well.

Ultimately we are going to have a legit Grid with several nodes supporting dozens of threads. This will definitely help with concurrency and potentially keeping tests out of the queue, but there will still be times when we may want to launch a test run with hundreds of test cases that I worry will still fall over to the queue.

One other thing to throw in the conversation is that there will often be test cases in the Running section that have been there for 5+ minutes and they start to occupy available threads. While annoying, a lot of these situations are tests that have somehow silently failed or not completed. I say this because when I don't run in headless mode and check that VM, there will be quite a few Chrome windows still open after the test run has completed.

So my questions are:
  1. Is there a better way to manage the Queue (via code in my test setup or via settings configuration on the node) to better manage the queue so that tests don't fail?
  2. Since the tests that fail with the HTTP time out are timing out in 60 seconds, should I adjust something in code to allow them to run / wait longer than 60 seconds? I'm not really a fan of this 60 seconds is generally plenty for a response time but given that tests can go to a queue and aren't actually executed then, maybe this is an option.
  3. For the test cases that appear to have "failed" and are blocking a thread for 5+ minutes, is there something I can configure to fail a test case that has run longer than x minutes?
It seems like some combination of #2 and #3 is the best option since technically the queue is doing what it expected to do.

Thanks in advance for any insight you can provide.

Krishnan Mahadevan

unread,
Jun 18, 2024, 12:44:51 AM (8 days ago) Jun 18
to seleniu...@googlegroups.com
Matt, 

That’s a very interesting question.


Ideally speaking your parallelism count (Number of tests that can run in parallel) should always match whatever is the maxession settings at the grid (ie., how many tests can run in parallel at any given point in time).

This can be determined by querying the endpoint http://localhost:4444/status 

This endpoint would basically list out the nodes and the slots associated with each of them. Now this approach would work ONLY when there is one process that runs your test. If you have two or more processes that are kicking off tests then each of them would assume exclusiveness to the grid slots and end up sending more tests, which will get piled up in the new session queue.

Alternatively you can query the endpoint http://localhost:4444/se/grid/newsessionqueue/queue to determine if there are any tests that are in the queue. An empty queue would basically mean that the Grid still has slots that are available which tests can make use of. So you can spin off a new test if and only if the new session queue is empty. There will be times when you might end up doing a dirty read (ie., two threads query the endpoint and get empty values and so both of them assume that the grid can support new sessions and both of them fire a new test at the grid and then one of them gets a slot while the other ends up in the queue). You may have to do some trial and error to determine how to get past this (easiest I can think of is to query the queue continuously 3 times in a row, with a random delay and if the queue is empty all the times then fire up a test else wait at the client side itself)

Now to determine stale tests you can leverage the timeouts (especially --session-timeout) at the grid side. Please look for timeout related configurations at https://www.selenium.dev/documentation/grid/configuration/cli_options/#node 



Hope that helps you get started.


Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribblings @ https://rationaleemotions.com/

--
You received this message because you are subscribed to the Google Groups "Selenium Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to selenium-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/selenium-users/5d554fcb-c909-42da-8844-a9a2b74cafean%40googlegroups.com.

Matt Smith

unread,
Jun 18, 2024, 12:35:53 PM (8 days ago) Jun 18
to Selenium Users
Thanks for responding, Krishnan.

In this POC that I am currently working on, my parallelism count in my test application equals the number of threads available on the grid. So it was somewhat interesting to me that I was running into this issue fairly easily with just one test run calling against the grid. It is somewhat interesting though that my test application (running in Visual Studio Test Explorer) never seems like it is a asynchronously launching a large number of tests. It appears that the test application is still throttling it to 16 concurrent test cases.

Thinking about it a little more, that might be part of the root issue: The test case is somehow finalizing in Test Explorer but the thread is still running in Grid. That would explain why it moves forward with launching new test cases which then end up in the queue.

I like your idea of checking the queue before sending the next request. However, if abandoned threads are the root issue, I need to figure out a way to kill those. I did find this SO post that references timeout and browserTimeout that I need to explore: https://stackoverflow.com/questions/39538249/difference-between-timeout-and-browsertimeout

I tried adding a NUnit Timeout(120000) attribute to my tests which changed up the results and did not end up with a bunch of HTTP timeouts as before, but still ended up with a lot of tests that hit that Timeout value so it didn't really help come up with a good approach for mitigating the underlying issue.

Going back to figuring out a way to "kill" sessions, is there a way to send a command to kill a specific test session? It seems like the test case is finalizing in VS Test Explorer but the session is continuing in Grid. I'm wondering if it would be possible to have some code in [Teardown] that would attempt to send a request to Grid to make sure that the test case is finalized in Grid.

I'll keep digging on my side but would also appreciate any input if anyone has anything.

Thanks!

Krishnan Mahadevan

unread,
Jun 19, 2024, 1:03:01 AM (7 days ago) Jun 19
to seleniu...@googlegroups.com
Matt,

>>> The test case is somehow finalizing in Test Explorer but the thread is still running in Grid. That would explain why it moves forward with launching new test cases which then end up in the queue.

I am not quite sure I understand this statement. I am more of a Java guy and that perhaps could be the reason. At any given point in time, the number of free slots available in the Grid should be equivalent to the number of threads that are currently running. The easiest way of enforcing this would be to just inspect the new session queue and ensure that it is empty before scheduling a new test. If the queue is not empty, then the test runner should stall the test.


>>> I tried adding a NUnit Timeout(120000) attribute to my tests which changed up the results and did not end up with a bunch of HTTP timeouts as before, but still ended up with a lot of tests that hit that Timeout value so it didn't really help come up with a good approach for mitigating the underlying issue.

This would just be a client level timeout wherein the test automatically gets killed by the test runner once it crosses the timeout. But it would still not fix the part wherein the server level timeouts kick in (The default is 5 minutes).


>>> I did find this SO post that references timeout and browserTimeout that I need to explore: https://stackoverflow.com/questions/39538249/difference-between-timeout-and-browsertimeout

Yeah, that’s my answer on SO, but that was for Grid3. Things have changed with respect to Grid4.

If you run “java -jar selenium-server-4.21.0.jar node --help" you should see all the configuration help for the node mode.

The most important configuration would be 


   --session-timeout
      Let X be the session-timeout in seconds. The Node will automatically
      kill a session that has not had any activity in the last X seconds. This
      will release the slot for other tests.
      Default: 300

There are similar configurations at the hub level as well (You can see them if you run “java -jar selenium-server-4.21.0.jar hub —help”). Some important ones are as below

   --session-request-timeout
      Timeout in seconds. New incoming session request is added to the queue.
      Requests sitting in the queue for longer than the configured time will
      timeout.
      Default: 300
   --session-request-timeout-period
      In seconds, how often the timeout for new session requests is checked.
      Default: 10
    --session-retry-interval
      Session creation interval in milliseconds. If all slots are busy, new
      session request will be retried after the given interval.
      Default: 15


>>> Going back to figuring out a way to "kill" sessions, is there a way to send a command to kill a specific test session? It seems like the test case is finalizing in VS Test Explorer but the session is continuing in Grid. I'm wondering if it would be possible to have some code in [Teardown] that would attempt to send a request to Grid to make sure that the test case is finalized in Grid.

You should find all the Grid supported API endpoints in https://www.selenium.dev/documentation/grid/advanced_features/endpoints/ 
This page includes API endpoints to kill session as well. You can retrieve the current session via https://www.selenium.dev/selenium/docs/api/dotnet/OpenQA.Selenium.WebDriver.html#OpenQA_Selenium_WebDriver_SessionId 




Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribblings @ https://rationaleemotions.com/

Matt Smith

unread,
Jun 19, 2024, 2:20:49 PM (7 days ago) Jun 19
to Selenium Users
Thanks again for your response, Krishnan. 

Funny enough, I was making some headway from this post https://stackoverflow.com/questions/45591976/how-to-terminate-session-in-selenium-gridextras and realized that it was your answer there that was helping me most.

This article also helped with bringing a number of other options to look at: https://stackoverflow.com/questions/22322596/selenium-error-the-http-request-to-the-remote-webdriver-timed-out-after-60-sec. I didn't have any luck with the "no-sandbox" option but ended up making some headway by adding:

ChromeDriver driver = new ChromeDriver(ChromeDriverService.CreateDefaultService(), options, TimeSpan.FromMinutes(3)); 
driver.Manage().Timeouts().PageLoad.Add(System.TimeSpan.FromSeconds(60));

Also, Daniel Charles' response had a link to http://jimevansmusic.blogspot.com/2012/11/net-bindings-whaddaymean-no-response.html which talked about the challenges involved with what to do when an application just stops responding. This prompted me to dig in a little more and I was able to isolate my hung application issue to a specific application in our lower environments. We've got a situation where, for whatever reason, the application just goes into a really weird mode of stalling while sending data back. It doesn't crash or send an error code. You can watch it in F12 Dev Tools and see the HTML gets partially output for the page and then the content quits coming in. The status icon is still spinning on the page but nothing else is coming in. So the test in Test Explorer would eventually fail and allow another test to kick off, but the session would still be active in the Running Sessions view in Grid and the browser would still be open on that VM and still trying to download content. This would happen a bunch of times and use up half my slots but the Test Explorer was told to use 16 parallel threads so it would keep sending the requests. 

Once I made the change to abandon a request after 60 seconds, then my queued sessions went from 40-50 down to 5-6 and most of them would then finish.

The only issue that I am running across now is that I have a flag for whether or not to use Grid... 

            if (config.UseSeleniumGrid)
            {
                driver = new RemoteWebDriver(new Uri(config.SeleniumGridHubUrl), chromeOptions);
            }
            else
            {
                // Local Selenium WebDriver
                driver = new ChromeDriver(ChromeDriverService.CreateDefaultService(), chromeOptions, TimeSpan.FromSeconds(70));
            }
            driver.Manage().Timeouts().PageLoad.Add(TimeSpan.FromSeconds(50));

but the RemoteWebDriver constructor does not have an option to accept options and a commandTimeout so I'm not able to specify a commandTimeout when using Grid but I think as long as I drop PageLoad to 50 seconds then it seems to work better because I think the commandTimeout is 60 seconds and it was implied somewhere in something I read that it was better to have a shorter PageLoad duration so that it would fail up to the Grid properly.

        public RemoteWebDriver(DriverOptions options);
        public RemoteWebDriver(ICapabilities capabilities);
        public RemoteWebDriver(Uri remoteAddress, DriverOptions options);
        public RemoteWebDriver(Uri remoteAddress, ICapabilities capabilities);
        public RemoteWebDriver(ICommandExecutor commandExecutor, ICapabilities capabilities);
        public RemoteWebDriver(Uri remoteAddress, ICapabilities capabilities, TimeSpan commandTimeout);

I also found that on my Click() helper method that I had a very naive implementation of the Retry pattern and was eating the ElementClickInterceptedException which might have been causing issues as well and not allowing the TearDown method to get called. These two articles helped me out to implement Microsoft's recommended version of the pattern:
Okay, making progress and I think I can mostly call this specific issue closed because I've made it mostly go away with the PageLoad duration and also I've identified the root cause in a hanging application on our side.

Still lots of other things to work through but I'll bring those up on other threads.

Thanks!

Krishnan Mahadevan

unread,
Jun 20, 2024, 1:12:51 AM (6 days ago) Jun 20
to seleniu...@googlegroups.com
Matt,

>>> but the RemoteWebDriver constructor does not have an option to accept options and a commandTimeout

I think this should work no ? I am not very conversant in Java, but I think you could still implement the ICapabilities interface that internally uses a options class and get this to work right?

public RemoteWebDriver(Uri remoteAddress, ICapabilities capabilities, TimeSpan commandTimeout);

Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribblings @ https://rationaleemotions.com/

Krishnan Mahadevan

unread,
Jun 20, 2024, 1:13:39 AM (6 days ago) Jun 20
to seleniu...@googlegroups.com
I meant to say, I am not very conversant in C#, but more of a Java guy :) 


Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"
My Scribblings @ http://wakened-cognition.blogspot.com/
My Technical Scribblings @ https://rationaleemotions.com/

Matt Smith

unread,
Jun 20, 2024, 11:37:57 AM (6 days ago) Jun 20
to Selenium Users
Hi Krishnan,

> I think this should work no ? I am not very conversant in Java, but I think you could still implement the ICapabilities interface that internally uses a options class and get this to work right?

Thanks for the hint on this. I did not realize that the options and capabilities classes were tied together like that. It took me a while to find out exactly how to do it but I eventually found this post which showed how to make that jump: https://stackoverflow.com/questions/69020547/what-is-the-correct-way-to-add-capabilities-using-selenium-in-c

Now my code looks like the following using the chromeOptions.ToCapabilities() method:

            if (config.UseSeleniumGrid)
            {
                driver = new RemoteWebDriver(new Uri(config.SeleniumGridHubUrl), chromeOptions.ToCapabilities(), commandTimeout);
            }
            else
            {
                driver = new ChromeDriver(ChromeDriverService.CreateDefaultService(), chromeOptions, commandTimeout);
            }
            driver.Manage().Timeouts().PageLoad = pageloadTimeout;

Everything appears to be running much better, especially since I changed the PageLoad timeout from PageLoad.Add() which added the timespan to the existing value, to setting the value to what I want (50 seconds). On my last run of 359 test cases, I only see 7 test cases that hit the 50 second timeout threshold and those sessions were shutdown properly in the Grid and the slots freed up and no queueing occurred. Of course, oddly enough, the application that has been causing issues with the never-ending responses appears to have cooperated this time around and failed with a different exception. Long story short: I am not receiving the "The HTTP request to the remote WebDriver server for URL timed out after 60 seconds" message anymore which was the impetus of my post.

Thanks again for your helpful responses!

Matt

Krishnan Mahadevan

unread,
Jun 20, 2024, 11:44:51 AM (6 days ago) Jun 20
to seleniu...@googlegroups.com
Matt 

Am glad to know that you were able to get your issues resolved. 


Thanks & Regards
Krishnan Mahadevan

"All the desirable things in life are either illegal, expensive, fattening or in love with someone else!"

From: seleniu...@googlegroups.com <seleniu...@googlegroups.com> on behalf of Matt Smith <mjs...@gmail.com>
Sent: Thursday, June 20, 2024 9:07:57 PM
To: Selenium Users <seleniu...@googlegroups.com>
Subject: Re: [selenium-users] RemoteWebDriver + Queue + Time outs
 
Reply all
Reply to author
Forward
0 new messages