Selenium Grid constantly crashing with Timeout exceptions

475 views
Skip to first unread message

jc

unread,
Sep 2, 2020, 1:42:06 PM9/2/20
to Selenium Users
Hi all,

I am coming here as a shot in the dark.  We have a suite of about 600+ tests that we run on Chrome in desktop resolution and then we also run them in Tablet and Mobile resolution using Chrome's mobile emulation.  We also run it on Firefox desktop resolution.  So about 2500 tests on the same application running nightly through Jenkins.  We are using Geb/Spock as our framework which is just a wrapper on top of Selenium.

Starting in early July we haven't had much for passing builds because at some point during the tests I get seemingly random WebDriverExceptions.  Most of the time I get errors such as these: 
- org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to BROWSER_TIMEOUT
org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to TIMEOUT
org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to SO_TIMEOUT
org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to ERROR_FORWARDING_TO_NODE (don't see this one often)

Nothing has really changed in the application and thus the code in the test suite hasn't really changes all that much.  The builds are ran from Jenkins overnight in the order of Chrome Desktop > Chrome Mobile > Chrome Tablet > Firefox Desktop.

The grid is set up with docker for the HUB only, though I am not sure what image it is.  I just know it is the latest stable version of 3.141.59.  The nodes are VMs with the same image on each one of them.  There are 10 nodes connected to the hub with 5 sessions each of Chrome, Firefox, and IE.  Most nights everything will pass except for 1 or 2 test classes that fail halfway through because of the exceptions and then it seems to get progressively worse down to where Firefox which is last in line almost always fails somewhere. 

I think what is happening is these tests are crashing either the hubs or the nodes but I don't know why.  I run the tests in parallel with gradle's maxParallelForks = 5 which launches 1 session on 5 different VMs.  I have tried playing with this number, turning it down to 3 even and that did not seem to help.  I have forkEvery = 1 because I believe Firefox had a memory leak issue awhile back and have decided to leave it.  I also have cacheDriverPerThread = true in my GebConfig.groovy which is explained in the Gebish.org documentation as "The default caching behavior is to cache the driver globally across the JVM. If you are using Geb in multiple threads this may not be what you want, as neither Geb Browser objects nor WebDriver at the core is thread safe. To remedy this, you can instruct Geb to cache the driver instance per thread by setting the config option cacheDriverPerThread to true".  I can't imagine 1 browser per machine is causing memory issues.

Just last night I decided to use a local windows laptop of mine and a coworkers to test out what might be happening.  Chrome desktop ran 100% fine, and then Chrome mobile ran good about halfway through until every broke.  Chrome tablet and Firefox desktop subsequently had failures all over the board.  When we both logged into our machine my coworker said the nodes said that Javascript ran out of memory.  On my windows machine Chrome was stuck open and said "Out of memory" also.  These machines have 16GB of ram each so that is surprising to me.  The hub and Chrome node were actually closed as well, I'm assuming they crashed.  My guess is this is basically what is happening on our docker hub but I don't know why.  The VMs have 10GB of RAM on them which should be plenty.

Has anyone else had these issues?  If I should ask this somewhere else please let me know where.  Any help or troubleshooting tips is appreciated.

Here is an example of our current config:

browserTimeout: 40000

debug: false

jettyMaxThreads: -1

host: redacted

port: 48863

role: node

timeout: 30000

cleanUpCycle: 5000

maxSession: 5

capabilities: Capabilities {browserName: chrome, maxInstances: 5, platform: WIN10, platformName: WIN10, seleniumProtocol: WebDriver, server:CONFIG_UUID: 95554650-d4b5-493b-aad3-c57...}

capabilities: Capabilities {browserName: firefox, maxInstances: 5, platform: WIN10, platformName: WIN10, seleniumProtocol: WebDriver, server:CONFIG_UUID: 99881a28-d6d2-4e87-ab0c-cd5...}

capabilities: Capabilities {browserName: internet explorer, maxInstances: 5, platform: WIN10, platformName: WIN10, seleniumProtocol: WebDriver, server:CONFIG_UUID: 9e29d0e7-cad6-4075-af38-ff7...}

downPollingLimit: 2

hub: redacted

id: redacted:48863

nodePolling: 5000

nodeStatusCheckTimeout: 5000

proxy: org.openqa.grid.selenium.proxy.DefaultRemoteProxy

register: true

registerCycle: 5000

remoteHost: redacted:48863

unregisterIfStillDownAfter: 60000

Reply all
Reply to author
Forward
0 new messages