Parallel Geb/Spock tests crashing grid

60 views
Skip to first unread message

jc

unread,
Sep 2, 2020, 10:19:41 AM9/2/20
to Geb User Mailing List
Hi all,

I am coming here as a shot in the dark.  We have a suite of about 600+ tests that we run on Chrome in desktop resolution and then we also run them in Tablet and Mobile resolution using Chrome's mobile emulation.  We also run it on Firefox desktop resolution.  So about 2500 tests on the same application running nightly through Jenkins.

Starting in early July we haven't had much for passing builds because at some point during the tests I get seemingly random WebDriverExceptions.  Most of the time I get errors such as these: 
- org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to BROWSER_TIMEOUT  - org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to TIMEOUT
org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to SO_TIMEOUT
org.openqa.selenium.WebDriverException: Session [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to ERROR_FORWARDING_TO_NODE (don't see this one often)

Nothing has really changed in the application and thus the code in the test suite hasn't really changes all that much.  The builds are ran from Jenkins overnight in the order of Chrome Desktop > Chrome Mobile > Chrome Tablet > Firefox Desktop.

The grid is set up with docker for the HUB only, though I am not sure what image it is.  I just know it is the latest stable version of 3.141.59.  The nodes are VMs with the same image on each one of them.  There are 10 nodes connected to the hub with 5 sessions each of Chrome, Firefox, and IE.  Most nights everything will pass except for 1 or 2 test classes that fail halfway through because of the exceptions and then it seems to get progressively worse down to where Firefox which is last in line almost always fails somewhere. 

I think what is happening is these tests are crashing either the hubs or the nodes but I don't know why.  I run the tests in parallel with grade's maxParallelForks = 5 which launches 1 session on 5 different VMs.  I have tried playing with this number, turning it down to 3 even and that did not seem to help.  I have forkEvery = 1 because I believe Firefox had a memory leak issue awhile back and have decided to leave it.  I also have cacheDriverPerThread = true in my GebConfig.groovy.  I can't imagine 1 browser per machine is causing memory issues.

Just last night I decided to use a local windows laptop of mine and a coworkers to test out what might be happening.  Chrome desktop ran 100% fine, and then Chrome mobile ran good about halfway through until every broke.  Chrome tablet and Firefox desktop subsequently had failures all over the board.  When we both logged into our machine my coworker said the nodes said that Javascript ran out of memory.  On my windows machine Chrome was stuck open and said "Out of memory" also.  These machines have 16GB of ram each so that is surprising to me.  The hub and Chrome node were actually closed as well, I'm assuming they crashed.  My guess is this is basically what is happening on our docker hub but I don't know why.  

I guess what I am trying to find out is if this is a Geb/Spock issue and if so, why?  What other settings should I check in my build.gradle or my GebConfig.groovy?  If it's a hub/node issue, what else can we check?  The VMs have 10GB of RAM on them which should be plenty.

Has anyone else had these issues?  If I should ask this somewhere else please let me know where.  Any help or troubleshooting tips is appreciated.

tho...@posteo.de

unread,
Sep 2, 2020, 11:48:21 AM9/2/20
to geb-...@googlegroups.com
Hello jc,

sorry for maybe starting the obvious, but first thing I do in case of
such errors, is to make sure that every component in the Grid is running
the latest version, as well as upgrade the Selenium version in my test
project.
Most of the time, this would solve these kind of issues for me.
I have also made good experience with restarting the Grid before running
a major test suite during the night.

The "forkEvery" parameter in Gradle (and similar mechanisms) would only
affect the machine running the test code, not the ones running the
browser code, so that should not have any effect on the Grid.
For the Grid it should only matter how many requests for browser slots
are occurring simultaneously.

Another thing you could check, is whether the browsers get restarted
during the tests.
Maybe there is some sort of memory leak (in the webpage/Javascript that
you are testing), and restarting could help with that. The browser
itself running out of memory seems suspicious. I have never seen this
before, and I have been using Grid a lot.
A memory leak could also be a result of a version mismatch between
Selenium and the Grid, but I'm speculating here.

Depending how much you have customised your Grid installation, it might
not be an option to do this, but I can also very much recommend to use
Zalenium (https://opensource.zalando.com/zalenium/) as Grid
implementation.
It helped me save a lot of maintenance effort, and it comes with a lot
of great features.

Best regards,
Thomas

Am 02.09.2020 16:19 schrieb jc:
> Hi all,
>
> I am coming here as a shot in the dark. We have a suite of about 600+
> tests that we run on Chrome in desktop resolution and then we also run
> them in Tablet and Mobile resolution using Chrome's mobile emulation.
> We also run it on Firefox desktop resolution. So about 2500 tests on
> the same application running nightly through Jenkins.
>
> Starting in early July we haven't had much for passing builds because
> at some point during the tests I get seemingly random
> WebDriverExceptions. Most of the time I get errors such as these:
> _- org.openqa.selenium.WebDriverException: Session
> [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to
> BROWSER_TIMEOUT - __org.openqa.selenium.WebDriverException: Session
> [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to TIMEOUT_
> _- __org.openqa.selenium.WebDriverException: Session
> [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to
> SO_TIMEOUT_
> _- __org.openqa.selenium.WebDriverException: Session
> [a7d84a66-daaa-48ad-a0cd-c29dfb98d805] was terminated due to
> ERROR_FORWARDING_TO_NODE _(don't see this one often)
>
> Nothing has really changed in the application and thus the code in the
> test suite hasn't really changes all that much. The builds are ran
> from Jenkins overnight in the order of Chrome Desktop > Chrome Mobile
>> Chrome Tablet > Firefox Desktop.
>
> The grid is set up with docker for the HUB only, though I am not sure
> what image it is. I just know it is the latest stable version of
> 3.141.59. The nodes are VMs with the same image on each one of them.
> There are 10 nodes connected to the hub with 5 sessions each of
> Chrome, Firefox, and IE. Most nights everything will pass except for 1
> or 2 test classes that fail halfway through because of the exceptions
> and then it seems to get progressively worse down to where Firefox
> which is last in line almost always fails somewhere.
>
> I think what is happening is these tests are crashing either the hubs
> or the nodes but I don't know why. I run the tests in parallel with
> grade's MAXPARALLELFORKS = 5 which launches 1 session on 5 different
> VMs. I have tried playing with this number, turning it down to 3 even
> and that did not seem to help. I have FORKEVERY = 1 because I believe
> Firefox had a memory leak issue awhile back and have decided to leave
> it. I also have CACHEDRIVERPERTHREAD = TRUE in my GebConfig.groovy. I
> can't imagine 1 browser per machine is causing memory issues.
>
> Just last night I decided to use a local windows laptop of mine and a
> coworkers to test out what might be happening. Chrome desktop ran 100%
> fine, and then Chrome mobile ran good about halfway through until
> every broke. Chrome tablet and Firefox desktop subsequently had
> failures all over the board. When we both logged into our machine my
> coworker said the nodes said that Javascript ran out of memory. On my
> windows machine Chrome was stuck open and said "Out of memory" also.
> These machines have 16GB of ram each so that is surprising to me. The
> hub and Chrome node were actually closed as well, I'm assuming they
> crashed. My guess is this is basically what is happening on our docker
> hub but I don't know why.
>
> I guess what I am trying to find out is if this is a Geb/Spock issue
> and if so, why? What other settings should I check in my build.gradle
> or my GebConfig.groovy? If it's a hub/node issue, what else can we
> check? The VMs have 10GB of RAM on them which should be plenty.
>
> Has anyone else had these issues? If I should ask this somewhere else
> please let me know where. Any help or troubleshooting tips is
> appreciated.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Geb User Mailing List" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to geb-user+u...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/geb-user/2f0549ec-8710-44b8-a2d1-3748790f7b46n%40googlegroups.com
> [1].
>
>
> Links:
> ------
> [1]
> https://groups.google.com/d/msgid/geb-user/2f0549ec-8710-44b8-a2d1-3748790f7b46n%40googlegroups.com?utm_medium=email&utm_source=footer

jc

unread,
Sep 2, 2020, 12:05:08 PM9/2/20
to Geb User Mailing List
Thanks for the reply Thomas.  For versions, Chrome and Firefox are on their latest versions as well as the Chrome driver.  My project is also using selenium version 3.141.59 for everything which is the same as the Grid.

When you say forkEvery only affects the machine running the test code, in this case it would be a docker Jenkins instance.  I was thinking maybe Jenkins could be the culprit but I had our DevOps team look into it and they couldn't find anything suspicious.  Would it be reasonable to think it's the "Jenkins machine" at fault?

Zalenium is interesting.  Where would this run?  Currently we have a separate hub, and then 10 separate nodes as VMs.  Are you saying we wouldn't need these 10 nodes anymore?

tho...@posteo.de

unread,
Sep 2, 2020, 1:37:37 PM9/2/20
to geb-...@googlegroups.com
Yes, the Gradle forking options only affect the machine directly, that
is running the test code, i.e. Jenkins.
The Gradle process creates more Java processes on that machine,
according to the rules you set, so they will consume more resources on
that machine.
From the point of view of the Grid, or rather, Grid-Hub, it will just
receive more requests, and that will consume resources, i.e. browsers on
the Grid nodes.
So there is a connection there, but it is an indirect one.
From how you describe it, I would not think that Jenkins is the issue.
Running the same suite from another machine, i.e. your local development
machine, should have the same effect on the Grid, if you are using the
same parameters.
I am afraid I cannot offer more specific help right now.

Zalenium could be run as a Docker image, that would be one single
command, and that would spawn the Hub and the Nodes on the same machine.
If you want to have a distributed setup, with more Nodes on more
machines connected to the same Hub, you would need to read up in the
documentation how to do that.
But you could get already get a complete Grid + features on a single
machine by running a single command.
See https://opensource.zalando.com/zalenium/#docker

I would definitely recommend looking into this at some point, regardless
of your current problem.
>> Zalenium (https://opensource.zalando.com/zalenium/ [1]) as Grid
>> [2]
>> [3]
>
> --
> You received this message because you are subscribed to the Google
> Groups "Geb User Mailing List" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to geb-user+u...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/geb-user/6b947b8a-364a-4152-bef1-35cc775c11e0n%40googlegroups.com
> [4].
>
>
> Links:
> ------
> [1] https://opensource.zalando.com/zalenium/
> [2]
> https://groups.google.com/d/msgid/geb-user/2f0549ec-8710-44b8-a2d1-3748790f7b46n%40googlegroups.com
> [3]
> https://groups.google.com/d/msgid/geb-user/2f0549ec-8710-44b8-a2d1-3748790f7b46n%40googlegroups.com?utm_medium=email&utm_source=footer
> [4]
> https://groups.google.com/d/msgid/geb-user/6b947b8a-364a-4152-bef1-35cc775c11e0n%40googlegroups.com?utm_medium=email&utm_source=footer
Reply all
Reply to author
Forward
0 new messages