Speeding up the ATH

156 views
Skip to first unread message

Jesse Glick

unread,
Sep 28, 2017, 9:45:19 AM9/28/17
to Jenkins Dev
After the fiasco of Jenkins 2.80 some of us have been discussing ways
to speed up `acceptance-test-harness` to the point where it would be
practical to run all or part of it against core more frequently,
perhaps even in core PR builds.

Part of the problem is that the test infrastructure has a ton of
overhead. One suggestion was to use PhantomJS rather than a live
browser like Firefox during typical test runs. I suggested optimizing
the plugin upload mechanism in `LocalController`, perhaps using a mock
local update center URL to avoid network downloads (as well as
simplifying the way `LOCAL_JARS` and similar options work).

Part of the problem is also that there are just a lot of tests, some
of questionable value.

https://github.com/jenkinsci/acceptance-test-harness/blob/0e3d03474c28c047b365d1f29b621af32eb7918c/src/test/java/plugins/JobDslPluginTest.java#L1537-L1563

is an example of why a full ATH run is so ridiculously slow. On one CI
builder this single test case (there are 61 of them in the test suite
for this one plugin) took 2m45s. To test what? That you can run a DSL
script and it produces a `View` with certain matching jobs? This could
run via `JenkinsRule` in a few seconds and be equally effective in
catching regressions. (Probably more so, since it would be part of the
plugin repo and thus run on every plugin PR, catching mistakes in
plugin code. There is a smaller chance that a fundamental change in
core would break this plugin functionality; that is what
`plugin-compat-tester` is for.)

We need to just delete stuff like this and make sure ATH is just
testing the UI and critical (“smoke test”) functionality. Think of
test harnesses in a pyramid:

You can have as many unit tests as you want, which test individual
methods or small self-contained components. They should run more or
less instantaneously, and behave essentially the same on any platform.

You should have a bunch of `JenkinsRule` “functional” tests, which are
slower to run (a few seconds) and a little flakier, but which start up
a real Jenkins server and test your plugin’s functionality in a more
realistic way including extensions and background threads and project
builds and settings and everything. These can even start (open-source
Linux) external services using `docker-fixtures` where necessary,
check basic aspects of HTML page rendering including form submissions
with HtmlUnit, run CLI commands, etc.

Finally you should have a handful of carefully chosen “acceptance”
tests which are harder to develop, very slow to run, and often flaky,
but provide a convincing demonstration that an entire user-level
feature truly works in a realistic environment including
production-style plugin and class loading (perhaps even contacting
outside services) and driven entirely from the same kind of gestures a
user would make in a web browser.

Mark Waite

unread,
Sep 28, 2017, 10:42:48 AM9/28/17
to jenkin...@googlegroups.com
Do we have any way of associating historical acceptance test harness failures to a cause of those failures?

If not, could such data be gathered from some sample of recent runs of the acceptance test harness, so that we know which of the tests have been most helpful in the recent past?  

For example, I'd support deleting an acceptance test if it has never detected a bug since it was created.

Alternately, could we add a layering concept to the acceptance test harness?  There could be "precious" tests which run every time and there could be other tests which are part of a collection from which a few tests are selected randomly for execution every time.

As another alternate, is there a way to make the acceptance test harness run in a massively parallel fashion, with a test per machine, and hundreds of machines running the tests all at once?
 
You can have as many unit tests as you want, which test individual
methods or small self-contained components. They should run more or
less instantaneously, and behave essentially the same on any platform.

You should have a bunch of `JenkinsRule` “functional” tests, which are
slower to run (a few seconds) and a little flakier, but which start up
a real Jenkins server and test your plugin’s functionality in a more
realistic way including extensions and background threads and project
builds and settings and everything. These can even start (open-source
Linux) external services using `docker-fixtures` where necessary,
check basic aspects of HTML page rendering including form submissions
with HtmlUnit, run CLI commands, etc.

Finally you should have a handful of carefully chosen “acceptance”
tests which are harder to develop, very slow to run, and often flaky,
but provide a convincing demonstration that an entire user-level
feature truly works in a realistic environment including
production-style plugin and class loading (perhaps even contacting
outside services) and driven entirely from the same kind of gestures a
user would make in a web browser.


As a safety check of that concept, did any of the current acceptance tests detect the regression when run with Jenkins 2.80 (or Jenkins 2.80 RC)?

Is there a JenkinsRule test which could reasonably be written to test for the conditions that caused the bug in Jenkins 2.80?

Mark Waite
 
--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr0UfKN8M1JF6pp7qdoKjo5SSHVwTcv_XVvbsGycK_tjPw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

R. Tyler Croy

unread,
Sep 28, 2017, 11:46:28 AM9/28/17
to jenkin...@googlegroups.com
(replies inline)

On Thu, 28 Sep 2017, Jesse Glick wrote:

> After the fiasco of Jenkins 2.80 some of us have been discussing ways
> to speed up `acceptance-test-harness` to the point where it would be
> practical to run all or part of it against core more frequently,
> perhaps even in core PR builds.
>
> Part of the problem is that the test infrastructure has a ton of
> overhead. One suggestion was to use PhantomJS rather than a live
> browser like Firefox during typical test runs. I suggested optimizing
> the plugin upload mechanism in `LocalController`, perhaps using a mock
> local update center URL to avoid network downloads (as well as
> simplifying the way `LOCAL_JARS` and similar options work).



Just an FYI on this point, I have been working on upgrading to Selenium 3.6.0
(https://github.com/jenkinsci/acceptance-test-harness/pull/360) which
introduces support for newer browsers (duh) which include headless modes.


I will be investigating other test speed up options after that, but I fully
support jglick's approach of deleting pointless/unnecessary test cases. Fastest
Selenium test is one which isn't executed :P



- R. Tyler Croy

------------------------------------------------------
Code: <https://github.com/rtyler>
Chatter: <https://twitter.com/agentdero>
xmpp: rty...@jabber.org

% gpg --keyserver keys.gnupg.net --recv-key 1426C7DC3F51E16F
------------------------------------------------------
signature.asc

Jesse Glick

unread,
Sep 28, 2017, 1:43:47 PM9/28/17
to Jenkins Dev
On Thu, Sep 28, 2017 at 10:42 AM, Mark Waite <mark.ea...@gmail.com> wrote:
> Do we have any way of associating historical acceptance test harness
> failures to a cause of those failures?

Not that I am aware of.

> could such data be gathered from some sample of recent runs of the
> acceptance test harness, so that we know which of the tests have been most
> helpful in the recent past?

Maybe. I doubt things are in good enough condition to do that kind of
analysis. AFAIK none of the CI jobs running the ATH have ever had a
single stable run, so there is not really a baseline.

> Alternately, could we add a layering concept to the acceptance test harness?
> There could be "precious" tests which run every time and there could be
> other tests which are part of a collection from which a few tests are
> selected randomly for execution every time.

Yes this is possible.

> is there a way to make the acceptance test harness run
> in a massively parallel fashion

Yes. Does not help with the flakiness, and still wastes a tremendous
amount of cloud machine time.

> As a safety check of that concept, did any of the current acceptance tests
> detect the regression when run with Jenkins 2.80 (or Jenkins 2.80 RC)?

Yes.

> Is there a JenkinsRule test which could reasonably be written to test for
> the conditions that caused the bug in Jenkins 2.80?

Not really; that particular issue was unusual, since it touched on the
setup wizard UI which is normally suppressed by test harnesses.

Mark Waite

unread,
Sep 28, 2017, 2:11:32 PM9/28/17
to jenkin...@googlegroups.com
On Thu, Sep 28, 2017 at 11:43 AM Jesse Glick <jgl...@cloudbees.com> wrote:
On Thu, Sep 28, 2017 at 10:42 AM, Mark Waite <mark.ea...@gmail.com> wrote:
> Do we have any way of associating historical acceptance test harness
> failures to a cause of those failures?

Not that I am aware of.

> could such data be gathered from some sample of recent runs of the
> acceptance test harness, so that we know which of the tests have been most
> helpful in the recent past?

Maybe. I doubt things are in good enough condition to do that kind of
analysis. AFAIK none of the CI jobs running the ATH have ever had a
single stable run, so there is not really a baseline.


Ah, so that means the current acceptance test harness is unlikely to be trusted in its entirety by anyone.  A failure in the entire acceptance test harness is probably only used rarely to stop a release.

I support deleting tests of questionable value.

> Alternately, could we add a layering concept to the acceptance test harness?
> There could be "precious" tests which run every time and there could be
> other tests which are part of a collection from which a few tests are
> selected randomly for execution every time.

Yes this is possible.

> is there a way to make the acceptance test harness run
> in a massively parallel fashion

Yes. Does not help with the flakiness, and still wastes a tremendous
amount of cloud machine time.


Would we consider a policy that flaky tests are deleted, rather than tolerated?  Possibly with an email notification to each committer that modified any line in the deleted test?

I've understood that Kent Beck has mentioned that they delete flaky tests at facebook, rather than just tolerating them.  Seems dramatic, but increases trust in the acceptance test results, and encourages moving tests from the ATH into JenkinsRule or other locations where they may be easier to diagnose.

Mark Waite
 
> As a safety check of that concept, did any of the current acceptance tests
> detect the regression when run with Jenkins 2.80 (or Jenkins 2.80 RC)?

Yes.

> Is there a JenkinsRule test which could reasonably be written to test for
> the conditions that caused the bug in Jenkins 2.80?

Not really; that particular issue was unusual, since it touched on the
setup wizard UI which is normally suppressed by test harnesses.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Jesse Glick

unread,
Sep 28, 2017, 2:38:11 PM9/28/17
to Jenkins Dev
On Thu, Sep 28, 2017 at 2:11 PM, Mark Waite <mark.ea...@gmail.com> wrote:
> Would we consider a policy that flaky tests are deleted, rather than
> tolerated?

Or `@Ignore`d, sure.

> I've understood that Kent Beck has mentioned that they delete flaky tests at
> facebook, rather than just tolerating them. Seems dramatic

Not really. Even if the source code lines are physically deleted, the
author can trivially resurrect them from Git history if and when they
have time to figure out what the flakes were about and how to fix
them. Something like

https://wiki.jenkins.io/display/JENKINS/Flaky+Test+Handler+Plugin

would be handy, though today it is not Pipeline-compatible, and IMO
makes several untenable assumptions about project setup.

Stephen Connolly

unread,
Sep 28, 2017, 2:52:16 PM9/28/17
to jenkin...@googlegroups.com
On Thu 28 Sep 2017 at 19:11, Mark Waite <mark.ea...@gmail.com> wrote:
On Thu, Sep 28, 2017 at 11:43 AM Jesse Glick <jgl...@cloudbees.com> wrote:
On Thu, Sep 28, 2017 at 10:42 AM, Mark Waite <mark.ea...@gmail.com> wrote:
> Do we have any way of associating historical acceptance test harness
> failures to a cause of those failures?

Not that I am aware of.

> could such data be gathered from some sample of recent runs of the
> acceptance test harness, so that we know which of the tests have been most
> helpful in the recent past?

Maybe. I doubt things are in good enough condition to do that kind of
analysis. AFAIK none of the CI jobs running the ATH have ever had a
single stable run, so there is not really a baseline.


Ah, so that means the current acceptance test harness is unlikely to be trusted in its entirety by anyone.  A failure in the entire acceptance test harness is probably only used rarely to stop a release.

I support deleting tests of questionable value.

I don’t want to knock the contributors to the ATH, but my long standing view is that writing good acceptance tests is a high skill and not something that can be easily crowdsourced. 

Acceptance tests are, by their nature, among the slowest tests.

My long standing view of the ATH is that there is a fundamental flaw in using the same harness to drive as to verify *because* any change to that harness has the potential to invalidate any tests using the modified harness... it is all too easy to turn a good test into a test giving a false positive... I think an ATH that had two sets of tests, one driving by the filesystem / CLI and verifying via the UI while the other drives by the UI and verified by the filesystem / CLI would be much much stronger (and faster too). In most cases you wouldn’t be changing the two harnesses at the same time, so a failing test will indicate a failing test.

We need an ATH that can be realistically run by humans in an hour or two... trim the test cases to get there, moving the tested functionality higher up to be tests that run faster, eg JenkinsRule or better yet pure unit tests.

That’s just my view, feel free to ignore. (Because some people confuse my interaction style I am going to try and clarify) I’m bowing out unless someone specifically asks me for clarification on my recommendations or I see people miss-representing my opinion... this is normally the way I interact... just people have a real habit of dragging me back in and that makes me look like someone who rabbits on and on.


For more options, visit https://groups.google.com/d/optout.
--
Sent from my phone

Jesse Glick

unread,
Sep 28, 2017, 3:33:26 PM9/28/17
to Jenkins Dev
On Thu, Sep 28, 2017 at 2:51 PM, Stephen Connolly
<stephen.al...@gmail.com> wrote:
> writing good acceptance tests is a high skill and not something that
> can be easily crowdsourced.

Agreed. In particular we should not be asking GSoC contributors to
write new acceptance tests. Every PR proposing a new test should be
carefully reviewed with an eye to whether it is testing something that

· could not reasonably be covered by lower-level, faster tests
· is a reasonably important user scenario, for which an accidental
breakage would be considered a noteworthy regression worth holding up
or even blacklisting a release (of core or a plugin)
· is sufficiently distinct from other scenarios already being tested
that we think regressions might otherwise slip by

> there is a fundamental flaw in
> using the same harness to drive as to verify *because* any change to that
> harness has the potential to invalidate any tests using the modified
> harness

Probably true but does not seem to me like the highest priority facing us.

> it is all too easy to turn a good test into a test giving a false
> positive

I am not personally aware of any such historical cases (maybe I missed
some). The immediate problems are slowness, flakiness (tests sometimes
fail for no clear reason), and fragility (tests fail due to trivial
code changes especially in the UI).

> We need an ATH that can be realistically run by humans in an hour or two

Yes that seems like a good goal.

Stephen Connolly

unread,
Sep 28, 2017, 4:43:16 PM9/28/17
to jenkin...@googlegroups.com
On Thu 28 Sep 2017 at 20:33, Jesse Glick <jgl...@cloudbees.com> wrote:
On Thu, Sep 28, 2017 at 2:51 PM, Stephen Connolly
<stephen.al...@gmail.com> wrote:
> writing good acceptance tests is a high skill and not something that
> can be easily crowdsourced.

Agreed. In particular we should not be asking GSoC contributors to
write new acceptance tests. Every PR proposing a new test should be
carefully reviewed with an eye to whether it is testing something that

· could not reasonably be covered by lower-level, faster tests
· is a reasonably important user scenario, for which an accidental
breakage would be considered a noteworthy regression worth holding up
or even blacklisting a release (of core or a plugin)
· is sufficiently distinct from other scenarios already being tested
that we think regressions might otherwise slip by

+1


> there is a fundamental flaw in
> using the same harness to drive as to verify *because* any change to that
> harness has the potential to invalidate any tests using the modified
> harness

Probably true but does not seem to me like the highest priority facing us.

If we want high value tests, this is the direction I recommend. How we get there is (or even if we get to my recommendation) is a matter for the people driving the effort to decide. You have my advice and you appear to understand why I recommend it. I am not driving this effort, so I will not dictate anything about it.


> it is all too easy to turn a good test into a test giving a false
> positive

I am not personally aware of any such historical cases (maybe I missed
some). The immediate problems are slowness, flakiness (tests sometimes
fail for no clear reason), and fragility (tests fail due to trivial
code changes especially in the UI).

Well here’s the thing, unless we retest the tests with the feature under test broken, we have no way of knowing. A false test will pass irrespective of whether the feature works or not... likely rather features are working, so nobody will question the passing tests that originally tested the feature but now are simultaneously failing to test and failing to verify.

Fragile tests are a worse problem in my mind.

Useless slow tests are an even worse problem.

But that is just my opinion, how you apply your energy is your call.


> We need an ATH that can be realistically run by humans in an hour or two

Yes that seems like a good goal.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ullrich Hafner

unread,
Sep 28, 2017, 6:31:16 PM9/28/17
to Jenkins Developers

> Am 28.09.2017 um 15:45 schrieb Jesse Glick <jgl...@cloudbees.com>:
>
> After the fiasco of Jenkins 2.80 some of us have been discussing ways
> to speed up `acceptance-test-harness` to the point where it would be
> practical to run all or part of it against core more frequently,
> perhaps even in core PR builds.

This is already supported (since 2014):
https://github.com/jenkinsci/acceptance-test-harness/blob/master/docs/JUNIT.md#marking-tests-to-be-member-of-the-smoke-test-group
In order to verify core PRs just use the smoke tests. You can select the corresponding tests on your own by changing the annotated tests.

>
> Part of the problem is that the test infrastructure has a ton of
> overhead. One suggestion was to use PhantomJS rather than a live
> browser like Firefox during typical test runs. I suggested optimizing
> the plugin upload mechanism in `LocalController`, perhaps using a mock
> local update center URL to avoid network downloads (as well as
> simplifying the way `LOCAL_JARS` and similar options work).

There are several thing to do to improve the speed of a single test, however, you typically will not get a good regression test below a minute (maybe 30 seconds).
Currently we have 600 tests, resulting in a runtime of 10 hours. This sounds ridiculously slow, however if we compare it to the number of plugins it looks different:
i.e. we have one test for two of our plugins, or we spend 30 seconds for each plugin. This is a ridiculously few number of tests! I don’t know any software compony
that has so few acceptance tests for their components. If you want to spent the same amount of testing with a QA team you will spend weeks!

>
> Part of the problem is also that there are just a lot of tests, some
> of questionable value.
>
> https://github.com/jenkinsci/acceptance-test-harness/blob/0e3d03474c28c047b365d1f29b621af32eb7918c/src/test/java/plugins/JobDslPluginTest.java#L1537-L1563
>
> is an example of why a full ATH run is so ridiculously slow. On one CI
> builder this single test case (there are 61 of them in the test suite
> for this one plugin) took 2m45s. To test what? That you can run a DSL
> script and it produces a `View` with certain matching jobs? This could
> run via `JenkinsRule` in a few seconds and be equally effective in
> catching regressions. (Probably more so, since it would be part of the
> plugin repo and thus run on every plugin PR, catching mistakes in
> plugin code. There is a smaller chance that a fundamental change in
> core would break this plugin functionality; that is what
> `plugin-compat-tester` is for.)
>
> We need to just delete stuff like this and make sure ATH is just
> testing the UI and critical (“smoke test”) functionality.
>

We should not delete any of these tests. If you do not like them just do not run them. I use these tests (not all, just the tests for my plugins) before I make a plugin release.
This helps me a lot to reduce the manual testing time. They take about an hour to run on my machine, but after running them I am sure that no big issues are part of a release.
I think this is good invested time. (Manual testing would require a day at least.)

> Think of
> test harnesses in a pyramid:
> You can have as many unit tests as you want, which test individual
> methods or small self-contained components. They should run more or
> less instantaneously, and behave essentially the same on any platform.
>
> You should have a bunch of `JenkinsRule` “functional” tests, which are
> slower to run (a few seconds) and a little flakier, but which start up
> a real Jenkins server and test your plugin’s functionality in a more
> realistic way including extensions and background threads and project
> builds and settings and everything. These can even start (open-source
> Linux) external services using `docker-fixtures` where necessary,
> check basic aspects of HTML page rendering including form submissions
> with HtmlUnit, run CLI commands, etc.
>
> Finally you should have a handful of carefully chosen “acceptance”
> tests which are harder to develop, very slow to run, and often flaky,
> but provide a convincing demonstration that an entire user-level
> feature truly works in a realistic environment including
> production-style plugin and class loading (perhaps even contacting
> outside services) and driven entirely from the same kind of gestures a
> user would make in a web browser.
>

Yes, I see this also in this way. (At least for the numbers, not for the flakiness, a test should never be flaky).

> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr0UfKN8M1JF6pp7qdoKjo5SSHVwTcv_XVvbsGycK_tjPw%40mail.gmail.com.
signature.asc

Ullrich Hafner

unread,
Sep 28, 2017, 6:37:30 PM9/28/17
to Jenkins Developers

Am 28.09.2017 um 16:42 schrieb Mark Waite <mark.ea...@gmail.com>:


[..]


Do we have any way of associating historical acceptance test harness failures to a cause of those failures?


I found a lot of regressions in my plugins with theses tests, though I don’t have any numbers. 

If not, could such data be gathered from some sample of recent runs of the acceptance test harness, so that we know which of the tests have been most helpful in the recent past?  


I do not think that this is possible (or even required).

For example, I'd support deleting an acceptance test if it has never detected a bug since it was created.


This makes no sense. A good ATH test case is there to prevent a regression. So after what age would you like to remove a test that caused no failure? Two months, two years? 


Alternately, could we add a layering concept to the acceptance test harness?  There could be "precious" tests which run every time and there could be other tests which are part of a collection from which a few tests are selected randomly for execution every time.

This is already possible. You just need someone of the core team that identifies which of the tests are essential for core features. Most ATH tests are testing plugin features.

signature.asc

Jesse Glick

unread,
Sep 28, 2017, 6:45:58 PM9/28/17
to Jenkins Dev
On Thu, Sep 28, 2017 at 6:31 PM, Ullrich Hafner
<ullrich...@gmail.com> wrote:
> In order to verify core PRs just use the smoke tests. You can select the corresponding tests on your own by changing the annotated tests.

Yes I have seen this, though I think there is no one really keeping
track of which tests are annotated in this way and making sure that
the set is sane.

The way ATH is structured it is not suitable for running in core PRs I
think: tests could start failing with no relation to the core change
under test. Even if you pinned a specific revision of
`acceptance-test-harness`, you would still be puling some unknown
version of each plugin from the update center. Even if you exclude all
tests with `@WithPlugins` (making running the tests far less useful),
many of them have intrinsic external dependencies that we can expect
to be flaky.

OTOH it makes sense to run a defined set of tests on a regular basis,
having someone designated to watch over the results, and ensuring that
weekly releases with known failures are not published.

> we have one test for two of our plugins, or we spend 30 seconds for each plugin. This is a ridiculously few number of tests!

If that were the only test coverage, sure. But the vast majority of
tests are functional tests in plugin repositories. ATH contributes
relatively little to our automated test coverage.

> I use these tests (not all, just the tests for my plugins) before I make a plugin release.

And in fact the tests for the Analysis suite are particularly slow, so
this is a good example of my point. A handful of them test the UI but
most seem that they could have been written more naturally as
functional tests.

Ullrich Hafner

unread,
Sep 28, 2017, 6:56:43 PM9/28/17
to Jenkins Developers
Am 28.09.2017 um 20:51 schrieb Stephen Connolly <stephen.al...@gmail.com>:


On Thu 28 Sep 2017 at 19:11, Mark Waite <mark.ea...@gmail.com> wrote:
On Thu, Sep 28, 2017 at 11:43 AM Jesse Glick <jgl...@cloudbees.com> wrote:
On Thu, Sep 28, 2017 at 10:42 AM, Mark Waite <mark.ea...@gmail.com> wrote:
> Do we have any way of associating historical acceptance test harness
> failures to a cause of those failures?

Not that I am aware of.

> could such data be gathered from some sample of recent runs of the
> acceptance test harness, so that we know which of the tests have been most
> helpful in the recent past?

Maybe. I doubt things are in good enough condition to do that kind of
analysis. AFAIK none of the CI jobs running the ATH have ever had a
single stable run, so there is not really a baseline.


Ah, so that means the current acceptance test harness is unlikely to be trusted in its entirety by anyone.  A failure in the entire acceptance test harness is probably only used rarely to stop a release.

I support deleting tests of questionable value.

I don’t want to knock the contributors to the ATH, but my long standing view is that writing good acceptance tests is a high skill and not something that can be easily crowdsourced. 


But writing code without tests can crowdsourced?

Acceptance tests are, by their nature, among the slowest tests.

My long standing view of the ATH is that there is a fundamental flaw in using the same harness to drive as to verify *because* any change to that harness has the potential to invalidate any tests using the modified harness... it is all too easy to turn a good test into a test giving a false positive... I think an ATH that had two sets of tests, one driving by the filesystem / CLI and verifying via the UI while the other drives by the UI and verified by the filesystem / CLI would be much much stronger (and faster too). In most cases you wouldn’t be changing the two harnesses at the same time, so a failing test will indicate a failing test.

We need an ATH that can be realistically run by humans in an hour or two... trim the test cases to get there, moving the tested functionality higher up to be tests that run faster, eg JenkinsRule or better yet pure unit tests.

Who is we? The core team? Yes, you should create an ATH that detects core problems (subset of the current ATH). Create a test suite with 100 ATH tests and you will get your proposed runtime. 
But this should not affect the ATH for plugins. You will never get a quality UI test suite for 1500 plugins in an hour. If a plug-in author crates a suite with 10-50 tests so what. This is up to the plugin author. 

(We discussed it during the initial days of the ATH if it wouldn’t be better to have the framework and page objects in the ATH only and the individual tests in the plugins. Maybe we should reconsider this topic again…)

signature.asc

Robert Sandell

unread,
Sep 29, 2017, 5:40:10 AM9/29/17
to jenkin...@googlegroups.com
The original "selling points" for me with the ATH was that it provided ways to test plugin functionality together with other plugins already installed.
Running plugin integration tests with "just" the test dependencies is one thing but will it still work when 250 other plugins are installed alongside this one.
In that scenario even the simplest "happy path" plugin test is worth it in my mind.
The same i believe is valid from a validating core perspective. It's one thing to test a core functionality works with one plugin installed, but does it still work with 250 other plugins? Or at least the recommended plugin set installed?
No matter where the fault is when something breaks, the overall "product experience" will be seen as broken if something doesn't work with core + the recommended plugins imho.

Now getting that mode to work in the ATH is not simple so I guess very few (if anyone) are running it in that way, I don't even know if that run mode even works any more.

So if we run a curated set of tests with the recommended plugins pre-installed we'll at least lower/remove the plugin install part of the test time, but then we won't be testing the setup wizard, and the reason for why this discussion is happening would be void. But a separate run of "just" the install experience oriented tests could be run on the side with a blank jenkinsHome :)

/B

2017-09-29 0:56 GMT+02:00 Ullrich Hafner <ullrich...@gmail.com>:
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.
-- 
Sent from my phone

-- 
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/716A4677-29EF-45CB-A895-2F9EC1BD66C4%40gmail.com.

For more options, visit https://groups.google.com/d/optout.



--
Robert Sandell
Software Engineer
CloudBees Inc.

Ullrich Hafner

unread,
Sep 29, 2017, 6:19:48 AM9/29/17
to Jenkins Developers

> Am 29.09.2017 um 00:45 schrieb Jesse Glick <jgl...@cloudbees.com>:
>
> On Thu, Sep 28, 2017 at 6:31 PM, Ullrich Hafner
> <ullrich...@gmail.com> wrote:
>> In order to verify core PRs just use the smoke tests. You can select the corresponding tests on your own by changing the annotated tests.
>
> Yes I have seen this, though I think there is no one really keeping
> track of which tests are annotated in this way and making sure that
> the set is sane.
>

Yes, nobody is using or managing this annotation right now. But it would make sense to change this…

> The way ATH is structured it is not suitable for running in core PRs I
> think: tests could start failing with no relation to the core change
> under test. Even if you pinned a specific revision of
> `acceptance-test-harness`, you would still be puling some unknown
> version of each plugin from the update center. Even if you exclude all
> tests with `@WithPlugins` (making running the tests far less useful),
> many of them have intrinsic external dependencies that we can expect
> to be flaky.
>

During the four months of our testing course the number of flaky tests was just a handful (<10). We should remove/ignore these as already suggested.

> OTOH it makes sense to run a defined set of tests on a regular basis,
> having someone designated to watch over the results, and ensuring that
> weekly releases with known failures are not published.

This would make sense.

>
>> we have one test for two of our plugins, or we spend 30 seconds for each plugin. This is a ridiculously few number of tests!
>
> If that were the only test coverage, sure. But the vast majority of
> tests are functional tests in plugin repositories. ATH contributes
> relatively little to our automated test coverage.
>
>> I use these tests (not all, just the tests for my plugins) before I make a plugin release.
>
> And in fact the tests for the Analysis suite are particularly slow, so
> this is a good example of my point. A handful of them test the UI but
> most seem that they could have been written more naturally as
> functional tests.
>

Yes, these tests are slow. But converting them to integration tests would not make them much faster.
Jenkins rule tests don’t have the same overhead but are also quite slow. (E.g. a pipeline multi-branch tests needs about 30sec on my machine)

It would be quite an effort to rewrite these tests just to make them faster. Is is mostly historical that these tests are in ATH, at the creation time
the integration test package using the JenkinsRule was not in the state as it is now.

I think it makes sense to exclude these tests from a core PR test job, as already mentioned they are only important for a release of my plugins.
(Actually a PR to one of my plugins should run these tests!)
signature.asc

Jesse Glick

unread,
Sep 29, 2017, 12:52:39 PM9/29/17
to Jenkins Dev
On Fri, Sep 29, 2017 at 5:40 AM, Robert Sandell <rsan...@cloudbees.com> wrote:
> The original "selling points" for me with the ATH was that it provided ways
> to test plugin functionality together with other plugins already installed. […]
>
> Now getting that mode to work in the ATH is not simple so I guess very few
> (if anyone) are running it in that way, I don't even know if that run mode
> even works any more.

Not sure. If you just run against a regular 2.x WAR and specify
`@WithPlugins`, only the specified plugins (and transitive
dependencies) will be installed. I agree that it would be valuable to
have a mode whereby everything mentioned in `platform-plugins.json`
(or at least those which are `"suggested": true`) get loaded as well.
This would let us know if some extensions or listeners conflict, if
there are weird class loading issues with duplicated libraries, if
some Jelly or JavaScript errors break rendering of unrelated page
sections, and so on.

Ullrich Hafner wrote:
> converting them to integration tests would not make them much faster

In my experience ATH-based tests are about an order of magnitude
slower than `JenkinsRule`-based tests. YMMV

martinda

unread,
Sep 29, 2017, 4:23:31 PM9/29/17
to Jenkins Developers
I have two clarification questions regarding ATH, JenkinsRule and plugin-compat-tester.

Q1: Which type of test is good enough catch that the http_request plugin has a transitive dependency on Guava 11.0.1. Guava 11.0.1 is currently provided by the jenkins core, but this will change with https://github.com/jenkinsci/maven-plugin/pull/102

Q2: Which type of automated test approach should I use for my Jenkinsfiles and my Pipeline libraries? I want to test against the exact line up of Jenkins war version and plugin versions I use in production. On IRC the suggestion was to create a custom update center containing only the plugin versions I find in production, and run my code using the ATH. There is a chicken and egg problem there to get the UC up, but I can live with it.

Thanks,
Martin

Jesse Glick

unread,
Sep 29, 2017, 4:50:22 PM9/29/17
to Jenkins Dev
On Fri, Sep 29, 2017 at 4:23 PM, martinda <martin....@gmail.com> wrote:
> Which type of test is good enough catch that the http_request plugin has
> a transitive dependency on Guava 11.0.1. Guava 11.0.1 is currently provided
> by the jenkins core, but this will change

ATH would probably catch that kind of thing. `JenkinsRule` would
probably not, hence JENKINS-41827.

> Which type of automated test approach should I use for my Jenkinsfiles
> and my Pipeline libraries? I want to test against the exact line up of
> Jenkins war version and plugin versions I use in production.

I do not know that there is any single recommendation. Some people use
JenkinsPipelineUnit, which is fast and lightweight but very low
fidelity. Pipeline component plugins themselves rely heavily on
`JenkinsRule`-based tests, which you could presumably use yourself to
get higher fidelity (including Jenkins core and plugin versions) at
the expense of more effort to mock out irrelevant parts of the
environment.

Oliver Gondža

unread,
Oct 2, 2017, 7:52:18 AM10/2/17
to jenkin...@googlegroups.com
A lot of comments and valuable feedback has popped up by the end of the
week. Let me react here instead of on individual messages.

- No argument on slow test run, high flakiness, fragility. Some of that
is caused by not enough attention as ATH stands a bit aside the delivery
pipeline. Bringing it to PR builds of core or plugins would point more
people to it (hopefully making it better) but it would be annoying at
first until we make it more mature/stable. The question is if we are
willing to bite the bullet here.

- With regards to the *kinds* of tests that should be part of ATH, the
thing is this was never defined so different kinds got in. I tend to
agree that what can be written against JTH, should. Perhaps it would be
beneficial to write that down and guard no more such tests are added to
prevent further bloating. The existing ones can either be deleted,
migrated or tolerated based on how much effort does it take to maintain
them.

- Ad smoke tests: This seems like a good compromise (starting point?)
for PR integration for several reasons. No question the actual test set
should be revisited. All (sufficiently distinct) core scenarios +
recommended plugin set scenarios + some popular scenarios of
non-recommended plugins?

- Speaking of deleting flaky tests - I am willing to give this a try as
the idea of stable build sounds wonderful and almost unheard of in ATH
land. As a maintainer of downstream plugin update center certified by
ATH internally, I am less excited. Again, I would like to see some
consensus on the process. Sometimes it might be enough to check JTH
covers same scenario or it can be migrated easily. What if it does not
and the scenario appears fundamental? How to tell?

Please let me know if there are any additional thoughts or I have not
commented on something. Later this week, I intend to summarize this in a
doc or two in ATH repo to base our future decisions on (w.r.t. removing
tests, not accepting them, etc.) and update the thread with PR link.

--
oliver

Andres Rodriguez

unread,
Oct 3, 2017, 7:35:37 AM10/3/17
to jenkin...@googlegroups.com
One thing that could encourage moving certain tests from the ATH to the JTH would be JENKINS-41827.


--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Regards,

Andres Rodriguez

Baptiste Mathus

unread,
Oct 3, 2017, 4:15:38 PM10/3/17
to Jenkins Developers
+1 on simply deleting flaky tests.

As an ATH user myself, I see flaky tests as deeply dangerous for the morale and the trust people have in the tool. At such point that even after the quality had improved dramatically (i.e. lowered flakiness), I have to fight my own first reflex to just ignore any ATH result: "heh, anyway failures are unrelated"...

"Unrelated" failures shouldn't be tolerated indeed. And I tend to think @Ignore is a poor-man's versioning and as the test suite there should only be critical tests, having less tests is probably for the good.


@Ulrich about
> quality UI test suite for 1500 plugins in an hour." 

Well I don't know what others think, but for critical usage, I don't think we're aiming for 1500. Now I've no idea what would be the number. How should we base our coverage? Like: I guess that makes no sense to have an ATH for any plugin with less than 5k? 10k installs, more? less? Or what should be the criteria to be at least eligible to enter the testsuite.

Or as an intermediate solution, should we create two test suites?
Like for instance have two modules inside https://github.com/jenkinsci/acceptance-test-harness (we could have two repos, but I tend to think having only one with two modules would help having a clearer and more trackable versioning/history when shuffling things around). 
One would be like acceptance-test-harness-critical and we'd *very* aggressive with what would move there. And there would be the module acceptance-test-harness (the current one I would say). 
Any new PR adding a new test should always be against acceptance-test-harness, and after a to-be-decided period, if 1) that test is deemed important enough and 2) has *never* failed in the last X weeks/runs and 3) add more criteria here, like test speed, could be moved to acceptance-test-harness-critical.


My 2 cents

Jesse Glick

unread,
Oct 4, 2017, 12:25:47 PM10/4/17
to Jenkins Dev
On Tue, Oct 3, 2017 at 4:15 PM, Baptiste Mathus <m...@batmat.net> wrote:
> should we create two test suites?
> Like for instance have two modules inside
> https://github.com/jenkinsci/acceptance-test-harness

Using the existing `SmokeTest` marker (activated with
`-PrunSmokeTests`) seems a lot lighter-weight. As previously
mentioned, the specific set of tests included here definitely needs
updating.

ogondza

unread,
Oct 6, 2017, 8:18:41 AM10/6/17
to Jenkins Developers
Hey,

Here is my draft for the smoke tests https://github.com/jenkinsci/acceptance-test-harness/pull/365
and documentation of the test acceptance / removal: https://github.com/jenkinsci/acceptance-test-harness/pull/364

Please, comment there for tuning. Note things can be further refined after merged...

--
oliver
Reply all
Reply to author
Forward
0 new messages