Bazel unstable on Gerrit-CI

Shawn Pearce

unread,

Feb 20, 2017, 1:10:56 PM2/20/17

to repo-discuss

I'm seeing transient build failures in my series with errors like:

JUnit4 Test Runner
.*** Error in `/home/jenkins/.cache/bazel/_bazel_jenkins/3239551e333dc09ba2b5ef07ff4549b6/execroot/gerrit/bazel-out/local-fastbuild/bin/gerrit-gpg/gpg_tests.runfiles/local_jdk/bin/java': free(): invalid pointer: 0x00000007c0035528 ***
external/bazel_tools/tools/test/test-setup.sh: line 105: 13710 Aborted                 (core dumped) "${EXE}" "$@"

*sigh*

luca.mi...@gmail.com

unread,

Feb 20, 2017, 5:58:39 PM2/20/17

to Shawn Pearce, repo-d...@googlegroups.com

I know, we can try to reintroduce the retry to eliminate the residual flakiness.

Luca

Sent from my iPhone

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lucamilanesio

unread,

Feb 21, 2017, 2:38:23 AM2/21/17

to Repo and Gerrit Discussion, s...@google.com

Looking at the failures, I can see the following:

We already run 3 builds for each change, using different NoteDb modes.
When there is some flakiness, like the one mentioned by Shawn, it typically affects only some of the 3 builds
When all the 3 builds are failing, it is typically for a genuine code (or test) error

I'd suggest then to check the status of the 3 builds and, should some of them fail, go into a retry cycle (3 times?).

What do you think?

Luca.

On Monday, February 20, 2017 at 10:58:39 PM UTC, lucamilanesio wrote:

I know, we can try to reintroduce the retry to eliminate the residual flakiness.

Luca

Sent from my iPhone

On 20 Feb 2017, at 18:10, 'Shawn Pearce' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

I'm seeing transient build failures in my series with errors like:

JUnit4 Test Runner
.*** Error in `/home/jenkins/.cache/bazel/_bazel_jenkins/3239551e333dc09ba2b5ef07ff4549b6/execroot/gerrit/bazel-out/local-fastbuild/bin/gerrit-gpg/gpg_tests.runfiles/local_jdk/bin/java': free(): invalid pointer: 0x00000007c0035528 ***
external/bazel_tools/tools/test/test-setup.sh: line 105: 13710 Aborted                 (core dumped) "${EXE}" "$@"

*sigh*

--
--

To unsubscribe, email repo-discuss+unsubscribe@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.

Edwin Kempin

unread,

Feb 21, 2017, 2:43:28 AM2/21/17

to lucamilanesio, Repo and Gerrit Discussion, Shawn Pearce

On Tue, Feb 21, 2017 at 8:38 AM, lucamilanesio <luca.mi...@gmail.com> wrote:

Looking at the failures, I can see the following:

We already run 3 builds for each change, using different NoteDb modes.
When there is some flakiness, like the one mentioned by Shawn, it typically affects only some of the 3 builds
When all the 3 builds are failing, it is typically for a genuine code (or test) error
I'd suggest then to check the status of the 3 builds and, should some of them fail, go into a retry cycle (3 times?).

What do you think?

SGTM

lucamilanesio

unread,

Feb 22, 2017, 4:07:53 AM2/22/17

to Repo and Gerrit Discussion, luca.mi...@gmail.com, s...@google.com

I've implemented a change to the CI flow that detects and retry flaky builds.

You'll see something like the following:

** FLAKY Builds detected: [bazel/reviewdb]
ignore(FAILURE) {
    parallel {
        retry (attempt 1) {
            Schedule job Gerrit-verifier-bazel
            Build Gerrit-verifier-bazel #5169 started
            Gerrit-verifier-bazel #5169 completed 
Builds status:
  bazel/notedbPrimary : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5160/console)
  bazel/notedbReadWrite : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5159/console)
  bazel/reviewdb : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5169/console)
        }
    }
    // SUCCESS ignored
}

What that means is some of the build combinations were flaky and are retried up to 3 times. If they are then successful, the overall change is Verified +1, otherwise if the flaky builds will still fail 3 times, the change is Verified -1.

Let see if that helps :-)

Luca.

On Tuesday, February 21, 2017 at 7:43:28 AM UTC, Edwin Kempin wrote:

On Tue, Feb 21, 2017 at 8:38 AM, lucamilanesio <luca.mi...@gmail.com> wrote:
Looking at the failures, I can see the following:

We already run 3 builds for each change, using different NoteDb modes.
When there is some flakiness, like the one mentioned by Shawn, it typically affects only some of the 3 builds
When all the 3 builds are failing, it is typically for a genuine code (or test) error
I'd suggest then to check the status of the 3 builds and, should some of them fail, go into a retry cycle (3 times?).

What do you think?
SGTM

Luca.

On Monday, February 20, 2017 at 10:58:39 PM UTC, lucamilanesio wrote:

I know, we can try to reintroduce the retry to eliminate the residual flakiness.

Luca

Sent from my iPhone

On 20 Feb 2017, at 18:10, 'Shawn Pearce' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

I'm seeing transient build failures in my series with errors like:

JUnit4 Test Runner
.*** Error in `/home/jenkins/.cache/bazel/_bazel_jenkins/3239551e333dc09ba2b5ef07ff4549b6/execroot/gerrit/bazel-out/local-fastbuild/bin/gerrit-gpg/gpg_tests.runfiles/local_jdk/bin/java': free(): invalid pointer: 0x00000007c0035528 ***
external/bazel_tools/tools/test/test-setup.sh: line 105: 13710 Aborted                 (core dumped) "${EXE}" "$@"

*sigh*

--
--

To unsubscribe, email repo-discuss...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Shawn Pearce

unread,

Feb 22, 2017, 10:05:24 AM2/22/17

to lucamilanesio, Repo and Gerrit Discussion

On Wed, Feb 22, 2017 at 1:07 AM, lucamilanesio <luca.mi...@gmail.com> wrote:

I've implemented a change to the CI flow that detects and retry flaky builds.

Thank you Luca!

Han-Wen Nienhuys

unread,

Feb 22, 2017, 11:36:48 AM2/22/17

to Shawn Pearce, repo-discuss

The JVM shouldn't do invalid malloc/frees. Did you file a bug for this
with bazel team?

> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to repo-discuss...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--

Han-Wen Nienhuys
Google Munich
han...@google.com

lucamilanesio

unread,

Feb 22, 2017, 5:44:17 PM2/22/17

to Repo and Gerrit Discussion, s...@google.com

I have to say that when we had Buck and Bazel in parallel, we noticed the Bazel ones a bit more flaky :-(

However, with the retry logic implemented, things are going much better.

See below one example of build that it would have failed, but with the 2nd retry succeded:

Builds status:
  bazel/notedbPrimary : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5362/console)
  bazel/reviewdb : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5360/console)
  bazel/notedbReadWrite : FAILURE
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5361/console)
        } // failed
    }
    // FAILURE ignored
}
** FLAKY Builds detected: [bazel/notedbReadWrite]
ignore(FAILURE) {
    parallel {
        retry (attempt 1) {
            Schedule job Gerrit-verifier-bazel
            Build Gerrit-verifier-bazel #5363 started
            Gerrit-verifier-bazel #5363 completed 
Builds status:
  bazel/notedbPrimary : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5362/console)
  bazel/reviewdb : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5360/console)
  bazel/notedbReadWrite : SUCCESS
    (https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-bazel/5363/console)
        }
    }
    // SUCCESS ignored
}
----------------------------------------------------------------------------
Gerrit Review: Verified=1 to change 97818/ffb06c28ba727d7b337f7bffea438c7b6d4dc4c1
----------------------------------------------------------------------------
Finished: SUCCESS

David Ostrovsky

unread,

Feb 23, 2017, 1:49:03 AM2/23/17

to Repo and Gerrit Discussion

On Wednesday, February 22, 2017 at 11:44:17 PM UTC+1, lucamilanesio wrote:

I have to say that when we had Buck and Bazel in parallel, we noticed the Bazel ones a bit more flaky :-(
However, with the retry logic implemented, things are going much better.

See below one example of build that it would have failed, but with the 2nd retry succeded:

I think the point of Han-Wen was to report the breakage to Bazel

team so they are aware of it an could help us, or even provide a work around.

But looking on the breakage, you have referenced, it's socket closed exception

1) pushWithoutChangeId(com.google.gerrit.acceptance.git.HttpPushForReviewIT)
org.eclipse.jgit.api.errors.TransportException: Socket closed

Can we repoduce it somehow locally?

[1] http://paste.openstack.org/show/600174/

luca.mi...@gmail.com

unread,

Feb 23, 2017, 2:34:45 AM2/23/17

to David Ostrovsky, Repo and Gerrit Discussion

As the breakage is random and makes the build flaky, it means that is sporadic and hard to reproduce :-(

Luca

Sent from my iPhone

David Ostrovsky

unread,

Feb 23, 2017, 2:57:22 AM2/23/17

to Repo and Gerrit Discussion

On Thursday, February 23, 2017 at 8:34:45 AM UTC+1, lucamilanesio wrote:

As the breakage is random and makes the build flaky, it means that is sporadic and hard to reproduce :-(

I think we we have a multi facet problem here: docker, jenkins, bazel,

flaky gerrit tests. Stil we should try to identify same similar failure

patterns and try to address them, by opening issue on the respective

upstream projects.

Luca Milanesio

unread,

Feb 23, 2017, 3:01:00 AM2/23/17

to David Ostrovsky, Repo and Gerrit Discussion

On 23 Feb 2017, at 07:57, David Ostrovsky <david.o...@gmail.com> wrote:

On Thursday, February 23, 2017 at 8:34:45 AM UTC+1, lucamilanesio wrote:
As the breakage is random and makes the build flaky, it means that is sporadic and hard to reproduce :-(

I think we we have a multi facet problem here: docker

Do you believe the flakiness comes from Docker? I could setup a physical slave to check if that is the case.

, jenkins

The build runs on a machine that has no Jenkins setup, only the minimum set of packages needed to build Gerrit.

, bazel,

Yes.

flaky gerrit tests.

Yes, but the SEGV highlighted by Shawn wasn't on a flaky test.

Stil we should try to identify same similar failure
patterns and try to address them, by opening issue on the respective
upstream projects.

Seems easy to say, more difficult to implement :-(

lucamilanesio

unread,

Feb 24, 2017, 3:56:15 AM2/24/17

to Repo and Gerrit Discussion, david.o...@gmail.com

Looks like the retry logic introduced has provided a lot more stability to the builds :-)

See below the latest graph:

Since the introduction of the change on the Gerrit CI flow ... the green is predominant again :-)

Luca.

On Thursday, February 23, 2017 at 8:01:00 AM UTC, lucamilanesio wrote:

On 23 Feb 2017, at 07:57, David Ostrovsky <david.o...@gmail.com> wrote:

On Thursday, February 23, 2017 at 8:34:45 AM UTC+1, lucamilanesio wrote:
As the breakage is random and makes the build flaky, it means that is sporadic and hard to reproduce :-(

I think we we have a multi facet problem here: docker

Do you believe the flakiness comes from Docker? I could setup a physical slave to check if that is the case.

, jenkins

The build runs on a machine that has no Jenkins setup, only the minimum set of packages needed to build Gerrit.

, bazel,

Yes.

flaky gerrit tests.

Yes, but the SEGV highlighted by Shawn wasn't on a flaky test.

Stil we should try to identify same similar failure
patterns and try to address them, by opening issue on the respective
upstream projects.

Seems easy to say, more difficult to implement :-(

--
--
To unsubscribe, email repo-discuss+unsubscribe@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.

Edwin Kempin

unread,

Feb 24, 2017, 4:25:03 AM2/24/17

to lucamilanesio, Repo and Gerrit Discussion, David Ostrovsky

On Fri, Feb 24, 2017 at 9:56 AM, lucamilanesio <luca.mi...@gmail.com> wrote:

Looks like the retry logic introduced has provided a lot more stability to the builds :-)

See below the latest graph:

Since the introduction of the change on the Gerrit CI flow ... the green is predominant again :-)

Nice, this looks much better :)

Still the number of flaky Gerrit tests is worrisome.

Luca Milanesio

unread,

Feb 24, 2017, 4:30:34 AM2/24/17

to Edwin Kempin, Repo and Gerrit Discussion, David Ostrovsky

Nice, this looks much better :)
Still the number of flaky Gerrit tests is worrisome.

So, to be honest with you *LOTS* of tests I have analysed recently are inherently flaky :-(

Shall we start tracking them somewhere?

Our issue tracker?

The reason WHY are flaky is because they are designed to run in isolation on the dev. laptop ... whilst most typically isn't the case in the CI.

1. Bazel runs tests in parallel

2. Test run at different speed, sometimes higher sometimes lower based on the load

The way you write test is important to make it stable and reproducible.

Luca.

Edwin Kempin

unread,

Feb 24, 2017, 4:39:32 AM2/24/17

to Luca Milanesio, Repo and Gerrit Discussion, David Ostrovsky

On Fri, Feb 24, 2017 at 10:30 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:

Nice, this looks much better :)
Still the number of flaky Gerrit tests is worrisome.

So, to be honest with you *LOTS* of tests I have analysed recently are inherently flaky :-(

Shall we start tracking them somewhere?
Our issue tracker?

+1, I think it's worth to look at these tests and fix them

Luca Milanesio

unread,

Feb 24, 2017, 4:42:29 AM2/24/17

to Edwin Kempin, Repo and Gerrit Discussion, David Ostrovsky

Let me see if I can automate something and provide an automatic comment "+Flaky" to a change that has a flaky run associated.

Luca.

Reply all

Reply to author

Forward