[WARNING] Gerrit CI disabled because of broken build on master

99 views
Skip to first unread message

lucamilanesio

unread,
Mar 28, 2017, 11:38:15 AM3/28/17
to Repo and Gerrit Discussion
The Gerrit master tests are failing because of an out of memory error.
I am trying to see if raising the default settings helps or if we have a leak somewhere.

In the meantime, I can't leave the validation enabled because all your builds would fail anyway :-(

Will keep the mailing list posted on the progress ...

Luca.

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 2:53:03 PM3/28/17
to Repo and Gerrit Discussion
I am thinking 50 50 chance this https://gerrit-review.googlesource.com/#/c/101173/ fixed it. (guessing).

Dave Borowitz

unread,
Mar 28, 2017, 2:54:32 PM3/28/17
to thomasmu...@yahoo.com, Repo and Gerrit Discussion
I would guess not, since that bug only affects the UI and shouldn't cause a server to OOM.

--
--
To unsubscribe, email repo-discuss+unsubscribe@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Milanesio

unread,
Mar 28, 2017, 5:36:56 PM3/28/17
to Dave Borowitz, thomasmu...@yahoo.com, Repo and Gerrit Discussion
I made many tries and raised the heap up to 8GBytes ... but still OutOfMemory Error!
I am then convinced we have introduced a memory leak somewhere.

I am trying now to isolate the exact test that triggers the error and then bisect on it.

Hope to post more news and findings very soon ...

To unsubscribe, email repo-discuss...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

David Ostrovsky

unread,
Mar 28, 2017, 5:45:21 PM3/28/17
to Repo and Gerrit Discussion

On Tuesday, March 28, 2017 at 11:36:56 PM UTC+2, lucamilanesio wrote:
I made many tries and raised the heap up to 8GBytes ... but still OutOfMemory Error!
I am then convinced we have introduced a memory leak somewhere.

I am trying now to isolate the exact test that triggers the error and then bisect on it.

Hope to post more news and findings very soon ...

I was able to reproduce the GC overhead issue in disabled ReviewDb mode,
that the CI reported on this change: [1]. So, I bisected it with this command:

  $ bazel test --test_env=GERRIT_NOTEDB=DISABLE_CHANGE_REVIEW_DB //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other

11873ef74eb5a8fbce6023a37ad762b3784c4449 is the first bad commit

The failure I'm seeing is: [2].


lucamilanesio

unread,
Mar 28, 2017, 5:45:35 PM3/28/17
to Repo and Gerrit Discussion, dbor...@google.com, thomasmu...@yahoo.com
Reproduced locally: 
bazel test //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other

... and then after 2-3 mins ... BOOM ! OutOfMemory Error.

See below the VM graphs:


It seems that the number of threads increases constantly over time and they can possibly hold all the associated resources.


Then very soon the GC gets crazy and is not able to release any more memory.


Starting now bisecting ...


Luca.

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 5:49:34 PM3/28/17
to Repo and Gerrit Discussion
Do we know the date it first started? If we do then we can start from there in decides which commits to look through.


On Tuesday, March 28, 2017 at 4:38:15 PM UTC+1, lucamilanesio wrote:

Luca Milanesio

unread,
Mar 28, 2017, 5:49:49 PM3/28/17
to David Ostrovsky, Repo and Gerrit Discussion
Oh thanks, you saved me lots of time.
:-)

Have you tried simply to revert 11873ef74eb5a8fbce6023a37ad762b3784c4449 ?

Luca Milanesio

unread,
Mar 28, 2017, 5:51:34 PM3/28/17
to David Ostrovsky, Repo and Gerrit Discussion

Dave Borowitz

unread,
Mar 28, 2017, 5:51:41 PM3/28/17
to Luca Milanesio, Patrick Hiesel, David Ostrovsky, Repo and Gerrit Discussion
+Patrick

On Tue, Mar 28, 2017 at 5:49 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
Oh thanks, you saved me lots of time.
:-)

Have you tried simply to revert 11873ef74eb5a8fbce6023a37ad762b3784c4449 ?
On 28 Mar 2017, at 22:45, David Ostrovsky <david.o...@gmail.com> wrote:


On Tuesday, March 28, 2017 at 11:36:56 PM UTC+2, lucamilanesio wrote:
I made many tries and raised the heap up to 8GBytes ... but still OutOfMemory Error!
I am then convinced we have introduced a memory leak somewhere.

I am trying now to isolate the exact test that triggers the error and then bisect on it.

Hope to post more news and findings very soon ...

I was able to reproduce the GC overhead issue in disabled ReviewDb mode,
that the CI reported on this change: [1]. So, I bisected it with this command:

  $ bazel test --test_env=GERRIT_NOTEDB=DISABLE_CHANGE_REVIEW_DB //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other

11873ef74eb5a8fbce6023a37ad762b3784c4449 is the first bad commit

The failure I'm seeing is: [2].



--
--

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
--
To unsubscribe, email repo-discuss+unsubscribe@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.

Luca Milanesio

unread,
Mar 28, 2017, 5:54:52 PM3/28/17
to Dave Borowitz, Patrick Hiesel, David Ostrovsky, Repo and Gerrit Discussion
So ... that change was actually verified successfully.

The only thing that is a bit "suspect" is that test execution time:
        Testing //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other, 553 s

The test took almost 10 minutes.
Let me try locally on that commit and see how the JVM graphs look like ...

Luca.

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 5:55:28 PM3/28/17
to Repo and Gerrit Discussion
The test passes for me on a 16gb ram mac laptop.

Though i actually used like 10gb of the ram.

Anyways

$ bazel test //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other

INFO: Found 1 test target...

Target //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other up-to-date:

  bazel-bin/gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change/rest_change_other.jar

  bazel-bin/gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change/rest_change_other

INFO: Elapsed time: 147.994s, Critical Path: 144.87s

//gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other PASSED in 138.0s


Executed 1 out of 1 test: 1 test passes.


I noticed right at the start it went from 6gb of ram to almost 0gb of ram. But then after that the ram was steady.


On Tuesday, March 28, 2017 at 4:38:15 PM UTC+1, lucamilanesio wrote:

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 5:59:49 PM3/28/17
to Repo and Gerrit Discussion
Reverting this https://gerrit-review.googlesource.com/#/c/100783/ seems to give me 1gb more of ram so instead of running out at 1gb i am at 2.17 now.


On Tuesday, March 28, 2017 at 4:38:15 PM UTC+1, lucamilanesio wrote:

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 6:01:00 PM3/28/17
to Repo and Gerrit Discussion
But it is a mix. I am now between 2.17 and 1.80gb. Better then before. I was expecting it to go to 1 or 2gb.

bazel test //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other

INFO: Found 1 test target...

Target //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other up-to-date:

  bazel-bin/gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change/rest_change_other.jar

  bazel-bin/gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change/rest_change_other

INFO: Elapsed time: 174.101s, Critical Path: 166.99s

//gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other PASSED in 131.5s


Executed 1 out of 1 test: 1 test passes.


thomasmu...@yahoo.com

unread,
Mar 28, 2017, 6:01:33 PM3/28/17
to Repo and Gerrit Discussion
It seems there is a almost 8 seconds faster with the revert.

Luca Milanesio

unread,
Mar 28, 2017, 6:18:21 PM3/28/17
to thomasmu...@yahoo.com, Repo and Gerrit Discussion, Patrick Hiesel, David Ostrovsky, Dave Borowitz
It is not the speed that is worrying but rather the memory consumption.
Let me try the commit *before* that one and see how the JVM heap graph looks like ...

--
--
To unsubscribe, email repo-discuss...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Luca Milanesio

unread,
Mar 28, 2017, 6:22:03 PM3/28/17
to thomasmu...@yahoo.com, Repo and Gerrit Discussion, Patrick Hiesel, David Ostrovsky, Dave Borowitz
*SAME PROBLEM* even before Patrick change.
I'll dig now inside those threads to see why they are not getting released ...

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 6:31:19 PM3/28/17
to Repo and Gerrit Discussion
As far as i can see, this https://gerrit-ci.gerritforge.com/job/Gerrit-master/624/ is the first build that started failing with these errors. All before that are green builds.


On Tuesday, March 28, 2017 at 4:38:15 PM UTC+1, lucamilanesio wrote:

Luca Milanesio

unread,
Mar 28, 2017, 6:31:26 PM3/28/17
to thomasmu...@yahoo.com, Repo and Gerrit Discussion, Patrick Hiesel, David Ostrovsky, Dave Borowitz
OK, I believe I know what's happening.

The Bazel BUILD rule says:

SUBMIT_UTIL_SRCS = glob(["AbstractSubmit*.java"])

SUBMIT_TESTS = glob(["Submit*IT.java"])

OTHER_TESTS = glob(
    ["*IT.java"],
    exclude = SUBMIT_TESTS,
)

That means that when the other tests gets executed, it runs *ALL* the other IT tests in parallel in the same JVM.
As the number of integration tests grew over time, the JVM got more and more collapsed with in-memory Gerrit instances running in parallel ... and we just reached the breaking point.

I am going now to split the OTHER_TESTS in different groups to see if the problem goes away.

Luca.

Luca Milanesio

unread,
Mar 28, 2017, 6:33:03 PM3/28/17
to thomasmu...@yahoo.com, Repo and Gerrit Discussion, Patrick Hiesel, David Ostrovsky, Dave Borowitz
I believe Patrick's change was just the "last drop of water" ...

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 6:37:18 PM3/28/17
to Repo and Gerrit Discussion, thomasmu...@yahoo.com, hie...@google.com, david.o...@gmail.com, dbor...@google.com
Checking out https://gerrit-review.googlesource.com/#/c/100928/ results in no dramatic drops in ram.

went from 5.* gb to 3.30 gb back up to 4.15gb with the test running.

thomasmu...@yahoo.com

unread,
Mar 28, 2017, 6:37:37 PM3/28/17
to Repo and Gerrit Discussion, thomasmu...@yahoo.com, hie...@google.com, david.o...@gmail.com, dbor...@google.com

bazel test //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other

INFO: Found 1 test target...

Target //gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other up-to-date:

  bazel-bin/gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change/rest_change_other.jar

  bazel-bin/gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change/rest_change_other

INFO: Elapsed time: 143.882s, Critical Path: 137.41s

//gerrit-acceptance-tests/src/test/java/com/google/gerrit/acceptance/rest/change:rest_change_other PASSED in 97.5s


Executed 1 out of 1 test: 1 test passes.


lucamilanesio

unread,
Mar 28, 2017, 6:54:10 PM3/28/17
to Repo and Gerrit Discussion, thomasmu...@yahoo.com, hie...@google.com, david.o...@gmail.com, dbor...@google.com
So, I should have found the root cause: it seems that com.google.gerrit.acceptance.AbstractDaemonTest
is not able to shutdown *ALL* Gerrit threads and leave some of them opened, with the associated resources.

Even by running a simple test suite (ActionsIT.java) it is clear the thread leak.
Let me see if there is a quick fix for it ...

Luca.

On Tuesday, March 28, 2017 at 11:33:03 PM UTC+1, lucamilanesio wrote:

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss+unsubscribe@googlegroups.com.

lucamilanesio

unread,
Mar 28, 2017, 7:21:33 PM3/28/17
to Repo and Gerrit Discussion, thomasmu...@yahoo.com, hie...@google.com, david.o...@gmail.com, dbor...@google.com
Have to get some sleep, but I believe that I can reproduce the leak by simply making a simple subclass of AbstractDaemonTest and then start/stop the Common Server multiple times.

After the stop I see threads that shouldn't be there:

"sshd-SshServer[4fccdd1d]-nio2-thread-1" #97 daemon prio=5 os_prio=31 tid=0x00007fb780356000 nid=0x7f07 waiting on condition [0x000070000fcda000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000006c1c207f8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None

"ChangeUpdate-0" #54 daemon prio=5 os_prio=31 tid=0x00007fb77f6f4800 nid=0x9703 waiting on condition [0x00007000108fe000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000006c13a63e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None

"sshd-SshServer[4fccdd1d]-timer-thread-1" #39 daemon prio=5 os_prio=31 tid=0x00007fb780340800 nid=0x7b03 waiting on condition [0x000070000fad4000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000006c1c22108> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None


"groups Write-0" #27 daemon prio=5 os_prio=31 tid=0x00007fb77b277800 nid=0x6503 waiting on condition [0x000070000efb3000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000006c2122dc0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None

"groups Commit-0" #25 daemon prio=5 os_prio=31 tid=0x00007fb780328800 nid=0x6103 waiting on condition [0x000070000edad000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000006c212e038> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None

"accounts Commit-0" #23 daemon prio=5 os_prio=31 tid=0x00007fb780000000 nid=0x5d03 waiting on condition [0x000070000eba7000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000006c20df890> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None

lucamilanesio

unread,
Mar 29, 2017, 3:32:05 AM3/29/17
to Repo and Gerrit Discussion, thomasmu...@yahoo.com, hie...@google.com, david.o...@gmail.com, dbor...@google.com
First problem: SshDaemon.

Even if we call the Acceptor.close() the SSHD Nio2Acceptor doesn't release any of his threads resulting of having in memory all the SSH threads we used in the tests, for every test, and the associated memory resources.

I'm deep-diving now into Apache SSHD.

Luca.

Patrick Hiesel

unread,
Mar 29, 2017, 3:39:27 AM3/29/17
to lucamilanesio, Repo and Gerrit Discussion, thomasmu...@yahoo.com, David Ostrovsky, Dave Borowitz
Sounds like a possible cause to me. We don't run any of the SSH tests inside Google and don't start the SshDaemon at all. There are more differences of course, but this seems like the best place to start to me.

lucamilanesio

unread,
Mar 29, 2017, 3:42:54 AM3/29/17
to Repo and Gerrit Discussion, luca.mi...@gmail.com, thomasmu...@yahoo.com, david.o...@gmail.com, dbor...@google.com
First bug found and fixed:

We never shutdown the Ssh Daemon executor in the stop process.

More fixes to come ...

Luca.
To unsubscribe, email repo-discuss...@googlegroups.com

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

lucamilanesio

unread,
Mar 29, 2017, 6:11:31 AM3/29/17
to Repo and Gerrit Discussion, luca.mi...@gmail.com, thomasmu...@yahoo.com, david.o...@gmail.com, dbor...@google.com
Looking at the last check on master ... the build is green and without any flakiness:

Can someone approve the corresponding stable-2.14 fix and merge to master?

We need to resume the CI validations ASAP.

Luca.

lucamilanesio

unread,
Mar 29, 2017, 6:10:18 PM3/29/17
to Repo and Gerrit Discussion, luca.mi...@gmail.com, thomasmu...@yahoo.com, david.o...@gmail.com, dbor...@google.com
Apparently, the fix wasn't enough: we still have failures for OutOfMemory.

Looking at the graph, it looks obvious when things started to diverge significantly:


Apart from sporadic spikes, the build time was between 30' and 40' ... and then spikes of 200+ ' builds ending up in deep red status.


What is strange though is that the SAME BUILD but split into three different parallel jobs, is successful on the verifier flow.


Is it possible that we are hitting some leak in Bazel as well?

I know that it is client - server and if you run multiple targets in the same "build session" it will hit the same Bazel server.


The combined effect of Bazel (just guessing here) and Gerrit leaks could lead to this problem, is that credible or just pure fantasy?

I don't have any other explanation on why the same build all done in a single run or three parallel runs on three parallel VMs can have very different outcomes.


Ideas / feedback is more than welcome :-)


Luca.

lucamilanesio

unread,
Mar 29, 2017, 7:40:02 PM3/29/17
to Repo and Gerrit Discussion, luca.mi...@gmail.com, thomasmu...@yahoo.com, david.o...@gmail.com, dbor...@google.com
I've analysed again the situation on the commits just before the one from Patrick and I confirm that the JVM was already in critical state and the suite completed with significant delay and high GC activity, as you can see below:


Then Patrick's commit introduced a brand new suite quite heavy (ChangeReviewersByEmailIT.java) that just couldn't be hosted in an already heavy loaded JVM.


Maybe a quick fix is just to set a dedicated new target for Patrick's Suite to unblock temporarily the situation?


Luca.

David Ostrovsky

unread,
Mar 30, 2017, 3:32:10 AM3/30/17
to Repo and Gerrit Discussion

On Thursday, March 30, 2017 at 1:40:02 AM UTC+2, lucamilanesio wrote:
I've analysed again the situation on the commits just before the one from Patrick and I confirm that the JVM was already in critical state and the suite completed with significant delay and high GC activity, as you can see below:


Then Patrick's commit introduced a brand new suite quite heavy (ChangeReviewersByEmailIT.java) that just couldn't be hosted in an already heavy loaded JVM.


Maybe a quick fix is just to set a dedicated new target for Patrick's Suite to unblock temporarily the situation?


I think the way we do it: generating test suite in Skylark rule,
and assigning to the test suite arbitrary number of test classes
could be problematic from resource managemet perspective.

I changed that approach to native java_test rule invocation per test
clas in: [1] I see locally, it fixed the jvm GC problem here. Can you
activate this CL on the CI and check if that fixed the problem on CI
too?

Disadvantage of this approach: it takes much longer to run the tests.


lucamilanesio

unread,
Mar 30, 2017, 3:38:10 AM3/30/17
to Repo and Gerrit Discussion
Hold on see what I've found on the JVM:

-Djava.io.tmpdir=/private/var/tmp/_bazel_lucamilanesio/05146e5884bb386b6d372310d769d784/execroot/gerrit/_tmp/rest_change_other_1
-Xmx16g
-Xmx256m
-ea
-Dbazel.test_suite=rest_change_otherTestSuite

It seems that EVEN IF I've told Bazel I want 16g of heap for my tests ... it adds the -Xmx256m anyway. 
Bug in Bazel? Special magic setting? 

Possibly the problem is much easier to be solved :-)

Luca.

thomasmu...@yahoo.com

unread,
Mar 30, 2017, 3:47:41 AM3/30/17
to Repo and Gerrit Discussion
Which version of bazel are you using?

lucamilanesio

unread,
Mar 30, 2017, 3:51:37 AM3/30/17
to Repo and Gerrit Discussion
It isn't a Bazel problem ... the 256m is hardcoded in our Gerrit acceptance tests :-(
Going to raise it to at least 8g and see how it goes.

Luca.

lucamilanesio

unread,
Mar 30, 2017, 3:56:30 AM3/30/17
to Repo and Gerrit Discussion
With [1] it is running now with the desired 8g:

-Djava.io.tmpdir=/private/var/tmp/_bazel_lucamilanesio/05146e5884bb386b6d372310d769d784/execroot/gerrit/_tmp/rest_change_other_1
-Xmx8g
-ea
-Dbazel.test_suite=rest_change_otherTestSuite

lucamilanesio

unread,
Mar 30, 2017, 3:59:14 AM3/30/17
to Repo and Gerrit Discussion
OMG tests are running *SO MUCH FASTER NOW* ! Possibly that is the reason why initially we saw Bazel builds going so much slower than Buck :-(
Can someone review [1] it please?

Luca.

lucamilanesio

unread,
Mar 30, 2017, 4:25:38 AM3/30/17
to Repo and Gerrit Discussion
Running now the stable-2.14 on [2].

An open point though is why we have leaks when we shut down Gerrit in-process. The remaining ones are actually in Lucene-related code.
Raising heap to 8g is a quick fix, but sooner or later we'll hit the roof again.

Han-Wen Nienhuys

unread,
Mar 30, 2017, 5:38:47 AM3/30/17
to lucamilanesio, Repo and Gerrit Discussion
Is there any reason why we should specify a heap at all? Presumably
the JVM knows better how much it wants to consume anyway.

Bazel caps parallelism at the CPU count. If you want more or less
parallelism you could specify --jobs=N.
> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to repo-discuss...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Han-Wen Nienhuys
Google Munich
han...@google.com

luca.mi...@gmail.com

unread,
Mar 30, 2017, 5:47:38 AM3/30/17
to Han-Wen Nienhuys, Repo and Gerrit Discussion
In real terms, you would like to know if your apps has leaks or not: we do have leaks in the tests (or app?) and raising the heap limit is only a short term workaround.

Leaving the heap "unlimited" would not highlight leaks during tests :-(

In theory even 256M would have been enough, if we had no leaks !

Luca

Sent from my iPhone

Han-Wen Nienhuys

unread,
Mar 30, 2017, 5:49:32 AM3/30/17
to lucamilanesio, Repo and Gerrit Discussion
Could we have a test that just brings up and down a server a 20 times
and checks that there the thread count isn't significantly bigger
between round 10 and round 20?

Luca Milanesio

unread,
Mar 30, 2017, 6:05:43 AM3/30/17
to Han-Wen Nienhuys, Repo and Gerrit Discussion
Yes, I did that test myself (server up/down/up/down ...) and is eating up threads and heap.
I started fixing the leaks but it will take a bit of time to fix all of them.

The first fix was on the SSH Server that has been merged.

Others will follow ... but in the meantime we need to quickly restore the CI as in the past couple of days changes were merged without CI validation.

Luca.

lucamilanesio

unread,
Mar 30, 2017, 9:00:38 AM3/30/17
to Repo and Gerrit Discussion, han...@google.com
Gerrit master is GREEN AGAIN. 🎺
I've re-enabled the changes verification for 4 runs, just to see how it goes. If everything goes well, will leave it enabled.

Thanks for the help in getting this problem sorted out.

Luca.
>>>> To unsubscribe, email repo-discuss+unsubscribe@googlegroups.com
>>>> More info at http://groups.google.com/group/repo-discuss?hl=en
>>>>
>>>> ---
>>>> You received this message because you are subscribed to the Google Groups
>>>> "Repo and Gerrit Discussion" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an

Edwin Kempin

unread,
Mar 30, 2017, 9:25:49 AM3/30/17
to lucamilanesio, Repo and Gerrit Discussion, Han-Wen Nienhuys
On Thu, Mar 30, 2017 at 3:00 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
Gerrit master is GREEN AGAIN. 🎺
Great! Thanks Luca for keeping the CI running!

Luca Milanesio

unread,
Mar 30, 2017, 9:45:06 AM3/30/17
to Edwin Kempin, Repo and Gerrit Discussion, Han-Wen Nienhuys
No problem, happy to contribute to the project :-)

Luca.

lucamilanesio

unread,
Mar 30, 2017, 6:27:02 PM3/30/17
to Repo and Gerrit Discussion, eke...@google.com, han...@google.com
Kudos to Viktar for work rounding the WCT flakiness with [1]. 
Now Gerrit master builds in ~ 20 mins, which is 2x times faster than previously (see last build at [2]).

With the changes validation, it should go even faster than that, because of the parallel executions of the three phases.
In theory 10-15 mins should be from now the norm.

P.S. We still have the leaks in our acceptance-tests, the heap increase was A TEMPORARY WORKAROUND ... we need to fix our problems :-(

Luca.

lucamilanesio

unread,
Mar 31, 2017, 7:15:34 PM3/31/17
to Repo and Gerrit Discussion, han...@google.com
Another leak fixed at [1]: Lucene indexes were not closed and shut down.

A simple loop of start/stop of the in-memory Gerrit server would have consumed over 2GB or heap. With this fix heap consumption is fixed around 100Mb.



On Thursday, March 30, 2017 at 11:05:43 AM UTC+1, lucamilanesio wrote:
>>>> To unsubscribe, email repo-discuss+unsubscribe@googlegroups.com
>>>> More info at http://groups.google.com/group/repo-discuss?hl=en
>>>>
>>>> ---
>>>> You received this message because you are subscribed to the Google Groups
>>>> "Repo and Gerrit Discussion" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an
Reply all
Reply to author
Forward
0 new messages