Next steps with Zuul

116 views
Skip to first unread message

James E. Blair

unread,
Feb 21, 2021, 10:22:40 PM2/21/21
to repo-d...@googlegroups.com, Monty Taylor
Hi,

I think we're ready to continue moving forward with Zuul for Gerrit.

We've been running Zuul on plugins for a while, and I think that's been
going relatively well. This past week I did manage to cause a couple of
outages, for which I apologize. In both cases we made improvements to
the system that should make it more robust.

We do already have a job for Gerrit ready to run. It only runs the
build, no tests yet, and it currently takes about 12-14 minutes to run,
which seems like a workable starting point to me. I believe we can
improve on that time, but it seems short enough that we could go ahead
and start running it in parallel with the current CI system and not
inconvenience anyone.

Of that run time, about 6 minutes is setup, and about 6 minutes is the
Bazel build. Monty and I have started work on improving the setup time
and using a Bazel cache to reduce the build time.

Meanwhile, we should also start running tests. If I'm reading the
job definitions correctly, it looks like we're running:

--test_tag_filters=-flaky,elastic,git-protocol-v2

I propose the following iterative approach:

1) Add a checker to the gerrit repo so that we start running the current
build-only job on changes to Gerrit.

2) Update that job to start running the above tests.

3) Add any additional tests.

And as they are ready, we can modify the job to use the Bazel cache and
other improvements to the setup time.

How does that sound? If the Gerrit maintainers are okay with this, I
can go ahead and add the checker.

-Jim

David Ostrovsky

unread,
Feb 22, 2021, 5:58:46 AM2/22/21
to Repo and Gerrit Discussion
James Blair schrieb am Montag, 22. Februar 2021 um 04:22:40 UTC+1:
Hi,

I think we're ready to continue moving forward with Zuul for Gerrit.

We've been running Zuul on plugins for a while, and I think that's been
going relatively well. This past week I did manage to cause a couple of
outages, for which I apologize. In both cases we made improvements to
the system that should make it more robust.

Thanks for the quick fix.
 


We do already have a job for Gerrit ready to run. It only runs the
build, no tests yet, and it currently takes about 12-14 minutes to run,
which seems like a workable starting point to me. I believe we can
improve on that time, but it seems short enough that we could go ahead
and start running it in parallel with the current CI system and not
inconvenience anyone.

Of that run time, about 6 minutes is setup, and about 6 minutes is the
Bazel build. Monty and I have started work on improving the setup time
and using a Bazel cache to reduce the build time.

Meanwhile, we should also start running tests. If I'm reading the
job definitions correctly, it looks like we're running:

--test_tag_filters=-flaky,elastic,git-protocol-v2

Is there a typo in the above filter? Should this rather be:

--test_tag_filters=-flaky,-elastic,-git-protocol-v2

This filter is saying: run all tests except flaky tests, Docker based tests (ES)
and git protocol v2 test that requires very recent git client release: 2.17.1. 



I propose the following iterative approach:

1) Add a checker to the gerrit repo so that we start running the current
build-only job on changes to Gerrit.

2) Update that job to start running the above tests.

3) Add any additional tests.


We moved recently to using remote build execution (RBE) backend, and added
dedicated new checker for that: RBE Build/Test.

Now, we split the tests to be executed on RBE and on non-RBE checker, thus
preserving both checkers in place:

RBE Build/Test (filter: --test_tag_filters=-flaky,-elastic,-git-protocol-v2)
Build/Test filter: (filter --test_tag_filters=-flaky,elastic,git-protocol-v2)

As can be seen from this series: [1], we've got significant performance improvement:

Build/Tests Optional Successful Feb 21 1:20 PM 23 min 53 sec
RBE Build/Tests Optional Successful Feb 21 1:20 PM 8 min 38 sec

We should also move PolyGerrit verification to RBE, see: [2].

And as they are ready, we can modify the job to use the Bazel cache and
other improvements to the setup time.

How does that sound? If the Gerrit maintainers are okay with this, I
can go ahead and add the checker.

It was agreed to move to Zuul CI in last ESC meeting: [3], but I guess we
should replace the current CI and not to build/test in addition on different CIs.
Of course we should use iterative approach, say move RBE checker from
Jenkins CI to Zuul CI.

I guess we need new checker to run on Zuul, so we would add new one, say
Zuul RBE Build/Tests and off load the build and test to RBE, and will verify
in transition period on both CI: Jenkins CI and Zuul CI. But once we are
confident that Zuul CI just works we would shutdown RBE checker on Jenkins
CI and proceed with migration of other checkers.

On the related note: checks plugin is deprecated, so that we should also start
looking into migrating to new Checks UI, see: [4]. Ben and Patrick would know
more, but do we still have to rely on deprecated checks plugin and keep adding
new checkers? Couldn't we just migrate off Jenkins CI and check plugin in one step?
gerrit-review.gs.com is always on @HEAD, so that the latest and greatest version
is already deployed there...
 

Edwin Kempin

unread,
Feb 22, 2021, 6:16:21 AM2/22/21
to James E. Blair, Repo and Gerrit Discussion, Monty Taylor
On Mon, Feb 22, 2021 at 4:22 AM James E. Blair <j...@acmegating.com> wrote:
Hi,

I think we're ready to continue moving forward with Zuul for Gerrit.
Sounds exciting. Happy to see progress on this!
 

We've been running Zuul on plugins for a while, and I think that's been
going relatively well.
+1 
The code-owners plugin is using Zuul verification and it's working really well!
 
  This past week I did manage to cause a couple of
outages, for which I apologize.  In both cases we made improvements to
the system that should make it more robust.
Thanks a lot, much appreciated!
 

We do already have a job for Gerrit ready to run.  It only runs the
build, no tests yet, and it currently takes about 12-14 minutes to run,
which seems like a workable starting point to me.  I believe we can
improve on that time, but it seems short enough that we could go ahead
and start running it in parallel with the current CI system and not
inconvenience anyone.

Of that run time, about 6 minutes is setup, and about 6 minutes is the
Bazel build.  Monty and I have started work on improving the setup time
and using a Bazel cache to reduce the build time.

Meanwhile, we should also start running tests.  If I'm reading the
job definitions correctly, it looks like we're running:

  --test_tag_filters=-flaky,elastic,git-protocol-v2

I propose the following iterative approach:

1) Add a checker to the gerrit repo so that we start running the current
   build-only job on changes to Gerrit.

2) Update that job to start running the above tests.

3) Add any additional tests.

And as they are ready, we can modify the job to use the Bazel cache and
other improvements to the setup time.

How does that sound?  If the Gerrit maintainers are okay with this, I
can go ahead and add the checker.

-Jim

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/87a6rwss5i.fsf%40meyer.lemoncheese.net.

Luca Milanesio

unread,
Feb 22, 2021, 10:42:46 AM2/22/21
to James E. Blair, Luca Milanesio, repo-d...@googlegroups.com, Monty Taylor


> On 22 Feb 2021, at 03:22, James E. Blair <j...@acmegating.com> wrote:
>
> Hi,
>
> I think we're ready to continue moving forward with Zuul for Gerrit.

+1, looking forward to it :-)

>
> We've been running Zuul on plugins for a while, and I think that's been
> going relatively well. This past week I did manage to cause a couple of
> outages, for which I apologize. In both cases we made improvements to
> the system that should make it more robust.
>
> We do already have a job for Gerrit ready to run. It only runs the
> build, no tests yet, and it currently takes about 12-14 minutes to run,
> which seems like a workable starting point to me.

Do you have the pointer for it?
I believe that enabling caching and RBE execution, that could be massively be brought down to a couple of minutes.

> I believe we can
> improve on that time, but it seems short enough that we could go ahead
> and start running it in parallel with the current CI system and not
> inconvenience anyone.

Sure, we should get started ASAP with something working.

>
> Of that run time, about 6 minutes is setup, and about 6 minutes is the
> Bazel build. Monty and I have started work on improving the setup time
> and using a Bazel cache to reduce the build time.

You can just point to the existing cache (https://gerrit-ci.gerritforge.com/cache) and it should work out of the box.

>
> Meanwhile, we should also start running tests. If I'm reading the
> job definitions correctly, it looks like we're running:
>
> --test_tag_filters=-flaky,elastic,git-protocol-v2
>
> I propose the following iterative approach:
>
> 1) Add a checker to the gerrit repo so that we start running the current
> build-only job on changes to Gerrit.

We do not currently have a check for “build only” but just “build + test”.
Can we also add tests to it so that we can reuse the existing check?

>
> 2) Update that job to start running the above tests.

1) + 2) are the same phase currently, unless you propose to break them down, which would increase the overall build time though, as the setup phase would be executed twice, isn’t it?

>
> 3) Add any additional tests.

Not sure we can do that now, as we would need Docker for the extra tests.
I would prefer to park this phase for now and, instead, assessing if and how we can get rid of the Docker-based tests.

ES support is currently in doubt and should be either dropped or moved to a libModule: Docker may NOT be needed anymore in Gerrit tests IMHO.

>
> And as they are ready, we can modify the job to use the Bazel cache and
> other improvements to the setup time.
>
> How does that sound? If the Gerrit maintainers are okay with this, I
> can go ahead and add the checker.

We already have the Checker for that, can we just have Zuul use it?
I wouldn’t like to add extra complexity, as we do know that Zuul works and we can just use it !

We can run the:
- Code Style
- RBE Build/Test

Thanks again, Jim, for working on that :-)

Luca.

David Ostrovsky

unread,
Feb 22, 2021, 3:29:53 PM2/22/21
to Repo and Gerrit Discussion
We can't. To separate two CI-systems and the corresponding checkers
we are using scheme based query: [1]:

scheme:'SCHEME': Matches checks with the scheme ‘SCHEME’.

The CI scheme values are:

GerritForge-CI: scheme:gerritforge
Zuul-CI: scheme:zuul-check

That's why I said in my previous comment: we would have to duplicate the checkers:
"Zuul RBE Build/Tests" in parallel with: "RBE Build/Tests", and disable the old one
when we are confident that Zuul verification works as expected.

James E. Blair

unread,
Feb 22, 2021, 4:39:47 PM2/22/21
to Luca Milanesio, repo-d...@googlegroups.com, Monty Taylor
Luca Milanesio <luca.mi...@gmail.com> writes:

>> On 22 Feb 2021, at 03:22, James E. Blair <j...@acmegating.com> wrote:
>>
>> Hi,
>>
>> I think we're ready to continue moving forward with Zuul for Gerrit.
>
> +1, looking forward to it :-)
>
>>
>> We've been running Zuul on plugins for a while, and I think that's been
>> going relatively well. This past week I did manage to cause a couple of
>> outages, for which I apologize. In both cases we made improvements to
>> the system that should make it more robust.
>>
>> We do already have a job for Gerrit ready to run. It only runs the
>> build, no tests yet, and it currently takes about 12-14 minutes to run,
>> which seems like a workable starting point to me.
>
> Do you have the pointer for it?
> I believe that enabling caching and RBE execution, that could be
> massively be brought down to a couple of minutes.

Here's an example run of the build-only job:

https://ci.gerritcodereview.com/t/gerrit/build/d0e62ce4190c4b85b1ac2f576af3da63

>> I believe we can
>> improve on that time, but it seems short enough that we could go ahead
>> and start running it in parallel with the current CI system and not
>> inconvenience anyone.
>
> Sure, we should get started ASAP with something working.
>
>>
>> Of that run time, about 6 minutes is setup, and about 6 minutes is the
>> Bazel build. Monty and I have started work on improving the setup time
>> and using a Bazel cache to reduce the build time.
>
> You can just point to the existing cache
> (https://gerrit-ci.gerritforge.com/cache) and it should work out of
> the box.

Ah okay. I also set up a cache as a bucket in GCS. It reduced the
build time from 6 minutes to 3 minutes in my experiments. Monty's work
may shave off even more time.

I tried running the tests with
"--test_tag_filters=-flaky,-elastic,-git-protocol-v2" (thanks David for
the correction). The test run took 37 minutes on a very large test
node, and a number of tests timed out and/or failed. So it may take
some iteration to work that out.

>> Meanwhile, we should also start running tests. If I'm reading the
>> job definitions correctly, it looks like we're running:
>>
>> --test_tag_filters=-flaky,elastic,git-protocol-v2
>>
>> I propose the following iterative approach:
>>
>> 1) Add a checker to the gerrit repo so that we start running the current
>> build-only job on changes to Gerrit.
>
> We do not currently have a check for “build only” but just “build + test”.
> Can we also add tests to it so that we can reuse the existing check?

We will need a new checker for Zuul, but we'll only need one -- we won't
have separate checkers for build only and build+test. When we add it,
we'll see 5 checks on Gerrit changes instead of 4. If we later turn off
some of the Jenkins checks, we can disable those checkers and they will
no longer appear.

>> 2) Update that job to start running the above tests.
>
> 1) + 2) are the same phase currently, unless you propose to break them
> down, which would increase the overall build time though, as the setup
> phase would be executed twice, isn’t it?

Sorry, what I was trying to say is that there is a job already defined
and ready to go that builds Gerrit but doesn't run tests. I propose we
start running it, then add in the "bazelisk test" command to that job
shortly afterwords. I agree, we should not run separate jobs for
{build} and {build+test}.

>> 3) Add any additional tests.
>
> Not sure we can do that now, as we would need Docker for the extra tests.
> I would prefer to park this phase for now and, instead, assessing if
> and how we can get rid of the Docker-based tests.
>
> ES support is currently in doubt and should be either dropped or moved
> to a libModule: Docker may NOT be needed anymore in Gerrit tests IMHO.

Okay, step 3 can be a no-op. I mostly just wanted to establish the
iterative approach by starting by running the build job.

>> And as they are ready, we can modify the job to use the Bazel cache and
>> other improvements to the setup time.
>>
>> How does that sound? If the Gerrit maintainers are okay with this, I
>> can go ahead and add the checker.
>
> We already have the Checker for that, can we just have Zuul use it?
> I wouldn’t like to add extra complexity, as we do know that Zuul works
> and we can just use it !
>
> We can run the:
> - Code Style
> - RBE Build/Test

You may recall there are extra considerations with using RBE due to
Zuul's speculative execution capability. Using RBE without exposing the
credential is not a solved problem (though I'm sure we can solve it).

If it's critical that we use RBE, then let's stop now and design for
that. I'm sure it's possible, but it's going to take some commitment
from a Maintainer with good knowledge of the test system, a Googler with
admin access to the cloud account and more of my time.

It's a matter of priorities. We could start running the build job now,
add on locally-executed tests, and work on addressing any shortfalls.
We'll all be able to see immediate results and iterate in the open. And
we'd know that tests can run in a clean environment.

Or, if we don't want to waste time on trying to get tests to work
outside of an RBE environment, then let's stop and design for that.
We'll need commitment to work on it, and it will probably be a while
before we see visible progress on it.

Thoughts on how we should proceed?

-Jim

Luca Milanesio

unread,
Feb 22, 2021, 5:36:13 PM2/22/21
to David Ostrovsky, Luca Milanesio, Repo and Gerrit Discussion
Mmm … I believe the checks API doesn’t really care *who* is calling, isn’t it?
There isn’t anything magic in the scheme about Jenkins or Zuul, unless the Zuul API client requires a ‘zuul’ scheme.

Luca.

Ben Rohlfs

unread,
Mar 1, 2021, 10:17:09 AM3/1/21
to Luca Milanesio, David Ostrovsky, Repo and Gerrit Discussion
Sorry for my late reply.

As of today the new Checks UI is ready to be used and to be developed against. You can see it live on gerrit-review, where iIt has already replaced the old Checks UI. If all goes well and nobody objects we would even remove the old Checks UI completely in the 3.4 release.

The plugin API is defined here:

Here you can see how the Checks plugin uses the new UI with data from the Checks backend:

Another example of a plugin using the new Checks UI can be found in Chromium's repo:

If you have a HTTP endpoint for fetching data (based on change and patchset numbers), then I am happy to collaborate with you, and set up a new plugin that fetches and converts Zuul data to be consumed by the new Checks UI. Maybe start a separate thread for that. :-)

I am very much looking forward to Zuul instead of Jenkins and to using the new Checks UI for this. Whenever frontend related questions block you please reach out to me, and I am happy to help!

-Ben







Reply all
Reply to author
Forward
0 new messages